chipset debuging FTF-DES-F1321-QorIQ-Debug.pptx

navidmirmotahhary1 18 views 38 slides Jun 04, 2024
Slide 1
Slide 1 of 38
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38

About This Presentation

Debuging Process for QorIQ chips of NXP


Slide Content

Debugging Hardware Level Issues using QorIQ Debug Architecture FTF-DES-F1321 JUN.24.2015 Vakul Garg | Lead Software Engineer

Session Outline Hardware level debug assists available on QorIQ Non-intrusive methods of tracing & performance analysis Methods that can be used in field deployed systems QorIQ Debug as an alternative to emulator to discover hardware issues in SoC

Key takeaways from this session … Learn basics of QorIQ Debug Architecture What are the building blocks When to use hardware level debug Understand software tools to use QorIQ Debug hardware Learn to debug hardware level issues in field deployed systems

Agenda Overview of a Multicore processor Requirements of hardware level debugging Understanding QorIQ Debug Architecture Event counting Tracing Enablement tools to use Debug Architecture Case Studies Debugging IPSEC performance issue Discovering hardware errata using QorIQ Debug

Modern Multicore is not just ‘Multiple Cores’ RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5GHz SERDES PCIe SRIO PCIe CoreNet 1024KB Frontside L3 Cache 64-bit DDR-2 / 3 Memory Controller SRIO Watchpoint Cross Trigger Perf Monitor CoreNet Trace Aurora SEC PME Buffer Mgr eLBC Test Port/ SAP Frame Manager 1GE 1GE 1GE 1GE 10GE 1024KB Frontside L3 Cache 64-bit DDR-2 / 3 Memory Controller PAMU Coherency Fabric PAMU PAMU PAMU PAMU Peripheral Access Mgmt Unit eOpenPIC Power Mgmt 2x USB 2.0/ULPI SD/MMC Clocks/Reset 2x DUART 4x I 2 C SPI GPIO PreBoot Loader Security Monitor Internal BootROM CCSR Power Architecture e500-mc Core D-Cache I-Cache 128KB Backside L2 Cache 32KB 32KB Real Time Debug Frame Manager 1GE 1GE 1GE 1GE 10GE Queue Manager QorIQ P4080 Multiple accelerators, memory controllers packed with cores

Current debugging approach Mostly core centric Source level debugging using gdb etc Application, OS specific counters (implemented in software) Software tracing, watchpoints High intrusiveness Changes system run-time dynamics JTAG requires core to be halted Software tracing often too expensive Race conditions do not reproduce when debugging turned ON Zero visibility inside accelerators, I/O peripherals

QorIQ Debug Architecture

Platform Debug Frontside L3 Cache 64-bit DDR-2 / 3 Memory Controller 5GHz SERDES Lanes RapidIO Message Unit (RMU) 2x DMA PCIe PCIe SRIO PCIe SRIO BMan QMan Security Pattern Match Engine Frame Manager Parse Buffer OceaN Debug Data Path Debug DDR Debug Trace Events Trace Classify KeyGen Trace Events Counters 1GE 1GE 10GE x2 x8 x4 Marking 1GE 1GE Event Processing Unit Select Combine Act Count Events Interrupt DMA tMMA EVT Trace Core(s) SoC Evts NPC Aurora NXC Filter/Select External EVT0-11 Core Debug Run Ctrl Trace Counters Events Power Architecture™ Cores/clusters QorIQ Debug Overview Run Ctrl Corenet Corenet Debug Trace Events Marking

P4080 Debug 8 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

Event Processing Unit (EPU ) – Event Collector 9 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

Event Processing Unit – EPU SoC level event counting Analogous to core perfmons 32 x 16-bit counters Over 2K debug & perf events Muxes select event to count Chaining, reset, capture overflow detection, freeze support Sequencing & Combining Derive new events after applying logical conditions to reference events

Nexus Port Controller (NPC ) – Trace Collector 11 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

Nexus Port Control (NPC) Aggregation point of traces from multiple clients Programmable arbitration Lossless trace collection CPU cycles not consumed Multiple trace output ports Internal buffer (16KB) SERDES lanes, requires external probe Pre allocated buffer in main memory (DDR) Trace filtering Individual clients can output traces to different ports Core(s) Corenet DDRC NPC DDR Aurora Link SERDES Internal buffer Accelerators

e500mc Debug 13 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

e500mc Event Counting Performance Monitor Each core provides 4x 32-bit performance monitor counters ( PMCs ) PMCs count events of interest: Instructions, branches, pipeline stalls, load/store, MMU , cache misses, etc. PMCs typically count event occurrence or duration above threshold PMCs can be chained Overflow on one counter increments another Chaining can reduce performance counter interrupts for frequent events PMCs are supervisor read/write, user read only Each counter can signal interrupts on overflow Each counter can signal Nexus watchpoint messages on overflow PMC overflows can generate events to EPU (x-triggering) PMC count values can be captured into shadow registers in response to a trigger. The shadow registers can be accessed non-intrusively

e500mc Tracing Real-Time Debug Traces Program (instruction) trace Highly compressed branch messaging (with history) Program correlation (events “asynchronous” to program flow) Data trace Trace data writes (only store instructions supported) Address and data values transmitted Ownership (process ID) trace Tracking Task switches Logical partition switches Data acquisition trace ( DQM ) and data acquisition events (DQE) Software can generate trace messages for values of interest (custom data logging) Requires instrumentation of code ( mtspr instructions) DQEs used for trace control, performance counters, and cross-triggers

CoreNet Debug 16 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

CoreNet Debug Provides visibility to CoreNet transactions CoreNet Address Messages CoreNet Data Messages CoreNet Watchpoint Messages (event logging) Optional timestamping Supports filtering of accesses of interest Data addresses and attributes Data value compare Transaction marking Supports performance event muxing CoreNet performance events Platform (L3) cache events Supported by software a nalysis tools for internal use

DDR Debug 18 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

DDR Debug Monitor selective transactions at DDR controller Source ID (cores, FMAN, SEC etc) Attribute (WIMGE bits) Transaction type (read, write, atomic, decorated etc) Monitoring modes Count DDR events at EPU Tracing with addresses and timestamps DDR bandwidth measurement Application level estimation

DataPath Debug 20 RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5 GHz SerDes PCIe sRIO PCIe CoreNet™ SRIO 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller 1024 KB Frontside L3 Cache 64-bit DDR-2 / 3 Controller P4080 Debug DDR Debug OCeaN Debug DPATH Debug FMan Parse, Classify, Distribute Buffer QMan BMan 1GE 1GE 1GE 1GE 10GE Parse, Classify, Distribute Buffer 1GE 1GE 1GE 1GE 10GE FMan Perf Monitor Select EPU Seq. Action NPC CoreNet Debug Mem. Mapped Interface IEEE 1149.1 (JTAG) Debug / Development Bus (to all debug IP) SoC events x-triggers Cntrl/Arb . Trace Buffer (16K ) Aurora Power Architecture™ e500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB IJAM PerfMon Nexus Debug

DataPath Debug Provides visibility to queuing operations in architected Queue Manager (QM) Frame queue ID, operation type ( enqueue / dequeue ) Channel number Portal number, portal type Provides visibility to debug context info within each Frame Manager (FM) Trace data from each processing stage ( BMI, QMI, KeyGen …) Sophisticated frame comparators for each sub-block Fine granularity frame tracing control Packet-marking capability Trace only packets of required flows Aggregates performance events from all Data Path blocks (FM, QM, BM, PME, SEC) and forwards to EPU QMan Supported by Software Analysis Tools

QorIQ Debug enablement Tools

Enablement Tools

Scenarios Tool - Overview Optimized workflow for efficiently narrowing down performance issues anywhere on the system Customer Benefits System optimization for Cores and SoC Complexity abstraction Delivers Freescale expertise to users Ease of use Streamlined to solve several performance issues Key Features Stand alone – no CodeWarrior needed Performance analysis including visualization Connection auto discovery “Canned” measurement scenarios 100+ scenarios covering Core and SoC blocks User-defined measurement scenarios Compare pairs of runs Graphically visualize all measurements “Live” view of events and metrics Supports “bare metal” or Linux applications Python scripting support Devices Supported P2040, P3041, P5020, P5021, P5040, P4040, P4080 T2080 , T4240 B4860 Future Layerscape devices

Scenarios Tool – Available Scenarios Approximately 50-100 Scenarios are available depending on the target Scenario Group What can I measure? CPU – Utilization Branch Misses, Interrupt Counts, CPU Usage CPU – Cache Cache Misses (Data/Instruction L1, Data/Instruction Backside L2) CPU - MMU TLB4K Reloads, VSP Reloads, L2 MMU Misses CPU - Core Complex Core Complex Traffic CPU - Load Store Unit Data Line Fill Buffer (DLFB) Misses DDR DDR Traffic , Page Misses, Collisions CoreNet CoreNet Traffic DPAA QMan QMan Dequeue and/or Enqueue Counts DPAA SEC Security Engine Utilization OCeaN DMA Performance Combination Broad measurements over the whole system

Performance optimization using counters

IPSEC processing pipeline MAC CPU CRYPTO Processor (SEC) CPU MAC DDRC DDR DDRC DDR DDRC DDR T4240

Problem Symptom Very low IPSEC throughput – 2.6 Mpps Much lower than top SEC throughput > 10 Mpps Frames accumulated at SEC ingress interface Means SEC is bottlneck MAC CPU CRYPTO Processor (SEC) CPU MAC T4240

Analyzing of SEC performance with Scenarios Tool Deco Utilization (%) Deco 1 90 Deco 2 88 Deco 3 86 Deco 4 83 Deco 5 77 Deco 6 62 Deco 7 25 Deco 8 17 SEC DMA wait cycles / second 177, 352, 672 Gross Deco utilization 66 % SEC is under- utlized due to slow DMA

Analyzing DDR controller performance DDR Data bus utilization DDRC1 DDRC2 DDRC3 .02 % .05 % 33 % DMA is slow due to single DDRC in use Check DDRC Interleaving Setting

After we enable interleaving.. Gross Deco utilization (%) Old New 66 90.31 DDR Data bus utilization DDRC1 DDRC2 DDRC3 Old .02 % .05 % 33 % New 15.11 % 14.9 % 14.8 % SEC DMA wait cycles / second Old New 177, 352, 672 20,728,328 Improved IPSEC performance Old New 2.6 Mpps 8.3 Mpps

Cache debug using bus traces

System Setup B4860 based system One app thread per each PowerPC and DSP core Apps exchange IPC msgs over FQs PowerPC app pre-sets pattern in msgs DSP increments pattern and reflects back PowerPC app verifies received pattern Works fine if DSP stashing is OFF StartCore DSP Corenet , CPC QMAN DDR PowerPC™ e6500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB Pattern sent by PowerPC 1000 2000 3000 4000 5000 6000 7000 8000 Expected Pattern at PowerPC 1001 2001 3001 4001 5001 6001 7001 8001

1000 2000 3000 4000 5000 6000 7000 8000 Problem Description 1000 2000 3000 4000 5000 6000 7000 8000 1001 2001 3001 4001 5001 6001 7001 8001 1001 2001 3001 4001 5000 6000 7000 xxxx Problem With DSP stashing ON , sometimes unexpected pattern received on PowerPC. Corruption always at half cache line boundary (i.e. last 32 bytes bad) Pattern got corrupt after DSP sends it Suspect Areas Application code Any other DMA, I/O corrupting messages Ruled out by turning them off, placing IOMMU restrictions Cache coherency settings Reviewed again… PowerPC DSP

Debug using C orenet traces Impractical to reproduce the problem on emulator Collected corenet trace for transactions on msg buffers: For corrupt msg , DSP cache did bad CASTOUT of half cache line size data What triggers a bad CASTOUT? Stashing gives a clue…. Bad CASTOUT happens when there is a STASH at same cache set Hardware team confirms RTL issue STASH triggers a half line size CASTOUT Workaround: Prevent CASTOUT due to STASH transaction Cache lock addresses which are stashed Message # 41829 MSG TYPE : CoreNet Address Message SourceID : 0x3c (PowerPC) TYPE : 0xd STASH and WRITE SIZE : 8 64 bytes QUALIFIER 0x2 (WIMG) TIMESTAMP : 0 (0x0)  F-ADDR : 0x0000000f:f6055240 Message # 41832 MSGTYPE : CoreNet Address Message SourceID : 0x31 (DSP cache) TYPE : 0x10 CASTOUT SIZE : 0 reserved  This is bad QUALIFIER : 0x0 (WIMG) TIMESTAMP : 0 (0x0) F-ADDR : 0x00000000:60195240 STASH & CASTOUT addr fall in same cache set

What we learnt Integrated nature of multicore processors pose debugging challenge since it hides the interconnections of accelerators to software domain QorIQ Debug Architecture Enable non-intrusive debugging Debug Tools for hardware debugging Scenarios tool, Packet Trace tool, SPID libs Viable alternative to emulation
Tags