navidmirmotahhary1
18 views
38 slides
Jun 04, 2024
Slide 1 of 38
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
About This Presentation
Debuging Process for QorIQ chips of NXP
Size: 2.25 MB
Language: en
Added: Jun 04, 2024
Slides: 38 pages
Slide Content
Debugging Hardware Level Issues using QorIQ Debug Architecture FTF-DES-F1321 JUN.24.2015 Vakul Garg | Lead Software Engineer
Session Outline Hardware level debug assists available on QorIQ Non-intrusive methods of tracing & performance analysis Methods that can be used in field deployed systems QorIQ Debug as an alternative to emulator to discover hardware issues in SoC
Key takeaways from this session … Learn basics of QorIQ Debug Architecture What are the building blocks When to use hardware level debug Understand software tools to use QorIQ Debug hardware Learn to debug hardware level issues in field deployed systems
Agenda Overview of a Multicore processor Requirements of hardware level debugging Understanding QorIQ Debug Architecture Event counting Tracing Enablement tools to use Debug Architecture Case Studies Debugging IPSEC performance issue Discovering hardware errata using QorIQ Debug
Modern Multicore is not just ‘Multiple Cores’ RapidIO Message Unit (RMU) 2x DMA PCIe 18-Lane 5GHz SERDES PCIe SRIO PCIe CoreNet 1024KB Frontside L3 Cache 64-bit DDR-2 / 3 Memory Controller SRIO Watchpoint Cross Trigger Perf Monitor CoreNet Trace Aurora SEC PME Buffer Mgr eLBC Test Port/ SAP Frame Manager 1GE 1GE 1GE 1GE 10GE 1024KB Frontside L3 Cache 64-bit DDR-2 / 3 Memory Controller PAMU Coherency Fabric PAMU PAMU PAMU PAMU Peripheral Access Mgmt Unit eOpenPIC Power Mgmt 2x USB 2.0/ULPI SD/MMC Clocks/Reset 2x DUART 4x I 2 C SPI GPIO PreBoot Loader Security Monitor Internal BootROM CCSR Power Architecture e500-mc Core D-Cache I-Cache 128KB Backside L2 Cache 32KB 32KB Real Time Debug Frame Manager 1GE 1GE 1GE 1GE 10GE Queue Manager QorIQ P4080 Multiple accelerators, memory controllers packed with cores
Current debugging approach Mostly core centric Source level debugging using gdb etc Application, OS specific counters (implemented in software) Software tracing, watchpoints High intrusiveness Changes system run-time dynamics JTAG requires core to be halted Software tracing often too expensive Race conditions do not reproduce when debugging turned ON Zero visibility inside accelerators, I/O peripherals
Nexus Port Control (NPC) Aggregation point of traces from multiple clients Programmable arbitration Lossless trace collection CPU cycles not consumed Multiple trace output ports Internal buffer (16KB) SERDES lanes, requires external probe Pre allocated buffer in main memory (DDR) Trace filtering Individual clients can output traces to different ports Core(s) Corenet DDRC NPC DDR Aurora Link SERDES Internal buffer Accelerators
e500mc Event Counting Performance Monitor Each core provides 4x 32-bit performance monitor counters ( PMCs ) PMCs count events of interest: Instructions, branches, pipeline stalls, load/store, MMU , cache misses, etc. PMCs typically count event occurrence or duration above threshold PMCs can be chained Overflow on one counter increments another Chaining can reduce performance counter interrupts for frequent events PMCs are supervisor read/write, user read only Each counter can signal interrupts on overflow Each counter can signal Nexus watchpoint messages on overflow PMC overflows can generate events to EPU (x-triggering) PMC count values can be captured into shadow registers in response to a trigger. The shadow registers can be accessed non-intrusively
e500mc Tracing Real-Time Debug Traces Program (instruction) trace Highly compressed branch messaging (with history) Program correlation (events “asynchronous” to program flow) Data trace Trace data writes (only store instructions supported) Address and data values transmitted Ownership (process ID) trace Tracking Task switches Logical partition switches Data acquisition trace ( DQM ) and data acquisition events (DQE) Software can generate trace messages for values of interest (custom data logging) Requires instrumentation of code ( mtspr instructions) DQEs used for trace control, performance counters, and cross-triggers
CoreNet Debug Provides visibility to CoreNet transactions CoreNet Address Messages CoreNet Data Messages CoreNet Watchpoint Messages (event logging) Optional timestamping Supports filtering of accesses of interest Data addresses and attributes Data value compare Transaction marking Supports performance event muxing CoreNet performance events Platform (L3) cache events Supported by software a nalysis tools for internal use
DataPath Debug Provides visibility to queuing operations in architected Queue Manager (QM) Frame queue ID, operation type ( enqueue / dequeue ) Channel number Portal number, portal type Provides visibility to debug context info within each Frame Manager (FM) Trace data from each processing stage ( BMI, QMI, KeyGen …) Sophisticated frame comparators for each sub-block Fine granularity frame tracing control Packet-marking capability Trace only packets of required flows Aggregates performance events from all Data Path blocks (FM, QM, BM, PME, SEC) and forwards to EPU QMan Supported by Software Analysis Tools
QorIQ Debug enablement Tools
Enablement Tools
Scenarios Tool - Overview Optimized workflow for efficiently narrowing down performance issues anywhere on the system Customer Benefits System optimization for Cores and SoC Complexity abstraction Delivers Freescale expertise to users Ease of use Streamlined to solve several performance issues Key Features Stand alone – no CodeWarrior needed Performance analysis including visualization Connection auto discovery “Canned” measurement scenarios 100+ scenarios covering Core and SoC blocks User-defined measurement scenarios Compare pairs of runs Graphically visualize all measurements “Live” view of events and metrics Supports “bare metal” or Linux applications Python scripting support Devices Supported P2040, P3041, P5020, P5021, P5040, P4040, P4080 T2080 , T4240 B4860 Future Layerscape devices
Scenarios Tool – Available Scenarios Approximately 50-100 Scenarios are available depending on the target Scenario Group What can I measure? CPU – Utilization Branch Misses, Interrupt Counts, CPU Usage CPU – Cache Cache Misses (Data/Instruction L1, Data/Instruction Backside L2) CPU - MMU TLB4K Reloads, VSP Reloads, L2 MMU Misses CPU - Core Complex Core Complex Traffic CPU - Load Store Unit Data Line Fill Buffer (DLFB) Misses DDR DDR Traffic , Page Misses, Collisions CoreNet CoreNet Traffic DPAA QMan QMan Dequeue and/or Enqueue Counts DPAA SEC Security Engine Utilization OCeaN DMA Performance Combination Broad measurements over the whole system
Performance optimization using counters
IPSEC processing pipeline MAC CPU CRYPTO Processor (SEC) CPU MAC DDRC DDR DDRC DDR DDRC DDR T4240
Problem Symptom Very low IPSEC throughput – 2.6 Mpps Much lower than top SEC throughput > 10 Mpps Frames accumulated at SEC ingress interface Means SEC is bottlneck MAC CPU CRYPTO Processor (SEC) CPU MAC T4240
Analyzing of SEC performance with Scenarios Tool Deco Utilization (%) Deco 1 90 Deco 2 88 Deco 3 86 Deco 4 83 Deco 5 77 Deco 6 62 Deco 7 25 Deco 8 17 SEC DMA wait cycles / second 177, 352, 672 Gross Deco utilization 66 % SEC is under- utlized due to slow DMA
Analyzing DDR controller performance DDR Data bus utilization DDRC1 DDRC2 DDRC3 .02 % .05 % 33 % DMA is slow due to single DDRC in use Check DDRC Interleaving Setting
After we enable interleaving.. Gross Deco utilization (%) Old New 66 90.31 DDR Data bus utilization DDRC1 DDRC2 DDRC3 Old .02 % .05 % 33 % New 15.11 % 14.9 % 14.8 % SEC DMA wait cycles / second Old New 177, 352, 672 20,728,328 Improved IPSEC performance Old New 2.6 Mpps 8.3 Mpps
Cache debug using bus traces
System Setup B4860 based system One app thread per each PowerPC and DSP core Apps exchange IPC msgs over FQs PowerPC app pre-sets pattern in msgs DSP increments pattern and reflects back PowerPC app verifies received pattern Works fine if DSP stashing is OFF StartCore DSP Corenet , CPC QMAN DDR PowerPC™ e6500-mc Core D-Cache I-Cache 128 KB Backside L2 Cache 32 KB 32 KB Pattern sent by PowerPC 1000 2000 3000 4000 5000 6000 7000 8000 Expected Pattern at PowerPC 1001 2001 3001 4001 5001 6001 7001 8001
1000 2000 3000 4000 5000 6000 7000 8000 Problem Description 1000 2000 3000 4000 5000 6000 7000 8000 1001 2001 3001 4001 5001 6001 7001 8001 1001 2001 3001 4001 5000 6000 7000 xxxx Problem With DSP stashing ON , sometimes unexpected pattern received on PowerPC. Corruption always at half cache line boundary (i.e. last 32 bytes bad) Pattern got corrupt after DSP sends it Suspect Areas Application code Any other DMA, I/O corrupting messages Ruled out by turning them off, placing IOMMU restrictions Cache coherency settings Reviewed again… PowerPC DSP
Debug using C orenet traces Impractical to reproduce the problem on emulator Collected corenet trace for transactions on msg buffers: For corrupt msg , DSP cache did bad CASTOUT of half cache line size data What triggers a bad CASTOUT? Stashing gives a clue…. Bad CASTOUT happens when there is a STASH at same cache set Hardware team confirms RTL issue STASH triggers a half line size CASTOUT Workaround: Prevent CASTOUT due to STASH transaction Cache lock addresses which are stashed Message # 41829 MSG TYPE : CoreNet Address Message SourceID : 0x3c (PowerPC) TYPE : 0xd STASH and WRITE SIZE : 8 64 bytes QUALIFIER 0x2 (WIMG) TIMESTAMP : 0 (0x0) F-ADDR : 0x0000000f:f6055240 Message # 41832 MSGTYPE : CoreNet Address Message SourceID : 0x31 (DSP cache) TYPE : 0x10 CASTOUT SIZE : 0 reserved This is bad QUALIFIER : 0x0 (WIMG) TIMESTAMP : 0 (0x0) F-ADDR : 0x00000000:60195240 STASH & CASTOUT addr fall in same cache set
What we learnt Integrated nature of multicore processors pose debugging challenge since it hides the interconnections of accelerators to software domain QorIQ Debug Architecture Enable non-intrusive debugging Debug Tools for hardware debugging Scenarios tool, Packet Trace tool, SPID libs Viable alternative to emulation