2 Prerequisite Address of all PCIe registers will be stated in offset off each capable structure. Please have the following training. https://sharedspaces.intel.com/sites/vlc/SitePages/Course.aspx?c=364#
3 Major PCIe Error Status Registers Your test content must check the following registers Your debug should start with the following registers for triage Type0/1 Common Configuration Space PCISTS offset 0x06 Type1 Configuration Space Secondary Status Register offset 0x1E PCIe Capability Structure Device Status Register offset 0x0A Advanced Error Reporting Capability Uncorrectable Error Status Register offset 0x04 Correctable Error Status Register offset 0x10 Root Error Status Register 0x30
4 PCI Status(PCISTS) – offset 0x6 off Common Space Bit Bit Name Description 11 Signaled Target Abort Set when a Function completes a Posted or Non-Posted Request as a Completer Abort error. This applies to a Function with a Type 1 Configuration header when the Completer Abort was generated by its Primary Side( Set whenever the root port forwards a target abort received from the downstream device onto the backbone) 12 Received Target Abort (CA) Set when a Requester receives a Completion with Completer Abort Completion Status. On a Function with a Type 1 Configuration header, the bit is Set when the Completer Abort is received by its Primary Side( Set when the root port receives a completion with completer abort from the backbone.) 13 Received Master Abort (UR) Set when a Requester receives a Completion with Unsupported Request Completion Status. On a Function with a Type 1 Configuration header, the bit is Set when the Unsupported Request is received by its Primary Side.( Set when the root port receives a completion with unsupported request status from the backbone) 14 Signaled System Error Set when a Function sends an ERR_FATAL or ERR_NONFATAL Message, and the SERR# Enable bit in the Command register is 1. Set when the root port signals a system error to the internal SERR# logic 15 Detected Parity Error Set by a Function whenever it receives a Poisoned TLP, regardless of the state the Parity Error Response bit in the Command register. On a Function with a Type 1 Configuration header, the bit is Set when the Poisoned TLP is received by its Primary Side( Set when the root port receives a command or data from the backbone with a parity error. This is set even if PCMD.PERE is not set)
5 Secondary Status(SSTS) – Status for downstream side of bridge – offset 0x1E off Type1 Space Bit Bit Name Description 11 Signaled Target Abort This bit is Set when the Secondary Side for Type 1 Configuration Space header Function (for Requests completed by the Type 1 header Function itself) completes a Posted or Non-Posted Request as a Completer Abort error. 12 Received Target Abort (CA) This bit is Set when the Secondary Side for Type 1Configuration Space header Function (for Requests initiated by the Type 1 header Function itself) receives a Completion with Completer Abort Completion Status. 13 Received Master Abort (UR) Set when the Secondary Side for Type 1Configuration Space header Function (for Requests initiated by the Type 1 header Function itself) receives a Completion with Unsupported Request Completion Status 14 Signaled System Error Set when the Secondary Side for a Type 1Configuration Space header Function receives an ERR_FATAL or ERR_NONFATAL Message. 15 Detected Parity Error Set by the Secondary Side for a Type 1 Configuration Space header Function whenever it receives a Poisoned TLP, regardless of the state the Parity Error Response Enable bit in the Bridge Control register.
6 Device Status(DSTS) – Offset 0x0A off PCIe Cap structure Bit Bit Name Description Correctable Error Detected Indicates a correctable error was detected. Set when received an internal correctable error from receiver errors / framing errors, TLP CRC error, DLLP CRC error, Replay Number Rollover, Replay Timer Timeout 1 Non-Fatal Error Detected Indicates a non-fatal error was detected, Set when an received a non-fatal error occurred from a poisoned TLP, unexpected completions, unsupported requests, completion abort, or completion timeout – Note that all is based on uncorrectable error severity register configuration 2 Fatal Error Detected Indicates a fatal error was detected. Set when a fatal error occurred on from a data link protocol error, buffer overflow, or malformed TLP 3 Unsupported Request Detected Set when the Secondary Side for a Type 1Configuration Space header Function receives an ERR_FATAL or ERR_NONFATAL Message.
7 Device Control(DCTL) – Offset 0x08 off PCIe Cap structure Bit Bit Name Description Correctable Error Reporting Enable When set, enables signaling of ERR_CORR to the Root Control register due to internally detected errors or error messages received across the link. Other bits also control the full scope of related error reporting. 1 Non-Fatal Error Reporting Enable When set, enables signaling of ERR_NONFATAL to the Root Control register due to internally detected errors or error messages received across the link. Other bits also control the full scope of related error reporting. 2 Fatal Error Reporting Enable Enables signaling of ERR_FATAL to the Root Control register due to internally detected errors or error messages received across the link. Other bits also control the full scope of related error reporting. 3 Unsupported Request Reporting Enable When set, allows signaling ERR_NONFATAL, ERR_FATAL, or ERR_COR to the Root Control register when detecting an unmasked Unsupported Request (UR). An ERR_COR is signaled when a unmasked Advisory Non-Fatal UR is received. An ERR_FATAL, or NONFATAL, is sent to the Root Control Register when an uncorrectable non Advisory UR is received with the severity set by the Uncorrectable Error Severity register.
8 Root Control(RCTL) – Offset 0x1C off PCIe Cap structure Bit Bit Name Description System Error on Correctable Error Enable When set, an SERR# will be generated if a correctable error is reported by any of the devices in the hierarchy of this root port, including correctable errors in this root port. 1 System Error on Non-Fatal Error Enable When set, an SERR# will be generated if a non-fatal error is reported by any of the devices in the hierarchy of this root port, including non-fatal errors in this root port. 2 System Error on Fatal Error Enable When set, an SERR# will be generated if a fatal error is reported by any of the devices in the hierarchy of this root port, including fatal errors in this root port
9 Correctable Error Status(offset 0x10) off AER Bit[0]: Receiver Error Physical Layer detected an error in the incoming packet. The packet is discarded at the Physical Layer, any buffer space allocated to its released, and the Link Layer is informed that a receive error occurred. Bit[6]: Bad TLP Data Link Layer detected a packet with a bad LCRC, an out of sequence Seq # or incorrectly nullified packet. In each case, the Link Layer discards the packet and report a Nak DLLP to the transmitter to trigger TLP replay ERROR TYPE Bits SKL CPU family TGL, ICL-H(SIP16-Brooks) PCH family Advisory Non-Fatal Error 13 Logged Logged Logged Replay Timer Timeout 12 Logged Logged Logged Replay Number Rollover 8 Logged Logged Logged Bad DLLP 7 Logged Logged Logged Bad TLP 6 Logged Logged Logged Receiver Error Logged Logged Logged
10 Correctable Error Status(offset 0x10) off AER Bit[7]: Bad DLLP Data Link Layer noticed an incoming DLLP had a 16bit CRC failure so the packet is dropped. A subsequent DLLP of the same type is expected to make up for the information it contained. Bit[8]: Replay Number Rollover If replay happens four times, this bit is flagged and a device initiates recovery. Bi[12]: Replay Timer Timeout At the Data Link Layer, transmitted TLPs have not received an acknowledgement( Ack or Nak ) within the timeout period. Hardware automatically replays all unacknowledged TLPs, meaning all packets in the Replay Buffer. Replay Timer Configuration TGL-U: Offset 0x0300 PCIERTP1, Offset 0x0304 PCIERTP2, Offset 0x06A0 PCIERTP3, Offset 0x06A4 PCIERTP4 PCH: Offset 0x300 PCIERTP1, Offset 0x304 PCIERTP2 SKL CPU family: Offset 0x238[10:0] LLTC.RT Bit[13]: Advisory Non-Fatal Error Status If uncorrectable error severity bit is 0, then this bit is flagged to let SW know that there is non-fatal error.
11 Uncorrectable Error Status (offset 0x04) off AER ERROR TYPE Bits SKL CPU family TGL, ICL-H(SIP16-Brooks) PCH family Poisoned TLP Egress Blocked Status 26 Not Supported Logged if DPCCAPR.PTLPEBS =1 Logged if DPCCAPR.PTLPEBS =1 TLP Prefix Blocked Error 25 Not Supported Not Supported Not Supported AtomicOp Egress Blocked Status 24 Not Supported Logged Not Supported ACS violation Status 21 Logged Logged Logged Unsupported Request Error 20 Logged Logged Logged ECRC Error 19 Not Supported Not Supported Not Supported Malformed TLP 18 Logged Logged Logged Receiver Overflow 17 Logged Logged Logged Unexpected Completion 16 Logged Logged Logged Completer Abort 15 Not Supported Logged Logged Completion Timeout 14 Logged Logged Logged Flow Control Protocol Error 13 Not Supported Not Supported Not Supported Poisoned TLP 12 Logged Logged Logged Surprise Down Error Status 5 Not Supported Not Supported Not Supported Data Link Protocol Error 4 Logged Logged Logged Training error Not Supported Not Supported Not Supported
12 Uncorrectable Error Status (offset 0x04) off AER Bit[26]: Poisoned TLP Egress Blocked Status It is tied with DPC enabled. Root Port blocks the transmission of a poisoned TLP from its Egress port. Bit[24]: AtomicOp Egress Blocked Status Egress Ports of routing elements can be programmed to block AtomicOps from being forwarded to agents that shouldn’t see them. Bit[20]: Unsupported Request If a receiver doesn’t support a Request, it returns a Completion with UR status and log this bit. What can be unsupported request Message with unsupported or undefined message code Request doesn’t reference address space mapped to the device Type 1 configuration Request is received at an endpoint A function(device) in D1, D2 or D3hot receives a Request other than configuration request or Message. A TLP with No_Snoop = 0 in its header is routed to a port that has Reject Snoop Transaction bit = 1 in VC Resource Capability register. The receiver responds Completion with UR status
13 Uncorrectable Error Status (offset 0x04) off AER Bit[18]: Malformed TLP Checking for violation of the TLP packet formatting rule Data payload exceeds MPS Data length does not match length specified in the header Memory start address and length combine to cause a transaction to cross a naturally-aligned 4KB boundary TLP digest(TD field) indication doesn't correspond with packet size(ECRC is unexpectedly missing or present) Byte Enable violation Undefined Type field values Completion that violates the Read Completion Boundary (RCB) value Completion with status of Configuration Request Retry Status in response to a Requester other than a configuration request Traffic Class field contains a value not assigned to an enabled Virtual Channel(This is also known as TC filtering) I/O and Configuration Request violations - example: TC field, Attr [1:0] and the AT field must all be zero, while the Length filed must have a value of one Interrupt emulation message sent downstream TLP received with TLP prefix error TLP prefix but no TLP header End-to-End TLP Prefixes preceding Local Prefixes Local TLP Prefix type not supported More than 4 End-to-End TLP Prefixes More End-to-End TLP Prefixes are supported
14 Uncorrectable Error Status (offset 0x04) off AER Transaction type requiring use of TC0 has a different TC value I/O Read or Write Requests and corresponding Completions Configuration Read or Write Requests and corresponding Completions Error Messages INTx messages Power Management messages Unlock messages Slot Power messages LTR messages OBFF messages AtomicOp operand doesn't match an architected value AtomicOp address isn't naturally aligned with operand size Routing is incorrect for transaction type(e.g., transactions requiring routing to Root Complex detected moving away from Root Complex)
15 Uncorrectable Error Status (offset 0x04) off AER Bit[17]: Receiver Overflow Status More TLPs have arrived than the Receive buffer had room to accept. When this error can occur? Remote device doesn’t adhere to flow control rule Bit[16]: Unexpected Completion Status Requester receives a Completion that doesn’t match any Requests that are awaiting a Completion. Mismatched Request ID between Request and Completion Mismatched Tag number between Request and Completion Bit[15]: Completer Abort What can be completer abort If the Completer of an AtomicOp Request encounters an uncorrectable error accessing the target location or carrying out Atomic operation, the Completer must handle it as a Completer Abort. Completer receives a Request that it cannot process because of some permanent error condition in the device. For example, a wireless LAN card that won’t accept new packet because it can’t transmit or receive over its radio until an approved antenna is attached The receiver responds Completion with CA status
16 Uncorrectable Error Status (offset 0x04) off AER Bit[14]: Completion Timeout Status For the case of a pending Request that never receives the Completion it’s expecting, the spec defines a Completion Timeout mechanism. If completion timeout is enabled and a completion fails to return within the amount of time specified by the Completion Timeout value. AECC[12] - Completion Timeout Prefix/Header Log Capable DCTL2[4] – Completion Timeout Disable DCTL2[3:0] – Completion Timeout Value Bit[12]: Poison TLP Data poisoning, also called “Error Forwarding” for a device to indicate that the data associated a TLP is corrupted. In any write requests with data or completion with data, if data is corrupted, a sender can set EP bit in the TLP header to show data corruption. When a device receives a TLP with EP set in the header, Bit12 is set in Uncorrectable Error Status register. Why a sender sends such TLP that is already known to be bad then? If a request result in a Completion returned with data, but that data encountered an error it was gathered from the target like ECC error or parity error in memory, what is best way. If the completion is not returned, a requester gets Completion timoeut . On the other hand, the Completion is delivered with the poisoned bit set, then at least the requester can see the path to target is good which is better than timeout. It may be can accept the data with errors. Ex, audio streaming. A device might have a means of correcting the data.
17 Uncorrectable Error Status (offset 0x04) off AER Bit[4]: Data Link Protocol Error Status Caused by Data Link Layer protocol errors including the Ack / Nak retry mechanism. For example, a transmitter receives an Ack or Nak whose sequence number doesn’t correspond to an unacknowledged TLP or to the ACKD_SEQ number. Transmitter sends TLP with Seq #=0x2A and received Ack with Seq # = 0x29 or less.
18 Root Error Status(RSTS) – Offset 0x30 off AER Bit Bit Name Description ERR_COR Received Set when a correctable error message is received 1 Multiple ERR_COR Received Set when a correctable error message is received and bit0 is already set 2 ERR_FATAL/NONFATAL Received Set when either a fatal or a non-fatal error message is received 3 Multiple ERR_FATAL/NONFATAL Received Set when either a fatal or a non-fatal error message is received and bit2 is already set 4 First Uncorrectable Fatal Set when the first Uncorrectable Error message received is for a fatal error 5 Non-Fatal Error Message Received Set when one or more Non-Fatal uncorrectable error message have been received 6 Fatal Error Message Received Set when one or more Fatal Uncorrectable error message have been received
19 Root Error Command(REC) – Offset 0x2C off AER Bit Bit Name Description Correctable Error Reporting Enable When set, the root port will generate an interrupt when a correctable error is reported by the attached device 1 Non-Fatal Error Reporting Enable When set, the root port will generate an interrupt when a non-fatal error is reported by the attached device 2 Fatal Error Reporting Enable When set, the root port will generate an interrupt when a fatal error is reported by the attached device
20 Error Message Generated by an Endpoint Device How can we debug in the following scenario When root complex gets completion timeout When we suspect something wrong in downstream where an endpoint get errors In this scenario, an endpoint device can send error message upstream – correctable error(ERR_COR), non-fatal error(ERR_NONFATAL) or fatal error message(ERR_FATAL). Hence we can trigger any error message upstream using PCIe LA or protocol analyzer to identify what downstream traffic causes the error. Configuration on an endpoint device PCICMD[8] = 1 When Set, this bit enables reporting upstream of Non-fatal and Fatal errors detected by the Function DCTL[2:0] = 0x7 Enable Correctable, Non-Fatal and Fatal Errors Reporting enable
21 One Confusion for Most of Us in General Root Complex Endpoint Completion with UR MemRd with Invalid Address Where is Completion UR set in Uncorrectable Error Status? And what else error status? UES[20]:URE = 1 PCISTS[13]: Received Master Abort = 0 SSTS[13]: Received Master Abort = 0 UES[20]:URE = 0 PCISTS[13]: Received Master Abort = 1
22 One Confusion for Most of Us in General Root Complex Endpoint Completion with UR MemRd with Invalid Address Where is Completion UR set in Uncorrectable Error Status? And what else error status? UES[20]:URE = 0 PCISTS[13]: Received Master Abort = 0 SSTS[13]: Received Master Abort = 1 UES[20]:URE = 1 PCISTS[13]: Received Master Abort = 0