2. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
2
Prerequisite
Address of all PCIe registers will be stated in offset off each capable structure.
Please have the following training.
https://sharedspaces.intel.com/sites/vlc/SitePages/Course.aspx?c=364#
3. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
3
Major PCIe Error Status Registers
Your test content must check the following registers
Your debug should start with the following registers for triage
Type0/1 Common Configuration Space
– PCISTS offset 0x06
Type1 Configuration Space
– Secondary Status Register offset 0x1E
PCIe Capability Structure
– Device Status Register offset 0x0A
Advanced Error Reporting Capability
– Uncorrectable Error Status Register offset 0x04
– Correctable Error Status Register offset 0x10
– Root Error Status Register 0x30
4. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
4
PCI Status(PCISTS) – offset 0x6 off Common
Space
Bit Bit Name Description
11 Signaled Target
Abort
• Set when a Function completes a Posted or Non-Posted Request as a Completer Abort error.
• This applies to a Function with a Type 1 Configuration header when the Completer Abort was
generated by its Primary Side(Set whenever the root port forwards a target abort received from
the downstream device onto the backbone)
12 Received
Target Abort
(CA)
• Set when a Requester receives a Completion with Completer Abort Completion Status.
• On a Function with a Type 1 Configuration header, the bit is Set when the Completer Abort is
received by its Primary Side(Set when the root port receives a completion with completer abort
from the backbone.)
13 Received
Master Abort
(UR)
• Set when a Requester receives a Completion with Unsupported Request Completion Status.
• On a Function with a Type 1 Configuration header, the bit is Set when the Unsupported Request is
received by its Primary Side.(Set when the root port receives a completion with unsupported
request status from the backbone)
14 Signaled
System Error
• Set when a Function sends an ERR_FATAL or ERR_NONFATAL Message, and the SERR#
Enable bit in the Command register is 1.
• Set when the root port signals a system error to the internal SERR# logic
15 Detected Parity
Error
• Set by a Function whenever it receives a Poisoned TLP, regardless of the state the Parity Error
Response bit in the Command register.
• On a Function with a Type 1 Configuration header, the bit is Set when the Poisoned TLP is
received by its Primary Side(Set when the root port receives a command or data from the
backbone with a parity error. This is set even if PCMD.PERE is not set)
5. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
5
Secondary Status(SSTS) – Status for downstream
side of bridge – offset 0x1E off Type1 Space
Bit Bit Name Description
11 Signaled Target
Abort
This bit is Set when the Secondary Side for Type 1 Configuration Space header Function (for
Requests completed by the Type 1 header Function itself) completes a Posted or Non-Posted
Request as a Completer Abort error.
12 Received
Target Abort
(CA)
This bit is Set when the Secondary Side for Type 1Configuration Space header Function (for
Requests initiated by the Type 1 header Function itself) receives a Completion with Completer Abort
Completion Status.
13 Received
Master Abort
(UR)
Set when the Secondary Side for Type 1Configuration Space header Function (for Requests initiated
by the Type 1 header Function itself) receives a Completion with
Unsupported Request Completion Status
14 Signaled
System Error
Set when the Secondary Side for a Type 1Configuration Space header Function receives an
ERR_FATAL or ERR_NONFATAL Message.
15 Detected Parity
Error
Set by the Secondary Side for a Type 1 Configuration Space header Function whenever it receives a
Poisoned TLP, regardless of the state the Parity Error Response Enable bit in the Bridge Control
register.
6. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
6
Device Status(DSTS) – Offset 0x0A off PCIe Cap
structure
Bit Bit Name Description
0 Correctable
Error Detected
Indicates a correctable error was detected. Set when received an internal correctable error from
receiver errors / framing errors, TLP CRC error, DLLP CRC error, Replay Number Rollover, Replay
Timer Timeout
1 Non-Fatal Error
Detected
Indicates a non-fatal error was detected, Set when an received a non-fatal error occurred from a
poisoned TLP, unexpected completions, unsupported requests, completion abort, or completion
timeout – Note that all is based on uncorrectable error severity register configuration
2 Fatal Error
Detected
Indicates a fatal error was detected. Set when a fatal error occurred on from a data link protocol error,
buffer overflow, or malformed TLP
3 Unsupported
Request
Detected
Set when the Secondary Side for a Type 1Configuration Space header Function receives an
ERR_FATAL or ERR_NONFATAL Message.
7. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
7
Device Control(DCTL) – Offset 0x08 off PCIe Cap
structure
Bit Bit Name Description
0 Correctable
Error Reporting
Enable
When set, enables signaling of ERR_CORR to the Root Control register due to internally detected
errors or error messages received across the link. Other bits also control the full scope of related
error reporting.
1 Non-Fatal Error
Reporting
Enable
When set, enables signaling of ERR_NONFATAL to the Root Control register due to internally
detected errors or error messages received across the link. Other bits also control the full scope of
related error reporting.
2 Fatal Error
Reporting
Enable
Enables signaling of ERR_FATAL to the Root Control register due to internally detected errors or error
messages received across the link. Other bits also control the full scope of related error reporting.
3 Unsupported
Request
Reporting
Enable
When set, allows signaling ERR_NONFATAL, ERR_FATAL, or ERR_COR to the Root Control
register when detecting an unmasked Unsupported Request (UR). An ERR_COR is signaled when a
unmasked Advisory Non-Fatal UR is received. An ERR_FATAL, or NONFATAL, is sent to the Root
Control Register when an uncorrectable non Advisory UR is received with the severity set by the
Uncorrectable Error Severity register.
8. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
8
Root Control(RCTL) – Offset 0x1C off PCIe Cap
structure
Bit Bit Name Description
0 System Error
on Correctable
Error Enable
When set, an SERR# will be generated if a correctable error is reported by any of the devices in the
hierarchy of this root port, including correctable errors in this root port.
1 System Error
on Non-Fatal
Error Enable
When set, an SERR# will be generated if a non-fatal error is reported by any of the devices in the
hierarchy of this root port, including non-fatal errors in this root port.
2 System Error
on Fatal Error
Enable
When set, an SERR# will be generated if a fatal error is reported by any of the devices in the
hierarchy of this root port, including fatal errors in this root port
9. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
9
Correctable Error Status(offset 0x10) off AER
Bit[0]: Receiver Error
Physical Layer detected an error in the incoming packet. The packet is discarded at the Physical Layer, any buffer
space allocated to its released, and the Link Layer is informed that a receive error occurred.
Bit[6]: Bad TLP
Data Link Layer detected a packet with a bad LCRC, an out of sequence Seq# or incorrectly nullified packet. In each
case, the Link Layer discards the packet and report a Nak DLLP to the transmitter to trigger TLP replay
ERROR TYPE Bits SKL CPU
family
TGL, ICL-H(SIP16-
Brooks)
PCH family
Advisory Non-
Fatal Error
13 Logged Logged Logged
Replay Timer
Timeout
12 Logged Logged Logged
Replay Number
Rollover
8 Logged Logged Logged
Bad DLLP 7 Logged Logged Logged
Bad TLP 6 Logged Logged Logged
Receiver Error 0 Logged Logged Logged
10. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
10
Correctable Error Status(offset 0x10) off AER
Bit[7]: Bad DLLP
Data Link Layer noticed an incoming DLLP had a 16bit CRC failure so the packet is dropped. A subsequent DLLP of
the same type is expected to make up for the information it contained.
Bit[8]: Replay Number Rollover
If replay happens four times, this bit is flagged and a device initiates recovery.
Bi[12]: Replay Timer Timeout
At the Data Link Layer, transmitted TLPs have not received an acknowledgement(Ack or Nak) within the timeout
period. Hardware automatically replays all unacknowledged TLPs, meaning all packets in the Replay Buffer.
Replay Timer Configuration
– TGL-U: Offset 0x0300 PCIERTP1, Offset 0x0304 PCIERTP2, Offset 0x06A0 PCIERTP3, Offset 0x06A4 PCIERTP4
– PCH: Offset 0x300 PCIERTP1, Offset 0x304 PCIERTP2
– SKL CPU family: Offset 0x238[10:0] LLTC.RT
Bit[13]: Advisory Non-Fatal Error Status
If uncorrectable error severity bit is 0, then this bit is flagged to let SW know that there is non-fatal error.
11. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
11
Uncorrectable Error Status (offset 0x04) off AER
ERROR TYPE Bits SKL CPU family TGL, ICL-H(SIP16-Brooks) PCH family
Poisoned TLP Egress Blocked
Status
26 Not Supported Logged if DPCCAPR.PTLPEBS =1 Logged if
DPCCAPR.PTLPEBS =1
TLP Prefix Blocked Error 25 Not Supported Not Supported Not Supported
AtomicOp Egress Blocked
Status
24 Not Supported Logged Not Supported
ACS violation Status 21 Logged Logged Logged
Unsupported Request Error 20 Logged Logged Logged
ECRC Error 19 Not Supported Not Supported Not Supported
Malformed TLP 18 Logged Logged Logged
Receiver Overflow 17 Logged Logged Logged
Unexpected Completion 16 Logged Logged Logged
Completer Abort 15 Not Supported Logged Logged
Completion Timeout 14 Logged Logged Logged
Flow Control Protocol Error 13 Not Supported Not Supported Not Supported
Poisoned TLP 12 Logged Logged Logged
Surprise Down Error Status 5 Not Supported Not Supported Not Supported
Data Link Protocol Error 4 Logged Logged Logged
Training error 0 Not Supported Not Supported Not Supported
12. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
12
Uncorrectable Error Status (offset 0x04) off AER
Bit[26]: Poisoned TLP Egress Blocked Status
It is tied with DPC enabled. Root Port blocks the transmission of a poisoned TLP from its Egress port.
Bit[24]: AtomicOp Egress Blocked Status
Egress Ports of routing elements can be programmed to block AtomicOps from being forwarded to agents that
shouldn’t see them.
Bit[20]: Unsupported Request
If a receiver doesn’t support a Request, it returns a Completion with UR status and log this bit.
What can be unsupported request
– Message with unsupported or undefined message code
– Request doesn’t reference address space mapped to the device
– Type 1 configuration Request is received at an endpoint
– A function(device) in D1, D2 or D3hot receives a Request other than configuration request or Message.
– A TLP with No_Snoop = 0 in its header is routed to a port that has Reject Snoop Transaction bit = 1 in VC Resource Capability register.
The receiver responds Completion with UR status
13. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
13
Uncorrectable Error Status (offset 0x04) off AER
Bit[18]: Malformed TLP
Checking for violation of the TLP packet formatting rule
– Data payload exceeds MPS
– Data length does not match length specified in the header
– Memory start address and length combine to cause a transaction to cross a naturally-aligned 4KB boundary
– TLP digest(TD field) indication doesn't correspond with packet size(ECRC is unexpectedly missing or present)
– Byte Enable violation
– Undefined Type field values
– Completion that violates the Read Completion Boundary (RCB) value
– Completion with status of Configuration Request Retry Status in response to a Requester other than a configuration request
– Traffic Class field contains a value not assigned to an enabled Virtual Channel(This is also known as TC filtering)
– I/O and Configuration Request violations - example: TC field, Attr[1:0] and the AT field must all be zero, while the Length filed must have a value of one
– Interrupt emulation message sent downstream
– TLP received with TLP prefix error
– TLP prefix but no TLP header
– End-to-End TLP Prefixes preceding Local Prefixes
– Local TLP Prefix type not supported
– More than 4 End-to-End TLP Prefixes
– More End-to-End TLP Prefixes are supported
14. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
14
Uncorrectable Error Status (offset 0x04) off AER
Transaction type requiring use of TC0 has a different TC value
– I/O Read or Write Requests and corresponding Completions
– Configuration Read or Write Requests and corresponding Completions
– Error Messages
– INTx messages
– Power Management messages
– Unlock messages
– Slot Power messages
– LTR messages
– OBFF messages
AtomicOp operand doesn't match an architected value
AtomicOp address isn't naturally aligned with operand size
Routing is incorrect for transaction type(e.g., transactions requiring routing to Root Complex detected moving away from Root
Complex)
15. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
15
Uncorrectable Error Status (offset 0x04) off AER
Bit[17]: Receiver Overflow Status
More TLPs have arrived than the Receive buffer had room to accept.
When this error can occur?
– Remote device doesn’t adhere to flow control rule
Bit[16]: Unexpected Completion Status
Requester receives a Completion that doesn’t match any Requests that are awaiting a Completion.
– Mismatched Request ID between Request and Completion
– Mismatched Tag number between Request and Completion
Bit[15]: Completer Abort
What can be completer abort
– If the Completer of an AtomicOp Request encounters an uncorrectable error accessing the target location or carrying out Atomic
operation, the Completer must handle it as a Completer Abort.
– Completer receives a Request that it cannot process because of some permanent error condition in the device. For example, a wireless
LAN card that won’t accept new packet because it can’t transmit or receive over its radio until an approved antenna is attached
The receiver responds Completion with CA status
16. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
16
Uncorrectable Error Status (offset 0x04) off AER
Bit[14]: Completion Timeout Status
For the case of a pending Request that never receives the Completion it’s expecting, the spec defines a Completion
Timeout mechanism. If completion timeout is enabled and a completion fails to return within the amount of time
specified by the Completion Timeout value.
AECC[12] - Completion Timeout Prefix/Header Log Capable
DCTL2[4] – Completion Timeout Disable
DCTL2[3:0] – Completion Timeout Value
Bit[12]: Poison TLP
Data poisoning, also called “Error Forwarding” for a device to indicate that the data associated a TLP is corrupted. In
any write requests with data or completion with data, if data is corrupted, a sender can set EP bit in the TLP header to
show data corruption. When a device receives a TLP with EP set in the header, Bit12 is set in Uncorrectable Error
Status register.
Why a sender sends such TLP that is already known to be bad then?
– If a request result in a Completion returned with data, but that data encountered an error it was gathered from the target like ECC error or
parity error in memory, what is best way. If the completion is not returned, a requester gets Completion timoeut. On the other hand, the
Completion is delivered with the poisoned bit set, then at least the requester can see the path to target is good which is better than
timeout.
– It may be can accept the data with errors. Ex, audio streaming.
– A device might have a means of correcting the data.
17. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
17
Uncorrectable Error Status (offset 0x04) off AER
Bit[4]: Data Link Protocol Error Status
Caused by Data Link Layer protocol errors including the Ack/Nak retry mechanism.
For example, a transmitter receives an Ack or Nak whose sequence number doesn’t correspond to an
unacknowledged TLP or to the ACKD_SEQ number.
Transmitter sends TLP with Seq#=0x2A and received Ack with Seq# = 0x29 or less.
18. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
18
Root Error Status(RSTS) – Offset 0x30 off AER
Bit Bit Name Description
0 ERR_COR Received Set when a correctable error message is received
1 Multiple ERR_COR Received Set when a correctable error message is received and bit0 is already set
2 ERR_FATAL/NONFATAL
Received
Set when either a fatal or a non-fatal error message is received
3 Multiple
ERR_FATAL/NONFATAL
Received
Set when either a fatal or a non-fatal error message is received and bit2 is already set
4 First Uncorrectable Fatal Set when the first Uncorrectable Error message received is for a fatal error
5 Non-Fatal Error Message
Received
Set when one or more Non-Fatal uncorrectable error message have been received
6 Fatal Error Message Received Set when one or more Fatal Uncorrectable error message have been received
19. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
19
Root Error Command(REC) – Offset 0x2C off AER
Bit Bit Name Description
0 Correctable Error Reporting
Enable
When set, the root port will generate an interrupt when a correctable error is reported by
the attached device
1 Non-Fatal Error Reporting
Enable
When set, the root port will generate an interrupt when a non-fatal error is reported by
the attached device
2 Fatal Error Reporting Enable When set, the root port will generate an interrupt when a fatal error is reported
by the attached device
20. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
20
Error Message Generated by an Endpoint Device
How can we debug in the following scenario
When root complex gets completion timeout
When we suspect something wrong in downstream where an endpoint get errors
In this scenario, an endpoint device can send error message upstream – correctable error(ERR_COR), non-fatal
error(ERR_NONFATAL) or fatal error message(ERR_FATAL). Hence we can trigger any error message upstream using
PCIe LA or protocol analyzer to identify what downstream traffic causes the error.
Configuration on an endpoint device
PCICMD[8] = 1
When Set, this bit enables reporting upstream of Non-fatal and Fatal errors detected by the Function
DCTL[2:0] = 0x7
Enable Correctable, Non-Fatal and Fatal Errors Reporting enable
21. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
21
One Confusion for Most of Us in General
Root Complex Endpoint
Completion with UR
MemRd with Invalid Address
Where is Completion UR set in Uncorrectable Error Status?
And what else error status?
UES[20]:URE = 1
PCISTS[13]: Received Master Abort = 0
SSTS[13]: Received Master Abort = 0
UES[20]:URE = 0
PCISTS[13]: Received Master Abort = 1
22. Manufacturing Validation Engineering (MVE)
Delivering Quality Products That Delight Our Customers
22
One Confusion for Most of Us in General
Root Complex Endpoint
Completion with UR
MemRd with Invalid Address
Where is Completion UR set in Uncorrectable Error Status?
And what else error status?
UES[20]:URE = 0
PCISTS[13]: Received Master Abort = 0
SSTS[13]: Received Master Abort = 1
UES[20]:URE = 1
PCISTS[13]: Received Master Abort = 0