"@(#)hv_sun4v_errorphilosophy-V2.0.txt 1.17 06/08/15 " Sun4v Hypervisor Error Handling Interfaces [DRAFT FOR REVIEW] ------------------------------------------- NOTE: This document describes the error handling interfaces for CPU, memory, internal register and programmed I/O errors. The error handling interfaces for host bus adapter errors and directly accessible I/O device errors are still being developed and are not described in this version of the document. 1.0 Introduction Hardware errors which do not reset the system generate a trap to the hypervisor. The hyperprivileged trap handler virtualizes the errors from the CPU, memory, and virtual I/O devices like the host bus adapter, and sends an error notification to the affected guests. For errors that do not reset the guest, an error report indicating the impact of the error is sent to the guest. Section 5 of this document describes the structure of the error report sent to the guest. Errors from devices that are directly accessed by the sun4v guest are not virtualized by the hypervisor. They are handled by the device drivers of the sun4v guest. The sun4v architecture[1] defines two classes of errors based on their impact on the interrupted instruction stream: resumable errors and non-resumable errors. Resumable errors are those that do not affect the current instruction stream. Non-resumable errors are those that affect the current instruction stream and require software intervention before the interrupted instruction stream can be resumed. The sun4v architecture defines queues for the hypervisor to send error reports to its guests. The sun4v error reports for CPU, memory, and PIO errors are queued on to the resumable error queue or non-resumable error queue depending on the type of the error. The sun4v error reports for errors in virtual or direct I/O devices are queued on to the dev_mondo queue. The simplest implementation of a sun4v guest could, for example, simply perform a 'retry' on resumable error notifications and 'panic' on non-resumable error notifications. But the intent is to have enough information in the hypervisor generated error reports to the sun4v guests such that an advanced guest would be able to take corrective actions and make forward progress when possible. The remainder of this document is divided as follows. Section 2 defines new terms introduced in this document. Section 3 describes the sun4v hypervisor error handling philosophy. Section 4 provides a brief overview of the hypervisor generated error notifications for errors. Section 5 describes the sun4v error handling interfaces. Section 6 describes the hypervisor error handling principles of operation. 2.0 Terms and Definitions 2.1 Diagnosis Service Provider. The platform is expected to include an FMA Fault Manager that provides a Diagnosis Service for the hardware components. The diagnosis service must provide a transport for FMA Error Reports, and can be implemented on any of the following: (1) The only sun4v guest partition on the platform (2) A sun4v service partition (3) The Service Processor The diagnosis service implements the appropriate hardware diagnosis algorithms and triggers corrective actions, messaging, and other tasks resulting from the diagnosis of a fault in the platform hardware. 2.2 FMA Error Report Generator Service Provider. If a hypervisor implementation does not itself produce FMA Error Reports, then an FMA Error Report Generator must be implemented to convert the hypervisor implementation-specific error data structures to FMA Error Reports and transport them to the diagnosis service. 2.3 Service Provider Interface. A platform-specific Service Provider Interface must be implemented on the platform to transmit hypervisor generated error reports to the FMA Error Report Generator (if one is required) or to the Diagnosis Service Provider (if the hypervisor itself produces FMA Error Reports). For information on Sun SPARC error terminology, please refer to [2]. For more information on FMA, see http://fma.eng. 3.0 Philosophy The sun4v hypervisor error handling philosophy is based on the following principles: (1) Abstract the underlying hardware characteristics from errors reported to a sun4v guest so as to enable sun4v guest error handlers to be implemented without built-in knowledge of the underlying hardware implementation. (2) Provide a separate mechanism to report errors for analysis and diagnosis of hardware faults that should be subscribed to by the FMA Error Report generator and diagnosis service provider. 4.0 Brief Overview of Error Notifications Hardware errors which do not reset the system trap to the hypervisor. For each error handled by the hypervisor: (1) If the error does not reset the sun4v guest, then a sun4v error report that virtualizes the underlying hardware error and describes the impact of the error is sent to the sun4v guest. (2) A service error report containing the raw error logs captured by the hardware and additional diagnostic data is sent to the diagnostic service provider. As shown in Figure 1 below, the sun4v error report is sent to the affected sun4v guest via the queues defined by the sun4v architecture. The service error report is sent via the Service Provider Interface to the FMA Error Report Generator which sends an FMA Error Report to the Diagnosis Service Provider. Diagnosis Service Provider _________ _________ ( ) forward ( ) ( FMA Agent )=========>( FMA Agent ) (_________) (_________) ^ FMA Error ^ FMA Error | Report | Report +--------------+ +-------------+ +-------------+ +------------+ |CPU/Mem/PIO | |Virtual i/o | |Direct i/o | | FMA Error | |error handler | |error handler| |error handler| | Report | +--------------+ +-------------+ +-------------+ | Generator | ^ ^ ^ +------------+ +-------|-------+ |----+-----| ^ | | | | +-------------+ +-----------+ +-----------+ +--------------+ |non-resumable| |resumable | |device | |Service | |error queue | |error queue| |mondo queue| |Provider I/F | +-------------+ +-----------+ +-----------+ +--------------+ ^ ^ ^ ^ | | | | +--------------+ | | |cpu, |virtual i/o |service |memory, |error report, |error |PIO error |direct i/o |report |report |error interrupt | +---------------------+ | | | | | +-------------+ | | Hypervisor |--------------------------------+ +-------------+ ^ | hardware errors Fig 1. Hypervisor Error Reports to sun4v Guest and FMA Service Provider Some notes on Figure 1 above: (1) Virtual I/O refers to devices that cannot be directly accessed by the guest. They are either complete abstractions of the underlying physical devices, like virtual console device, or are indirectly accessed using hypervisor calls, like to access the host bus adapter. (2) The CPU, memory, and virtual I/O errors are diagnosed by the Diagnosis Service Provider based on the service error report data sent. (3) Direct I/O device errors are handled by the sun4v guest device drivers. Hardened drivers generate FMA Error Reports. Those FMA Error Reports are sent to the FMA Agent (as shown in Figure 1) on the sun4v guest and are forwarded to the FMA Agent on the Diagnosis Service Provider. Forwarding the FMA Error Reports may not be necessary if the Diagnosis Service Provider and the sun4v guest are on the same partition. (4) The sun4v error report is a virtualized error report used by the sun4v guest, and is not the same as the FMA Error Report that captures platform-specific information for the Diagnosis Service. Figure 1 shows the error reports generated by the hypervisor when handling hardware errors and how they are propagated to the Diagnosis Service Provider and the sun4v guest. The sun4v error report is sent to the sun4v guest. The CPU, memory, and PIO error reports are sent via the sun4v resumable_error and nonresumable_error queues. Both virtual and direct I/O device error reports are sent via the sun4v dev_mondo queue. These queues are allocated per CPU. Each queue has a head and a tail pointers. When the queue is empty, the head and tail pointers are equal. The hypervisor queues the error report at the tail and updates the tail pointer to the next entry. For resumable_error and dev_mondo queues, the hardware generates a disrupting trap whenever the head and tail pointers are not equal. The disrupting trap is taken on the CPU if the interrupts are enabled (PSTATE.IE = 1) or remains pending if the interrupts are disabled. The sun4v guest interrupt handler processes the sun4v error reports starting from the head pointer to the tail pointer (excluding) and updates the head pointer to equal the tail pointer leaving the queue in a non-interrupting state. For nonresumable_error queues the hardware does not generate a trap automatically when the head and tail pointers are not equal. The hypervisor emulates a nonresumable_error trap on the CPU by transferring control to the nonresumable_error trap handler of the sun4v guest. The sun4v guest trap handler processes the sun4v error reports starting from the head pointer to the tail pointer (excluding) and updates the head pointer to equal the tail pointer. For direct I/O device errors, the sun4v guest hardened device drivers generate the FMA Error Report to be sent for diagnosis. For CPU, memory, and virtual I/O device errors, the sun4v guest does not generate FMA Error Reports, instead an FMA Error Report is generated based on the service error report sent via the service provider interface. The service error report is sent to the FMA Error Report Generator and Diagnosis Service Provider via a platform-specific interface called the Service Provider Interface. The FMA Error Report Generator receives the error logs and diagnostic data from the hypervisor and generates an FMA Error Report. It sends the FMA Error Report to the Diagnosis Service Provider for analysis and diagnosis. The error recovery actions based on the platform's SERD policies and failure rates are then communicated back to the guests (not shown in Figure 1). Please refer to the proposed FMA Error Report Generator and Diagnosis Service Provider architecture[3] for more information. The sun4v guest error reports and the error reports sent through the Service Provider Interface each contain an Error Handle (see 5.2.1) that can be used to correlate the reports. This document describes the sun4v error reports for CPU, memory, and the virtual host bus adapter errors. For a description of the service error reports, please refer to the platform's FMA and Service Entity documentation. 4.1 PIO store errors. Note that this error report format is *not* used for PIO store errors. These errors are reported to the guest using a different, I/O specific format. See [6] for details. 5.0 Sun4v Error Handling Interfaces This section describes the sun4v error handling interfaces as a supplement to the Error Model section in the sun4v architecture specification[1]. 5.1 Classification of Errors An error as defined in [2] is when a signal or datum is wrong. Hypervisor classifies the hardware errors into three classes: (1) Resumable errors, where the error does not affect the current instruction stream. (2) Non-resumable errors, where the error affects the current instruction stream and requires software intervention before the program can be resumed. (3) Unconstrained or terminating errors, where the error results in a loss of system coherency and/or data integrity that continuing execution can lead to further damage. For unconstrained or terminating errors, a sun4v error report is not queued to the affected sun4v guests; the affected guests are reset. For resumable and non-resumable errors, a sun4v error report is queued to the affected sun4v guest. The sun4v architecture[1] defines the resumable, non-resumable, and dev_mondo queues, and their workings. All I/O error interrupts are resumable errors as they are derived from asynchronous interrupts generated by I/O devices and do not affect the current instruction stream. However, their error reports are queued on the dev_mondo queue to be handled by the nexus and device drivers of the sun4v guest. The structure of the i/o error reports [TBD] is different from the sun4v error reports for CPU, memory and PIO errors defined in section 5.2 of this document. Regardless of the class of the error, the hypervisor creates a service error report and attempts to deliver it to the FMA Error Report Generator. Upon receipt of a service error report, the FMA Error Report Generator creates an FMA Error Report and attempts to deliver it to the Diagnosis Service Provider for error recovery and fault diagnosis. In the case where the Diagnosis Service Provider is implemented on the sun4v guest which is fatally affected by a non-resumable error, the FMA Error Report may not be successfully delivered to the diagnosis service. 5.1.1 Resumable Errors For an error that is a resumable error, the originating error may either be corrected by the hardware or hypervisor, or left unchanged. If the originating error was corrected, then the guest is not sent an error report. However, a service error report is sent to the Diagnosis Service Provider via the service provider interface for fault analysis. Some examples of hardware errors that are not reported to the guest because the error was corrected are: correctable ECC error in cache data, TLB data parity error, cache data parity error in a clean cache, and uncorrectable ECC error in a clean cache line. If the originating error was left unchanged, then an error report is sent to the affected guest. The impact of the error on the sun4v guest, for example, whether there was memory corruption or whether a CPU became unavailable, is indicated in the error report. Some examples of hardware errors that are reported as resumable errors where the originating error was left unchanged are: uncorrectable data ECC error on cache writeback, transaction timeout on a PIO read, data parity error on a PIO read data return, bus error on a PIO read, and recursive unrecoverable errors on a CPU. In the case of an uncorrectable data ECC error on cache writeback, the memory region that was corrupted is indicated in the error report. In the case of recursive errors on a CPU, the ID of the CPU that was marked in error along with the execution mode of the CPU at the time of the error are indicated in the error report. In the case of a failed PIO transaction, the PIO transaction address is indicated in the error report. 5.1.2 Non-Resumable Errors For an error that is a non-resumable error, the originating error is not corrected. The sun4v error report indicates the location of the originating error. These errors require the intervention of the sun4v guest error handler to take corrective actions, when possible, before resuming or terminating the interrupted program. For example, the guest may use the hypervisor call to scrub the memory region in error indicated in the error report. A non-resumable error may be reported to the sun4v guest as either a precise trap or a deferred trap. The error report descriptor indicates the trap type. When multiple error reports are queued, the deferred error reports will be queued ahead of the precise error report according to the age of the instructions that induced the errors. Some examples of hardware errors that are reported as non-resumable errors are uncorrectable data ECC error in cache on loads, instruction fetches, or atomics from a dirty line, uncorrectable ECC error in DRAM on loads, instruction fetches, or atomics, and an uncorrectable ECC error in a CPU's register file. For uncorrectable ECC errors in the memory hierarchy, the non-resumable error report indicates the memory region in error. For uncorrectable ECC errors in register files which can be cleared, the register file as well as the the ID of the CPU whose register file had the error are indicated in the error report. 5.1.3 Unconstrained or Terminating Errors Unconstrained or terminating errors are not reported to the sun4v guest OS. They result in resetting the sun4v guest. In some cases, the hardware generates a reset trap, and in others the hypervisor resets the sun4v guest. Some examples of hardware errors that are treated as unconstrained or terminating errors to the guest are Niagara's L2 cache tag parity error and L2 cache directory parity error, ROCK's store buffer address or control parity error, and Niagara's TLB tag parity error. The L2 cache tag and directory parity errors are detected by hardware which causes a warm reset of the entire chip. For store buffer address or control parity error, the hardware generates a deferred trap to the hypervisor which resets the affected partitions. For the TLB tag parity error the hardware generates a precise trap to the hypervisor which resets the partitions using that TLB. Recursive errors on a CPU may result in the resetting of the partition if that results in all of the CPUs in that partition to be in error. 5.2 Sun4v Error Report For CPU, Memory, and Programmed I/O (PIO) Access The sun4v error report for CPU, memory, and PIO access errors is a fixed length error report that describes the underlying hardware error in terms of resumable or non-resumable error to the sun4v guest. The intent is to have enough information in the error reports to enable an advanced guest to take corrective actions, when possible, and make forward progress. The sun4v error report is not meant for hardware fault analysis or diagnosis. On startup, the sun4v guest and the hypervisor exchange the versions that they support and pick the latest version that is compatible. Please refer to [1] for more information. The table 5.2-I below describes the format of the sun4v error report record. -------------------------------------------------------------------- Offset Size Field Description (bytes) -------------------------------------------------------------------- 0x0 8 EHDL# Unique error handle 0x8 8 STICK Value of the %STICK register 0x10 3 Rsvd Reserved, always set to zero. 0x13 1 DESC Error descriptor (see section 5.2.3) 0x14 4 ATTR Error attributes (see section 5.2.4) 0x18 8 ADDR Real address of the affected memory region or PIO transaction address Virtual address for the ASI register 0x20 4 SZ Size, in bytes, of the affected memory region or the size (in bytes) of the ASI region in error 0x24 2 CPUID ID of the affected CPU 0x26 2 SECS Grace period for shutdown in seconds 0x28 1 ASI Value of the %ASI register 0x29 1 Rsvd Reserved, always set to zero. 0x30 2 REG Value of the ASR register# -------------------------------------------------------------------- Table 5.2-I. Sun4v Error Report Format 5.2.1 Error Handle (EHDL#). This field specifies the handle of the error. Error handles are unique opaque values that will not be reused until the hypervisor in the hardware domain is restarted. If multiple error reports are generated for the same error, they will all have the same EHDL value. 5.2.2 Stick register (STICK). This field specifies the contents of the %STICK register that was captured by the hypervisor trap handler. 5.2.3 Error Descriptor (DESC). This field specifies the type of the error report. The table 5.2.3-I below lists the currently defined values. ------------------------------------------------------------ Value Mnemonic Description ------------------------------------------------------------ 0 UNDEF Undefined 1 R_UE Uncorrected resumable error report 2 NR_PR Precise non-resumable error report 3 NR_DF Deferred non-resumable error report 4 SHT_R Shutdown request (resumable) 5 DCORE Dump Core (non-resumable) ------------------------------------------------------------ Table 5.2.3-I. Error Report Descriptors All other values are reserved. The values R_UE and SHT_R are valid only for error reports that are queued on the resumable_error queue. The values NR_PREC, NR_DEF and DCORE are valid only for the error reports that are queued on the nonresumable_error queue. 5.2.3.i Uncorrected resumable error report. An uncorrected resumable error report is always queued on the resumable_error queue of a CPU that belongs to the affected partition. It specifies that the underlying error was not corrected. The resource in error is specified by the ATTR (5.2.4) field of the error report. An uncorrected resumable error report is used to indicate a CPU in error. For example, in a partition with multiple CPUs when a permanent error in a register file of a CPU is detected, the CPU is marked in error and an uncorrected resumable error report indicating the CPU in error is queued on a different CPU of the same partition. When the only running CPU in a partition is in error, the partition is reset. 5.2.3.ii Precise non-resumable error report. A precise non-resumable error report is always queued on the nonresumable_error queue of the CPU that executed the instruction that induced the error. It specifies that the nonresumable_error trap taken is a precise trap where TPC[TL] points to the instruction that induced the error. The error report contains enough information about the error for the guest to take appropriate actions before resuming or terminating the interrupted instruction stream. The location of the error is specified by the ATTR (5.2.4) field of the error report. A hypervisor call is provided for the guest to scrub the error location. When multiple non-resumable error reports are queued on the nonresumable_error queue of a CPU the deferred error reports will be queued ahead of the precise non-resumable error reports. 5.2.3.iii Deferred non-resumable error report. A deferred non-resumable error report is always queued on the nonresumable_error queue of the CPU that executed the instruction that induced the error. It specifies that the nonresumable_error trap taken is a deferred trap which means that the error is unrecoverable and the instruction stream should be terminated. The location of the error is specified by the ATTR (5.2.4) field of the error report. The MODE (5.2.4.viiii) field in the ATTR specifies the execution mode in which the error occurred. When multiple non-resumable error reports are queued on the nonresumable_error queue of a CPU the deferred error reports will be queued ahead of the precise non-resumable error reports. 5.2.3.iv Shutdown request. This is used to request the guest to initiate a graceful shutdown sequence. This report will be queued on the resumable error queue. 5.2.3.v DCORE, (Dump Core). This is used to instruct the guest to initiate a dump core sequence. This report will be queued on the non-resumable error queue. 5.2.4 Error Attributes (ATTR). The meaning of this field depends on the error descriptor (see 5.2.3) of the error report. It also includes the resumable queue full indicator (see 5.2.11). In uncorrected resumable error reports, this field specifies the resource affected by the error. When a CPU has an uncorrected error, whether the CPU was executing in user or privileged mode, if known, is also included in the error report. In precise non-resumable error reports, this field specifies the location in error. In deferred non-resumable error reports, this field specifies the location in error as well as the execution mode in which the error occurred, if that can be determined. The settings of this field also determines which of the additional information included in the error report have valid contents. The table 5.2.4-I below describes the format of this field. --------------------------------------------------------------------- Field Bit Location/ Valid Fields Position Impact In Error Report --------------------------------------------------------------------- RQFULL 31 Resumable Queue Full RSVD 30:26 Undefined. Reserved for future use. MODE 25:24 Execution Mode (see 5.2.5.viiii) RSVD0 23:9 Undefined. Reserved for future use. PREG 8 Sun4v Privileged CPUID, REG Register ASI 7 Sun4v ASI register ASI, ADDR, SZ ASR 6 Sun4v ASR REG SHUT 5 Shutdown request FRF 4 Floating-point CPUID, REG Register File IRF 3 Integer Register File CPUID, REG PIO 2 Programmed I/O Access ADDR MEM 1 Memory Hierarchy ADDR, SZ CPU 0 CPU CPUID --------------------------------------------------------------------- Table 5.2.4-I. Format of the Error Attributes (ATTR) Field The unused bits may have undefined values and are reserved for future use. The PIO and MEM bits cannot be set in the same error report. The tables 5.2.4-II below shows the applicable attibute fields for the different types of error reports. 'Y' indicates applicable. '-' indicates not applicable. +----------------------------------------------------------------------------------------------+ |Error| Error Attributes | | | | |DESC |CPU |MEM |PIO |IRF |FRF |SHUT|ASR|ASI|PREG|MODE |RQFULL | Notes | +-----|----|----|----|----|----|----|---|---|----|-----|---------------------------------------+ |R_UE | Y | Y | - | - | - | - | - | Y | - | Y | Y | PIO, IRF, FRF, ASR, PREG | | | | | | | | | | | | | | and REG not applicable in | | | | | | | | | | | | | | uncorrected resumable error | | | | | | | | | | | | | | reports. | |NR_PR| - | Y | Y | Y | Y | - | Y | Y | Y | - | - | CPU not applicable in | | | | | | | | | | | | | | precise non-resumable error | | | | | | | | | | | | | | reports. PIO and MEM cannot | | | | | | | | | | | | | | be set in the same report. | |NR_DF| - | Y | Y | - | - | - | - | - | - | Y | - | CPU, IRF, FRF, ASR, ASI and | | | | | | | | | | | | | | PREG not applicable | | | | | | | | | | | | | | in deferred non-resumable | | | | | | | | | | | | | | error reports. | | | | | | | | | | | | | | PIO and MEM cannot be set in | | | | | | | | | | | | | | the same report. | |SHT_R| - | - | - | - | - | Y | - | - | - | - | - | | |DCORE| - | - | - | - | - | - | - | - | - | - | - | No attributes for DCORE | +----------------------------------------------------------------------------------------------+ Table 5.2.4-II. Applicable Error Attributes Map 5.2.4.i CPU Field. In an uncorrected resumable error report, the CPU bit when set specifies that a CPU belonging to the same partition is in error. The ID of the CPU in error is specified by the CPUID (see 5.2.5) field in the error report. The CPU bit is not used in non-resumable error reports. 5.2.4.ii MEM Field. In uncorrected resumable error reports and in non-resumable error reports, the MEM bit when set specifies that there exists an uncorrected data error in the memory hierarchy. The uncorrected error could be either due to a bad ECC syndrome or NotData. The starting real address and the size, in bytes, of the affected memory region are specified by the ADDR (5.2.6) and SZ (5.2.7) fields in the error report, respectively. Subsequent reads from the affected memory region would also generate an error unless there was an intervening hypervisor call to scrub the memory error. A hypervisor call is provided for the guest to scrub the memory region in error. The MEM field cannot be set in the same error report as the PIO field (5.2.4.iii), ASI field (5.2.4.vi) or ASR field (5.2.4.vii). 5.2.4.iii PIO Field. In non-resumable error reports, the PIO bit when set specifies that an unrecoverable error was encountered on a PIO access. The PIO address accessed is specified by the ADDR (5.2.6) field in the error report. The I/O device corresponding to the PIO transaction that failed can be determined based on the PIO address specified by the ADDR field in the error report. The PIO bit is not used in resumable error reports. The PIO field cannot be set in the same error report as the MEM field (5.2.4.ii), ASI field (5.2.4.vi) or ASR field (5.2.4.vii). 5.2.4.iv IRF Field. In precise non-resumable error reports, the IRF bit when set specifies that a non-permanent uncorrectable error in the integer register file occurred when executing that instruction (pointed to TPC[TL]). The data in one or more register operands of that instruction has been corrupted by the error, but the source of error has been cleared. The IRF field is not used in uncorrected resumable error reports. NOTE: For permanent errors in the integer register file of a CPU, the CPU is marked in error. An uncorrected resumable error report is sent to a different CPU in the same partition indicating the ID of the CPU in error. 5.2.4.v FRF Field. This is same as the IRF (5.2.4.iv) field except that when set it specifies that the error was in the floating-point register file instead of the integer register file. Please see IRF (5.2.4.iv) description for more information. NOTE: For permanent errors in the floating point register file of a CPU, the CPU is marked in error. An uncorrected resumable error report is sent to a different CPU in the same partition indicating the ID of the CPU in error. 5.2.4.vi ASR Field. An error occurred in one of the internal ASRs of the CPU. The ASR in error is identified by the REG field in the error report, see 5.2.9. 5.2.4.vii ASI Field. An error occurred in one or more registers accessed via alternate Address Space Identifiers. The register or registers in error are identified by the combination of their ASI, their start address, and length using the ASI (see 5.2.8), the ADDR (see 5.2.6), and SZ (see 5.2.7) fields, repectively. 5.2.4.viii PREG Field. An error occurred in one of the internal privileged registers of the CPU. The register in error is identified by the REG field in the error report, see 5.2.9. NOTE: For permanent errors in the privileged register file of a CPU, the CPU is marked in error. An uncorrected resumable error report is sent to a different CPU in the same partition indicating the ID of the CPU in error. 5.2.4.viiii Execution Mode (MODE). This field specifies the execution mode of the operation that induced the error. The table 5.2.4-III below lists the currently defined values. --------------------------------- Value Description --------------------------------- 0b00 Unknown 0b01 User mode 0b10 Privilege mode 0b11 Reserved --------------------------------- Table 5.2.4-III. Execution Mode The 'Unknown' execution mode will be used in error reports when the hypervisor cannot determine the CPU's state at the time of the error. 5.2.5 ID of the CPU (CPUID). This field specifies the ID of the CPU affected by the reported error. It is valid when the ATTR field in the error report has either the CPU, IRF, or FRF bit set. 5.2.6 Address (ADDR). If the MEM bit in the ATTR field in the error report is set, then this field contains the starting address of the memory region affected by the error. If the PIO bit in the ATTR field in the error report is set, then this field contains the PIO transaction address. If the ASI bit in the ATTR field in the error report is set, then this field contains the first virtual address of the ASI register(s) which caused the error. This is used in conjunction with the ASI field (see section 5.2.8), and the SZ field (see section 5.2.7) to identify the ASI register(s) in error. A value of (-1) implies that the ADDR is unknown or unused. 5.2.7 Size of the Memory Region (SZ). This field specifies the size in bytes of the memory region affected by the reported error when the MEM bit in the error attributes (ATTR) field is set. When the ASI bit in the error attributes (ATTR) field is set this field is used to indicate the size (in bytes) of the ASI region in error. This must be a multiple of the sun4v ASI register size. For a single ASI/VA register the SZ field must be set to the size of a single register, (typically 8 bytes). The range of ASI/VAs in error will be [ADDR]ASI ... [ADDR + (SZ -(size of single register))]ASI. Note that this implies that we can only support a contiguous range of VAs for a particular ASI region. Error handling software may however be aware of gaps in the range and act accordingly. NB: : SZ == 0 is reserved and must not be used. 5.2.8 ASI. When the ASI bit of the ATTR field in the error report is set, this field contains the value of the sun4v %asi register when the error occurred. Together with the value of the ADDR and SZ fields, it identifies the register(s) which caused the error. If the error occurred on more than one register for that ASI, the SZ field can be used to specify the range of ASI virtual addresses, (see 5.2.7 above). which caused the error. For example, if an error occurred in the Niagara2 MMU Primary Context Register 0, this field would be set to 0x21, the ADDR field would be set to 0x8, and the SZ field set to 8 (bytes, the size of a register on N2). For the same CPU, if the error occurred on both primary and secondary context registers, this field would be set to 0x21, the ADDR field would be set to 0x8, and the SZ field set to 16 (bytes, the size of two registers on N2). 5.2.9 REG. When the ASR bit of the ATTR field in the error report is set, this field specifies the sun4v ASR number, (for example if the error occurred in the system tick register, this field would be set to 24, => %asr24). When the IRF bit of the ATTR field in the error report is set, this field contains the number of the Sparc V9 general purpose register, (see [4], section 5.1.3.), which caused the error. For example, if the error occurred in register %o0, this field will contain the value 8, for general purpose register r[8]. When the FRF bit of the ATTR field in the error report is set, this field contains the number of the Sparc V9 floating point register, (see [4], section 5.1.4), which caused the error. For example, if the error occurred in register %f9, this field will contain the value 9, for floating point register f[9]. When the PREG bit bit of the ATTR field in the error report is set, this field contains the number of the Sparc V9 privileged register, (see [5], sections 5.8, 7.83), which caused the error. Note that this field is a 2-byte (16-bit) word but only bits[14:0] are allocated for use as the register number. Bit[15] is the VALID bit. This bit must be set to indicate that the REG value in bits[14:0] are valid. if this bit is set, guest software may assume that the REG value has a valid value encoded. If this bit is not set, guest software must assume that the value in the REG field is not valid for this error report and should not use that value in it's error handling. The table 5.2.4-IV below describes the format of this field. --------------------------------------------------------------------- Field Bit Description Position --------------------------------------------------------------------- VALID 15 1: The contents of this field are valid 0: This field does not contain a valid register number REG 14:0 Register number --------------------------------------------------------------------- Table 5.2.4-IV. Format of the Register Number (REG) Field 5.2.10 SECS. The number of seconds the guest should allow before shutdown. 5.2.11 Resumable queue is full (RQFULL). This field applies only to resumable error reports. When set, it specifies that zero or more resumable errors might have been dropped since the queueing of that error report and the next one. 6.0 Hypervisor Error Handling Principles of Operation This section describes the principles of operation of the hypervisor error handlers. 6.1 Handling of Errors 6.1.1 Corrected Errors For hardware corrected errors where the error is not automatically cleared, the hypervisor attempts to clear the source of the error by writing back the corrected data (an attempt to clear a stuck-at bit will fail). For example, if a correctable ECC error was reported on a L2 cache line or DRAM memory, the hypervisor will attempt to write the corrected data back to the error location. 6.1.2 Uncorrectable Errors 6.1.2.i Register errors For uncorrectable errors in the processor's integer or floating-point register files, the hypervisor attempts to clear the source of the error by writing a test pattern to the register and reading it back. If the error in the register cannot be cleared due to a stuck-bit, then the CPU is stopped and a resumable error (uncorrected resumable error report) indicating the CPU in error is sent to another CPU of the same partition. If the uncorrectable error in the register is cleared, a precise non-resumable error report is reported to the guest on the CPU that took the trap with the register that was reported in error containing an undefined value. 6.1.2.ii Cache errors For uncorrectable errors in the processor caches, the hypervisor clears the error from the cache by flushing the cache line with the bad data to memory as long as there is no expansion of data poisoning or corruption (which is determined based on the granularity of the error protection in the processor caches and memory.) If the flushing of the cache line with the bad datum would result in the expansion of data poisoning or corruption, the hypervisor leaves the bad data in the cache when reporting the error to the guest. (The guest can use the the hypervisor call to scrub the bad data which clears the cache line in error by filling it with zeroes and flushes it to memory.) If the cache line with the bad data is clean, then the hypervisor evicts the line with the bad data out of the cache. Here is an example. Suppose that the L2 cache has ECC protection for every 4 bytes and DRAM memory has ECC protection for every 16 bytes. In this case, an uncorrectable error in the L2 cache would mean that there are 4 bytes of bad data. If the line containing the error was modified, then flushing the line out of the cache to the memory would expand the error to 16 bytes of bad data because the ECC protection granularity of memory is 16 bytes. That would result in the expanding the data corruption from 4 bytes to 16 bytes. To avoid such expansion, the hypervisor will not attempt to clear the uncorrectable error that was detected in the L2 cache line. 6.1.2.iii Cache writeback errors For uncorrectable errors during cache writebacks, if the processor turns the signalling error to a non-signalling error thereby resulting in data corruption, the hypervisor will regard the writeback error as an unconstrained error and reset the affected guests. If the uncorrectable error on a cache writeback remains a signalling error after the writeback, then a uncorrected resumable error report is sent to the affected guests. NOTE: It is highly recommended that processors do not convert a signalling error to a non-signalling error on cache writebacks. 6.1.2.iv Memory errors For uncorrectable memory errors, the hypervisor does not attempt to clear the source of the error. The hypervisor notifies the sun4v guest of the memory region in error. The sun4v guest is responsible for its recovery policy. It can scrub the memory region in error using the hypervisor call to scrub memory, which clears the memory region in error by filling it with zeroes. The hypervisor call to scrub memory should return an error code to the guest if the scrub was not successful. Hypervisor should also notify the Diagnosis Service Provider about the scrub operations performed on behalf of the guest. 6.1.2.v ASR errors. For uncorrectable ASR errors, the hypervisor does not attempt to clear the source of the error. The hypervisor notifies the sun4v guest of the ASR in error. The sun4v guest is responsible for identifying the ASR and determining the recovery policy. It may be able to correct the error or reload the ASR with correct data. if (ATTR.ASR && REG == 24) /* system tick register */ read system time from TOD write new system time to %asr24 retry 6.1.2.vi ASI errors. For uncorrectable ASI errors, the hypervisor does not attempt to clear the source of the error. The hypervisor notifies the sun4v guest of the ASI in error using the ASI, ADDR and SZ fields of the error report. The sun4v guest is responsible for identifying the ASI register(s) and determining the recovery policy. It may be able to correct the error or reload the register with correct data. For example, for a Rock CRP error we have ASI=0x21 VA=0x8 ASI_Primary_Context_ID_0 ASI=0x21 VA=0x10 ASI_Secondary_Context_ID_0 if (ATTR.ASI && ASI == 0x21) { if (ADDR == 0x8) { reset primary context register() if (SZ == 16) reset secondary context register() } if (ADDR == 0x10)) { reset secondary context register() } } Note: The ASR/ASI error types are essentially targetted at errors in registers which contain data which is maintained by the guest OS. The guest should have a valid copy of the data to reload the register and clear the error. Alternatively it may be possible to continue operating without correcting the error by disabling or avoiding some guest features/functionality. 6.1.2.vii CPU "error" state When hypervisor puts a CPU in error state, it must ensure the following: (1) Hypervisor calls targetting CPUs in error state should return an error code to the guest indicating that one or more of the targetted CPUs are in error state. (2) The guest cannot restart the CPU that is in error state. 6.2 Reporting of Errors The guidelines for reporting errors are: (1) All errors are reported to the FMA Error Report Generator and sent to the Diagnosis Service Provider. (2) Always report an error that generates a precise or deferred trap to the CPU that took the trap unless the CPU is marked in error. (3) For disrupting errors, notify only the affected guests as can be determined based on the error information logged. (4) Errors in shared memory are reported to all of the affected guests. If the error was a precise or deferred error, then a non-resumable error report is sent to the guest that induced the operation, and a resumable error report is sent to the other guests that share the memory region in error. If the error was a disrupting trap (for example, as generated by a hardware scrub operation), then a resumable error is sent to all of the affected guests. (5) Hypervisor should set the RQFULL bit in the error attributes field of the resumable error report that makes the queue full. (A queue is said to be full when the tail pointer if incremented equals the head pointer.) Hypervisor drops the resumable error reports if the resumable error queue if full. The setting of the RQFULL bit in the resumable error report indicates to the guest that zero or more resumable errors might have been dropped since the queueing of that error report. (6) If the nonresumable_error queue of a CPU is non-empty or if it does not have enough room to queue the error report(s), then the hypervisor marks that CPU in error and sends a resumable error report to a different CPU of the same partition. If all the CPUs in a partition are in error, then the partition is reset. (7) Errors in virtualized I/O devices should be reported to only the affected guests. 6.3 Handling Correctable Error Storms The hypervisor must attempt to prevent a storm of correctable errors from pinning the system in the hypervisor for long periods of time. This is done by disabling correctable error trap generation on the CPU that just took a correctable error trap for a finite period. At the expiration of the period, if no correctable errors are logged on that CPU then the correctable error trap generation is reenabled. The period for which the correctable error trap generation is disabled on a CPU is determined based on platform policy and can be tuned from the platform's Diagnosis Service Provider. 6.4 Collecting diagnostic data for errors For errors, the hypervisor must perform CPU-specific work to gather information required to populate the service error reports for diagnosis. Please refer to the CPU's Error Handling document for more information. 6.5 Switch Guest to New Hardware TBD 7.0 Rules for future expansion All bits of DESC/ATTR word are significant, including reserved bits. New errors not covered in the current specification will be indicated by using reserved bits in one or both of these two fields. If a guest CPU encounters an non_resumable_error trap, and the error payload contains an unrecognized encoding in the DESC/ATTR word, the guest is recommended to terminate. Reserved fields in in the structure from offsets 0x32-0x3f may be any value. Hypervisors implementing the current spec will fill these fields with zeroes; however, guests implementing the current spec should not rely on this, but should ignore the fields altogether. 8.0 References 1. The sun4v Architecture Specification. http://projectq.sfbay/ 2. Sun SPARC Processor RAS and Error Handling Requirements http://chipweb.sfbay/archperf/SPARC-Arch-SWG/RASEH-doc.txt 3. Diagnosis Service Provider Architecture Proposal http://dtsw.sfbay/~sriniv/docs/niagara/diag_service_provider.txt 4. The Sparc V9 Architecture Manual https://systemsweb.sfbay.sun.com/archperf/SPARC-Arch-SWG/SPARC-V9-current.pdf 5. UltraSPARC Architecture 2006 https://systemsweb.sfbay.sun.com/archperf/SPARC-Arch-SWG/restricted/UA2006-current-draft-HP-Sun.pdf 6. PCI-Express Root Complex Error Handling Interfaces for Sun4v http://projectq.sfbay.sun.com/docs/sun4v-err.txt Appendix A. Sample Sun4v Guest OS Error Handler Disclaimer: This is not intended to be an example of advanced OS error handler routines. It is an example of extremely simple guest error handlers. A.1 Resumable error handler if (DESC == 1) /* Uncorrected resumable error */ if (ATTR.CPU) if (ATTR.MODE == User) kill user process else panic if (ATTR.MEM) get ADDR, SZ call hypervisor to scrub memory retry; if (ATTR.ASI) get ASI, ADDR, SZ if ASI register(s) valid for this CPU if ASI register(s) is reloadable/recoverable reload/recover retry panic if (DESC == 4) /* Shutdown request */ if (ATTR.SHUT) get SECS delay SECS seconds shutdown A.2 Non-resumable error handler if (DESC == 5) /* dump core */ panic if (DESC == 3) /* deferred trap */ if (ATTR.MODE == User) kill user process else panic if (ATTR.MEM) get ADDR, SZ make hypervisor call to scrub memory if (data not recoverable) panic else retry if (ATTR.PIO) get IOADDR panic if (ATTR.IRF or ATTR.FRF) if (user mode) kill user process else panic if (ATTR.ASR) get ASR register from REG if ASR valid for this CPU if ASR is reloadable/recoverable reload/recover retry if (user mode) kill user process else panic if (ATTR.ASI) get ASI, ADDR, SZ if ASI register(s) valid for this CPU if ASI register(s) is reloadable/recoverable reload/recover retry if (user mode) kill user process else panic if (ATTR.PREG) get REG (privileged register) if privileged register is reloadable/recoverable reload/recover retry if (user mode) kill user process else panic