Copyright 2007 Sun Microsystems

1. Introduction
   1.1. Project/Component Working Name:

        Machine Description Latency Node Definitions

   1.2. Name of Document Author/Supplier:
   
        Steve Sistare
	Ashley Saulsbury

   1.3. Date of This Document:
   
        05/08/2007

   1.4. Name of Major Document Customer(s)/Consumer(s):
	1.4.1. The PAC or CPT you expect to review your project:
            HS PAC
	1.4.2. The ARC(s) you expect to review your project:
            FWARC
	1.4.3. The Director/VP who is "Sponsoring" this project:
            Tony Barreca
	1.4.4. The name of your business unit:
            Systems Group

   1.5. Email Aliases:
        1.5.1. Responsible Manager: chad.solomon@sun.com
        1.5.2. Responsible Engineer: kanth.ghatraju@sun.com
        1.5.3. Marketing Manager: N/A
        1.5.4. Interest List: mpo-maramba-core@sun.com

2. Project Summary
   2.1. Project Description:
   
        This project defines MD nodes that describe to the Solaris
	guest the latency between various components of a system 
	including CPUs, memory, and I/O devices.  These latencies can
	vary depending on the physical location of the initiator and
	the target, and exposing them allows the guest to co-locate its
	resources to achieve minimum latencies and improve domain
	application performance.  For example, Solaris Memory Placement 
	Optimization (MPO) can be enabled, in which a thread's memory 
	is allocated close to the CPU that runs the thread.
	
   2.2. Risks and Assumptions:

	Assumption: The Solaris changes to enable MPO will be delivered
	by a separate project.

3. Business Summary
   3.1. Problem Area:

	Larger servers may exhibit NUMA characteristics which hurt 
	application performance if ignored.  Solaris minimizes latency 
	by considering the physical topology when mapping threads and 
	data to CPUs and memory.  In the sun4v architecture, resources
	are virtualized, and the guest has no visibility into physical
	topology, so the latency relationships must be explicitly 
	represented in the guest MD.

   3.2. Market/Requester:
   
	The Maramba platform team, Systems Group. Maramba, based on
	the Victoria Falls processor, is the first sun4v platform
	with NUMA characteristics.  However, the latency node
	specification is general enough to support future sun4v NUMA
	platforms.

   3.3. Business Justification:
   
	Allows multi-processor sun4v systems to be performance
	competitive by effectively reducing their NUMA characteristic.
	These changes plus the necessary Solaris changes will boost 
	performance on some strategic benchmarks to be published for 
	Maramba RR by an estimated 5% - 10%.

   3.4. Competitive Analysis:
        N/A

   3.5. Opportunity Window/Exposure:

	Must be available for Maramba RR for maximum sales impact.

   3.6. How will you know when you are done?:
      
        1. Code has been integrated into released system firmware image, and
	2. Nodes defined in this specification are visible in the guest MD,
		which can be verified by examining the machine description
		provided to a guest operating system, and
	3. Solaris MPO framework implements and shows correct lgroups, and
	4. The measured lmbench latency approaches the minimum hardware
	   memory latency.
	
	Note that 3 and 4 are not part of this project, but are the 
	ultimate check of its correctness.

4. Technical Description:
    4.1. Overview:

	The basis for this project is a type of MD node called
	the latency group, or lg for short, which links together a
	group of resources which can be accessed with a specified
	latency.  Existing nodes including cpu, mblock, and iodevice
	either point to a latency group via fwd arcs, or are pointed 
	to by the latency group.
	
	To map portions of an mblock to a locality group, the guest 
	must examine lower order bits of the PA.  To enable the guest 
	to recover lower order PA bits from an RA, while still hiding the
	higher order PA bits, a mblock property addr-congruence-offset 
	is defined.

	Latency group and congruence information is optional in the
	MD.  It may be added, changed, or removed during reconfiguration 
	of virtual resources. Furthermore, it may be inaccurate during
	transitional periods.  For these reasons, latency group information
	should only be used for performance optimizations, thus inaccuracies
	may result in sub-optimal performance, but not incorrect behavior.

    4.1.1 Latency Group Nodes

	The latency group nodes are shown below.  Properties
	are shown in the next section.

                                       Required        Optional
	Name              Category   subordinates    subordinates
	------------------------------------------------------------------
	mem-lg            optional    mblock              -

	Defines the load and store latency between virtual CPU(s) and 
	memory; specifically between superior cpu nodes and subordinate
	mblock nodes.
	Topology: cpu -> mem-lg -> mblock

	------------------------------------------------------------------
	dma-lg            optional    mblock              -

	Defines the latency of DMA operations between I/O device(s) and
	memory; specifically between superior iodevice nodes plus 
	descendants, and subordinate mblock nodes.
	Topology: iodevice -> dma-lg -> mblock

	------------------------------------------------------------------
	pio-lg            optional    iodevice              -

	Defines the latency of load and store operations (aka programmed 
	I/O) between virtual CPU(s) and I/O device(s); specifically
	between superior cpu nodes and subordinate iodevice nodes plus
	descendants.
	Topology: cpu -> pio-lg -> iodevice

	------------------------------------------------------------------
	irq-lg            optional       -                -

	Defines interrupt delivery latency between I/O device(s) and 
	virtual CPU(s); specifically between superior iodevice nodes plus
	descendants, and superior cpu nodes.
	Topology: iodevice -> irq-lg <- cpu

	------------------------------------------------------------------
	latency-groups    optional       -               mem-lg, dma-lg
						         pio-lg, irq-lg

	This collective node leads to all latency group nodes.
	Topology:
					   -> mem-lg
		platform -> latency-groups -> dma-lg
					   -> pio-lg
					   -> irq-lg

	------------------------------------------------------------------
	group             optional       -               any

	The group node is a nexus for incoming and outgoing arcs 
	and serves no other purpose.  This avoids a combinatorial
	explosion of arcs when defining many-to-many relationships.
	The group node is a pass-through node for MD traversals.
	Topology:
                     any ->       -> any
                     any -> group -> any
                     any ->       -> any
	------------------------------------------------------------------

    4.1.2 Latency Group Properties

	The properties of the latency nodes are defined below:

	Name              Tag        Req'd?
	------------------------------------------------------------------
	latency         PROP_VAL     Yes

	A 64-bit unsigned integer giving the approximate latency of 
	access in picoseconds.  Defined for mem-lg, dma-lg, pio-lg, 
	and irq-lg nodes.

	------------------------------------------------------------------
	addr-mask       PROP_VAL     No
	addr-match      PROP_VAL     No

	These are 64-bit unsigned integers that together define a memory
	stripe.  Memory in subordinate mblock nodes is a member of the
	latency group if (address & addr-mask) == addr-match.  If these 
	properties are absent, then all memory described by subordinate 
	mblock nodes is a member of the latency group.  Defined for
	mem-lg and dma-lg nodes.
	
	The value used in the mask and match equation is 
	(RA + addr-congruence-offset) as described in section 4.1.4 below. 
	
	------------------------------------------------------------------


    4.1.3 Subordinates

	The following nodes are augmented to allow subordinates.

                           Optional
	Name             subordinates
	------------------------------------------
	cpu               mem-lg, pio-lg, irq-lg
	iodevice          irq-lg, dma-lg
	platform          latency-groups
	------------------------------------------

    4.1.4 RA and PA Congruence

	The real address space used within a virtual machine is a
	remapping of portions of a system's underlying physical memory.
	A guest running within a virtual machine is not provided
	the physical addresses of its memory blocks. This abstraction of
	memory addresses enables guests to be moved in memory without
	changing their real address space layout.

	However, to support NUMA and page-coloring algorithms for a
	guest operating system further information is required that
	describes the congruency relationship between a real address and
	the underlying physical address to which it is mapped.
	
	To do this, this case adds the following optional property to an
	mblock node;

	    Node name: mblock
	    ------------------------------------------------------------------
	    addr-congruence-offset    PROP_VAL     No

	    A 64-bit unsigned integer.
	    addr-congruence-offset = (PA_base - RA_base) mod M.

	    M is a power of 2 strictly greater than all values of addr-mask 
	    and index-mask (see section 4.1.5) in the MD.
	    The guest is guaranteed that
	    (RA + addr-congruence-offset) = PA (mod M)
	    where "=" means "is congruent to".  In other words, the guest
	    may recover the lower order PA bits in positions less than M.

	    If this property is not present in the mblock, then its value
	    must be assumed 0.

	    Programming note:

		This property is typically provided when the congruency between
		the real and underlying physical address of a mblock
		is less than the size needed for lgroup or page color masking.

		For example; Consider a NUMA machine where memory is
		striped on 1GB boundaries between 4 different memory
		controllers. Each cpu may see different access latencies
		to each of the memory controllers - each latency is
		represented by a lgroup node described above.

		Now consider a 1GB memory segment that starts at real address
		0x400000000 and is bound to physical address 0x10000000.

		To identify 4 different memory controllers with a 1GB stripe
		the addr-mask property of one of the lgroups might have the
		value 0xc0000000.
	
		In this legitimate scenario to correctly apply the lgroup
		information, the guest OS needs enough correctly congruent
		bits from the actual physical address to be able to
		meaningfully apply the lgroup address mask.

		So for our example, real address 0x400000000 corresponds to
		physical address 0x10000000, and real address 0x430000000
		corresponds to physical address 0x40000000.

		If we apply the lgroup mask to 0x10000000 we get 0x0.
		If we apply the lgroup mask to 0x40000000 we get 0x40000000
		as the result. Therefore we see that these different
		address pages reside on different memory controllers
		with different access latencies.

		(Note: if we had applied the lgroup mask to the corresponding
		real addresses the result is always 0x0 implying the same
		memory controller - which would be incorrect).

		Thus a means to recover the relevant bits
		of the physical address are required so that the address
		mask can be correctly applied.

		The addr-congruence-offset property in an mblock provides
		this information. As described above the property is derived
		from the difference between real and their corresponding
		physical addresses for a mblock. However, to retain ambiguity
		for actual physical address bindings, this property is not
		the actual difference, but simply enough bits from the
		RA/PA difference that an addr mask can be correctly applied.

		Thus the value provided for addr-congruence-offset is
		sufficient that the equality:

			(ra + addr-congruence-offset) & addr-mask == addr-match

		holds correctly for all the provided addr-mask and addr-match
		values within the MD in order to correctly match lgroups.
		
		If the addr-mask 0xc0000000 is the largest mask provided,
		then the addr-congruence-offset for example above would be:

                	(0x10000000 - 0x400000000) & 0xffffffff = 0x10000000 
                
                The address matches for the real addresses above will be,

                        (0x400000000 + 0x10000000) & 0xc0000000 = 0x0

                        (0x430000000 + 0x10000000) & 0xc0000000 = 0x40000000 

		As defined above the addr-congruence-offset is an optional
		property in an mblock node.  If not present, a value of 0 can 
		be assumed, thus the equality for matching lgroups reduces to:

			ra & addr-mask == addr-match

	    ------------------------------------------------------------------

    4.1.5 Page coloring

	Page coloring for large caches exhibits a similar set of problems
	to identifying lgroups.

	To assist, a cache node is extended with an optional property
	to compute a matching set within the corresponding cache.
	
	    Node name: cache
	    Property:
	    Name                      Tag        Req'd?
	    ------------------------------------------------------------------
	    index-mask                PROP_VAL     No

	    A 64-bit unsigned integer.  A bit in index-mask is set if that 
	    bit in a PA influences the cache index at which a memory location 
	    is stored when cache resident.

	    Programming note:

		The actual cache index employed by hardware is a
		function of multiple bits from the physical address
		of the memory reference. To compute a page coloring value
		the index-mask field identifies the relevant bits from
		a physical address. Thus the index-bits for page coloring
		can be derived as:

		index-bits = (ra + addr-congruence-offset) & index-mask

		Where the addr-congruence-offset is the property from the
		mblock (corresponding to the given ra) as defined in section
		4.1.4 above.

		Similarly to lgroup matching, if the addr-congruence-offset
		property is not provided for a mblock its value can be
		assumed as zero reducing the equation to:

		index = ra & index-mask

	    ------------------------------------------------------------------

    4.1.6 Example

	The file sample-lg.pdf shows an abstracted MD containing mem-lg,
	irq-lg, and pio-lg nodes for a theoretical system with 2
	physical processors, each of which contains 2 virtual CPUs
	and one PCI-E host adapter.  The latency-groups container
	node and the dma-lg nodes are omitted for clarity.

    4.2. Bug/RFE Number(s):

	6540324 represent memory locality and RA vs PA congruence in the MD
	6539799 represent I/O locality in the MD
	6540315 RA versus PA congruence property
	6539930 MPO for sun4v platforms
   
    4.3. In Scope:
   
        MD node definitions and properties.

    4.4. Out of Scope:

        Changes in the Solaris guest.
   
    4.5. Interfaces:

    4.5.1 Imported Interfaces

        Name                    Classification  Description
        -----------------       --------------- -------------------
        sun4v Machine           Sun Private     MD nodes definitions as
        Description nodes                       defined by FWARC/2005/115

        "iodevice" MD node      Committed       MD node name and properties as 
                                                described in FWARC/2007/070

    4.5.2 Exported Interfaces

	All interfaces are specified in Section 4.1 of this document.

        Name                    Classification  Description
        ----------------------  --------------  -------------------
	"latency-groups" node	 Committed	MD node name and properties
	"mem-lg" node		 Committed	MD node name and properties
	"pio-lg" node		 Committed	MD node name and properties
	"dma-lg" node		 Committed	MD node name and properties
	"irq-lg" node		 Committed	MD node name and properties
	"group"  node		 Committed	MD node name and properties
	"index-mask"             Committed	MD property of "cache" node
	"addr-congruence-offset" Committed	MD property of "mblock" node


5. Reference Documents:

        [1] Sample latency-group MD graph 
            sample-lg.pdf

6. Resources and Schedule:
   6.1. Projected Availability:
	Must be available in firmware release supporting Maramba RR.

   6.2. Cost of Effort:
        2-4 person weeks

   6.3. Cost of Capital Resources:
        None

   6.4. ARC review type:
	Fasttrack

7. Prototype Availability:
   7.1. Prototype Availability:
        Working prototype is available.

   7.2. Prototype Cost:
        N/A