4. Technical Description:
    4.1. Overview:

        The basis for this project is a type of MD node called
        the latency group, or lg for short, which links together a
        group of resources which can be accessed with a specified
        latency.  Existing nodes including cpu, mblock, and iodevice
        either point to a latency group via fwd arcs, or are pointed 
        to by the latency group.
        
        To map portions of an mblock to a locality group, the guest 
        must examine lower order bits of the PA.  To enable the guest 
        to recover lower order PA bits from an RA, while still hiding the
        higher order PA bits, a mblock property address-congruence-offset 
        is defined.

        Latency group and congruence information is optional in the
        MD.  It may be added, changed, or removed during reconfiguration 
        of virtual resources. Furthermore, it may be inaccurate during
        transitional periods.  For these reasons, latency group information
        should only be used for performance optimizations, thus inaccuracies
        may result in sub-optimal performance, but not incorrect behavior.

    4.1.1 Latency Group Nodes

        The latency group nodes are shown below.  Properties
        are shown in the next section.

                                               Required        Optional
        Name                      Category   subordinates    subordinates
        ------------------------------------------------------------------
        memory-latency-group      optional    mblock              -

        Defines the load and store latency between virtual CPU(s) and 
        memory; specifically between superior cpu nodes and subordinate
        mblock nodes.
        Topology: cpu -> memory-latency-group -> mblock

        ------------------------------------------------------------------
        dma-latency-group         optional    mblock              -

        Defines the latency of DMA operations between I/O device(s) and
        memory; specifically between superior iodevice nodes plus 
        descendants, and subordinate mblock nodes.
        Topology: iodevice -> dma-latency-group -> mblock

        ------------------------------------------------------------------
        pio-latency-group         optional    iodevice            -

        Defines the latency of load and store operations (aka programmed 
        I/O) between virtual CPU(s) and I/O device(s); specifically
        between superior cpu nodes and subordinate iodevice nodes plus
        descendants.
        Topology: cpu -> pio-latency-group -> iodevice

        ------------------------------------------------------------------
        interrupt-latency-group   optional       -                -

        Defines interrupt delivery latency between I/O device(s) and 
        virtual CPU(s); specifically between superior iodevice nodes plus
        descendants, and superior cpu nodes.
        Topology: iodevice -> interrupt-latency-group <- cpu

        ------------------------------------------------------------------
        latency-groups            optional       -       memory-latency-group
                                                         dma-latency-group
                                                         pio-latency-group
                                                         interrupt-latency-group

        This collective node leads to all latency group nodes.
        Topology:
                                           -> memory-latency-group
                platform -> latency-groups -> dma-latency-group
                                           -> pio-latency-group
                                           -> interrupt-latency-group

        ------------------------------------------------------------------
        group                     optional       -       any

        The group node is a nexus for incoming and outgoing arcs 
        and serves no other purpose.  This avoids a combinatorial
        explosion of arcs when defining many-to-many relationships.
        The group node is a pass-through node for MD traversals.
        Topology:
                     any ->       -> any
                     any -> group -> any
                     any ->       -> any
        ------------------------------------------------------------------

    4.1.2 Latency Group Properties

        The properties of the latency nodes are defined below:

        Name                 Tag        Req'd?
        ------------------------------------------------------------------
        latency            PROP_VAL     Yes

        A 64-bit unsigned integer giving the approximate latency of 
        access in picoseconds.  Defined for memory-latency-group, 
        dma-latency-group, pio-latency-group, and interrupt-latency-group 
        nodes.

        Name                 Tag        Req'd?
        ------------------------------------------------------------------
        address-mask       PROP_VAL     No
        address-match      PROP_VAL     No

        These are 64-bit unsigned integers that together define a memory
        stripe.  Memory in subordinate mblock nodes is a member of the
        latency group if

            ((RA + address-congruence-offset) & address-mask) == address-match

        address-mask and address-match are defined in PA space.
        address-congruence-offset is a property of the mblock in
        which the RA lies, and transforms the RA bits into PA space
        for all bits covered by address-mask (see 4.1.4).

        If these properties are absent, then all memory described by
        subordinate mblock nodes is a member of the latency group.
        Defined for memory-latency-group and dma-latency-group nodes.
        ------------------------------------------------------------------


    4.1.3 Subordinates

        The following nodes are augmented to allow subordinates.

                               Optional
        Name                 subordinates
        ------------------------------------------
        cpu               memory-latency-group
                          pio-latency-group
                          interrupt-latency-group

        iodevice          interrupt-latency-group
                          dma-latency-group

        platform          latency-groups
        ------------------------------------------

    4.1.4 RA and PA Congruence

        The real address space used within a virtual machine is a
        remapping of portions of a system's underlying physical memory.
        A guest running within a virtual machine is not provided
        the physical addresses of its memory blocks. This abstraction of
        memory addresses enables guests to be moved in memory without
        changing their real address space layout.

        However, to support NUMA and page-coloring algorithms for a
        guest operating system further information is required that
        describes the congruency relationship between a real address and
        the underlying physical address to which it is mapped.
        
        To do this, this case adds the following optional property to an
        mblock node;

            Node name: mblock

            Name                             Tag        Req'd?
            ------------------------------------------------------------------
            address-congruence-offset       PROP_VAL     No

            A 64-bit unsigned integer.
            address-congruence-offset = (PA_base - RA_base) mod M.

            M is a power of 2 strictly greater than all values of address-mask 
            and index-mask (see section 4.1.5) in the MD.

            The guest adds address-congruence-offset to an RA before applying
            masks based on the PA, such as address-mask and index-mask.
            See 4.1.2 and 4.1.5 for details.

            If this property is not present in the mblock, then its value
            must be assumed 0.

            Programming note:

                This property is typically provided when the congruency between
                the real and underlying physical address of a mblock
                is less than the size needed for lgroup or page color masking.

                For example; Consider a NUMA machine where memory is
                striped on 1GB boundaries between 4 different memory
                controllers. Each cpu may see different access latencies
                to each of the memory controllers - each latency is
                represented by a lgroup node described above.

                Now consider a 1GB memory segment that starts at real address
                0x400000000 and is bound to physical address 0x10000000.

                To identify 4 different memory controllers with a 1GB stripe
                the address-mask property of one of the lgroups might have the
                value 0xc0000000.
        
                In this legitimate scenario to correctly apply the lgroup
                information, the guest OS needs enough correctly congruent
                bits from the actual physical address to be able to
                meaningfully apply the lgroup address mask.

                So for our example, real address 0x400000000 corresponds to
                physical address 0x10000000, and real address 0x430000000
                corresponds to physical address 0x40000000.

                If we apply the lgroup mask to 0x10000000 we get 0x0.
                If we apply the lgroup mask to 0x40000000 we get 0x40000000
                as the result. Therefore we see that these different
                address pages reside on different memory controllers
                with different access latencies.

                (Note: if we had applied the lgroup mask to the corresponding
                real addresses the result is always 0x0 implying the same
                memory controller - which would be incorrect).

                Thus a means to recover the relevant bits
                of the physical address are required so that the address
                mask can be correctly applied.

                The address-congruence-offset property in an mblock provides
                this information. As described above the property is derived
                from the difference between real and their corresponding
                physical addresses for a mblock. However, to retain ambiguity
                for actual physical address bindings, this property is not
                the actual difference, but simply enough bits from the
                RA/PA difference that an addr mask can be correctly applied.

                Thus the value provided for address-congruence-offset is
                sufficient that the equality:

                        (RA + address-congruence-offset) & address-mask 
                            == address-match

                holds correctly for all the provided address-mask and 
                address-match values within the MD in order to correctly 
                match lgroups.
                
                If the address-mask 0xc0000000 is the largest mask provided,
                then the address-congruence-offset for example above would be:

                        (0x10000000 - 0x400000000) & 0xffffffff = 0x10000000 
                
                The address matches for the real addresses above will be,

                        (0x400000000 + 0x10000000) & 0xc0000000 = 0x0

                        (0x430000000 + 0x10000000) & 0xc0000000 = 0x40000000 

                As defined above the address-congruence-offset is an optional
                property in an mblock node.  If not present, a value of 0 can 
                be assumed, thus the equality for matching lgroups reduces to:

                        RA & address-mask == address-match

            ------------------------------------------------------------------

    4.1.5 Page coloring

        Page coloring for large caches exhibits a similar set of problems
        to identifying lgroups.

        To assist, a cache node is extended with an optional property
        to compute a matching set within the corresponding cache.
        
            Node name: cache

            Property:
            Name                      Tag        Req'd?
            ------------------------------------------------------------------
            index-mask                PROP_VAL     No

            A 64-bit unsigned integer.  A bit in index-mask is set if that 
            bit in a PA influences the cache index at which a memory location 
            is stored when cache resident.

            The actual cache index employed by hardware is a
            function of multiple bits from the physical address
            of the memory reference. To compute a page coloring value
            the index-mask field identifies the relevant bits from
            a physical address. Thus the index-bits for page coloring
            can be derived as:

                index-bits = (RA + address-congruence-offset) & index-mask

            Where the address-congruence-offset is the property from the
            mblock (corresponding to the given RA) as defined in section
            4.1.4 above.

            Similarly to lgroup matching, if the address-congruence-offset
            property is not provided for a mblock its value can be
            assumed as zero reducing the equation to:

                index-bits = RA & index-mask

            ------------------------------------------------------------------

    4.1.6 Example

        The file sample-lg.pdf shows an abstracted MD containing 
        memory-latency-group, interrupt-latency-group, and pio-latency-group 
        nodes for a theoretical system with 2 physical processors, each of 
        which contains 2 virtual CPUs and one PCI-E host adapter.  The 
        latency-groups container node and the dma-latency-group nodes are 
        omitted for clarity.

    4.2. Bug/RFE Number(s):

        6540324 represent memory locality and RA vs PA congruence in the MD
        6539799 represent I/O locality in the MD
        6540315 RA versus PA congruence property
        6539930 MPO for sun4v platforms
   
    4.3. In Scope:
   
        MD node definitions and properties.

    4.4. Out of Scope:

        Changes in the Solaris guest.
   
    4.5. Interfaces:

    4.5.1 Imported Interfaces

        Name                    Classification   Description
        -----------------       ---------------  -------------------
        sun4v Machine           Sun Private      MD nodes definitions as
        Description nodes                        defined by FWARC/2005/115

        "iodevice" MD node      Committed        MD node name and properties as 
                                                 described in FWARC/2007/070

    4.5.2 Exported Interfaces

        All interfaces are specified in Section 4.1 of this document.

        Name                     Classification   Description
        ----------------------   --------------   -------------------
        "latency-groups"            Committed     MD node name and properties
        "memory-latency-group"      Committed     MD node name and properties
        "pio-latency-group"         Committed     MD node name and properties
        "dma-latency-group"         Committed     MD node name and properties
        "interrupt-latency-group"   Committed     MD node name and properties
        "group"                     Committed     MD node name and properties
        "index-mask"                Committed     MD property of "cache" node
        "address-congruence-offset" Committed     MD property of "mblock" node