[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

Indigo2 and POWER Indigo2 Technical Report

Section 7 The R8000 POWER Indigo2

The MIPS R8000 represents an entirely new generation of MIPS RISC architecture. The MIPS R8000 processor provides peak floating-point performance of 300 MFLOPS-roughly equivalent to a CRAY Y-MP [this is for the 75MHz version of the R8000 chip set. The 90MHz version gives 360 MFLOPS peak].

The R8000 microprocessor chip set is the first 64-bit superscalar implementation of the MIPS IV Instruction Architecture (ISA). The R8000 processor combines a large fast-access, high-throughput cache subsystem with high-performance floating-point capabilities. Its bandwidth satisfies applications with large working sets of data.

In the past, microprocessors typically had to restrict on-chip functionality. Small cache subsystems and other limiting features degraded performance for large scientific applications. Silicon Graphics leveraged new chip technology and an advanced instruction set architecture to overcome these limitations and expand applicability. The highly integrated R8000 chip set successfully meets the requirements of numeric-intensive applications in the technical market.

The chip set delivers peak performance of 300 double-precision MFLOPS and 300 MIPS with a clock frequency of 75 MHz. The R8000 uses the MIPS IV instruction set, which is a superset of the MIPS III architecture and is backward-compatible with all previous MIPS processors.

The MIPS R8000 processor is designed to deliver extremely high floating-point performance. Key features:

multicomponent chip set consisting of an integer unit (IU), floating-point unit (FPU), tag RAMs, and 2 MB of data streaming cache
four-way superscalar architecture
true 64-bit microprocessor with 64-bit integer and floating-point operations, registers, and virtual addresses
16 KB of instruction cache (I-cache) in III, 16 KB of dual-ported data cache (D-cache) in IU, I K entries of branch prediction cache
Memory Management Unit (MMU) in IU contains a 384-entry, dual-ported, three-way set associative Translation Lookaside Buffer (TLB)
ANSI/IEEE-754 standard floating-point coprocessor with imprecise interrupts
32 double-word (64-bit) general-purpose registers in IU and 32 double-word (64-bit) floatingpoint registers in FPU
128-bit data bus and a separate 32-bit address bus that can access up to 640 MB of physical memory
full compatibility with earlier 32-bit and 64-bit MIPS microprocessors

Figure 15 is a block diagram of the R8000 chip set.

[Fig 15: R8000 Microprocessor Chip Set Block Diagram]

FIGURE 15 R8000 Microprocessor Chip Set Block Diagram

This section explains the R8000's:

superscalar implementation
integer unit organization
integer operations
floating-point unit organization
data streaming cache and tag RAM
FPU operations

7.0.1 Superscalar Implementation

The R8000 chip set is a cost-effective solution for the high-performance scientific computing community. A superscalar implementation was selected for:

sustained high performance on vectoi'izahle code
accelerated compute-intensive scalar code
maintained binary compatibility with low-end products

Several features and design concepts collectively enable a cost-effective solution. This section describes the R8000's balanced memory throughput implementation, extensions to the instruction set architecture and addressing modes, and a data streaming cache.

Balanced Memory Throughput Implementation

The R8000 chip set is optimized for floating-point performance. However, both floating-point and integer computational capabilities are balanced with the required memory throughput. Two load/store units in the IU provide the integer and floating-point functional units with the 64-bit (double word) information required for sustained operation. The load/store units support at most two memory references in a single cycle. Any combination of floating-point and integer loads and stores are possible except one integer store followed by a second integer store in the same cycle. Furthermore, there are no restrictions on pipelining any combination of floating-point and integer loads and stores.

The R8000 can dispatch up to four instructions including two memory accesses per cycle:

In total, up to two integer instructions can be dispatched to the integer functional units (two integer ALU units, one integer multiply unit, and one branch unit) on the IU. Only one integer instruction can he dispatched to an integer functional unit.
In total, up to two floating-point instructions can he dispatched to the floating-point functional units on the FPU. The dual floating-point functional units in the FPU handle parallel multiply, add, multiply-add, divide, and square root operations. These can execute up to four floating-point operations per cycle.

The characteristics of the R8000 memory subsystem - number of ports, sizes and algorithms of the caches, tag RAM, and buffering schemes - complement the high-performance computational capabilities of the R8000 and ensure that memory bandwidth demands from the floating-point and integer units are met.

Instruction and Addressing Mode Extensions

Several new instructions improve performance for numerically intensive code. The floating-point multiply-add instructions achieve results comparable to chaining vector operations: multiple floating-point operations execute each machine cycle. This characteristic results in greater precision and higher performance.

Many scientific applications have separately compiled subroutines containing parameter matrices with variable dimensions. Standard register-plus-offset addressing requires an extra integer addition for each access to these arrays. In contrast, the R8000's indexed addressing mode (base register plus index register) eliminates the extra addition for floating-point loads and stores.

For high-end numeric processing, loops containing IF statements must execute efficiently. The R8000 design includes a set of four conditional move operators that allows IF statements to be represented without branches. The bodies of the THEN and ELSE clauses are unconditionally computed, and the results are put in temporary registers. Conditional move operators then selectively transfer the temporary results to their true register. In summary, both legs of an IF statement are computed, and one of them is discarded.

Data Streaming Cache Architecture

The R8000 chip set incorporates a unique coherent cache scheme to address two disparate computational requirements. Most programs contain a mix of integer, floating-point, and address computation. Integer and address computation is best accomplished with a moderately sized low-latency, fast data cache. Floating-point calculations require large amounts of memory or a large data cache subsystem. The cache need only be moderately fast, but it must be sizable enough to hold large data sets and have high throughput to the floating-point functional units.

The R8000 chip set provides a unique cache hierarchial implementation. The level-one data cache, on the integer chip, is 16 KB and allows very fast access for integer loads and stores. A large 4 MB off-chip cache, called the data streaming cache, serves as a second-level cache for integer data and instructions, and as a first-level cache for floating-point data. This configuration allows floating-point loads and stores to bypass the on-chip cache and communicate with the large off-chip cache directly. The data streaming cache is pipelined to allow for continuous access by the floating-point functional units. Data can be transferred at the rate of two 64-bit double words per cycle, or 1.2 GB per second. The large pipelined data streaming cache provides the capacity and sustained bandwidth needed to handle floating-point data objects.

7.0.2 R8000 Integer Unit Organization

The IU chip, a 591-pin device, is the main computing component of the R8000 microprocessor chip set. Its multiple execution units are supported by dedicated buses that allow independent operation of the functional units. The complex bussing scheme alleviates the need for multiplexing data and addresses and allows simultaneous floating-point loads and stores from and to the data streaming cache. Figure 16 is a block diagram of the R8000 integer unit.

[Fig 16: R8000 Integer Unit Block Diagram]

FIGURE 16 R8000 Integer Unit Block Diagram

The IU contains four caches: the instruction cache, the data cache, the branch prediction cache, and the Translation Lookaside Buffer (TLB) cache. Dedicated interfaces let the IU generate and provide address information to the data streaming cache, the floating-point unit, and the tag RAMs. An additional 80-bit bus allows the IU and FPU to communicate. The IU also contains two integer arithmetic logic units (ALUs) and two address generation units. These functional units support up to four instructions per cycle - two integer instructions and two data accesses.

Instruction Cache and Instruction Cache Tag RAM

The instruction cache contains instructions to he executed. The 16-KB I-cache in the IU is direct-mapped, arranged as 1024 entries by 128 hits: each entry contains four 32-bit instructions and every access to the cache fetches four instructions. The I-cache is virtually indexed and virtually tagged, alleviating the need for address translation on I-cache accesses.

The I-cache tag RAM is used to determine if a valid instruction exists in the I-cache. The I-cache tag RAM has 512 entries, one tag for every two I-cache entries or one tag for each I-cache line. Each I-cache tag RAM entry contains a tag, an address space identifier (ASID), a tag valid bit, and 2 region bits. The ASID differentiates instructions between processes and allows instructions that have the same virtual address, but different physical addresses from multiple processes to reside in the I-cache at the same time. It also ensures that two processes accessing the same memory space will not overwrite each other's instructions.

Instruction/Tag Queues and Dispatching

Effective utilization of a superscalar processor requires access to multiple instructions per cycle and the ability to dispatch multiple instructions per cycle. In the R8000's creative approach for fetching and dispatching instructions, the integer unit fetches four instructions or 128 bits per cycle from the I-cache. These instructions are partially decoded and then placed in a six-deep temporary storage queue. The queue is required because dependencies might prevent instructions from being dispatched.

The X-bar can dispatch from zero to four instructions every cycle. It uses resource modeling to determine when instructions are ready to he dispatched. The X-bar monitors the status of each execution unit from cycle to cycle and determines interdependencies between any of the four instructions in a given line. Figure 17 diagrams instruction dispatch.

[Fig 17: R8000 Instruction Dispatch]

FIGURE 17 R8000 Instruction Dispatch

Branch Prediction Cache and Branch Prediction

When a branch is taken, the contents of the program counter are altered. In a pipelined processor implementation, this forces one or more wasted cycles and reduces the performance of the processor. Research shows that during execution of a typical application, a branch instruction occurs every six to eight instructions. Hence, in the superscalar R8000 processor, a branch can he expected every other cycle. Branch prediction is therefore crucial for maintaining the pipeline and ensuring continuous execution. Figure 18 diagrams the branch prediction cache.

[Fig 18: R8000 Branch Prediction Cache]

FIGURE 18 R8000 Branch Prediction Cache

The R8000 branch prediction cache is used to modify the program counter (the location of the target instruction) when the processor encounters a branch instruction. It works in conjunction with the I-cache and incorporates a simple branch prediction mechanism. The branch prediction cache has 1024 entries, one for each entry in the I-cache. Each entry is 15 bits wide, ten of which are used as a branch target address to any of the 1024 entries in the I-cache. The remaining bits indicate whether the branch is predicted taken, and where in the target quadword execution should begin.

Table 2 summarizes parameters for the instruction cache and branch prediction cache.

Parameter          Instruction Cache                   Branch Prediction Cache


Location           IU                                  IU

Contents/entry     128 bits (4 instructions)           16 bits

Size               16KB                                2KB

Mapping            Direct-mapped                       Direct-mapped

Index              Virtual address                     Virtual address

Tag                Virtual address                     N/A

Data access        single cycle                        single cycle

Data transfer      128 bits/4 instructions per cycle   16 bits per cycle

Cache bandwidth    1.2 GB per second                   159 MB per second

Line size          32 Bytes or 8 words                 N/A

Miss penalty       11 cycles to D8 cache               3 cycles to actual branch result

TABLE 2 Parameters for R8000 Instruction Cache and Branch Prediction Cache

Translation Lookaside Buffer (TLB) Organization

The translation lookaside buffer converts virtual addresses to physical addresses. A single TLB in the IU handles instruction references when there is an I-cache miss and all data references. The TLB contains 384 entries to reduce TLB misses when processing large matrices, and is three-way set associative to maintain high performance.

The TLB is dual ported to allow for parallel references. It is split into two sections - one section contains the virtual tags (VTAGS), the other section contains the actual physical address (PA) corresponding to each virtual tag. Table 3 compares the R8000 and R4x00 TLBs.

Parameter           POWER Indigo2 TLB/R8000     Indigo2 TLB/R4000


Size                384 entries: one            48 entries: two translations/
                    translation/entry           entry (even - odd pages)

Mapping             3-way set associative       Fully associative
                    random placement            random placement

Index               Virtual address             Virtual address

Ports               Two                         One

Access              Single cycle                Single cycle

Kernel page size    16K                         4K

User page size      16K                         4K

Size                384 entries: one            48 entries: two translations/
                    translation/entry           entry (even - odd pages)

TABLE 3 Translation Lookaside Buffer Comparison

Data Cache and Data Cache Tag

The 16KB data cache in the IU is dual-ported and is arranged as 2048 entries by 64 bits. The direct-mapped D-cache is virtually indexed and physically tagged; it implements a write-through protocol to maintain coherency with the data streaming cache. The D-cache allows either two loads or one load and one store to occur simultaneously. Loads and stores of data on a byte (8-bit) resolution are also supported.

7.0.3 Integer Operations

The IU performs several types of integer operations. Each operation requires a functional unit for some number of cycles (referred to as staging) and has some fixed time before the results are available (referred to as latency).

Most integer operations execute in only one cycle, but some operations such as integer multiply and divide require additional cycles to generate the result. The R8000 integer multiply operation is one of the fastest implementations available. This fast integer multiplication drastically improves the execution of loops that require integer multiplication. A clever approach was used for the integer division operation: the time required for an integer divide operation is a function of the quotient size. Table 4 summarizes integer latencies.

Integer Unit Operation       Latency (Cycle Count)


Add, shift, logical          1

Load, store                  1

Multiply                     4 (32-bit operands) 6 (64-bit operands)

Divide                       21 (quotient <15 bits)
                             39 (quotient 16-31 bits)
                             73 (quotient 32-64 bits

TABLE 4 R8000 Integer Latencies

7.0.4 R8000 Floating-Point Unit Organization

: The FPU chip, a 591-pin device, performs all floating-point functions for the R8000 microprocessor chip set. The FPU has two fully-pipelined execution units, allowing two floatingpoint mathematical operations and two floating-point memory operations every cycle. The FPU register file contains 32 64-bit entries and has eight read ports and four write ports. Load and store data queues provide a pipelined interface between the IU and the FPU, streamlining data flow and minimizing unused cycles. The FPU offers peak performance of 300 double-precision MFLOPS with a clock frequency of 75 MHz.

7.0.5 Data Streaming Cache and Tag RAM

The R8000 chip set employs a very large 2-MB, four-way set associative external data streaming cache. The large cache size, coupled with blocked data, greatly improves the performance of engineering and scientific applications with large data sets. The set associativity reduces the thrashing common with direct-mapped caches and increases the effective size of the cache. Studies have shown that four-way set associative caches exhibit miss rates similar to those of nonassociative caches that are twice the size of the associative caches.

The data streaming cache is implemented as separate load and store data buses eliminating the bus turn-around time. Load and store operations can be fully pipelined, making it possible to issue two memory operations to the data streaming cache every cycle. The 4-MB data streaming cache is split between even and odd 1-MB banks; one memory operation can go to each bank every cycle.

The data streaming cache tag RAM is four-way set-associative and is organized as 2048 entries by 128 bits. Each 128-bit entry contains 32 bits of information for each of the four sets. Each bank of the data streaming cache employs a dedicated custom tag RAM to address the cache. The dual tag RAMS also allow simultaneous and independent operation of the data streaming cache banks every cycle. Figure 19 diagrams data streaming cache organization.

[Fig 19: R8000 Data Streaming Cache Organization]

FIGURE 19 R8000 Data Streaming Cache Organization

Table 5 compares the data-streaming cache and lcvel-2 secondary cache.

Parameter             Data Streaming Cache (R8000)       Secondary Cache (R4400)


Location              Off-chip; 16 SSRAMS (12 ns)        Off-chip; 4 SIMMs w/ SRAMS (10 ns)

Contents              FP data and integer                FP data and integer
                      data instructions                  data instructions

Size                  4MB                                1MB

Mapping               4-way set associative              Direct-mapped, no hashing
                      random placement

Index                 Physical address                   Physical address

Tag                   Physical address                   Physical address

Coherency Policy      Write back                         Write back
on Writes

Data Protection       Parity on 16-bit quantities        SECDED- Single-error
                                                         correction, double-error
                                                         detection

Ports                 Single-ported to even bank         Single-ported
                      single-ported to odd bank

Interleaving          Two-way by even/odd                N/A
                      double words

Data Access           5-stage fully pipelined            Asynchronous

Data Transfer         2 64-bit double words per cycle    128-bit per two cycle
                      (1 even double word per cycle      (one Double word per cycle)
                      and 1 odd double word per cycle)

Latency               5 cycles initially, 0 cycles       4-11 cycles
                      when pipeline is full

Cache Bandwidth       600 MB/s even; 600 MB/s odd;       400 MB/s - L2 cache
                      data streaming cache to/from       to L1 cache
                      registers

Total Cache           1.2GB/s                            400 MB/s
Bandwidth          

Line Size             128 Bytes (32 Words)               128 Bytes (32 words)

Miss Penalty          80 cycles to main memory
                      (128 Bytes);

Access Pattern        2 loads/cycle or 2 stores/cycle    1 load/cycle or
                      or 1 load and 1 store/cycle        1 store/cycle

Dirtiness Recorded    On a line basis                    On a line basis

TABLE 5 Data Streaming Cache and Level-2 Secondary Cache Comparison

In addition to being set associative, the data streaming cache is also two-way interleaved. This provides sufficient memory bandwidth required to effectively use both floating-point functional units simultaneously each cycle.

The design provides two 64-bit operands to the FPU each cycle for an effective 1.2 GB per second transfer rate. This allows each floating-point functional unit to execute a multiply/add every cycle. Figure 20 diagrams the data streaming cache interleaving.

[Fig 20: R8000 Interleaved Data Streaming Cache]

FIGURE 20 R8000 Interleaved Data Streaming Cache

7.0.6 FPU Operations

The FPU performs three types of floating-point arithmetic operations: short, regular and long. Each operation requires a functional unit for some number of cycles (staging) and has a fixed time before the results are available (latency):

Short operations (MOV, NEG, ABS) require one staging cycle and have a one-cycle latency.
Regular operations (ADD, SUB, MADD, MSUB, MUL, CONVERT) require one staging cycle and have a four-cycle latency. Regular operations require one, two or three source operands from the FP register file and take four cycles to complete. On the next cycle the result is written back to the FP register file. On-chip bypass logic facilitates the execution of data dependant floating-point instructions.
Long operations have variable latency and staging times.

Additionally, floating-point load and store operations are defined. When pipelined, these operations have an apparent latency of one cycle even though it takes five cycles to go through the data streaming cache.

Table 6 summarizes latencies and staging associated with various FPU operations.

Floating-point Unit Operation       Latency (Cycle Count)     Staging (Cycle Count)


Move, Negate, Absolute Value        1                         1

Add, Multiply, MADD                 4                         1

Load, Store                         1                         1

Compare, Move, Conditional Move     1                         1

Divide                              14 (single precision)     11 (single precision)
                                    20 (double precision)     17 (double precision)

Square Root                         14 (single precision)     11 (single precision)
                                    23 (double precision)     20 (double precision)

Reciprocal                          8 (single precision)      5 (single precision)
                                    14 (double precision)     11 (double precision)

Reciprocal Square Root              8 (single precision)      5 (single precision)
                                    17 (double precision)     14 (double precision)

TABLE 6 R8000 FPU Operations and Associated Latencies/Staging

In summary: the R8000 microprocessor chip set delivers the floating-point performance and bandwidth necessary to accommodate large numerically-intensive applications. The floating-point structure and innovative caches provide capabilities not previously available from a high-volume RISC microprocessor.

7.0.7 R8000 Interface to the System Board

The R8000 chipset communicates with the graphics subsystem and other devices through the interface with the Indigo2 system board. This is accomplished by three ASICs, a TCC cache controller, and two TDB data buffers which are also located on the CPU module.

Cache Controller (TCC)

The TCC acts as the bridge between the R8000 Integer Unit and the POWER Indigo2 MC (Memory Controller). It manages the R8000 Global Cache using the sysAD bus to pass cache miss requests, writebacks, and uncached operations to the MC and generates interrupts to the R8000. Additionally, it implements the necessary logic to control the two data buffers (TDBs).

Data Buffers (TDBs)

The TDB implements the datapath interface between the Global Cache and the DMux (Data Multiplexer). Each of two TDBs contains half of the 128-bit StoreData bus interface to the Global Cache and R8000 FPU, and half of the 64-bit sysAD interface to the DMuxes. The TDB provides parity generation and checking for each of these interfaces. The TDB contains 4 cache line buffers: 1 for cache misses, 2 for prefetches, and 1 for cache writebacks, in addition to a 64-deep FIFO for buffering graphics stores.

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

Indigo2 and POWER Indigo2 Technical Report

Integer Unit Operation Latency (Cycle Count) Add, shift, logical 1 Load, store 1 Multiply 4 (32-bit operands) 6 (64-bit operands) Divide 21 (quotient <15 bits) 39 (quotient 16-31 bits) 73 (quotient 32-64 bits

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades