The R8000 microprocessor chip set is the first 64-bit superscalar implementation of the MIPS IV Instruction Architecture (ISA). The R8000 processor combines a large fast-access, high-throughput cache subsystem with high-performance floating-point capabilities. Its bandwidth satisfies applications with large working sets of data.
In the past, microprocessors typically had to restrict on-chip functionality. Small cache subsystems and other limiting features degraded performance for large scientific applications. Silicon Graphics leveraged new chip technology and an advanced instruction set architecture to overcome these limitations and expand applicability. The highly integrated R8000 chip set successfully meets the requirements of numeric-intensive applications in the technical market.
The chip set delivers peak performance of 300 double-precision MFLOPS and 300 MIPS with a clock frequency of 75 MHz. The R8000 uses the MIPS IV instruction set, which is a superset of the MIPS III architecture and is backward-compatible with all previous MIPS processors.
The MIPS R8000 processor is designed to deliver extremely high floating-point performance. Key features:
FIGURE 15 R8000 Microprocessor Chip Set Block Diagram
This section explains the R8000's:
Balanced Memory Throughput Implementation
The R8000 chip set is optimized for floating-point performance. However, both floating-point and integer computational capabilities are balanced with the required memory throughput. Two load/store units in the IU provide the integer and floating-point functional units with the 64-bit (double word) information required for sustained operation. The load/store units support at most two memory references in a single cycle. Any combination of floating-point and integer loads and stores are possible except one integer store followed by a second integer store in the same cycle. Furthermore, there are no restrictions on pipelining any combination of floating-point and integer loads and stores.
The R8000 can dispatch up to four instructions including two memory accesses per cycle:
Instruction and Addressing Mode Extensions
Several new instructions improve performance for numerically intensive code. The floating-point multiply-add instructions achieve results comparable to chaining vector operations: multiple floating-point operations execute each machine cycle. This characteristic results in greater precision and higher performance.
Many scientific applications have separately compiled subroutines containing parameter matrices with variable dimensions. Standard register-plus-offset addressing requires an extra integer addition for each access to these arrays. In contrast, the R8000's indexed addressing mode (base register plus index register) eliminates the extra addition for floating-point loads and stores.
For high-end numeric processing, loops containing IF statements must execute efficiently. The R8000 design includes a set of four conditional move operators that allows IF statements to be represented without branches. The bodies of the THEN and ELSE clauses are unconditionally computed, and the results are put in temporary registers. Conditional move operators then selectively transfer the temporary results to their true register. In summary, both legs of an IF statement are computed, and one of them is discarded.
Data Streaming Cache Architecture
The R8000 chip set incorporates a unique coherent cache scheme to address two disparate computational requirements. Most programs contain a mix of integer, floating-point, and address computation. Integer and address computation is best accomplished with a moderately sized low-latency, fast data cache. Floating-point calculations require large amounts of memory or a large data cache subsystem. The cache need only be moderately fast, but it must be sizable enough to hold large data sets and have high throughput to the floating-point functional units.
The R8000 chip set provides a unique cache hierarchial implementation. The level-one data cache, on the integer chip, is 16 KB and allows very fast access for integer loads and stores. A large 4 MB off-chip cache, called the data streaming cache, serves as a second-level cache for integer data and instructions, and as a first-level cache for floating-point data. This configuration allows floating-point loads and stores to bypass the on-chip cache and communicate with the large off-chip cache directly. The data streaming cache is pipelined to allow for continuous access by the floating-point functional units. Data can be transferred at the rate of two 64-bit double words per cycle, or 1.2 GB per second. The large pipelined data streaming cache provides the capacity and sustained bandwidth needed to handle floating-point data objects.
FIGURE 16 R8000 Integer Unit Block Diagram
The IU contains four caches: the instruction cache, the data cache, the branch prediction cache, and the Translation Lookaside Buffer (TLB) cache. Dedicated interfaces let the IU generate and provide address information to the data streaming cache, the floating-point unit, and the tag RAMs. An additional 80-bit bus allows the IU and FPU to communicate. The IU also contains two integer arithmetic logic units (ALUs) and two address generation units. These functional units support up to four instructions per cycle - two integer instructions and two data accesses.
Instruction Cache and Instruction Cache Tag RAM
The instruction cache contains instructions to he executed. The 16-KB I-cache in the IU is direct-mapped, arranged as 1024 entries by 128 hits: each entry contains four 32-bit instructions and every access to the cache fetches four instructions. The I-cache is virtually indexed and virtually tagged, alleviating the need for address translation on I-cache accesses.
The I-cache tag RAM is used to determine if a valid instruction exists in the I-cache. The I-cache tag RAM has 512 entries, one tag for every two I-cache entries or one tag for each I-cache line. Each I-cache tag RAM entry contains a tag, an address space identifier (ASID), a tag valid bit, and 2 region bits. The ASID differentiates instructions between processes and allows instructions that have the same virtual address, but different physical addresses from multiple processes to reside in the I-cache at the same time. It also ensures that two processes accessing the same memory space will not overwrite each other's instructions.
Instruction/Tag Queues and Dispatching
Effective utilization of a superscalar processor requires access to multiple instructions per cycle and the ability to dispatch multiple instructions per cycle. In the R8000's creative approach for fetching and dispatching instructions, the integer unit fetches four instructions or 128 bits per cycle from the I-cache. These instructions are partially decoded and then placed in a six-deep temporary storage queue. The queue is required because dependencies might prevent instructions from being dispatched.
The X-bar can dispatch from zero to four instructions every cycle. It uses resource modeling to determine when instructions are ready to he dispatched. The X-bar monitors the status of each execution unit from cycle to cycle and determines interdependencies between any of the four instructions in a given line. Figure 17 diagrams instruction dispatch.
FIGURE 17 R8000 Instruction Dispatch
Branch Prediction Cache and Branch Prediction
When a branch is taken, the contents of the program counter are altered. In a pipelined processor implementation, this forces one or more wasted cycles and reduces the performance of the processor. Research shows that during execution of a typical application, a branch instruction occurs every six to eight instructions. Hence, in the superscalar R8000 processor, a branch can he expected every other cycle. Branch prediction is therefore crucial for maintaining the pipeline and ensuring continuous execution. Figure 18 diagrams the branch prediction cache.
FIGURE 18 R8000 Branch Prediction Cache
The R8000 branch prediction cache is used to modify the program counter (the location of the target instruction) when the processor encounters a branch instruction. It works in conjunction with the I-cache and incorporates a simple branch prediction mechanism. The branch prediction cache has 1024 entries, one for each entry in the I-cache. Each entry is 15 bits wide, ten of which are used as a branch target address to any of the 1024 entries in the I-cache. The remaining bits indicate whether the branch is predicted taken, and where in the target quadword execution should begin.
Table 2 summarizes parameters for the instruction cache and branch prediction cache.
Parameter Instruction Cache Branch Prediction Cache Location IU IU Contents/entry 128 bits (4 instructions) 16 bits Size 16KB 2KB Mapping Direct-mapped Direct-mapped Index Virtual address Virtual address Tag Virtual address N/A Data access single cycle single cycle Data transfer 128 bits/4 instructions per cycle 16 bits per cycle Cache bandwidth 1.2 GB per second 159 MB per second Line size 32 Bytes or 8 words N/A Miss penalty 11 cycles to D8 cache 3 cycles to actual branch result
TABLE 2 Parameters for R8000 Instruction Cache and Branch Prediction Cache
Translation Lookaside Buffer (TLB) Organization
The translation lookaside buffer converts virtual addresses to physical addresses. A single TLB in the IU handles instruction references when there is an I-cache miss and all data references. The TLB contains 384 entries to reduce TLB misses when processing large matrices, and is three-way set associative to maintain high performance.
The TLB is dual ported to allow for parallel references. It is split into two sections - one section contains the virtual tags (VTAGS), the other section contains the actual physical address (PA) corresponding to each virtual tag. Table 3 compares the R8000 and R4x00 TLBs.
Parameter POWER Indigo2 TLB/R8000 Indigo2 TLB/R4000 Size 384 entries: one 48 entries: two translations/ translation/entry entry (even - odd pages) Mapping 3-way set associative Fully associative random placement random placement Index Virtual address Virtual address Ports Two One Access Single cycle Single cycle Kernel page size 16K 4K User page size 16K 4K Size 384 entries: one 48 entries: two translations/ translation/entry entry (even - odd pages)
TABLE 3 Translation Lookaside Buffer Comparison
Data Cache and Data Cache Tag
The 16KB data cache in the IU is dual-ported and is arranged as 2048 entries by 64 bits. The direct-mapped D-cache is virtually indexed and physically tagged; it implements a write-through protocol to maintain coherency with the data streaming cache. The D-cache allows either two loads or one load and one store to occur simultaneously. Loads and stores of data on a byte (8-bit) resolution are also supported.
Most integer operations execute in only one cycle, but some operations such as integer multiply and divide require additional cycles to generate the result. The R8000 integer multiply operation is one of the fastest implementations available. This fast integer multiplication drastically improves the execution of loops that require integer multiplication. A clever approach was used for the integer division operation: the time required for an integer divide operation is a function of the quotient size. Table 4 summarizes integer latencies.
Integer Unit Operation Latency (Cycle Count) Add, shift, logical 1 Load, store 1 Multiply 4 (32-bit operands) 6 (64-bit operands) Divide 21 (quotient <15 bits) 39 (quotient 16-31 bits) 73 (quotient 32-64 bits
TABLE 4 R8000 Integer Latencies
The data streaming cache is implemented as separate load and store data buses eliminating the bus turn-around time. Load and store operations can be fully pipelined, making it possible to issue two memory operations to the data streaming cache every cycle. The 4-MB data streaming cache is split between even and odd 1-MB banks; one memory operation can go to each bank every cycle.
The data streaming cache tag RAM is four-way set-associative and is organized as 2048 entries by 128 bits. Each 128-bit entry contains 32 bits of information for each of the four sets. Each bank of the data streaming cache employs a dedicated custom tag RAM to address the cache. The dual tag RAMS also allow simultaneous and independent operation of the data streaming cache banks every cycle. Figure 19 diagrams data streaming cache organization.
FIGURE 19 R8000 Data Streaming Cache Organization
Table 5 compares the data-streaming cache and lcvel-2 secondary
cache.
Parameter Data Streaming Cache (R8000) Secondary Cache (R4400) Location Off-chip; 16 SSRAMS (12 ns) Off-chip; 4 SIMMs w/ SRAMS (10 ns) Contents FP data and integer FP data and integer data instructions data instructions Size 4MB 1MB Mapping 4-way set associative Direct-mapped, no hashing random placement Index Physical address Physical address Tag Physical address Physical address Coherency Policy Write back Write back on Writes Data Protection Parity on 16-bit quantities SECDED- Single-error correction, double-error detection Ports Single-ported to even bank Single-ported single-ported to odd bank Interleaving Two-way by even/odd N/A double words Data Access 5-stage fully pipelined Asynchronous Data Transfer 2 64-bit double words per cycle 128-bit per two cycle (1 even double word per cycle (one Double word per cycle) and 1 odd double word per cycle) Latency 5 cycles initially, 0 cycles 4-11 cycles when pipeline is full Cache Bandwidth 600 MB/s even; 600 MB/s odd; 400 MB/s - L2 cache data streaming cache to/from to L1 cache registers Total Cache 1.2GB/s 400 MB/s Bandwidth Line Size 128 Bytes (32 Words) 128 Bytes (32 words) Miss Penalty 80 cycles to main memory (128 Bytes); Access Pattern 2 loads/cycle or 2 stores/cycle 1 load/cycle or or 1 load and 1 store/cycle 1 store/cycle Dirtiness Recorded On a line basis On a line basis
TABLE 5 Data Streaming Cache and Level-2 Secondary Cache Comparison
In addition to being set associative, the data streaming cache is also two-way interleaved. This provides sufficient memory bandwidth required to effectively use both floating-point functional units simultaneously each cycle.
The design provides two 64-bit operands to the FPU each cycle for an effective 1.2 GB per second transfer rate. This allows each floating-point functional unit to execute a multiply/add every cycle. Figure 20 diagrams the data streaming cache interleaving.
FIGURE 20 R8000 Interleaved Data Streaming Cache
Table 6 summarizes latencies and staging associated with various FPU operations.
Floating-point Unit Operation Latency (Cycle Count) Staging (Cycle Count) Move, Negate, Absolute Value 1 1 Add, Multiply, MADD 4 1 Load, Store 1 1 Compare, Move, Conditional Move 1 1 Divide 14 (single precision) 11 (single precision) 20 (double precision) 17 (double precision) Square Root 14 (single precision) 11 (single precision) 23 (double precision) 20 (double precision) Reciprocal 8 (single precision) 5 (single precision) 14 (double precision) 11 (double precision) Reciprocal Square Root 8 (single precision) 5 (single precision) 17 (double precision) 14 (double precision)
TABLE 6 R8000 FPU Operations and Associated Latencies/Staging
In summary: the R8000 microprocessor chip set delivers the floating-point performance and bandwidth necessary to accommodate large numerically-intensive applications. The floating-point structure and innovative caches provide capabilities not previously available from a high-volume RISC microprocessor.
Cache Controller (TCC)
The TCC acts as the bridge between the R8000 Integer Unit and the POWER Indigo2 MC (Memory Controller). It manages the R8000 Global Cache using the sysAD bus to pass cache miss requests, writebacks, and uncached operations to the MC and generates interrupts to the R8000. Additionally, it implements the necessary logic to control the two data buffers (TDBs).
Data Buffers (TDBs)
The TDB implements the datapath interface between the Global Cache and the DMux (Data Multiplexer). Each of two TDBs contains half of the 128-bit StoreData bus interface to the Global Cache and R8000 FPU, and half of the 64-bit sysAD interface to the DMuxes. The TDB provides parity generation and checking for each of these interfaces. The TDB contains 4 cache line buffers: 1 for cache misses, 2 for prefetches, and 1 for cache writebacks, in addition to a 64-deep FIFO for buffering graphics stores.