floating-point math operations common in 3D graphics
MIPS Technologies has taken a conscious step in another direction with its new R5000 processor. As the latest Rx000-compatible CPU, the R5000 takes advantage of a relatively simple and efficient design to deliver high clock speeds (250MHz by the end of this year) and competitive performance (an estimated 6.0 SPECint95 and 6.1 SPECfp95 at least) at a very attractive price (less than $300). The architects of the R5000 deliberately avoided complexity that would compromise their goal of making a fast, economical chip for low- to mid-range workstations.
That's not because MIPS is adverse to advanced microprocessor design. The MIPS R8000 and R10000 have all the sophisticated features mentioned earlier and more. Where appropriate, the R5000 inherits some of these features, and it adds some new twists of its own, such as optimised logic for the single-precision floating-point (FP) math that characterizes today's 3D graphics.
Simple, Not Stupid
To strike a balance between performance and simplicity, R5000 architects made some interesting choices. Compatibility with the latest MIPS software was a paramount consideration, so they retained the 64-bit architecture first introduced in the R4000 [1992] and the MIPS IV instruction set that made its debut with the R10000. The R5000's 64-bit data paths and registers effectively double the chip's bandwidth, yet they can also operate in 32-bit mode to provide backward compatibility with older MIPS software.
MIPS retained many features of the high-end R10000, such as generous register files and primary caches. The R5000 has 32 integer and 32 FP registers, all of them 64 bits wide. Although the R10000 actually has 64 integer and 64 FP registers, half of those are programmer-invisible shadow registers for speculatively executed instructions. Since the R5000 doesn't speculatively execute, all its registers can be programmer-visible architectural registers.
The R5000 has separate primary caches for instructions and data, and each cache is 32KB and two-way set-associative - just like the R10000. In fact, the R5000's caches occupy as much space as the logic. Large on-board caches are becoming increasingly common in high-performance microprocessors to help alleviate the memory-latency problem of modern system design.
The final important feature that R5000 inherits from the the R10000 is superscalar pipelining. The R10000 was the first single-chip superscalar CPU from MIPS, and the engineers went all-out. They endowed the R10000 with four-way pipelines and the potential to execute as many as five instructions per cycle, although it can retire only four per cycle. For the economy-model R5000, MIPS pared down: The chip has two-way pipelines, with significant limitations on the types of instructions it can execute in parallel.
For instance, the R5000 cannot execute two integer instructions simultaneously. Unfortunately, integer instructions are the most common operations in general-purpose software. MIPS' thinking was that the R5000 will deliver sufficient integer performance for its target market even without parallelism. Instead, architects chose to optimize the R5000's pipelines for the instruction streams typically found in 2D and 3D graphics software: a mix of integer and single-precision floating-point operations.
MADD About Math
Consider the math behind 3D geometry processing. To calculate the vertices of a 3D object, a graphics program typically multiplies a 4x4 transform matrix of single-precision FP values against a 1x4 matrix of similar values representing a single vertex. The result is another 1x4 matrix. The graphics program must then repeat this operation for every vertex in the object - potentially tens of thousands of times for a complex object in a CAD drawing.
To do this kind of math, the MIPS IV instruction set includes both a single-precision and a double-precision multiply-add (MADD) operation that's similar to the multiply-accumulate (MAC) instruction in a DSP. The main difference is that MADD uses four operands instead of MAC's three (A*B+C=D instead of A*B+C=C).
As implemented in the R5000, the single-precision MADD instruction has a repeat rate of one cycle and a latency of four cycles. The FPU is subpipelined, so it can calculate the multiplication and addition components of a matrix problem in parallel. Once the R5000's five-stage pipeline is primed, there can be a MADD instruction in a different state of completion in every pipe stage. So that the R5000 can repeatedly issue a new MADD instruction every cycle, the R5000 can also process an integer or load/store instruction at the same time as a single-precision MADD. For every MADD executing in the FP pipeline, an accompanying load/store instruction can be flowing through the second pipeline.
Result: a highly tuned microarchitecture that rips through 3D geometry calculations at speeds you'd normally expect from a more expensive processor. Ordinary FP benchmarks may not see the whole story because they're not specifically measuring this performance. MIPS claims that 3D graphics software such as Pro-Engineer and UltraCAD will run faster on the R5000 than on similarly-benchmarked CPUs like the Pentium Pro.
BYTE hasn't verified those claims (*), but MIPS' parent company, Silicon Graphics (Mountain View, CA), is reporting significant performance gains on its latest R5000-based Indy workstations. According to SGI, the new Indys run 3D graphics software about 83 percent faster than existing R4400-based systems. And because they use early versions of the R5000 that run at 150 to 180 MHz, greater gains lie ahead when the R5000 achieves its target clock speed of 250MHz (#).
Cutting Corners
To hold down costs, MIPS left several advanced features out of the R5000 chip. Having fewer pipelines is just one example. The R5000 also cannot execute instructions speculatively or out of order. This greatly reduces the chip's complexity because it doesn't have to bother with tricky techniques to retire instructions in their original program order.
The R5000 doesn't support dynamic branch prediction, either. However, it does have several branch-likely instructions that smart programmers and compilers can use as a sort of poor man's substitute. There's also an FP conditional-move instruction that saves a branch.
In another cost-cutting measure, MIPS eliminated the 128-bit secondary-cache bus found on the R4000 and R10000. The R5000 accesses its secondary cache over the general I/O bus, which is 64 bits wide. The maximum cache size is 2MB. This is similar to the Pentium's cache interface, except the R5000's bus can run at 100MHz, compared the to Pentium bus's top speed of 66MHz. [The R10000's cache limit is 16MB]
All these economies significantly reduce the R5000's pin count. And the R5000's die is downright tiny: 84 sq mm on a .32 micron process. A smaller die means higher yields, lower manufacturing costs, and faster clock speeds. With CPUs, that's the whole ball game.