[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

195MHz R10000 Performance Comparison Between O2,
Indigo2, Octane, Origin200, Origin2000 and Power Challenge

Last Change: 09/Aug/1998

SPEC's Introduction to SPEC95
SPECfp95 Analysis
SPECint95 Analysis

(Note: the 2D bar graphs shown on this page for the various SPEC95 tests have been drawn to the same scale)
(the graphs are also to the same scale as those given on the 250MHz R10000 comparison page)

195MHz R10000 SPECfp95 Performance Comparison

Introductory Notes:

SPECfp95 now permits autoparallelisation, ie. a compiler can parallelise code on multiple-CPU systems as long as this is done automatically, ie. no special flags can be used. The analysis given here does not include autoparallelised results because, in my opinion, they are not directly comparable with single-CPU results since some tests are affected enormously, some are accelerated only to a limited degree whilst the remainder are left totally unchanged. Worse still, different vendors' compilers accelerate a different selection of the ten tests from the SPECfp95 suite.
Thus, one may see similar results from two rival systems with the same number of CPUs from different vendors, when in fact the individual test results for the rival systems are totally different (ie. a different subset of the tests are accelerated, but end up leading to a similar final average). I shall construct a page discussing autoparallelised results for R10000 at a later date. This study will continue to be an examination of the behaviour of a single R10000 195MHz CPU in different systems. Note that I strongly recommend you do not use the final averages from autoparallelised results at all since, in my opinion, they can be highly misleading.
John McCalpin, a senior Performance Analyst at SGI and author of the STREAM memory bandwidth benchmark, told me:

"I agree completely with your comments on the auto-parallelized SPECfp results. By not allowing even the addition of comments, the SPEC committee has practically invited vendors to invest time in "special-case" compiler optimizations that are not likely to be of help in the real world. SGI's automatic parallelization technology is good, so our results are good, but the whole approach is too fragile/sensitive to be a good thing.
...
It is not always easy to know what can and cannot be parallelized. Taking the standard tests and allowing the addition of compiler directives for the parallel runs would be appropriate for this sort of test."
SGI revised its compiler technology in early 1998; after the change, SGI released new Origin2000 SPEC95 figures, showing minor improvements to floating point (fp) performance and major improvements to integer (int) performance. The diagrams and tables on this page use the better figures released after the compiler change - their relation to older figures is not discussed.
A key feature of the operation of the R10000 does not operate in certain systems that cannot support it. This has an impact on certain tests and so is discussed, though not in great detail since it is difficult to know which tests are affected and to what degree.

Objectives

I compared the test results for the R10000 on various SGI systems. The goal was to discover how the same processor (in this case the 195MHz R10000) behaved on different SGI systems that supported it. It's interesting to see which systems are affected by a greater L2, which are improved by a new architectural design or faster bus, etc., and which hardly vary at all, ie. perhaps they'd be faster simply with a better clock speed, as opposed to higher memory I/O, L2 cache size, etc.

To aid visualisation, I've constructed a 3D Inventor model of the data; screenshots of this are included below. You can download the 3D model (1232bytes gzipped) if you wish: load the file into SceneViewer or ivview and switch into Orthographic mode (ie. no perspective). Rotate the object 30 degrees horizontally and then 30 degrees vertically (use Roty and Rotx thumbwheels) - that'll give you the standard isometric view. I actually found slightly smaller angles makes things a little clearer (15 or 20 degrees) so feel free to experiment. Changing the direction of the headlight can also help. Note that newer versions of popular browsers may be able to load and show the object directly, although I've found such browsers may not offer Orthographic viewing.

All source data for this analysis came from www.specbench.org.

Note: The R10000 used in Origin200 in this analysis is 180MHz, not 195MHz (the latter is not available for Origin200 due to XIO timing issues). The figures for Origin200 given here have not been scaled by 195/180 in an attempt to take account of this, so please bare in mind that Origin200 results are for a lower-clocked R10000. If one does scale Origin200 figures by 195/180 (just over 8%), then the differences between Origin200 and other systems are lessened, but there is little point in comparing published results to something which is not available as a buyable system. Besides, scaling SPEC test results in a linear manner is not recommended.

Given below is a comparison table of the various R10000/195 SPECfp95 test results. Faster systems are leftmost in this table (in the Inventor graph, they're placed at the back). You may need to widen your browser window to view the complete table. After the table and 3D graphs is a short-cut index to the original results pages for the various systems.

Key:


O200   = Origin200
O2000  = Origin2000
PChall = Power Challenge
I2     = Indigo2

System:   O2000    Octane    O200    PChall   PChall    I2       O2
L2:        4MB      1MB      1MB      2MB      1MB      1MB      1MB

tomcatv    26.9     25.3     22.2     16.7     16.1     12.1     9.78
swim       41.2     40.6     34.5     23.9     23.9     17.1     13.9
su2cor     11.5     9.64     8.47     8.74     7.35     6.40     4.72
hydro2d    12.6     9.97     7.99     6.36     5.03     4.01     3.17
mgrid      18.8     15.9     14.8     11.4     10.4     8.60     6.95
applu      11.7     11.2     11.0     8.69     8.54     7.44     5.92
turb3d     15.3     13.8     14.3     11.3     11.2     10.3     9.57
apsi       15.6     12.8     11.9     15.5     10.3     9.60     9.77
fpppp      29.6     29.7     28.3     31.1     31.3     31.4     29.3
wave5      25.5     22.4     20.3     21.3     18.4     17.3     11.8

          SPECfp95 Comparison Table for MIPS R10000 195MHz

(click on the images above to download larger versions of the views shown)

[Test Suite Description | O2000 | Octane | O200 | PChall 2MB L2 | PChall 1MB L2 | Indigo2 | O2]

Next, a separate comparison graph for each of the ten SPECfp95 tests:

tomcatv:

tomcatv comparison graph

swim:

swim comparison graph

su2cor:

su2cor comparison graph

hydro2d:

hydro2d comparison graph

mgrid:

mgrid comparison graph

applu:

applu comparison graph

turb3d:

turb3d comparison graph

apsi:

apsi comparison graph

fpppp:

fpppp comparison graph

wave5:

wave5 comparison graph

Note: for Origin200, the peak apsi SPEC ratio (11.9) is less than the base apsi SPEC ratio (12.6) shown in the submitted results on www.specbench.org. This is apparently due to a mistake in the flags used for the peak test run.

Observations

These are easier to spot from the graphs, which is why I made them in the first place:

some tests vary hardly at all between the six systems, so what is the bottleneck for such cases? (eg. fpppp) It can't be memory bandwidth, L2 size, etc. Is it just straight clock speed? If so, existing faster-clocked systems should get much higher results, but some don't (eg. 500MHz Alpha gets 38.3; if clock speed was the key factor, it should be nearer 80). So what's going on?
Apparently, the fpppp data set is quite small and will fit nicely into a typical L2 cache; in fact, fpppp can fit into a reasonably sized L1 data cache. This is why the fpppp results do not vary amongst the different R10000 systems - in all cases, the CPU is going as fast as it can and is not being hindered by latency and bandwidth factors concerning main RAM, or L1<-->L2 cache transfer issues. Notice, therefore, that despite the problems with R10000 in O2, the fpppp test goes very fast which shows that the factors affecting an R10K's performance in O2 is not concerned with the CPU itself, but with the connection between the CPU's L2 cache and the rest of the system (ie. any task that fits into L1 or L2 cache for R10K in O2 will run just as fast as Octane).
However, the 21164 has a much smaller L1 data cache and so, even though it should theoretically run fpppp very quickly based on its clock speed, in fact it can't because now the bottleneck is L1<-->L2 data transfer instead of L2<-->RAM transfer. The 21164's L1 data cache is 8K total, compared to the R10000's much larger 32K. Before the release of 21264 SPEC95 results, I predicted that the fpppp result for the 21264 would be high since that processor is not hampered by a small L1 cache (my guess was 80, as stated earlier). Sure enough, the 575MHz 21264 achieves 82.7 for fpppp. QED, as the saying goes.
John told me:

fpppp actually fits into the L1 cache in the R10000. I suspect that it won't fit into the 8kB cache in the 21164. The 21164 also has trouble with branch mispredictions and pipeline stalls here. Note that the latency of an FP multiply is about the same on the MIPS and DEC chips: 2 cycles at 200 MHz on the R10000 vs 5 cycles at 500 MHz on the 21164.

But what is fpppp? I asked John and he said:

"It is a horribly convoluted code extracted from an obsolete molecular dynamics program. Despite Intel's request, it will not be in SPEC98."

I asked if there was any need at all for a test like fpppp, where the data set is very small. John replied:

"None of the commercial scientific/engineering apps that I have tested are anywhere near as tiny as fpppp."

Asking for an overall summary, John said:

"With the Alpha line, DEC has chosen a long pipeline that runs fast. As a consequence, some important operations take more cycles. The 21164 can issue independent FP Add instructions at a rate of one per cycle, but dependent add instructions have a repeat rate of one per 5 cycles. The R10000 can issue independent FP Add instructions at a rate of one per cycle, while dependent add instructions have a repeat rate of one per 2 cycles. The rate for independent FP Adds is determined by lots of other bottlenecks in the system(s), while the rate for dependent FP Adds is about the same for a 500 MHz 21164 and a 195 MHz R10000. This is visible in the comparable performance delivered on linear finite element analysis codes, since these are dominated by direct sparse matrix solvers that have lots of dependent FP arithmetic."

Asking for clarification on 'about the same', John replied:

"5 cycles at 500 MHz is the same amount of time as 2 cycles at 200 MHz (10 ns in each case)."
4 tests seem unaffected by larger L2 cache: swim, applu, turb3d and fpppp.
4 tests benefit slightly from larger L2: tomcatv, su2cor, hydro2d and mgrid.
2 tests benefit alot from larger L2: apsi and wave5.
Many tests benefit greatly from improved main memory systems (tomcatv, swim, hydro2d, mgrid, applu, turb3d), some tests benefit a fair amount (apsi and wave5) and one test benefits almost not at all (fpppp) for reasons given above.
Clearly, the operation of R10000 in O2 is hampered by various factors, but it's worth noting that apsi on O2 actually runs faster than it does on Indigo2. apsi is obviously a task that benefits more from a larger L2 cache than anything else.
O2 shows the largest variance in performance results for any of the systems; the difference between highest and lowest is almost an order of magnitude. From this it can be concluded that the final SPECfp95 average for R10K in O2 should not be used at all as it could easily be a factor of 3 away from the actual performance one might see from one's target application (ie. if one judges R10K in O2 based on the final average, the performance of a real-world problem might be 3 times slower or 3 times faster, depending on the nature of the code).
As far as I know, none of the SPECfp95 tests are similar to geometry/lighting matrix calculations found in 3D graphics. If this is true, then one cannot use SPECfp95 as an indicator of performance for 3D graphics applications. Asking for comment, John said:

"I know that 102.swim is coded as single precision. I have not looked at the others... Swim is pretty strongly bandwidth limited on most systems -- in this case, using 32-bit reals significantly increases performance.
SWIM is fairly standard long vector code. It does not have the characteristics of the 4x4 matrix manipulations that characterize geometry calculations."

Of most note are the performance jumps between Indigo2/IMPACT to Origin200. Even though the Origin200 has a 180MHz R10000 whilst the Indigo2/IMPACT has 195MHz R10000, many of the tasks run much better on the Origin200 by a factor of 2 in many cases. Such tasks are clearly benefiting from the much higher memory bandwidth (STREAM result is 5 times higher than Indigo2/IMPACT), lower memory latency (Origin200's memory latency is about 40% better than Indigo2) and an additional factor that is discussed below. As the complexity of a data set becomes larger (ie. data set size increases), the improvement seen with Origin200 over Indigo2/IMPACT will increase.

However, there is a confusing factor in all this which makes it difficult to come to precise conclusions as to why some tests perform in a particular way on a certain system:

The R10000 supports multiple oustanding cache misses, but this feature is restricted in certain systems (see section 2.4 of the R10000 Technical Brief for more information). John commented:

"On the O2, Indigo2/10K, and Power Challenge/10K, the system PROM re-programs the R10000 to only allow 1 outstanding cache miss, since that is all that the memory system supports. Octane and Origin systems allow 4 outstanding misses per cpu."

Restricting the number of outstanding cache misses in this way will have an impact on certain types of code, but it is difficult to ascertain, without detailed analysis, which of the SPECfp95 tests may be affected; John's view was, "It is very hard to quantify this."

Cache accessing is obviously an important factor when looking at performance. I asked John for a summary description of the way cache systems operate and the relevant issues; he said:

"Any time you execute a Load instruction, the CPU checks in the caches to see if a copy of the data is handy. If the data is in the primary cache, the CPU proceeds at full speed, and the data can be used two cycles after the load (on MIPS processors, anyway). If there is not a copy of the data in the primary cache, the secondary cache is checked. If there is a copy in the secondary cache, it takes about 9-12 cycles (on the R10000) to get the data. If there is no copy of the data in the secondary cache, it takes about 65 cycles to get the data.

If the CPU continues executing other instructions after the Load that missed the primary cache, you have what is called a "non-blocking" cache. Some systems allow more loads to occur while the cache miss of the first load is outstanding, and will continue as long as all the subsequent loads hit in the cache. This is called a "hit under miss" cache. Typically a "hit under miss" cache will stall the CPU when a second cache miss occurs, i.e. the CPU will sit there doing nothing at all until the data for one of the loads gets returned either by the secondary cache or by the memory controller.

More sophisticated processors allow multiple outstanding misses, typically with different numbers of misses allowed at each level of the memory hierarchy. On the R10000, you can have 4 loads miss the cache and still continue executing instructions. The information about these outstanding misses sits in a buffer on the CPU that matches the data coming in from the secondary cache or main memory controller with the corresponding instruction."

Apparently, PowerPC 604/604e systems operate a "hit under miss" cache system.

I asked John for some examples of the types of code which are sensitive to cache-access issues; he said:

Traversing linked lists (pointer chasing),
Using only 1 part of a multi-element structure per loop iteration,
Code with lots of branches.

Obviously, the precise reasons why a piece of code performs in a particular way can be quite complex.

Summary.

I've done this comparison because people still rely heavily on SPEC95 results when making purchasing and upgrade decisions, as I discovered when helping a colleague with the buying of a new number crunching system - he eventually ordered an Origin200 (full details given below). Thus, I hope my 3D graphs can help people understand a little more about how SPECfp95 is actually behaving on different systems that use identical processors, Origin200/180 not withstanding.

What this shows, once again, is that it's important not to rely on final SPEC averages. More than anything, get your code physically tested on the target system if possible before making a purchasing decision.

Finally, please also read my page on what I call performance difference profiling - a technique that I hope may aid those who are involved in making upgrade decisions.

An Example R10000 Performance Comparison to Indigo2/R4400 and Indy/R5000

A colleague from the University of Central Lancashire's Fire and Explosion Studies Centre asked me for advice on upgrading from Indy/Indigo2 (using R4600, R5000 and R4400) to Indigo2/R10000 or perhaps Origin200.

I arranged for SGI UK to bring along an Indigo2 R10000 195MHz Solid IMPACT. The Centre then conducted tests on the R10000 Indigo2 to compare its performance against their existing 133MHz R4600 Indys and a 200MHz R4400 Indigo2. All the tests involved Fortran77 code using the MIPS Pro 7.0 compilers, performing 64bit floating point calculations. These were pure time-based number crunching tests, ie. no screen output or disk operations were involved.

The results are as follows (factor speedups, ie. how much faster the Indigo2 R10000/195 was compared to the target system):-

Standard problem, 50x50 grid, 100 steps:: 11.4 times faster than Indy R5000SC/150
Larger Problem, 150x150 grid, 100 steps:: 5.8 times faster than Indigo2 R4400/200
11.4 times faster than Indy R5000SC/150
Standard Problem, 51x51 grid, 100 steps:: 2 times faster than Indigo2 R4400/200
Larger Problem, 101x101 grid, 10 steps:: 2.6 times faster than Indigo2 R4400/200
3.2 times faster than Indy R5000SC/150

Other test results; each factor is how much faster the R10000/195 Indigo2 was compared to the R4400/200 Indigo2:

Pseudospectral algorithm with Fast Fourier Transform:

[l = 2048]:  5.39 times faster
[l = 32768]: 3.41 times faster

Finite-Difference algorithm in decomposed geometry:

[110x10/50x10/50x10]:       2.13 times faster
[1100x100/500x100/500x100]: 3.10 times faster

Eigenvalues, LAPACK:

300x300 DGEEV (ordinary):      3.79 times faster
300x300 DGEGV (generalised):   2.89 times faster
1000x1000 DGEEV (ordinary):    2.83 times faster
1000x1000 DGEGV (generalised): 2.35 times faster

195MHz R10000 SPECint95 Performance Comparison

The systems examined were the same as for the SPECfp95 comparison given above, including the Origin200 using a 180MHz R10000, so please bare this difference in mind when examining the data. Just as above, you can download a 3D performance graph (gzipped) if you wish: load the file into SceneViewer or ivview and switch into Orthographic mode (ie. no perspective), etc.

The rationale and method for this examination were the same as for SPECfp95. Thus, given below is a comparison table of the various R10000/195 SPECint95 test results. You may need to widen your browser window to view the complete table. After the table and 3D graphs is a short-cut index to the original results pages for the various systems.

Key:


O200   = Origin200
O2000  = Origin2000
PChall = Power Challenge
I2     = Indigo2

System:   O2000    Octane    O200    PChall   PChall    I2      O2
L2:        4MB      1MB      1MB      2MB      1MB      1MB     1MB

go:        11.4     11.4     10.5     10.0     10.0     10.0    11.0
m88ksim:   11.3     11.3     10.4     9.15     9.18     9.14    11.1
gcc:       10.4     10.1     9.26     8.25     7.87     7.91    9.02
compress:  11.3     11.3     10.4     10.0     10.0     10.1    10.6
li:        9.57     9.59     8.79     7.79     7.85     7.87    9.42
ijpeg:     10.2     10.1     9.26     8.23     8.29     8.20    9.35
perl:      13.3     13.0     12.1     9.42     9.27     10.3    13.0
vortex:    14.4     11.2     12.4     8.25     7.86     7.97    8.20

         SPECint95 Comparison Table for MIPS R10000 195MHz

(click on the images above to download larger versions of the views shown)

[Test Suite Description | O2000 | Octane | O200 | PChall 2MB L2 | PChall 1MB L2 | Indigo2 | O2]

Next, a separate comparison graph for each of the eight SPECint95 tests:

go:

go comparison graph

m88ksim:

m88ksim comparison graph

gcc:

gcc comparison graph

compress:

compress comparison graph

li:

li comparison graph

ijpeg:

ijpeg comparison graph

perl:

perl comparison graph

vortex:

vortex comparison graph

It is immediately obvious that the behaviour of these tests is very different from the SPECfp95 suite. Firstly, there is a much lower variance for all the systems. Other observations are:

Few tests benefit to any significant degree from larger L2 cache. The only test which does is vortex. Note that although some tests do improve slightly from 1MB L2 PowerChallenge (PChall) to 2MB L2 PChall (eg. gcc and perl), only vortex shows this behaviour and a significant jump from 1MB Octane to 4MB Origin2000.
The Origin architecture is clearly helping all these tests to some degree, with many showing significant improvements.
Perhaps most important of all, O2 does very well in virtually all the tests, almost always outperforming Indigo2, PChall and Origin200 (vortex on 2MB PChall and Origin200 is the only exception), in many cases very close to Octane in performance (go, m88ksim, li) and in one case matching Octane exactly (perl).

Vortex is an interesting case. Though O2 does well on all tests, matching Origin and beating older systems handsomely most of the time, vortex is the one test where O2 does not do as well as Origin-based systems. Stranger still, Octane does not do as well as Origin200 for vortex. It's a shame that vortex is the only test in SPECint95 which gives rise to this behaviour; if one's code happens to be like vortex, ascertaining that fact may not be easy. I asked about vortex; he replied:

: "vortex in SPECint95 has a lot of trouble with TLB misses, as its memory access patterns cover a large memory space. The lower latency of the main memory system (for getting the new TLB entries) helps the newer machines a fair amount relative to the old machines. Somewhere along the lines we made an O/S modification that allowed this code to use large pages and reduce the TLB miss rate -- this modification would not be reflected in the results on the older machines, but would help them if we went back and ran the tests again."

Note that SGI does not have the time to rerun tests on older systems. Running the base tests is trivial, but it isn't easy to gather together people who can work on finding the best optimising compiler options for the peak tests. SGI is probably busy enough as it is testing newer CPUs on the current systems.

The SPECfp95 discussion above refers to the absence of the R10K's outstanding cache-miss feature on older SGI systems and O2. However, for SPECint95, this appears to be much less of an issue. vortex might be an exception but it's difficult to tell without detailed knowledge of how vortex works. John's view on this was:

: "Outstanding cache misses are not relevant on SPECint95 because almost no secondary cache misses occur! With default page sizes, only vortex and gcc seem to show any benefit from going from 1 MB to 4 MB caches."

Asking John why no cache misses were happening, his response was:

: "The jobs are working on small data sets."

The next question had to be, what kind of tasks do use large data sets? John's reply was:

: "Database stuff, large cpu simulators, integer programming/optimization (like travelling salesman problems, airline scheduling, etc.)"

Finally, given that the 2D graphs above are to the same scale, it's very clear that the R10000 shows a much larger floating point (fp) advantage over the SPEC reference system than an integer (int) advantage and also a greater variance for fp results. However, the problem with this observation is that one has no way of knowing how 'good' the original reference system was for int vs. fp work.

In other words, one has no way of knowing whether the int and fp tests were 'equally' difficult for the reference system. It would have been better perhaps if the tests could have been tailored such that the variance in reference times between the int and fp tests were similar; ie. SPEC95's reference times vary between 1400 and 9600 for the fp tests and between 1700 and 4600 for the int tests - it would have been better if the spread was the same for both test suites. Also, for SPECfp95, there is an odd correlation between tests whose reference times are high and final high target system ratios (swim and fpppp); is this because the reference system ran them rather slowly, or because SPEC made the tests more complex to slow them down? It is difficult to know what to conclude. Just a coincidence? Or were some tests just not tough enough in the first place? Or maybe there were, but the R10000 is just very good for that kind of work anyway? Who knows?

Clearly, SPEC95 must be examined in detail to gain any genuinely useful information.

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

195MHz R10000 Performance Comparison Between O2, Indigo2, Octane, Origin200, Origin2000 and Power Challenge

Last Change: 09/Aug/1998

SPEC's Introduction to SPEC95 SPECfp95 Analysis SPECint95 Analysis

(Note: the 2D bar graphs shown on this page for the various SPEC95 tests have been drawn to the same scale) (the graphs are also to the same scale as those given on the 250MHz R10000 comparison page)

195MHz R10000 SPECfp95 Performance Comparison

(click on the images above to download larger versions of the views shown)

[Test Suite Description | O2000 | Octane | O200 | PChall 2MB L2 | PChall 1MB L2 | Indigo2 | O2]

195MHz R10000 SPECint95 Performance Comparison

(click on the images above to download larger versions of the views shown)

[Test Suite Description | O2000 | Octane | O200 | PChall 2MB L2 | PChall 1MB L2 | Indigo2 | O2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

195MHz R10000 Performance Comparison Between O2,
Indigo2, Octane, Origin200, Origin2000 and Power Challenge

SPEC's Introduction to SPEC95
SPECfp95 Analysis
SPECint95 Analysis

(Note: the 2D bar graphs shown on this page for the various SPEC95 tests have been drawn to the same scale)
(the graphs are also to the same scale as those given on the 250MHz R10000 comparison page)