Objectives
This analysis compares the SPECfp95 performance of different dual-CPU Octane configurations. At present, this means comparing dual-R10K/195 to dual-R10K/250.
SPECint95 is not covered because the MIPSpro Auto Parallelizing Option does not appear to be relevant for running integer tasks on multi-CPU systems. At least, that's what I must infer from the lack of any multi-CPU SPECint95 results. I don't think this means integer tasks can't be parallelised; rather, I believe it's simply the case that, at present, the compilers do not deal with parallel integer optimisation.
There are many other tasks which can benefit from dual CPU systems, eg. LSDyna (engineering analysis), Performer, etc. However, data on those tasks is hard to come by, so this page deals only with SPECfp95.
Note that I do not have any SPECfp95 data for dual-R10K/175MHz. Since many systems will be using this CPU, please contact me if you have any relevant detailed data.
As with all these studies, a 3D Inventor model of the data is available (screenshots of this are included below). Load the file into SceneViewer or ivview and switch into Orthographic mode (ie. no perspective), rotate the object horizontally then vertically, etc.
All source data for this analysis came from www.specbench.org.
Given below is a comparison table of available dual-CPU R10000 SPECfp95 test results for Octane. Faster configurations are leftmost in the table (in the Inventor graph, they're placed at the back). After the table and 3D graphs is a short-cut index to the original results pages for the various systems.
R10000 R10000 % Increase 2x250MHz 2x195MHz (2x195 -> 2x250) tomcatv 49.1 45.7 7.44% swim 63.6 58.9 7.98% su2cor 20.0 17.0 17.7% hydro2d 15.6 14.4 8.33% mgrid 32.2 28.4 13.4% applu 19.9 17.3 15.0% turb3d 16.4 13.4 22.4% apsi 15.7 12.7 23.6% fpppp 37.7 29.8 26.5% wave5 29.3 21.7 35.0% Peak Avg: 26.6 22.7 17.2% Octane SPECfp95 Comparison
Next, a separate comparison graph for each of the ten SPECfp95 tests:
tomcatv:
swim:
su2cor:
hydro2d:
mgrid:
applu:
turb3d:
apsi:
fpppp:
wave5:
Observations
Do the above results look rather confusing, compared to Octane single-CPU comparison results and Octane vs. Origin (single-CPU) comparison results? If so, good! I say this because I don't like the fact that SPECfp95 allows autoparallelisation. Why? My rationale is simple: not all of the ten tests can be parallelised to any great degree; in fact, some cannot be parallelised at all as SPEC's own documents state. There are two totally different statistical factors at work which decision makers need to be aware of; these factors are mixed together in the following manner:
R10000 % Increase Factor 2x250MHz (2x195 -> 2x250) swim 63.6 7.98% 8.0 tomcatv 49.1 7.44% 6.6 mgrid 32.2 13.4% 2.4 hydro2d 15.6 8.33% 1.9 fpppp 37.7 26.5% 1.4 applu 19.9 15.0% 1.3 su2cor 20.0 17.7% 1.1 wave5 29.3 35.0% 0.8 turb3d 16.4 22.4% 0.7 apsi 15.7 23.6% 0.6
I hope you can see what is happening: in general, the tests which were already doing well because of parallelisation on the lower clocked dual-CPUs end up with the lowest percentage increase when upgrading (and thus they have the highest factors); meanwhile, the tests which may as well be run on a single CPU system, because they're hard to parallelise (and had lower original scores as a result), end up with the highest percentage increases when upgrading. This results in the bizarre situation where the tests which have highest absolute individual SPEC ratios give the worst percentage increases from an upgrade, simply because a) the non-parallelisable tests are behaving as if they were in a single-CPU system, and b) it is easy to forget that the percentage increase for parallelised results applies to all CPUs, making the percentage increase look worse than it actually is (ie. the absolute SPEC ratio for such tests is higher and still very high compared to single-CPU performance).
In other words, if one focuses too much on percentage increases in performance for parallelisable tests, one can easily lose sight of the fact that the absolute performance is still enormous compared to single-CPU performance.
What irritates me is the way the various results, ie. those which gain from dual-CPUs plus those that don't, are all mixed together to produce a final SPECfp95 peak average which statistically has nothing to do with the actual variances and behaviour of each individual test, because we're effectively mixing single- and dual-CPU performance metrics - this is like combining statistical results from surveys that had different sample sizes (such surveys have different errors, etc.) Ouch!
So which tests do actually benefit from autoparallelisation? Here is a comparison graph of single vs. dual Octane/195 (complete analysis available separately):
R10000 R10000 % Increase 2x195MHz 1x195MHz (1x195 -> 2x195) tomcatv 45.7 25.3 80.6% swim 58.9 40.6 45.1% su2cor 17.0 9.64 76.4% hydro2d 14.4 9.97 44.4% mgrid 28.4 15.9 78.6% applu 17.3 11.2 54.5% turb3d 13.4 13.8 -2.9% apsi 12.7 12.8 -0.8% fpppp 29.8 29.7 0.3% wave5 21.7 22.4 -3.1%
Well, it couldn't be clearer in my opinion: turb3d, apsi, fpppp and wave5 are not affected, while all the other tests gain to a significant degree. This doesn't mean that the non-accelerated tests cannot be parallelised; it merely means that SGI's compilers don't currently affect them (whether this is because the tests factually cannot be accelerated is a separate issue). The reason I say this is because results from other vendors show that different vendors' autoparallelising compiler options behave in very different ways. Note: if you're wondering why some tests appear to slow down slightly, this is because, obviously in my opinion, the autoparallelising option will interfere slightly with the optimisations which would normally occur for the relevant tests in a single CPU system (besides, the differences are well within standard margins of error anyway).
Variance in SPECfp95 for most RISC systems is wierd enough already
(eg. R10K O2 shows almost an order of magnitude difference between
highest and lowest), but allowing autoparallelisation means that the
situation for decision makers has become much worse. For dual-250MHz
Octane, the final peak result is 26.6, yet the actual figures can be
anywhere from almost half this to nearly 2.5 times as much. Of what
use then is the final peak average? It conveys no useful information
whatsoever about the system's overall performance. SPEC95 is supposed
to be a useful guide, but we now have a situation where people could
be comparing single vs. multi CPU results without being aware of it,
or comparing multi vs. multi CPU results but for systems with totally
different autoparallelisation profiles.
If you're thinking ahead by now, you'll realise that this situation becomes considerably worse when looking at multi-CPU Origin systems that have 4, 8, 16, etc. CPUs. Individual results range from 20 to more than 400, making complete nonsense of the final average. During early 1998, SGI held the 'absolute' SPECfp95 record simply because of the accidental effects of statistical averaging. If a rival vendor's system had accelerated a smaller number of the tests each to a proportionally greater degree (eg. accelerated 5 tests by an average of 80% each instead of accelerating 6 tests by an average of 63.3% each), then the rival would have held the record even though one could argue their autoparallelising software was of less use because it was applicable to fewer code types. This is insane! It's the Jacob's Ladder of statistcal nightmares! Note that since I first wrote this page, DEC has released the 21264 and so they now hold the absolute SPECfp95 record, at least for the moment; of course, their results show a similar spread of affected and non-affected tests.
So, I say never use autoparallelised multi-CPU peak final averages in conversations about system performance when comparing with other vendors, or even the same vendor. The results are meaningless, the average is meaningless and any conclusions drawn will be meaningless. This is a case where one must absolutely break away from the final average and deal with individual test results. Only then does one see the interesting phenomena which are important to decision makers. From the studies I've carried out, there is one important fact that a decision maker should know about autoparallelised results (burn this into your brain if you are such a person):
For a typical example of this, examine the individual results for the AlphaServer 8400 5/625; this 8-CPU system gives a final peak result of 56.7 (not that far off the 8-CPU Origin2000's result of 66.4), yet the individual results show the DEC system to be accelerating its own selection of the ten fp tests in a very different way; eg. comparing 8-CPU results to 1-CPU results, tomcatv is accelerated on the Alpha by 143% compared to the Origin2000's 628%. On the other hand, turb3d is accelerated on the Alpha by 436%, compared to the Origin's 0.5%; both systems accelerate swim by a good margin: Alpha runs it 12X faster whilst Origin runs it 9X faster. Here's a quick index for you to compare results at your leisure:
Some general guidelines:
Thus, when looking at system results on the large SPEC95 summary page, note the number of CPUs listed and, if more than 1 CPU is in use, take special note of whether or not the 'Software Information' section on the detailed results page includes any mention of any kind of compiler parallelising option. These two factors should determine how you treat a particular result.
Well, that's enough moaning for now. Suffice to say that I think mixing parallelised results with non-parallelised results is really silly. It's made an already confusing situation much worse. There ought to be separate benchmarks for parallelised results, eg. SPECPAR95 (SPECpar_int95 and SPECpar_fp95, or something similar). I hope this happens with SPEC98.