Render Test 4: Maya V6.5 Render Benchmark Results, Simple Scene

SGI Performance Comparisons

Maya V6.5 Render Benchmark Results

Simple Scene (Benchmark_Maya.mb) Rendered Using mentalray

Last Change: 05/Aug/2010

This test uses the command-line Maya V6.5 'Render' program, employing the mentalray renderer. The example scene looks like this:

The Maya binary file for this scene is available for download if required. Note that the ZooRender website has a table of results, but they're pretty useless for comparison purposes since the table is full of bogus/joke numbers. However, I can at least use the test for comparing different SGIs.

The commands used for the tests are as follows:

Single-CPU Command:

  timex Render -r mr -rd $PWD -im test -of tif Benchmark_Maya.mb

Multi-CPU Command:

  timex Render -r mr -rt 4 -rd $PWD -im test -of tif Benchmark_Maya.mb

Here are the results, in hours, minutes, seconds and hundredths of a second (ie. the time format is 'hours:minutes:seconds.hundredths-of-a-second').

NOTE 1: any result shown in bold type is an overall 'throughput' result, ie. the effective speed per render when running multiple 4-thread renders at the same time.

NOTE 2: in order to demonstrate CPU scalability, any system with N CPUs that is tested with a number of threads K that is less than N is shown by having its entry in italics, ie. only K CPUs in that system are being used.


              Num    -------- CPU -------      Time
  System      CPUs   Type     MHz     L2     h:mm:ss.ss     Notes

  Onyx        24     R10000   195    1MB     0:01:57.00     4 threads x 6, throughput test (#1)
  Tezro        4     R16000  1000   16MB     0:02:24.39     4 threads
  Onyx        16     R10000   195    2MB     0:02:43.85     4 threads x 4, throughput test (#1)
  Origin350    4     R16000   700    4MB     0:03:11.23     4 threads  (32-CPU system,  only 4 CPUs used)
  Origin300    4     R14000   600    4MB     0:03:37.10     4 threads
  Tezro        4     R16000  1000   16MB     0:04:07.54     2 threads
  Origin300    4     R14000   500    2MB     0:04:13.26     4 threads
  Onyx2        4     R14000   500    8MB     0:04:29.24     4 threads
  Onyx2        4     R12000   400    8MB     0:05:16.42     4 threads
  Tezro        2     R16000   700    4MB     0:05:54.37     4 threads
  Origin2000   4     R12000   350    4MB     0:06:17.90     4 threads  (system has 4GB RAM, SSE+TRAM gfx with IO6G)
  Origin300    2     R14000   600    4MB     0:06:42.16     4 threads
  Octane       2     R14000   600    2MB     0:06:50.02     4 threads
  Tezro        4     R16000  1000   16MB     0:08:20.49     1 thread
  Fuel         1     R16000   900    8MB     0:08:39.93
  Origin300    1     R14000   600    4MB     0:09:33.12
  Octane       2     R12000   400    2MB     0:09:36.57     4 threads
  Fuel         1     R14000   800    4MB     0:09:45.99
  Onyx         4     R10000   195    2MB     0:10:55.67     4 threads
  Fuel         1     R16000   700    4MB     0:11:29.96
  Onyx         4     R10000   195    1MB     0:11:33.87     4 threads
  Octane       2     R12000   350    1MB     0:12:39.55     4 threads
  Origin300    1     R14000   600    4MB     0:13:04.09
  Fuel         1     R14000   600    4MB     0:13:17.18
  Octane       1     R14000   600    2MB     0:13:24.35
  Octane       2     R12000   300    2MB     0:13:30.56     4 threads
  Octane       1     R14000   550    2MB     0:15:25.44
  Octane       2     R10000   250    1MB     0:16:35.61     4 threads
  Octane       2     R12000   250    1MB     0:17:50.52     4 threads  (CPU mod, stage 1. No benefit until overclocked!)
  Onyx2        1     R12000   400    8MB     0:18:38.65
  Fuel         1     R14000   500    2MB     0:19:11.33
  Onyx2        2     R10000   195    4MB     0:19:46.46     2 threads
  Octane       1     R12000   400    2MB     0:20:04.25
  Onyx         2     R10000   195    1MB     0:20:11.07     2 threads
  Octane       2     R10000   195    1MB     0:20:22.90
  Octane       1     R12000   360    2MB     0:21:16.89
  Onyx         4     R4400    250    4MB     0:22:39.33     4 threads  [hinv]
  Octane       2     R10000   175    1MB     0:23:43.14     2 threads
  Octane       1     R12000   300    2MB     0:25:42.36
  O2           1     R7000    600  256K/1MB  0:26:01.79     [hinv]
  O2           1     R12000   400    2MB     0:30:04.04
  Octane       1     R10000   250    2MB     0:32:16.08
  Octane       1     R10000   250    1MB     0:33:00.92
  O2           1     R12000   300    1MB     0:36:07.57
  Onyx2        1     R10000   195    4MB     0:38:37.14
  O2           1     R7000    350    1MB     0:40:33.61
  O2           1     R12000   270    1MB     0:41:03.35
  O2           1     R10000   250    1MB     0:42:04.01
  Indigo2      1     R10000   195    1MB     0:43:19.61
  Octane       1     R10000   175    1MB     0:45:22.45
  Octane       1     R10000   195    1MB     0:45:50.43
  O2           1     R10000   225    1MB     0:47:21.28
  O2           1     R10000   195    1MB     0:51:46.01
  O2           1     R10000   175    1MB     1:00:54.02
  O2           1     R5200    300    1MB     1:01:22.76
  O2           1     R10000   150    1MB     1:13:20.93
  O2           1     R5000    200    1MB     1:23:11.92
  Indigo2      1     R4400    250    2MB     1:39:08.93
  O2           1     R5000    180    512K    1:40:08.31
  Indy         1     R5000    180    512K    1:52:18.56
  Indigo2      1     R4400    200    2MB     1:55:03.81
  Indy         1     R4400    200    1MB     2:02:30.20
  Indy         1     R5000    150    512K    2:03:14.05
  Indigo2      1     R4400    200    1MB     2:04:27.66
  O2           1     R5000    180    -       2:31:02.95
  Indy         1     R4400    150    1MB     2:34:36.26
  Indy         1     R5000    150    -       2:53:37.49
  Indy         1     R4600    133    512K    3:13:43.05
  Indy         1     R4000    100    1MB     4:10:05.47
  Indy         1     R4600    133    -       4:39:16.63
  Indy         1     R4600    100    -       5:05:16.15

Unlike the Alias render test, render time performance for this scene varies more or less with straight clock speed (except for older systems that have no L2 cache or small L1 cache) and thus scales very well with multiple CPUs on older Origin2000-based systems (ie. linear speed increase), suggesting it does not involve particularly complex memory access and/or does not benefit that much from a larger L2. Or to put it another way, an Origin2000 or Onyx2 would scale nicely with more CPUs. Indeed, testing my older Onyx system confirms this idea, scaling nicely from 1 to 4 CPUs; the system also performs with an almost linear speed increase when running multiple instances of the same render on a 24-CPU system, giving an excellent overall throughput for rendering multiple frames.

By contrast, the Alias test scene scales better if run on a later Origin3000-series system (which includes Origin300, Onyx300, Onyx350, Fuel, Tezro, etc.) This is also why it is sometimes more efficient not to parallelise frame rendering too much for a complex scene, ie. simply render 1 frame per CPU/core instead. It depends on the scene. Smart render management software systems will adjust how they use CPU resources based on the scene being processed, eg. I know of one movie company (MPC) which uses a system that does not use the 4th core in each quad-core XEON for very complex renders (their renderfarm has 7000 cores).

If data can be reused between frames though, then good speedups can be obtained on shared memory systems, but these days not many companies bother doin this because it means employing people to write the custom software (ILM used to do this with 16-CPU Origin2K racks, giving much better results than would otherwise be the case).

The results show that older systems and multi-CPU systems are very effective for this kind of task, ie. a non-complex scene that can benefit from multiple CPUs. Usually the speedup from using more than one CPU is pretty much linear, no matter what CPUs a system has. Given Maya's V6.5's thread limit of 4, this bodes well for rendering multiple frames for animations, ie. overall throughput as opposed to the speed of doing just one frame. Thus, for example, a 24-CPU Onyx is almost as fast as an 8 x 600MHz Origin300!

NOTES:

The application startup time can be significant on older multi-CPU systems, partly masking the benefit of having extra CPUs. It's only a few seconds, but this can account for some variance between results. One could alleviate this by using a striped XLV to hold key applications directories and/or data.

#1: These results refer to running n instances of the Render test at the same time, done by executing the n commands in n different shell windows (rlogged into from another system), acting upon copies of the scene file in different directories. I set up the command in each shell, then used the mouse middle button to paste a newline into each shell as fast as possible, so the n commands are activated all within the space of about 2 seconds at most. Since this means n copies of Maya have to be loaded at the same time, there is some variance in how long each instance takes to run, but the results are impressive; here is an example for six renders:

              Time
            mm:ss.ss

  Render 1: 11:41.09
  Render 2: 11:34.50
  Render 3: 11:32.88
  Render 4: 11:54.06
  Render 5: 11:48.55
  Render 6: 11:51.45

This is an average time of 11 min 44 sec, ie. an overal throughput time for 24 CPUs of 1 min 57 secs per render. Very cool! 8)

Unanswered questions:

Why does dual-175 Octane run better when only using 2 threads? All other dual-CPU Octanes give the best results with 4 threads.
Why is a 175MHz Octane faster than a 195MHz Octane?

Feedback is most welcome! :)

Octanes not yet tested:

  Dual-R12K/360
  Dual-R10K/225
  Single-R12K/270
  Single-R10K/225

Fuels not yet tested:

  R16K/800 (4MB)

O2s not yet tested:

  R7K/600

Indigo2s not yet tested:

  R10K/175
  R4K/175 (1MB)
  R4K/150 (1MB)
  R4K/100 (1MB)
  R4K/100
  R4600SC/133 (512K)
  R8000/75 (2MB)

Indys not yet tested:

  R4400SC/175 (1MB)
  R4000PC/100