Figure 23 illustrates the various components of the MIPSpro
compilation system. Components include the compiler back end and the
Parallel C and Fortran 77 analyzer. Scalar and superscalar
optimizations common to all the compilers are performed in the common
back end. All parallel processing optimizations are performed in the
parallel analyzer phases of the compilation system. Kuck and
Associates (KAI) technology is an integral part of the MIPSpro
compilation system.
The compilers support value-added extensions to:
A range of new optimization techniques takes maximum advantage of the new processor features such as on-chip and off-chip caches, pipelining, and superscalar chip architecture. The optimizations are applicable to a wide range of scientific and engineering applications and benefit both scalar and parallel performance of applications.
A rich assortment of command-line options can leverage different combinations of optimizations. In general, the optimizations are spread across the compilation system for better efficiency. For instance, high-level optimizations like loop interchange and loop unrolling are performed in the compiler front ends, whereas architecture-specific optimizations like software pipelining and automatic blocking are implemented in the common back end. All optimizations are fine-tuned to take advantage of the new system. Key optimizations:
Caches are based on the observation that most application programs exhibit some degree of locality of reference: programs access pieces of data that are "near" already requested data, in space and time. A program that accesses memory without regard to locality of reference might perform poorly because of a high number of cache misses. The compiler plays a crucial role in restructuring programs to reduce cache misses by interchanging loops, or by tiling or blocking loop nests so that data is consumed most efficiently by the processor. This arrangement is similar to traditional vectorizing compilers that restructured programs to fit in vector memory (registers) in pieces. In short, the compiler restructures programs so that a useful subset of the problem can fit into the cache. Thus the processor can work on patches of the original code and data from the cache memory (thereby avoiding main memory references) before moving on to the next patch.
Figure 24 illustrates the cache misses per FLOP as a function of different cache sizes for a broad range of scientific and engineering applications.
[1] From "Working Sets, Cache Sizes, and Node Granularity Issues for Large-scale Multiprocessors" by Jaswinder Pal Singh, Anoop Gupta, and Edward Rothberg, in Proceedings of the 20th International Symposium on Computer Architecture.
For cache sizes of less than 1 MB, the miss ratio becomes negligible.
This experiment illustrates how a combination of a moderately large
cache size and good compiler technology can reduce cache misses to a
negligible amount for a large (but not all-inclusive) class of big
scientific and engineering problems.
The compilers also provide a comprehensive set of standards-based comment directives that enable users to assist the compiler in the parallelization process. Users can use these directives to provide additional information to the compiler to boost parallel performance.
The parallelization technology means that those using a POWER Indigo2 to write applications to be run on POWER Challenge and POWER Onyx can take advantage of the POWER Challenge system architecture. A combination of automatic and user-assisted parallelization can lead to substantial improvements in the performance of many programs.
CHALLENGEcomplib is similar to scientific libraries provided by other supercomputing vendors like the Cray SCILIB, IBM ESSL and the Convex VECLIB. The library consists of two subcomponents: complib.sgimath and complib.slatec. These libraries include support for Basic Linear Algebra Subprograms (BLAS), Extended BLAS (Level 2 and Level 3), LAPACK, FFT, Convolutions, and selected routines from Cray's SCILIB and SLATEC from the Energy, Science and Technology Software Center.
Figure 25 illustrates three important parameters that influence the effectiveness of the optimization phases of the compilers.
This section discusses
Instruction-level optimizations are performed mostly in the common back end to extract maximum performance out of the R8000 superscalar processor. Common instruction-level optimizations include software pipelining, instruction scheduling, global instruction movement and register allocation.
Loop-level optimizations are performed in the early stages of the compilation process. Key loop-level optimizations include automatic loop blocking, loop interchange and loop unrolling.
Automatic loop blocking and loop interchange are memory hierarchy optimizations that take advantage of the cache architecture of the machine. Similarly, loop unrolling attempts to expose more instruction level parallelism to the optimizer for fine-grained parallelism.
Other optimizations such as loop distribution and loop fusion are important for efficient parallel execution. The compiler uses extensive analysis and transformation techniques to detect parallelism in programs. The compilation system supports a complete runtime environment for parallel execution. This runtime library is common to all the MIPSpro compilers.
Figure 26 illustrates the different kinds of parallelism exploited by the compilers.
The -OPT:roundoff=n flag is available to determine the extent to which optimizations are allowed to affect floating-point results, in terms of both accuracy and overflow/underflow behavior.
IEEE Option
The -OPT:IEEE_arithmetic=n flag specifies the extent to which optimizations should preserve IEEE floating-point arithmetic.
Reciprocal and Reciprocal Square Root
The flexible floating-point options provide users with a range of alternatives to trade off accuracy for speed. Thus applications can take advantage of fast MIPS IV instructions like recip and rsqrt. This is particularly significant for applications that were running on Crays, which have several fewer bits of precision than IEEE 64-bit. Heavy users of Cray and other non-IEEE compliant vector machines who have a need for speed can use these options.
In short, optimizing divides into multiplies by using reciprocal, and lifting the inverse calculation outside the loop can give rise to superior performance improvements. For example, if IEEE conformance is required, the generated code must do the n loop iterations in order, with a divide and an add in each iteration. Alternatively, if IEEE conformance is not required, the implementation of x/y as x * recip(y), or sqrt(x) as x * rsqrt(x) can be used to treat the divide as a(i) * (1.0/divisor). On the MIPS R8000, the reciprocal can be calculated once before the loop is entered, thereby reducing the loop body to a much faster multiply and add per iteration, which can be a single madd instruction on the R8000:
INTEGER i,n REAL sum, divisor, a(n) sum = 0.0 do i = 1, n sum = sum + a(i) / divisor enddo
For example, a loop encountered in the Computational Chemistry Application called WESDYN developed at Wesleyan University is representative of loops that are frequently encountered in computation-intensive portions of chemistry applications. The loop contains reciprocal as well as square-root operations that are candidates for higher performance.
do 200i = 1,n r2(i) = 1 / ( xx(i3) ** 2 + xx(i3 + 1) ** 2 + xx(i3 + 2) ** 2) r1(i) = sqrt( r2(i)) i3 = i3 + 3 200 continue
recip and rsqrt are also important to graphics applications that use the reciprocal and reciprocal square root operations in important computational parts.
Valid speculative code motion must normally avoid moving operations which may cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013.spice2g6, it is possible to enable speculative code motion for a critical loop by turning off floating point and memory exceptions around that loop, resulting in a healthy performance gain.
Fast Intrinsics
The MIPSpro compilation system supports a fast version of intrinsic library functions. Selected mathematical functions from the standard mathematical library are hand-coded in assembly language to take maximum advantage of the MIPS IV instruction set of the MIPS R8000 architecture. Specifically, frequently used intrinsics like the transcendental functions (log, exp, power, sin, cos, cis and tan) are hand-coded in assembly and are part of a separate fast mathematical library.
The fast library can be invoked with the -lfastm command-line flag. The accuracy level of all the hand-code transcendental functions (except for tan) is better than 2 ULPS (units in the least significant place).
Traditional global optimizers avoid moving instructions in cases that might cause them to be executed along control flow paths where they would not have been in the original program. The MIPSpro global optimizer does perform such code motion, called speculative code motion, because the instructions moved are executed based on speculation that they will actually prove useful. This kind of aggressive code motion is unique to the MIPSpro compilers. By default, GCM is very conservative in its speculations. However, a number of options are available to control the degree of speculation.
Valid speculative code motion must normally avoid moving operations that might cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013-spice2g6, speculative code motion can be enabled for a critical loop by turning off floating-point and memory exceptions around that loop, resulting in a healthy performance gain.
Figure 27 illustrates the different kinds of speculations available.
What follows is a brief description of each of these optimizations
(with appropriate user-level flag control):
GCM normally does not move instructions to basic blocks that are already using most of the instruction execution resources available, since doing so will likely extend the execution time of the block. This option minimizes that bias, which often helps floating-point-intensive code.
A form of speculation that is often very effective is called bottom loading. It moves instructions from the top of the a loop's body to both the block before the loop (for the first iteration) and to the end of the loop body (which executes them at the end of one iteration so that they will be ready early for the next iteration). Doing this, however, means that the instructions are executed one or more extra times in the last iteration(s) of the loop. If the instructions moved are loading elements of an array, extra accesses might occur beyond the end of the array. This option permits such out-of-bounds array references by padding the arrays to prevent the out-of-bounds references from causing memory exceptions.
This option allows speculative motion of loads of two kinds, both involving pointer usage. The first allows motion of loads through pointers that may be NULL. The second form moves a reference like *(p + n), for a small integer n, to a block that already contains a reference to *p. The assumption is that if p is a valid address, p+n will be, too. In the example below, the load of p->next->val can be moved before the if through a potentially NULL pointer, as can the load of p->final_val, which is offset by a small amount from the p->next reference.
if ( p -> next != NULL ) { sum = sum + p -> next -> val ; } else { sum = sum + p -> final_val ; }
In software pipelining, iterations of loops are continuously initiated at constant intervals without having to wait for preceding iterations to complete. That is, multiple iterations, in different stages of computation, are in progress simultaneously. The steady state of this pipeline constitutes the loop body of the object code. Consider this simple DAXPY loop:
do i = 1, n v (i) = v(i) + X * w(i) enddo
On the MIPS R8000 architecture, this loop can be coded in assembly language as two load instructions followed by a multiply-add instruction and a store. Figure 28 shows this execution order constraint.
This simple schedule completes one iteration of the loop body
in five machine cycles. Considering that the R8000 processor allows
up to two memory operations and two floating-point operations in the
same cycle, the instructions above are initiated by the machine as
outlined in Table 8. (This schedule completes one iteration of the
loop in five machine cycles):
Cycle Count Memory Operations Floating-point Operations 0 load andv(i); load andw(i) madd X, w(i), v(i) 1 2 3 4 store andv(i)
TABLE 8 Simple Schedule for Loop Body of DAXPY
If the same loop were unrolled by a factor of four, the loop body would look like:
do i = 1, n, 4 v(i) = v(i) + X * w(i) v(i+1) = v(i+1) + X * w(i+1) v(i+2) = v(i+2) + X * w(i+2) v(i+2) = v(i+2) + X * w(i+2) enddo
The unrolled loop allows for instructions from four independent iterations to execute in parallel, thereby increasing the instruction level parallelism and hence the performance of this loop. Table 9 illustrates the sequence of loads, madds and stores for this unrolled loop. (This schedule completes four iterations of the loop in eight machine cycles)
Cycle Count Memory Operations Floating Point Operations 0 load andv(i); load andw(i) madd X, w(i), v(i) 1 load andv(i+1); load andw(i+1) madd X, w(i+l), v(i+l) 2 load andv(i+2); load andw(i+2) madd X, w(i+2), v(i+2) 3 load andv(i+3); load andw(i+3) madd X, w(i+3), v(i+3) 4 store andv(i) 5 store andv(i+1) 6 store andv(i+2) 7 store andv(i+3)
TABLE 9 Schedule for Loop Body of Four-way Unrolled DAXPY
The schedule above completes four iterations of the loop body in eight cycles, thereby improving the performance to one quarter of R8000's peak megaFLOPs. However, this schedule still leaves room for improvement. Each store has to wait three cycles for its corresponding madd instruction to complete, thereby forcing the four store operations to be initiated in different cycles. In other words, the schedule above does not take advantage of the R8000's ability to do two stores in one cycle.
By using software pipelining, the loop instructions can be initiated at constant intervals such that each iteration executes a combination of loads and stores from different iterations. In the DAXPY example, this can result in a schedule that would complete two iterations every three cycles to realize significant performance improvements over the two previous schedules.
Table 10 illustrates the machine schedule for the software pipelined version of the above loop. To prepare properly for entry into such a loop, a prologue section of code is added that sets up the registers for the first few stores in the main loop. Similarly, in order to exit the loop properly, an epilog section is added that performs the final stores. Any preparation of registers needed for the epilog is done in the cleanup section of the code. (The main loop completes four different iterations in six machine cycles).
Cycle Count Memory Operations Floating Point Operations PROLOG: t1 = load andv(i); t7 = madd X, w(i), v(i) t2 = load andw(i) t4 = load andv(i+1); t8 = madd X, w(i+1), v(i+1) t5 = load andw(i+1) MAINLOOP: 0 t1 = load andv(i+2); t3 = madd X, w(i+2), v(i+2) t2 = load andw (i+2) 1 t4 = load andv(i+3); t6 = madd X, w(i+3), v(i+3) t5 = load andw (i+3) 2 store t7; store t8 beq CLEANUP 3 t1 = load andv(i+4); t7 = madd X, w(i+4), v(i+4) t2 = load andw(i+4) 4 t4 = load andv(i+5); t8 = madd X, w(i+5), v(i+5) t5 = load andw(i+5) 5 store t3; store t6 bne MAINLOOP; EPILOG: store t7; store t8 br ALLDONE CLEANUP: t7 = t3; t8 = t6 br EPILOG ALLDONE:
TABLE 10 Schedule for Software-pipelined DAXPY
In the main loop, the code completes four different iterations in six cycles, which is better than the previous two schedules. Table 11 illustrates the performance improvement in the three cases.
Scheduling Type Performance Speedup Simple scheduling 1 iteration in 5 cycles 1.0 4-way unrolling 4 iterations in 8 cycles 2.5 Software pipelining 4 iterations in 6 cycles 3.3TABLE 11 DAXPY Speedup Factor for Simple Schedule
As Table 11 suggests, software pipelining can make a huge difference in the performance of compute-intensive applications.
Another advantage of software pipelining is its ability to generate compact code, compared to transformations like loop unrolling that can increase the program size by a noticeable amount. Compact code prevents instruction cache penalties resulting from increased code size. The most important aspect of software pipelining is its ability to generate near-optimal code for loops.
Software pipelining in the MIPSpro compilers is based on the concept of "modulo iteration-interval scheduling", a technique for software pipelining innermost loops. This is a proven technique for generating code with near-optimal performance. In effect, this technique sets a performance goal for each loop prior to scheduling and then attempts to achieve this goal by taking into account resource constraints and the program data reference constraints.
Typical benchmark programs, like the SPECfp92 benchmarks, illustrate substantial improvements in performance when compiled with software pipelining. Figure 29 summarizes the performance and cost effectiveness of software pipelining as implemented in the MIPSpro compilers.
In short, compilers must normally be conservative in optimizing memory references involving pointers (especially in languages like C), since aliases (ie. different ways of accessing the same memory location) may be very hard to detect. Consider the following example:
float x[100]; float *c; void f4(n, p, g) int n; float *p; float *g; for (i = 0; i < n; i++ ) { p[i] = g[i] + c[i] + x[i] ; } }
To be safe, the compiler must assume that the pointer references a, b, and c may all be aliased to each other. This in turn precludes the possibility of aggressive loop optimizations by the optimizer.
The MIPSpro compilers alleviate this general problem of aliasing by providing users with a number of different options for specifying pointer aliasing information to the compiler. The compiler can use this information to perform aggressive optimizations in the presence of pointers and thereby achieve healthy performance improvements.
Table 12 illustrates the various user-controlled options that can be useful for improving runtime performance.
Flag Description -OPT:cilias=any Compiler should assume that any pair of memory references may be aliases unless proven otherwise. This is the default setting in the compiler and reflects a safe assumption by the compiler. -OPT:alias=typed Compiler assumes that any pair of memory references that are of distinct types cannot be aliased. For example: void dbl(i, f) { int *i; float *f; *i = *i + *i; *f = *f + *f; } The compiler assumes that i and f point to different memory locations as they are of different types. This can result in producing an overlapped schedule for the two calculations. -OPT:alias=unnamed Compiler can assume that pointers will never point to named objects. In the following example, the compiler will assume that the pointer p cannot point to the object g, and will produce an overlapped schedule for the two calculations. This is the default assumption for the pointers implicit in Fortran dummy arguments according to the ANSI standard. float g; void double(f) { float *p; g = g * g; *p = *p + *p; } -OPT:alias=restrict Compiler should assume a very restrictive model of aliasing, where no two pointers ever point to the same memory area. This is a rather restrictive assumption, but when applied for specific well-controlled, valid cases, can produce significantly better code. void double(p,q) int *p; int *q; { *p = *p + *p; *q = *q + *q; }
TABLE 12 User-assisted Pointer Optimizations
Loop Interchange
Loop interchange is a memory hierarchy optimization that modifies the data access pattern of nested loops to match with the way data is laid out in memory. For example, in a typical Fortran 77 loop nest, Fortran stores array elements in column-major order (not row-major like most programming languages). Each iteration of the i loop steps across contiguous elements of A, while each iteration of the j loop steps over an entire column of A. Assuming that A is dimensioned as A(M,N), each iteration of the j loop steps across M elements of A. If M is larger than a page size of the machine, each iteration of the j loop steps on a new page, thereby exhibiting bad locality. As a result, the program may spend a considerable portion of its time moving data between memory and the cache system, and exhibit poor performance.
This problem of the innermost loop having a large stride is
eliminated by interchanging the two loops, as shown in the right-hand
example in Figure 30. Now the innermost loop runs across contiguous
elements, minimizing page faults and cache misses. Depending on the
dimensions of the arrays, the transformed loop can exhibit
significantly better runtime performance.
Another advantage of loop interchange is its ability to move parallelism to outer levels of a nested loop. Before interchange, the innermost j loop can be parallelized by the compiler. However, the amount of work performed within the j loop may not be sufficient for efficient parallel execution. Once loop interchange is performed, the parallelism moves to the outer loop thereby increasing the amount of work in the loop. In effect the compiler is able to parallelize a larger region of code for better performance.
Loop Distribution
The compiler performs loop distribution to partition a single loop into multiple loops. Loop distribution has the advantage of making a loop's working set better fit the paging structure of the underlying machine. It can also expose more parallelism to the compiler. By distributing the loop into a sequential loop and a parallel loop, the compiler is able to efficiently execute parts of the original loop in parallel. The multiple loops are usually smaller (in body size) compared to the original loop and are more amenable to software pipelining. Figure 31 illustrates this transformation:
The original loop (left) cannot be parallelized because of the data dependency arising from the reference to array D. However, after distribution the first i loop can be parallelized and the ii loop software pipelined for performance.
Loop Fusion
Loop fusion, the inverse of loop distribution, involves "jamming" two originally separate loops into a single loop. Figure 32 illustrates this transformation.
Loop fusion can be used in many cases to combine two loops, each of
which utilizes a large portion of the page space of the machine. The
fused loop can have a working set that is smaller than the sum of the
two individual loops, improving data reuse, and permitting better
register allocation. In Figure 32, after loop fusion the elements of
array A are immediately available for use by the second
statement in each iteration. The optimizer recognizes the reuse of
elements of A, and keeps them in registers for the
operation.
Loop fusion can also increase the size of loops to improve the efficiency of parallel execution. By combining two small loops into a bigger loop, fusion sets the stage for profitable parallelization of the bigger loop. In Figure 32, the two individual loops may be too small to overcome the overheads of parallelization. However, the combined loop after fusion may be large enough to realize performance improvements from parallelization.
Loop Blocking
Loop blocking is an effective technique available in the MIPSpro compilers for optimizing the performance of the memory hierarchy for numerical algorithms. The reason for blocking is that entire matrices typically do not fit in the fast data storage (for example, the register file or cache) of the machine. Blocking decomposes matrix operations into submatrix operations, with a submatrix size chosen so that the operands can fit in the register file or cache. Since elements of a submatrix are reused in matrix operations, this reduces slow memory accesses and speeds up the computation.
Figure 33 (courtesy of Dr. James Winget, Chief Scientist, Silicon Graphics) shows the change in the memory access pattern as a result of loop blocking.
The before picture in Figure 33 references four sets of
consecutive addresses over a certain period of time before repeating
the access pattern. Blocking restructures the loop to reflect the
memory access pattern illustrated in the after picture. Here,
subsets of all four data sets reside in cache and are accessed in a
shorter period of time. This arrangement enables useful computation
to be performed efficiently on a cache resident subset of the
original dataset before moving on to the next subset. Performance
improvements come from reduced processor-to-main memory traffic as a
result of efficient cache utilization.
Loop Unrolling
Loop unrolling is a fundamental transformation that is a basic component of other restructuring techniques like software pipelining and unroll-and-jam. The unrolling of outer loops of nested loop regions are usually important for good use of the memory hierarchy. Unrolling of inner loops improves the usage of the floating-point registers and provides more room for instruction overlap. Unrolling decreases the trip count of loops, thereby reducing the loop's conditional branch overhead.
The number of times a loop should be unrolled (unrolling factor) is determined by the compiler, based on numerous considerations including the amount of data referenced in the loop body, the data access dependencies, the availability of registers, the size of data cache, and the purpose of unrolling. Figure 34 illustrates the process of unrolling.
In Figure 34, the unrolled loop exposes a lot of instruction level
parallelism as the different assignments in the unrolled loop can be
overlapped for performance.
Loop Multiversioning
Multiversioning is a technique employed by the compiler to improve the efficiency of parallel performance. Many loops, specially in Fortran, use symbolic bounds as trip counts which cannot be determined at compile time. However, the compiler can generate multiple versions of the original code at compile time. The resulting program will execute the appropriate path depending on the loops's trip count as determined dynamically at execution time.
Multiversioning improves the overall efficiency of parallel execution
by using accurate information at program execution time.
Pattern Matching
The compiler back end understands patterns of standard computational kernels like SAXPY, DAXPY, UNPACK, and the like, and generates optimal code for such code sequences. Pattern matching is also used for performing basic dependency analysis on loops to improve the effectiveness of software pipelining.
The directives provide comprehensive support for specifying and controlling the degree and dynamics of parallel execution. For example, the directives can be used to specify conditional parallelism to ensure that parallelism occurs only under certain dynamic conditions. In the two examples shown in Figure 36, the if clause in the directive specifies the conditions for parallel execution. In the Fortran 77 example, the loop executes in parallel only if the value of jmax is greater than 1000. Similarly, the C example executes in parallel only if the value of max is greater than 1000.
The compilers automatically detect program parallelism by employing
the technique of data dependency analysis. Both Fortran 77 (-pfa
flag) and C (-pca flag) compilers have this capability. Data
dependence information is also used by a number of loop
transformations listed in previous sections of this chapter.
The compilers provide command-line options to direct the inlining of the specified list of subroutines. Flags are available to limit inlining to routines that are referenced in deeply nested loops, where the reduced call overhead or enhanced optimization is multiplied. Options exist to perform interprocedural inlining, whereby instances of routines can be inlined across different files.
One drawback of unlimited inlining is its tendency to increase the code size of the resulting program. Uncontrolled replacement of function or subroutine calls with the actual body of the called routine can cause "code explosion," which in turn increases compile time and reduces the effectiveness of other optimizations. The technique of Interprocedural Analysis (IPA) provides the benefits of inlining without necessitating inlining the code.
IPA tracks the flow of control and data across procedure boundaries and uses this information to drive the optimization process. IPA is particularly useful for performing interprocedural constant propagation, which enables routines with incoming loop bounds information to be available at compile time. This ability can be useful for driving optimization decisions. Figure 37 shows an example.
In this example, the value of n is used as a loop bound within
the subroutine foo. In the absence of IPA, the compiler
assumes that the values of n is modified inside the call to
subroutine foo.
Moreover, the value of n on entry to subroutine foo will not be known at compile-time. As a result, the compiler must generate multiversion code when parallelizing the j loop within foo. However, with IPA turned on, it is possible to know (at compile time) the value of n on entry to foo at the call site. This information is then used by the automatic parallelizer to decide how to parallelize the j loop in subroutine foo. If the value of n is small, the loop may not be profitably parallelized. On the other hand, if the value of n is large, the loop gets parallelized for profitable execution.
In short, IPA provides a mechanism to propagate information across procedure boundaries without having to inline calls, thereby increasing the effectiveness of all optimizations.
Before examining the degree of Cray compatibility provided by the MIPSpro compilers, it is worthwhile to understand the key features of Cray's compilation system. Cray extensions fall into three categories:
Cray Fortran Corresponding Comments Extension Silicon Graphics Functionality real real*8 Replace real with real*8 or turn on the command-line option -r8, which promotes all real data types to 64-bit quantities. integer integer*8 Replace integer declarations with integer*8, or turn on the command-line option -i8 to promote all integer declarations to become 64-bit quantities. The -i8 flag provides a convenient way for users to selectively promote integers to be 64-bit quantities. This enables default integer data type size to remain as a 32-bit quantity, thereby preventing integer data cache conflicts that may result from larger integer quantities. Thus, the -i8 flag provides a much more flexible, performance-oriented approach to manage 64-bit integer data types. Library routines called from user applications should conform to the integer*8 format. logical logical*8 Replace logical withlogical*8 if a 64-bit quantity is required. complex complex*32 Use complex*32 to get quad precision support.
TABLE 13 Cray Fortran 77 Data Type Extensions
Cray Fortran 77 also supports Fortran 90 style indexed array syntax. Table 14 illustrates the equivalent Silicon Graphics functionality for handling Cray Fortran 77 array extensions. The MIPSpro Fortran 77 supports the Fortran 90 style array syntax. MIPSpro Fortran 77 also supports dynamic allocation of arrays.
Cray Program with Indexed and Vector- Equivalent Silicon valued Array Section Selectors Graphics Program dimension A(10), B(10), C(5), TEMP(5) dimension A(10), B(10), C(5) dimension X(5,5), Y(2:10) dimension X(5,5), Y(2:10) A=B do i = 1, 10 A(i) = B(i) enddo C = A(3:7) do i = 1,5 C(i)=A(i+2) enddo B(I: 5) = X(3, 1:5) do i = 1, 5 B(i) = X(3, i) enddo A = sin(B) do i = 1, 10 A(i) = sin(B(i)) enddo TEMP = A(C) do i = 1,5 TEMP(i) = A(C(i)) enddo TEMP = X(1, C) do i = 1,5 TEMP(i) - X(1, C(i)) enddo TEMP = A(C+C) do i = 1,5 TEMP(i) = A(C(i) + C(i)) enddo
TABLE 14 Cray Fortran 77 Array Syntax
Cray Fortran Vectorization Silicon Graphics Corresponding Functionality Directive cdir$ivdep This directive informs the Cray compiler that there is no data dependency in the loop that follows this directive, thus ensuring complete vectorization of the loop. This directive is accepted as is by the MIPSpro compilers. It tells the compiler to be less strict when deciding whether it can get some sort of parallelism between loop iterations. By default, the compiler makes conservative assumptions about possible memory reference conflicts. The directive allows the compiler to be more aggressive about such assumptions. Thus, superscalar optimizations like software pipelining can benefit from recognizing this directive. cdir$nextscalar This directive informs the compiler to generate only scalar (nonparallel) instructions for the loop that immediately follows this directive. The MIPSpro compiler accepts this directive as is and adheres to the original meaning of this directive.
TABLE 15 Cray Fortran 77 Vectorization Directives
Cray's Parallel Silicon Graphics Processing Functionality Equivalent Capability The "cmic$parallel" directive The PCF directive "c*KAP* is used to declare a parallel parallel region" provides region in Cray Fortran. equivalent functionality. cmic$ parallel [ if (exp) ] c*KAP* parallel region [ if (exp) ] cmic$ shared ( var, ... ) c*KAP* shared ( var, ... ) cmic$ private ( var, ... ) c*KAP* local ( var, ... ) Parallel code includes loops Parallel code includes loops with independent iterations, with independent iterations, adjacent independent blocks adjacent independent blocks of code, critical sections, of code, critical sections, or a combination of all or a combination of all these cases. these cases. cmic$ endparallel c*KAP* end parallel region The "cmic$doall" directive is This functionality can be used for specifying loop-level replicated by using Silicon parallelism in Cray Fortran. Graphics Fortran MP "c$doacross" directive. cmic$ doall [ if (exp) ] c$doacross [ if (exp) ] cmic$ shared ( var, ... ) c$and shared ( var, ... ), cmic$ private ( var. ... ) local ( var, ... ) cmic$ [single |chunksize (n) c$and lastlocal (var, ... ) |numchunks(n) | c$and [ mp_schedtype=type, chunk=n ] cmic$ guided | vector] [ savelast ] do i = n1, n2 do i = n1, n2 .... .... enddo enddo cmic$ endparallel
TABLE 16 Cray Parallel Processing Functionality
To summarize, the MIPSpro compilers have sufficient functionality built into them to facilitate smooth migration of Cray code onto Silicon Graphics platforms. Important Cray directives like c$dir ivdep are recognized and processed to improve the performance of superscalar optimizations like software pipelining, while other vector specific directives like c$dir vector are essentially ignored by the MIPSpro compilers. In terms of general purpose optimizations, the MIPSpro compilers favor superscalar optimizations like software pipelining and global speculative code motion while the Cray compilers attempt vectorization for best performance.