CLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD.
FLITE3D is a finite-element code for solving the Euler equations governing airflow over whole aircraft. Parallelisation of FLITE3D for shared and distributed memory parallel systems has been undertaken as part of a collaboration between the Computational Engineering Group at Daresbury and the Sowerby Research Centre at British Aerospace. The code comprises a suite of modules for obtaining Euler solutions of the flow over complex configurations.
Work has been carried out on the parallelisation of the steady Euler flow solver using standard techniques of mesh-partitioning for a single-program multiple-data (SPMD) programming model implemented in Fortran 77 and C with message passing using a choice (selectable at compile time) of MPI or PVM. The flow solver now reads in the partitioned mesh and performs the necessary communications at the boundaries between sub-domains. Fields are gathered onto the master processor for output so that no changes are necessary in the post-processing stages. This also enables the flow solver to be stopped and restarted using a different number of processors, if necessary. Table 1 shows timings on the Cray T3E/1200E, IBM-SP/WH2-375 and both Pentium- and Alpha-Beowulf systems for two MPI-based FLITE3D benchmark studies, (i) a modest wing body benchmark using 298,244 elements, and (ii) the more demanding F18 benchmark using 3,444,350 elements.
Table 1: Time in Wall Clock Seconds for the FLITE3D benchmarks on the Cray T3E/1200E, IBM SP/WH2-375 and Pentium and Alpha Beowulf Systems.
These benchmarks provide further compelling evidence as to the value of the Beowulf clusters, and to the limited performance of the Cray EV56 node. Focusing on just the largest F18 benchmark, we see that although the Cray is scaling well (a speedup of 81 on 128 nodes), the Pentium cluster outperforms the Cray T3E/1200E at all node counts. Beowulf II shows a percentage delivery figure of 145% of the Cray T3E on 32 nodes. This figure increases substantially on the more powerful CPUs of the IBM SP/WH2-375 and Alpha Cluster. The Linux Alpha Beowulf III outperforms the 32-node Cray T3E by a factor of 4.8, with the 32-node Alpha time significantly faster than that recorded on 128 nodes of the Cray. The relative performance of the IBM SP/WH2 is also impressive. While slower than the Alpha cluster, the 32-CPU SP timing is again significantly faster than that recorded on 128 nodes of the Cray. Although the code was originally developed for the Cray, these results strongly suggest that the individual node performance of the T3E is far from optimal.
CLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
We summarise the conclusions of the benchmarking exercise on applications reported in a number of separate articles in Table 1, by showing the % of a 32-node partition of the Cray T3E/1200E delivered by both the Pentium-based Beowulf II and Alpha-based Beowulf III systems (i.e. T32-nodeCray T3E / T32-nodeBeowulf).
These figures suggest the following:
Table 1: Application Performance: Percentage of 32-node partition of the Cray T3E/1200E achieved by the 32-node Pentium Beowulf II and 32-node Alpha Beowulf III.