Crunching Quarks

A parallel particle simulation on the SGI Origin 2000
Mathias Kölsch

Domain and Motivation

Think of a particle as a molecule of water or as air. It has a position, a velocity and a mass. Multiple forces act on it, for example gravity, magnetic fields or wind. I wanted to simulate the behaviour of tiny clouds, steam, smoke or any other gaseous phenomina. Long range forces like mass attraction are neglectible in this domain, the real important force is a short distance repulsive force which becomes stronger on particles the closer they are to each other. These forces produce a sceenery similar to a pool table if there are only a few, far apart particles. As the number increases to a few thousand, condensed in a constraint space, they behave more like a fluid or smoke. My final goal is to generate animated steam that looks like smoke from an ancient steam train (just doesn't stink;-).
The animation and rendering of many natural phenomina are hot research areas these days. Virtually all of those processes are predestined for parallel processing for many different reasons. I will not contemplate about this since other people can do much better. In the following I want to describe my approach to gain the highest performance of a simulation of equal-mass particles.

Approaches to Overcome Computational Constraints

Several mechanisms are implemented to speed the execution time up.

Difficulties and Hindrances

My initial design used the graphical abilities of OpenGL to do the visual part of the project. This caused trouble with the Xserver connection from the Origin 2000 to my workstation, basically the server lacked the so called "glX extension". I switched to pure Xlib for the display then. Some of the GUI was already done in Tcl/Tk, so I kept this. The basic installation supported 32bit libraries only, so I compiled Tcl/Tk8.0 for 64bit and installed it on the crunch.
I still have some trouble with Tk, since the call of its main processing loop never returns. Only callbacks provide the means to perform own computations. There are ways to work around this, either by taking over the control and re-implementing the main loop or by new thread features in version 8.0. I neither had the time nor the joy to do any of these, so now I run one MPI thread entirely for the GUI and pre-display image processing (which is none so far, but could be some rendering for example). The other problem arises when the user wants to finish the program. Both MPI and Tk want to be exited gracefully, first MPI and then Tk. The missing link from Tk's callback to all MPI threads has to be implemented by the programmer, who tried some notification procedures, but none of them was constantly successful...

Evaluation and Benchmarks

Communication benchmarks: What is the overhead of setting up a send/recv connection with the non-blocking MPI functions I wanted to use ? How many floats can I send within this period of time ? A straight forward MPI benchmark revealed the following results:
overhead: init = 20us; per float: f = 0.2us or less; all on a heavily loaded crunch (32%+ load)
This means that we can send about 100 floats instead of initializing a new message. That justifies the approach with the fixed size buffer, since sending a size first would be too expensive in the average case. Furthermore it proves that a sliced layout of the processors is more effective than a cubed layout for some reasonable assumptions. Here is the break even point for communicating the border rows/columns between all processors:
Assumptions Cubed layout Sliced layout
  • 9 processors,
  • quadratic area (height == width),
  • not more than 10,000 (100x100) particles
  • and a maximum interaction distance such that no more than 100 particles affect each other on a cut over the whole length or width.
12 * init + 4*100 * f = 320us 8 * init + 8*100 * f = 320us
Since we did not account for the management of the more difficult layout of cubes, the slices are more effective (at least in terms of programming...).

Profiler: Running the program with ssrun -usertime and analysing the generated information files with prof produced two main results. The first one is not very surprising. It is a profile for the root process. This process mostly executes the Tk main loop, which does nothing but wait for user events. Only a relatively small fraction of time is spent in the display_cb or timer_cb routine. (A larger amount in the exit procedures, but that's due to the problem described above.)
In the second example, the number of samples as well as the total time spent in this process shows that in this simulator process was much more work done than in the root process. But the frightening figure is not much further down in line [5], 35% of the samples were taken in MPI_SGI_request_wait ! Actually, the percentage was even higher in some other samples. It shows one big problem with the application: It is constraint by the amount of communication and not the computation.
The amount of time spent in this procedure and in finishRecv (which more or less calls this MPI function) decreases a lot as soon as I take the approach mentioned above: Interleaved computation and communication. I will put up a result file as soon as crunch is up again. STL performance: All the functions for the manipulation of the grid datastructure account for a sum of 5-10% of total time spent in the process. This is quite a bit, but I doubt an array of pointers to particles (which is the leanest data structure to represent a grid) would have been much faster. For sure it would not have been as flexible and adaptable.

Speedup:
version # procs # particles sec per 100 updates ms per update speedup (to serial) speedup (to parallel 1+1)
serial 1 3312 15.5 155 1 1.68
parallel 1+1 3312 26 260 0.60 1
parallel 1+2 3312 13.8 138 1.12 1.88
parallel 1+3 3312 9.9 99 1.57 2.63
parallel 1+4 3312 7.9 79 1.96 3.29
parallel 1+5 3312 6.3 63 2.46 4.13
parallel 1+6 3312 5.5 55 2.81 4.73
serial 1 13152 48 480 1 1.67
parallel 1+1 13152 80 800 0.60 1
parallel 1+2 13152 40.6 406 1.18 1.97
parallel 1+4 13152 21.4 214 2.24 3.74
parallel 1+8 13152 11.6 116 4.14 6.90

Future Work

The ability to produce meaningful statistics about both the system domain and some implementation data must be included. In particular I want to monitor: The following features would add to the functionality of the program:
Last modified: Fri Jun 19 13:10:01 PDT 1998