| Assumptions | Cubed layout | Sliced layout |
| 12 * init + 4*100 * f = 320us | 8 * init + 8*100 * f = 320us |
Profiler: Running the program with ssrun -usertime
and analysing the generated information files with prof
produced two main results. The first one is not very
surprising. It is a profile for the root process. This process mostly
executes the Tk main loop, which does nothing but wait for user
events. Only a relatively small fraction of time is spent in the
display_cb or timer_cb routine. (A larger amount in the exit
procedures, but that's due to the problem described above.)
In the second example, the number of samples as well as the total
time spent in this process shows that in this simulator process was much more work
done than in the root process. But the frightening figure is not
much further down in line [5], 35% of the samples were taken in
MPI_SGI_request_wait ! Actually, the percentage was even higher in
some other samples. It shows one big problem with the application:
It is constraint by the amount of communication and not the
computation.
The amount of time spent in this procedure and in finishRecv
(which more or less calls this MPI function) decreases a lot as
soon as I take the approach mentioned above: Interleaved
computation and communication. I will put up a result file as soon
as crunch is up again. STL performance: All the functions for the
manipulation of the grid datastructure account for a sum of 5-10%
of total time spent in the process. This is quite a bit, but I
doubt an array of pointers to particles (which is the leanest data
structure to represent a grid) would have been much faster. For
sure it would not have been as flexible and adaptable.
Speedup:
| version | # procs | # particles | sec per 100 updates | ms per update | speedup (to serial) | speedup (to parallel 1+1) |
| serial | 1 | 3312 | 15.5 | 155 | 1 | 1.68 |
| parallel | 1+1 | 3312 | 26 | 260 | 0.60 | 1 |
| parallel | 1+2 | 3312 | 13.8 | 138 | 1.12 | 1.88 |
| parallel | 1+3 | 3312 | 9.9 | 99 | 1.57 | 2.63 |
| parallel | 1+4 | 3312 | 7.9 | 79 | 1.96 | 3.29 |
| parallel | 1+5 | 3312 | 6.3 | 63 | 2.46 | 4.13 |
| parallel | 1+6 | 3312 | 5.5 | 55 | 2.81 | 4.73 |
| serial | 1 | 13152 | 48 | 480 | 1 | 1.67 |
| parallel | 1+1 | 13152 | 80 | 800 | 0.60 | 1 |
| parallel | 1+2 | 13152 | 40.6 | 406 | 1.18 | 1.97 |
| parallel | 1+4 | 13152 | 21.4 | 214 | 2.24 | 3.74 |
| parallel | 1+8 | 13152 | 11.6 | 116 | 4.14 | 6.90 |