The test programs are from the nofib parallel benchmark suite. I did 5 runs of each in succession, and took the average elapsed wall-clock time. The machines were lightly loaded, though I can't see what's going on in the second slice of my Dept T2 (probably nothing).
- My Desktop machine. Pentium4 (Prescott 2M) CPU, 1 core, 2 threads per core, 3Ghz clock, 1GB RAM.
- My Mac Book Air. Core2 Duo (Merom) CPU, 2 cores, 1 thread per core, 1.6Ghz clock, 2GB RAM.
- My Dept T2. UltraSPARC T2 (Niagra 2) CPU, 4 core slice of 8 core machine, 8 threads per core, 1.4Ghz clock, 1GB ram in slice.
- The haskell.org T2. UltraSPARC T2 (Niagra 2) CPU, 8 cores, 8 threads per core, 1.4Ghz clock, 32GB ram.
Parfib is an embarrassingly parallel micro-benchmark. It's a good test of the parallel Run Time System (RTS), because if this program doesn't parallelize then nothing will.
Note that this is a log-log graph. The fact that the initial plots for the T2s are linear means that the performance is scaling proportionally to the number of threads, that is, double the threads, half the runtime. With parfib, it's easy to divide the work into uniformly sized chunks (grains) and each thread only needs to talk to its parent and children, there is little cross-thread communication.
On the T2, each core has 8 threads, and each core has 2 ALUs and 1 Load/Store unit. However, because only one instruction from a thread is issued at a time, a single thread cannot make use of all the hardware in a core. To get good performance, you really need lots of threads.
With parfib, lots of threads are no problem. The T2 beats the Pentium4 at 5 threads, and the dual core Merom chip at about 9. I don't know whether the OS will try to schedule those threads all on the same core, or spread them out -- I'll find out.
At about 20 threads, the speedup starts to flatten out. This will either be because the grains are already too small to usefully divide them any further, or we're out of memory bandwidth. At a guess, I'd say the former because the sumeuler test below has no problem scaling up to 64 threads.
Note that in the right of the graph, the plots start to get shaky because there is a lot of variance in the runtime. On some benchmarks, when we try to create too many threads the runtime can vary by a factor of 2. Averaging the time of 5 runs smooths a bit of it out, but we can still see the variance though the average.
partree shows what happens when we can't divide the work into enough threads to make proper use of the hardware. Either there's not enough thread-level parallelism inherent in this problem, or the program hasn't been written to expose it properly. On this benchmark, a single Pentium 4 thread is about 8x faster than a T2 thread.
Interestingly, the P4 and Merom behave much worse than the T2 when we increase the number of threads, and the P4 much worse than the Merom. Something to investigate...
sumeuler is another embarrassingly parallel microbenchmark, and is good smoke-test for the run time system. This one scales right up to the limit of the hardware, getting faster until it runs out of hardware threads. Compared to parfib, when we add too many threads the runtime degrades much faster, not sure why.
The fact at the plots for the Pentium4 and Merom chips cross over is interesting. So is the fact that when the number of threads is increased on the Merom, the performance is relatively even, but it degrades badly on the P4. This seems to be like the behavior of partree. The T2 performances matches the others at about 16 threads, but there isn't enough exposed parallelism here to beat them.
The T2 is purpose built to exploit thread level parallelism. If you have lots of jobs (threads) to run, then the throughput of the machine is quite high. On other hand, with the current GHC-head, if you only have a single thread then performance is going to be from sort-of-acceptable to miserable -- which mean's there's more work to be done.
In these benchmarks, the single thread performance between the T2 and P4 varies significantly. With partree the P4 thread is 8x faster, but in sumeuler it's only 1.6x faster. I'll need to put some more work in to determine why this is.