Thursday, February 26, 2009

Benchmarking

I spent yesterday writing a tool (sekhmet) to help with the benchmarking, and collected the data today.

The test programs are from the nofib parallel benchmark suite. I did 5 runs of each in succession, and took the average elapsed wall-clock time. The machines were lightly loaded, though I can't see what's going on in the second slice of my Dept T2 (probably nothing).

Test Machines:

  • My Desktop machine. Pentium4 (Prescott 2M) CPU, 1 core, 2 threads per core, 3Ghz clock, 1GB RAM.

  • My Mac Book Air. Core2 Duo (Merom) CPU, 2 cores, 1 thread per core, 1.6Ghz clock, 2GB RAM.

  • My Dept T2. UltraSPARC T2 (Niagra 2) CPU, 4 core slice of 8 core machine, 8 threads per core, 1.4Ghz clock, 1GB ram in slice.

  • The haskell.org T2. UltraSPARC T2 (Niagra 2) CPU, 8 cores, 8 threads per core, 1.4Ghz clock, 32GB ram.



Benchmarks:



Parfib is an embarrassingly parallel micro-benchmark. It's a good test of the parallel Run Time System (RTS), because if this program doesn't parallelize then nothing will.

Note that this is a log-log graph. The fact that the initial plots for the T2s are linear means that the performance is scaling proportionally to the number of threads, that is, double the threads, half the runtime. With parfib, it's easy to divide the work into uniformly sized chunks (grains) and each thread only needs to talk to its parent and children, there is little cross-thread communication.

On the T2, each core has 8 threads, and each core has 2 ALUs and 1 Load/Store unit. However, because only one instruction from a thread is issued at a time, a single thread cannot make use of all the hardware in a core. To get good performance, you really need lots of threads.

With parfib, lots of threads are no problem. The T2 beats the Pentium4 at 5 threads, and the dual core Merom chip at about 9. I don't know whether the OS will try to schedule those threads all on the same core, or spread them out -- I'll find out.

At about 20 threads, the speedup starts to flatten out. This will either be because the grains are already too small to usefully divide them any further, or we're out of memory bandwidth. At a guess, I'd say the former because the sumeuler test below has no problem scaling up to 64 threads.

Note that in the right of the graph, the plots start to get shaky because there is a lot of variance in the runtime. On some benchmarks, when we try to create too many threads the runtime can vary by a factor of 2. Averaging the time of 5 runs smooths a bit of it out, but we can still see the variance though the average.



partree shows what happens when we can't divide the work into enough threads to make proper use of the hardware. Either there's not enough thread-level parallelism inherent in this problem, or the program hasn't been written to expose it properly. On this benchmark, a single Pentium 4 thread is about 8x faster than a T2 thread.

Interestingly, the P4 and Merom behave much worse than the T2 when we increase the number of threads, and the P4 much worse than the Merom. Something to investigate...



sumeuler is another embarrassingly parallel microbenchmark, and is good smoke-test for the run time system. This one scales right up to the limit of the hardware, getting faster until it runs out of hardware threads. Compared to parfib, when we add too many threads the runtime degrades much faster, not sure why.



The fact at the plots for the Pentium4 and Merom chips cross over is interesting. So is the fact that when the number of threads is increased on the Merom, the performance is relatively even, but it degrades badly on the P4. This seems to be like the behavior of partree. The T2 performances matches the others at about 16 threads, but there isn't enough exposed parallelism here to beat them.

Summary:

The T2 is purpose built to exploit thread level parallelism. If you have lots of jobs (threads) to run, then the throughput of the machine is quite high. On other hand, with the current GHC-head, if you only have a single thread then performance is going to be from sort-of-acceptable to miserable -- which mean's there's more work to be done.

In these benchmarks, the single thread performance between the T2 and P4 varies significantly. With partree the P4 thread is 8x faster, but in sumeuler it's only 1.6x faster. I'll need to put some more work in to determine why this is.

Tuesday, February 24, 2009

Bugfixes and pretty graphs

Fixed the problem with conc042-045 and concprog002. The bug was in the runtime system code that deals with software transactional memory (STM). Tag bits from a closure pointer weren't getting stripped before it was dereferenced, which was causing misaligned read error. This was a similar bug to one I found at the very start in Printer.c:printClosure.

I also made some pretty graphs of parallel nofib benchmarks. These are just for the T2, which I ran by hand. I'll do comparisons with other platforms tomorrow:





These are both embarrassingly parallel micro benchmarks. This is clear from the fact that the runtime scales almost linearly with the number of threads, until the available hardware is exhausted. Still, it shows its working like it ought to. Horray for +RTS -N64.

Monday, February 23, 2009

Sanity

Fixed the problem with enum02.

It turned out the code to compare values of type Integer in the runtime system was being mis-compiled. This was because some of the basic blocks created by the condition code generator didn't end in jumps, which confused the register liveness determinator, and caused a bad allocation.

Decided that 1 ounce of prevention is worth 100000000000 ounces of staring blindly at assembly dumps for hours, so I added code to the NCG that checks the above invariant holds in all code it generates. It also checks that the branch delay slot is always filled for branch instructions, and I'll add any other invariants I come across.