Wednesday, May 27, 2009

The CAS experiment

(Compare And Swap)

One of the main questions we wanted to address with the GHC on SPARC project was how well the T2 architecture supports parallel lazy evaluation. Earlier posts have shown that it does well for most of the standard parallel benchmarks in the nofib suite, with sumeuler scaling almost perfectly.

For practical parallel programming, the limiting performance factor tends to be the cost of inter-processor communication. We must distribute work to the various threads, then collect and combine their results into the final output of the program. For an "embarrassingly parallel" program, the amount of communication is low, and it can make full use of the available processors. However, for most programs we will reach a point where adding extra processors to the machine does not improve performance, because the benefit of having an extra processor is outweighed by the cost of communicating with it.

For lazy functional programs, the work can be logically divided into thunks, that is, unevaluated function applications. In this case, lots of communication overhead can arise due to the need for the processors to decide who should be evaluating which thunks. The paper Haskell on a Shared Memory Multiprocessor describes the way GHC does this. An important point is that in GHC the processors are only weakly synchronized, that is, there a chance that two processors both evaluate the same thunk. This is safe in GHC because the code is purely functional, it does not matter if the same expression is evaluated more than once.

An alternative approach is to place a lock on each thunk, and require a processor to set a thunk's lock before evaluating it. We call the action of setting the lock "claiming" the thunk. This mechanism eliminates the possibility of two processors claiming the same thunk, at the cost of increased communication overhead. We would like to know how much this overhead would be.

The experiment
In the paper, the authors modified GHC to perform an atomic compare and swap instruction in the entry code for each thunk. On i386 they reported an average slowdown of 50%, with a maximum of 300%. This is a substantial penalty, and the paper goes on to describe the lock-free algorithms they implemented to avoid it.

The SPARC T2 was designed specifically to have a low inter-thread communication overhead, so we would like to know if using locks would be feasible on this architecture. To answer this I repeated the experiment from the paper, by adding the following instruction to the entry code for each thunk:

casa [%sp], %i1, %g1

This means "If the data pointed to by %sp is the same as the value in %i1 then swap that data with the value in %g1, otherwise put it in %g1". On SPARC, %sp is the C stack pointer, and %i1 is the STG R1 register, so the the comparison will always fail. However, the rest of the thunk code doesn't use the %g1 register, so this instruction is a no-op, apart from causing a processor wide synchronization event.

The benchmarking results with this modification are here. In summary, we get an average slowdown of 34%, with a maximum of 94%.

Comparison with x86
For comparison, I redid the experiment on my Pentium4 machine as well. The paper was written back in 2005, and other parts of GHC have improved since then. Here is the code I added to the entry code for each thunk:

push %eax
mov %ebx,%eax
lock cmpxchg %ebx, (%epb)
pop %eax

This fragment contains more instructions than the SPARC version, but I did it this way to ensure the fragment was a no-op (apart from the slowdown due to synchronization). For a real implementation we'd need more than one instruction anyway, and the cmpxchg is substantially more expensive than the others, so it's a fair comparison.

The nofib results for x86 are here. In summary, for this architecture we have an average 130% slowdown, with a worst-case of 540%.

For our purposes, the T2 has roughly a quarter of the thread synchronization overhead of a Pentium 4 machine. This is very good, but Haskell uses lazy evaluation as default, so thunk-entry is something that we do a lot of. An average 34% slowdown is too high to want to move to a locking implementation, but not by much.

GHC is already a mature compiler, so it's unlikely we'd be able to get that 34% back by improving other parts. I had a brief look at the generated code, and for a processor to claim every thunk we'd need a lock every 50 instructions or so. Using locks would be fine for applications that don't use such fine grained parallelism.