Wednesday, May 27, 2009

The CAS experiment

(Compare And Swap)

One of the main questions we wanted to address with the GHC on SPARC project was how well the T2 architecture supports parallel lazy evaluation. Earlier posts have shown that it does well for most of the standard parallel benchmarks in the nofib suite, with sumeuler scaling almost perfectly.

For practical parallel programming, the limiting performance factor tends to be the cost of inter-processor communication. We must distribute work to the various threads, then collect and combine their results into the final output of the program. For an "embarrassingly parallel" program, the amount of communication is low, and it can make full use of the available processors. However, for most programs we will reach a point where adding extra processors to the machine does not improve performance, because the benefit of having an extra processor is outweighed by the cost of communicating with it.

For lazy functional programs, the work can be logically divided into thunks, that is, unevaluated function applications. In this case, lots of communication overhead can arise due to the need for the processors to decide who should be evaluating which thunks. The paper Haskell on a Shared Memory Multiprocessor describes the way GHC does this. An important point is that in GHC the processors are only weakly synchronized, that is, there a chance that two processors both evaluate the same thunk. This is safe in GHC because the code is purely functional, it does not matter if the same expression is evaluated more than once.

An alternative approach is to place a lock on each thunk, and require a processor to set a thunk's lock before evaluating it. We call the action of setting the lock "claiming" the thunk. This mechanism eliminates the possibility of two processors claiming the same thunk, at the cost of increased communication overhead. We would like to know how much this overhead would be.

The experiment
In the paper, the authors modified GHC to perform an atomic compare and swap instruction in the entry code for each thunk. On i386 they reported an average slowdown of 50%, with a maximum of 300%. This is a substantial penalty, and the paper goes on to describe the lock-free algorithms they implemented to avoid it.

The SPARC T2 was designed specifically to have a low inter-thread communication overhead, so we would like to know if using locks would be feasible on this architecture. To answer this I repeated the experiment from the paper, by adding the following instruction to the entry code for each thunk:

casa [%sp], %i1, %g1

This means "If the data pointed to by %sp is the same as the value in %i1 then swap that data with the value in %g1, otherwise put it in %g1". On SPARC, %sp is the C stack pointer, and %i1 is the STG R1 register, so the the comparison will always fail. However, the rest of the thunk code doesn't use the %g1 register, so this instruction is a no-op, apart from causing a processor wide synchronization event.

The benchmarking results with this modification are here. In summary, we get an average slowdown of 34%, with a maximum of 94%.

Comparison with x86
For comparison, I redid the experiment on my Pentium4 machine as well. The paper was written back in 2005, and other parts of GHC have improved since then. Here is the code I added to the entry code for each thunk:

push %eax
mov %ebx,%eax
lock cmpxchg %ebx, (%epb)
pop %eax

This fragment contains more instructions than the SPARC version, but I did it this way to ensure the fragment was a no-op (apart from the slowdown due to synchronization). For a real implementation we'd need more than one instruction anyway, and the cmpxchg is substantially more expensive than the others, so it's a fair comparison.

The nofib results for x86 are here. In summary, for this architecture we have an average 130% slowdown, with a worst-case of 540%.

For our purposes, the T2 has roughly a quarter of the thread synchronization overhead of a Pentium 4 machine. This is very good, but Haskell uses lazy evaluation as default, so thunk-entry is something that we do a lot of. An average 34% slowdown is too high to want to move to a locking implementation, but not by much.

GHC is already a mature compiler, so it's unlikely we'd be able to get that 34% back by improving other parts. I had a brief look at the generated code, and for a processor to claim every thunk we'd need a lock every 50 instructions or so. Using locks would be fine for applications that don't use such fine grained parallelism.


  1. On #haskell someone brought up the point that in order to avoid evaluating two thunks at once, when we first evaluate a thunk, write some code to it that says 'go do something else' to other threads of execution. Once evaluation has finished, you overwrite it with your value. This has the cost of one more memory operation per thunk evaluation.

    It sounds similar to Simon Marlow's -feager-blackholing option in HEAD, and Simon sez "To get the best results, any code which may be executed in parallel should be compiled with eager blackholing turned on." But I haven't investigated the patch or its ramifications on parallel programs. Maybe we should have a go at using it with nofib?

  2. Yes, "claiming" a thunk means exactly that, overwriting its info pointer with a pointer to a blackhole. If a thread tries to evaluate a thunk that has already been claimed it hits the blackhole code. This code adds the thread to a queue so it is woken up when the thunk has finished evaluating.

    Even with eager blackholing there is a race condition which can result in two threads evaluating the same thunk. To completely eliminate it you need to do an atomic CAS of the info pointer.

    I haven't compared runs with -feager-blackholing vs not, but it's a good idea, and easy to do. Stay tuned.