Monday, February 15, 2010

Memory Barriers and GHC 6.12.1

I've moved to UNSW (University of New South Wales) to help with the DPH (Data Parallel Haskell) project. The first order of business has been to make a SPARC/Solaris binary release for 6.12.1.

It looks like this week will be spent spelunking through the runtime system trying to find the source of #3875. Running the DPH Quickhull benchmark on SPARC/Solaris with the current head causes a consistent runtime crash, and running sumsq causes a crash about one time out of twenty. Simon M reckons this is likely to be caused by a missing memory barrier somewhere in the runtime system, and if that's true then it's going to be tricky to find.

According to this wikipedia page the default memory ordering properties on SPARC (TSO = Total Store Ordering) are supposed to be the same as on x86, but the x86 benchmarks don't crash like on SPARC. I'm not sure if this is because the memory ordering on SPARC and x86 are actually different, or if we've just been lucky on x86 so far. I'll spend today reading more about the ordering properties of these two architectures.

I'm also eagerly awaiting the new GHC buildbot, which Ian is working on. GHC HQ develops mostly on x86_64/Linux + Windows, so if we want to keep other platforms working then having reliable buildbots is an absolute necessity.

Wednesday, May 27, 2009

The CAS experiment

(Compare And Swap)

One of the main questions we wanted to address with the GHC on SPARC project was how well the T2 architecture supports parallel lazy evaluation. Earlier posts have shown that it does well for most of the standard parallel benchmarks in the nofib suite, with sumeuler scaling almost perfectly.

For practical parallel programming, the limiting performance factor tends to be the cost of inter-processor communication. We must distribute work to the various threads, then collect and combine their results into the final output of the program. For an "embarrassingly parallel" program, the amount of communication is low, and it can make full use of the available processors. However, for most programs we will reach a point where adding extra processors to the machine does not improve performance, because the benefit of having an extra processor is outweighed by the cost of communicating with it.

For lazy functional programs, the work can be logically divided into thunks, that is, unevaluated function applications. In this case, lots of communication overhead can arise due to the need for the processors to decide who should be evaluating which thunks. The paper Haskell on a Shared Memory Multiprocessor describes the way GHC does this. An important point is that in GHC the processors are only weakly synchronized, that is, there a chance that two processors both evaluate the same thunk. This is safe in GHC because the code is purely functional, it does not matter if the same expression is evaluated more than once.

An alternative approach is to place a lock on each thunk, and require a processor to set a thunk's lock before evaluating it. We call the action of setting the lock "claiming" the thunk. This mechanism eliminates the possibility of two processors claiming the same thunk, at the cost of increased communication overhead. We would like to know how much this overhead would be.

The experiment
In the paper, the authors modified GHC to perform an atomic compare and swap instruction in the entry code for each thunk. On i386 they reported an average slowdown of 50%, with a maximum of 300%. This is a substantial penalty, and the paper goes on to describe the lock-free algorithms they implemented to avoid it.

The SPARC T2 was designed specifically to have a low inter-thread communication overhead, so we would like to know if using locks would be feasible on this architecture. To answer this I repeated the experiment from the paper, by adding the following instruction to the entry code for each thunk:

casa [%sp], %i1, %g1

This means "If the data pointed to by %sp is the same as the value in %i1 then swap that data with the value in %g1, otherwise put it in %g1". On SPARC, %sp is the C stack pointer, and %i1 is the STG R1 register, so the the comparison will always fail. However, the rest of the thunk code doesn't use the %g1 register, so this instruction is a no-op, apart from causing a processor wide synchronization event.

The benchmarking results with this modification are here. In summary, we get an average slowdown of 34%, with a maximum of 94%.

Comparison with x86
For comparison, I redid the experiment on my Pentium4 machine as well. The paper was written back in 2005, and other parts of GHC have improved since then. Here is the code I added to the entry code for each thunk:

push %eax
mov %ebx,%eax
lock cmpxchg %ebx, (%epb)
pop %eax

This fragment contains more instructions than the SPARC version, but I did it this way to ensure the fragment was a no-op (apart from the slowdown due to synchronization). For a real implementation we'd need more than one instruction anyway, and the cmpxchg is substantially more expensive than the others, so it's a fair comparison.

The nofib results for x86 are here. In summary, for this architecture we have an average 130% slowdown, with a worst-case of 540%.

For our purposes, the T2 has roughly a quarter of the thread synchronization overhead of a Pentium 4 machine. This is very good, but Haskell uses lazy evaluation as default, so thunk-entry is something that we do a lot of. An average 34% slowdown is too high to want to move to a locking implementation, but not by much.

GHC is already a mature compiler, so it's unlikely we'd be able to get that 34% back by improving other parts. I had a brief look at the generated code, and for a processor to claim every thunk we'd need a lock every 50 instructions or so. Using locks would be fine for applications that don't use such fine grained parallelism.

Wednesday, April 22, 2009

Loose ends and register pairing

I've been spending some time cleaning up loose ends and addressing the floating point register pairing on SPARC.

First loose end was refactoring the way information about what STG registers are pinned to what hardware regs is managed. When I fixed the NCG back in Jan / Feb I had defined a local freeRegs function which records which regs are available for allocation, mostly so I could work out what was going on. Last week I redid it so its gets the information from include/MachRegs.h again via some #ifdef nastiness. Said nastiness is contained to a single module, with the benefit being that the SPARC NCG behaves the same way as the ones for the other architectures.

Second loose end was working out why we couldn't pin a register to hold the pointer to a thread's gc_thread structure in the garbage collector code. The reason is that names like %l1 are just offsets into the SPARC register stack, and change during a function call. It's all well and good to pin a certain variable to %l1, but its value gets lost as soon as you make a call. Seems obvious in hindsight. We can pin STG registers to hardware registers while running the program proper, because Haskell code doesn't use the register window. However, the garbage collector is written in C, so the same plan doesn't fly.

I'm currently refactoring the graph coloring register allocator to do a better job of handling the SPARC's floating point registers. SPARC has 32x 32bit float registers. To store a 64 bit float you place the value in a pair of 32bit regs. This is much like the situation on x86 where the 16 bit regs AL and AH can be combined to form AX. Previously, we had just statically reserved 6 of the FP regs to be 32 bit, and the rest to be 64 bit. This works, but is kludgy and doesn't make good use of the hardware.

When I wrote GHC's graph coloring allocator back in 2007, I based it on the algorithm described by Smith, Ramsey and Holloway in A Generalized Algorithm for Graph-Coloring Register Allocation. This algorithm handles register quite elegantly, and I wrote the code with the SPARC architecture in mind. However, back then the SPARC NCG didn't actually work, so I couldn't test this functionality. It does now though, so i'm going back and finishing this unfinished business.

Tuesday, April 14, 2009

Instruction counts on x86 vs SPARC

Here are breakdowns of the total number of instructions executed on a Pentium-M processor and the SPARC T2 for four of the nofib/parallel benchmarks. The vertical axis is gig-instructions (x10^9). The benchmarks were run with just a single thread (+RTS -N1). The figures were recorded using hardware performance counters.

SPARC/Solaris provides the cpustat and cputrack commands which allow direct access to the performance counters, as long as you're the super-user. On x86/Linux I originally tried using valgrind to count the instructions, but it turned out to be too slow. Valgrind is basically an x86 CPU simulator, the simulation imposes a 50 - 200x slowdown, and I didn't want to wait all week. I ended up using a machine with a Linux kernel patched with the perfex/perfctr library, generously provided by Daniel Frampton's GC research group.

Recall that Pentium-M architecture does an on-the-fly translation from x86 CISC instructions to RISC micro-operations (uops). On these benchmarks, about 10% more uops are generated than the number of x86 instructions. The SPARC versions of the benchmarks use about 25% more instructions than Pentium-M uops. This is partly because the SPARC needs two instructions to load a 32 bit immediate value, which is something GHC does a lot of. For example, here is some code which builds part of a stack frame:

sethi %hi(_module_registered),%g2
or %g2,%lo(_module_registered),%g2
st %g1,[%g2]
add %i0,-4,%i0
sethi %hi(__stginit_base_Prelude_),%g1
or %g1,%lo(__stginit_base_Prelude_),%g1
st %g1,[%i0]
add %i0,-4,%i0
sethi %hi(__stginit_base_SystemziEnvironment_),%g1
or %g1,%lo(__stginit_base_SystemziEnvironment_),%g1
st %g1,[%i0]

SPARC instructions are all 32 bits wide, and most ALU ops only have room for a 13 bit immediate value. The usual way to load a 32 bit value is to use the sethi instruction to set the top 22 bits, then or in the bottom few. Lack of room for immediate values hurts continuation passing styles of compilation more than the traditional loop+branch styles, as we can't use PC relative addressing as much. Still, on the T2, dependent ALU ops don't cause pipeline stalls, so it's not a big penalty.

The Pentium-M doesn't support x86_64, so we have to use the x86 instruction set, which doesn't have many registers. Due to this, the Pentium-M code performs about twice the number of memory accesses then the SPARC code. The number of branch instructions is about the same, as expected.

All over, on the x86 about 50% of the instructions are loads or stores, and 18% are branches. On the SPARC, about 20% are loads or stores, and about 16% branches.

In other news, I've been spending some time looking through the RTS code for the scheduler, and the way black-holes are handled. The paper Haskell on a Shared-Memory Multiprocessor discusses the mechanism that GHC uses to avoid fine-grained locking of thunks during parallel evaluation. On x86 this locking is expensive, but it should be less so on the T2. I'm planning to repeat the experiment mentioned in that paper to determine whether doing the locking is feasible. Doing so would eliminate the need for the complicated stack tracing scheme described in that paper to kill of duplicated computation (at least on SPARC), and may improve the latency of parallel code. Stay tuned.

Monday, March 23, 2009

Pinning threads to cores

The following graphs show runs of the nofib/parallel benchmarks sumeuler, parfib and mandel, run with a varying number of Haskell threads (+RTS -N1 .. -N14).

In the plots marked "8 thread", the program was constrained to run on a single core. A core has 8 hardware threads, can issue an instruction from two different threads each cycle, has 2 ALUs, 1 Load/Store unit, 8k L1 data cache and 16k L1 instr cache. In this mode the threads must share limited hardware, so if the two active threads need to use the Load/Store unit at the same time, then one will stall. However, the inter-thread communication overhead is low, as they can communicate via the shared L1 cache.

In the plots marked "8 core", the program was run on a processor set consisting of the first hardware thread of each core. This means the program effectively has 8 ALUs, 8 Load/Store units, 64k L1 data and 128k L1 instr cache available. The T2 does not issue multiple instructions from the same thread in a single cycle, so the second ALU of each core is left unused. In this mode the program has more hardware available, but the communication overhead is higher as the threads must communicate via L2 cache.

The question is how well the T2's hardware threading multiplexes GHC threads (capabilities). In the "8 core" mode, the program has more than 4x the hardware available, and will not suffer contention over limited Load/Store units. However, it is not expected to be able to issue 4x the number of instructions, due to pipeline stalls and cache latency.
For the runtime graph, both axes use log scales. The fact that the plots for the "8 core" mode are (mostly) straight lines means that the performance is scaling (almost) perfectly, that is, double the threads, half the runtime. In all cases the program performs better in the "8 core" mode, so the extra hardware available more than makes up for the higher communication overhead. The raw run-time for 8 GHC threads are as follows:

progruntime "8 thread" runtime "8 core" speedup

From these numbers we can see that in "8 core" mode, even though the program has more than 4x the hardware available, this doesn't translate into a 4x speedup. Each thread spends the majority of its time stalled while waiting for the cache to respond, and in "8 core" mode the cores are idle for much of the time.

This is less true for parfib. Note from the max issue rate graph that parfib achieves the peak single-core issue rate of 2.33 gig instrs/sec with only 6 threads. This implies the program is integer bound, not load/store bound. It is managing to issue two instructions on every cycle, and the overall execution of the program doesn't stall while waiting for memory.

In all cases, adding more GHC threads than hardware threads is a bad idea. Not only does the run time increase, but the total number of instructions for mandel and sumeuler increases dramatically. These extra instructions will be due to context switching between the Haskell threads, contribute to overall load, and will slow down other jobs running on the same machine.

Thursday, March 19, 2009

Peak issue rate is 18.64 Gig instrs/sec

This plot confirms the number for the peak instruction issue rate that I presented yesterday. On the T2 at 1165Mhz we get 18.64 Gig instructions/sec.

The test program spawns 64 threads that execute the following loop 10^8 times.

add %o0, %o1, %o2
add %o1, %o2, %o3
add %o3, %o4, %o5
add %o5, %o6, %o7

sub %l3, 1, %l3
subx %l2, 0, %l2

cmp %g0, %l3
bne here

Wednesday, March 18, 2009

Peak performance

These are plots of machine activity when running benchmarks from nofib/parallel on the T2. The Instr_cnt plot shows the machine-wide instruction issue rate at 100ms resolution. The Data_miss plot shows the number of loads which miss L1 data cache multiplied by 20. This scaling factor is the is the average number of cycles needed to fetch data from L2, according to Darryl's article on Calculating Processor Utilization.

The Solaris psrstat command informs me that the processor operates at 1165Mhz. The OS doesn't seem to scale the clock frequency with respect to load, so I've assumed that it's stable during the tests. The processor has 8 cores, with each core being able to issue two instructions per clock, each from different threads. This gives us a peak theoretical issue rate of 18.64 Gig instructions/sec (18.64 * 10^9). Of course, that's assuming we're just moving data between registers. For some applications we'll expect to run out of memory or IO bandwidth before we reach the peak issue rate.


These are plots of sumeuler running with 16, 32, 64, and 128 threads. The post from a few weeks ago shows that sumeuler scales perfectly, and that is reinforced here. As a rough approximation, when we double the number of threads, the issue rate doubles and the run-time halves. The idle time before and after each run corresponds to me flipping between terminals to control the logger vs run the program.

We have 64 hardware threads, so using -N64 Haskell threads (capabilities) gives us the shortest run time. I also ran the program with -N128 to confirm there's nothing to be gained by adding more capabilities than we have hardware threads (I tried for other values between -N64 and -N128 as well).

For -N64 we get a maximum issue rate of about 10 Gig instructions/sec. That is about 54% of the peak rate, which is pretty good, but it also indicates that we could gain something by improving the native code generator to do instruction reordering or prefetching. If the program was achieving the peak rate then there would be no free cycles left to use, so the only way to improve run time would be to reduce the number of instructions.

Over the next few days I'll run some C micro-benchmarks to confirm that my figure for the peak issue rate is correct.

mandel also scales well. This program sparks a new thread for each pixel in the output image. The time it takes to calculate the color of each pixel can vary by a factor of 10 or more, depending on where it is in the image. I imagine this is why the plot for the instruction issue rate is noisier than for sumeuler. The maximum issue rate is about the same.


There is something going on with matmult that I don't fully understand yet. The four plots below were taken in quick succession on a quiet machine. Note how the run time varies between 10 and 13 seconds, and that the second and fourth plots are quite similar but different from the other two. Perhaps the initial choice of which cores the threads are placed on is causing locality effects, but that wouldn't explain why those plots also seem to execute more instructions.

If another process was actively running on the machine, and contributing to the instruction count, then the two similar plots wouldn't be so similar. In these plots we can see an initial startup period, then seven distinct bursts of activity. Those bursts aren't garbage collection. The GHC runtime system reports that the major GCs are taking < 0.1s each, but the bursts in the plot have around 1s duration. More investigation required.