GHC on SPARC: 2009-03-08

Thursday, March 12, 2009

Thread activity plotting

Here are plots of hardware thread activity on the T2 when running some of the benchmarks from nofib/parallel. They were obtained via the Solaris cpustat command, with custom post processing hackery to make the images.

The horizontal lines are hardware threads, with the darkness of the line corresponding to the number of instructions executed per unit of time. The ticks along the X axis give the number of seconds wall-clock runtime.

(sorry there are no numbers on the axes, custom hackery and all)

sumeuler

-N8

-N32

-N64

These are with 8, 32 and 64 haskell threads (+RTS -N8 ..) respectively. sumeuler scales well, so in the -N64 case we get even machine utilization and a short runtime.

In the plot for -N8, the system seems to favor the the first thread of each core, so every 8th trace is darker. The higher threads aren't used as much. There seems to more activity in the last two cores.

matmult

-N32

-N64

The plot for matmult reveals distinct phases of high and low parallelism. At about 7 sec into the -N32 plot we see a single running Haskell thread shuffling between the cores.

partree

-N32

In this plot we see a faint background computation, along with what I expect are regular periods of garbage collection. However, it appears as though most of the work is being done by a single Haskell thread. The plots for 8 threads and 1 thread have a similar runtime.

I am eagerly awaiting the release of ThreadScope so I can match these hardware profiles up with GHC's view of the world.

Monday, March 9, 2009

The GNU Debugger and me

At the end of last week I tried to profile the performance of matmult and parfib, but discovered it was broken. Specifically, the allocation count for some cost centers is getting corrupted. For example:

COST CENTRE            MODULE    no. entries  %time %alloc   %time %alloc
MAIN        MAIN     1           0   0.0   20.3     0.0  100.0
 main       Main     172 30590380562441576   0.0   19.0     0.0   19.0
 CAF        Main     166 30590376291721096   0.0   20.2     0.0   20.2
  main      Main     173           0   0.0   0.0    0.0    0.0
   dude     Main     174 30619204112395264   0.0    0.0     0.0    0.0

Sadly, for some reason it's also getting corrupted with the via-c build, so my usual approach of swapping out suspect -fasm closure code for known good -fvia-c code isn't going to fly.

That puts me squarely in GDB territory, and the nastiness of breakpoints and single stepping through instructions. I figure GDB is like the SWAT team of software engineering. It's fun to recount its exploits to your pals over beer, but you hope you're never in a situation where you actually need to call on its services.

Anyway, some time later I've got a Cmm line:

        I32[CCCS] = CAF_ccs;
        I64[I32[R1 + 4] + 16] = I64[I32[R1 + 4] + 16] + 1 :: W64;

Which produces:

        
        sethi   %hi(Main_CAFs_cc_ccs),%g1
        or      %g1,%lo(Main_CAFs_cc_ccs),%g1 -- ok, load CAF_ccs into %g1
        sethi   %hi(CCCS),%g2
        or      %g2,%lo(CCCS),%g2    -- ok, load addr of CCCS into %g2
        st      %g1,[%g2]            -- ok, I32[CCCS] = CAF_ccs
        add     %g1,1,%g1            -- hmmm: produce a misaligned addr into %g1
        addx    %g2,%g0,%g2          -- hmmm: a nop. %g0 is always zero
        ld      [%l1+4],%g3          -- ok, load addr of the cost centre record into %g3
        add     %g3,16,%g3           -- ok, calculate the addr of the allocation count
                                     --     for that cost centre 

        st      %g2,[%g3]            -- badness! write addr into the allocation count field
        st      %g1,[%g3+4]          --          ditto

Finding being 80% of fixing, hopefully this'll be working again tomorrow.

Project midpoint

I was away for a couple of days last week at my sister's wedding, but am back on the SPARC case.

After collecting the benchmarks in last post, we've reached the midpoint of the project. The various stakeholders, interested parties and I then had an email discussion about what to do next, a summary of which is:

Discussion

We're pretty happy with the performance of the highly threaded benchmarks like sumeuler, but others such as matmult could use some work.

Darryl ran some tests that confirmed there is indeed a stall cost for load instructions followed by dependent ops that use the loaded value.

Although doing instruction reordering would reduce the cost, the overall throughput of the machine probably doesn't suffer much for highly threaded code. The T2 is designed to handle stalls well, as long as there is another useful thread to switch to.

We considered targeting the intermediate representation (IR) used in Sun's C compiler. That would do code generation better than we ever could, but the IR itself isn't publicly documented. We might be able to reverse engineer it, but the end product would rot when the IR changes.

Targeting LLVM is another option, but that's a longer term project, and not SPARC specific. Manuel mentioned on IRC that he has someone looking into targeting LLVM.

Doing our own instruction reordering is also an option, but also not SPARC specific. It would also be subsumed by a prospective LLVM port.

Performing prefetching might help for single threaded code, because GHC spends a fair amount of time copying data between the stack and the heap. The T2 does no hardware prefetching.

We decided it'd be better using the rest of the time looking into how well we can exploit the features specific to the T2, instead of focusing on the performance of single threaded code. Manuel posted some initial results for DPH showing the T2 running about as well as an eight core Xenon machine for a memory-bound benchmark, so that'll do for now.

Plan

First of all, get some profiles together comparing parfib and matmult running on the T2 and x86. This should give us an idea of where the time is going

An interesting feature of the T2 is the low thread synchronization overhead. This might support per-thread thunk locking instead of the complex machinery described in Haskell on a Shared-Memory Multiprocessor that is used to kill off duplicated computations.

Simon, Simon and Satnam's recent paper Runtime Support for Multicore Haskell had a lot to say about locality effects.
It would also be interesting to compare the cost of pinning threads to different cores eg 8 threads on one core vs 1 thread per core.

There are also outstanding bugs to fix. We want to have a stable SPARC port, with native code generator, for the next major GHC release.

GHC on SPARC