GHC on SPARC: 2009-02-15

Saturday, February 21, 2009

Alright, it's been a busy few days.

I've split the native code generator into architecture specific modules, and chopped the SPARC part into bite-sized chunks. There are still some #ifdefs around for other architectures, but most of the native code generator is compiled, independent of what the target architecture is.

I think I've squashed most of the bugs as well, but I'll have to wait overnight for a complete test run to finish.

Here are some inital nofib results.

The tests compare:

My 3Ghz Pentium4 Desktop

My 1.6Ghz Core2 Duo (Merom) MacBook Air

The 1.4Ghz SPARC T2

Take these results with a fair dose of salt, the libs on the T2 were built with -O2, but on the P4 and Merom I only used -O. I'll have some more comprehensive results in a few days. I'm not sure what's going on with "primetest", I'll have to run it again.

In short, with the current compiler, a Haskell program on a single T2 thread runs about 5x slower than on a P4 desktop. Remember that the T2 is totally set up for parallel processing, and has none of the instruction reordering or multiple issue of the P4. If we scale for the clock rate difference between the P4 and T2, then a T2 thread runs about 2.3 times slower than a P4 thread, clock for clock. On the other hand, 16 threads (2 per core) can be active at once on the T2.

Anyway, in regards to single thread performance on the T2, I think we're losing out big-time due to our total lack of instruction reordering.

Here is a piece of code from the nqueens test from nofib.


  .LcFW:
        add %i0,-20,%g1
        cmp %g1,%i2
        blu .LcFY
        nop
        add %i3,16,%i3                           <- dep
        cmp %i3,%i4                              <-
        bgu .LcFY
        nop
        sethi %hi(stg_upd_frame_info),%g1                  <- dep
        or %g1,%lo(stg_upd_frame_info),%g1                 <-
        st %g1,[%i0-8]                                     <-
        st %l1,[%i0-4]
        sethi %hi(sDJ_info),%g1                    <- dep
        or %g1,%lo(sDJ_info),%g1                   <- 
        st %g1,[%i3-12]                            <-
        ld [%l1+12],%g1                                 <- copy
        st %g1,[%i3-4]                                  <- 
        ld [%l1+16],%g1                         <- copy
        st %g1,[%i3]                            <-           
        add %i3,-12,%g1               
        st %g1,[%i0-16]              
        ld [%l1+8],%g1                          <- copy
        st %g1,[%i0-12]                         <-
        sethi %hi(base_GHCziNum_zdf6_closure),%l2           <- dep
        or %l2,%lo(base_GHCziNum_zdf6_closure),%l2          <-
        sethi %hi(sDK_info),%g1                 <- dep
        or %g1,%lo(sDK_info),%g1                <-
        st %g1,[%i0-20]
        add %i0,-20,%i0
        call base_GHCziNum_zdp1Num_info,0
        nop

A typical GHC compiled Haskell program spends most of its time copying data around memory. Such is the price of lazy evaluation, which we all know and love. The copies create a huge amount of memory traffic, and associated cache miss stalls. In addition, many of the instructions are directly dependent on the one just above them, which creates data hazard stalls.

At least for the T2, I expect that doing some basic instruction reordering will be a big win, so I'll try that first. Maybe some prefetching also, considering that we're moving so much data around memory.

GHC on SPARC

Saturday, February 21, 2009

Triage

Tuesday, February 17, 2009

Thunderbirds are go

Blog Archive