I've split the native code generator into architecture specific modules, and chopped the SPARC part into bite-sized chunks. There are still some #ifdefs around for other architectures, but most of the native code generator is compiled, independent of what the target architecture is.
I think I've squashed most of the bugs as well, but I'll have to wait overnight for a complete test run to finish.
Here are some inital nofib results.
The tests compare:
- My 3Ghz Pentium4 Desktop
- My 1.6Ghz Core2 Duo (Merom) MacBook Air
- The 1.4Ghz SPARC T2
Take these results with a fair dose of salt, the libs on the T2 were built with -O2, but on the P4 and Merom I only used -O. I'll have some more comprehensive results in a few days. I'm not sure what's going on with "primetest", I'll have to run it again.
In short, with the current compiler, a Haskell program on a single T2 thread runs about 5x slower than on a P4 desktop. Remember that the T2 is totally set up for parallel processing, and has none of the instruction reordering or multiple issue of the P4. If we scale for the clock rate difference between the P4 and T2, then a T2 thread runs about 2.3 times slower than a P4 thread, clock for clock. On the other hand, 16 threads (2 per core) can be active at once on the T2.
Anyway, in regards to single thread performance on the T2, I think we're losing out big-time due to our total lack of instruction reordering.
Here is a piece of code from the nqueens test from nofib.
add %i3,16,%i3 <- dep
cmp %i3,%i4 <-
sethi %hi(stg_upd_frame_info),%g1 <- dep
or %g1,%lo(stg_upd_frame_info),%g1 <-
st %g1,[%i0-8] <-
sethi %hi(sDJ_info),%g1 <- dep
or %g1,%lo(sDJ_info),%g1 <-
st %g1,[%i3-12] <-
ld [%l1+12],%g1 <- copy
st %g1,[%i3-4] <-
ld [%l1+16],%g1 <- copy
st %g1,[%i3] <-
ld [%l1+8],%g1 <- copy
st %g1,[%i0-12] <-
sethi %hi(base_GHCziNum_zdf6_closure),%l2 <- dep
or %l2,%lo(base_GHCziNum_zdf6_closure),%l2 <-
sethi %hi(sDK_info),%g1 <- dep
or %g1,%lo(sDK_info),%g1 <-
A typical GHC compiled Haskell program spends most of its time copying data around memory. Such is the price of lazy evaluation, which we all know and love. The copies create a huge amount of memory traffic, and associated cache miss stalls. In addition, many of the instructions are directly dependent on the one just above them, which creates data hazard stalls.
At least for the T2, I expect that doing some basic instruction reordering will be a big win, so I'll try that first. Maybe some prefetching also, considering that we're moving so much data around memory.