Saturday, February 21, 2009

Triage


OVERALL SUMMARY for test run started at Thursday, 19 February 2009 11:39:01 AM EST
2301 total tests, which gave rise to
12203 test cases, of which
0 caused framework failures
2359 were skipped

9375 expected passes
342 expected failures
1 unexpected passes
126 unexpected failures

Unexpected passes:
hpc_ghc_ghci(normal)

Unexpected failures:
enum02(all) -- wrong output. 64 bit arith.

ffi019(all) -- segv
conc042(threaded2,profthreaded) -- segv, segv
conc043(threaded2,profthreaded) -- segv, segv
conc044(threaded2,profthreaded) -- segv, segv
conc045(threaded2,profthreaded) -- segv, segv
concprog002(threaded2) -- segv
random1283(threaded2) -- segv
ghciprog004(normal) -- segv

strings(all) -- wrong output. same as HEAD on x86_64.
bits(all) -- "

testblockalloc(normal,threaded1) -- ? out of memory error, as expected?
process007(all) -- ? what is this trying to test. uses sh tricks.

genUpTo(all) -- build problem. 'StringRep' not in scope.

hClose002(all) -- Solaris file locking difference.
user001(all) -- Solaris unix env lib differences.

ann01(profc,profasm) -- (*) driver / linker problem.
annrun01(all) -- "

1861(optc,profc) -- INFINITY undefined. C header file difference on Solaris.
derefnull(profc,profthreaded) -- segv as expected, but does not drop ./derefnull.prof.
divbyzero(profc,profthreaded) -- arith error as expected, does not drop ./divbyzero.prof.

apirecomp001(normal) -- ok by hand. Test framework makefile wibble.
ghcpkg02(normal) -- "

arith011(profc) -- ok by hand. GCC timeout.
barton-mangler-bug(optc,profc) -- "
seward-space-leak(ghci) -- "
joao-circular(optc,profc) -- "

1914(ghci) -- ok by hand. Needs GNU version of touch.
gadt23(normal) -- "
recomp004(normal) -- "

hpc_draft(normal) -- ok by hand. Not sure why it failed during run.
hpc_hand_overlay(normal) -- "
hpc_markup_002(normal) -- "
hpc_overlay(normal) -- "
hpc_overlay2(normal) -- "
hpc_show(normal) -- "
arr017(profthreaded) -- "


ann01

=====> ann01(profc)
cd . && '/data0/home/benl/devel/ghc/ghc-HEAD-build/ghc/stage2-inplace/ghc' -fforce-recomp
-dcore-lint -dcmm-lint -dno-debug-output -c ann01.hs -O -prof -auto-all -fvia-C -v0 >ann01.comp.stderr 2>&1
Compile failed (status 256) errors were:
ann01.hs:37:0:
Dynamic linking required, but this is a non-standard build (eg. prof).
You need to build the program twice: once the normal way, and then
in the desired way using -osuf to set the object file suffix.

Tuesday, February 17, 2009

Thunderbirds are go

Alright, it's been a busy few days.

I've split the native code generator into architecture specific modules, and chopped the SPARC part into bite-sized chunks. There are still some #ifdefs around for other architectures, but most of the native code generator is compiled, independent of what the target architecture is.

I think I've squashed most of the bugs as well, but I'll have to wait overnight for a complete test run to finish.

Here are some inital nofib results.

The tests compare:

  • My 3Ghz Pentium4 Desktop

  • My 1.6Ghz Core2 Duo (Merom) MacBook Air

  • The 1.4Ghz SPARC T2


Take these results with a fair dose of salt, the libs on the T2 were built with -O2, but on the P4 and Merom I only used -O. I'll have some more comprehensive results in a few days. I'm not sure what's going on with "primetest", I'll have to run it again.

In short, with the current compiler, a Haskell program on a single T2 thread runs about 5x slower than on a P4 desktop. Remember that the T2 is totally set up for parallel processing, and has none of the instruction reordering or multiple issue of the P4. If we scale for the clock rate difference between the P4 and T2, then a T2 thread runs about 2.3 times slower than a P4 thread, clock for clock. On the other hand, 16 threads (2 per core) can be active at once on the T2.

Anyway, in regards to single thread performance on the T2, I think we're losing out big-time due to our total lack of instruction reordering.

Here is a piece of code from the nqueens test from nofib.

.LcFW:
add %i0,-20,%g1
cmp %g1,%i2
blu .LcFY
nop
add %i3,16,%i3 <- dep
cmp %i3,%i4 <-
bgu .LcFY
nop
sethi %hi(stg_upd_frame_info),%g1 <- dep
or %g1,%lo(stg_upd_frame_info),%g1 <-
st %g1,[%i0-8] <-
st %l1,[%i0-4]
sethi %hi(sDJ_info),%g1 <- dep
or %g1,%lo(sDJ_info),%g1 <-
st %g1,[%i3-12] <-
ld [%l1+12],%g1 <- copy
st %g1,[%i3-4] <-
ld [%l1+16],%g1 <- copy
st %g1,[%i3] <-
add %i3,-12,%g1
st %g1,[%i0-16]
ld [%l1+8],%g1 <- copy
st %g1,[%i0-12] <-
sethi %hi(base_GHCziNum_zdf6_closure),%l2 <- dep
or %l2,%lo(base_GHCziNum_zdf6_closure),%l2 <-
sethi %hi(sDK_info),%g1 <- dep
or %g1,%lo(sDK_info),%g1 <-
st %g1,[%i0-20]
add %i0,-20,%i0
call base_GHCziNum_zdp1Num_info,0
nop


A typical GHC compiled Haskell program spends most of its time copying data around memory. Such is the price of lazy evaluation, which we all know and love. The copies create a huge amount of memory traffic, and associated cache miss stalls. In addition, many of the instructions are directly dependent on the one just above them, which creates data hazard stalls.

At least for the T2, I expect that doing some basic instruction reordering will be a big win, so I'll try that first. Maybe some prefetching also, considering that we're moving so much data around memory.