Thursday, January 22, 2009

Wait and perform

Spent time reading about hardware performance counters while waiting for builds and test runs. Realised that I had built the RTS with -fvia-c before, so there are still a couple of NCG things to fix before we can do a full stage2 -fasm build.

The RTS code uses a machine op MO_WriteBarrier that compiles to a nop on i386, but uses the LWSYNC instruction on PPC. I still have to read about that and work out what to do for sparc.

Also spent some time fighting a configure problem. The configure script can detect whether the assembler supports the -x flag, that tells it whether to export local symbol names. The configure script looks at the ld in the current path, but ghc calls gcc to do the assembly. That gcc will use whatever ld /it/ was configured with, and on mavericks the Solaris linker doesn't support -x. (epic sigh). In the same vein, the Solaris assembler doesn't support the .space directive. Fixed that and my builds are running again.

Started of a full run of the testsuite with the stage1 compiler + via-c rts, but that'll take all day and all night. num012 is failing with 16 bit integer arithmetic, but that's probably no big deal.

The hardware performance counters on the T2 are exercised with the cputrack util, which I spent some time learning about. You can get counts of instrs executed, branches taken, cache misses etc. There are lots of measurements available, but you can only collect two at a time - because there are only two PICs (performance instrumentation counters) on the T2.

For example:

cputrack -t -T 10 -c Instr_cnt,DC_miss ./boyer 200
time lwp event %tick pic0 pic1
3.409 1 exit 3954145530 1730642411 32277574

cputrack -t -T 10 -c Instr_ld,Instr_st ./boyer 200

time lwp event %tick pic0 pic1
3.413 1 exit 3957292767 286778987 219450322

cputrack -t -T 10 -c Br_completed,Br_taken ./boyer 200
time lwp event %tick pic0 pic1
3.411 1 exit 3952535075 281792355 190004907

So, it looks like we're getting about 2.3 clocks per instruction (cpi). About 30% of instructions executed are loads or stores. If those load/stores account for L1 data cache misses then about 6% of them miss. That might be wrong though - I'll have to work out whether the info tables are stored in the data or instr cache. In any event, about 30% of all instructions are loads or stores , and another 16% are branches of which 67% are taken.

I'll post some nice graphs and whatnot once my nofib run has gone through. Should probably add the performance counters to nofib-analyse as well, in the place of valgrind on sparc / solaris.

Wednesday, January 21, 2009

The Strap

Implemented tabled switch, and fixed a problem when converting float to integer formats. The FSTOI instruction leaves its result in a float register, but in integer format. On at least V9 you then have to copy the value to mem and back to get it into an actual int register. Later architectures have instructions to do this without going via mem, but we're sticking with V9 for now. Perhaps we could get some speedup for FP code by using the later instruction set extensions (Vis2.0?)

The genCCall problem was that calls to out-of-line float ops like sin and exp were disabled for 32 bit floats, maybe because other things were broken before.

Also fixed a 64 bit FFI problem. Closures allocated by the storage manager are 32bit aligned, but the RTS was trying to do misaligned read/writes of 64bit words. The standard SPARC load/store instructions don't support misaligned read/writes, so had to break it them up into parts.

Looking good. A few tests still fail, but they fail the same way with all ways. I'm guessing that they are problems with Solaris or other environment stuff, and not the NCG.

I tried to do build the stage2 compiler with -fasm, but I made the foolish mistake of pulling from the head beforehand, which broken the build. Did a full distclean, but will have to leave it overnight.

OVERALL SUMMARY for test run started at Wednesday, 21 January 2009 9:05:20 AM EST
2283 total tests, which gave rise to
8531 test cases, of which
0 caused framework failures
7429 were skipped

1047 expected passes
32 expected failures
0 unexpected passes
23 unexpected failures

Unexpected failures:

cvh_unboxing(optasm) -- fixed
seward-space-leak(optasm) -- fixed
1916(optasm) -- fixed
expfloat(optasm) -- fixed
fun_insts(optasm) -- fixed
2594(optasm) -- fixed
ffi019(optasm) -- fixed
arith011(optasm) -- fixed

barton-mangler-bug(optasm) -- optc: timeout. others: ok
joao-circular(optasm) -- timeout

---- noncrash fail all ways ----

Tuesday, January 20, 2009

The Grind

Decided to address some of the other outstanding bugs before starting on genSwitch. I don't want other bugs to confuse the issue.

Did a full run of the testsuite with optasm. I added the triage comments manually.

OVERALL SUMMARY for test run started at Thursday, 15 January 2009  7:47:05 PM EST
2283 total tests, which gave rise to
8531 test cases, of which
0 caused framework failures
7429 were skipped

1025 expected passes
32 expected failures
0 unexpected passes
45 unexpected failures

Unexpected failures:
arith004(optasm) -- hpc: iselExpr64 panic. optasm: getRegister panic.
num012(optasm) -- optasm: ppr match fail. others: wrong output
process007(optasm) -- hpc: iselExpr64. others: wrong output
time003(optasm) -- hpc: iselExpr64 panic. optasm: getRegister panic.
bits(optasm) -- hpc: iselExpr64 panic. optasm: getRegister panic.
tough(optasm) -- iselExpr64
hpc001(optasm) -- iselExpr64
hpc_fork(optasm) -- iselExpr64
tc213(optasm) -- getRegister

expfloat(optasm) -- genCCall can not reduce
fun_insts(optasm) -- genCCall can not reduce

num013(optasm) -- iselExpr64, match fail
2388(optasm) -- match fail
enum02(optasm) -- match fail
enum03(optasm) -- match fail
arith011(optasm) -- match fail
arith017(optasm) -- match fail
ffi017(optasm) -- match fail C types
ffi018(optasm) -- match fail 64bit ffi
ffi019(optasm) -- match fail 64bit ffi

1916(optasm) -- invalid register

2594(optasm) -- all ways: segv 64bit ffi

seward-space-leak(optasm) -- genSwitch
simpl007(optasm) -- genSwitch
syn-perf(optasm) -- genSwitch
tup001(optasm) -- genSwitch
andy_cherry(optasm) -- genSwitch
arrowrun001(optasm) -- genSwitch
arrowrun004(optasm) -- genSwitch
barton-mangler-bug(optasm) -- genSwitch
cg054(optasm) -- genSwitch
cvh_unboxing(optasm) -- genSwitch
drv005(optasm) -- genSwitch
drv006(optasm) -- genSwitch
drvrun014(optasm) -- genSwitch
joao-circular(optasm) -- genSwitch
jtod_circint(optasm) -- genSwitch

hClose002(optasm) -- all ways: same wrong output
user001(optasm) -- all ways: same wrong output
2910(optasm) -- all ways: same wrong output
tcrun007(optasm) -- all ways: missing import
T2914(optasm) -- all ways: type error
annrun01(optasm) -- all ways: unknown package ghc
ann01(optasm) -- all ways: no TH in stage1 all ways
haddockA028(optasm) -- all ways: test wibble

Looking into arith004. The code to generate integer remainder / divide instructions was missing. On further investigation, old SPARC implementations didn't have hardware support for this. GHC used to call out to a library. The SPARC T2 has hardware divide, but you have to compute remainders using div/mul/sub. Added code to do so. Not sure if we still want to maintain the software mul/div path - but I'll worry about that when the rest is fixed and refactored. Also fixed code to generate 64 bit operations on 32 bit SPARC, which was the isel64Expr problem.

arith004 -- fixed
bits -- fixed
tc213 -- fixed
arith012 -- fixed
arith017 -- fixed
ffi017 -- fixed
ffi018 -- fixed
enum02 -- fixed
enum03 -- fixed

num012 -- invalid register 64bit

time003 -- genSwitch

ffi019(optasm) -- all ways: bus error

process007 -- all ways: same wrong output
tough -- all ways: same wrong output
hpc001 -- all ways: same wrong output
hpc_fork -- all ways: same wrong output