Thursday, January 22, 2009

Wait and perform

Spent time reading about hardware performance counters while waiting for builds and test runs. Realised that I had built the RTS with -fvia-c before, so there are still a couple of NCG things to fix before we can do a full stage2 -fasm build.

The RTS code uses a machine op MO_WriteBarrier that compiles to a nop on i386, but uses the LWSYNC instruction on PPC. I still have to read about that and work out what to do for sparc.

Also spent some time fighting a configure problem. The configure script can detect whether the assembler supports the -x flag, that tells it whether to export local symbol names. The configure script looks at the ld in the current path, but ghc calls gcc to do the assembly. That gcc will use whatever ld /it/ was configured with, and on mavericks the Solaris linker doesn't support -x. (epic sigh). In the same vein, the Solaris assembler doesn't support the .space directive. Fixed that and my builds are running again.

Started of a full run of the testsuite with the stage1 compiler + via-c rts, but that'll take all day and all night. num012 is failing with 16 bit integer arithmetic, but that's probably no big deal.

The hardware performance counters on the T2 are exercised with the cputrack util, which I spent some time learning about. You can get counts of instrs executed, branches taken, cache misses etc. There are lots of measurements available, but you can only collect two at a time - because there are only two PICs (performance instrumentation counters) on the T2.

For example:

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer>
cputrack -t -T 10 -c Instr_cnt,DC_miss ./boyer 200
time lwp event %tick pic0 pic1
3.409 1 exit 3954145530 1730642411 32277574
True

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer>
cputrack -t -T 10 -c Instr_ld,Instr_st ./boyer 200

time lwp event %tick pic0 pic1
3.413 1 exit 3957292767 286778987 219450322
True

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer>
cputrack -t -T 10 -c Br_completed,Br_taken ./boyer 200
time lwp event %tick pic0 pic1
3.411 1 exit 3952535075 281792355 190004907
True

So, it looks like we're getting about 2.3 clocks per instruction (cpi). About 30% of instructions executed are loads or stores. If those load/stores account for L1 data cache misses then about 6% of them miss. That might be wrong though - I'll have to work out whether the info tables are stored in the data or instr cache. In any event, about 30% of all instructions are loads or stores , and another 16% are branches of which 67% are taken.

I'll post some nice graphs and whatnot once my nofib run has gone through. Should probably add the performance counters to nofib-analyse as well, in the place of valgrind on sparc / solaris.

1 comment:

  1. Not having seen the intended usage of MO_writeBarrier, I'm going to nonetheless speculate that this will be a no-op under TSO (certainly true if this operation is used *before* a write to ensure consistency); if it is not, a membar #StoreLoad will be sufficient (to ensure that the write occurs before subsequent loads).

    ReplyDelete