Here are breakdowns of the total number of instructions executed on a Pentium-M processor and the SPARC T2 for four of the nofib/parallel benchmarks. The vertical axis is gig-instructions (x10^9). The benchmarks were run with just a single thread (+RTS -N1). The figures were recorded using hardware performance counters.
SPARC/Solaris provides the cpustat and cputrack commands which allow direct access to the performance counters, as long as you're the super-user. On x86/Linux I originally tried using valgrind to count the instructions, but it turned out to be too slow. Valgrind is basically an x86 CPU simulator, the simulation imposes a 50 - 200x slowdown, and I didn't want to wait all week. I ended up using a machine with a Linux kernel patched with the perfex/perfctr library, generously provided by Daniel Frampton's GC research group.
Recall that Pentium-M architecture does an on-the-fly translation from x86 CISC instructions to RISC micro-operations (uops). On these benchmarks, about 10% more uops are generated than the number of x86 instructions. The SPARC versions of the benchmarks use about 25% more instructions than Pentium-M uops. This is partly because the SPARC needs two instructions to load a 32 bit immediate value, which is something GHC does a lot of. For example, here is some code which builds part of a stack frame:
sethi %hi(_module_registered),%g2
or %g2,%lo(_module_registered),%g2
st %g1,[%g2]
add %i0,-4,%i0
sethi %hi(__stginit_base_Prelude_),%g1
or %g1,%lo(__stginit_base_Prelude_),%g1
st %g1,[%i0]
add %i0,-4,%i0
sethi %hi(__stginit_base_SystemziEnvironment_),%g1
or %g1,%lo(__stginit_base_SystemziEnvironment_),%g1
st %g1,[%i0]
SPARC instructions are all 32 bits wide, and most ALU ops only have room for a 13 bit immediate value. The usual way to load a 32 bit value is to use the sethi instruction to set the top 22 bits, then or in the bottom few. Lack of room for immediate values hurts continuation passing styles of compilation more than the traditional loop+branch styles, as we can't use PC relative addressing as much. Still, on the T2, dependent ALU ops don't cause pipeline stalls, so it's not a big penalty.
The Pentium-M doesn't support x86_64, so we have to use the x86 instruction set, which doesn't have many registers. Due to this, the Pentium-M code performs about twice the number of memory accesses then the SPARC code. The number of branch instructions is about the same, as expected.
All over, on the x86 about 50% of the instructions are loads or stores, and 18% are branches. On the SPARC, about 20% are loads or stores, and about 16% branches.
In other news, I've been spending some time looking through the RTS code for the scheduler, and the way black-holes are handled. The paper Haskell on a Shared-Memory Multiprocessor discusses the mechanism that GHC uses to avoid fine-grained locking of thunks during parallel evaluation. On x86 this locking is expensive, but it should be less so on the T2. I'm planning to repeat the experiment mentioned in that paper to determine whether doing the locking is feasible. Doing so would eliminate the need for the complicated stack tracing scheme described in that paper to kill of duplicated computation (at least on SPARC), and may improve the latency of parallel code. Stay tuned.