The Solaris psrstat command informs me that the processor operates at 1165Mhz. The OS doesn't seem to scale the clock frequency with respect to load, so I've assumed that it's stable during the tests. The processor has 8 cores, with each core being able to issue two instructions per clock, each from different threads. This gives us a peak theoretical issue rate of 18.64 Gig instructions/sec (18.64 * 10^9). Of course, that's assuming we're just moving data between registers. For some applications we'll expect to run out of memory or IO bandwidth before we reach the peak issue rate.
We have 64 hardware threads, so using -N64 Haskell threads (capabilities) gives us the shortest run time. I also ran the program with -N128 to confirm there's nothing to be gained by adding more capabilities than we have hardware threads (I tried for other values between -N64 and -N128 as well).
For -N64 we get a maximum issue rate of about 10 Gig instructions/sec. That is about 54% of the peak rate, which is pretty good, but it also indicates that we could gain something by improving the native code generator to do instruction reordering or prefetching. If the program was achieving the peak rate then there would be no free cycles left to use, so the only way to improve run time would be to reduce the number of instructions.
Over the next few days I'll run some C micro-benchmarks to confirm that my figure for the peak issue rate is correct.
There is something going on with matmult that I don't fully understand yet. The four plots below were taken in quick succession on a quiet machine. Note how the run time varies between 10 and 13 seconds, and that the second and fourth plots are quite similar but different from the other two. Perhaps the initial choice of which cores the threads are placed on is causing locality effects, but that wouldn't explain why those plots also seem to execute more instructions.
If another process was actively running on the machine, and contributing to the instruction count, then the two similar plots wouldn't be so similar. In these plots we can see an initial startup period, then seven distinct bursts of activity. Those bursts aren't garbage collection. The GHC runtime system reports that the major GCs are taking < 0.1s each, but the bursts in the plot have around 1s duration. More investigation required.