Thursday, March 19, 2009

Peak issue rate is 18.64 Gig instrs/sec

This plot confirms the number for the peak instruction issue rate that I presented yesterday. On the T2 at 1165Mhz we get 18.64 Gig instructions/sec.

The test program spawns 64 threads that execute the following loop 10^8 times.

add %o0, %o1, %o2
add %o1, %o2, %o3
add %o3, %o4, %o5
add %o5, %o6, %o7

sub %l3, 1, %l3
subx %l2, 0, %l2

cmp %g0, %l3
bne here

Wednesday, March 18, 2009

Peak performance

These are plots of machine activity when running benchmarks from nofib/parallel on the T2. The Instr_cnt plot shows the machine-wide instruction issue rate at 100ms resolution. The Data_miss plot shows the number of loads which miss L1 data cache multiplied by 20. This scaling factor is the is the average number of cycles needed to fetch data from L2, according to Darryl's article on Calculating Processor Utilization.

The Solaris psrstat command informs me that the processor operates at 1165Mhz. The OS doesn't seem to scale the clock frequency with respect to load, so I've assumed that it's stable during the tests. The processor has 8 cores, with each core being able to issue two instructions per clock, each from different threads. This gives us a peak theoretical issue rate of 18.64 Gig instructions/sec (18.64 * 10^9). Of course, that's assuming we're just moving data between registers. For some applications we'll expect to run out of memory or IO bandwidth before we reach the peak issue rate.


These are plots of sumeuler running with 16, 32, 64, and 128 threads. The post from a few weeks ago shows that sumeuler scales perfectly, and that is reinforced here. As a rough approximation, when we double the number of threads, the issue rate doubles and the run-time halves. The idle time before and after each run corresponds to me flipping between terminals to control the logger vs run the program.

We have 64 hardware threads, so using -N64 Haskell threads (capabilities) gives us the shortest run time. I also ran the program with -N128 to confirm there's nothing to be gained by adding more capabilities than we have hardware threads (I tried for other values between -N64 and -N128 as well).

For -N64 we get a maximum issue rate of about 10 Gig instructions/sec. That is about 54% of the peak rate, which is pretty good, but it also indicates that we could gain something by improving the native code generator to do instruction reordering or prefetching. If the program was achieving the peak rate then there would be no free cycles left to use, so the only way to improve run time would be to reduce the number of instructions.

Over the next few days I'll run some C micro-benchmarks to confirm that my figure for the peak issue rate is correct.

mandel also scales well. This program sparks a new thread for each pixel in the output image. The time it takes to calculate the color of each pixel can vary by a factor of 10 or more, depending on where it is in the image. I imagine this is why the plot for the instruction issue rate is noisier than for sumeuler. The maximum issue rate is about the same.


There is something going on with matmult that I don't fully understand yet. The four plots below were taken in quick succession on a quiet machine. Note how the run time varies between 10 and 13 seconds, and that the second and fourth plots are quite similar but different from the other two. Perhaps the initial choice of which cores the threads are placed on is causing locality effects, but that wouldn't explain why those plots also seem to execute more instructions.

If another process was actively running on the machine, and contributing to the instruction count, then the two similar plots wouldn't be so similar. In these plots we can see an initial startup period, then seven distinct bursts of activity. Those bursts aren't garbage collection. The GHC runtime system reports that the major GCs are taking < 0.1s each, but the bursts in the plot have around 1s duration. More investigation required.