Monday, March 23, 2009

Pinning threads to cores

The following graphs show runs of the nofib/parallel benchmarks sumeuler, parfib and mandel, run with a varying number of Haskell threads (+RTS -N1 .. -N14).

In the plots marked "8 thread", the program was constrained to run on a single core. A core has 8 hardware threads, can issue an instruction from two different threads each cycle, has 2 ALUs, 1 Load/Store unit, 8k L1 data cache and 16k L1 instr cache. In this mode the threads must share limited hardware, so if the two active threads need to use the Load/Store unit at the same time, then one will stall. However, the inter-thread communication overhead is low, as they can communicate via the shared L1 cache.

In the plots marked "8 core", the program was run on a processor set consisting of the first hardware thread of each core. This means the program effectively has 8 ALUs, 8 Load/Store units, 64k L1 data and 128k L1 instr cache available. The T2 does not issue multiple instructions from the same thread in a single cycle, so the second ALU of each core is left unused. In this mode the program has more hardware available, but the communication overhead is higher as the threads must communicate via L2 cache.

The question is how well the T2's hardware threading multiplexes GHC threads (capabilities). In the "8 core" mode, the program has more than 4x the hardware available, and will not suffer contention over limited Load/Store units. However, it is not expected to be able to issue 4x the number of instructions, due to pipeline stalls and cache latency.
For the runtime graph, both axes use log scales. The fact that the plots for the "8 core" mode are (mostly) straight lines means that the performance is scaling (almost) perfectly, that is, double the threads, half the runtime. In all cases the program performs better in the "8 core" mode, so the extra hardware available more than makes up for the higher communication overhead. The raw run-time for 8 GHC threads are as follows:

progruntime "8 thread" runtime "8 core" speedup
mandel25.720.51.25
parfib14.07.41.89
sumeuler4.03.21.25

From these numbers we can see that in "8 core" mode, even though the program has more than 4x the hardware available, this doesn't translate into a 4x speedup. Each thread spends the majority of its time stalled while waiting for the cache to respond, and in "8 core" mode the cores are idle for much of the time.

This is less true for parfib. Note from the max issue rate graph that parfib achieves the peak single-core issue rate of 2.33 gig instrs/sec with only 6 threads. This implies the program is integer bound, not load/store bound. It is managing to issue two instructions on every cycle, and the overall execution of the program doesn't stall while waiting for memory.

In all cases, adding more GHC threads than hardware threads is a bad idea. Not only does the run time increase, but the total number of instructions for mandel and sumeuler increases dramatically. These extra instructions will be due to context switching between the Haskell threads, contribute to overall load, and will slow down other jobs running on the same machine.