After collecting the benchmarks in last post, we've reached the midpoint of the project. The various stakeholders, interested parties and I then had an email discussion about what to do next, a summary of which is:
- We're pretty happy with the performance of the highly threaded benchmarks like sumeuler, but others such as matmult could use some work.
- Darryl ran some tests that confirmed there is indeed a stall cost for load instructions followed by dependent ops that use the loaded value.
- Although doing instruction reordering would reduce the cost, the overall throughput of the machine probably doesn't suffer much for highly threaded code. The T2 is designed to handle stalls well, as long as there is another useful thread to switch to.
- We considered targeting the intermediate representation (IR) used in Sun's C compiler. That would do code generation better than we ever could, but the IR itself isn't publicly documented. We might be able to reverse engineer it, but the end product would rot when the IR changes.
- Targeting LLVM is another option, but that's a longer term project, and not SPARC specific. Manuel mentioned on IRC that he has someone looking into targeting LLVM.
- Doing our own instruction reordering is also an option, but also not SPARC specific. It would also be subsumed by a prospective LLVM port.
- Performing prefetching might help for single threaded code, because GHC spends a fair amount of time copying data between the stack and the heap. The T2 does no hardware prefetching.
- We decided it'd be better using the rest of the time looking into how well we can exploit the features specific to the T2, instead of focusing on the performance of single threaded code. Manuel posted some initial results for DPH showing the T2 running about as well as an eight core Xenon machine for a memory-bound benchmark, so that'll do for now.
- First of all, get some profiles together comparing parfib and matmult running on the T2 and x86. This should give us an idea of where the time is going
- An interesting feature of the T2 is the low thread synchronization overhead. This might support per-thread thunk locking instead of the complex machinery described in Haskell on a Shared-Memory Multiprocessor that is used to kill off duplicated computations.
- Simon, Simon and Satnam's recent paper Runtime Support for Multicore Haskell had a lot to say about locality effects.
It would also be interesting to compare the cost of pinning threads to different cores eg 8 threads on one core vs 1 thread per core.
- There are also outstanding bugs to fix. We want to have a stable SPARC port, with native code generator, for the next major GHC release.