Monday, March 9, 2009

Project midpoint

I was away for a couple of days last week at my sister's wedding, but am back on the SPARC case.

After collecting the benchmarks in last post, we've reached the midpoint of the project. The various stakeholders, interested parties and I then had an email discussion about what to do next, a summary of which is:

Discussion

  • We're pretty happy with the performance of the highly threaded benchmarks like sumeuler, but others such as matmult could use some work.

  • Darryl ran some tests that confirmed there is indeed a stall cost for load instructions followed by dependent ops that use the loaded value.

  • Although doing instruction reordering would reduce the cost, the overall throughput of the machine probably doesn't suffer much for highly threaded code. The T2 is designed to handle stalls well, as long as there is another useful thread to switch to.

  • We considered targeting the intermediate representation (IR) used in Sun's C compiler. That would do code generation better than we ever could, but the IR itself isn't publicly documented. We might be able to reverse engineer it, but the end product would rot when the IR changes.

  • Targeting LLVM is another option, but that's a longer term project, and not SPARC specific. Manuel mentioned on IRC that he has someone looking into targeting LLVM.

  • Doing our own instruction reordering is also an option, but also not SPARC specific. It would also be subsumed by a prospective LLVM port.

  • Performing prefetching might help for single threaded code, because GHC spends a fair amount of time copying data between the stack and the heap. The T2 does no hardware prefetching.

  • We decided it'd be better using the rest of the time looking into how well we can exploit the features specific to the T2, instead of focusing on the performance of single threaded code. Manuel posted some initial results for DPH showing the T2 running about as well as an eight core Xenon machine for a memory-bound benchmark, so that'll do for now.

Plan


  • First of all, get some profiles together comparing parfib and matmult running on the T2 and x86. This should give us an idea of where the time is going

  • An interesting feature of the T2 is the low thread synchronization overhead. This might support per-thread thunk locking instead of the complex machinery described in Haskell on a Shared-Memory Multiprocessor that is used to kill off duplicated computations.

  • Simon, Simon and Satnam's recent paper Runtime Support for Multicore Haskell had a lot to say about locality effects.
    It would also be interesting to compare the cost of pinning threads to different cores eg 8 threads on one core vs 1 thread per core.

  • There are also outstanding bugs to fix. We want to have a stable SPARC port, with native code generator, for the next major GHC release.

1 comment:

  1. Roman and I gathered some some more numbers on x86 vs T2: http://bit.ly/VOo62 The story is still that thread scheduling on the T2 is very efficient and the hardware threads cover memory latency effectively. It appears odd that sumsq scales beyond 8 cores (as it is not memory bound), but maybe the stall after load that Darryl confirmed explains it (I didn't look at the assembly).

    It would be interesting to know whether different placement of threads on particular cores has an effect on performance. (Our previous experience with other Sun hardware was that Suns are a lot less sensitive to thread placement than x86-based machines.)

    All in all, our results so far suggest that the Sun engineers know what they are doing.

    ReplyDelete