Monday, February 15, 2010

Memory Barriers and GHC 6.12.1

I've moved to UNSW (University of New South Wales) to help with the DPH (Data Parallel Haskell) project. The first order of business has been to make a SPARC/Solaris binary release for 6.12.1.

It looks like this week will be spent spelunking through the runtime system trying to find the source of #3875. Running the DPH Quickhull benchmark on SPARC/Solaris with the current head causes a consistent runtime crash, and running sumsq causes a crash about one time out of twenty. Simon M reckons this is likely to be caused by a missing memory barrier somewhere in the runtime system, and if that's true then it's going to be tricky to find.

According to this wikipedia page the default memory ordering properties on SPARC (TSO = Total Store Ordering) are supposed to be the same as on x86, but the x86 benchmarks don't crash like on SPARC. I'm not sure if this is because the memory ordering on SPARC and x86 are actually different, or if we've just been lucky on x86 so far. I'll spend today reading more about the ordering properties of these two architectures.

I'm also eagerly awaiting the new GHC buildbot, which Ian is working on. GHC HQ develops mostly on x86_64/Linux + Windows, so if we want to keep other platforms working then having reliable buildbots is an absolute necessity.

3 comments:

  1. My guess: you've been getting lucky on x64. I believe there are subtle differences between x86 and TSO ordering, but for present purposes they ought to be the same. It's more likely that higher thread counts and/or different levels of sharing in memory pipelines are causing SPARC to crash more often.

    ReplyDelete
  2. The bus error would probably indicate a misaligned memory access. This would not be a problem on x86 as it handles misaligned accesses, but SPARC doesn't.

    ReplyDelete
  3. If it is a memory ordering thing it could be that the x86/amd64 chips that you've tested have been conservative in how they conform to their memory ordering standards.

    In non-parallel contexts I've seen code work on x86 but crash on sparc because of geinuine bugs, for example C's heap may be layed out differently and cause a bug that is present in both programs to surface only on Sparc. I've also seen SIGBUS when I was using misaligned memory as Darryl said.

    ReplyDelete