GHC on SPARC

Memory Barriers and GHC 6.12.1

2010-02-15T11:13:00.002+11:00

I've moved to UNSW (University of New South Wales) to help with the DPH (Data Parallel Haskell) project. The first order of business has been to make a SPARC/Solaris binary release for 6.12.1.

It looks like this week will be spent spelunking through the runtime system trying to find the source of #3875. Running the DPH Quickhull benchmark on SPARC/Solaris with the current head causes a consistent runtime crash, and running sumsq causes a crash about one time out of twenty. Simon M reckons this is likely to be caused by a missing memory barrier somewhere in the runtime system, and if that's true then it's going to be tricky to find.

According to this wikipedia page the default memory ordering properties on SPARC (TSO = Total Store Ordering) are supposed to be the same as on x86, but the x86 benchmarks don't crash like on SPARC. I'm not sure if this is because the memory ordering on SPARC and x86 are actually different, or if we've just been lucky on x86 so far. I'll spend today reading more about the ordering properties of these two architectures.

I'm also eagerly awaiting the new GHC buildbot, which Ian is working on. GHC HQ develops mostly on x86_64/Linux + Windows, so if we want to keep other platforms working then having reliable buildbots is an absolute necessity.

The CAS experiment

2009-05-27T20:44:00.003+10:00

(Compare And Swap)

One of the main questions we wanted to address with the GHC on SPARC project was how well the T2 architecture supports parallel lazy evaluation. Earlier posts have shown that it does well for most of the standard parallel benchmarks in the nofib suite, with sumeuler scaling almost perfectly.

For practical parallel programming, the limiting performance factor tends to be the cost of inter-processor communication. We must distribute work to the various threads, then collect and combine their results into the final output of the program. For an "embarrassingly parallel" program, the amount of communication is low, and it can make full use of the available processors. However, for most programs we will reach a point where adding extra processors to the machine does not improve performance, because the benefit of having an extra processor is outweighed by the cost of communicating with it.

For lazy functional programs, the work can be logically divided into thunks, that is, unevaluated function applications. In this case, lots of communication overhead can arise due to the need for the processors to decide who should be evaluating which thunks. The paper Haskell on a Shared Memory Multiprocessor describes the way GHC does this. An important point is that in GHC the processors are only weakly synchronized, that is, there a chance that two processors both evaluate the same thunk. This is safe in GHC because the code is purely functional, it does not matter if the same expression is evaluated more than once.

An alternative approach is to place a lock on each thunk, and require a processor to set a thunk's lock before evaluating it. We call the action of setting the lock "claiming" the thunk. This mechanism eliminates the possibility of two processors claiming the same thunk, at the cost of increased communication overhead. We would like to know how much this overhead would be.

The experiment
In the paper, the authors modified GHC to perform an atomic compare and swap instruction in the entry code for each thunk. On i386 they reported an average slowdown of 50%, with a maximum of 300%. This is a substantial penalty, and the paper goes on to describe the lock-free algorithms they implemented to avoid it.

The SPARC T2 was designed specifically to have a low inter-thread communication overhead, so we would like to know if using locks would be feasible on this architecture. To answer this I repeated the experiment from the paper, by adding the following instruction to the entry code for each thunk:


   casa [%sp], %i1, %g1

This means "If the data pointed to by %sp is the same as the value in %i1 then swap that data with the value in %g1, otherwise put it in %g1". On SPARC, %sp is the C stack pointer, and %i1 is the STG R1 register, so the the comparison will always fail. However, the rest of the thunk code doesn't use the %g1 register, so this instruction is a no-op, apart from causing a processor wide synchronization event.

The benchmarking results with this modification are here. In summary, we get an average slowdown of 34%, with a maximum of 94%.

Comparison with x86
For comparison, I redid the experiment on my Pentium4 machine as well. The paper was written back in 2005, and other parts of GHC have improved since then. Here is the code I added to the entry code for each thunk:


    push %eax
    mov  %ebx,%eax
    lock cmpxchg %ebx, (%epb)
    pop  %eax

This fragment contains more instructions than the SPARC version, but I did it this way to ensure the fragment was a no-op (apart from the slowdown due to synchronization). For a real implementation we'd need more than one instruction anyway, and the cmpxchg is substantially more expensive than the others, so it's a fair comparison.

The nofib results for x86 are here. In summary, for this architecture we have an average 130% slowdown, with a worst-case of 540%.

Summary
For our purposes, the T2 has roughly a quarter of the thread synchronization overhead of a Pentium 4 machine. This is very good, but Haskell uses lazy evaluation as default, so thunk-entry is something that we do a lot of. An average 34% slowdown is too high to want to move to a locking implementation, but not by much.

GHC is already a mature compiler, so it's unlikely we'd be able to get that 34% back by improving other parts. I had a brief look at the generated code, and for a processor to claim every thunk we'd need a lock every 50 instructions or so. Using locks would be fine for applications that don't use such fine grained parallelism.

Loose ends and register pairing

2009-04-22T19:02:00.003+10:00

I've been spending some time cleaning up loose ends and addressing the floating point register pairing on SPARC.

First loose end was refactoring the way information about what STG registers are pinned to what hardware regs is managed. When I fixed the NCG back in Jan / Feb I had defined a local freeRegs function which records which regs are available for allocation, mostly so I could work out what was going on. Last week I redid it so its gets the information from include/MachRegs.h again via some #ifdef nastiness. Said nastiness is contained to a single module, with the benefit being that the SPARC NCG behaves the same way as the ones for the other architectures.

Second loose end was working out why we couldn't pin a register to hold the pointer to a thread's gc_thread structure in the garbage collector code. The reason is that names like %l1 are just offsets into the SPARC register stack, and change during a function call. It's all well and good to pin a certain variable to %l1, but its value gets lost as soon as you make a call. Seems obvious in hindsight. We can pin STG registers to hardware registers while running the program proper, because Haskell code doesn't use the register window. However, the garbage collector is written in C, so the same plan doesn't fly.

I'm currently refactoring the graph coloring register allocator to do a better job of handling the SPARC's floating point registers. SPARC has 32x 32bit float registers. To store a 64 bit float you place the value in a pair of 32bit regs. This is much like the situation on x86 where the 16 bit regs AL and AH can be combined to form AX. Previously, we had just statically reserved 6 of the FP regs to be 32 bit, and the rest to be 64 bit. This works, but is kludgy and doesn't make good use of the hardware.

When I wrote GHC's graph coloring allocator back in 2007, I based it on the algorithm described by Smith, Ramsey and Holloway in A Generalized Algorithm for Graph-Coloring Register Allocation. This algorithm handles register quite elegantly, and I wrote the code with the SPARC architecture in mind. However, back then the SPARC NCG didn't actually work, so I couldn't test this functionality. It does now though, so i'm going back and finishing this unfinished business.

Instruction counts on x86 vs SPARC

2009-04-14T12:45:00.006+10:00

Here are breakdowns of the total number of instructions executed on a Pentium-M processor and the SPARC T2 for four of the nofib/parallel benchmarks. The vertical axis is gig-instructions (x10^9). The benchmarks were run with just a single thread (+RTS -N1). The figures were recorded using hardware performance counters.

SPARC/Solaris provides the cpustat and cputrack commands which allow direct access to the performance counters, as long as you're the super-user. On x86/Linux I originally tried using valgrind to count the instructions, but it turned out to be too slow. Valgrind is basically an x86 CPU simulator, the simulation imposes a 50 - 200x slowdown, and I didn't want to wait all week. I ended up using a machine with a Linux kernel patched with the perfex/perfctr library, generously provided by Daniel Frampton's GC research group.

Recall that Pentium-M architecture does an on-the-fly translation from x86 CISC instructions to RISC micro-operations (uops). On these benchmarks, about 10% more uops are generated than the number of x86 instructions. The SPARC versions of the benchmarks use about 25% more instructions than Pentium-M uops. This is partly because the SPARC needs two instructions to load a 32 bit immediate value, which is something GHC does a lot of. For example, here is some code which builds part of a stack frame:


        sethi   %hi(_module_registered),%g2
        or      %g2,%lo(_module_registered),%g2
        st      %g1,[%g2]
        add     %i0,-4,%i0
        sethi   %hi(__stginit_base_Prelude_),%g1
        or      %g1,%lo(__stginit_base_Prelude_),%g1
        st      %g1,[%i0]
        add     %i0,-4,%i0
        sethi   %hi(__stginit_base_SystemziEnvironment_),%g1
        or      %g1,%lo(__stginit_base_SystemziEnvironment_),%g1
        st      %g1,[%i0]

SPARC instructions are all 32 bits wide, and most ALU ops only have room for a 13 bit immediate value. The usual way to load a 32 bit value is to use the sethi instruction to set the top 22 bits, then or in the bottom few. Lack of room for immediate values hurts continuation passing styles of compilation more than the traditional loop+branch styles, as we can't use PC relative addressing as much. Still, on the T2, dependent ALU ops don't cause pipeline stalls, so it's not a big penalty.

The Pentium-M doesn't support x86_64, so we have to use the x86 instruction set, which doesn't have many registers. Due to this, the Pentium-M code performs about twice the number of memory accesses then the SPARC code. The number of branch instructions is about the same, as expected.

All over, on the x86 about 50% of the instructions are loads or stores, and 18% are branches. On the SPARC, about 20% are loads or stores, and about 16% branches.

In other news, I've been spending some time looking through the RTS code for the scheduler, and the way black-holes are handled. The paper Haskell on a Shared-Memory Multiprocessor discusses the mechanism that GHC uses to avoid fine-grained locking of thunks during parallel evaluation. On x86 this locking is expensive, but it should be less so on the T2. I'm planning to repeat the experiment mentioned in that paper to determine whether doing the locking is feasible. Doing so would eliminate the need for the complicated stack tracing scheme described in that paper to kill of duplicated computation (at least on SPARC), and may improve the latency of parallel code. Stay tuned.

Pinning threads to cores

2009-03-23T18:48:00.013+11:00

The following graphs show runs of the nofib/parallel benchmarks sumeuler, parfib and mandel, run with a varying number of Haskell threads (+RTS -N1 .. -N14).

In the plots marked "8 thread", the program was constrained to run on a single core. A core has 8 hardware threads, can issue an instruction from two different threads each cycle, has 2 ALUs, 1 Load/Store unit, 8k L1 data cache and 16k L1 instr cache. In this mode the threads must share limited hardware, so if the two active threads need to use the Load/Store unit at the same time, then one will stall. However, the inter-thread communication overhead is low, as they can communicate via the shared L1 cache.

In the plots marked "8 core", the program was run on a processor set consisting of the first hardware thread of each core. This means the program effectively has 8 ALUs, 8 Load/Store units, 64k L1 data and 128k L1 instr cache available. The T2 does not issue multiple instructions from the same thread in a single cycle, so the second ALU of each core is left unused. In this mode the program has more hardware available, but the communication overhead is higher as the threads must communicate via L2 cache.

The question is how well the T2's hardware threading multiplexes GHC threads (capabilities). In the "8 core" mode, the program has more than 4x the hardware available, and will not suffer contention over limited Load/Store units. However, it is not expected to be able to issue 4x the number of instructions, due to pipeline stalls and cache latency.

For the runtime graph, both axes use log scales. The fact that the plots for the "8 core" mode are (mostly) straight lines means that the performance is scaling (almost) perfectly, that is, double the threads, half the runtime. In all cases the program performs better in the "8 core" mode, so the extra hardware available more than makes up for the higher communication overhead. The raw run-time for 8 GHC threads are as follows:

prog	runtime "8 thread"	runtime "8 core"	speedup
mandel	25.7	20.5	1.25
parfib	14.0	7.4	1.89
sumeuler	4.0	3.2	1.25

From these numbers we can see that in "8 core" mode, even though the program has more than 4x the hardware available, this doesn't translate into a 4x speedup. Each thread spends the majority of its time stalled while waiting for the cache to respond, and in "8 core" mode the cores are idle for much of the time.

This is less true for parfib. Note from the max issue rate graph that parfib achieves the peak single-core issue rate of 2.33 gig instrs/sec with only 6 threads. This implies the program is integer bound, not load/store bound. It is managing to issue two instructions on every cycle, and the overall execution of the program doesn't stall while waiting for memory.

In all cases, adding more GHC threads than hardware threads is a bad idea. Not only does the run time increase, but the total number of instructions for mandel and sumeuler increases dramatically. These extra instructions will be due to context switching between the Haskell threads, contribute to overall load, and will slow down other jobs running on the same machine.

Peak issue rate is 18.64 Gig instrs/sec

2009-03-19T19:28:00.003+11:00

This plot confirms the number for the peak instruction issue rate that I presented yesterday. On the T2 at 1165Mhz we get 18.64 Gig instructions/sec.

The test program spawns 64 threads that execute the following loop 10^8 times.

here:
        add     %o0, %o1, %o2
        add     %o1, %o2, %o3
        add     %o3, %o4, %o5
        add     %o5, %o6, %o7

        sub     %l3, 1, %l3
        subx    %l2, 0, %l2

        cmp     %g0, %l3
        bne     here
        nop

Peak performance

2009-03-18T18:40:00.005+11:00

These are plots of machine activity when running benchmarks from nofib/parallel on the T2. The Instr_cnt plot shows the machine-wide instruction issue rate at 100ms resolution. The Data_miss plot shows the number of loads which miss L1 data cache multiplied by 20. This scaling factor is the is the average number of cycles needed to fetch data from L2, according to Darryl's article on Calculating Processor Utilization.

The Solaris psrstat command informs me that the processor operates at 1165Mhz. The OS doesn't seem to scale the clock frequency with respect to load, so I've assumed that it's stable during the tests. The processor has 8 cores, with each core being able to issue two instructions per clock, each from different threads. This gives us a peak theoretical issue rate of 18.64 Gig instructions/sec (18.64 * 10^9). Of course, that's assuming we're just moving data between registers. For some applications we'll expect to run out of memory or IO bandwidth before we reach the peak issue rate.

sumeuler

These are plots of sumeuler running with 16, 32, 64, and 128 threads. The post from a few weeks ago shows that sumeuler scales perfectly, and that is reinforced here. As a rough approximation, when we double the number of threads, the issue rate doubles and the run-time halves. The idle time before and after each run corresponds to me flipping between terminals to control the logger vs run the program.

We have 64 hardware threads, so using -N64 Haskell threads (capabilities) gives us the shortest run time. I also ran the program with -N128 to confirm there's nothing to be gained by adding more capabilities than we have hardware threads (I tried for other values between -N64 and -N128 as well).

For -N64 we get a maximum issue rate of about 10 Gig instructions/sec. That is about 54% of the peak rate, which is pretty good, but it also indicates that we could gain something by improving the native code generator to do instruction reordering or prefetching. If the program was achieving the peak rate then there would be no free cycles left to use, so the only way to improve run time would be to reduce the number of instructions.

Over the next few days I'll run some C micro-benchmarks to confirm that my figure for the peak issue rate is correct.

mandel

mandel also scales well. This program sparks a new thread for each pixel in the output image. The time it takes to calculate the color of each pixel can vary by a factor of 10 or more, depending on where it is in the image. I imagine this is why the plot for the instruction issue rate is noisier than for sumeuler. The maximum issue rate is about the same.

matmult

There is something going on with matmult that I don't fully understand yet. The four plots below were taken in quick succession on a quiet machine. Note how the run time varies between 10 and 13 seconds, and that the second and fourth plots are quite similar but different from the other two. Perhaps the initial choice of which cores the threads are placed on is causing locality effects, but that wouldn't explain why those plots also seem to execute more instructions.

If another process was actively running on the machine, and contributing to the instruction count, then the two similar plots wouldn't be so similar. In these plots we can see an initial startup period, then seven distinct bursts of activity. Those bursts aren't garbage collection. The GHC runtime system reports that the major GCs are taking < 0.1s each, but the bursts in the plot have around 1s duration. More investigation required.

Thread activity plotting

2009-03-12T15:44:00.012+11:00

Here are plots of hardware thread activity on the T2 when running some of the benchmarks from nofib/parallel. They were obtained via the Solaris cpustat command, with custom post processing hackery to make the images.

The horizontal lines are hardware threads, with the darkness of the line corresponding to the number of instructions executed per unit of time. The ticks along the X axis give the number of seconds wall-clock runtime.

(sorry there are no numbers on the axes, custom hackery and all)

sumeuler

-N8

-N32

-N64

These are with 8, 32 and 64 haskell threads (+RTS -N8 ..) respectively. sumeuler scales well, so in the -N64 case we get even machine utilization and a short runtime.

In the plot for -N8, the system seems to favor the the first thread of each core, so every 8th trace is darker. The higher threads aren't used as much. There seems to more activity in the last two cores.

matmult

-N32

-N64

The plot for matmult reveals distinct phases of high and low parallelism. At about 7 sec into the -N32 plot we see a single running Haskell thread shuffling between the cores.

partree

-N32

In this plot we see a faint background computation, along with what I expect are regular periods of garbage collection. However, it appears as though most of the work is being done by a single Haskell thread. The plots for 8 threads and 1 thread have a similar runtime.

I am eagerly awaiting the release of ThreadScope so I can match these hardware profiles up with GHC's view of the world.

The GNU Debugger and me

2009-03-09T17:57:00.002+11:00

At the end of last week I tried to profile the performance of matmult and parfib, but discovered it was broken. Specifically, the allocation count for some cost centers is getting corrupted. For example:

COST CENTRE            MODULE    no. entries  %time %alloc   %time %alloc
MAIN        MAIN     1           0   0.0   20.3     0.0  100.0
 main       Main     172 30590380562441576   0.0   19.0     0.0   19.0
 CAF        Main     166 30590376291721096   0.0   20.2     0.0   20.2
  main      Main     173           0   0.0   0.0    0.0    0.0
   dude     Main     174 30619204112395264   0.0    0.0     0.0    0.0

Sadly, for some reason it's also getting corrupted with the via-c build, so my usual approach of swapping out suspect -fasm closure code for known good -fvia-c code isn't going to fly.

That puts me squarely in GDB territory, and the nastiness of breakpoints and single stepping through instructions. I figure GDB is like the SWAT team of software engineering. It's fun to recount its exploits to your pals over beer, but you hope you're never in a situation where you actually need to call on its services.

Anyway, some time later I've got a Cmm line:

        I32[CCCS] = CAF_ccs;
        I64[I32[R1 + 4] + 16] = I64[I32[R1 + 4] + 16] + 1 :: W64;

Which produces:

        
        sethi   %hi(Main_CAFs_cc_ccs),%g1
        or      %g1,%lo(Main_CAFs_cc_ccs),%g1 -- ok, load CAF_ccs into %g1
        sethi   %hi(CCCS),%g2
        or      %g2,%lo(CCCS),%g2    -- ok, load addr of CCCS into %g2
        st      %g1,[%g2]            -- ok, I32[CCCS] = CAF_ccs
        add     %g1,1,%g1            -- hmmm: produce a misaligned addr into %g1
        addx    %g2,%g0,%g2          -- hmmm: a nop. %g0 is always zero
        ld      [%l1+4],%g3          -- ok, load addr of the cost centre record into %g3
        add     %g3,16,%g3           -- ok, calculate the addr of the allocation count
                                     --     for that cost centre 

        st      %g2,[%g3]            -- badness! write addr into the allocation count field
        st      %g1,[%g3+4]          --          ditto

Finding being 80% of fixing, hopefully this'll be working again tomorrow.

Project midpoint

2009-03-09T17:09:00.007+11:00

I was away for a couple of days last week at my sister's wedding, but am back on the SPARC case.

After collecting the benchmarks in last post, we've reached the midpoint of the project. The various stakeholders, interested parties and I then had an email discussion about what to do next, a summary of which is:

Discussion

We're pretty happy with the performance of the highly threaded benchmarks like sumeuler, but others such as matmult could use some work.

Darryl ran some tests that confirmed there is indeed a stall cost for load instructions followed by dependent ops that use the loaded value.

Although doing instruction reordering would reduce the cost, the overall throughput of the machine probably doesn't suffer much for highly threaded code. The T2 is designed to handle stalls well, as long as there is another useful thread to switch to.

We considered targeting the intermediate representation (IR) used in Sun's C compiler. That would do code generation better than we ever could, but the IR itself isn't publicly documented. We might be able to reverse engineer it, but the end product would rot when the IR changes.

Targeting LLVM is another option, but that's a longer term project, and not SPARC specific. Manuel mentioned on IRC that he has someone looking into targeting LLVM.

Doing our own instruction reordering is also an option, but also not SPARC specific. It would also be subsumed by a prospective LLVM port.

Performing prefetching might help for single threaded code, because GHC spends a fair amount of time copying data between the stack and the heap. The T2 does no hardware prefetching.

We decided it'd be better using the rest of the time looking into how well we can exploit the features specific to the T2, instead of focusing on the performance of single threaded code. Manuel posted some initial results for DPH showing the T2 running about as well as an eight core Xenon machine for a memory-bound benchmark, so that'll do for now.

Plan

First of all, get some profiles together comparing parfib and matmult running on the T2 and x86. This should give us an idea of where the time is going

An interesting feature of the T2 is the low thread synchronization overhead. This might support per-thread thunk locking instead of the complex machinery described in Haskell on a Shared-Memory Multiprocessor that is used to kill off duplicated computations.

Simon, Simon and Satnam's recent paper Runtime Support for Multicore Haskell had a lot to say about locality effects.
It would also be interesting to compare the cost of pinning threads to different cores eg 8 threads on one core vs 1 thread per core.

There are also outstanding bugs to fix. We want to have a stable SPARC port, with native code generator, for the next major GHC release.

Benchmarking

2009-02-26T14:35:00.004+11:00

I spent yesterday writing a tool (sekhmet) to help with the benchmarking, and collected the data today.

The test programs are from the nofib parallel benchmark suite. I did 5 runs of each in succession, and took the average elapsed wall-clock time. The machines were lightly loaded, though I can't see what's going on in the second slice of my Dept T2 (probably nothing).

Test Machines:

My Desktop machine. Pentium4 (Prescott 2M) CPU, 1 core, 2 threads per core, 3Ghz clock, 1GB RAM.

My Mac Book Air. Core2 Duo (Merom) CPU, 2 cores, 1 thread per core, 1.6Ghz clock, 2GB RAM.

My Dept T2. UltraSPARC T2 (Niagra 2) CPU, 4 core slice of 8 core machine, 8 threads per core, 1.4Ghz clock, 1GB ram in slice.

The haskell.org T2. UltraSPARC T2 (Niagra 2) CPU, 8 cores, 8 threads per core, 1.4Ghz clock, 32GB ram.

Benchmarks:

Parfib is an embarrassingly parallel micro-benchmark. It's a good test of the parallel Run Time System (RTS), because if this program doesn't parallelize then nothing will.

Note that this is a log-log graph. The fact that the initial plots for the T2s are linear means that the performance is scaling proportionally to the number of threads, that is, double the threads, half the runtime. With parfib, it's easy to divide the work into uniformly sized chunks (grains) and each thread only needs to talk to its parent and children, there is little cross-thread communication.

On the T2, each core has 8 threads, and each core has 2 ALUs and 1 Load/Store unit. However, because only one instruction from a thread is issued at a time, a single thread cannot make use of all the hardware in a core. To get good performance, you really need lots of threads.

With parfib, lots of threads are no problem. The T2 beats the Pentium4 at 5 threads, and the dual core Merom chip at about 9. I don't know whether the OS will try to schedule those threads all on the same core, or spread them out -- I'll find out.

At about 20 threads, the speedup starts to flatten out. This will either be because the grains are already too small to usefully divide them any further, or we're out of memory bandwidth. At a guess, I'd say the former because the sumeuler test below has no problem scaling up to 64 threads.

Note that in the right of the graph, the plots start to get shaky because there is a lot of variance in the runtime. On some benchmarks, when we try to create too many threads the runtime can vary by a factor of 2. Averaging the time of 5 runs smooths a bit of it out, but we can still see the variance though the average.

partree shows what happens when we can't divide the work into enough threads to make proper use of the hardware. Either there's not enough thread-level parallelism inherent in this problem, or the program hasn't been written to expose it properly. On this benchmark, a single Pentium 4 thread is about 8x faster than a T2 thread.

Interestingly, the P4 and Merom behave much worse than the T2 when we increase the number of threads, and the P4 much worse than the Merom. Something to investigate...

sumeuler is another embarrassingly parallel microbenchmark, and is good smoke-test for the run time system. This one scales right up to the limit of the hardware, getting faster until it runs out of hardware threads. Compared to parfib, when we add too many threads the runtime degrades much faster, not sure why.

The fact at the plots for the Pentium4 and Merom chips cross over is interesting. So is the fact that when the number of threads is increased on the Merom, the performance is relatively even, but it degrades badly on the P4. This seems to be like the behavior of partree. The T2 performances matches the others at about 16 threads, but there isn't enough exposed parallelism here to beat them.

Summary:

The T2 is purpose built to exploit thread level parallelism. If you have lots of jobs (threads) to run, then the throughput of the machine is quite high. On other hand, with the current GHC-head, if you only have a single thread then performance is going to be from sort-of-acceptable to miserable -- which mean's there's more work to be done.

In these benchmarks, the single thread performance between the T2 and P4 varies significantly. With partree the P4 thread is 8x faster, but in sumeuler it's only 1.6x faster. I'll need to put some more work in to determine why this is.

Bugfixes and pretty graphs

2009-02-24T20:39:00.003+11:00

Fixed the problem with conc042-045 and concprog002. The bug was in the runtime system code that deals with software transactional memory (STM). Tag bits from a closure pointer weren't getting stripped before it was dereferenced, which was causing misaligned read error. This was a similar bug to one I found at the very start in Printer.c:printClosure.

I also made some pretty graphs of parallel nofib benchmarks. These are just for the T2, which I ran by hand. I'll do comparisons with other platforms tomorrow:

These are both embarrassingly parallel micro benchmarks. This is clear from the fact that the runtime scales almost linearly with the number of threads, until the available hardware is exhausted. Still, it shows its working like it ought to. Horray for +RTS -N64.

Sanity

2009-02-23T18:01:00.002+11:00

Fixed the problem with enum02.

It turned out the code to compare values of type Integer in the runtime system was being mis-compiled. This was because some of the basic blocks created by the condition code generator didn't end in jumps, which confused the register liveness determinator, and caused a bad allocation.

Decided that 1 ounce of prevention is worth 100000000000 ounces of staring blindly at assembly dumps for hours, so I added code to the NCG that checks the above invariant holds in all code it generates. It also checks that the branch delay slot is always filled for branch instructions, and I'll add any other invariants I come across.

Triage

2009-02-21T17:49:00.002+11:00


OVERALL SUMMARY for test run started at Thursday, 19 February 2009 11:39:01 AM EST
    2301 total tests, which gave rise to
   12203 test cases, of which
       0 caused framework failures
    2359 were skipped

    9375 expected passes
     342 expected failures
       1 unexpected passes
     126 unexpected failures

Unexpected passes:
   hpc_ghc_ghci(normal)

Unexpected failures:
   enum02(all)                        -- wrong output. 64 bit arith.

   ffi019(all)                        -- segv
   conc042(threaded2,profthreaded)    -- segv, segv
   conc043(threaded2,profthreaded)    -- segv, segv
   conc044(threaded2,profthreaded)    -- segv, segv
   conc045(threaded2,profthreaded)    -- segv, segv
   concprog002(threaded2)             -- segv
   random1283(threaded2)              -- segv
   ghciprog004(normal)                -- segv

   strings(all)                       -- wrong output. same as HEAD on x86_64.
   bits(all)                          -- "

   testblockalloc(normal,threaded1)   -- ? out of memory error, as expected?
   process007(all)                    -- ? what is this trying to test. uses sh tricks.

   genUpTo(all)                       -- build problem. 'StringRep' not in scope.

   hClose002(all)                     -- Solaris file locking difference.
   user001(all)                       -- Solaris unix env lib differences.

   ann01(profc,profasm)               -- (*) driver / linker problem.
   annrun01(all)                      -- "

   1861(optc,profc)                   -- INFINITY undefined. C header file difference on Solaris.
   derefnull(profc,profthreaded)      -- segv as expected, but does not drop ./derefnull.prof.
   divbyzero(profc,profthreaded)      -- arith error as expected, does not drop ./divbyzero.prof. 

   apirecomp001(normal)               -- ok by hand. Test framework makefile wibble.
   ghcpkg02(normal)                   -- "

   arith011(profc)                    -- ok by hand. GCC timeout.
   barton-mangler-bug(optc,profc)     -- "
   seward-space-leak(ghci)            -- "
   joao-circular(optc,profc)          -- "

   1914(ghci)                         -- ok by hand. Needs GNU version of touch.
   gadt23(normal)                     -- "
   recomp004(normal)                  -- "

   hpc_draft(normal)                  -- ok by hand. Not sure why it failed during run.
   hpc_hand_overlay(normal)           -- "
   hpc_markup_002(normal)             -- "
   hpc_overlay(normal)                -- "
   hpc_overlay2(normal)               -- "
   hpc_show(normal)                   -- "
   arr017(profthreaded)               -- "

ann01


=====> ann01(profc)
cd . && '/data0/home/benl/devel/ghc/ghc-HEAD-build/ghc/stage2-inplace/ghc' -fforce-recomp
-dcore-lint -dcmm-lint  -dno-debug-output -c ann01.hs -O -prof -auto-all -fvia-C -v0  >ann01.comp.stderr 2>&1
Compile failed (status 256) errors were:
ann01.hs:37:0:
    Dynamic linking required, but this is a non-standard build (eg. prof).
    You need to build the program twice: once the normal way, and then
    in the desired way using -osuf to set the object file suffix.

Thunderbirds are go

2009-02-17T18:52:00.004+11:00

Alright, it's been a busy few days.

I've split the native code generator into architecture specific modules, and chopped the SPARC part into bite-sized chunks. There are still some #ifdefs around for other architectures, but most of the native code generator is compiled, independent of what the target architecture is.

I think I've squashed most of the bugs as well, but I'll have to wait overnight for a complete test run to finish.

Here are some inital nofib results.

The tests compare:

My 3Ghz Pentium4 Desktop

My 1.6Ghz Core2 Duo (Merom) MacBook Air

The 1.4Ghz SPARC T2

Take these results with a fair dose of salt, the libs on the T2 were built with -O2, but on the P4 and Merom I only used -O. I'll have some more comprehensive results in a few days. I'm not sure what's going on with "primetest", I'll have to run it again.

In short, with the current compiler, a Haskell program on a single T2 thread runs about 5x slower than on a P4 desktop. Remember that the T2 is totally set up for parallel processing, and has none of the instruction reordering or multiple issue of the P4. If we scale for the clock rate difference between the P4 and T2, then a T2 thread runs about 2.3 times slower than a P4 thread, clock for clock. On the other hand, 16 threads (2 per core) can be active at once on the T2.

Anyway, in regards to single thread performance on the T2, I think we're losing out big-time due to our total lack of instruction reordering.

Here is a piece of code from the nqueens test from nofib.


  .LcFW:
        add %i0,-20,%g1
        cmp %g1,%i2
        blu .LcFY
        nop
        add %i3,16,%i3                           <- dep
        cmp %i3,%i4                              <-
        bgu .LcFY
        nop
        sethi %hi(stg_upd_frame_info),%g1                  <- dep
        or %g1,%lo(stg_upd_frame_info),%g1                 <-
        st %g1,[%i0-8]                                     <-
        st %l1,[%i0-4]
        sethi %hi(sDJ_info),%g1                    <- dep
        or %g1,%lo(sDJ_info),%g1                   <- 
        st %g1,[%i3-12]                            <-
        ld [%l1+12],%g1                                 <- copy
        st %g1,[%i3-4]                                  <- 
        ld [%l1+16],%g1                         <- copy
        st %g1,[%i3]                            <-           
        add %i3,-12,%g1               
        st %g1,[%i0-16]              
        ld [%l1+8],%g1                          <- copy
        st %g1,[%i0-12]                         <-
        sethi %hi(base_GHCziNum_zdf6_closure),%l2           <- dep
        or %l2,%lo(base_GHCziNum_zdf6_closure),%l2          <-
        sethi %hi(sDK_info),%g1                 <- dep
        or %g1,%lo(sDK_info),%g1                <-
        st %g1,[%i0-20]
        add %i0,-20,%i0
        call base_GHCziNum_zdp1Num_info,0
        nop

A typical GHC compiled Haskell program spends most of its time copying data around memory. Such is the price of lazy evaluation, which we all know and love. The copies create a huge amount of memory traffic, and associated cache miss stalls. In addition, many of the instructions are directly dependent on the one just above them, which creates data hazard stalls.

At least for the T2, I expect that doing some basic instruction reordering will be a big win, so I'll try that first. Maybe some prefetching also, considering that we're moving so much data around memory.

Refactoring

2009-02-13T20:16:00.002+11:00

untie the knots
soften the glare
write the program
that was always there

Allocate!

2009-02-12T17:06:00.004+11:00

Last week I pushed some patches that started to break up the native code generator into architecture specific modules. Unfortunately, due to a miscommunication the patches weren't validated on powerpc properly, and now the head has been broken on that architecture for a few days.

I've been using Thorkil's machine to debug the problem, and have narrowed it down to the linear register allocator (surprise surprise). The linear allocator has some code that tries to make sure that basic blocks are allocated in call-order. Unfortunately, for tiresome reasons, if there are no branch instructions to a particular block then the allocator goes into an endless loop. This problem is exposed when trying to compile AutoApply.cmm from the runtime system. Only a small amount of code from the RTS contains Cmm level loops, so it's not well tested in that respect. Regular, compiled Haskell code contains no assembly level loops.

The graph coloring allocator compiles the code fine, so I'm pretty sure it's not a problem with code generation. I'm not sure why splitting up the code generator created this problem.

There isn't a comment in the allocator code that fully explains the block ordering problem, or gives a test case. I tried disabling this code, but it didn't make things better. Also, when I try to dump the cmm code for AutoApply.cmm with -ddump-opt-cmm the pretty printer (or something) also goes into an endless loop. This happens when compiling with -fregs-graph as well, so I think it's an unrelated problem.

More digging. Code generation for AutoApply.cmm produces basic blocks that have no explicit branches to them, because the branch is via a jump table...

Dah! Found it. My previous patch to the handling of fixup code wasn't complete for powerpc. The BCTR instruction wasn't being treated as though it was a branch.

Realise that there is a problem: 1 min
Setup to work on the problem: 1 hr
Find the problem: 5 hrs
Fix the problem: 1 min

Ah, code generation. Who doesn't love it?

Just Testing

2009-02-05T12:35:00.003+11:00

Ok, I've got the stage3 build with RTS and Libs working with -fasm -O2. Did a full testsuite run last night, and it looks promising. However, a stack of the driver tests failed because the compiler or build system spews something about "/bin/sh: cygpath: not found". This line isn't in the file its testing against, which results in a failure.

.. the cygpath thing was easy enough to hack around. A full stage3 test run gives the following. I've sorted the failing tests by ways failed.

OVERALL SUMMARY for test run started at Thursday,  5 February 2009 12:43:50 PM EST
    2287 total tests, which gave rise to
    8550 test cases, of which
       0 caused framework failures
    1580 were skipped

    6483 expected passes
     247 expected failures
       1 unexpected passes
     239 unexpected failures

Unexpected passes:
   hpc_ghc_ghci(normal)

Unexpected failures:
   TH_ppr1(normal)
   apirecomp001(normal)
   gadt23(normal)
   ghciprog004(normal)
   ghcpkg02(normal)
   hpc_draft(normal)
   hpc_hand_overlay(normal)
   hpc_markup_001(normal)
   hpc_overlay(normal)
   hpc_overlay2(normal)
   rebindable5(normal)
   recomp004(normal)

   ThreadDelay001(normal,threaded1)
   conc014(normal,threaded1)
   conc015(normal,threaded1)
   conc017(normal,threaded1)
   concprog001(normal,hpc,threaded1)
   signals001(normal,hpc,threaded1)
   list001(normal,hpc,threaded1)
   hSeek003(normal,hpc,threaded1)
   hSetBuffering003(normal,hpc,threaded1)
   copyFile002(normal,hpc,threaded1)
   dynamic001(normal,hpc,threaded1)
   dynamic002(normal,hpc,threaded1)
   enum03(normal,hpc,threaded1)
   integerBits(normal,hpc,threaded1)
   xmlish(normal,hpc,threaded1)
   memo001(normal,threaded1)
   memo002(normal,threaded1)
   num007(normal,threaded1)
   num013(normal,threaded1)
   openFile008(normal,threaded1)
   read001(normal,threaded1)
   readwrite002(normal,threaded1)
   testblockalloc(normal,threaded1)
   tup001(normal,threaded1)
   getDirContents001(normal,threaded1)
   jtod_circint(normal,threaded1)
   life_space_leak(normal,threaded1)

   arr016(normal,threaded1,threaded2)
   random1283(normal,threaded1,threaded2)

   1914(ghci)
   seward-space-leak(ghci)

   1861(optc)
   barton-mangler-bug(optc)
   joao-circular(optc,hpc,threaded1,threaded2)
   time003(optc,hpc,optasm,threaded2)
   simplrun007(optc,optasm)


   2910(hpc)
   arith012(hpc)
   arr019(hpc)
   cholewo-eval(hpc)
   concio002(hpc)
   hDuplicateTo001(hpc)
   copyFile001(hpc)
   hGetPosn001(hpc)
   hIsEOF002(hpc)
   hSeek001(hpc)
   hSeek002(hpc)
   hTell002(hpc)
   ioeGetHandle001(hpc)
   launchbury(hpc)
   num005(hpc)
   num009(hpc)
   num014(hpc)
   process004(hpc)
   process006(hpc)
   renameFile001(hpc)
   show001(hpc)
   signals002(hpc)

   arith002(threaded1)
   arith005(threaded1)
   dsrun022(threaded1)

   conc042(threaded2)
   conc043(threaded2)
   conc044(threaded2)
   conc045(threaded2)
   concprog002(threaded2)

   andy_cherry(normal,optc,hpc,optasm,threaded1,threaded2)
   annrun01(normal,optc,hpc,optasm,threaded1,threaded2)
   arith011(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   bits(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   enum01(normal,optc,hpc,optasm,threaded1,threaded2)
   enum02(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   ffi009(normal,optc,hpc,optasm,threaded1,threaded2)
   ffi019(optc,hpc,optasm,ghci,threaded1,threaded2)
   genUpTo(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   hClose002(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   hGetBuf001(normal,optc,hpc,optasm,threaded1,threaded2)
   integerConversions(normal,optc,hpc,optasm,threaded1,threaded2)
   num012(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   process007(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   strings(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   tcrun007(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
   user001(normal,optc,hpc,optasm,ghci,threaded1,threaded2)

A good percentage of these are seg faulting, so there is still a bug or two lurking around.

Continued to split the native code generator into arch specific modules while waiting for test runs. I'm splitting into Alpha, X86, PPC, and SPARC subdirs. The Alpha code is long dead, but it seems a shame to just delete it, so I've made modules for it but commented it out in those modules.

Due to time constraints I'll have to leave some of the #ifdefs in, expecially for the difference between i386 and x86_64. There are also some #ifdefs to choose between Darwin and other OSs.

The way it setup now is that (almost) all code from all arches gets compiled, no matter what the host platform is. This means that potential validate problems are always exposed. It should also help reduce the rot-factor for seldom used platforms.

The top level modules like MachCodeGen still have an #ifdef to choose which one to import for the specific platform. However, it should be straight forward to go from this setup to having a data type representing the target platform, and being able to specify it on the command line.

On the SPARC front, I've just finished cleaning up the horror show that was nativeGen/MachRegs.hs. I'll spare you the details, but this source comment (not mine) sums it up nicely: "Gag me with a spoon, eh?"

It feels a lot simpler when all the SPARC code is isolated, and you don't have to expend mental effort to reject code for architectures you're not working on. It's also reassuring to know that if you change something you're not going to break something for another architecture that is in an #ifdef block that isn't being compiled on the current platform.

Also went through and fixed all the warnings in code from MachRegs, MachInstrs, RegsAllocInfo and PprMach. The world seems shiny and new.

Join Points

2009-02-02T11:36:00.005+11:00

There were no updates last week on account of it being the Australia Day public holiday on monday, and me going to fp-syd later in the week.

Although the SPARC NCG now passes almost all of the testsuite, and can compile the Haskell source of GHC itself, it still has some problems when compiling the RTS. Unlike the Cmm code emitted when compiling Haskell source, the hand written Cmm code in the RTS contains loops as well as join points.

Mid last week I hit a bug in the linear register allocator, but didn't finish fixing it. Here is some emitted assembler code when compiling rts/PrimOps.cmm

 .Lch:
        ....
        st %l0,[%l6]
        add %l6,4,%l6
        b .Lci             <--- JUMP to .Lci
        nop

 .Lci:
        st %l0,[%i6-84]
        ld [%i6-76],%l0
        st %l0,[%i6-76]
        sll %l0,2,%l0
        ....

 .Lcj:
        ....
        st %l6,[%i6-76]
        add %l0,8,%l6
        b .Lci             <--- JUMP to .Lci
        nop

 .Ln1l:
        b .Lci             <--- JUMP to .Lci
        mov %l7,%l0
        ld [%i6-100],%l7   <--- ??? slot 100 never assigned to
        b .Lci             <--- unreachable code??
        nop

Label .Lci is a join point because it is the target of several jumps. The code in block .Ln11 is supposed to be fixup code. Fixup code is used to make sure that register assignments match up for all jumps to a particular label.

This code is badly broken though
1) .Ln11 is unreachable, nothing ever jumps to it.
2) There should probably be a .nop after the first branch instruction in .Ln11 to fill the SPARC branch delay slot. That's if that first jump is supposed to be there at all.
3) The instruction ld[%i6-100],%l7 loads from spill slot 100, but that slot is not stored to anywhere else in the code.
4) The three instructions after the first are also unreachable.

This is a bit funny considering I spent July-Sept 2007 at GHC HQ writing a new graph colouring register allocator for GHC, specifically because the linear allocator was known to be dodgy.

Alas, the graph allocator suffered the fate of many such "replacement projects". It is a little slower on some examples, and it comes with an ambient FUD cloud because it is fresh code. Sure, I could just use the graph allocator for SPARC, but that would still leave known bugs in the linear allocator for someone else to trip over. Until such time as we ditch the linear allocator entirely, for all architectures, we still need to maintain it.

The reason why .n1l wasn't being referenced was that the code to change the destination of the jump instruction at the end of block .Lch was missing. ... finally got sick enough of the organisation of RegAllocLinear.hs that I spent most of the day splitting it out into its own set of modules, which was well over due.

Some time later...

It seems as though the code in block .Ln1l is supposed to be swapping the values in regs %l7 and %l0. In block .Lch vreg %vI_co is in %l7 and %vI_cp is in %l0, but the block .Lci wants them the other way around, what a nightmare.

More time later...

From handleComponents in RegAllocLinear.hs

 -- OH NOES!
 (_, slot)  <- spillR (RealReg sreg) spill_id

The _ is supposed to match the instruction, generated by spillR, that stores the value in sreg into the spill slot at [%i6-100]. The comment is mine.

Wait and perform

2009-01-22T18:46:00.005+11:00

Spent time reading about hardware performance counters while waiting for builds and test runs. Realised that I had built the RTS with -fvia-c before, so there are still a couple of NCG things to fix before we can do a full stage2 -fasm build.

The RTS code uses a machine op MO_WriteBarrier that compiles to a nop on i386, but uses the LWSYNC instruction on PPC. I still have to read about that and work out what to do for sparc.

Also spent some time fighting a configure problem. The configure script can detect whether the assembler supports the -x flag, that tells it whether to export local symbol names. The configure script looks at the ld in the current path, but ghc calls gcc to do the assembly. That gcc will use whatever ld /it/ was configured with, and on mavericks the Solaris linker doesn't support -x. (epic sigh). In the same vein, the Solaris assembler doesn't support the .space directive. Fixed that and my builds are running again.

Started of a full run of the testsuite with the stage1 compiler + via-c rts, but that'll take all day and all night. num012 is failing with 16 bit integer arithmetic, but that's probably no big deal.

The hardware performance counters on the T2 are exercised with the cputrack util, which I spent some time learning about. You can get counts of instrs executed, branches taken, cache misses etc. There are lots of measurements available, but you can only collect two at a time - because there are only two PICs (performance instrumentation counters) on the T2.

For example:


  benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer> 
                           cputrack -t -T 10 -c Instr_cnt,DC_miss ./boyer 200
   time lwp      event     %tick      pic0      pic1 
  3.409   1       exit 3954145530 1730642411  32277574
True

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer> 
                          cputrack -t -T 10 -c Instr_ld,Instr_st ./boyer 200

   time lwp      event     %tick      pic0      pic1 
  3.413   1       exit 3957292767 286778987 219450322
True

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer> 
                           cputrack -t -T 10 -c Br_completed,Br_taken ./boyer 200
   time lwp      event     %tick      pic0      pic1 
  3.411   1       exit 3952535075 281792355 190004907
True

So, it looks like we're getting about 2.3 clocks per instruction (cpi). About 30% of instructions executed are loads or stores. If those load/stores account for L1 data cache misses then about 6% of them miss. That might be wrong though - I'll have to work out whether the info tables are stored in the data or instr cache. In any event, about 30% of all instructions are loads or stores , and another 16% are branches of which 67% are taken.

I'll post some nice graphs and whatnot once my nofib run has gone through. Should probably add the performance counters to nofib-analyse as well, in the place of valgrind on sparc / solaris.

The Strap

2009-01-21T17:29:00.002+11:00

Implemented tabled switch, and fixed a problem when converting float to integer formats. The FSTOI instruction leaves its result in a float register, but in integer format. On at least V9 you then have to copy the value to mem and back to get it into an actual int register. Later architectures have instructions to do this without going via mem, but we're sticking with V9 for now. Perhaps we could get some speedup for FP code by using the later instruction set extensions (Vis2.0?)

The genCCall problem was that calls to out-of-line float ops like sin and exp were disabled for 32 bit floats, maybe because other things were broken before.

Also fixed a 64 bit FFI problem. Closures allocated by the storage manager are 32bit aligned, but the RTS was trying to do misaligned read/writes of 64bit words. The standard SPARC load/store instructions don't support misaligned read/writes, so had to break it them up into parts.

Looking good. A few tests still fail, but they fail the same way with all ways. I'm guessing that they are problems with Solaris or other environment stuff, and not the NCG.

I tried to do build the stage2 compiler with -fasm, but I made the foolish mistake of pulling from the head beforehand, which broken the build. Did a full distclean, but will have to leave it overnight.


OVERALL SUMMARY for test run started at Wednesday, 21 January 2009  9:05:20 AM EST
    2283 total tests, which gave rise to
    8531 test cases, of which
       0 caused framework failures
    7429 were skipped

    1047 expected passes
      32 expected failures
       0 unexpected passes
      23 unexpected failures

Unexpected failures:

   cvh_unboxing(optasm)       -- fixed
   seward-space-leak(optasm)  -- fixed
   1916(optasm)               -- fixed
   expfloat(optasm)           -- fixed
   fun_insts(optasm)          -- fixed
   2594(optasm)               -- fixed
   ffi019(optasm)             -- fixed
   arith011(optasm)           -- fixed

   barton-mangler-bug(optasm) -- optc: timeout. others: ok
   joao-circular(optasm)      -- timeout

---- noncrash fail all ways ----
   num012(optasm)
   hpc_fork(optasm)
   hClose002(optasm)
   user001(optasm)
   2910(optasm)
   tcrun007(optasm)
   T2914(optasm)
   annrun01(optasm)
   ann01(optasm)
   haddockA028(optasm)
   hpc001(optasm)
   process007(optasm)
   tough(optasm)

The Grind

2009-01-20T10:04:00.003+11:00

Decided to address some of the other outstanding bugs before starting on genSwitch. I don't want other bugs to confuse the issue.

Did a full run of the testsuite with optasm. I added the triage comments manually.

OVERALL SUMMARY for test run started at Thursday, 15 January 2009  7:47:05 PM EST
    2283 total tests, which gave rise to
    8531 test cases, of which
       0 caused framework failures
    7429 were skipped

    1025 expected passes
      32 expected failures
       0 unexpected passes
      45 unexpected failures

Unexpected failures:
   arith004(optasm)           -- hpc: iselExpr64 panic. optasm: getRegister panic. 
   num012(optasm)             -- optasm: ppr match fail. others: wrong output
   process007(optasm)         -- hpc: iselExpr64. others: wrong output
   time003(optasm)            -- hpc: iselExpr64 panic. optasm: getRegister panic.
   bits(optasm)               -- hpc: iselExpr64 panic. optasm: getRegister panic.
   tough(optasm)              -- iselExpr64
   hpc001(optasm)             -- iselExpr64
   hpc_fork(optasm)           -- iselExpr64
   tc213(optasm)              -- getRegister

   expfloat(optasm)           -- genCCall can not reduce
   fun_insts(optasm)          -- genCCall can not reduce

   num013(optasm)             -- iselExpr64, match fail
   2388(optasm)               -- match fail
   enum02(optasm)             -- match fail
   enum03(optasm)             -- match fail
   arith011(optasm)           -- match fail
   arith017(optasm)           -- match fail
   ffi017(optasm)             -- match fail C types
   ffi018(optasm)             -- match fail 64bit ffi
   ffi019(optasm)             -- match fail 64bit ffi

   1916(optasm)               -- invalid register

   2594(optasm)               -- all ways: segv 64bit ffi

   seward-space-leak(optasm)  -- genSwitch
   simpl007(optasm)           -- genSwitch
   syn-perf(optasm)           -- genSwitch
   tup001(optasm)             -- genSwitch
   andy_cherry(optasm)        -- genSwitch
   arrowrun001(optasm)        -- genSwitch
   arrowrun004(optasm)        -- genSwitch
   barton-mangler-bug(optasm) -- genSwitch
   cg054(optasm)              -- genSwitch
   cvh_unboxing(optasm)       -- genSwitch
   drv005(optasm)             -- genSwitch
   drv006(optasm)             -- genSwitch
   drvrun014(optasm)          -- genSwitch
   joao-circular(optasm)      -- genSwitch
   jtod_circint(optasm)       -- genSwitch


   hClose002(optasm)          -- all ways: same wrong output
   user001(optasm)            -- all ways: same wrong output
   2910(optasm)               -- all ways: same wrong output
   tcrun007(optasm)           -- all ways: missing import
   T2914(optasm)              -- all ways: type error
   annrun01(optasm)           -- all ways: unknown package ghc
   ann01(optasm)              -- all ways: no TH in stage1 all ways
   haddockA028(optasm)        -- all ways: test wibble

Looking into arith004. The code to generate integer remainder / divide instructions was missing. On further investigation, old SPARC implementations didn't have hardware support for this. GHC used to call out to a library. The SPARC T2 has hardware divide, but you have to compute remainders using div/mul/sub. Added code to do so. Not sure if we still want to maintain the software mul/div path - but I'll worry about that when the rest is fixed and refactored. Also fixed code to generate 64 bit operations on 32 bit SPARC, which was the isel64Expr problem.


   arith004                   -- fixed
   bits                       -- fixed
   tc213                      -- fixed
   arith012                   -- fixed
   arith017                   -- fixed
   ffi017                     -- fixed
   ffi018                     -- fixed
   enum02                     -- fixed
   enum03                     -- fixed

   num012                     -- invalid register 64bit

   time003                    -- genSwitch

   ffi019(optasm)             -- all ways: bus error

   process007                 -- all ways: same wrong output
   tough                      -- all ways: same wrong output
   hpc001                     -- all ways: same wrong output
   hpc_fork                   -- all ways: same wrong output

Info tables

2009-01-15T12:45:00.004+11:00

The info tables are getting broken for some reason. The Cmm code has:

sEl_ret()
        { [const Main.$wf_srt-sEl_info;, const 1;, const 2228231;]
        }

-fvia-c gives:


    .text
        .align 4
        .long Main_zdwf_srt - sEl_info
        .long 1
        .long 2228231
sEl_info:
        ....

but -fasm gives:


    .text
        .align 4
        .long Main_zdwf_srt+0         <---------- lost sE1_info
        .long 1
        .long 2228231
sEl_info:
        ....

some time later..


#if sparc_TARGET_ARCH
-- ToDo: This should really be fixed in the PIC support, but only
-- print a for now.
pprImm (ImmConstantDiff a b) = pprImm a 
#else
pprImm (ImmConstantDiff a b) = pprImm a <> char '-'
                            <> lparen <> pprImm b <> rparen
#endif

!!!! .... sigh.

Fixed that, but it still doesn't work. Here's the code for a whole closure:

    .data                        <-------------------------------- data
    .align 8
        .global Main_a_srt
    Main_a_srt:
        .long Main_lvl_closure
        .long base_GHCziHandle_stdout_closure
        .long base_GHCziIO_a28_closure
    .data
        .align 8
        .global Main_a_closure
    Main_a_closure:
        .long Main_a_info
        .long 0
    .text                              <-------------------------------- text
        .align 4
        .long Main_a_srt-(Main_a_info)+0        <-------- offset text to data
        .long 196609
        .long 0
        .long 983047
        .global Main_a_info
    Main_a_info:
    .LcGp:
        sethi %hi(base_GHCziHandle_stdout_closure),%l2
        or %l2,%lo(base_GHCziHandle_stdout_closure),%l2
        sethi %hi(Main_lvl_closure),%l3
        or %l3,%lo(Main_lvl_closure),%l3
        call base_GHCziIO_a28_info,0
        nop

SRTs (Static Resource Table?) are supposibly used for garbage collecting CAFs, but the GHC commentary page on them seems out of date, or missing. In any event, the assembler can't make an offset between labels in .text and .data segments. In some architectures .text and .data use entirely separate address spaces. This probably got broken in a previous GHC release when info tables were moved to be next to the code. Checking against x86 reveals it does the same thing, but the x86 assembler is ok with cross segment offsets.

I ended up just changing the pretty printer so it prints out ReadOnlyData segments as .text, which is a bit nasty. I'll be able to handle this in a nicer way when the sparc NCG is factored out into its own set of modules.

Win!


      56 expected passes
       1 expected failures
       0 unexpected passes
       3 unexpected failures

Unexpected failures:
   2080(optasm)      -- wrong output
   cg015(optasm)     -- unknown unary match op
   cg054(optasm)     -- genSwitch

That seems to have fixed the seg faulting ones.

Working on 2080.hs

-- cmm code is
        _sFv::I32 = %MO_UU_Conv_W8_W32(I8[R2]);   <--- load unsigned
        _sFx::I32 = _sFv::I32;
        _cG7::I32 = %MO_S_Le_W32(_sFx::I32, 127);
        if (_cG7::I32 >= 1) goto cGa;

-- -fvia-c gives:
        ldub [%l2], %g1                           <--- load unsigned
        cmp %g1, 127
        ble,pt %icc, .LL5
         sethi %hi(ghczmprim_GHCziBool_True_closure), %g1
        ...

-- -fasm gives:
        ldsb [%l2],%l0                            <--- load signed
        srl %l0,0,%l0                             <--- nop
        cmp %l0,127
        ble .LcGa
        nop
        ...

This was an easy operand format problem. However, I did notice that the Cmm code only ever loads unsigned data, like I8[R2]. If you were going to load signed data it would be better to use the sparc ldsb instruction which sign extends the byte in one go, versus doing an unsigned load then sign extending it separately. Another task for a simple peephole optimizer....

Also fixed the unknown unary match op problem - sign extension code was unfinished. genSwitch here we come.

Liveness lies

2009-01-14T10:10:00.005+11:00

Yesterday I fixed the linear allocator to handle floating point register twinning, or at least I thought I did. The output code looked ok, but the programs I tried still crashed. I ended up spending the rest of the day writing a tool (mayet) to compare the -fasm and -fvia-c versions. Mayet takes the two .s files and splits them up into parts belonging to the individual closures. It then slowly substitutes the dubious -fasm sections for the known good -fvia-c sections.

Last night I got enough of it working to find a bad -fasm closure in cg034, which tests out floating point math. Curiously, the closure itself didn't do any float or double math. This morning I hand adjusted the -fvia-c version to look like the -fasm one until it exhibited the same problem.

After some wibbling around found this:

             ld [%l1+12],%vI_s1GO
                    # born:    %vI_s1GO
                     
             cmp %vI_s1GO,0
                    # r_dying: %vI_s1GO       <--------- LIES
                     
             bne .Lc2cm

                ......
                ......
        c2cm:
             ld [%l1+8],%vI_n2cH
                    # born:    %vI_n2cH
                     
             st %vI_n2cH,[%i0-12]
                    # r_dying: %vI_n2cH
                     
             sethi %hi(base_GHCziFloat_a_closure),%l2
                     
             or %l2,%lo(base_GHCziFloat_a_closure),%l2
                     
             or %g0,%vI_s1GO,%l3
                    # r_dying: %vI_s1GO       <---------

This is a dump of register liveness information. The line marked LIES shows that the allocator thinks that variable %vI_s1G0 isn't used after the cmp instruction. Unfortuntately, after the branch, it's used in an or. The vreg %vI_n2cH got allocated to the same register as %vI_s1G0, clobbering the contained value and causing the crash.

Turns out the register liveness determinator wasn't treating BI and BF as though they were branch instructions, so liveness information wasn't being propagated across the basic blocks properly.

Fixing that problem stopped cg034 from crashing, though it still gave the wrong answer. During debugging, noticed that if ghc is executed with -v or -ddump-reg-liveness then the top level labels emitted in the .s file change - which confuses mayet. Hmm.. let that be a lesson to all of us: changing compiler flags should not change top level names, if at all possible.

More digging

        (_s1Ri::F32,) = foreign "ccall" 
               __encodeFloat((_c2sm::I32, `signed'), (_c2sn::I32, PtrHint),
                             (_c2so::I32, `signed'))[_unsafe_call_];
        F32[Sp] = _s1Ri::F32;

Is translated to:

        call __int_encodeFloat,2
        nop

        st %f28,[%i0]         <- BOGUS %f28

A floating point return value should be placed in %f0, but for some reason the GHC code that does just that was missing. Fixed that, and it almost works... just gives the wrong answer.

Loading of doubles looks broken.

via-c says:


        ld [%l1+3], %f8
        fitod %f8, %f2

but the NGC does:


        ld [%l1+3],%l0
        st %l0,[%o6-8]
        ld [%o6-8],%f10
        fitos %f10,%f10

Hmm.

Remember that comment from a few days ago:

-- ToDo: Verify correctness

Turns out it wasn't correct.. Who would have known :P

That fixed cg034 and cg035. Now we're down to:


Unexpected failures:
   2080(optasm)   -- segv
   cg015(optasm)  -- unknown unary match op
   cg021(optasm)  -- segv
   cg022(optasm)  -- segv
   cg026(optasm)  -- segv 
   cg044(optasm)  -- segv
   cg046(optasm)  -- segv
   cg054(optasm)  -- genSwitch 
   cg058(optasm)  -- segv
   cg060(optasm)  -- segv

Bootstrapping 7

2009-01-12T13:14:00.002+11:00

I'm still chasing down the source of that bug from last time. Spent some time re-reading the STG paper, and trolling through the RTS code.

Running the program with +RTS -DS didn't help, but I noticed a flag -Da that checks the format of closures while the program runs. However, it seemed to be disabled, or unused. There is code for it is in the RTS, but it doesn't show up in +RTS --help. Running the program with +RTS -Da gives:

benl@mavericks:~/devel/ghc/ghc-HEAD-native/tmp> ./Main +RTS -DS -Da
stg_ap_v_ret... PAP/1(3f5212, 3ef238)
stg_ap_0_ret... Bus error (core dumped)

Went chasing through the RTS code looking for the source of the Bus Error. -Da causes Printer.c:printClosure to be invoked print out a description of each closure, but the pointers passed to it are misaligned. That is, misaligned / containing pointer tag bits. Reread the dynamic pointer tagging paper, then fixed the RTS code.

Ended up doing a binary-ish search to find the problem. Starting with a known good .s file generated with -fvia-c, slowly copied dubious sections of the -fasm version into it, testing along the way. This works because in the current STG -> Cmm translation, top level STG functions only share data via pinned registers. This is sort of like a standard calling convention for GHC functions, so it doesn't matter if GHC and GCC do register allocation a little differently. It might be worthwhile automating this process if I hit similar problems in the future.

Anyway, it turn's out there's a world of difference between
mov %l1, %l2 and mov %l2, %l1...

With that fixed, ran the should_run codeGen tests and got

OVERALL SUMMARY for test run started at Monday, 12 January 2009  5:23:10 PM EST
      61 total tests, which gave rise to
     427 test cases, of which
       0 caused framework failures
     367 were skipped

      42 expected passes
       1 expected failures
       0 unexpected passes
      17 unexpected failures

Unexpected failures:
   1852(optasm)        -- regalloc
   1861(optasm)        -- regalloc
   2080(optasm)        -- wrong output
   cg015(optasm)       -- unknown unary match op
   cg018(optasm)       -- regalloc
   cg021(optasm)       -- segv
   cg022(optasm)       -- segv
   cg024(optasm)       -- regalloc
   cg026(optasm)       -- regalloc
   cg028(optasm)       -- regalloc
   cg034(optasm)       -- regalloc
   cg035(optasm)       -- regalloc
   cg044(optasm)       -- regalloc
   cg046(optasm)       -- segv
   cg054(optasm)       -- genSwitch not implemented
   cg058(optasm)       -- segv
   cg060(optasm)       -- segv

The ones marked spill die with:

ghc: panic! (the 'impossible' happened)
  (GHC version 6.11.20090110 for sparc-sun-solaris2):
        RegAllocLinear.allocRegsAndSpill: no spill candidates

Repaired the rot in the linear register allocator. The free register map only worked for x86(_64) and PPC. Now we've got:

   2080(optasm)    -- wrong output
   cg015(optasm)   -- unknown unary match op
   cg021(optasm)   -- segv
   cg022(optasm)   -- segv
   cg026(optasm)   -- segv
   cg034(optasm)   -- regalloc FF64
   cg035(optasm)   -- regalloc FF64
   cg044(optasm)   -- regalloc FF64
   cg046(optasm)   -- segv
   cg054(optasm)   -- genSwitch
   cg058(optasm)   -- segv
   cg060(optasm)   -- segv

The others marked regalloc are dying because the allocator is messing up the float register twinning. That'll be tomorrow's problem.