Wednesday, May 27, 2009

The CAS experiment

(Compare And Swap)

One of the main questions we wanted to address with the GHC on SPARC project was how well the T2 architecture supports parallel lazy evaluation. Earlier posts have shown that it does well for most of the standard parallel benchmarks in the nofib suite, with sumeuler scaling almost perfectly.

For practical parallel programming, the limiting performance factor tends to be the cost of inter-processor communication. We must distribute work to the various threads, then collect and combine their results into the final output of the program. For an "embarrassingly parallel" program, the amount of communication is low, and it can make full use of the available processors. However, for most programs we will reach a point where adding extra processors to the machine does not improve performance, because the benefit of having an extra processor is outweighed by the cost of communicating with it.

For lazy functional programs, the work can be logically divided into thunks, that is, unevaluated function applications. In this case, lots of communication overhead can arise due to the need for the processors to decide who should be evaluating which thunks. The paper Haskell on a Shared Memory Multiprocessor describes the way GHC does this. An important point is that in GHC the processors are only weakly synchronized, that is, there a chance that two processors both evaluate the same thunk. This is safe in GHC because the code is purely functional, it does not matter if the same expression is evaluated more than once.

An alternative approach is to place a lock on each thunk, and require a processor to set a thunk's lock before evaluating it. We call the action of setting the lock "claiming" the thunk. This mechanism eliminates the possibility of two processors claiming the same thunk, at the cost of increased communication overhead. We would like to know how much this overhead would be.

The experiment
In the paper, the authors modified GHC to perform an atomic compare and swap instruction in the entry code for each thunk. On i386 they reported an average slowdown of 50%, with a maximum of 300%. This is a substantial penalty, and the paper goes on to describe the lock-free algorithms they implemented to avoid it.

The SPARC T2 was designed specifically to have a low inter-thread communication overhead, so we would like to know if using locks would be feasible on this architecture. To answer this I repeated the experiment from the paper, by adding the following instruction to the entry code for each thunk:

casa [%sp], %i1, %g1

This means "If the data pointed to by %sp is the same as the value in %i1 then swap that data with the value in %g1, otherwise put it in %g1". On SPARC, %sp is the C stack pointer, and %i1 is the STG R1 register, so the the comparison will always fail. However, the rest of the thunk code doesn't use the %g1 register, so this instruction is a no-op, apart from causing a processor wide synchronization event.

The benchmarking results with this modification are here. In summary, we get an average slowdown of 34%, with a maximum of 94%.


Comparison with x86
For comparison, I redid the experiment on my Pentium4 machine as well. The paper was written back in 2005, and other parts of GHC have improved since then. Here is the code I added to the entry code for each thunk:

push %eax
mov %ebx,%eax
lock cmpxchg %ebx, (%epb)
pop %eax

This fragment contains more instructions than the SPARC version, but I did it this way to ensure the fragment was a no-op (apart from the slowdown due to synchronization). For a real implementation we'd need more than one instruction anyway, and the cmpxchg is substantially more expensive than the others, so it's a fair comparison.

The nofib results for x86 are here. In summary, for this architecture we have an average 130% slowdown, with a worst-case of 540%.

Summary
For our purposes, the T2 has roughly a quarter of the thread synchronization overhead of a Pentium 4 machine. This is very good, but Haskell uses lazy evaluation as default, so thunk-entry is something that we do a lot of. An average 34% slowdown is too high to want to move to a locking implementation, but not by much.

GHC is already a mature compiler, so it's unlikely we'd be able to get that 34% back by improving other parts. I had a brief look at the generated code, and for a processor to claim every thunk we'd need a lock every 50 instructions or so. Using locks would be fine for applications that don't use such fine grained parallelism.

Wednesday, April 22, 2009

Loose ends and register pairing

I've been spending some time cleaning up loose ends and addressing the floating point register pairing on SPARC.

First loose end was refactoring the way information about what STG registers are pinned to what hardware regs is managed. When I fixed the NCG back in Jan / Feb I had defined a local freeRegs function which records which regs are available for allocation, mostly so I could work out what was going on. Last week I redid it so its gets the information from include/MachRegs.h again via some #ifdef nastiness. Said nastiness is contained to a single module, with the benefit being that the SPARC NCG behaves the same way as the ones for the other architectures.

Second loose end was working out why we couldn't pin a register to hold the pointer to a thread's gc_thread structure in the garbage collector code. The reason is that names like %l1 are just offsets into the SPARC register stack, and change during a function call. It's all well and good to pin a certain variable to %l1, but its value gets lost as soon as you make a call. Seems obvious in hindsight. We can pin STG registers to hardware registers while running the program proper, because Haskell code doesn't use the register window. However, the garbage collector is written in C, so the same plan doesn't fly.

I'm currently refactoring the graph coloring register allocator to do a better job of handling the SPARC's floating point registers. SPARC has 32x 32bit float registers. To store a 64 bit float you place the value in a pair of 32bit regs. This is much like the situation on x86 where the 16 bit regs AL and AH can be combined to form AX. Previously, we had just statically reserved 6 of the FP regs to be 32 bit, and the rest to be 64 bit. This works, but is kludgy and doesn't make good use of the hardware.

When I wrote GHC's graph coloring allocator back in 2007, I based it on the algorithm described by Smith, Ramsey and Holloway in A Generalized Algorithm for Graph-Coloring Register Allocation. This algorithm handles register quite elegantly, and I wrote the code with the SPARC architecture in mind. However, back then the SPARC NCG didn't actually work, so I couldn't test this functionality. It does now though, so i'm going back and finishing this unfinished business.

Tuesday, April 14, 2009

Instruction counts on x86 vs SPARC


Here are breakdowns of the total number of instructions executed on a Pentium-M processor and the SPARC T2 for four of the nofib/parallel benchmarks. The vertical axis is gig-instructions (x10^9). The benchmarks were run with just a single thread (+RTS -N1). The figures were recorded using hardware performance counters.

SPARC/Solaris provides the cpustat and cputrack commands which allow direct access to the performance counters, as long as you're the super-user. On x86/Linux I originally tried using valgrind to count the instructions, but it turned out to be too slow. Valgrind is basically an x86 CPU simulator, the simulation imposes a 50 - 200x slowdown, and I didn't want to wait all week. I ended up using a machine with a Linux kernel patched with the perfex/perfctr library, generously provided by Daniel Frampton's GC research group.

Recall that Pentium-M architecture does an on-the-fly translation from x86 CISC instructions to RISC micro-operations (uops). On these benchmarks, about 10% more uops are generated than the number of x86 instructions. The SPARC versions of the benchmarks use about 25% more instructions than Pentium-M uops. This is partly because the SPARC needs two instructions to load a 32 bit immediate value, which is something GHC does a lot of. For example, here is some code which builds part of a stack frame:

sethi %hi(_module_registered),%g2
or %g2,%lo(_module_registered),%g2
st %g1,[%g2]
add %i0,-4,%i0
sethi %hi(__stginit_base_Prelude_),%g1
or %g1,%lo(__stginit_base_Prelude_),%g1
st %g1,[%i0]
add %i0,-4,%i0
sethi %hi(__stginit_base_SystemziEnvironment_),%g1
or %g1,%lo(__stginit_base_SystemziEnvironment_),%g1
st %g1,[%i0]

SPARC instructions are all 32 bits wide, and most ALU ops only have room for a 13 bit immediate value. The usual way to load a 32 bit value is to use the sethi instruction to set the top 22 bits, then or in the bottom few. Lack of room for immediate values hurts continuation passing styles of compilation more than the traditional loop+branch styles, as we can't use PC relative addressing as much. Still, on the T2, dependent ALU ops don't cause pipeline stalls, so it's not a big penalty.

The Pentium-M doesn't support x86_64, so we have to use the x86 instruction set, which doesn't have many registers. Due to this, the Pentium-M code performs about twice the number of memory accesses then the SPARC code. The number of branch instructions is about the same, as expected.

All over, on the x86 about 50% of the instructions are loads or stores, and 18% are branches. On the SPARC, about 20% are loads or stores, and about 16% branches.

In other news, I've been spending some time looking through the RTS code for the scheduler, and the way black-holes are handled. The paper Haskell on a Shared-Memory Multiprocessor discusses the mechanism that GHC uses to avoid fine-grained locking of thunks during parallel evaluation. On x86 this locking is expensive, but it should be less so on the T2. I'm planning to repeat the experiment mentioned in that paper to determine whether doing the locking is feasible. Doing so would eliminate the need for the complicated stack tracing scheme described in that paper to kill of duplicated computation (at least on SPARC), and may improve the latency of parallel code. Stay tuned.

Monday, March 23, 2009

Pinning threads to cores

The following graphs show runs of the nofib/parallel benchmarks sumeuler, parfib and mandel, run with a varying number of Haskell threads (+RTS -N1 .. -N14).

In the plots marked "8 thread", the program was constrained to run on a single core. A core has 8 hardware threads, can issue an instruction from two different threads each cycle, has 2 ALUs, 1 Load/Store unit, 8k L1 data cache and 16k L1 instr cache. In this mode the threads must share limited hardware, so if the two active threads need to use the Load/Store unit at the same time, then one will stall. However, the inter-thread communication overhead is low, as they can communicate via the shared L1 cache.

In the plots marked "8 core", the program was run on a processor set consisting of the first hardware thread of each core. This means the program effectively has 8 ALUs, 8 Load/Store units, 64k L1 data and 128k L1 instr cache available. The T2 does not issue multiple instructions from the same thread in a single cycle, so the second ALU of each core is left unused. In this mode the program has more hardware available, but the communication overhead is higher as the threads must communicate via L2 cache.

The question is how well the T2's hardware threading multiplexes GHC threads (capabilities). In the "8 core" mode, the program has more than 4x the hardware available, and will not suffer contention over limited Load/Store units. However, it is not expected to be able to issue 4x the number of instructions, due to pipeline stalls and cache latency.
For the runtime graph, both axes use log scales. The fact that the plots for the "8 core" mode are (mostly) straight lines means that the performance is scaling (almost) perfectly, that is, double the threads, half the runtime. In all cases the program performs better in the "8 core" mode, so the extra hardware available more than makes up for the higher communication overhead. The raw run-time for 8 GHC threads are as follows:

progruntime "8 thread" runtime "8 core" speedup
mandel25.720.51.25
parfib14.07.41.89
sumeuler4.03.21.25

From these numbers we can see that in "8 core" mode, even though the program has more than 4x the hardware available, this doesn't translate into a 4x speedup. Each thread spends the majority of its time stalled while waiting for the cache to respond, and in "8 core" mode the cores are idle for much of the time.

This is less true for parfib. Note from the max issue rate graph that parfib achieves the peak single-core issue rate of 2.33 gig instrs/sec with only 6 threads. This implies the program is integer bound, not load/store bound. It is managing to issue two instructions on every cycle, and the overall execution of the program doesn't stall while waiting for memory.

In all cases, adding more GHC threads than hardware threads is a bad idea. Not only does the run time increase, but the total number of instructions for mandel and sumeuler increases dramatically. These extra instructions will be due to context switching between the Haskell threads, contribute to overall load, and will slow down other jobs running on the same machine.

Thursday, March 19, 2009

Peak issue rate is 18.64 Gig instrs/sec





This plot confirms the number for the peak instruction issue rate that I presented yesterday. On the T2 at 1165Mhz we get 18.64 Gig instructions/sec.

The test program spawns 64 threads that execute the following loop 10^8 times.

here:
add %o0, %o1, %o2
add %o1, %o2, %o3
add %o3, %o4, %o5
add %o5, %o6, %o7

sub %l3, 1, %l3
subx %l2, 0, %l2

cmp %g0, %l3
bne here
nop

Wednesday, March 18, 2009

Peak performance

These are plots of machine activity when running benchmarks from nofib/parallel on the T2. The Instr_cnt plot shows the machine-wide instruction issue rate at 100ms resolution. The Data_miss plot shows the number of loads which miss L1 data cache multiplied by 20. This scaling factor is the is the average number of cycles needed to fetch data from L2, according to Darryl's article on Calculating Processor Utilization.

The Solaris psrstat command informs me that the processor operates at 1165Mhz. The OS doesn't seem to scale the clock frequency with respect to load, so I've assumed that it's stable during the tests. The processor has 8 cores, with each core being able to issue two instructions per clock, each from different threads. This gives us a peak theoretical issue rate of 18.64 Gig instructions/sec (18.64 * 10^9). Of course, that's assuming we're just moving data between registers. For some applications we'll expect to run out of memory or IO bandwidth before we reach the peak issue rate.


sumeuler

These are plots of sumeuler running with 16, 32, 64, and 128 threads. The post from a few weeks ago shows that sumeuler scales perfectly, and that is reinforced here. As a rough approximation, when we double the number of threads, the issue rate doubles and the run-time halves. The idle time before and after each run corresponds to me flipping between terminals to control the logger vs run the program.

We have 64 hardware threads, so using -N64 Haskell threads (capabilities) gives us the shortest run time. I also ran the program with -N128 to confirm there's nothing to be gained by adding more capabilities than we have hardware threads (I tried for other values between -N64 and -N128 as well).

For -N64 we get a maximum issue rate of about 10 Gig instructions/sec. That is about 54% of the peak rate, which is pretty good, but it also indicates that we could gain something by improving the native code generator to do instruction reordering or prefetching. If the program was achieving the peak rate then there would be no free cycles left to use, so the only way to improve run time would be to reduce the number of instructions.

Over the next few days I'll run some C micro-benchmarks to confirm that my figure for the peak issue rate is correct.

mandel
mandel also scales well. This program sparks a new thread for each pixel in the output image. The time it takes to calculate the color of each pixel can vary by a factor of 10 or more, depending on where it is in the image. I imagine this is why the plot for the instruction issue rate is noisier than for sumeuler. The maximum issue rate is about the same.

matmult

There is something going on with matmult that I don't fully understand yet. The four plots below were taken in quick succession on a quiet machine. Note how the run time varies between 10 and 13 seconds, and that the second and fourth plots are quite similar but different from the other two. Perhaps the initial choice of which cores the threads are placed on is causing locality effects, but that wouldn't explain why those plots also seem to execute more instructions.

If another process was actively running on the machine, and contributing to the instruction count, then the two similar plots wouldn't be so similar. In these plots we can see an initial startup period, then seven distinct bursts of activity. Those bursts aren't garbage collection. The GHC runtime system reports that the major GCs are taking < 0.1s each, but the bursts in the plot have around 1s duration. More investigation required.




Thursday, March 12, 2009

Thread activity plotting

Here are plots of hardware thread activity on the T2 when running some of the benchmarks from nofib/parallel. They were obtained via the Solaris cpustat command, with custom post processing hackery to make the images.

The horizontal lines are hardware threads, with the darkness of the line corresponding to the number of instructions executed per unit of time. The ticks along the X axis give the number of seconds wall-clock runtime.

(sorry there are no numbers on the axes, custom hackery and all)

sumeuler



-N8-N32-N64

These are with 8, 32 and 64 haskell threads (+RTS -N8 ..) respectively. sumeuler scales well, so in the -N64 case we get even machine utilization and a short runtime.

In the plot for -N8, the system seems to favor the the first thread of each core, so every 8th trace is darker. The higher threads aren't used as much. There seems to more activity in the last two cores.

matmult


-N32
-N64

The plot for matmult reveals distinct phases of high and low parallelism. At about 7 sec into the -N32 plot we see a single running Haskell thread shuffling between the cores.

partree

-N32

In this plot we see a faint background computation, along with what I expect are regular periods of garbage collection. However, it appears as though most of the work is being done by a single Haskell thread. The plots for 8 threads and 1 thread have a similar runtime.

I am eagerly awaiting the release of ThreadScope so I can match these hardware profiles up with GHC's view of the world.

Monday, March 9, 2009

The GNU Debugger and me

At the end of last week I tried to profile the performance of matmult and parfib, but discovered it was broken. Specifically, the allocation count for some cost centers is getting corrupted. For example:

COST CENTRE            MODULE    no. entries  %time %alloc   %time %alloc
MAIN MAIN 1 0 0.0 20.3 0.0 100.0
main Main 172 30590380562441576 0.0 19.0 0.0 19.0
CAF Main 166 30590376291721096 0.0 20.2 0.0 20.2
main Main 173 0 0.0 0.0 0.0 0.0
dude Main 174 30619204112395264 0.0 0.0 0.0 0.0

Sadly, for some reason it's also getting corrupted with the via-c build, so my usual approach of swapping out suspect -fasm closure code for known good -fvia-c code isn't going to fly.

That puts me squarely in GDB territory, and the nastiness of breakpoints and single stepping through instructions. I figure GDB is like the SWAT team of software engineering. It's fun to recount its exploits to your pals over beer, but you hope you're never in a situation where you actually need to call on its services.

Anyway, some time later I've got a Cmm line:
        I32[CCCS] = CAF_ccs;
I64[I32[R1 + 4] + 16] = I64[I32[R1 + 4] + 16] + 1 :: W64;


Which produces:
        
sethi %hi(Main_CAFs_cc_ccs),%g1
or %g1,%lo(Main_CAFs_cc_ccs),%g1 -- ok, load CAF_ccs into %g1
sethi %hi(CCCS),%g2
or %g2,%lo(CCCS),%g2 -- ok, load addr of CCCS into %g2
st %g1,[%g2] -- ok, I32[CCCS] = CAF_ccs
add %g1,1,%g1 -- hmmm: produce a misaligned addr into %g1
addx %g2,%g0,%g2 -- hmmm: a nop. %g0 is always zero
ld [%l1+4],%g3 -- ok, load addr of the cost centre record into %g3
add %g3,16,%g3 -- ok, calculate the addr of the allocation count
-- for that cost centre

st %g2,[%g3] -- badness! write addr into the allocation count field
st %g1,[%g3+4] -- ditto


Finding being 80% of fixing, hopefully this'll be working again tomorrow.

Project midpoint

I was away for a couple of days last week at my sister's wedding, but am back on the SPARC case.

After collecting the benchmarks in last post, we've reached the midpoint of the project. The various stakeholders, interested parties and I then had an email discussion about what to do next, a summary of which is:

Discussion

  • We're pretty happy with the performance of the highly threaded benchmarks like sumeuler, but others such as matmult could use some work.

  • Darryl ran some tests that confirmed there is indeed a stall cost for load instructions followed by dependent ops that use the loaded value.

  • Although doing instruction reordering would reduce the cost, the overall throughput of the machine probably doesn't suffer much for highly threaded code. The T2 is designed to handle stalls well, as long as there is another useful thread to switch to.

  • We considered targeting the intermediate representation (IR) used in Sun's C compiler. That would do code generation better than we ever could, but the IR itself isn't publicly documented. We might be able to reverse engineer it, but the end product would rot when the IR changes.

  • Targeting LLVM is another option, but that's a longer term project, and not SPARC specific. Manuel mentioned on IRC that he has someone looking into targeting LLVM.

  • Doing our own instruction reordering is also an option, but also not SPARC specific. It would also be subsumed by a prospective LLVM port.

  • Performing prefetching might help for single threaded code, because GHC spends a fair amount of time copying data between the stack and the heap. The T2 does no hardware prefetching.

  • We decided it'd be better using the rest of the time looking into how well we can exploit the features specific to the T2, instead of focusing on the performance of single threaded code. Manuel posted some initial results for DPH showing the T2 running about as well as an eight core Xenon machine for a memory-bound benchmark, so that'll do for now.

Plan


  • First of all, get some profiles together comparing parfib and matmult running on the T2 and x86. This should give us an idea of where the time is going

  • An interesting feature of the T2 is the low thread synchronization overhead. This might support per-thread thunk locking instead of the complex machinery described in Haskell on a Shared-Memory Multiprocessor that is used to kill off duplicated computations.

  • Simon, Simon and Satnam's recent paper Runtime Support for Multicore Haskell had a lot to say about locality effects.
    It would also be interesting to compare the cost of pinning threads to different cores eg 8 threads on one core vs 1 thread per core.

  • There are also outstanding bugs to fix. We want to have a stable SPARC port, with native code generator, for the next major GHC release.

Thursday, February 26, 2009

Benchmarking

I spent yesterday writing a tool (sekhmet) to help with the benchmarking, and collected the data today.

The test programs are from the nofib parallel benchmark suite. I did 5 runs of each in succession, and took the average elapsed wall-clock time. The machines were lightly loaded, though I can't see what's going on in the second slice of my Dept T2 (probably nothing).

Test Machines:

  • My Desktop machine. Pentium4 (Prescott 2M) CPU, 1 core, 2 threads per core, 3Ghz clock, 1GB RAM.

  • My Mac Book Air. Core2 Duo (Merom) CPU, 2 cores, 1 thread per core, 1.6Ghz clock, 2GB RAM.

  • My Dept T2. UltraSPARC T2 (Niagra 2) CPU, 4 core slice of 8 core machine, 8 threads per core, 1.4Ghz clock, 1GB ram in slice.

  • The haskell.org T2. UltraSPARC T2 (Niagra 2) CPU, 8 cores, 8 threads per core, 1.4Ghz clock, 32GB ram.



Benchmarks:



Parfib is an embarrassingly parallel micro-benchmark. It's a good test of the parallel Run Time System (RTS), because if this program doesn't parallelize then nothing will.

Note that this is a log-log graph. The fact that the initial plots for the T2s are linear means that the performance is scaling proportionally to the number of threads, that is, double the threads, half the runtime. With parfib, it's easy to divide the work into uniformly sized chunks (grains) and each thread only needs to talk to its parent and children, there is little cross-thread communication.

On the T2, each core has 8 threads, and each core has 2 ALUs and 1 Load/Store unit. However, because only one instruction from a thread is issued at a time, a single thread cannot make use of all the hardware in a core. To get good performance, you really need lots of threads.

With parfib, lots of threads are no problem. The T2 beats the Pentium4 at 5 threads, and the dual core Merom chip at about 9. I don't know whether the OS will try to schedule those threads all on the same core, or spread them out -- I'll find out.

At about 20 threads, the speedup starts to flatten out. This will either be because the grains are already too small to usefully divide them any further, or we're out of memory bandwidth. At a guess, I'd say the former because the sumeuler test below has no problem scaling up to 64 threads.

Note that in the right of the graph, the plots start to get shaky because there is a lot of variance in the runtime. On some benchmarks, when we try to create too many threads the runtime can vary by a factor of 2. Averaging the time of 5 runs smooths a bit of it out, but we can still see the variance though the average.



partree shows what happens when we can't divide the work into enough threads to make proper use of the hardware. Either there's not enough thread-level parallelism inherent in this problem, or the program hasn't been written to expose it properly. On this benchmark, a single Pentium 4 thread is about 8x faster than a T2 thread.

Interestingly, the P4 and Merom behave much worse than the T2 when we increase the number of threads, and the P4 much worse than the Merom. Something to investigate...



sumeuler is another embarrassingly parallel microbenchmark, and is good smoke-test for the run time system. This one scales right up to the limit of the hardware, getting faster until it runs out of hardware threads. Compared to parfib, when we add too many threads the runtime degrades much faster, not sure why.



The fact at the plots for the Pentium4 and Merom chips cross over is interesting. So is the fact that when the number of threads is increased on the Merom, the performance is relatively even, but it degrades badly on the P4. This seems to be like the behavior of partree. The T2 performances matches the others at about 16 threads, but there isn't enough exposed parallelism here to beat them.

Summary:

The T2 is purpose built to exploit thread level parallelism. If you have lots of jobs (threads) to run, then the throughput of the machine is quite high. On other hand, with the current GHC-head, if you only have a single thread then performance is going to be from sort-of-acceptable to miserable -- which mean's there's more work to be done.

In these benchmarks, the single thread performance between the T2 and P4 varies significantly. With partree the P4 thread is 8x faster, but in sumeuler it's only 1.6x faster. I'll need to put some more work in to determine why this is.

Tuesday, February 24, 2009

Bugfixes and pretty graphs

Fixed the problem with conc042-045 and concprog002. The bug was in the runtime system code that deals with software transactional memory (STM). Tag bits from a closure pointer weren't getting stripped before it was dereferenced, which was causing misaligned read error. This was a similar bug to one I found at the very start in Printer.c:printClosure.

I also made some pretty graphs of parallel nofib benchmarks. These are just for the T2, which I ran by hand. I'll do comparisons with other platforms tomorrow:





These are both embarrassingly parallel micro benchmarks. This is clear from the fact that the runtime scales almost linearly with the number of threads, until the available hardware is exhausted. Still, it shows its working like it ought to. Horray for +RTS -N64.

Monday, February 23, 2009

Sanity

Fixed the problem with enum02.

It turned out the code to compare values of type Integer in the runtime system was being mis-compiled. This was because some of the basic blocks created by the condition code generator didn't end in jumps, which confused the register liveness determinator, and caused a bad allocation.

Decided that 1 ounce of prevention is worth 100000000000 ounces of staring blindly at assembly dumps for hours, so I added code to the NCG that checks the above invariant holds in all code it generates. It also checks that the branch delay slot is always filled for branch instructions, and I'll add any other invariants I come across.

Saturday, February 21, 2009

Triage


OVERALL SUMMARY for test run started at Thursday, 19 February 2009 11:39:01 AM EST
2301 total tests, which gave rise to
12203 test cases, of which
0 caused framework failures
2359 were skipped

9375 expected passes
342 expected failures
1 unexpected passes
126 unexpected failures

Unexpected passes:
hpc_ghc_ghci(normal)

Unexpected failures:
enum02(all) -- wrong output. 64 bit arith.

ffi019(all) -- segv
conc042(threaded2,profthreaded) -- segv, segv
conc043(threaded2,profthreaded) -- segv, segv
conc044(threaded2,profthreaded) -- segv, segv
conc045(threaded2,profthreaded) -- segv, segv
concprog002(threaded2) -- segv
random1283(threaded2) -- segv
ghciprog004(normal) -- segv

strings(all) -- wrong output. same as HEAD on x86_64.
bits(all) -- "

testblockalloc(normal,threaded1) -- ? out of memory error, as expected?
process007(all) -- ? what is this trying to test. uses sh tricks.

genUpTo(all) -- build problem. 'StringRep' not in scope.

hClose002(all) -- Solaris file locking difference.
user001(all) -- Solaris unix env lib differences.

ann01(profc,profasm) -- (*) driver / linker problem.
annrun01(all) -- "

1861(optc,profc) -- INFINITY undefined. C header file difference on Solaris.
derefnull(profc,profthreaded) -- segv as expected, but does not drop ./derefnull.prof.
divbyzero(profc,profthreaded) -- arith error as expected, does not drop ./divbyzero.prof.

apirecomp001(normal) -- ok by hand. Test framework makefile wibble.
ghcpkg02(normal) -- "

arith011(profc) -- ok by hand. GCC timeout.
barton-mangler-bug(optc,profc) -- "
seward-space-leak(ghci) -- "
joao-circular(optc,profc) -- "

1914(ghci) -- ok by hand. Needs GNU version of touch.
gadt23(normal) -- "
recomp004(normal) -- "

hpc_draft(normal) -- ok by hand. Not sure why it failed during run.
hpc_hand_overlay(normal) -- "
hpc_markup_002(normal) -- "
hpc_overlay(normal) -- "
hpc_overlay2(normal) -- "
hpc_show(normal) -- "
arr017(profthreaded) -- "


ann01

=====> ann01(profc)
cd . && '/data0/home/benl/devel/ghc/ghc-HEAD-build/ghc/stage2-inplace/ghc' -fforce-recomp
-dcore-lint -dcmm-lint -dno-debug-output -c ann01.hs -O -prof -auto-all -fvia-C -v0 >ann01.comp.stderr 2>&1
Compile failed (status 256) errors were:
ann01.hs:37:0:
Dynamic linking required, but this is a non-standard build (eg. prof).
You need to build the program twice: once the normal way, and then
in the desired way using -osuf to set the object file suffix.

Tuesday, February 17, 2009

Thunderbirds are go

Alright, it's been a busy few days.

I've split the native code generator into architecture specific modules, and chopped the SPARC part into bite-sized chunks. There are still some #ifdefs around for other architectures, but most of the native code generator is compiled, independent of what the target architecture is.

I think I've squashed most of the bugs as well, but I'll have to wait overnight for a complete test run to finish.

Here are some inital nofib results.

The tests compare:

  • My 3Ghz Pentium4 Desktop

  • My 1.6Ghz Core2 Duo (Merom) MacBook Air

  • The 1.4Ghz SPARC T2


Take these results with a fair dose of salt, the libs on the T2 were built with -O2, but on the P4 and Merom I only used -O. I'll have some more comprehensive results in a few days. I'm not sure what's going on with "primetest", I'll have to run it again.

In short, with the current compiler, a Haskell program on a single T2 thread runs about 5x slower than on a P4 desktop. Remember that the T2 is totally set up for parallel processing, and has none of the instruction reordering or multiple issue of the P4. If we scale for the clock rate difference between the P4 and T2, then a T2 thread runs about 2.3 times slower than a P4 thread, clock for clock. On the other hand, 16 threads (2 per core) can be active at once on the T2.

Anyway, in regards to single thread performance on the T2, I think we're losing out big-time due to our total lack of instruction reordering.

Here is a piece of code from the nqueens test from nofib.

.LcFW:
add %i0,-20,%g1
cmp %g1,%i2
blu .LcFY
nop
add %i3,16,%i3 <- dep
cmp %i3,%i4 <-
bgu .LcFY
nop
sethi %hi(stg_upd_frame_info),%g1 <- dep
or %g1,%lo(stg_upd_frame_info),%g1 <-
st %g1,[%i0-8] <-
st %l1,[%i0-4]
sethi %hi(sDJ_info),%g1 <- dep
or %g1,%lo(sDJ_info),%g1 <-
st %g1,[%i3-12] <-
ld [%l1+12],%g1 <- copy
st %g1,[%i3-4] <-
ld [%l1+16],%g1 <- copy
st %g1,[%i3] <-
add %i3,-12,%g1
st %g1,[%i0-16]
ld [%l1+8],%g1 <- copy
st %g1,[%i0-12] <-
sethi %hi(base_GHCziNum_zdf6_closure),%l2 <- dep
or %l2,%lo(base_GHCziNum_zdf6_closure),%l2 <-
sethi %hi(sDK_info),%g1 <- dep
or %g1,%lo(sDK_info),%g1 <-
st %g1,[%i0-20]
add %i0,-20,%i0
call base_GHCziNum_zdp1Num_info,0
nop


A typical GHC compiled Haskell program spends most of its time copying data around memory. Such is the price of lazy evaluation, which we all know and love. The copies create a huge amount of memory traffic, and associated cache miss stalls. In addition, many of the instructions are directly dependent on the one just above them, which creates data hazard stalls.

At least for the T2, I expect that doing some basic instruction reordering will be a big win, so I'll try that first. Maybe some prefetching also, considering that we're moving so much data around memory.

Friday, February 13, 2009

Refactoring

untie the knots
soften the glare
write the program
that was always there

Thursday, February 12, 2009

Allocate!

Last week I pushed some patches that started to break up the native code generator into architecture specific modules. Unfortunately, due to a miscommunication the patches weren't validated on powerpc properly, and now the head has been broken on that architecture for a few days.

I've been using Thorkil's machine to debug the problem, and have narrowed it down to the linear register allocator (surprise surprise). The linear allocator has some code that tries to make sure that basic blocks are allocated in call-order. Unfortunately, for tiresome reasons, if there are no branch instructions to a particular block then the allocator goes into an endless loop. This problem is exposed when trying to compile AutoApply.cmm from the runtime system. Only a small amount of code from the RTS contains Cmm level loops, so it's not well tested in that respect. Regular, compiled Haskell code contains no assembly level loops.

The graph coloring allocator compiles the code fine, so I'm pretty sure it's not a problem with code generation. I'm not sure why splitting up the code generator created this problem.

There isn't a comment in the allocator code that fully explains the block ordering problem, or gives a test case. I tried disabling this code, but it didn't make things better. Also, when I try to dump the cmm code for AutoApply.cmm with -ddump-opt-cmm the pretty printer (or something) also goes into an endless loop. This happens when compiling with -fregs-graph as well, so I think it's an unrelated problem.

More digging. Code generation for AutoApply.cmm produces basic blocks that have no explicit branches to them, because the branch is via a jump table...

Dah! Found it. My previous patch to the handling of fixup code wasn't complete for powerpc. The BCTR instruction wasn't being treated as though it was a branch.

Realise that there is a problem: 1 min
Setup to work on the problem: 1 hr
Find the problem: 5 hrs
Fix the problem: 1 min

Ah, code generation. Who doesn't love it?

Thursday, February 5, 2009

Just Testing

Ok, I've got the stage3 build with RTS and Libs working with -fasm -O2. Did a full testsuite run last night, and it looks promising. However, a stack of the driver tests failed because the compiler or build system spews something about "/bin/sh: cygpath: not found". This line isn't in the file its testing against, which results in a failure.

.. the cygpath thing was easy enough to hack around. A full stage3 test run gives the following. I've sorted the failing tests by ways failed.

OVERALL SUMMARY for test run started at Thursday,  5 February 2009 12:43:50 PM EST
2287 total tests, which gave rise to
8550 test cases, of which
0 caused framework failures
1580 were skipped

6483 expected passes
247 expected failures
1 unexpected passes
239 unexpected failures

Unexpected passes:
hpc_ghc_ghci(normal)

Unexpected failures:
TH_ppr1(normal)
apirecomp001(normal)
gadt23(normal)
ghciprog004(normal)
ghcpkg02(normal)
hpc_draft(normal)
hpc_hand_overlay(normal)
hpc_markup_001(normal)
hpc_overlay(normal)
hpc_overlay2(normal)
rebindable5(normal)
recomp004(normal)

ThreadDelay001(normal,threaded1)
conc014(normal,threaded1)
conc015(normal,threaded1)
conc017(normal,threaded1)
concprog001(normal,hpc,threaded1)
signals001(normal,hpc,threaded1)
list001(normal,hpc,threaded1)
hSeek003(normal,hpc,threaded1)
hSetBuffering003(normal,hpc,threaded1)
copyFile002(normal,hpc,threaded1)
dynamic001(normal,hpc,threaded1)
dynamic002(normal,hpc,threaded1)
enum03(normal,hpc,threaded1)
integerBits(normal,hpc,threaded1)
xmlish(normal,hpc,threaded1)
memo001(normal,threaded1)
memo002(normal,threaded1)
num007(normal,threaded1)
num013(normal,threaded1)
openFile008(normal,threaded1)
read001(normal,threaded1)
readwrite002(normal,threaded1)
testblockalloc(normal,threaded1)
tup001(normal,threaded1)
getDirContents001(normal,threaded1)
jtod_circint(normal,threaded1)
life_space_leak(normal,threaded1)

arr016(normal,threaded1,threaded2)
random1283(normal,threaded1,threaded2)

1914(ghci)
seward-space-leak(ghci)

1861(optc)
barton-mangler-bug(optc)
joao-circular(optc,hpc,threaded1,threaded2)
time003(optc,hpc,optasm,threaded2)
simplrun007(optc,optasm)


2910(hpc)
arith012(hpc)
arr019(hpc)
cholewo-eval(hpc)
concio002(hpc)
hDuplicateTo001(hpc)
copyFile001(hpc)
hGetPosn001(hpc)
hIsEOF002(hpc)
hSeek001(hpc)
hSeek002(hpc)
hTell002(hpc)
ioeGetHandle001(hpc)
launchbury(hpc)
num005(hpc)
num009(hpc)
num014(hpc)
process004(hpc)
process006(hpc)
renameFile001(hpc)
show001(hpc)
signals002(hpc)

arith002(threaded1)
arith005(threaded1)
dsrun022(threaded1)

conc042(threaded2)
conc043(threaded2)
conc044(threaded2)
conc045(threaded2)
concprog002(threaded2)

andy_cherry(normal,optc,hpc,optasm,threaded1,threaded2)
annrun01(normal,optc,hpc,optasm,threaded1,threaded2)
arith011(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
bits(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
enum01(normal,optc,hpc,optasm,threaded1,threaded2)
enum02(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
ffi009(normal,optc,hpc,optasm,threaded1,threaded2)
ffi019(optc,hpc,optasm,ghci,threaded1,threaded2)
genUpTo(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
hClose002(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
hGetBuf001(normal,optc,hpc,optasm,threaded1,threaded2)
integerConversions(normal,optc,hpc,optasm,threaded1,threaded2)
num012(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
process007(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
strings(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
tcrun007(normal,optc,hpc,optasm,ghci,threaded1,threaded2)
user001(normal,optc,hpc,optasm,ghci,threaded1,threaded2)



A good percentage of these are seg faulting, so there is still a bug or two lurking around.

Continued to split the native code generator into arch specific modules while waiting for test runs. I'm splitting into Alpha, X86, PPC, and SPARC subdirs. The Alpha code is long dead, but it seems a shame to just delete it, so I've made modules for it but commented it out in those modules.

Due to time constraints I'll have to leave some of the #ifdefs in, expecially for the difference between i386 and x86_64. There are also some #ifdefs to choose between Darwin and other OSs.

The way it setup now is that (almost) all code from all arches gets compiled, no matter what the host platform is. This means that potential validate problems are always exposed. It should also help reduce the rot-factor for seldom used platforms.

The top level modules like MachCodeGen still have an #ifdef to choose which one to import for the specific platform. However, it should be straight forward to go from this setup to having a data type representing the target platform, and being able to specify it on the command line.

On the SPARC front, I've just finished cleaning up the horror show that was nativeGen/MachRegs.hs. I'll spare you the details, but this source comment (not mine) sums it up nicely: "Gag me with a spoon, eh?"

It feels a lot simpler when all the SPARC code is isolated, and you don't have to expend mental effort to reject code for architectures you're not working on. It's also reassuring to know that if you change something you're not going to break something for another architecture that is in an #ifdef block that isn't being compiled on the current platform.

Also went through and fixed all the warnings in code from MachRegs, MachInstrs, RegsAllocInfo and PprMach. The world seems shiny and new.

Monday, February 2, 2009

Join Points

There were no updates last week on account of it being the Australia Day public holiday on monday, and me going to fp-syd later in the week.

Although the SPARC NCG now passes almost all of the testsuite, and can compile the Haskell source of GHC itself, it still has some problems when compiling the RTS. Unlike the Cmm code emitted when compiling Haskell source, the hand written Cmm code in the RTS contains loops as well as join points.

Mid last week I hit a bug in the linear register allocator, but didn't finish fixing it. Here is some emitted assembler code when compiling rts/PrimOps.cmm

 .Lch:
....
st %l0,[%l6]
add %l6,4,%l6
b .Lci <--- JUMP to .Lci
nop

.Lci:
st %l0,[%i6-84]
ld [%i6-76],%l0
st %l0,[%i6-76]
sll %l0,2,%l0
....

.Lcj:
....
st %l6,[%i6-76]
add %l0,8,%l6
b .Lci <--- JUMP to .Lci
nop

.Ln1l:
b .Lci <--- JUMP to .Lci
mov %l7,%l0
ld [%i6-100],%l7 <--- ??? slot 100 never assigned to
b .Lci <--- unreachable code??
nop


Label .Lci is a join point because it is the target of several jumps. The code in block .Ln11 is supposed to be fixup code. Fixup code is used to make sure that register assignments match up for all jumps to a particular label.

This code is badly broken though
1) .Ln11 is unreachable, nothing ever jumps to it.
2) There should probably be a .nop after the first branch instruction in .Ln11 to fill the SPARC branch delay slot. That's if that first jump is supposed to be there at all.
3) The instruction ld[%i6-100],%l7 loads from spill slot 100, but that slot is not stored to anywhere else in the code.
4) The three instructions after the first are also unreachable.

This is a bit funny considering I spent July-Sept 2007 at GHC HQ writing a new graph colouring register allocator for GHC, specifically because the linear allocator was known to be dodgy.

Alas, the graph allocator suffered the fate of many such "replacement projects". It is a little slower on some examples, and it comes with an ambient FUD cloud because it is fresh code. Sure, I could just use the graph allocator for SPARC, but that would still leave known bugs in the linear allocator for someone else to trip over. Until such time as we ditch the linear allocator entirely, for all architectures, we still need to maintain it.

The reason why .n1l wasn't being referenced was that the code to change the destination of the jump instruction at the end of block .Lch was missing. ... finally got sick enough of the organisation of RegAllocLinear.hs that I spent most of the day splitting it out into its own set of modules, which was well over due.

Some time later...

It seems as though the code in block .Ln1l is supposed to be swapping the values in regs %l7 and %l0. In block .Lch vreg %vI_co is in %l7 and %vI_cp is in %l0, but the block .Lci wants them the other way around, what a nightmare.

More time later...

From handleComponents in RegAllocLinear.hs

 -- OH NOES!
(_, slot) <- spillR (RealReg sreg) spill_id

The _ is supposed to match the instruction, generated by spillR, that stores the value in sreg into the spill slot at [%i6-100]. The comment is mine.

Thursday, January 22, 2009

Wait and perform

Spent time reading about hardware performance counters while waiting for builds and test runs. Realised that I had built the RTS with -fvia-c before, so there are still a couple of NCG things to fix before we can do a full stage2 -fasm build.

The RTS code uses a machine op MO_WriteBarrier that compiles to a nop on i386, but uses the LWSYNC instruction on PPC. I still have to read about that and work out what to do for sparc.

Also spent some time fighting a configure problem. The configure script can detect whether the assembler supports the -x flag, that tells it whether to export local symbol names. The configure script looks at the ld in the current path, but ghc calls gcc to do the assembly. That gcc will use whatever ld /it/ was configured with, and on mavericks the Solaris linker doesn't support -x. (epic sigh). In the same vein, the Solaris assembler doesn't support the .space directive. Fixed that and my builds are running again.

Started of a full run of the testsuite with the stage1 compiler + via-c rts, but that'll take all day and all night. num012 is failing with 16 bit integer arithmetic, but that's probably no big deal.

The hardware performance counters on the T2 are exercised with the cputrack util, which I spent some time learning about. You can get counts of instrs executed, branches taken, cache misses etc. There are lots of measurements available, but you can only collect two at a time - because there are only two PICs (performance instrumentation counters) on the T2.

For example:

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer>
cputrack -t -T 10 -c Instr_cnt,DC_miss ./boyer 200
time lwp event %tick pic0 pic1
3.409 1 exit 3954145530 1730642411 32277574
True

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer>
cputrack -t -T 10 -c Instr_ld,Instr_st ./boyer 200

time lwp event %tick pic0 pic1
3.413 1 exit 3957292767 286778987 219450322
True

benl@mavericks:~/devel/ghc/ghc-HEAD-work/nofib/spectral/boyer>
cputrack -t -T 10 -c Br_completed,Br_taken ./boyer 200
time lwp event %tick pic0 pic1
3.411 1 exit 3952535075 281792355 190004907
True

So, it looks like we're getting about 2.3 clocks per instruction (cpi). About 30% of instructions executed are loads or stores. If those load/stores account for L1 data cache misses then about 6% of them miss. That might be wrong though - I'll have to work out whether the info tables are stored in the data or instr cache. In any event, about 30% of all instructions are loads or stores , and another 16% are branches of which 67% are taken.

I'll post some nice graphs and whatnot once my nofib run has gone through. Should probably add the performance counters to nofib-analyse as well, in the place of valgrind on sparc / solaris.

Wednesday, January 21, 2009

The Strap

Implemented tabled switch, and fixed a problem when converting float to integer formats. The FSTOI instruction leaves its result in a float register, but in integer format. On at least V9 you then have to copy the value to mem and back to get it into an actual int register. Later architectures have instructions to do this without going via mem, but we're sticking with V9 for now. Perhaps we could get some speedup for FP code by using the later instruction set extensions (Vis2.0?)

The genCCall problem was that calls to out-of-line float ops like sin and exp were disabled for 32 bit floats, maybe because other things were broken before.

Also fixed a 64 bit FFI problem. Closures allocated by the storage manager are 32bit aligned, but the RTS was trying to do misaligned read/writes of 64bit words. The standard SPARC load/store instructions don't support misaligned read/writes, so had to break it them up into parts.

Looking good. A few tests still fail, but they fail the same way with all ways. I'm guessing that they are problems with Solaris or other environment stuff, and not the NCG.

I tried to do build the stage2 compiler with -fasm, but I made the foolish mistake of pulling from the head beforehand, which broken the build. Did a full distclean, but will have to leave it overnight.


OVERALL SUMMARY for test run started at Wednesday, 21 January 2009 9:05:20 AM EST
2283 total tests, which gave rise to
8531 test cases, of which
0 caused framework failures
7429 were skipped

1047 expected passes
32 expected failures
0 unexpected passes
23 unexpected failures

Unexpected failures:

cvh_unboxing(optasm) -- fixed
seward-space-leak(optasm) -- fixed
1916(optasm) -- fixed
expfloat(optasm) -- fixed
fun_insts(optasm) -- fixed
2594(optasm) -- fixed
ffi019(optasm) -- fixed
arith011(optasm) -- fixed

barton-mangler-bug(optasm) -- optc: timeout. others: ok
joao-circular(optasm) -- timeout

---- noncrash fail all ways ----
num012(optasm)
hpc_fork(optasm)
hClose002(optasm)
user001(optasm)
2910(optasm)
tcrun007(optasm)
T2914(optasm)
annrun01(optasm)
ann01(optasm)
haddockA028(optasm)
hpc001(optasm)
process007(optasm)
tough(optasm)

Tuesday, January 20, 2009

The Grind

Decided to address some of the other outstanding bugs before starting on genSwitch. I don't want other bugs to confuse the issue.

Did a full run of the testsuite with optasm. I added the triage comments manually.

OVERALL SUMMARY for test run started at Thursday, 15 January 2009  7:47:05 PM EST
2283 total tests, which gave rise to
8531 test cases, of which
0 caused framework failures
7429 were skipped

1025 expected passes
32 expected failures
0 unexpected passes
45 unexpected failures

Unexpected failures:
arith004(optasm) -- hpc: iselExpr64 panic. optasm: getRegister panic.
num012(optasm) -- optasm: ppr match fail. others: wrong output
process007(optasm) -- hpc: iselExpr64. others: wrong output
time003(optasm) -- hpc: iselExpr64 panic. optasm: getRegister panic.
bits(optasm) -- hpc: iselExpr64 panic. optasm: getRegister panic.
tough(optasm) -- iselExpr64
hpc001(optasm) -- iselExpr64
hpc_fork(optasm) -- iselExpr64
tc213(optasm) -- getRegister

expfloat(optasm) -- genCCall can not reduce
fun_insts(optasm) -- genCCall can not reduce

num013(optasm) -- iselExpr64, match fail
2388(optasm) -- match fail
enum02(optasm) -- match fail
enum03(optasm) -- match fail
arith011(optasm) -- match fail
arith017(optasm) -- match fail
ffi017(optasm) -- match fail C types
ffi018(optasm) -- match fail 64bit ffi
ffi019(optasm) -- match fail 64bit ffi

1916(optasm) -- invalid register

2594(optasm) -- all ways: segv 64bit ffi

seward-space-leak(optasm) -- genSwitch
simpl007(optasm) -- genSwitch
syn-perf(optasm) -- genSwitch
tup001(optasm) -- genSwitch
andy_cherry(optasm) -- genSwitch
arrowrun001(optasm) -- genSwitch
arrowrun004(optasm) -- genSwitch
barton-mangler-bug(optasm) -- genSwitch
cg054(optasm) -- genSwitch
cvh_unboxing(optasm) -- genSwitch
drv005(optasm) -- genSwitch
drv006(optasm) -- genSwitch
drvrun014(optasm) -- genSwitch
joao-circular(optasm) -- genSwitch
jtod_circint(optasm) -- genSwitch


hClose002(optasm) -- all ways: same wrong output
user001(optasm) -- all ways: same wrong output
2910(optasm) -- all ways: same wrong output
tcrun007(optasm) -- all ways: missing import
T2914(optasm) -- all ways: type error
annrun01(optasm) -- all ways: unknown package ghc
ann01(optasm) -- all ways: no TH in stage1 all ways
haddockA028(optasm) -- all ways: test wibble


Looking into arith004. The code to generate integer remainder / divide instructions was missing. On further investigation, old SPARC implementations didn't have hardware support for this. GHC used to call out to a library. The SPARC T2 has hardware divide, but you have to compute remainders using div/mul/sub. Added code to do so. Not sure if we still want to maintain the software mul/div path - but I'll worry about that when the rest is fixed and refactored. Also fixed code to generate 64 bit operations on 32 bit SPARC, which was the isel64Expr problem.


arith004 -- fixed
bits -- fixed
tc213 -- fixed
arith012 -- fixed
arith017 -- fixed
ffi017 -- fixed
ffi018 -- fixed
enum02 -- fixed
enum03 -- fixed

num012 -- invalid register 64bit

time003 -- genSwitch

ffi019(optasm) -- all ways: bus error

process007 -- all ways: same wrong output
tough -- all ways: same wrong output
hpc001 -- all ways: same wrong output
hpc_fork -- all ways: same wrong output

Thursday, January 15, 2009

Info tables

The info tables are getting broken for some reason. The Cmm code has:

sEl_ret()
{ [const Main.$wf_srt-sEl_info;, const 1;, const 2228231;]
}

-fvia-c gives:

.text
.align 4
.long Main_zdwf_srt - sEl_info
.long 1
.long 2228231
sEl_info:
....

but -fasm gives:

.text
.align 4
.long Main_zdwf_srt+0 <---------- lost sE1_info
.long 1
.long 2228231
sEl_info:
....

some time later..

#if sparc_TARGET_ARCH
-- ToDo: This should really be fixed in the PIC support, but only
-- print a for now.
pprImm (ImmConstantDiff a b) = pprImm a
#else
pprImm (ImmConstantDiff a b) = pprImm a <> char '-'
<> lparen <> pprImm b <> rparen
#endif

!!!! .... sigh.

Fixed that, but it still doesn't work. Here's the code for a whole closure:

    .data                        <-------------------------------- data
.align 8
.global Main_a_srt
Main_a_srt:
.long Main_lvl_closure
.long base_GHCziHandle_stdout_closure
.long base_GHCziIO_a28_closure
.data
.align 8
.global Main_a_closure
Main_a_closure:
.long Main_a_info
.long 0
.text <-------------------------------- text
.align 4
.long Main_a_srt-(Main_a_info)+0 <-------- offset text to data
.long 196609
.long 0
.long 983047
.global Main_a_info
Main_a_info:
.LcGp:
sethi %hi(base_GHCziHandle_stdout_closure),%l2
or %l2,%lo(base_GHCziHandle_stdout_closure),%l2
sethi %hi(Main_lvl_closure),%l3
or %l3,%lo(Main_lvl_closure),%l3
call base_GHCziIO_a28_info,0
nop

SRTs (Static Resource Table?) are supposibly used for garbage collecting CAFs, but the GHC commentary page on them seems out of date, or missing. In any event, the assembler can't make an offset between labels in .text and .data segments. In some architectures .text and .data use entirely separate address spaces. This probably got broken in a previous GHC release when info tables were moved to be next to the code. Checking against x86 reveals it does the same thing, but the x86 assembler is ok with cross segment offsets.

I ended up just changing the pretty printer so it prints out ReadOnlyData segments as .text, which is a bit nasty. I'll be able to handle this in a nicer way when the sparc NCG is factored out into its own set of modules.

Win!

56 expected passes
1 expected failures
0 unexpected passes
3 unexpected failures

Unexpected failures:
2080(optasm) -- wrong output
cg015(optasm) -- unknown unary match op
cg054(optasm) -- genSwitch

That seems to have fixed the seg faulting ones.

Working on 2080.hs

-- cmm code is
_sFv::I32 = %MO_UU_Conv_W8_W32(I8[R2]); <--- load unsigned
_sFx::I32 = _sFv::I32;
_cG7::I32 = %MO_S_Le_W32(_sFx::I32, 127);
if (_cG7::I32 >= 1) goto cGa;

-- -fvia-c gives:
ldub [%l2], %g1 <--- load unsigned
cmp %g1, 127
ble,pt %icc, .LL5
sethi %hi(ghczmprim_GHCziBool_True_closure), %g1
...

-- -fasm gives:
ldsb [%l2],%l0 <--- load signed
srl %l0,0,%l0 <--- nop
cmp %l0,127
ble .LcGa
nop
...

This was an easy operand format problem. However, I did notice that the Cmm code only ever loads unsigned data, like I8[R2]. If you were going to load signed data it would be better to use the sparc ldsb instruction which sign extends the byte in one go, versus doing an unsigned load then sign extending it separately. Another task for a simple peephole optimizer....

Also fixed the unknown unary match op problem - sign extension code was unfinished. genSwitch here we come.

Wednesday, January 14, 2009

Liveness lies

Yesterday I fixed the linear allocator to handle floating point register twinning, or at least I thought I did. The output code looked ok, but the programs I tried still crashed. I ended up spending the rest of the day writing a tool (mayet) to compare the -fasm and -fvia-c versions. Mayet takes the two .s files and splits them up into parts belonging to the individual closures. It then slowly substitutes the dubious -fasm sections for the known good -fvia-c sections.

Last night I got enough of it working to find a bad -fasm closure in cg034, which tests out floating point math. Curiously, the closure itself didn't do any float or double math. This morning I hand adjusted the -fvia-c version to look like the -fasm one until it exhibited the same problem.

After some wibbling around found this:

             ld [%l1+12],%vI_s1GO
# born: %vI_s1GO

cmp %vI_s1GO,0
# r_dying: %vI_s1GO <--------- LIES

bne .Lc2cm

......
......
c2cm:
ld [%l1+8],%vI_n2cH
# born: %vI_n2cH

st %vI_n2cH,[%i0-12]
# r_dying: %vI_n2cH

sethi %hi(base_GHCziFloat_a_closure),%l2

or %l2,%lo(base_GHCziFloat_a_closure),%l2

or %g0,%vI_s1GO,%l3
# r_dying: %vI_s1GO <---------


This is a dump of register liveness information. The line marked LIES shows that the allocator thinks that variable %vI_s1G0 isn't used after the cmp instruction. Unfortuntately, after the branch, it's used in an or. The vreg %vI_n2cH got allocated to the same register as %vI_s1G0, clobbering the contained value and causing the crash.

Turns out the register liveness determinator wasn't treating BI and BF as though they were branch instructions, so liveness information wasn't being propagated across the basic blocks properly.

Fixing that problem stopped cg034 from crashing, though it still gave the wrong answer. During debugging, noticed that if ghc is executed with -v or -ddump-reg-liveness then the top level labels emitted in the .s file change - which confuses mayet. Hmm.. let that be a lesson to all of us: changing compiler flags should not change top level names, if at all possible.

More digging

        (_s1Ri::F32,) = foreign "ccall" 
__encodeFloat((_c2sm::I32, `signed'), (_c2sn::I32, PtrHint),
(_c2so::I32, `signed'))[_unsafe_call_];
F32[Sp] = _s1Ri::F32;

Is translated to:
        call __int_encodeFloat,2
nop

st %f28,[%i0] <- BOGUS %f28

A floating point return value should be placed in %f0, but for some reason the GHC code that does just that was missing. Fixed that, and it almost works... just gives the wrong answer.

Loading of doubles looks broken.

via-c says:

ld [%l1+3], %f8
fitod %f8, %f2


but the NGC does:

ld [%l1+3],%l0
st %l0,[%o6-8]
ld [%o6-8],%f10
fitos %f10,%f10

Hmm.

Remember that comment from a few days ago:
-- ToDo: Verify correctness

Turns out it wasn't correct.. Who would have known :P

That fixed cg034 and cg035. Now we're down to:

Unexpected failures:
2080(optasm) -- segv
cg015(optasm) -- unknown unary match op
cg021(optasm) -- segv
cg022(optasm) -- segv
cg026(optasm) -- segv
cg044(optasm) -- segv
cg046(optasm) -- segv
cg054(optasm) -- genSwitch
cg058(optasm) -- segv
cg060(optasm) -- segv

Monday, January 12, 2009

Bootstrapping 7

I'm still chasing down the source of that bug from last time. Spent some time re-reading the STG paper, and trolling through the RTS code.

Running the program with +RTS -DS didn't help, but I noticed a flag -Da that checks the format of closures while the program runs. However, it seemed to be disabled, or unused. There is code for it is in the RTS, but it doesn't show up in +RTS --help. Running the program with +RTS -Da gives:

benl@mavericks:~/devel/ghc/ghc-HEAD-native/tmp> ./Main +RTS -DS -Da
stg_ap_v_ret... PAP/1(3f5212, 3ef238)
stg_ap_0_ret... Bus error (core dumped)

Went chasing through the RTS code looking for the source of the Bus Error. -Da causes Printer.c:printClosure to be invoked print out a description of each closure, but the pointers passed to it are misaligned. That is, misaligned / containing pointer tag bits. Reread the dynamic pointer tagging paper, then fixed the RTS code.

Ended up doing a binary-ish search to find the problem. Starting with a known good .s file generated with -fvia-c, slowly copied dubious sections of the -fasm version into it, testing along the way. This works because in the current STG -> Cmm translation, top level STG functions only share data via pinned registers. This is sort of like a standard calling convention for GHC functions, so it doesn't matter if GHC and GCC do register allocation a little differently. It might be worthwhile automating this process if I hit similar problems in the future.

Anyway, it turn's out there's a world of difference between
mov %l1, %l2 and mov %l2, %l1...

With that fixed, ran the should_run codeGen tests and got

OVERALL SUMMARY for test run started at Monday, 12 January 2009  5:23:10 PM EST
61 total tests, which gave rise to
427 test cases, of which
0 caused framework failures
367 were skipped

42 expected passes
1 expected failures
0 unexpected passes
17 unexpected failures

Unexpected failures:
1852(optasm) -- regalloc
1861(optasm) -- regalloc
2080(optasm) -- wrong output
cg015(optasm) -- unknown unary match op
cg018(optasm) -- regalloc
cg021(optasm) -- segv
cg022(optasm) -- segv
cg024(optasm) -- regalloc
cg026(optasm) -- regalloc
cg028(optasm) -- regalloc
cg034(optasm) -- regalloc
cg035(optasm) -- regalloc
cg044(optasm) -- regalloc
cg046(optasm) -- segv
cg054(optasm) -- genSwitch not implemented
cg058(optasm) -- segv
cg060(optasm) -- segv


The ones marked spill die with:
ghc: panic! (the 'impossible' happened)
(GHC version 6.11.20090110 for sparc-sun-solaris2):
RegAllocLinear.allocRegsAndSpill: no spill candidates


Repaired the rot in the linear register allocator. The free register map only worked for x86(_64) and PPC. Now we've got:

   2080(optasm)    -- wrong output
cg015(optasm) -- unknown unary match op
cg021(optasm) -- segv
cg022(optasm) -- segv
cg026(optasm) -- segv
cg034(optasm) -- regalloc FF64
cg035(optasm) -- regalloc FF64
cg044(optasm) -- regalloc FF64
cg046(optasm) -- segv
cg054(optasm) -- genSwitch
cg058(optasm) -- segv
cg060(optasm) -- segv


The others marked regalloc are dying because the allocator is messing up the float register twinning. That'll be tomorrow's problem.

Saturday, January 10, 2009

Bootstrapping 6

The build with the SPARC native code generator went through. Tried to run the test suite. For some reason when the compiler panics, the test framework runs off and tries to consume all available memory - like a fork bomb. Not sure why, maybe it was built wrong. Ended up just running individual tests by hand.

Ah, the joy of debugging assembly code. Fixed one problem where bad instructions were being generated. If you have LD [%f26 + ... ], %l1 then that's easy to find, because you can't use float regs for addressing.

Found a program from the test suite that seg faulted and spent some time reducing it to a smaller test case.
main = print (dude (1, 2))
dude xx
= case xx of
(x, y) -> y
Seems to work, but
main = sum2 [1,2]
sum2 xx
= case xx of
[] -> 0
(x:xs) -> x + sum2 xs
always returns 0, no matter what size the list being summed is.

Spent about an hour staring at the STG, Cmm and NCG code, but couldn't see anything obviously wrong. Noticed that the current NCG doesn't use branch delay slots, there's a comment in the code saying that doing so would confuse the register allocator. This'll be a first port of call for optimization once the NCG is working again.

On the other hand, to compile the sum2 program above with -O0, GCC used 420 instructions but the NCG used only 361 (doesn't actually work though).

I'm finding it hugely useful to have the via-c path still in place. Besides the obvious fact that I can still compile the libraries with it, it's comforting to know that the Cmm code is good, and I can just concentrate on the Cmm -> Asm pass. All the tricky dynamic pointer tagging and whatnot is expressed in Cmm, which makes life a lot easier for me. It's also good to have multiple -O flags, because it's an easy way to create wibbles in the Cmm code when diagnosing bugs.

Got bored staring at assembly code, and went back to find a better test case. Ended up with:
main = print (up 0)
up x
= case x of
0 -> x


Note that up also takes a Num dictionary. Got:
Main: internal error: stg_ap_pp_ret
(GHC version 6.11.20090105 for sparc_sun_solaris2)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
Abort (core dumped)

Bingo! Assertions in the RTS code are my new best friends.

Wednesday, January 7, 2009

Bootstrapping 5

Re-enabled compilation of the native code generator for SPARC, then went through and plugged enough holes to get it to compile again. It looks like the representation of operand formats was changed in the recent Cmm work, but the SPARC code didn't keep up because it was hidden inside #ifdef blocks. Remember kids, #ifdef = evil. Besides operand formats and the missing genSwitch function, it doesn't look like the SPARC backend has rotted too badly.

All the NCG modules have recompiled, but I must have touched something at the root of the module dependency tree. It's recompiling TcRnDriver.lhs again, which has a 2MB intermediate .hc file and has been going for 40 mins.. sigh.

Spent some time comparing the .s files produced by GCC and the GCC for Sun Systems (gccfss) compiler. It seems that any .hs file compiled via gccfss generates bad .s code because it doesn't handle pinning of STG registers the way real GCC does.

Tried to speed up recompilation by slurping across the object and interface files for TcRnDriver from a previous build. For some reason it accepted TcRnDriver.hi but not Parser.hi

parser/HaddockLex.x:23:0:
Bad interface file: dist-stage1/build/Parser.hi
Something is amiss; requested module ghc-6.11.20090105:Parser differs from
name found in the interface file ghc-6.11.20081231:Parser

Curses @ sensible checking of interface file versions! :) It appears as though the configure date (or latest patch date?) is part of the GHC version, so the backup build tree I squirreled away isn't going to help me. I'll have to leave it building overnight and remember not to run ./configure again.

In other news, a stage3 build of the HEAD + yesterdays patch worked, so at least the via-c path is good again (modulo epic slowness).

Tuesday, January 6, 2009

Bootstrapping 4

Test of patched GHC 6.10.1 produced the following:

Unexpected failures:

1861(normal,optc,hpc,threaded1,threaded2)
1914(ghci)
2594(normal,optc,hpc,threaded1,threaded2)
andy_cherry(hpc)
arith011(optc,hpc,threaded2)
barton-mangler-bug(normal,optc,hpc,threaded1,threaded2)
break024(ghci)
cabal01(normal)
conc042(threaded2)
conc043(threaded2)
conc044(threaded2)
conc045(threaded2)
concprog001(hpc)
driver062.1(normal)
driver062.2(normal)
driver062.3(normal)
driver062.4(normal)
driver062.5(normal)
driver081.1(normal)
driver081.2(normal)
dynamic_flags_001(normal)
enum01(optc,hpc,threaded2)
enum02(optc,hpc,threaded2)
enum03(optc,hpc,threaded2)
ffi019(normal,optc,hpc,ghci,threaded1,threaded2)
gadt23(normal)
ghciprog004(normal)
hClose002(normal,optc,hpc,ghci,threaded1,threaded2)
hpc_raytrace(normal,optc,hpc,threaded1,threaded2)
joao-circular(normal,optc,hpc,threaded1,threaded2)
num012(normal,optc,hpc,ghci,threaded1,threaded2)
process007(normal,optc,hpc,ghci,threaded1,threaded2)
random1283(threaded2)
recomp001(normal)
recomp004(normal)
seward-space-leak(ghci)
tcrun007(normal,optc,hpc,ghci,threaded1,threaded2)
testblockalloc(normal,threaded1)
user001(normal,optc,hpc,ghci,threaded1,threaded2)


When building the ghc-HEAD-native with the native code generator turned on, looks like the genSwitch function for SPARC is missing from MachCodeGen.hs. Also missing are the ALLOCATABLE_REGS defs from MachRegs.lhs.

Went through the description of the register set in includes/MachRegs.h. Looks like there are 3 allocatable integer regs, 6 allocatable double regs, and 4 allocatable float regs. The graph coloring allocator currently only allocates doubles, not single precision floats as well. Will have to come back and check this.

isRegRegMove and mkRegRegMovInstr from RegAllocInfo.hs were missing. So was regDotColor from RegAllocStats.hs, isFloatSize from MachRegs.lhs

MachRegs.lhs has a function mkVReg which determines the vreg to use for a particular size word. This is the source of single precision floating point vregs. All the word size code has also rotted. The sparc Size type is different from the ones use for the i386(64) and ppc. Should come back and fix this, but for now I've just added the missing functions and kept the old size type.

In MachInstrs.hs i386 has OpReg but sparc has RI, similar. Should refactor sparc code to use OpReg.

Monday, January 5, 2009

Bootstrapping 3

Running tests on patched stage2 build of GHC 6.10.1 on sparky. Most seem to be going through, but have:

cc004.hs:16:0:
calling convention not supported on this architecture: stdcall
When checking declaration:
foreign import stdcall safe "static &p" m_stdcall
:: StablePtr a -> IO (StablePtr b)

Started teasing out the SPARC stuff from nativeGen/MachCodeGen.hs. At the moment it's a mess of #ifdefery. I'm sure #ifdefs were the best way to do it back then there were only one or two targets, but now there is code for i386, i386_64, powerpc, alpha and sparc all mixed in together.

My plan is to copy out the non-architecture-specific functions from nativeGen/MachCodeGen.hs into their own module nativeGen/MachCodeGenShared.hs. I'll split the sparc specific stuff into a set of modules under nativeGen/sparc. Once the sparc native gen works again I'll then go back and delete the sparc specific stuff from nativeGen/MachCodeGen.hs, and make it use the MachCodeGenShared.hs. This should leave the original support for all architectures untouched during development.

While going through the code, found an amusing comment in the sparc section:
-- Floating point assignment to a register/temporary
-- ToDo: Verify correctness
assignReg_FltCode :: Size -> CmmReg -> CmmExpr -> NatM InstrBlock
assignReg_FltCode pk reg src = do ...

hmmm. I wonder how long that ToDo has been there..

The build of GCC 4.3.2 died with
Configuring stage 1 in sparc-sun-solaris2.10/libgcc
...
checking for suffix of object files... configure: error: cannot compute
suffix of object files: cannot compile.

Investagation of the config.log reveals:
/home/benl/files/gcc/build/gcc-4.3.2-obj/./gcc/xgcc ...
conftest.c:1: internal compiler error: Segmentation Fault
Please submit a full bug report,


Tried to back off to GCC 4.2.1 then GCC 4.2.4. Both die during the build with configure problems:
config.status: executing gstdint.h commands
make[3]: Entering directory `/home/benl/files/gcc/build/gcc-4.2.4-obj/libdecnumber'
make[3]: Nothing to be done for `all'.
make[3]: Leaving directory `/home/benl/files/gcc/build/gcc-4.2.4-obj/libdecnumber'
make[3]: Entering directory `/home/benl/files/gcc/build/gcc-4.2.4-obj/gcc'
make[3]: *** No rule to make target `all'. Stop.
make[3]: Leaving directory `/home/benl/files/gcc/build/gcc-4.2.4-obj/gcc'

For some reason the configure script isn't dropping the Makefile in ./gcc.

Giving up trying to compile a more recent GCC under Solaris. Will just copy the 4.2.1 binaries across from mavericks. Let's hope we don't stumble across any more bugs in it.

Stop. Rewind. I've been tripping over myself because the builds are taking so long. I've got copies of the head on mavericks, code.haskell.org and sparky and they're all different versions. Some of the recent patches pushed to the head also seem to have broken the build, so I'm backing up to the head as of 2009/01/01.

I'm going to stick with this version, and the same compiler flags, for all builds until the NGC is fixed. I need to be able to reuse the .o files between builds because they take too long to remake.

Finally pushed the gc_thread patch. This should fix the via-c build. I've run through the testsuite, though I'm still waiting the actual stage2 build to finish - it's stuck on Parser.hs again.

Turned on the NCG for sparc. That'll be another nights worth of rebuilding.

Sunday, January 4, 2009

Bootstrapping 2

A test of the patched GHC 6.10.1 looks promising:
make stage=1 WAY=optc EXTRA_HC_OPTS=-L/data0/home/benl/lib

Gives:
Unexpected failures:
2594(optc) -- run segv. tests 64 bit FFI.
T2486(optc) -- diff top level specialisations, ok by hand.
andy_cherry(optc) -- timed out when compiling, ok by hand.
barton-mangler-bug(optc)
conc020(optc) -- does nothing. not sure.
concprog001(optc)
enum01(optc) -- run segv. 64 bit code? not sure.
enum02(optc) -- run segv. 64 bit code
enum03(optc) -- run segv. 64 bit code
ffi019(optc)
hClose002(optc)
hpc_raytrace(optc)
joao-circular(optc)
num012(optc)
process007(optc)
simpl015(optc)
tcrun007(optc)
user001(optc)


I'm getting linker problems when trying to build gdb on sparky.
gcc -g -O2      \
-o gdb gdb.o libgdb.a \
../readline/libreadline.a ../opcodes/libopcodes.a ../bfd/libbfd.a -lintl
../libiberty/libiberty.a ../libdecnumber/libdecnumber.a -ldl -lncurses
-lsocket -lnsl -lm -lexpat ../libiberty/libiberty.a
Undefined first referenced
symbol in file
initscr32 libgdb.a(tui.o)
w32addch libgdb.a(tui-io.o)
w32attron libgdb.a(tui-wingeneral.o)
w32attroff libgdb.a(tui-wingeneral.o)
acs32map libgdb.a(tui-win.o)
getcurx libgdb.a(tui-io.o)
getcury libgdb.a(tui-io.o)


After some Googling around, it turns out these syms are defined in the Solaris /usr/ccs/libcurses.so library and not GNU libncurses.

Spent some time going through the current trac tickets. #186 says that Int64 code has been broken for some time, which will be the reason for many of the test failures.

On mavericks I have two builds running with a single thread, and another running with 4 threads. This is creating frequent 3-4 second pauses on the console, which I guess is due to memory starvation. The builds are also dying every 20 min due to race conditions, I'm giving up on parallel make.

The stage1 build of ghc-HEAD-work on mavericks died with:
basicTypes/Var.lhs:80:30:
Module
`IdInfo'
(hi-boot interface)
does not export
`GlobalIdDetails'

basicTypes/Var.lhs:80:47:
Module `IdInfo' (hi-boot interface) does not export `notGlobalId'

I think this was because the repo got into a weird state because I control-c'd darcs when it was running. The formatting of the first error message looks weird though.

Darcs failures.. zzzz
== running darcs pull --repodir testsuite
Pulling from "/data0/home/benl/devel/ghc/ghc-HEAD"...
darcs: bug in get_extra commuting patches:
First patch is:
Fri Sep 7 18:23:27 EST 2001 simonmar
* [project @ 2001-09-07 08:23:27 by simonmar]
Fix some signatures after Ord was removed as a superclass of Ix.
Second patch is:
Fri Sep 14 01:54:43 EST 2001 simonmar
* [project @ 2001-09-13 15:54:43 by simonmar]
Back out the change to remove Ord as a superclass of Ix; the revised
Haskell 98 report will no longer have this change.
darcs failed: 256 at ./darcs-all line 69.
benl@mavericks:~/devel/ghc/ghc-HEAD-work> darcs --version
2.1.0 (release)
benl@mavericks:~/devel/ghc/ghc-HEAD-work>

Hrm. stage1 build of ghc-HEAD-work on sparky completed, but trying to compile something results in:
benl@sparky:~/devel/ghc/ghc-HEAD-work/tmp$
~/devel/ghc/ghc-HEAD-work/ghc/stage1-inplace/ghc --make Main.hs
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...
ld: fatal: relocation error: R_SPARC_32: file /home/benl/devel/ghc/ghc-HEAD-work/libffi/libHSffi.a(v8.o):
symbol : offset 0xfd0559b6 is non-aligned

Trying GNU ld instead gives:
ld: gct: TLS definition in /home/benl/devel/ghc/ghc-HEAD-work/rts/libHSrts.a(GC.o)
section .tbss mismatches non-TLS reference in
/home/benl/devel/ghc/ghc-HEAD-work/rts/libHSrts.a(GCUtils.o)

I expect this happened because i changed my ~/bin/gcc during the build... yah. I'm trying to do too many things at once because the builds take so long. Rebuilt the runtime system and now I'm getting this:
/opt/gcc/bin/../../SUNW0scgfss/4.0.4/prod/bin/fbe: "StgCRun.s", line 17: error: statement syntax
/opt/gcc/bin/../../SUNW0scgfss/4.0.4/prod/bin/fbe: "StgCRun.s", line 31: error: statement syntax

This is with GHC 4.0.4 installed on sparky. Checking out the .s files at those lines shows:
    sethi   %hi(%l1),%i5
ld [%i5+%lo(%l1)],%i0

From the OpenSPARC 2007 architecture manual, both of those instructions are bad. This looks like a legitimate GCC bug. Time to upgrade.