Saturday, January 3, 2009

Bootstrapping

Sent a message to cvs-ghc asking about the huge compile times. Simon PJ responded saying that Roman had mentioned that GCC is uniquely slow on the SPARC T2. No suggestions for reducing the huge intermediate .hc files. I rebuilt Language/Haskell/TH/Syntax.hs again while dumping core. The source is 24k, but the desugared code is 2.7MB. The bulk of it is derived instance functions for the Data typeclass - stacks of gmaps of various sorts. Not much I can do about that, so I'll try and remove the lib from subsequent compiles.

Anyway. A stage2 build of GHC 6.8.3 with -O on sparky has worked, so has a stage2 build with -O0 on mavericks. I had two builds of GHC 6.10.1 running on mavericks last night and both have died with:

Configuring installPackage-1.0...
cabal-bin: ghc version >=6.4 is required but the version of
/data0/home/benl/devel/ghc/ghc-6.10.1-quickest/ghc/stage2-inplace/ghc could
not be determined.
make[3]: *** [with-stage-2] Error 1

The version can't be determined because the stage2 compiler segfaults, so we've made it to ticket #2692

All programs produced by the stage1 compiler segfault, including

main = return ()

The gdb stack trace says:

#0 0x0025e660 in todo_block_full ()
#1 0x0027d03c in evacuate ()
#2 0x0025e898 in markWeakPtrList ()
#3 0x0025d200 in GarbageCollect ()
#4 0x00256fa4 in scheduleDoGC ()
#5 0x002570ac in exitScheduler ()
#6 0x0025631c in hs_exit_ ()
#7 0x0025646c in shutdownHaskellAndExit ()
#8 0x00254770 in real_main ()
#9 0x002547c8 in main ()

So it's dying when performing the final GC which corresponds to the error message on the ticket. This probably got broken when the new parallel GC was added.

ghc: internal error: ASSERTION FAILED: file sm/GCUtils.c, line 140

Attempting to compile the same source on mavericks gives linker errors..
/opt/gnat/gcc/lib/gcc/sparc-sun-solaris2.10/4.2.1/../../../libbfd.a(libbfd.o): In function `warn_deprecated':
/var/tmp/Binutils/binutils-2.17.50/bfd/libbfd.c:978: undefined reference to `libintl_dgettext'
/var/tmp/Binutils/binutils-2.17.50/bfd/libbfd.c:981: undefined reference to `libintl_dgettext'

.. guess the build environment wan't that sane after all. Turns out on mavericks libintl is in
/usr/local/stow/gettext-0.13/lib/libintl.so

Hacking on rts/sm/Evac.c, trying to find why the assertion above failed. Undid the STATIC_INLINEs to get a better stack trace:

(gdb) bt
#0 0xff1c5bf0 in _lwp_kill () from /lib/libc.so.1
#1 0xff164bfc in raise () from /lib/libc.so.1
#2 0xff141100 in abort () from /lib/libc.so.1
#3 0x00258d00 in rtsFatalInternalErrorFn (s=0x322428 "ASSERTION FAILED: file %s, line %u\n", ap=0xffbff330) at RtsMessages.c:164
#4 0x002589e8 in barf (s=0x322428 "ASSERTION FAILED: file %s, line %u\n") at RtsMessages.c:40
#5 0x00258a54 in _assertFail (filename=0x3241c8 "sm/GCUtils.c", linenum=147) at RtsMessages.c:55
#6 0x0026b7ac in todo_block_full (size=5, ws=0x36a1f4) at sm/GCUtils.c:147
#7 0x0029d730 in alloc_for_copy (size=5, stp=0x36a174) at sm/Evac.c:77
#8 0x0029d7ec in copy_tag (p=0xffbff5f4, info=0x27ac34, src=0xfee81244, size=5, stp=0x36a174, tag=0) at sm/Evac.c:96
#9 0x0029e408 in evacuate (p=0xffbff5f4) at sm/Evac.c:621
#10 0x0026ca5c in markWeakPtrList () at sm/MarkWeak.c:395
#11 0x00267d40 in GarbageCollect (force_major_gc=rtsFalse) at sm/GC.c:346
#12 0x0025bd60 in scheduleDoGC (cap=0x0, task=0x0, force_major=rtsFalse) at Schedule.c:1478
#13 0x0025c8ac in exitScheduler (wait_foreign=rtsFalse) at Schedule.c:2018
#14 0x00259414 in hs_exit_ (wait_foreign=rtsFalse) at RtsStartup.c:416
#15 0x002595bc in shutdownHaskellAndExit (n=0) at RtsStartup.c:554
#16 0x002550bc in real_main () at Main.c:141
#17 0x00255114 in main (argc=1, argv=0xffbffa4c) at Main.c:153


That pointer for ws looks dodgy. It's loaded via the global gct, which is stored in a global register, which is architecture specific. Adding the following to GCThread.h makes the trivial program above compile, but stage2 is still segfaulting. Will do a clean rebuild.

#if defined(sparc_HOST_ARCH)
// Don't use REG_base or R1 for gct on SPARC because they're getting clobbered
// by something else. Not sure what yet. -- BL 2009/01/03

extern __thread gc_thread* gct;
#define DECLARE_GCT __thread gc_thread* gct;


Building ghc-HEAD on sparky. The alex I installed before was the wrong version for some reason, reinstall alex-2.2.

My attempted build of GHC 6.8.3 stage3 on sparky seems to have gone to sleep. Top shows it's done 15 sec of work all day. On further investigation, running stage2/ghc-inplace shows that it does nothing except sleep forever. A hello program built by the stage1 compiler segfaults. Lesson learned: test stage1 before building stage2.

Can't find gdb on sparky. Tried compiling gdb 6.8 from source. Died. FFS.
remote.c: In function `extended_remote_attach_1':
remote.c:2859: warning: unsigned int format, pid_t arg (arg 3)


When running tests for a patched GHC 6.10.1 on mavericks, have to manually supply a -L libdir to avoid it seeing the system 64 bit gmp libs. If it sees them then ld emits a warning to the console, which makes the test framework think the compile failed. Use:
make stage=1 WAY=optc TEST=cg036 EXTRA_HC_OPTS=-L/data0/home/benl/lib


Test framework was crashing because it was built with the old compiler. Must remember that all the framework stuff is built with stage1, not the host compiler.

Running the patched GHC 6.10.1 against the codeGen tests with just WAY=optc. They seem to be going though, so I'm hoping 6.10.1 is fixed for SPARC. Will run full testsuite tomorrow when the stage2 build is done.

Friday, January 2, 2009

Fighting dependencies

Looks like the stage1 build has succeeded, but stage2 died trying to link against libreadline.so.4. I only had a lib for .so.5 .. sigh.

After soft linking libreadline.so.5 to libreadline.so.4, had to copy across a new version of the GHC readline lib because it had gotten into a weird state.

Alright. It looks like I've got a sane build environment on mavericks. My build of GHC 6.8.3 made it through stage1 and the libs. I ran part of the testsuite and it seems fine.

Installed the binary distro of GHC 6.8.3 on sparky. Trying to get the build working. I'm still learning about where all the progs and libs are supposed to be on solaris. For some reason all the goodies in /usr/sfw, /opt/csw and /usr/css aren't in the default $PATH.

While compiling fresh copies of GHC 6.8.3 and GHC 6.10 on mavericks, some modules seem to take a very long time.. this is from top:
3273 benl       1   0    0  392M  142M cpu     47:25 79.63% cc1
^^^^^

47 minutes and still going? Smells like trouble.

Yah, my GHC 6.10.1 build died:
Constructor GHC.IOBase.IORef:
Can't find interface-file declaration for type constructor or class GHC.STRef.STRef
Probable cause: bug in .hi-boot file, or inconsistent .hi file
Use -ddump-if-trace to get an idea of which file caused the error
Cannot continue after interface file error


I'm guessing this was a build race. I just ran make again and it seems ok now.

That other module is still compiling:
3273 benl       1   0    0  393M  259M cpu     94:15 90.09% cc1
^^^^^

I looked into it and the .hc file was from the Happy generated parser, and is 2 MB big. I suppose cc doesn't like compiling 2MB source files..

Looks like the GHC 6.8.3 binary distro I installed on sparky didn't work. When I try and build something with it it dies with:
checking for path to top of build tree... In file included from /home/benl/software/ghc-6.8.3/lib/ghc-6.8.3/include/Stg.h:183,

from /tmp/ghc5891_0/ghc5891_0.hc:3:0:

/home/benl/software/ghc-6.8.3/lib/ghc-6.8.3/include/Regs.h:28:17:
gmp.h: No such file or directory
In file included from /home/benl/software/ghc-6.8.3/lib/ghc-6.8.3/include/Stg.h:183,

from /tmp/ghc5891_0/ghc5891_0.hc:3:0:

/home/benl/software/ghc-6.8.3/lib/ghc-6.8.3/include/Regs.h:121:0:
error: syntax error before "MP_INT"
....


I don't know if it's possible to change the pre-built GHC so it searches another include path, so I just added a softlink from /opt/csw/includes/gmp.h to ghc-6.8.3/lib/ghc-6.8.3/include/gmp.h. It's a hack, but it seems to have fixed it.

Now my other build on mavericks has caught up with the first one:
  3273 benl       1   0    0  394M  325M cpu    161:25 95.89% cc1
15563 benl 1 0 0 371M 271M cpu 35:19 95.10% cc1


Configuring GHC 6.8.3 on sparky is causing problems, not sure why:

bash-3.00$ sh boot
Booting .
configure.ac:887: warning: AC_CACHE_VAL(fp_gcc_version, ...): suspicious cache-id, must contain _cv_ to be cached
autoconf/general.m4:1988: AC_CACHE_VAL is expanded from...
autoconf/general.m4:2001: AC_CACHE_CHECK is expanded from...
aclocal.m4:548: FP_HAVE_GCC is expanded from...
configure.ac:887: the top level
configure.ac:887: warning: AC_CACHE_VAL(fp_gcc_version, ...): suspicious cache-id, must contain _cv_ to be cached
autoconf/general.m4:1988: AC_CACHE_VAL is expanded from...
autoconf/general.m4:2001: AC_CACHE_CHECK is expanded from...


The configure script on sparky must not have detected alex properly. There's a happy error later on as well.

== make boot - --no-print-directory -r;
in /home/benl/devel/ghc/ghc-6.8.3-build/utils/genprimopcode
------------------------------------------------------------------------
g Lexer.x
make[2]: g: Command not found
make[2]: [Lexer.hs] Error 127 (ignored)


When trying to unpack happy
bash-3.00$ gzip -dc happy-1.17.tar.gz |tar xf -
tar: ././@LongLink: typeflag 'L' not recognized, converting to regular file


Sigh. Built GNU tar, installed happy and alex on both mavericks and sparky.

Submitted a bug report about the 2MB intermediate C file for Parser.hs. There's no way a single source file is supposed to take 2 hrs to compile. Looking at the generated parser code, the parser has 545 states, and there are 109589 lines in the generated source file. This gives about 201 lines of C code per state.

At a guess, the trouble is that every parser state is represented by its own, named function. That's reasonable approach from a FP point of view, but it could cause some serious code blowout if not handled well.

Killed an attempted compile of the parser module with GHC flag -O0, after 121 mins running. Tried to compile it instead with -O2. ... didn't make any difference. The .hc is still 2MB.

Another compile of GHC 6.8.3 managed to finish Parser.hs after several hours (not sure how many) and is stuck on libraries/template-haskell/Language/Haskell/TH/Syntax.hs. The intermediate C code is 4MB this time.

I'll leave it overnight and see how far it gets..

Thursday, January 1, 2009

A new year and a new project

As per http://www.haskell.org/opensparc/ my mission is to fix the SPARC backend, and see how well we can take advantage of the multi-threaded T2 architecture. I'm aiming to post status reports on this blog at least a couple of times a week. Feel free to pitch in comments, suggestions, or to ask questions.

We have three T2 machines at our disposal:
  • sparky.ce.chalmers.se
    Generously donated by Sun to the haskell.org community. I'll setup some buildbots here in the coming weeks.

  • mavericks.anu.edu.au
    Owned by the Computer Science department here at the ANU (Australian National University) in Canberra where I am based. I'll be using this one for the main development because the latency is lowest for me. This machine has been cut into 2 virtual machines of 4 cores each. Mavericks is only one half, but it'll be fine for development. I'll do benchmarking on sparky.
  • The T2 at UNSW (University of New South Wales) which Roman should be making me an account on, any day now! :)
Today was setting up day. After abandoning a previous attempt to cross compile GHC from i386-linux to sparc-solaris, I'm using Christian Maeder's binary distro of 6.8.3.

Spent most of the day in dependency hell. The GHC distro was compiled for 32bit sparc so I've had to compile a matching version of libgmp and readline-5.2 from source. libgmp needs to be configed with --build=sparcv8-sun-solaris2.10 otherwise it defaults to 64 bit again.

After giving up on trying to get GIT working, it looks like my latest attempt at darcs 2.1.0 has succeeded. The configure incantation includes --without-curses --without-terminfo --without-manual --without-libcurl --disable-color.

I'm ignoring libcurl because that'd be another thing I'd have to fight. I'm ignoring curses because the GNU and Solaris versions don't quite match and I was getting header file problems with the Solaris version. Even when you specify --without-curses, darcs still tries to include it, so I had to patch the makefile.

Pulled down a current copy of the ghc head branch. Christian reports that GHC 6.10 is currently broken on SPARC, so I'll build 6.8.3 first to check that everything is working. I want to know that the GCC 4.2.1 installed on mavericks is good before I start fighting the head.

Spent some time reading through the SPARC architecture manual while waiting for GHC 6.8.3 to build. It's fun typing make -j32, but in stage1 each source file seems to take about 30sec on average to compile..

... it's still compiling an hour later. Roman mentioned something about "days", but I'll give it overnight before I get worried.

Check out:

ghc: 448311124 bytes, 457 GCs, 6902345/17301504 avg/max bytes residency (14 samples), 47M in use, 0.01 INIT (0.00 elapsed), 8.95 MUT (968.76 elapsed), 5.14 GC (7.63 elapsed)

woot!