Pjotr's rotating BLOG

Table of Contents

RSS feed

Tis document describes Pjotr's journey in (1) introducing a speedy LMM resolver for GWAS for GeneNetwork.org, (2) adding a REST API for genetics, and (3) solving the pipeline reproducibility challenge with GNU Guix.

Join us on the #genenetwork IRC channel on http://webchat.freenode.net/.

1 fast\lmm\d

[2017-03-16 Thu] Over the last months we have put in a lot of work to make fast\lmm\d happen, the single core version is faster than the python version, even without real optimizations.

And I wrote relocatable GNU Guix support to deploy packages on Beacon. Video of FOSDEM2017 presentation in the HPC room.

2 Beacon: getting Python to run with MKL

pip3 install --user virtualenv
~/.local/bin/virtualenv ~/virtualenv-python-3.4
cd ~/virtualenv-python-3.4
source bin/activate
pip3 install numpy

but now I find numpy is linked against OpenBLAS instead of MKL! Well, that is useful for benchmarking, but not what I wanted to do.

The Intel numpy build instructions are somewhat enlightening though they do not mention Phi as a target. What is interesting is explicit linking against MKL and that it is even possible to use the GNU compilers. The gcc provided is

gcc (GCC) 4.8.2 Copyright (C) 2013

which should be good enough to bootstrap Python and even Guix and friends.

3 Beacon: what software to use?

[2016-10-20 Thu] Beacon support wrote:

`Sorry about the empty module. We did support OpenCL at a time (when intel MPSS is at 3.3), but AFAIK the intel doesn't officially support OpenCL for MIC beyond MPSS 3.3. On Beacon we currently have MPSS 3.7.'

and suggests to use for Python Intel developer's pyMIC library (https://github.com/01org/pyMIC). In this library Python is bound against Intel's MKL library, see for example the BLAS functions.

Of course, the MKL is fast. But (like CUDA) it is propriety closed source software which makes it a no go beyond the short term for me.

For D there is MIR which I would love to try.

By the looks of it, as stated by Beacon support, Intel stopped support for OpenCL on Phi in favour of its own 'Manycore Platform Software Stack' which is nasty because it rules out all FOSS. In fact, Boost, BLAST and other tools on Beacon all want OpenCL support! And Beacon even has the intel-opencl compiler (which won't work without OpenCL drivers). The advantage of OpenCL is that it is easier for programmers to target multiple platforms. With my software it is not too hard to target different backends explicitely.

My conclusion: all non-Intel solutions on Beacon require significant software compilation and dependency resolution (read deployment) at this point. For example to get OpenCL going is non-trivial and GNU Guix would be a real help. Once bootstrapped I could even use D and LLVM. Also we may need to build Erlang and Elixir - something Guix is great at.

The current LMM implementation is in Python with Numpy, already targetting BLAS on CUDA. To get LMM running on Phi there are 3 routes now:

  1. Python targetting Intel's MKL libs (pyMIC or Numpy-mkl)
  2. Python using OpenCL
  3. D generating LLVM code for Phi

Python+MKL (1) appears to be the easiest, provided I get pyMIC or Numpy-mkl to compile. OpenCL (2) would be great to have to be able to roll out to different platforms - a true FOSS solution. D+LLVM (3) is the experimental one with great scope for performance improvements. Both (2) and (3) would need GNU Guix to succeed, in my opinion.

It looks like the pre-installed Python on Beacon already supports MKL to some degree:

config-3.4m/Makefile:CONFIGURE_LDFLAGS= -Wl,-rpath,/sw/cs300_centos6.6/python/3.4.3/centos6.6_intel16.1.056/lib,-rpath,/global/opt/intel/parallel_studio_xe_2016_update1/mkl/lib/intel64

I also find that Python numpy shared libs are linked against MKL

ldd ./site-packages/numpy/linalg/lapack_lite.cpython-34m.so|grep -e mkl
    libmkl_rt.so => /global/opt/intel/parallel_studio_xe_2016_update3/mkl/lib/intel64/libmkl_rt.so (0x00007f4f4a329000)

So, if these libraries know how to farm out to the Phi's we should be good to go. At least, it is worth exploring the default Python installation first with numpy. In parallel I'll try to install Guix packages using Roel's method so at least I'll be able to generate my own compiler environment (we'll probably want this for Elixir support too). Roel's method essentially has one build Guix packages on a build host and copy and run them in $HOME. This means Guix itself can not be run on Beacon, but the packages can, including a more recent gcc. I'll use Penguin2 to targe my $HOME which is /nics/b/home/pjotrp.

4 First steps on the Beacon Intel Phi supercomputer

The first step was getting access again. I had locked myself out somehow… But that was fixed in about 6 hours. I have ssh access again. One of the first things to check is that Beacon is running a 2012 Linux kernel using Red Hat 4.4.7.

Linux version 2.6.32-573.22.1.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org)
  (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) )
  #1 SMP Wed Mar 23 03:35:39 UTC 2016

For OpenCL Beacon have the intel-opencl/1.2-3.0.67279 compiler. Also PLASMA is provided as the plasma and dplasma modules, a higher level linalg library. There is Magma, another Linalg lib. Boost is also available, but lacks OpenCL support (yet - coming in v1.6.1). Then there is a JVM targetting the Phi. Hmmm. Choice, but not (exactly) what I am looking for.

The first step is to see if I can get a Python to work with OpenCL using the intel-opencl compiler. I found there is a module

module add intel-opencl

which loads and sets paths that do not exist, e.g.


time for an E-mail to support…

5 Progress on GN2

[2016-10-18 Tue] Many good things happening on GN2. New datasets are being added at a rapid rate. We have a new genotype map for the BXD. The REST API gives access to almost all data. A genome browser has been embedded and the list just goes on… The coming period I am going to focus on running a pilot on a super computer named Beacon.

6 Updating the deployment of GN2

[2016-08-16 Tue] Christian has done a lot of work in getting BD ready for GN2. My job is to get it on staging by updating the Guix packages. The last update was from February. Unfortunately Guix was misbehaving. The GNU servers are giving errors and the latest checkout of the Guix tree is also erroneous. One can live with one or the other but not both! I spent quite a few hours and rolled back to a version from a Month ago to start rebuilding GN2. One of the first problems I encountered was

_imagingft.c:68:32: fatal error: freetype2/fterrors.h: No such file or directory
 #include <freetype2/fterrors.h>
   compilation terminated.
   error: command 'gcc' failed with exit status 1
   phase `build' failed after 16.0 seconds
 builder for `/gnu/store/dmb05a6g2wfj8rx8mx3an1mwqrlznpzw-python2-pil-1.1.6.drv' failed with exit code 1

now python-pil I packaged and it is really, really old. According to this python-pillow is the modern package and, in principle a drop-in replacement. It is also python3 compatible: we have to aim for that. So, my first attempt is simply replacing pil with pillow. I'll probably run into a few issues in GN2.

First error I encountered was related to python-pil/pillow:

  File "/home/pjotr/.guix-profile/lib/python2.7/site-packages/Pillow-3.1.1-py2.7-linux-x86_64.egg/PIL/ImageDraw.py", line 91, in setink
   "Please use keyword arguments instead.")
Exception: setink() has been removed. Please use keyword arguments instead.

This error is coming from piddle, another really old package from the nineties - latest version from 2002. We should really get rid of it, but it is used by the GN1-style plot and (apparently) a plink plot.

Anyway, after a long look, I decided it was easier to fix the old python-pil package so it can find fterrors.h again. Isn't deployment fun?!

Apart from that major changes are R running at 3.3.0, and also I also did critical updates of python-numpy, python-scipy and openblas (see bug in OpenBLAS). I'll also upgrade R/qtl and add a new package for the excellent R/qtl2. There are more upgrades, mostly because thanks to the rolling updates of GNU Guix.q

The short and long of it is that GN2 builds again using a recent checkout of GNU Guix. Mind, I also updated plink-ng in the current build. Some of the parallel tests fail. May need to fix that before we roll into production (and add a test in mechanize).

7 Getting the latest pylmm-gn2 into GN2

The current version of pylmm we are using is my multi-core build from May 2015. That is over a year ago and does not include the CUDA features I added later. The main reason for not updating was a bug in OpenBLAS and that the new code has a different data transport requirement. Now we have the new gnserver running I am ready to bring the CUDA stuff into GN2. First, I tested that the multi-core version gives the same results as the earlier one

bergamo:~/genenetwork/pylmm$ env PYTHONPATH=.:$PYTHONPATH python pylmm_gn2/runlmm.py --geno=../test_gn_pylmm/data/test8000.geno --pheno=../test_gn_pylmm/data/test8000.pheno run

[15%] Calculate Kinship: 1.84636998177
[85%] Doing GWAS: 10.251172781

[ 0.898  0.153  0.324 ...,  0.962  0.883  0.962]
8000 4070.02328033

bergamo:~/genenetwork/pylmm_gn2$ env PYTHONPATH=./pylmm_gn2/:$PYTHONPATH python ./bin/runlmm.py --geno=../test_gn_pylmm/data/test8000.geno --pheno=../test_gn_pylmm/data/test8000.pheno run
[8%] Calculate Kinship: 0.989095926285 sec.
[92%] Doing GWAS: 10.708812952 sec.

[ 0.898  0.153  0.324 ...,  0.962  0.883  0.962]
8000 4070.02312498

so, with Karl's previous lmmlite-pylmm comparison, we should be reasonably confident that pylmmgn2 is doing the right job. The round-off difference is interesting, and due to one extra SNP (I think I remember one SNP was excluded in the original). Actually,

8 Updating Genotype data for the BXD and reproducibility

The REST API serves genotypes. Apparently the version in GN2 was outdated, for one BXD103 was removed/renamed. To ascertain reproducibility we need to introduce rigorous versioning and we need to be able to serve the different versions.

The genotypes exist as files. So can easily calculate an MD5 checksum and pass that in as metadata. We will also name the files using the partial checksum and publishing data, so BXD.csv becomes, for example, BXD\20150722\83771d79.csv. Passing this to the user (in the form of meta data) will allow the to uniquely identify the dataset for the purpose of reproducibility. These values are captured in the downloadable metadata BXD.json (also a file). The YAML version:

description: BXD
crosstype: riself
geno: genotypes/BXD/geno.csv
geno_transposed: true
    source: GeneNetwork
    unique_id: 42171462281377824604ec3d83771d79
    date: '20150722'
    unique_id: 6a766888cf7a5b5b9376ee165b4518ab
    date: '20150722'
    unique_id: dfbafbe862fb3572d3e847b7b7859540
    date: '20150722'
    '1': maternal
    '2': paternal
    '3': heterozygous
  B: 1
  D: 2
  H: 3
x_chr: X
- U
gmap: genotypes/BXD/gmap.csv

Phenotypes, meanwhile, are stored in a 'rolling' SQL database. This raises challenges for reproducibility. Some people have the opinion that it does not matter (the latest is the so-called greatest - an opinion also often voiced on the semantic web), i.e., they argue the researcher should archive the base data that was used for the analysis. That is possible, but I get uncomfortable when data changes just like that.

What we want is some way to signal a change to a dataset.

My proposal is to send a HASH value and a date with the phenotypes in the form of meta data. This will tell the researcher that a dataset has changed - if not how (if an archive exists a comparison should be possible). To achieve this we will add columns to the dataset description table and use and external tool to regularly test the datasets for changes. When a change happens and the HASH value has not been updated an alarm needs to be raised.

9 A great summer of code

[2016-08-04 Thu] Time to update this feed. The last two months saw plenty of activity. We have a publication in JOSS. Zach has introduced sessions into GN2. We have a scalable REST server running and we are adding functionality as we go. And Christian is adding the biodalliance genome browser to GN2, now with genotype and QTL tracks.

In the next phase we'll be adding improved LMM support, more tracks and software running in heterogeneous environments.

10 REST: fetching phenotypes (2)

[2016-05-24 Tue] Lots of activity on the #genenetwork IRC channel now the Google Summer of code started. Also distracting, of course, but I think good things come out of it. I am reverse engineering the GN database to provide the REST services.

11 Adding a multi-phenotype QTL plot

[2016-05-23 Mon] Harm has modifed Karl's QTL plot so it can show multiple QTL plots in one figure. I am looking at embedding this form into GN. Basically the user selects an experiment and a gene (presumably as an expression phenotype) and has it looks for all genes that correlate. At this point I am not so interested in the logic of correlations, but I am interested in the plot itself which could plot multiple phenotypes in one GN collection (read bag of phenotypes). I like also the way that it adjusts the chromosome sizes.

Looking at the plot's source code this should be possible.

Interestingly Karl has also a mutliplot in chart 3 but written in clean coffeescript. I am severely tempted to use that one. Harm agreed to look into it and maybe clean up the code. It would be better to make it so that Karl can reuse it.

12 REST: fetching phenotypes

[2016-05-18 Wed] We are adding a new REST service to GN. Main reason is to provide data to people using R, Python, Ruby etc. Introducing the REST server is also an opportunity for splitting functionality out of the main python webserver. One thing we want to avoid is long running jobs in python as it blocks on those. The third reason for a REST interface is to provide WEB 2.0 support to the genome browser and other UI tools. I started writing a maru based REST server.

13 Everyone on Guix

[2016-05-15 Sun] We have a Guix powered staging server now running on http://test-gn2.genenetwork.org/. Others have been installing and we will all be byte-identical and reproducible soon.

Now the main challenge is to get Zach on my version. First step is to update Guix on penguin followed by the guix genenetwork2 install. This pulls in all packages except for ctl.

Dennis updated the r-preprocesscore package - used for something (it is unbelievable what gets pulled in by all these tools) - so that triggers a rebuild of some of the dependencies:


you can see r-wgcna pulls in this dependency.

For Zach I set up the new repositories: genenetwork2 and all dependencies (except Danny's CTL) are now on penguin, so we can start synchronizing our efforts using GNU Guix. To use guix add ~/.guix-profile/bin to the PATH. Install system-wide guix packages, e.g.

guix package -i hello

For deployment we use a slightly modified setup, involving two git repositories as described here. Now

cd genenetwork/guix-gn-latest
env GUIX_PACKAGE_PATH=../guix-bioinformatics/ ./pre-inst-env guix package -i genenetwork2 --substitute-urls=http://guix.genenetwork.org:8080

to get the latest update of genenetwork2 dependencies.

Guix suggests the following paths

export PYTHONPATH="/home/user/.guix-profile/lib/python2.7/site-packages"
export R_LIBS_SITE="/home/user/.guix-profile/site-library/"
export GUIX_GTK3_PATH="/home/user/.guix-profile/lib/gtk-3.0"
export GI_TYPELIB_PATH="/home/user/.guix-profile/lib/girepository-1.0"
export XDG_DATA_DIRS="/home/user/.guix-profile/share"
export GIO_EXTRA_MODULES="/home/user/.guix-profile/lib/gio/modules"

I have not added these to ~/.bashrc so as not to confuse the older GN2 setup.

So to use the latest GN staging tree I have checked out

git clone git@github.com:genenetwork/genenetwork2.git gn2
git checkout -b staging
git pull origin staging
git checkout -b dev

Rather than using the secure-server script (which is not secure at all, btw), us the newer ./bin/genenetwork2 script. So

cd ~/genenetwork/gn2

try and run that. It may complain it can not find the module yaml, or something. That is because you need guix paths here. Get them with

~/.guix-profile/bin/guix package --search-paths
export PATH="/home/user/.guix-profile/bin:/home/user/.guix-profile/sbin"
export PYTHONPATH="/home/user/.guix-profile/lib/python2.7/site-packages"
export R_LIBS_SITE="/home/user/.guix-profile/site-library/"
export GUIX_GTK3_PATH="/home/user/.guix-profile/lib/gtk-3.0"
export GI_TYPELIB_PATH="/home/user/.guix-profile/lib/girepository-1.0"
export XDG_DATA_DIRS="/home/user/.guix-profile/share"
export GIO_EXTRA_MODULES="/home/user/.guix-profile/lib/gio/modules"

copy and paste these into the terminal! Note you should not copy the path as is, but add /usr/bin and /bin, e.g.,

export PATH="/home/user/.guix-profile/bin:/home/user/.guix-profile/sbin":/usr/bin:/bin

otherwise you lose all commands. Start the server up with

user@penguin:~/genenetwork/gn2$ env TEMPDIR=~/tmp ./bin/genenetwork2 ~/default_settings.py

You can see we set the TEMPDIR now and we use a settings file (take a look inside ~/defaultsettings.py) that is similar to the old one.

The command started up the new server on port 5003. Your original server is running on port 5002, so you may want to change that setting or ask Lei to make 5003 visible from outside.

See if you can get it running.

Note that I have only added directories and files to your HOMEDIR. In particular ~/.guix-profile/ ~/genenetwork, ~/opt and ~/defaultsettings.py. Nothing else was changed or overwritten.

Your old system and setup should still work fine - there is no interference.

14 Getting a server running for testing

[2016-04-28 Thu] Testing is crucial. I am putting in a test framework for GN2 - I named it 'Mechanical Rob' :).

A few years back Zach had put in a few Selenium tests and we will continue to use Selenium to generate tests (it can generate code using a browser plugin - I tried that and it is helpful because turning clicks into code is pretty tedious). Unlike the previous edition by Zach, the new tests, will be converted to run without Selenium itself. I want to be able to run the tests on the server without a browser (any time).

Dennis is interested in helping us to get full test coverage. I am not sure how much he will do, but we have a good start. My aim is to cover all functionality in GN2 with these tests.

I am adding dead-link detection in the testing framework. This is how I find that GeneRIF link on the main page no longer works. I replaced it with the wikipedia reference which, funnily, also contains the broken link. Go, go NCBI! There are more dead links, however…

Next we need a server to test against. I got one started using the merged trees on Penguin2 (currently running on http://penguin2:5003/ for those with VPN access.

This server should contain the latest changes by Zach, Danny and myself. WGCNA is broken (it is a driver problem on Penguin2) as well as the CTL functionality (I need to bring in Danny's CTL code). But otherwise the thing should run. I'll fix these two problems. Please notify me, or raise an issue, if you see any unexpected differences between Zach's and my server.

One of the most intrusive changes is that the large files have moved to a different directory and temporary files are now generated in /tmp, rather than in the source tree (that is a no go!). This may explain a missing figure or a failing export function. Again, do notify me if you miss something that ought to be there. One figure that is missing is the bar plot for interval mapping. I'll fix that too.

To get the server up and running (on top of Guix) I expanded the README documentation.

15 A git merge nightmare

[2016-04-21 Thu] Working on three branches (Zach, Danny and myself) we needed to merge our work. This proved less easy than normal, mostly because we were working on essentially different repositories (after I diverged the diet version described below). I tried merging, cherry-picking and hand patching - it all proved too difficult because we had been intrusively removing and renaming on both ends!

Since Zach's tree is still the authoritative one I decided to drop the diet for now and bring in Danny and my changes by file (kind of a rebase). Again, non-trivial as we all three worked on the same files. The 'git am' command came in handily - another GNU Guix trick I learnt. It was the most tedious job. I guess I learnt another lesson here (even though we could merge, better don't diverge git source trees). For a little while we'll use the sumo tree again.

The files that had the majority of conflicts were:


I had to remove the first from .gitignore as it is now supposed to be generic.

There are a few things to fix. R/qtl output is broken (seen that one before) and CTL lacks the binary dependency. Also I need to tick off gemma mapping and Zach's work.

16 Guix distribution of GN2

I spent significant time trying to get 'guix archive' to work, but unfortunately there is a problem with R packages. So I am working around that for Danny.

There are two options: (1) he builds himself and (2) I create multiple archives. The first option is a no go - I think - because there are some problems with the latest guix versions. Multiple archives will work, but I need to recreate the symlink structure. I ought to try Docker now (grumble) as it offers a unified environment, something Danny can use.

Funnily (well not so funny) the guix builds are working again, so Danny is doing that on his machine following these instructions.

17 Preparing GN2 for distribution

17.1 Removing large files out of the git repository

In my git repository I have removed the genotype data files (they are now outside the source tree). This trimmed it down from 350Mb to 30MB). We also need to rewrite git history to remove the large files from the git database.

git filter-branch --tree-filter 'rm -rf genotype_files' HEAD

Next I found there was another copy of the directory in git history web/genotypes using

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5
git rev-list --objects --all | grep  9fc0c77c360f8a0b1f7bc8a696a99b5dff340607
git filter-branch --index-filter 'git rm -rf web/genotypes/HSNIH.geno.gz'

and purge (after editing .git/packed-refs) them (see below). Other dirs that had to be mucked out were web/gemma, web/dbdoc, wqflask/output, web/newgenotypes. For each:

git filter-branch -f --index-filter 'git rm -rf --cached --ignore-unmatch web/ wqflask/output webtests/chromedriver '

After finishing the clean up the packed-refs file (remove foreign references), see if git fsck notices the unreachable objects with

rm -rf .git/refs/original/
git reflog expire --all --expire-unreachable=0
git fsck --full --unreachable

if you have a list then you can do:

#+beginsrc sh git repack -A -d git prune #+endsrc sh

And while I was at it I also removed webtests/chromedriver.

Now this can only work when we repack the repo and people clone afresh. We'll also need to remove some of the JS dirs.

Anyway, the diet version of the genenetwork2 repository has come down to 25Mb (from 350Mb).

17.2 Create archive of large genotype and mapping files

The large data directory of 3.8Gb of data was archived with

tar cvJf gn2_data.tar.xz gn2_data/

Next the pfff checksum was run to rename the file and upload it to the files repository. Actually I switched to lz4 because compressing these large files took forever with xz.

The small database only contains the files for the BXD. When testing BXD.geno, BXD.cross and BXD.json were the only ones used.

The latest files can be found here.

18 Many packages to get r-wgcna going

[2016-02-29 Mon] I wrote the R-WGCNA package with dependencies and had to write packages for, r-biocpreprocesscore r-wgcna, r-acepack, r-latticeextra, r-formula, r-hmisc, r-doparallel, r-iterators, r-foreach, r-fastcluster, r-dynamictreecut, and r-rcppeigen.

Now GN2 runs completely from GNU Guix, the beast named GN2 has been tamed! There will probably be more hickups and I still need to move the javascript libraries to Guix, but at least we can move forward from here.

19 Fixing GN2 file paths and a working GN2 server

Lots of file paths in GN2 are hard coded. To make the beast work and move genotype files out of the git repo (reducing that from 350Mb to a more manageable 30Mb).

I rejigged the (hard-coded) paths. A hickup was that GN2 generates images and puts them in a folder inside the source repository. Scary! Flask supports 'static' serving from subdirectory with the same name. We don't want that, but we can rewire the path which is what I do in this commit. Now, not only, do we avoid uploading images in the source tree (one problem is that Guix won't allow it because it is immutable) but also we have everything in temporary directory which is easier to monitor.

I spend much of my time this week getting paths and dependencies sorted. One ball-breaker was the piddle version I found on penguin. Despite its version number piddle-1.0.15-py2.7 it turned out to be different from the online version. Good grief, I am happy we are running Guix now!

Anyway, as it stands, GN2 runs fine on Guix which means it will run anywhere. pylmm, r-qtl and interval mapping work out of the box. I still need to work on the plink and gemma wiring as much of it is hard coded. We ought to move that out to the REST server anyway, so that is next.

But at least we have something we can roll out.

20 Packaging Genenetwork (cont 5)

[2016-02-22 Mon] Now that pylmm is packaged the final GN2 dependency is the database. MySQL allows for read-only databases these days, so we can create a nice testing/development database that is embedded in a GNU Guix package. To idea is to do a

guix package -i genenetwork2-database

which will install GN2 with the database and GN2 will fire up with one single command


against which we can run the test system and even do some integration testing (as long as it does not want to write data).

As we are probably going to split up private (mutable) and public (immutable) data in the future I think this is a brilliant way of deploying public datasets for testing. Should work right out of the box. Multiple mysql servers can even run on one single database, for example.

The first step is to fetch a database from somewhere. We now have http://files.genenetwork.org/raw_database/ for that purpose. Even the small database is still too large (1.5GB), so we may want to trim it down.

Then there are the flat files which we are going to move out of the various directories they live in today. There are some .map and .bed files living in a plinkgemma dir, for example, others are in the source code git repo (on github). I'll consolidate these on http://files.genenetwork.org/.

21 Packaging pylmm and sambamba

[2016-02-20 Sat] The current GN is running an old version of pylmm, git@github.com:genenetwork/pylmm.git commit b8a15885ed3701e079170d6a8bf69bb8d8349f9c (Importing multi-core version of pylmm for GN2). So that got pylmm-multicore packaged in guix-bioinformatics thanks to Nick allowing for a proper FOSS license.

In GN2 pylmm is called in two places, so I needed to fix those:

grep -r lmm.py *
utility/tools.py:            pylmm_command = 'python '+path+'/pylmm_gn2/lmm.py'
wqflask/heatmap/heatmap.py:  command = 'python /home/user/gene/wqflask/wqflask/my_pylmm/pyLMM/lmm.py --key {} --species {}'.format(key,

I added a wrapper for pylmm and adapted Guix and GN2 to call into the wrapper named pylmmredis (this version still uses Redis) so it says

INFO: lmm try loading module
WARNING: LMM standalone version missing the Genenetwork2 environment

which means it can be run standalone.

[2016-02-21 Sun] Roel started work on packaging sambamba, a fast performing NGS tool. To make it work I wrote a new sambamba makefile for Guix which builds fine in my environment. Over to him again.

22 Expanding the GeneNetwork effort (the team is growing)

[2016-02-15 Mon] The story continues. Good news is that Roel and Dennis are joining this week in the effort. And Harm.

Harm is working on a UI using Karl's D3 materials named the 'Arabidopsis eQTL analysis platform' or AraQTL, the Django source code (an MVC) is here. Dennis started work on GPU related stuff and Roel is packaging general bioinformatics software in GNU Guix.

To make it easier to collaborate we have created an IRC channel on freenode. Go to irc.freenode.net and join with

/join #genenetwork

Join on emacs with

#+beginsrc scheme (setq erc-autojoin-channels-alist '(("freenode.net" "#genenetwork"))) (erc :server "irc.freenode.net" :port 6667 :nick "name" :password "pwd") #+endsrc scheme

Of course it is blocked on the campus networks so people will need some other route.

23 Packaging gemma, plink

[2016-02-14 Sun] GN uses some external tools for computations. First I fixed and packaged the latest versions of gemma (was 0.94, 01/12/2014, now 0.95-alpha or 0.9.5-2de4bfab3) and plink (was PLINK v1.90b3s 64-bit (17 Jun 2015), now PLINK v1.90p 64-bit (12 Feb 2016)) so they faithfully show up.

24 Packaging Genenetwork (cont 4)

[2016-02-13 Sat] The qtlreaper I packaged a few weeks back. Today was htmlgen's turn (a piece of software dating 1998!). The version that comes with GN is slightly different from the standard version. In the Guix repo the python files are now correctly compiled and I only included the .pyc files we are using. Guix just makes me happy.

So, the good news is that all the GN2 code bootstraps using the new ./bin/genenetwork2 from Guix up to

Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock'

This error is because I am not yet running mysql. I am getting close, really close.

Lei has opened a http accessible storage for large files. This means large static files can be removed from the git repos and can be split out in GNU Guix deployment. This includes genotype files, source packages and even test databases. GNU Guix checksums these files, so we can make sure they do not change (just like that). I do think we should add the date and source to the file name, so it is clear where it came from. E.g., these filenames are not so descriptive and they are duplicates

d4364afdcce9b6071efb4016d585c288  genotype_files/genotypes/BXD300.geno.2brmv
d4364afdcce9b6071efb4016d585c288  genotype_files/genotypes/BXD300.geno.7636

25 Packaging Slurm

[2016-02-12 Fri] Today I packaged slurm, a resource manager for clusters, and submitted the patch to GNU Guix for inclusion. The first step towards using cluster effectively. It was a bit harder than I expected, mostly because non-free software had to be removed and the build system adapted.

26 Managing power of the cluster

[2016-02-11 Thu] Yesterday I got more of Victor's old boxes and it looks like some are in bad shape and some can be rescued. I have one node screaming behind me now and it has PXE, so we can boot over the network. One thing I want to do is easy provisioning of new images to make this a true throwaway GNU Guix cluster. A description of a simple PXE server can be found here, required is DHCP plus tftp. When it gets serious we should implement that. I also looked into wake-on-lan (WOL) which should also work on these machines. It is good we have these capabilities now because a running node pulls 200W. The powersupply for two nodes idling pulls 30W. I think we also need a master switch for ever 8 nodes (240W sleeping 1600W awake!). I am going to shut down 6 of the 8 existing nodes today. Waking them up is as easy as

wakeonlan hwaddr

To list the hwaddr (where machines.txt contains the output of ifconfig)

cat machines.txt |grep HWa|awk '{print $5}'

With the nodes sleeping they still pull 15W each. So that is 8x15W = 120W or about 2 KWh per day ~ EUR 0.40 p.d. or EUR 12 per month.

When the nodes double that will be EUR 24 per month (sleeping, running is 10x as much! You can see the point of cloud computing).

I still think I need we need a 220V master switch which can be managed remotely. So all cost drops to zero when the cluster is not in use.

Next time I am with the machines I need to configure the BIOS to accept PXE. After that we can roll out the provisioning.

On the hacker space mailing list the following question came up with my answer

Q: Would it be an idea to transform (some of) the nodes into a more generalized tinkering platform - for instance I heard someone mention botnets recently, and it would be a shame to rig a separate server for that if one could just spool a VM to a cluster node. Building a "general network tinkering and test deployment platform" might be a nice project in itself.

A: The cluster is exactly meant for experiments and I am happy to share.

The concept is a throwaway setup. That means you should write your installation in such a way that it can be regenerated every time. This is done through GNU Guix scripts which allow for VMs, containers and PXE installs on the nodes.

To run experiments needs some co-ordination with me and perhaps others. Every software deployment and configuration should be captured in Guix.

The idea is NOT to have ad hoc installation of software on the main node (though you are free to mess within your VM) and also NOT to have stuff running full-time.

Think of it as proof-of-concepts which can run for a while. I'll be happy to facilitate with 8x nodes with 8 cores and soon another 8x8.

See also the project page.

27 Back in NL and hacking GNU Guix

[2016-02-08 Mon] Returned a week ago and most time went into FOSDEM submitting a grant proposal. My FOSDEM talk can be found here - with slides and video.

This week ramping up GeneNetwork packaging again (after building and testing); So GN is on latest Guix. Also pushing the munge and slurm packages to Guix as well as updating the documentation including impressive graph (I'll add one for GN when it is complete).

I am purging LFS (because it is github propriety) from the GN2 git repo. I.e.

[master b8bbca6] Removing LFS
delete mode 100644 genotype_files/genotypes/.gitattributes
delete mode 100644 genotype_files/new_genotypes/.gitattributes

which also required a manual

rm .git/hooks/pre-push
rm -rf .git/lfs/

Ugly, but fixed.

The rest of the day was spent on a regression of python2-sqlalchemy (grrr). Turned out python2-sqlalchemy does not work with a more recent version of sqlite, so I introduced an older version again. It is amazing how many tools use sqlite. Anyway, the good news is that the build system is moving incrementally forward.

28 Packaging Genenetwork (cont 3)

[2016-01-23 Sat] The GNU Guix GN2 installer is pretty much finished though I don't have the bandwidth to test the project. Next up is the GN1 installer. I dug up an E-mail with instructions from Lei which points out Python 2.4.3 may be required as well as a bunch of modules that are in a 3rd party tar ball: graphviz-2.22.2, htmlgen, json, numarray-1.5.2, piddle, PIL (in guix), pp-1.5.7, pyx, pyXLWriter, svg. I.e., a few more modules to package. At least all these with source, because even a cursory look shows these modules are old and HTMLgen, for example, has no download location any longer. Since we are not going to write software for GN1, I will just use the tar ball with an installer script. The json, pp and numarray modules may be worth adding to GNU Guix. Anyway, it looks all doable now. Today I packaged python2-numarray as a GNU Guix package.

29 Packaging Genenetwork (cont 2)

[2016-01-22 Fri] The latest GNU Guix installer copies the full webserver code into the store


With that we have achieved clean versioning as a first step. Now I need to make the dependencies available for the webserver which is somewhat tedious. But once it is done it should last for a long time. Luckily a lot of the packages already exist in GNU Guix - sometimes it pays to wait in informatics ;). I think most of the dependencies are in the package now. Looks like I need to package another 4-5 myself. Hopefully they don't depend on too many modules themselves (r-wcgna has a long list I still need to do! The new bioconductor export module of GNU Guix will be a great help).

Today I added a GNU Guix package for qtlreaper. But looking into the GN2 source code I don't think it is actually used. Likewise a bunch of other modules.

30 Packaging Genenetwork (cont)

Today I managed to complete the python installer for GN2. This has led to a request for handling genotype data outside the tree and looking at contained packages not written by us. I also reported a bug in GNU Guix related to fetching repositories from git and updated R-qtl to the latest version in GNU Guix. I have started adding and testing dependencies following Artems Docker recipe with included requirements for the Python installation. Be interesting what kind of graph comes out!

31 Started packaging Genenetwork

Both Harm Nijveen and I are working on GeneNetwork installation. First hickup is that the git repo for GN1 is not complete and the tar-ball is 33GB. Meanwhile I am porting GN2 to GNU Guix. The first step was to create a python installer because the .py files need to be compiled to execute in a read-only store. The net effect is a GNU Guix install in


At least the python stuff looks complete, but the html and other files are still missing.

32 Install Guix

To solve the R/lmmlite install and am opting for a GNU Guix package. We are setting up an Arvados cluster based on GNU Guix and agreed that all installation should be through Guix. Progress is recorded here. I needed to do the base install anyway so Roel can continue later. Today I got to the point I can experiment with R/lmmlite.

33 Install R/lmmlite

Working from Zanzibar the time has come to align my version of pylmm with that of R/lmmlite by Prof. Karl Broman (lmmlite is an R/C++ hybrid implementation of Nick's pylmm). It ought to be interesting to compare output and performance of both tools and see how we can move forward hosting one or more of these tools into GN. Speed and memory usage are of interest when hosting such services.

For R/lmmlite the R libraries need to be installed and pulling over 100Mb for the R devtools and dependencies over narrow band is no joy. I am on narrow band and the 'lite' part obviously does not reflect the implicit dependencies. Karl sent me an update that I should use 'R CMD', but that in turn requires knitr etc. This is stuff I can only do at night, so as not to disrupt other users.

34 Running pylmm on penguin2

To build scipy I had to add

apt-get install gfortran libopenblas-dev liblapack-dev
pip install simplejson mako scikit-cuda

Running the Kinship calculation on the K80 it is 3x faster than the old K20:

env LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64 python ./bin/runlmm.py --pheno ../test_gn_pylmm/data/test8000.pheno --geno ../test_gn_pylmm/data/test8000.geno run

Timing report K20

[23%] Calculate Kinship: 1.86254692078
[77%] Doing GWAS: 6.14982700348

[ 0.898  0.153  0.324 ...,  0.962  0.883  0.962]

Timing report K80

[11%] Calculate Kinship: 0.609519958496 sec.
[89%] Doing GWAS: 4.69884514809 sec.

[ 0.898  0.153  0.324 ...,  0.962  0.883  0.962]

Next to the CUDA stuff I also did some work on the deploy tool. It does some basic file copying now from YAML descriptions, the most important part for deployment. I can use it now for configuring emacs, vim, sshd etc. Simple file editing is the next step.

35 Installing penguin2

The new production GN server (Penguin2) is sweet. 56 2Ghz Intel Xeon cores, 256Gb RAM and an NVIDIA K80m on board. Ready to R&R! At the moment it is not running anything - just a barebones Ubuntu 14.04 with Linux 3.19 kernel. I'll add CUDA (now where did I document that again? Aha, it is in my old BLOG):

First step is to install the CUDA driver and software provided by NVIDIA.

The toolkit (with driver!) is a 1.9Gb download and binary install. Luckily this 1.9GB download link can be used from the server directly. Initially, I installed the local .deb file

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
dpkg -i cuda*.deb

To see what it installed

dpkg -L cuda-repo-ubuntu1404-7-5-local
cat /etc/apt/sources.list.d/cuda-7-5-local.list
  deb file:///var/cuda-repo-7-5-local /

Interesting, a local repository. Run

apt-get update
 apt-get install cuda

Bit of a roundabout way(!?).

By default it installs in /usr/local/cuda-7.5 which is OK on Penguin.

rmmod nouveau

And the CUDA installer blacklisted it with some settings in /etc adding a file modprobe.d/nvidia-installer-disable-nouveau.conf which blacklists the driver. The kernel log says

[4617358.866019] nvidia: module license 'NVIDIA' taints kernel.
[4617358.870873] nvidia: module verification failed: signature and/or  required key missing - tainting kernel
[4617358.879592] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:83:00.0 on minor 1
[4617358.879688] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:84:00.0 on minor 2
[4617358.879715] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  346.96  Sun Aug 23 22:29:21 PDT 2015
[4617358.881460] nvidia_uvm: Loaded the UVM driver, major device number 247
[4617358.883559] nvidia_uvm: Unregistered the UVM driver

The UVM is used to share memory between the CPU and GPU in CUDA programs. We won't need that for the time being.

To install the software I used screen (the installer is probably not atomic).

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-7.
Samples:  Installed in /home/pjotr, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda/lib64, or, add /usr/local/cuda/lib64 to /etc/ld.so.conf and run ldconfig as root

Logfile is /tmp/cuda_install_31036.log

I copied the log to my Download folder and

/usr/local/cuda/bin/nvcc --version
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2015 NVIDIA Corporation
  Built on Tue_Aug_11_14:27:32_CDT_2015
  Cuda compilation tools, release 7.5, V7.5.17

To test CUDA run one of the examples in /usr/local/cuda/sample. E.g.

root@penguin2:/usr/local/cuda/samples/0_Simple/simplePrintf# ./simplePrintf
GPU Device 0: "Tesla K80" with compute capability 3.7

Device 0: "Tesla K80" with Compute 3.7 capability
printf() is called. Output:

[3, 0]:         Value is:10
[3, 1]:         Value is:10
[3, 2]:         Value is:10

deviceQuery responds with

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K80"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Looking good!

Next we try to get pycuda up and running. I am using the GNU Guix Python, so (after installing Guix as described here):

guix package -i python2-virtualenv python2-numpy glibc-utf8-locales
export PATH=$HOME/.guix-profile/bin:$PATH
mkdir ~/guix-python
virtualenv ~/guix-python
cd guix-python
. bin/activate
export PATH=$HOME/guix-python/bin:$PATH


export CUDA_ROOT=/usr/local/cuda
export PATH=$CUDA_ROOT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64 # optional
# pip install numpy # optional if no guix install
pip install pycuda

The install did warn that a default install may fail and that it was using unoptimized lapack. Running the following example crashed the machine on one Tesla the first time and segfaulted on another. Great.

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/lib/x86_64-linux-gnu
export PYTHON_PATH=$HOME/guix-python/lib/python2.7/site-packages:$HOME/.guix-profile/lib/python2.7/site-packages
~/guix_python/bin/python ~/Download/pycuda-2014.1/examples/hello_gpu.py

So much for hello! Is there another nvidia driver installed that conflicts with the CUDA one?

I went through the ldconf paths and removed the ones that are not required and rebooted. Then found that

ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.1
  lrwxrwxrwx 1 root root 17 Aug 14 23:47 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.352.39

points to an older install. Hmmmm.

apt-get remove --purge nvidia-352 nvidia-modprobe nvidia-settings libcuda1-352

And reinstalled the nvidia modules v349.96, so

ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.1
  lrwxrwxrwx 1 root root 17 Oct 15 07:32 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.346.96

is correct. However, when we want to install the CUDA packages it tries to install the more recent driver. WHAT NOW? I decided to use the binary installer (rather than the .deb) of the OLDER 7.0 CUDA version, to align with Penguin1. Note, however, that the newer CUDA libraries are more efficient at floating point storage, so we may want to upgrade when we start running out of CUDA RAM (sometime down the line). After installation:

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-7.0
Samples:  Installed in /home/pjotr

Please make sure that
 -   PATH includes /usr/local/cuda-7.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-7.0/lib64, or, add /usr/local/cuda-7.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-7.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Getting_Started_Guide_For_Linux.pdf in /usr/local/cuda-7.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_13279.log


ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.1
lrwxrwxrwx 1 root root 17 Oct 15 08:03 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.346.46

is the correct one. The CUDA examples work, but python still balks with

python[17633]: segfault at 100000018 ip 00007f03bb062f47 sp 00007ffea1323320 error 4 in libc-2.19.so[7f03bb041000+1bb000]

which differs from the Guix version which is libc-2.22. Hmmmm. Major HMMM. This does not look good. So, next I tried virtualenv with the Debian/Ubuntu packages. And, oh (banging head on table) I forgot to activate the virtualenv! That means I was mixing Pythons.

Now it works:

(python2.7)pjotr@penguin2:~/pylmm_gn2$ env LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64 python ~/pylmm_gn2/bin/cuda_properties.py
2 device(s) found.
Device #0: Tesla K80
  Compute Capability: 3.7

So I can work back towards using the GNU Guix version with the later CUDA 7.5.

Since we are aiming for reproducible installation paths it will take a bit of effort to install the server properly using GNU Guix, deploy etc. I also intend to write a new tool that takes care of monitoring running services. This has to happen in conjunction with the other stuff we are doing (not least getting pylmm to work with human data).

Next to programming I always end up working on (software) deployments. Maybe it is because few people are really interested. For reproducible software it is key.

36 Installing an Arvados cluster

[2015-10-14 Wed] For GN we are going to use a cluster for back-end computations. These weeks, to evaluate Arvados, I am building a ~180 core compute cluster. The installation process is documented here. The cluster will be hosted in Groningen. The installation is mostly automated. For development and testing of automation I use a KVM virtual machine as described here.

37 R/qtl control, genotype and phenotype parsers added

[2015-10-02 Fri] Today I added control, genotype and phenotype parsers for R/qtl support. R/qtl can add multiple phenotypes, so we need to add a switch for selecting the phenotype (or are we going to run them all by default?).

Run it with

./bin/runlmm.py rqtl --control data/rqtl/iron.yaml --pheno data/rqtl/iron_pheno.csv --geno data/rqtl/iron_geno.csv

the F2 Iron set is also nice because it has a lot of missing data. See how that comes out. In the process I am adding support for column headers.

38 Teaching R/qtl to pylmm

Because of an increase in human data and more complex crosses, such as the diversity outbred mice, there is a demand for using LMMs from R/qtl. The current R/lme4 library could arguably be used, but it misses the speed of tools like pylmm and FaST-LMM to crunch through the data. So, where we recently introduced R/qtl in GN2, now we are introducing pylmm for R/qtl (I rather enjoy inside-out use cases). With Karl using pylmm also scrutiny of the results will get another impetus - which is important.

The first step is to teach pylmm genotype and phenotype data formats that can be emitted from R/qtl and described here.

39 A new page - starting on pylmm again

[2015-10-01 Thu] The coming months I will be working on pylmm again, so time to restart this BLOG. Since my last BLOG we achieved a number of things: First, the pylmm license has been changed to a true FOSS license. Thanks Nick! Second, we got a new CUDA K80 server which should allow for some LMM heavy lifting. Third, we got Arvados to work and we are setting up a dedicated cluster as a pilot for backend computations. Forth, GNU Guix is flying and reproducible software installations are around the corner. Fifth, we are working on embedding pylmm into R/qtl. And, sixth, we did a Google Summer of Code project with LMMs.

There are still some outstanding TODOs for pylmm from the older BLOG. In a nutshell pylmm/CUDA works, but needs tweaking for the human data in GN2 (mostly issues around file size/transfer) and covariates need to be passed in. This work matches the missing links for R/qtl to use pylmm so it is clear what I should be working on.

But first we need to bring in the test system (currently sitting in a separate github repository) and teach pylmm to read R/qtl formats. I'll do that in parallel.

The outstanding pylmm issues on github matches the list from the previous BLOG reads (in no particular order):

39.1 IN PROGRESS Make sure the server immediately realizes an error

39.2 IN PROGRESS Fix Redis not accepting large objects on Penguin

39.3 IN PROGRESS Add data human and testing framework

39.4 IN PROGRESS Add covariate handler

39.5 IN PROGRESS integrate pylmm CUDA with GN2 using runlmm

39.6 IN PROGRESS Make sure SNPs are named (because order may differ)

39.7 IN PROGRESS Check difference of treatment human data in GN2/pylmm

39.8 IN PROGRESS Test human mapping for GN2

39.9 IN PROGRESS Adapt pylmm in GN2 to accept input files

39.10 DEFERRED Create python-CUDA package for GNU Guix

39.11 DEFERRED Handle plink input file

39.12 DEFERRED Get OmicABEL to compile and run the TODO list.

39.13 DEFERRED Get the GPU version of OmicABEL to compile and run

39.14 DEFERRED Create cache for K

39.15 DEFERRED Reduce precision of kinship matrix for easier storage

40 Older rotating BLOG

Older BLOG material on pylmm etc. can be found here.