UP | HOME

Pjotr's rotating BLOG

Table of Contents

RSS feed


Tis document describes Pjotr's journey in (1) introducing a speedy LMM resolver for GWAS for GeneNetwork.org, (2) Tools for pangenomes, and (3) solving the pipeline reproducibility challenge with GNU Guix. Ah, and then there is the APIs and bug fixing…

1 First code katas with Rust

[2020-12-03 Thu] To force ourselves to work with Rust I am introducing code katas to the pangenome team. First I set an egg timer to 1 hour and installed Rust and clang with Guix and checked out Christian's rs-wfa bindings because I want to test C bindings against Rust. Running cargo build pulled in a crazy number of dependencies. In a Guix container I had to set CC and LIBCLANGPATH. After a successful build all tests failed with cargo test. It says

ld: rs-wfa/target/debug/build/libwfa-e30b43a0c990e3e6/out/WFA/build/libwfa.a(mm_allocator.o):
  relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a PIE object;
  recompile with -fPIE

On my machine adding the PIE flag to the WFA C code worked:

diff --git a/Makefile b/Makefile
index 5cd3812..71c58c8 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ CC=gcc
 CPP=g++

 LD_FLAGS=-lm
-CC_FLAGS=-Wall -g
+CC_FLAGS=-Wall -g -fPIE
 ifeq ($(UNAME), Linux)
   LD_FLAGS+=-lrt
 endif

the PIE flag generated position independent code for executables. Because this is meant to be a library I switched -fPIE for -fPIC and that worked too.

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

A thing missing from the repo is a software license. WFA is published under an MIT license. Erik also favours the MIT license, so it make sense to add that. After adding the license I cloned the repo to the pangenome org.

2 GEMMA additive and dominance effects

[2020-11-26 Thu] To compute the additive effect one takes the mean a = (uA - uB)/2 and for the dominance d = uAB - (uA + uB)/2. GEMMA estimates the PVE by typed genotypes or “chip heritability”. Rqtl2 has an estherit function to estimate heritability from pheno, kinship and covar.

3 Sambamba build

[2020-11-30 Mon] Sambamba is an important tool for sequencing, but maintenance is a struggle. I have decided to radically simplify dependencies (same thing will happen with freebayes). First, I am making a wager that CRAM is not necessary in Sambamba by removing that code and the htslib dependency. I may get back to that later in the pangenome context, but for now we don't want it.

[2020-12-03 Thu] This week a new release went out and a few fixes.

4 GEMMA randomizer

[2020-11-28 Sat] GEMMA uses a randomizer in bslmm and mcmc functionality. It has a command line option -seed that is -1 by default. If below 0 it will seed the randomizer from the hardware clock in param.cpp. The randomizer is set in three places:

bslmm.cpp
953:  gsl_rng_set(gsl_r, randseed);
1659:  gsl_rng_set(gsl_r, randseed);
param.cpp
2032:      gsl_rng_set(gsl_r, randseed);

and there are three (global) definitions of 'long int randseed' in the source. Bit of a mess really. Let's keep the random seed at startup only. The gslr structure will be shared by all. After fixing the randomizer we have a new 0.89.3 release of GEMMA!

5 GEMMA, Sambamba, Freebayes and pangenome tools

[2020-11-25 Wed] Working on pangenomes with Erik Garrison we have started to develop a path for genotyping through pangenomes. This means I am now working on three legacy tools and will try to make them ready for a new generation of tooling. Get the best from these tools for future work.

I just managed to build Freebayes using a Guix environment. The tree of git submodules is quite amazing.

The first job is to build freebayes for ARM. This is part of our effort to use the NVIDIA Jetson ARM board. On ARMf the package included in GNU Guix fails with missing file chdir vcflib/fastahack and bwa fails with

ksw.c:29:10: fatal error: emmintrin.h: No such file or directory
 #include <emmintrin.h>
          ^~~~~~~~~~~~~

The first thing we need to resolve is disabling the SSE2 extensions. This suggests that -DNOSSE2 can be used for BWA. We are not the first to deal with this issue. This page replaces the file with sse2neon.h which replaces SSE calls with NEON. My Raspberry PI has neon support, so that should work:

pi@raspberrypi:~ $ LD_SHOW_AUXV=1 uname | grep HWCAP
AT_HWCAP:        half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
AT_HWCAP2:       crc32

Efraim will have a go at the BWA port. After that I'll pick up freebayes which includes a 2016 BWA as part of vcflib. People should not do this, though I am guilty with sambamba too.

5.1 Compile in a Guix container

After checking out the repo with git recursive create a Guix container with all the build tools with

guix environment -C -l guix.scm

Next rebuild the CMake environment with

rm CMakeCache.txt ./vcflib-prefix/src/vcflib-build/CMakeCache.txt
make clean
cmake .
make -j 4
make test

The last command is not working yet.

6 GEMMA compute GRM (2)

[2020-10-02 Fri] With all that plumbing in place it is time to get the GRM computation sorted. Nick, in parallel, is working on a speedy replacement for LOCO written in Rust. I'll focus on a Python implementation right now. One thing I need to do is compute MAF correctly and (in line with that) translate the real numbers.

  • [ ] Translate BIMBAM
  • [ ] Translate R/qtl geno and genoiter
  • [X] Compute MAF correctly

7 Managing filters with GEMMA1

[2020-09-30 Wed] GEMMA has filters which I need to be able to disable for HEGP input. Tracking this on github.

8 HEGP and randomness

[2020-09-30 Wed] For the HEGP encryption we use rnorm. I had to dig through R's source code to find out how it generates random numbers. It does not use the Linux randomizer but its own implementation. Ouch.

Honestly, we should not use that! These are pragmatic implementations for sampling. Not for encryption.

Interestingly, R sets srand() on startup using a time stamp. This is, however, only used to generate temporary filenames using the glibc rand() function. Correct me if I am wrong, but it is the only time R uses the OS randomizer.

Meanwhile the implementation of rustiefel is pretty straightforward: All we need to do is replace rnorm with an array of floats. Both Rust and Python provide a normal distribution from /dev/random. I'd have to dig deeper to get details, but to me it looks like the better idea.

9 GEMMA keeping track of transformations

[2020-09-29 Tue] One important thing mentioned below is that we want to keep track of data transformations. For example, applying a MAF filter twice can lead to different results. At least you'd want to show a warning. For this I am adding a transformation field in the control file which shows a list of tranformations formed like

"transformations": [
     {
         "type": "filter",
         "pheno-NA": true,
         "maf": 0.01,
         "miss": 0.05,
         "command": "./bin/gemma2 --overwrite -o test/data/regression/21487_filter filter -c test/data/regression/21487_convert.json"
     }

which allows us to check whether the filter has been applied before. Running the filter twice will show. The grm command will check whether a filter has run. If not it will add.

The current output after two steps looks like:

<urce/code/genetics/gemmalib [env]$ cat test/data/regression/21487_filter.json
{
    "command": "filter",
    "crosstype": null,     // Cross-type is outbreeding
    "sep": "\t",           // Default separator
    "na.strings": [
        "NA",
        "nan",
        "-"
    ],
    "comment.char": "#",
    // keeping track of these for debugging:
    "individuals": 17,
    "markers": 7320,
    "phenotypes": 1,
    // all data files are tidy and gzipped (you can open directly in vim)
    "geno": "test/data/regression/21487_filter_geno.txt.gz",
    "pheno": "test/data/regression/21487_filter_pheno.txt.gz",
    "gmap": "21487_convert_gmap.txt.gz",
    "alleles": [
        "A",
        "B",
        "H"
    ],
    "genotypes": {
        "A": 0,
        "H": 1,
        "B": 2
    },
    "geno_sep": false,       // We don't have separators between genotypes
    "geno_transposed": true,
    "transformations": [     // keeps growing with every step
        {
            "type": "export",
            "original": "rqtl2",
            "format": "bimbam",
            "command": "./bin/gemma2 --overwrite -o test/data/regression/21487_convert convert --bimbam -g example/21487_BXD_geno.txt.gz -a example/BXD_snps.txt -p example/21487_BXDPublish_pheno.txt"
        },
        {
            "type": "filter",
            "pheno-NA": true,
            "maf": 0.01,
            "miss": 0.05,
            "command": "./bin/gemma2 --overwrite -o test/data/regression/21487_filter filter -c test/data/regression/21487_convert.json"
        }
    ],
    "na_strings": [ // na_strings is more consistent than na.strings above, also need it for creating a method
        "NA",
        "nan",
        "-"
    ],
    "name": "test/data/regression/21487_convert.json" // name of the underlying file
}

Note that the control file increasingly starts to look like a Monad because it passes state along with gemma2/lib steps.

10 Fix outstanding CI build

[2020-09-29 Tue] The GEMMA1 build on Travis was failing for some time. The problem is that GSL BLAS headers and OpenBLAS headers do not collaborate well. To fix it I stopped testing with libgslv1 (which is really old anyway and we need to use gslcblas.h over OpenBLAS cblas.h.

11 GEMMA testing frame work

[2020-09-27 Sun] So far, I developed without much of a testing frame work, mostly because I was trying stuff. For GEMMA2 it will be a combination of unit and doctests for small functionality, but we also need something to make sure overall results do not change without notice. For this I am introducing a regression test system that is similar to the one I developed in Ruby at the time. Note that gemma1 had a bash based system that I copied from sambamba. It was showing its limitations, better have something in Python. For this we are going to use pytest-regressions.

The first test is to convert BIMBAM to GEMMA2 format with

gemma2 --overwrite convert --bimbam -g example/21487_BXD_geno.txt.gz  -a example/BXD_snps.txt -p example/21487_BXDPublish_pheno.txt

which outputs 3 files

INFO:root:Writing GEMMA2/Rqtl2 marker/SNP result_gmap.txt.gz
INFO:root:Writing GEMMA2/Rqtl2 pheno result_pheno_bimbam.txt.gz
INFO:root:Writing GEMMA2/Rqtl2 geno result_geno_bimbam.txt.gz

pytest-reqressions writes its own files so I needed to use the low level interface for file comparisons. Also I'll need to unpack the .gz files for showing a diff. Progressing here.

12 GEMMA validate data

[2020-09-24 Thu] Validation is extremely important. People make mistakes and when the data is wrong GEMMA should complain. I did quite a bit of work there on gemma1 but it is still fairly limited. Full validation has to happen out-of-band because you don't want to impact speed for running pipelines.

Therefore, we validate the data when there is (almost) zero cost inline. But a full validation is a separate switch. Let's see what it brings out!

gemma2 validate -c control

One thing the maf filter does not do is check for the absolute number of informative genotypes (maf is a percentage!). Also gemma1 does not check whether the minor allele is actually the smaller one. We hit the problem immediately with

WARNING:root:Only one type of genotype Counter({'B': 9, 'A': 7, 'H': 1}) found in ['A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'H', 'B'] --- other similar counter warnings are ignored (rs3722740 file ./test_geno.txt.gz line 175)

It is clear that minor allele is actually the major allele. The implications are not large for gemma1, but the minor allele frequency (MAF) filter may not work properly. This is why validation is so important!

Unlike gemma1, gemma2 will figure out the minor allele dynamically. One reason is that R/qtl2 does the same and we are sharing the data format. It also allows a consistent use of genotype markers (e.g. B and D for the BXD). I added a validation step to make sure major allele genotypes vary across the dataset (one constant allele is suspect and may imply some other problem).

BIMBAM files, meanwhile, include the SNP variants. GeneNetwork kinda ignores them. I need to update the importer for that. Added a note. First, however, I want to speed up the GRM LOCO.

13 GEMMA running some data

[2020-09-14 Mon] For a dosage comparison and LOCO permutation run I extracted data from the 21481 set on GeneNetwork. First I needed to match the genotypes with phenotypes using below filter.

First create the Rqtl2 type dataset:

gemma2 -o 21481 convert --bimbam -g tmp/21481/BXD_geno.txt.gz -p tmp/21481/21481_BXDPublish_pheno.txt

Filtering

gemma2 -o output/21481-pheno -v filters -c 21481.json

Now the genotype file looks like

marker  7 10  11  12  38  39  42  54  60  65  67  68  70  71  73  77  81  91  92
rs31443144  AAABBABABAABBABBAAAA
rs6269442 AAABBABABAABBABBAAAA
rs32285189  AAABBABABAABBABBAAAA

Create the BIMBAM for gemma1

gemma2 -o 21481-pheno export --bimbam -c output/21481-pheno.json

and run the standard gemma commands. Note we can create the original dosage file by modifying the control file.

Running 1,000 permutations on 20 individuals and 7321 markers took

real    213m0.889s
user    2939m21.780s
sys     5423m37.356s

We have to improve on that!

14 GEMMA compute GRM

[2020-09-17 Thu] The first step is to compute the kinship matrix or GRM. We'll have different algorithms so we need a switch for method or –impl.

gemma2 --overwrite -o output/21487 convert --bimbam -g example/21487_BXD_geno.txt.gz -p example/21487_BXDPublish_pheno.txt
gemma2 --overwrite -o output/21487-filter filter -c output/21487.json

we can use the gemma1 implementation with

gemma2 grm --impl gemma1 -c output/21487-filter.json

or our new version

gemma2 grm -c output/21487-filter.json

[2020-09-26 Sat] Today I added the necessary filtering steps for the GRM listed in the next section. The maf filter is currently hard coded to make sure results match gemma1.

[2020-09-28 Mon] I finally got a matching K in Python compared to gemma1. Turns out that scaling is not the default option. Ha!

14.1 Calculating kinship

[2020-09-21 Mon] Gemma injects the mean genotype for a SNP into missing data fields (the mean over a SNP row). Next it subtracts the mean for every value in a row and if centering it scales

gsl_vector_scale(geno, 1.0 / sqrt(geno_var));

and finally over the full matrix a division over the number of SNPs

gsl_matrix_scale(matrix_kin, 1.0 / (double)ns_test)

Where ns_test is the number of SNPs included. SNPs get dropped in a MAF filter which I rewrote in a fast version in D. The GEMMA filter happens at reading of the Geno file.

So, essentially

  1. [X] Always apply the MAF filter when reading genotypes
  2. [X] Apply missiness filter

And when computing kinship

  1. [X] Always impute missing data (injecting the row mean)
  2. [X] Always subtract the row mean
  3. [X] Center the data by row (which is the NOT the default option -gk 1, gemma1 CenterMatrix)
  4. [X] Always scale the matrix dividing by # of SNPs (gemma1 ScaleMatrix)

See also R's scaling function.

Prasun's D code may be a bit more readable. And here is our older pylmm implementation which only scales K by dividing the number of SNPs.

  • [X] Check D implementation

Python's numpy treats NaN as follows:

>>> np.mean([1.0,2.0,3.0,0.0])
1.5
>>> np.mean([1.0,2.0,3.0,0.0,0.0])
1.2
>>> np.mean([1.0,2.0,3.0,0.0,0.0,np.NAN])
nan

which means we have to filter first.

x = np.array([1.0,2.0,3.0,0.0,0.0,np.NAN])
>>> np.mean(x[~numpy.isnan(x)])
1.2

that is better. Note the tilde. Python is syntactically a strange beast. Also I need to be careful about numpy types. They are easily converted to lists, for example.

  • [ ] And then there is this R2 filter I should check

r2 is simply the square of the sample correlation coefficient (i.e., r) between the observed outcomes and the observed predictor values. The manual says correlation with any covariate. By default, SNPs with r^2 correlation with any of the covariates above 0.9999 will not be included in the analysis. When I get to covariates I should include that.

14.2 Implementation

Starting with the MAF filter. It is used both for GRM and GWA. Gemma1 says

-miss     [num]           specify missingness threshold (default 0.05)
-maf      [num]           specify minor allele frequency threshold (default 0.01)
-notsnp                   minor allele frequency cutoff is not used

Note that using -notsnp the value of maf_level is set to -1.

With Gemma2 I want to make all filtering explicit. But what to do if someone forgets to filter? Or filters twice - which would lead to different results.

gemma2 filter -c data --maf 0.01

Creates a new dataset and control file. We can add the filtering state to the new control data structure with "maf": 0.05 which prevents a second run. If it is missing we should apply it by default in

gemma2 grm -c data

which will be the same as the single run

gemma2 filter -c data --maf 0.01 '=>' grm

(don't try this yet).

I wrote a MAF filter in Python which puts genotypes in classes and counts the minor alleles. Note that GEMMA1 does not a pure minor allele count because it counts heterozygous as 50%. At this point I am not making a clear distinction.

Another filter is missing data miss_level defaults to 0.05:

if ((double)n_miss / (double)ni_test > miss_level) pass...

Say we have 6 out of 100 missing it fails. This is rather strict.

15 GEMMA filtering data

[2020-09-01 Tue] With gemma1 people requested more transparent filtering. This is why I am making it a two-step process. First we filter on phenotypes:

15.1 Filter on phenotypes

The first filter is on phenotypes. When a phenotype is missing it should be removed from the kinship matrix and GWA. The R/qtl2 format is simply:

id     pheno1    pheno2
1       1.2        3.0
2       NA         3.1
(etc)

So, if we select pheno1 we need to drop id=2 because it is an NA. The filter also needs to update the genotypes. Here we filter using the 6th phenotype column:

gemma2 -o output/ms-filtered -v filters -c result.json -p 6

With the new phenotype filter I was able to create a new GRM based on a reduced genotype list (imported from BIMBAM) for a paper we are putting out.

16 GEMMA convert data

16.1 Convert PLINK to GEMMA2/Rqtl2 and BIMBAM

The plink .fam format has

  1. Family ID ('FID')
  2. Within-family ID ('IID'; cannot be '0')
  3. Within-family ID of father ('0' if father isn't in dataset)
  4. Within-family ID of mother ('0' if mother isn't in dataset)
  5. Sex code ('1' = male, '2' = female, '0' = unknown)
  6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

The BIMBAM format just has columns of phenotypes and no headers(!)

For convenience, I am going to output BIMBAM so it can be fed directly into gemma1. Note Plink also outputs BIMBAM from plink files and VCF which is used by GEMMA users today.

[2020-09-06 Sun] To convert from plink to R/qtl2 we already have

gemma2 convert --plink example/mouse_hs1940

To convert from R/qtl2 to BIMBAM we can do

gemma2 convert --to-bimbam -c mouse_hs1940.json

which writes mouse_hs1940_pheno_bimbam.txt. Introducing a yield generator. I notice how Python is increasingly starting to look like Ruby. Yield was introduced in Python3 around 2012 - almost 20 years later than Ruby(!)

[2020-09-07 Mon] After some thought I decided to split convert into 'read' and 'write' commands. So now it becomes

gemma2 read --plink example/mouse_hs1940

To convert from R/qtl2 to BIMBAM we can do

gemma2 write --bimbam -c mouse_hs1940.json

I think that looks logical. On third thought I am making it

gemma2 convert --plink example/mouse_hs1940

To convert from R/qtl2 to BIMBAM we can do

gemma2 export --bimbam -c mouse_hs1940.json

It is important to come up with the right terms so it feels logical or predictable to users. Convert could be 'import', but Python does not allow that because it is a language keyword. And, in a way, I like 'convert' better. And 'export' is the irregular case that no one should really use. I could name it 'internal'. But hey.

Writing the BIMBAM genotype file it looks like

rs3683945, A, G, 1, 1, 0, 1, 0, 1, 1, ...

without a header. GEMMA does not actually use the allele values (A and G) and skips on spaces. So we can simplify it to

rs3683945 - - 1 1 0 1 0 1 1 ...

Using these new inputs (converted plink -> Rqtl2 -> BIMBAM) the computed cXX matrix is the same as the original PLINK we had.

gemma2 gemma1 -g mouse_hs1940_geno_bimbam.txt.gz -p mouse_hs1940_pheno_bimbam.txt   -gk -o mouse_hs1940

and same for the GWA

gemma2 gemma1 -g mouse_hs1940_geno_bimbam.txt.gz -p mouse_hs1940_pheno_bimbam.txt -n 1 -a ./example/mouse_hs1940.anno.txt -k ./output/mouse_hs1940.cXX.txt -lmm -o mouse_hs1940_CD8_lmm-new
==> output/mouse_hs1940_CD8_lmm-new.assoc.txt <==
chr     rs      ps      n_miss  allele1 allele0 af      beta    se      logl_H1 l_remle p_wald
1       rs3683945       3197400 0       -       -       0.443   -7.882975e-02   6.186788e-02    -1.581876e+03   4.332964e+00   2.028160e-01
1       rs3707673       3407393 0       -       -       0.443   -6.566974e-02   6.211343e-02    -1.582125e+03   4.330318e+00   2.905765e-01

==> output/result.assoc.txt <==
chr     rs      ps      n_miss  allele1 allele0 af      beta    se      logl_H1 l_remle p_wald
1       rs3683945       3197400 0       A       G       0.443   -7.882975e-02   6.186788e-02    -1.581876e+03   4.332964e+00   2.028160e-01
1       rs3707673       3407393 0       G       A       0.443   -6.566974e-02   6.211343e-02    -1.582125e+03   4.330318e+00   2.905765e-01

The BIMBAM version differs because the BIMBAM file in ./example differs slightly from the PLINK version (thanks Xiang, keeps me on my toes!). Minor differences exist because some values, such as 0.863, have been changed for the genotypes. For a faithful and lossless computation of that BIMBAM file we'll need to support those too. But that will come when we start importing BIMBAM files. I'll make a note of that.

16.2 Convert BIMBAM to GEMMA2/Rqtl2

[2020-09-08 Tue] From above we can also parse BIMBAM rather than export. It will need both geno and pheno files to create the GEMMA2/Rqtl2 format:

gemma2 convert --bimbam -g example/mouse_hs1940.geno.txt.gz -p example/mouse_hs1940.pheno.txt

Problem: BIMBAM files can contain any value while the Rqtl2 genotype file appears to be limited to alleles. Karl has a frequency format, but that uses some fancy binary 'standard'. I'll just use the genotypes now because GeneNetwork uses those too. Translating back and forth. BIMBAM

rs31443144, X, Y, 1, 1, 0, 0, 0, ... 0, 1, 0.5, 0.5

becomes Rqtl2

rs31443144

and then back to

rs31443144, - - 1 1 0 0 0 ... 0 1 2 2

For processing with GEMMAv1. Funny. Actually it should be comma delimited to align with PLINK output.

After some hacking

gemma2 convert --bimbam -g example/BXD_geno.txt.gz -p example/BXD_pheno.txt
gemma2 export -c result.json
gemma -gk -g result_geno_bimbam.txt.gz -p result_pheno.tsv
FAILED: Parsing input file 'result_geno_bimbam.txt.gz' failed in function ReadFile_geno in src/gemma_io.cpp at line 743

is still not happy. When looking at the code it fails to get (enough) genotypes. Also interesting is the hard coded in GEMMA1

geno = atof(ch_ptr);
if (geno >= 0 && geno <= 0.5) {
  n_0++;
}
if (geno > 0.5 && geno < 1.5) {
  n_1++;
}
if (geno >= 1.5 && geno <= 2.0) {
  n_2++;
}

GEMMA1 assumes Minor allele homozygous: 2.0; major: 0.0 for BIMBAM and 1.0 is H. The docs say BIMBAM format is particularly useful for imputed genotypes, as PLINK codes genotypes using 0/1/2, while BIMBAM can accommodate any real values between 0 and 2 (and any real values if paired with -notsnp option which sets cPar.maf_level = -1).

The first column is SNP id, the second and third columns are allele types with minor allele first, and the remaining columns are the posterior/imputed mean genotypes of different individuals numbered between 0 and 2. An example mean genotype file with two SNPs and three individuals is as follows:

rs1, A, T, 0.02, 0.80, 1.50
rs2, G, C, 0.98, 0.04, 1.00

GEMMA codes alleles exactly as provided in the mean genotype file, and ignores the allele types in the second and third columns. Therefore, the minor allele is the effect allele only if one codes minor allele as 1 and major allele as 0.

BIMBAM mode is described in its own manual. The posterior mean value is the minor allele dosage. This means the encoding should be

rs31443144,-,-,2,2,0,0,0,...,0,2,1,1

but GeneNetwork uses

rs31443144, X, Y, 1, 1, 0, 0, 0, ... 0, 1, 0.5, 0.5

which leads to a different GRM. Turns out the genotypes GeneNetwork is using is wrong. It will probably not be a huge difference because the dosage is just scaled. I'll test it on a live dataset - a job that needs to be done anyway.

[2020-09-18 Fri] GEMMA also has an annotation file for SNPs

rs31443144  3010274 1
rs6269442 3492195 1
rs32285189  3511204 1
rs258367496 3659804 1
etc.

Rqtl2 has a similar gmap file in tidy format:

marker,chr,pos
rs13475697,1,1.6449
rs3681603,1,1.6449001
rs13475703,1,1.73685561994571
rs13475710,1,2.57549035621086
rs6367205,1,2.85294211007162
rs13475716,1,2.85294221007162
etc.

entered as "gmap" in the control file. I suppose it is OK to use our chromosome positions there. I'll need to update the converter for BIMBAM and PLINK. The PLINK version for GEMMA looks like

rs3668922 111771071 13  65.0648
rs13480515  17261714  10  4.72355
rs13483034  53249416  17  30.175
rs4184231 48293994  16  33.7747
rs3687346 12815936  14  2.45302

so both positions are included, but no header. All different in other words! Rqtl2 also has a "pmap" file which is the physical mapping distance.

17 GEMMA GRM/K compute

[2020-08-31 Mon] GEMMA1 computes K fairly inefficiently. First of all it reads the large genotype file twice (once for K, once for GWA). Next it tries to optimize for memory consumption by reading the file in blocks and partially computing the multiplication matrix. This saves RAM, but should not be default behaviour (for one, OpenBLAS' threading is not really leveraged). Using gemma1 we can compute K with

time gemma -g ./example/mouse_hs1940.geno.txt.gz -p ./example/mouse_hs1940.pheno.txt -gk -o mouse_hs1940
GEMMA 0.98.2 (2020-05-28) by Xiang Zhou and team (C) 2012-2020
Reading Files ...
## number of total individuals = 1940
## number of analyzed individuals = 1410
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =    12226
## number of analyzed SNPs         =    10768
Calculating Relatedness Matrix ...
================================================== 100%
 real    0m7.545s
 user    0m14.468s
 sys     0m1.037s

Now the GRM output file (mousehs1940.cXX.txt) is pretty large at 54Mb and we keep a lot of those cached in GeneNetwork. It contains a matrix 1940x1940 of textual numbers

0.3350589588    -0.02272259412  0.0103535287    0.00838433365   0.04439930169   -0.01604468771  0.08336199305
-0.02272259412  0.3035959579    -0.02537616406  0.003454557308  ...

The gzip version is half that size. We can probably do beter storing 4-byte floats and only storing half the matrix (it is symmetrical after all) followed by compression.

Anyway, first thing to do is read R/qtl2 style files into gemma2 because that is our preferred format.

To pass in information we now use the control file defined below

gemma2 grm --control mouse_hs1940.json

[2020-09-01 Tue] I am reusing the code I wrote for pylmm five years ago(!) First results show numpy dot product outperforming gemma1 and blas today for this size dataset (I am not trying CUDA yet). Next stage is filtering and centering the GRM.

18 GEMMA with python-click and python-pandas-plink

[2020-08-28 Fri] This week I got the argument parsing going for gemma2 and the logic for running subcommands. It was a bit of a struggle: having the feeling I would be better off writing argument parsing from scratch. But today I added a parameter for finding the gemma1 binary (a command line switch and an environment variable with help in a one-liner). That was pleasingly quick and powerful. Python click is powerful and covers most use cases.

In the next step I wanted to convert PLINK files to GEMMA2/Rqtl2 formats. A useful tool to have for debugging anyhow. I decided to try pandas-plink and added that to GEMMA2 (via a Guix package). Reading the mouse data:

>>> G = read_plink1_bin("./example/mouse_hs1940.bed", "./example/mouse_hs1940.bim", "./example/mouse_hs1940.fam", ver>
>>> print(G)
<xarray.DataArray 'genotype' (sample: 1940, variant: 12226)>
dask.array<transpose, shape=(1940, 12226), dtype=float64, chunksize=(1024, 1024), chunktype=numpy.ndarray>
Coordinates:
  * sample   (sample) object '0.224991591484104' '-0.97454252753557' ... nan nan
  * variant  (variant) object '1_rs3683945' '1_rs3707673' ... '19_rs6193060'
    fid      (sample) <U21 '0.224991591484104' '-0.97454252753557' ... 'nan'
    iid      (sample) <U21 '0.224991591484104' '-0.97454252753557' ... 'nan'
    father   (sample) <U32 'nan' 'nan' 'nan' 'nan' ... 'nan' 'nan' 'nan' 'nan'
    mother   (sample) <U3 '1' '0' '1' 'nan' 'nan' ... 'nan' '0' '1' 'nan' 'nan'
    gender   (sample) <U32 'nan' 'nan' 'nan' 'nan' ... 'nan' 'nan' 'nan' 'nan'
    trait    (sample) float64 -0.2854 -2.334 0.04682 nan ... -0.09613 1.595 0.72
    chrom    (variant) <U2 '1' '1' '1' '1' '1' '1' ... '19' '19' '19' '19' '19'
    snp      (variant) <U18 'rs3683945' 'rs3707673' ... 'rs6193060'
    cm       (variant) float64 0.0 0.1 0.1175 0.1358 ... 53.96 54.02 54.06 54.07
    pos      (variant) int64 3197400 3407393 3492195 3580634 ... -9 -9 61221468
    a0       (variant) <U1 'A' 'G' 'A' 'A' 'G' 'A' ... 'G' 'G' 'C' 'G' 'G' 'A'
    a1       (variant) <U1 'G' 'A' 'G' 'G' 'G' 'C' ... 'A' 'A' 'G' 'A' 'A' 'G'

Very nice! With support for an embedded GRM we can also export. Note that this tool is not lazy, so for very large sets we may need to write some streaming code.

The first step is to create an R/qtl2 style genotype file to compute the GRM. The BIMBAM version for a marker looks like

rs3683945, A, G, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1, 1, 1, 2, 1, 0, 2, etc
rs3707673, G, A, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1, 1, 1, 2, 1, 0, 2,
rs6269442, A, G, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 2, 1, 0, 0, 1, 1, 1, 1, 2, 0, 0, 2,
(...)

reflecting row a0 and a1 vertically in the BED file.

===> BED (binary format)
[[1. 1. 2. 1. 2. 1. 1. 1. 1. 2. 2. 1. 2. 1. 1. 1. 1. 2. 0. 1. 1. 2. 1. 1.
  1. 1. 0. 1. 2. 0. 1. 0. 1. 2. 0. 2. 1. 1. 1. 1.
 [1. 1. 2. 1. 2. 1. 1. 1. 1. 2. 2. 1. 2. 1. 1. 1. 1. 2. 0. 1. 1. 2. 1. 1.
  1. 1. 0. 1. 2. 0. 1. 0. 1. 2. 0. 2. 1. 1. 1. 1.
 [1. 2. 2. 1. 2. 1. 1. 2. 2. 2. 2. 1. 2. 1. 1. 1. 2. 2. 0. 1. 2. 2. 1. 1.
  1. 1. 0. 2. 2. 0. 2. 0. 1. 2. 0. 2. 1. 1. 1. 1.
(...)
]]

So you can see 2 and 0 values switched meaning H.

The equivalent R/qtl2 can be the human readable

"genotypes": {"A":0, "B":1, "H": 2}

For the genotype file we'll go marker by individual (transposed) because that is what we can stream in GWA. The tentative 'control' file (mirroring recla.json) to start with:

{
    "description": "mouse_hs1940",
    "crosstype": "hs",
    "individuals": 1940,
    "markers": 12226,
    "phenotypes": 7,
    "geno": "mouse_hs1940_geno.tsv",
    "alleles": [
        "A",
        "B",
        "H"
    ],
    "genotypes": {
        "A": 1,
        "H": 2,
        "B": 3
    },
    "geno_transposed": true
}

Where cross-type "HS" should probably act similar to "DO". We'll use some (hopefully) sane defaults, such as a tab for separator and '-' and NA for missing data. Comments have '#' on the first position. We add number of individuals, phenotypes and markers/SNPs for easier processing and validation when parsing.

Now the GEMMA2 genotype file should become

marker, 1, 2, 3, 4, ...
rs3683945 B B A B A B B B B A A B A B B B B A H B B A B B B B H B A H etc
rs3707673 B B A B A B B B B A A B A B B B B A H B B A B B B B H B A H
rs6269442 B A A B A B B A A A A B A B B B A A H B A A B B B B H A A H
(...)

[2020-08-31 Mon] To the control file I added "geno_compact": true which makes it the more compact

marker,1,2,3,4,...
rs3683945 BBABABBBBAABABBBBAHBBABBBBHBAH etc
rs3707673 BBABABBBBAABABBBBAHBBABBBBHBAH
rs6269442 BAABABBAAAABABBBAAHBAABBBBHAAH
(...)

hope R/qtl2 will support this too. Next step is compression

The uncompressed space version:

-rw-r--r-- 1 wrk users  46M Aug 31 08:25 mouse_hs1940_geno.tsv

with gzip compresses to

-rw-r--r-- 1 wrk users 3.3M Aug 31 07:58 mouse_hs1940_geno.tsv.gz

while the compact version is half the size and compresses better too

-rw-r--r-- 1 wrk users 2.1M Aug 31 08:30 mouse_hs1940_geno.tsv.gz

The bz2 version is a little smaller but a lot slower. lz4 has a larger file

-rw-r--r-- 1 wrk users 5.4M Aug 31 08:37 mouse_hs1940_geno.tsv.lz4

but it is extremely fast. For now we'll just go for smaller files (these genotype files can be huge). gzip support is standard in Python3. Compressing the files from inside Python appeared slow with the default maximum compression. Reducing the level a bit it is comparable to the original writer and the file is 20x smaller. It is also 3x smaller than the original PLINK version (that is supposedly optimally compressed).

The control file probably has enough showing the compressed file name extension. Now it looks like

{
    "description": "mouse_hs1940",
    "crosstype": "hs",
    "sep": "\t",
    "individuals": 1940,
    "markers": 12226,
    "phenotypes": 7,
    "geno": "mouse_hs1940_geno.tsv.gz",
    "alleles": [
        "A",
        "B",
        "H"
    ],
    "genotypes": {
        "A": 1,
        "H": 2,
        "B": 3
    },
    "geno_sep": false,
    "geno_transposed": true
}

Next step is generating the GRM!

19 Building GEMMA

[2020-08-17 Mon] Gemma2 is going to act like a modular tool that can invoke many backends. Examples are to compute a GRM using R/qtl2, native GEMMA or some CUDA implementation. The command line will have to deal with all these variations too in an elegant way.

For deployment we'll use GNU Guix and Docker containers as described below. Gemma2 is going to be oblivious about how deployment hangs together because I think it is going to be an ecosystem on its own.

I am looking into command line parsing for gemma2. The requirement is reasonably sophisticated parameter checking that can grow over time. First of all I'll introduce 'commands' such as

gemma grm
gemma gwa

which allows for splitting option sets. Better readable too. Also I want to be able to inline 'pipe' commands:

gemma grm => gwa

which will reuse data stored in RAM to speed things up. You can imagine something like

gemma filter => grm --loco => gwa

as a single run. Now bash won't allow for this pipe operator so we may support a syntax like

gemma filter '=>' grm --loco '=>' gwa

or

gemma filter % grm --loco % gwa

Note that these 'pipes' will not be random workflows. It is just a CLI convenience notation that visually looks composable.

Also the current switches should be supported and gemma2 will drop to gemma version 1 if it does not understand a switch. For example

gemma -g /example/mouse_hs1940.geno.txt.gz -p mouse_hs1940.pheno.txt -a mouse_hs1940.anno.txt -gk -no-check

should still work, simply because current workflows expect that. The python click package looks like it can do this. What is tricky is that we want to check all parameters before the software runs. For now, the grouping works OK - you can chain commands, e.g.

@click.group(invoke_without_command=True)
@click.pass_context
def gemma2(context):
  if not context.invoked_subcommand:
      click.echo("** Call gemma1")

@click.command()
def grm():
    click.echo('** Kinship/Genetic Relationship Matrix (GRM) command')
    if second:
        gemma2(second)

@click.command()
# @click.argument('verbose', default=False)
def gwa():
    click.echo('** Genome-wide Association (GWA)')

gemma2.add_command(grm)
gemma2.add_command(gwa)

gemma2()

which allows

gemma grm ... -> calls into grm and gwa
gemma gwa ...
gemma (no options)

Not perfect because parameters are checked just before the actual chained command runs. But good enough for now. See also the group tutorial.

[2020-08-16 Sun] Performance metrics before doing a new release show openblas has gotten a bit faster for GEMMA. I also tried a build of recent openblas-git-0.3.10 with the usual tweaks. It is really nice to see how much effort is going into OpenBLAS development. The gains we get for free.

As people have trouble building GEMMA on MacOS (and I don't have a Mac) I released a test Docker container that can be run as

lario:/home/wrk/iwrk/opensource/code/genetics/gemma# time docker run -v `pwd`/example:/example 2e82532c7440 gemma -g /example/mouse_hs1940.geno.txt.gz -p /example/mouse_hs1940.pheno.txt -a /example/mouse_hs1940.anno.txt -gk -no-check
GEMMA 0.98.2 (2020-05-28) by Xiang Zhou and team (C) 2012-2020
Reading Files ...
## number of total individuals = 1940
## number of analyzed individuals = 1410
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =    12226
## number of analyzed SNPs         =    10768
Calculating Relatedness Matrix ...
================================================== 100%

real    0m8.161s
user    0m0.033s
sys     0m0.023s

Note that the local directory is mounted with the -v switch.

[2020-08-10 Mon] Last weeks got hijacked by grant writing, letters and failing machines. Back to GEMMA now. I was considering moving the development environment to our 8-core EPYC server, but it builds and runs actually slightly slower than my desktop (an 4-core Intel Nuc6i7KYK) which shows how IO bound we (still) are.

20 Starting on GEMMA2

[2020-07-27 Mon] A lot happened in the last month. Not least creating http://covid19.genenetwork.org which is getting quite a bit of attention.

Today is the official kick-off day for new GEMMA development! The coming months I am taking a coding sabbatical. Below you can read I was initially opting for D, but the last months we have increasingly invested in Rust and it looks like new GEMMA will be written in Python + Rust.

First, why Rust? Mostly because our work is showing it forces coders to work harder at getting things correct. Not having a GC can look like a step backward, but when binding to other languages it actually is handy. With Rust you know where you are - no GC kicking in and doing (black) magic. Anyway, I am going to give it a shot. I can revert later and D is not completely ruled out.

Now, why Python instead of Racket? Racket is vastly superior to Python in my opinion. Unfortunately, Python has the momentum and if I write the front-end in Racket it means no one will contribute. So, this is pragmatism to the Nth degree. We want GEMMA2/gemmalib to be useful to a wider community for experimentation. It will be a toolbox rather than an end-product. For example, I want to be able to swap in/out modules that present different algorithms. In the end it will be a toolbox for mapping phenotypes to genotypes.

I don't think we'll regret a choice for Python+Rust. Both languages are complementary and have amazing community support, both in terms of size and activity. Having Python as a front-end implies that is should be fairly trivial to bind the Rust back-end to other languages, such as Racket, R and Ruby. It will happen if we document it well enough.

One feature of new GEMMA2 will be that it actually can run GEMMA1. I am going to create a new development repository that can call into GEMMA1 transparently if functionality does not exist. This means the same command line parameters should work. GEMMA2 will fall back to GEMMA1 with a warning if it does not understand parameters. GEMMA2 specific parameters are a different set.

[2020-03-22 Sun] With Corona virus hitting us hard I am spending my time at home, hacking. For an introvert this is the normal. We are organising the COVID-19 hackathon - which will usurp quite a bit of my time. But that is OK, if we can scale up our precision medicine goals in such a way. Better than hacking it alone for sure! Other than that there are some papers to complete and perhaps one or two grant applications. There is always enough to do - fact of science.

I promised GeneNetwork that I would implement precompute for all GN mappings for GEMMA. I think that is a great idea. It is also an opportunity to clean up GEMMA. Essentially a rewrite where GEMMA becomes more of a library which can be called from any other language. That will also make it easier to optimise GEMMA for certain architectures. It is interesting to note that despite the neglect GEMMA is getting and its poor runtime performance it is still a surprisingly popular tool. The implementation, apparently, still rocks!

So, let's start with GEMMA version 2 (GEMMA2). What is GEMMA2? GEMMA2 is a fast implementation of GEMMA1 in D. Why not Rust? You may ask. Well, Rust is a consideration and we can still port, but D is close to idiomatic C++ which means existing GEMMA code is relatively easy to convert. I had been doing some of that already with Prasun, with faster-lmm-d. That project taught us a lot though we never really got it to replace GEMMA. Next to D I will be using Racket to create the library bindings. Racket is a Lisp and with a good FFI it should be easy to port that to (say) Python or Ruby. So, in short: Front-end Racket, back-end D. Though there may be variations.

21 Porting GeneNetwork1 to GNU Guix

[2020-01-31 Fri] GeneNetwork1 http://gn1.genenetwork.org/ is a legacy Python web application which was running on a 10 year old CentOS server. It depends on an older version of Python, mod-python and other obsolete modules. We decided to package it in GNU Guix because Guix gives full control over the dependency graph. Also GNU Guix has features like timemachine and containers which allows us to make snapshot of a deployment graph in time and serve different versions of releases.

The first step to package GN1 with the older packages was executed by Efraim. Also he created a container to run Apache, mod-python and GN1. Only problem is that mod-python in the container did not appear to be working.

22 Chasing that elusive sambamba bug (FIXED!)

[2019-11-24 Sun] With an upgrade of the LLVM D-compiler the garbage collector (GC) started bailing out on sambamba (ldc>1.10). Some progress on digging was tracked on github. Locating the bug was fairly easy - triggered by decompressBgzfBlock - and I manage to figure out it had to do with the copying of a thread object that messes up the stack. Fixing it, however is a different story! It took a while to get a minimal program to reproduce the problem. It looks like

BamReader bam;
auto fn = buildPath(dirName(__FILE__), "data", "ex1_header.bam");
foreach (i; 0 .. 60)
{
  bam = new BamReader(fn);
  assert(bam.header.format_version == "1.3");
}

which segfaults most of the time, but not always, with a stack trace

#0  0x00007ffff78b98e2 in invariant._d_invariant(Object) () from /usr/lib/x86_64-linux-gnu/libdruntime-ldc-debug-shared.so.87
#1  0x00007ffff7d4711e in std.parallelism.TaskPool.queueLock() () from /usr/lib/x86_64-linux-gnu/libphobos2-ldc-debug-shared.so.87
#2  0x00007ffff7d479b9 in _D3std11parallelism8TaskPool10deleteItemMFPSQBqQBp12AbstractTaskZb ()
   from /usr/lib/x86_64-linux-gnu/libphobos2-ldc-debug-shared.so.87
#3  0x00007ffff7d4791d in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()
#4  0x00005555557125c5 in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x0)
#6  0x00007ffff78a56fc in object.TypeInfo_Struct.destroy(void*) const ()
#7  0x00007ffff78bc80a in rt_finalizeFromGC () from /usr/lib/x86_64-linux-gnu/libdruntime-ldc-debug-shared.so.87
#8  0x00007ffff7899f6b in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()
#9  0x00007ffff78946d2 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#10  0x00007ffff78969fc in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()
#11 0x00007ffff78917c3 in _D2gc4impl12conservativeQw3Gcx5allocMFNbmKmkxC8TypeInfoZPv ()
(...)
    task_pool=0x7ffff736f000) at reader.d:130
#29 0x000055555575e445 in _D3bio3std3hts3bam6reader9BamReader6__ctorMFAyaZCQBvQBuQBtQBsQBrQBn (warning: (Internal error: pc 0x55555575e444 in read in psymtab, but not in symtab.)
    at reader.d:135
#30 0x00005555557bff0b in D main () at read_bam_file.d:47

where reader.d:130 adds a BamReader to the task pool. It is clear the GC kicks in and we end up with this mess.

Line #4 contains
  std.parallelism.TaskS
  bio.core.bgzf.block.decompressBgzfBlock
  utils.memoize.Cache-DecompressedBgzfBlock-yieldForce

yieldForce executes a task in the current thread which is coming from a cache:

alias Cache!(BgzfBlock, DecompressedBgzfBlock) BgzfBlockCache;

and Cache is part of BioD. One trick aspect of Sambamba is that the design is intricate. In our crashing example we only use the simple BamReader wich is defined in std.hts.bam.reader.d. We are using a default taskpool. In reader.d not much happens - it is almost all simple plumbing. std.hts.bam.read.d, meanwhile, represents the BAM format.

The Bgzf block processing happens in bio.core.bgzf.inputstream. The BamReader uses BgzfInputStream which has functions fillNextBlock, setupReadBuffer. The constructor sets up a RoundBuf!BlockAux(n_tasks). When I set n_tasks to a small number it no longer crashes!? The buffer

_task_buf = uninitializedArray!(DecompressionTask[])(n_tasks);

is a critical piece. Even increasing dim to n_tasks+2 is enough to remove most segfaults, but not all.

Remember it is defined as

alias Task!(decompressBgzfBlock, BgzfBlock, BgzfBlockCache) DecompressionTask;
DecompressionTask[] _task_buf;

with dimension n_tasks. Meanwhile BlockAux

static struct BlockAux {
  BgzfBlock block;
  ushort skip_start;
  ushort skip_end;

  DecompressionTask* task;
  alias task this;
}

Injecting code

struct DecompressedBgzfBlock {
  ~this() {
    stderr.writeln("destroy DecompressedBgzfBlock ",start_offset,":",end_offset," ",decompressed_data.sizeof);
  };

  ulong start_offset;
  ulong end_offset;
  ubyte[] decompressed_data;
}

It is interesting to see that even when not segfaulting the block offsets look corrupted:

destroy DecompressedBgzfBlock 0:0 16
destroy DecompressedBgzfBlock 0:0 16
destroy DecompressedBgzfBlock 0:0 16
destroy DecompressedBgzfBlock 89554:139746509800748 16
destroy DecompressedBgzfBlock 140728898420736:139748327903664 16
destroy DecompressedBgzfBlock 107263:124653 16
destroy DecompressedBgzfBlock 89554:107263 16
destroy DecompressedBgzfBlock 71846:89554 16
destroy DecompressedBgzfBlock 54493:71846 16
destroy DecompressedBgzfBlock 36489:54493 16
destroy DecompressedBgzfBlock 18299:36489 16
destroy DecompressedBgzfBlock 104:18299 16
destroy DecompressedBgzfBlock 0:104 16

and I am particularly suspicious about this piece of code in inputstream.d where task gets allocated and the resulting buffer gets copied to the roundbuffer. This is a hack, no doubt about it:

DecompressionTask tmp = void;
tmp = scopedTask!decompressBgzfBlock(b.block, _cache);
auto t = _task_buf.ptr + _offset / _max_block_size;
import core.stdc.string : memcpy;
memcpy(t, &tmp, DecompressionTask.sizeof);
b.task = t;
_tasks.put(b);
_pool.put(b.task);

and probably the reason why decompressBgzfBlock gets corrupted, followed by sending the GC in a tail spin when it kicks in. Artem obviously designed it this way to prevent allocating memory for the task, but I think he went a little too far here! One thing I tried earlier, I have to try again which is get rid of that copying.

First of all

alias Task!(decompressBgzfBlock, BgzfBlock, BgzfBlockCache) DecompressionTask;

defines DecompressionTask as calling decompressBgzfBlock with parameters which returns a DecompressedBgzfBlock. Remember it bails out with this block. There is something else that is notable, it is actually the cached version that bails out.

Removing the cache code makes it run more reliable. But not completely. Also we are getting memory errors now:

destroy DecompressedBgzfBlock 4294967306:0 16
core.exception.InvalidMemoryOperationError@core/exception.d(702): Invalid memory operation

but that leaves no stack trace. Now we get

std.parallelism.Task_bio.core.bgzf.block.decompressBgzfBlock-DecompressedBgzfBlock-yieldForce

so the caching itself is not the problem. In the next phase we are going to address that dodgy memory copying by introducing a task managed by GC instead of using the ScopedTask.

This is all happening in BgzfInputStream in inputstream.d used by reader.d which inherits from Stream. BamReaders uses that functionality to iterate through the reads usings D popFront() design. Streams allow reading a variable based on its type, e.g. a BAM read. The BamReader fetches the necessary data from BgzfBlockSupplier with getNextBgzfBlock which is used as the _bam variable. BamReader.reader itself returns an iterator.

It is interesting to note how the OOP design obfuscates what is going on. It is also clear that I have to fix BgzfInputStream in inputstream.d because it handles the tasks in the roundbuffer.

Part of sambamba's complexity is due to OOP and to having the threadpool running at the lowest level (unpacking bgzf). If I remove the threadpool there it means that threading will have to happen at a higher level. I.e., sambamba gets all its performance from multithreaded low level unpacking of data blocks. It is unusual, but it does have the (potential) advantage of leaving higher level code simple. I note with sambamba sort, however, Artem injected threads there too which begs the question what happens when you add different tasks to the same pool that have different timing characteristics. Be interesting to see the effect of using two task pools.

[2019-11-27 Wed] Paying some closer attention to block.d again, BgzfBlock is defined as a struct containing a _buffer defined as public ubyte[] _buffer = void; and is used in (indeed) block.d and inputstream.d only. The use of struct means that BgzfBlock gets allocated on the stack. Meanwhile _buffer get pointed into the uncompressed buffer which in turn is is a slice of uncompressed_buf that is also on the stack in decompressBgzfBlock (surprise, surprise) and gets assigned right before returning

block._buffer[0 .. block.input_size] = uncompressed[];

now, the cached version (which I disabled) actually does a copy to the heap

BgzfBlock compressed_bgzf_block = block;
compressed_bgzf_block._buffer = block._buffer.dup;
DecompressedBgzfBlock decompressed_bgzf_block;
with (decompressed_bgzf_block) {
    start_offset = block.start_offset;
    end_offset = block.end_offset;
    decompressed_data = uncompressed[].dup;
}
cache.put(compressed_bgzf_block, decompressed_bgzf_block);

Artem added a comment

/// A buffer is used to reduce number of allocations.
///
/// Its size is max(cdata_size, input_size)
/// Initially, it contains compressed data, but is rewritten
/// during decompressBgzfBlock -- indeed, who cares about
/// compressed data after it has been uncompressed?

Well, maybe the GC does! Or maybe the result does not fit the same buffer. Hmmm. If you go by the values

destroy DecompressedBgzfBlock 140728898420736:139748327903664 16

you can see the BgzfBlock is compromised by a stack overwrite. Changing the struct to a class fixes the offsets

destroy DecompressedBgzfBlock 104:18299 16
destroy DecompressedBgzfBlock 18299:36489 16
destroy DecompressedBgzfBlock 36489:54493 16

so things start to look normal. But it still segfaults on DecompressedBgzfBlock on GC in yieldForce.

std.parallelism.Task_bio.core.bgzf.block.decompressBgzfBlock-DecompressedBgzfBlock-yieldForce

[2019-11-28 Thu] So, the short of it is that when a thread kicks in with decompressBgzfBlock the underpinning data structure is corrupt. The problem has to be with struct BgzfBlock, struct BgzfBlockAux and the Roundbuf. Both BgzfBlockAux goes on a Roundbuf. I was thinking last night that there may be a struct size problem. The number of tasks in the roundbuffer should track the number of threads. Increasing the size of the roundbuffer makes a crash take longer.

Well I hit the jackpot! After disabling

_task_buf = uninitializedArray!(DecompressionTask[])(n_tasks);

there are no more segfaults. I should have looked at that more closely. I can only surmise that because the contained objects contain pointers (they do) the GC gets confused because it occassionaly finds something that looks like a valid pointer. Using uninitializedArray on an object that includes a pointer reference is dangerous.

Success!! That must have take me at least three days of work to find this bug, one of the most elusive bugs I have encountered. Annoyingly I was so close earlier when I expanded the size of n_tasks! The segfault got triggered by a new implementation of D's garbage collector. Pointer space is tricky and it shows how careful we have to be with non-initialized data.

Now what is the effect of disabling the cache and making more use of garbage collected structures (for each Bgzf block)? User time went down and CPU usage too, but wall clock time nudged up. Memory use also went up by 25%. The garbage collector kicked in twice as often. This shows Artem's aggressive avoidance of the garbage collector does have impact and I'll have to revert on some of my changes now. Note that releasing the current version should be OK, the performance difference does add to overall time and energy use. Even 10% of emissions is worth saving with tools run at this scale.

So, I started reverting on changes and after reverting two items:

- Sambamba
  - [-] speed test
    + [X] revert on class DecompressedBgzfBlock to struct
    + [X] revert on auto buf2 = (block._buffer[0 .. block.input_size]).dup;
    + [ ] revert on DecompressionTask[] _task_buf;
    + [ ] revert on tmp = scopedTask!decompressBgzfBlock(b.block)
    + [ ] revert on Cache

Sambamba is very similar to the last release. The new release is very slightly slower but uses less RAM. I decided not to revert on using the roundbuf, scopedTask and Cache because each of these introduces complexity with no obvious gain. v0.7.1 will be released after I update release notes and final tests.

23 It has been almost a year! And a new job.

[2019-07-13 Sat] A lot can happen in a year. After my last BLOG on the GEMMA release I got a job offer by the University of Tennessee and we moved in March to the USA. The coming years I am writing software and some grant proposals. Focus is still on GeneNetwork, genetic mapping and genomics. Some things changed, but I am working on GEMMA, sambamba, the variant graph and all that… One major shift is from Python to Racket (Scheme, a Lisp). I never liked Python that much (coming from Ruby and Perl earlier), but now I had the excuse to ditch Python. Racket rocks in so many ways! Others have written about that, so I am not going to repeat, but main thing is that I finally found a language that I can use for another 20 years, after trying many alternatives including Scala, Elixir, Erlang etc. Ironically, Lisp is one of the oldest language in use today. Go figure. For performance code I have D and we may replace some of that with Rust. It looks like it is Racket and D and perhaps Rust for me. I have come home to those choices after all these years.

I am writing a REST service which needs to maintain some state. The built-in Racket server has continuations - which is rather nice! Racket also has support for Redis, SQLite (nice example) and a simple key-value interface (ini-style) which I can use for sessions.

Ironically, two weeks after writing above I was hit by a car on my bicycle in Memphis. 14 fractures and a ripped AC joint. It took surgery and four months to feel halfway normal again…

24 Speeding up K

GEMMMA has been released with many bug fixes and speedups. Now it is time to focus on further speedups with K (and GWA after). There are several routes to try this. One possibility is to write our own matrix multiplication routine in D. My hunch is that because computing K is a special case we can get some significant speedups compared to a standard matrix multiplication for larger matrices. Directions we can go are:

  • Stop using BLAS and create multi-core dot-product based multiplication that makes use of
    • multi threaded decompression
    • start compute while reading data
    • use aligned rows only for dot product (less CPU cache misses)
    • compute half the result (K is symmetric!)
    • heavy use of threading
    • use AVX2 optimizations
    • use floats instead of doubles (we can do this with hardware checking)
    • chromosome-based computations (see below for LOCO)
  • Use tensor routines to reduce RAM IO

In other words, quite a few possible ways of improving things. It may be that the current BLAS routine is impossible to beat for our data, but there is only one way to find out: by trying.

The first step you would think is simple: take the genotype data, pass that in to calckinship(g) and get K back. Unfortunately GEMMA intertwines reading the genotype file, the computation of K in steps and scaling the matrix. All in one and the code is duplicated for Plink and BIMBAM formats. Great. Now I don't feel like writing any more code in C++ and, fortunately, Prasun has done some of the hard work in faster-lmm-d already. So the first step is to parse the BIMBAM format using an iterator. We'll use this to load the genotype data in RAM (we are not going to assume memory restrictions for now).

That genotype decompression and reading part is done now and I added it to BioD decompress.d. Next I added a tokenizer which is a safe replacement for the strtok gemma uses.

Using BLAS I added a full K computation after reading then genotype file which is on par (speed-wise) with the current GEMMA implementaton. The GEMMA implementation should be more optimal because it splits the matrix computation in smaller blocks and starts while streaming the genotype data. Prasun did an implementation of that in gemmakinship.d which is probably faster than my current version. Even so, I'll skip that method, for now, until I am convinced that none of the above dot-product optimizations pan out. Note that only a fraction of the time is used to do the actual matrix multiplication.

2018-10-23T09:53:38.299:api.d:flmmd_compute_bimbam_K:37 GZipbyLine
2018-10-23T09:53:43.911:kinship.d:kinship_full:23 Full kinship matrix used

Reading the file took 6 seconds.

2018-10-23T09:53:43.911:dmatrix.d:slow_matrix_transpose:57 slow_matrix_transpose
2018-10-23T09:53:44.422:blas.d:matrix_mult:48 matrix_mult
2018-10-23T09:53:45.035:kinship.d:kinship_full:33 DONE rows is 1940 cols 1940

Transpose + GxGT took 1 second. The total for

6e05143d717d30eb0b0157f8fd9829411f4cf2a0
real    0m9.613s
user    0m13.504s
sys     0m1.476s

I also wrote a threaded decompressor and it is slightly slower. Gunzip is so fast that building threads adds more overheads.

In this example reading the file dominates the time, but with LOCO we get ~20x K computation (one for each chromosome), so it makes sense to focus on K. For model species, for the forseeable future, we'll look at thousands of individuals, so it is possible to hold all K matrices in RAM. Which also means we can fill them using chromosome-based dot-products. Another possible optimization.

Next steps are to generate output for K and reproduce GEMMA output (also I need to filter on missing phenotype data). The idea is to compute K and LMM in one step, followed by developing the LOCO version as a first class citizen. Based on above metrics we should be able to reduce LOCO K of this sized dataset from 7 minutes to 30 seconds(!) That is even without any major optimizations.

25 MySQL to MariaDB

The version we are running today:

mysql --version
mysql  Ver 14.12 Distrib 5.0.95, for redhat-linux-gnu (x86_64) using readline 5.1

26 MySQL backups (stage2)

Backup to AWS.

env AWS_ACCESS_KEY_ID=* AWS_SECRET_ACCESS_KEY=* RESTIC_PASSWORD=genenetwork /usr/local/bin/restic --verbose init /mnt/big/mysql_copy_from_lily -r s3:s3.amazonaws.com/backupgn init
env AWS_ACCESS_KEY_ID=* AWS_SECRET_ACCESS_KEY=* RESTIC_PASSWORD=genenetwork /usr/local/bin/restic backup /mnt/big/mysql_copy_from_lily -r s3:s3.amazonaws.com/backupgn

I also added backups for genotypefiles and ES.

27 MySQL backups (stage1)

Before doing any serious work on MySQL I decided to create some backups. Lily is going to be the master for now, so the logical backup is on P1 which has a large drive. Much of the space is taken up by a running MySQL server (which is updated!) and a data file by Lei names 20151028dbsnp containing a liftover of dbSNP142. First I simply compressed the input files, not to throw them away.

Because ssh is so old on lily I can't login nicely from Penguin, but the other way works. This is temporary as mysql user

ssh -i /var/lib/mysql/.ssh/id_rsa pjotr@penguin

The script becomes (as a weekly CRON job for now) as mysql user

rsync -va /var/lib/mysql --rsync-path=/usr/bin/rsync -e "ssh -i /var/lib/mysql/.ssh/id_rsa" pjotr@penguin:/mnt/big/mysql_copy_from_lily/

This means we have a weekly backup for now. I'll improve it with more disk space and MariaDB to have incrementals.

Actually, it looks like only two tables really change

-rw-rw---- 1 pjotr mysql  29G Aug 16 18:59 ProbeSetData.MYD
-rw-rw---- 1 pjotr mysql  36G Aug 16 19:00 ProbeSetData.MYI

which are not that large.

To make incrementals we are opting for [restic](https://restic.readthedocs.io/). It looks modern and has interesting features.

env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic init -r /mnt/big/backup_restic_mysql/

now backups can be generated incrementally with

env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic --verbose backup /mnt/big/mysql_copy_from_lily -r /mnt/big/backup_restic_mysql

To list snapshots (directory format)

env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic -r backup_restic_mysql/ snapshots

Check

env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic -r backup_restic_mysql/ check

Prune can merge hashes, saving space

env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic -r backup_restic_mysql/ prune

So, now we can do a daily backup from Lily and have incrementals stored on Penguin too. My cron reads

40 5 * * * env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic --verbose backup /mnt/big/mysql_copy_from_lily -r /mnt/big/backup_restic_mysql|mail -s "MySQL restic backup" pjotr2017@thebird.nl

Restic can push to AWS S3 buckets. That is the next step (planned).

28 Migrating GN1 from EC2

GN1 is costing us $\600+ per month on EC2. With our new shiny server we should move it back into Memphis. The main problem is that the base image is

Linux version 2.6.18-398.el5 (mockbuild@builder17.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-55)) #1 SMP Tue Sep 16 20:50:52 EDT 2014

I mean, seriously, ten years old?

Compared to GN2, GN1 should be simpler to deploy. Main problem is that it requires Python 2.4 - currently. Not sure why that is.

Some system maintenance scripts live here. There is a Docker image by Artem here. And there are older notes referred to in the Artem doc. All of that is about GN2 really. So far, I have not found anything on GN1. In /usr/lib/python2.4/site-packages I find the following modules:

Alacarte
elementtree
gmenu.so
htmlgen
iniparse
invest
json
libsvn
nose
numarray
piddle
pirut
pp.py
pptransport.py
ppworker.py
pyx
pyXLWriter
reaper.so
sos
svg
urlgrabber

nothing too serious. Worse is that GN1 uses modpython which went out of vogue before 2013. The source code is still updated - you see, once things are out there…

This means we'll need to deploy Apache in the VM with modpython. The installation on Lily pulls in 50+ Apache modules. Argh. If we want to support this… The good news is that most modules come standard with Apache - and I think we can disable a lot of them.

29 Fixing Gunicorn in use

DEBUG:db.webqtlDatabaseFunction:.retrieve_species: retrieve_species result:: mouse
DEBUG:base.data_set:.create_dataset: dataset_type: ProbeSet
[2018-07-16 16:53:51 +0000] [4185] [ERROR] Connection in use: ('0.0.0.0', 5003)
[2018-07-16 16:53:51 +0000] [4185] [ERROR] Retrying in 1 second.

30 Updating ldc with latest LLVM

Updating ldc to latest leaves only 5 tests failing! That is rather good.

99% tests passed, 5 tests failed out of 1629

Total Test time (real) = 4135.71 sec

The following tests FAILED:
        387 - std.socket (Failed)
        785 - std.socket-debug (Failed)
        1183 - std.socket-shared (Failed)
        1581 - std.socket-debug-shared (Failed)
        1629 - lit-tests (Failed)

build/Testing/Temporary/LastTest.log: (core.exception.AssertError@std/socket.d(456): Assertion failure
build/Testing/Temporary/LastTest.log:core.exception.RangeError@std/socket.d(778): Range violation

and

Failing Tests (1):
    LDC :: plugins/addFuncEntryCall/testPlugin.d

Fixing these ldc 1.10.0 compiles on LLVM 3.8!

31 Fixing sambamba

We were getting this new error

/gnu/store/4snsi4vg06bdfi6qhdjfbhss16kvzxj7-ldc-1.10.0/include/d/std/numeric.d(1845):
Error: read-modify-write operations are not allowed for shared
variables. Use core.atomic.atomicOp!"+="(s, e) instead.

which was by the normalize class

bool normalize(R)(R range, ElementType!(R) sum = 1)

the normalization functions fractions range values to the value of sum, e.g.,

a = [ 1.0, 3.0 ];
assert(normalize(a));
assert(a == [ 0.25, 0.75 ]);

so, the question is why it needs to be a shared range. I modified it to cast to shared after normalization.

The next one was

BioD/bio/maf/reader.d(53): Error: cannot implicitly convert expression `this._f.byLine(cast(Flag)true, '\x0a')` of type `ByLineImpl!(char, char)` to `ByLine!(char, char)`

also reported on https://github.com/bioconda/bioconda-recipes/pull/8787#issuecomment-389195848

After fixing the compile time problems the tests failed for view, view -f unpack, subsample and sort(?!). In fact all md5sum's we test. For view it turns out the order of the output differs. view -f sam returns identity also converting back from BAM. It had to be something in the header. The header (apparently) contains the version of sambamba!

ldc2 -wi -I. -IBioD -IundeaD/src -g -O3 -release -enable-inlining -boundscheck=off -of=bin/sambamba bin/sambamba.o utils/ldc_version_info_.o htslib/libhts.a /usr/lib/x86_64-linux-gnu/liblz4.a -L-L/home/travis/dlang/ldc-1.10.0/lib -L-L/usr/lib/x86_64-linux-gnu -L-lrt -L-lpthread -L-lm -L-llz4
bin/sambamba.o: In function `_D5utils3lz426createDecompressionContextFZPv':
/home/travis/build/biod/sambamba/utils/lz4.d:199: undefined reference to `LZ4F_createDecompressionContext'
/home/travis/build/biod/sambamba/utils/lz4.d:200: undefined reference to `LZ4F_isError'
(...)

objdump -T /usr/lib/x86_64-linux-gnu/liblz4.so|grep LZ4F
000000000000cd30 g    DF .text  00000000000000f6  Base        LZ4F_flush
000000000000ce30 g    DF .text  0000000000000098  Base        LZ4F_compressEnd
000000000000c520 g    DF .text  00000000000002f5  Base        LZ4F_compressBegin
000000000000c4a0 g    DF .text  000000000000003f  Base        LZ4F_createCompressionContext
000000000000dd60 g    DF .text  00000000000000ee  Base        LZ4F_getFrameInfo
000000000000ced0 g    DF .text  00000000000002eb  Base        LZ4F_compressFrame
000000000000c4e0 g    DF .text  0000000000000033  Base        LZ4F_freeCompressionContext
000000000000c470 g    DF .text  0000000000000029  Base        LZ4F_getErrorName
000000000000c460 g    DF .text  000000000000000a  Base        LZ4F_isError
000000000000c8a0 g    DF .text  00000000000000e3  Base        LZ4F_compressFrameBound
000000000000d1c0 g    DF .text  0000000000000038  Base        LZ4F_createDecompressionContext
000000000000d200 g    DF .text  0000000000000037  Base        LZ4F_freeDecompressionContext
000000000000c990 g    DF .text  0000000000000396  Base        LZ4F_compressUpdate
000000000000d240 g    DF .text  0000000000000b1d  Base        LZ4F_decompress
000000000000c820 g    DF .text  000000000000007d  Base        LZ4F_compressBound

32 Trapping NaNs

When a floating point computation results in a NaN/inf/underflow/overflow GEMMA should stop The GSL has a generic function gslieeeenvsetup which works through setting an environment variable. Not exactly useful because I want GEMMA to run with just a –check switch.

The GNU compilers have feenableexcept as a function which did not work immediately. Turned out I needed to load fenv.h before the GSL:

#include <fenv.h>
#include "gsl/gsl_matrix.h"

Enabling FP checks returns

Thread 1 "gemma" received signal SIGFPE, Arithmetic exception.
0x00007ffff731983d in ieeeck_ () from /home/wrk/opt/gemma-dev-env/lib/libopenblas.so.0
(gdb) bt
#0  0x00007ffff731983d in ieeeck_ () from /home/wrk/opt/gemma-dev-env/lib/libopenblas.so.0
#1  0x00007ffff6f418dc in dsyevr_ () from /home/wrk/opt/gemma-dev-env/lib/libopenblas.so.0
#2  0x0000000000489181 in lapack_eigen_symmv (A=A@entry=0x731a00, eval=eval@entry=0x731850, evec=evec@entry=0x7317a0,
    flag_largematrix=<optimized out>) at src/lapack.cpp:195
#3  0x00000000004897f8 in EigenDecomp (flag_largematrix=<optimized out>, eval=0x731850, U=U@entry=0x7317a0, G=G@entry=0x731a00)
    at src/lapack.cpp:232
#4  EigenDecomp_Zeroed (G=G@entry=0x731a00, U=U@entry=0x7317a0, eval=eval@entry=0x731850, flag_largematrix=<optimized out>)
    at src/lapack.cpp:248
#5  0x0000000000457b3a in GEMMA::BatchRun (this=this@entry=0x7fffffffcc30, cPar=...) at src/gemma.cpp:2598

Even for the standard dataset!

Turns out it is division by zero and FP underflow in lapackeigensymmv.

33 A gemma-dev-env package

I went through difficulties of updating GNU Guix and writing a package that creates a GEMMA development environment. This was based on the need for having a really controlled dependency graph. It went wrong last time we released GEMMA, witness gemma got slower, leading to the discovery that I had linked in a less performing lapack.

GNU Guix was not behaving because I had not updated in a while and I discovered at least one bug. Anyway, I have a working build system now and we will work on code in the coming weeks to fix a number of GEMMA issues and bring out a new release.

34 Reviewing a CONDA package

Main achievement last week was getting GEMMA installed in a controlled fashion and proving performance is still up to scratch.

For JOSS I am reviewing a CONDA package for RNA-seq analysis. The author went through great lengths to make it easy to install with Bioconda, so I thought to have a go. GNU Guix has a conda bootstrap, so time to try that!

guix package -A conda
conda   4.3.16  out     gnu/packages/package-management.scm:704:2

and wants to install

/gnu/store/pj6d293c7r9xrc1nciabjxmh05z24fh0-Pillow-4.3.0.tar.xz
/gnu/store/m5prqxzlgaargahq5j74rnvz72yhb77l-python-olefile-0.44
/gnu/store/s9hzpsqf9zh9kb41b389rhmm8fh9ifix-python-clyent-1.2.1
/gnu/store/29wr2r35z2gnxbmvdmdbjncmj0d3l842-python-pytz-2017.3
/gnu/store/jv1p8504kgwp22j41ybd0j9nrz33pmc2-python-anaconda-client-1.6.3.tar.gz
/gnu/store/l6c40iwss9g23jkla75k5f1cadqbs4q5-python-dateutil-2.6.1
/gnu/store/y4h31l8xj4bd0705n0q7a8csz6m1s6s5-python-pycosat-0.6.1
/gnu/store/8rafww49qk2nxgr4la9i2v1yildhrvnm-python-cookies-2.2.1
/gnu/store/s5d94pbsv779nzi30n050qdq9w12pi52-python-responses-0.5.1
/gnu/store/kv8nvhmb6h3mkwyj7iw6zrnbqyb0hpld-python-conda-4.3.16.tar.gz
/gnu/store/cns9xhimr1i0fi8llx53s8kl33gsk3c4-python-ruamel.yaml-0.15.35

The CONDA package in Guix is a bit older - turns out CONDA has a ridiculous release rate. Let's try the older CONDA first

guix package -i conda

And

conda config --add channels defaults
  Warning: 'defaults' already in 'channels' list, moving to the top
conda config --add channels conda-forge
  Warning: 'conda-forge' already in 'channels' list, moving to the top
conda config --add channels bioconda
  Warning: 'bioconda' already in 'channels' list, moving to the top

so, that was all OK Next

conda install -c serine rnasik

Package plan for installation in environment /gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16:

The following NEW packages will be INSTALLED:

    asn1crypto:                    0.24.0-py_1          conda-forge
    backports:                     1.0-py36_1           conda-forge
    backports.functools_lru_cache: 1.5-py_1             conda-forge
    bedtools:                      2.25.0-3             bioconda
    bigdatascript:                 v2.0rc10-0           serine
    bwa:                           0.7.17-pl5.22.0_2    bioconda
    bzip2:                         1.0.6-1              conda-forge
    ca-certificates:               2018.4.16-0          conda-forge
    certifi:                       2018.4.16-py36_0     conda-forge
    cffi:                          1.11.5-py36_0        conda-forge
    chardet:                       3.0.4-py36_2         conda-forge
    click:                         6.7-py_1             conda-forge
    colormath:                     3.0.0-py_2           conda-forge
    conda:                         4.5.8-py36_1         conda-forge
    conda-env:                     2.6.0-0              conda-forge
    cryptography:                  2.2.1-py36_0         conda-forge
    curl:                          7.60.0-0             conda-forge
    cycler:                        0.10.0-py_1          conda-forge
    dbus:                          1.11.0-0             conda-forge
    decorator:                     4.3.0-py_0           conda-forge
    expat:                         2.2.5-0              conda-forge
    fastqc:                        0.11.5-pl5.22.0_3    bioconda
    fontconfig:                    2.12.1-4             conda-forge
    freetype:                      2.7-1                conda-forge
    future:                        0.16.0-py_1          conda-forge
    gettext:                       0.19.8.1-0           conda-forge
    glib:                          2.53.5-1             conda-forge
    gst-plugins-base:              1.8.0-0              conda-forge
    gstreamer:                     1.8.0-1              conda-forge
    hisat2:                        2.1.0-py36pl5.22.0_0 bioconda
    icu:                           58.2-0               conda-forge
    idna:                          2.7-py36_2           conda-forge
    je-suite:                      2.0.RC-0             bioconda
    jinja2:                        2.10-py_1            conda-forge
    jpeg:                          9b-2                 conda-forge
    krb5:                          1.14.6-0             conda-forge
    libffi:                        3.2.1-3              conda-forge
    libgcc:                        5.2.0-0
    libiconv:                      1.14-4               conda-forge
    libpng:                        1.6.34-0             conda-forge
    libssh2:                       1.8.0-2              conda-forge
    libxcb:                        1.13-0               conda-forge
    libxml2:                       2.9.5-1              conda-forge
    lzstring:                      1.0.3-py36_0         conda-forge
    markdown:                      2.6.11-py_0          conda-forge
    markupsafe:                    1.0-py36_0           conda-forge
    matplotlib:                    2.1.0-py36_0         conda-forge
    mkl:                           2017.0.3-0
    multiqc:                       1.5-py36_0           bioconda
    ncurses:                       5.9-10               conda-forge
    networkx:                      2.0-py36_1           conda-forge
    numpy:                         1.13.1-py36_0
    openjdk:                       8.0.121-1
    openssl:                       1.0.2o-0             conda-forge
    pcre:                          8.41-1               conda-forge
    perl:                          5.22.0.1-0           conda-forge
    picard:                        2.18.2-py36_0        bioconda
    pip:                           9.0.3-py36_0         conda-forge
    pycosat:                       0.6.3-py36_0         conda-forge
    pycparser:                     2.18-py_1            conda-forge
    pyopenssl:                     18.0.0-py36_0        conda-forge
    pyparsing:                     2.2.0-py_1           conda-forge
    pyqt:                          5.6.0-py36_5         conda-forge
    pysocks:                       1.6.8-py36_1         conda-forge
    python:                        3.6.3-1              conda-forge
    python-dateutil:               2.7.3-py_0           conda-forge
    pytz:                          2018.5-py_0          conda-forge
    pyyaml:                        3.12-py36_1          conda-forge
    qt:                            5.6.2-3              conda-forge
    readline:                      6.2-0                conda-forge
    requests:                      2.19.1-py36_1        conda-forge
    rnasik:                        1.5.2-0              serine
    ruamel_yaml:                   0.15.35-py36_0       conda-forge
    samtools:                      1.5-1                bioconda
    setuptools:                    40.0.0-py36_0        conda-forge
    simplejson:                    3.8.1-py36_0         bioconda
    sip:                           4.18-py36_1          conda-forge
    six:                           1.11.0-py36_1        conda-forge
    skewer:                        0.2.2-1              bioconda
    spectra:                       0.0.11-py_0          conda-forge
    sqlite:                        3.13.0-1             conda-forge
    star:                          2.5.2b-0             bioconda
    subread:                       1.5.3-0              bioconda
    tk:                            8.5.19-2             conda-forge
    tornado:                       5.1-py36_0           conda-forge
    urllib3:                       1.23-py36_0          conda-forge
    wheel:                         0.31.1-py36_0        conda-forge
    xorg-libxau:                   1.0.8-3              conda-forge
    xorg-libxdmcp:                 1.1.2-3              conda-forge
    xz:                            5.2.3-0              conda-forge
    yaml:                          0.1.7-0              conda-forge
    zlib:                          1.2.11-0             conda-forge

That is a rather long list of packages, including OpenJDK. Conda

CondaIOError: IO error: Missing write permissions in: /gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16

 You don't appear to have the necessary permissions to install packages
 into the install area '/gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16'.
 However you can clone this environment into your home directory and
 then make changes to it.
 This may be done using the command:

 $ conda create -n my_root --clone=/gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16

OK, I suppose that is an idea, though it kinda defeats the idea of a reproducible base repo. But this worked:

conda create -n joss-review-583
conda install -n joss-review-583 -c serine rnasik

The good news is that conda installs in one directory. But 2.7 GB downloaded…

conda info --envs
# conda environments:
#
conda                    /home/wrk/.conda/envs/conda
joss-review-583          /home/wrk/.conda/envs/joss-review-583
root                  *  /gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16

Activate the environment

source activate joss-review-583
RNAsik --help
00:00:00.000    Bds 2.0rc10 (build 2018-07-05 09:58), by Pablo Cingolani
Usage: RNAsik -fqDir </path/to/your/fastqs> [options]
main options
        -fqDir <string>         : path to fastqs directory, can be nested
        -align <string>         : pick your aligner [star|hisat|bwa]
        -refFiles <string>      : directory with reference files
        -paired <bool>          : paired end data [false], will also set pairIds = "_R1,_R2"
        -all <bool>             : short hand for counts, mdups, exonicRate, qc, cov and multiqc
more options
        -gtfFile <string>       : path to refFile.gtf
        -fastaRef <string>      : path to refFile.fa
        -genomeIdx <string>     : genome index
        -counts <bool>          : do read counts [featureCounts]
        -mdups <bool>           : process bam files, sort and mark dups [picard]
        -qc <bool>              : do bunch of QCs, fastqc, picard QCs and samtools
        -exonicRate <bool>      : do Int(ra|er)genic rates [qualiMap]
        -multiqc <bool>         : do MultiQC report [multiqc]
        -trim <bool>            : do FASTQ trimming [skewer]
        -cov <bool>             : make coverage plots, bigWig files
        -umi <bool>             : deduplicates using UMIs
extra configs
        -samplesSheet <string>  : tab delimited file, each line: old_prefix \t new_prefix
        -outDir <string>        : output directory [sikRun]
        -extn <string>          : specify FASTQ files extension [.fastq.gz]
        -pairIds <string>       : specify read pairs, [none]
        -extraOpts <string>     : add extra options through a file, each line: toolName = options
        -configFile <string>    : specify custome configuration file

I may be critical about CONDA, but this works ;)

Now I tried on a different machine and there was a problem on activate where the environment bumped me out of a shell. Hmmm. The conda settings of activate are:

CONDA_DEFAULT_ENV=joss-review-583
CONDA_PATH_BACKUP=/home/wrk/opt/gemma-dev-env/bin:/usr/bin:/bin
CONDA_PREFIX=/home/wrk/.conda/envs/joss-review-583
CONDA_PS1_BACKUP='\[\033[0;35m\]\h:\w\[\033[0m\]$ '
JAVA_HOME=/home/wrk/.conda/envs/joss-review-583
JAVA_HOME_CONDA_BACKUP=
PATH=/home/wrk/.conda/envs/joss-review-583/bin:/home/wrk/opt/gemma-dev-env/bin:/usr/bin:/bin
_CONDA_D=/home/wrk/.conda/envs/joss-review-583/etc/conda/activate.d
_CONDA_DIR=/home/wrk/opt/gemma-dev-env/bin

I guess I can replicate that

penguin2:~$ export JAVA_HOME=$HOME/.conda/envs/joss-review-583
penguin2:~$ export CONDA_PREFIX=$HOME/.conda/envs/joss-review-583
penguin2:~$ export PATH=$HOME/.conda/envs/joss-review-583/bin:$PATH

conda install -n joss-review-583 -c bioconda qualimap

wget bioinformatics.erc.monash.edu/home/kirill/sikTestData/rawData/IndustrialAntifoamAgentsYeastRNAseqData.tar

35 Updates

It has been a while since I updated the BLOG (see below older BLOGs). Time to start afresh because we have interesting developments going and the time ahead looks particularly exciting for GEMMA and GeneNetwork with adventures in D, CUDA, Arvados, Jupyter labs, IPFS, blockchain and the list goes on! I also promised to write a BLOG on our development/deployment setup. Might as well start here. My environments are very complex but controlled thanks to GNU Guix.

36 Older BLOGS

Work on LOCO, faster-lmm-d, Intel Phi and packaging with Guix moved to here.

37 Even older BLOG

Older BLOG material on pylmm etc. can be found here.


Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-12-03 Thu 05:36
.