Pjotr's rotating BLOG
Table of Contents
- 1. First code katas with Rust
- 2. GEMMA additive and dominance effects
- 3. Sambamba build
- 4. GEMMA randomizer
- 5. GEMMA, Sambamba, Freebayes and pangenome tools
- 6. GEMMA compute GRM (2)
- 7. Managing filters with GEMMA1
- 8. HEGP and randomness
- 9. GEMMA keeping track of transformations
- 10. Fix outstanding CI build
- 11. GEMMA testing frame work
- 12. GEMMA validate data
- 13. GEMMA running some data
- 14. GEMMA compute GRM
- 15. GEMMA filtering data
- 16. GEMMA convert data
- 17. GEMMA GRM/K compute
- 18. GEMMA with python-click and python-pandas-plink
- 19. Building GEMMA
- 20. Starting on GEMMA2
- 21. Porting GeneNetwork1 to GNU Guix
- 22. Chasing that elusive sambamba bug (FIXED!)
- 23. It has been almost a year! And a new job.
- 24. Speeding up K
- 25. MySQL to MariaDB
- 26. MySQL backups (stage2)
- 27. MySQL backups (stage1)
- 28. Migrating GN1 from EC2
- 29. Fixing Gunicorn in use
- 30. Updating ldc with latest LLVM
- 31. Fixing sambamba
- 32. Trapping NaNs
- 33. A gemma-dev-env package
- 34. Reviewing a CONDA package
- 35. Updates
- 36. Older BLOGS
- 37. Even older BLOG
Tis document describes Pjotr's journey in (1) introducing a speedy LMM resolver for GWAS for GeneNetwork.org, (2) Tools for pangenomes, and (3) solving the pipeline reproducibility challenge with GNU Guix. Ah, and then there is the APIs and bug fixing…
1 First code katas with Rust
code katas to the pangenome team. First I set an egg timer to 1 hour
and installed Rust and clang with Guix and checked out Christian's
rs-wfa bindings because I want to test C bindings against
Rust. Running cargo build
pulled in a crazy number of
dependencies. In a Guix container I had to set CC and
LIBCLANGPATH. After a successful build all tests failed with cargo
test
. It says
ld: rs-wfa/target/debug/build/libwfa-e30b43a0c990e3e6/out/WFA/build/libwfa.a(mm_allocator.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIE
On my machine adding the PIE flag to the WFA C code worked:
diff --git a/Makefile b/Makefile index 5cd3812..71c58c8 100644 --- a/Makefile +++ b/Makefile @@ -10,7 +10,7 @@ CC=gcc CPP=g++ LD_FLAGS=-lm -CC_FLAGS=-Wall -g +CC_FLAGS=-Wall -g -fPIE ifeq ($(UNAME), Linux) LD_FLAGS+=-lrt endif
the PIE flag generated position independent code for executables. Because this is meant to be a library I switched -fPIE for -fPIC and that worked too.
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
A thing missing from the repo is a software license. WFA is published under an MIT license. Erik also favours the MIT license, so it make sense to add that. After adding the license I cloned the repo to the pangenome org.
2 GEMMA additive and dominance effects
a
= (uA - uB)/2
and for the dominance d = uAB - (uA + uB)/2
. GEMMA
estimates the PVE by typed genotypes or “chip heritability”. Rqtl2 has
an estherit function to estimate heritability from pheno, kinship and
covar.
3 Sambamba build
Sambamba is an important tool for sequencing, but maintenance is a struggle. I have decided to radically simplify dependencies (same thing will happen with freebayes). First, I am making a wager that CRAM is not necessary in Sambamba by removing that code and the htslib dependency. I may get back to that later in the pangenome context, but for now we don't want it.
This week a new release went out and a few fixes.
4 GEMMA randomizer
-1
by
default. If below 0 it will seed the randomizer from the hardware
clock in param.cpp. The randomizer is set in three places:
bslmm.cpp 953: gsl_rng_set(gsl_r, randseed); 1659: gsl_rng_set(gsl_r, randseed); param.cpp 2032: gsl_rng_set(gsl_r, randseed);
and there are three (global) definitions of 'long int randseed' in the source. Bit of a mess really. Let's keep the random seed at startup only. The gslr structure will be shared by all. After fixing the randomizer we have a new 0.89.3 release of GEMMA!
5 GEMMA, Sambamba, Freebayes and pangenome tools
Working on pangenomes with Erik Garrison we have started to develop a path for genotyping through pangenomes. This means I am now working on three legacy tools and will try to make them ready for a new generation of tooling. Get the best from these tools for future work.
I just managed to build Freebayes using a Guix environment. The tree of git submodules is quite amazing.
The first job is to build freebayes for ARM. This is part of our
effort to use the NVIDIA Jetson ARM board. On ARMf the package included in GNU
Guix fails with missing file chdir vcflib/fastahack
and bwa fails with
ksw.c:29:10: fatal error: emmintrin.h: No such file or directory #include <emmintrin.h> ^~~~~~~~~~~~~
The first thing we need to resolve is disabling the SSE2 extensions. This suggests that -DNOSSE2 can be used for BWA. We are not the first to deal with this issue. This page replaces the file with sse2neon.h which replaces SSE calls with NEON. My Raspberry PI has neon support, so that should work:
pi@raspberrypi:~ $ LD_SHOW_AUXV=1 uname | grep HWCAP AT_HWCAP: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm AT_HWCAP2: crc32
Efraim will have a go at the BWA port. After that I'll pick up freebayes which includes a 2016 BWA as part of vcflib. People should not do this, though I am guilty with sambamba too.
5.1 Compile in a Guix container
After checking out the repo with git recursive create a Guix container with all the build tools with
guix environment -C -l guix.scm
Next rebuild the CMake environment with
rm CMakeCache.txt ./vcflib-prefix/src/vcflib-build/CMakeCache.txt make clean cmake . make -j 4 make test
The last command is not working yet.
6 GEMMA compute GRM (2)
With all that plumbing in place it is time to get the GRM computation sorted. Nick, in parallel, is working on a speedy replacement for LOCO written in Rust. I'll focus on a Python implementation right now. One thing I need to do is compute MAF correctly and (in line with that) translate the real numbers.
[ ]
Translate BIMBAM[ ]
Translate R/qtl geno and genoiter[X]
Compute MAF correctly
7 Managing filters with GEMMA1
8 HEGP and randomness
source code to find out how it generates random numbers. It does not use the Linux randomizer but its own implementation. Ouch.
For the HEGP encryption we use rnorm. I had to dig through R'sHonestly, we should not use that! These are pragmatic implementations for sampling. Not for encryption.
Interestingly, R sets srand() on startup using a time stamp. This is, however, only used to generate temporary filenames using the glibc rand() function. Correct me if I am wrong, but it is the only time R uses the OS randomizer.
Meanwhile the implementation of rustiefel is pretty straightforward: All we need to do is replace rnorm with an array of floats. Both Rust and Python provide a normal distribution from /dev/random. I'd have to dig deeper to get details, but to me it looks like the better idea.
9 GEMMA keeping track of transformations
One important thing mentioned below is that we want to keep track of data transformations. For example, applying a MAF filter twice can lead to different results. At least you'd want to show a warning. For this I am adding a transformation field in the control file which shows a list of tranformations formed like
"transformations": [ { "type": "filter", "pheno-NA": true, "maf": 0.01, "miss": 0.05, "command": "./bin/gemma2 --overwrite -o test/data/regression/21487_filter filter -c test/data/regression/21487_convert.json" }
which allows us to check whether the filter has been applied before. Running the filter twice will show. The grm command will check whether a filter has run. If not it will add.
The current output after two steps looks like:
<urce/code/genetics/gemmalib [env]$ cat test/data/regression/21487_filter.json { "command": "filter", "crosstype": null, // Cross-type is outbreeding "sep": "\t", // Default separator "na.strings": [ "NA", "nan", "-" ], "comment.char": "#", // keeping track of these for debugging: "individuals": 17, "markers": 7320, "phenotypes": 1, // all data files are tidy and gzipped (you can open directly in vim) "geno": "test/data/regression/21487_filter_geno.txt.gz", "pheno": "test/data/regression/21487_filter_pheno.txt.gz", "gmap": "21487_convert_gmap.txt.gz", "alleles": [ "A", "B", "H" ], "genotypes": { "A": 0, "H": 1, "B": 2 }, "geno_sep": false, // We don't have separators between genotypes "geno_transposed": true, "transformations": [ // keeps growing with every step { "type": "export", "original": "rqtl2", "format": "bimbam", "command": "./bin/gemma2 --overwrite -o test/data/regression/21487_convert convert --bimbam -g example/21487_BXD_geno.txt.gz -a example/BXD_snps.txt -p example/21487_BXDPublish_pheno.txt" }, { "type": "filter", "pheno-NA": true, "maf": 0.01, "miss": 0.05, "command": "./bin/gemma2 --overwrite -o test/data/regression/21487_filter filter -c test/data/regression/21487_convert.json" } ], "na_strings": [ // na_strings is more consistent than na.strings above, also need it for creating a method "NA", "nan", "-" ], "name": "test/data/regression/21487_convert.json" // name of the underlying file }
Note that the control file increasingly starts to look like a Monad because it passes state along with gemma2/lib steps.
10 Fix outstanding CI build
collaborate well. To fix it I stopped testing with libgslv1 (which is really old anyway and we need to use gslcblas.h over OpenBLAS cblas.h.
The GEMMA1 build on Travis was failing for some time. The problem is that GSL BLAS headers and OpenBLAS headers do not11 GEMMA testing frame work
The first test is to convert BIMBAM to GEMMA2 format with
gemma2 --overwrite convert --bimbam -g example/21487_BXD_geno.txt.gz -a example/BXD_snps.txt -p example/21487_BXDPublish_pheno.txt
which outputs 3 files
INFO:root:Writing GEMMA2/Rqtl2 marker/SNP result_gmap.txt.gz INFO:root:Writing GEMMA2/Rqtl2 pheno result_pheno_bimbam.txt.gz INFO:root:Writing GEMMA2/Rqtl2 geno result_geno_bimbam.txt.gz
pytest-reqressions writes its own files so I needed to use the low level interface for file comparisons. Also I'll need to unpack the .gz files for showing a diff. Progressing here.
12 GEMMA validate data
Validation is extremely important. People make mistakes and when the data is wrong GEMMA should complain. I did quite a bit of work there on gemma1 but it is still fairly limited. Full validation has to happen out-of-band because you don't want to impact speed for running pipelines.
Therefore, we validate the data when there is (almost) zero cost inline. But a full validation is a separate switch. Let's see what it brings out!
gemma2 validate -c control
One thing the maf filter does not do is check for the absolute number of informative genotypes (maf is a percentage!). Also gemma1 does not check whether the minor allele is actually the smaller one. We hit the problem immediately with
WARNING:root:Only one type of genotype Counter({'B': 9, 'A': 7, 'H': 1}) found in ['A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'H', 'B'] --- other similar counter warnings are ignored (rs3722740 file ./test_geno.txt.gz line 175)
It is clear that minor allele is actually the major allele. The implications are not large for gemma1, but the minor allele frequency (MAF) filter may not work properly. This is why validation is so important!
Unlike gemma1, gemma2 will figure out the minor allele dynamically. One reason is that R/qtl2 does the same and we are sharing the data format. It also allows a consistent use of genotype markers (e.g. B and D for the BXD). I added a validation step to make sure major allele genotypes vary across the dataset (one constant allele is suspect and may imply some other problem).
BIMBAM files, meanwhile, include the SNP variants. GeneNetwork kinda ignores them. I need to update the importer for that. Added a note. First, however, I want to speed up the GRM LOCO.
13 GEMMA running some data
For a dosage comparison and LOCO permutation run I extracted data from the 21481 set on GeneNetwork. First I needed to match the genotypes with phenotypes using below filter.
First create the Rqtl2 type dataset:
gemma2 -o 21481 convert --bimbam -g tmp/21481/BXD_geno.txt.gz -p tmp/21481/21481_BXDPublish_pheno.txt
Filtering
gemma2 -o output/21481-pheno -v filters -c 21481.json
Now the genotype file looks like
marker 7 10 11 12 38 39 42 54 60 65 67 68 70 71 73 77 81 91 92 rs31443144 AAABBABABAABBABBAAAA rs6269442 AAABBABABAABBABBAAAA rs32285189 AAABBABABAABBABBAAAA
Create the BIMBAM for gemma1
gemma2 -o 21481-pheno export --bimbam -c output/21481-pheno.json
and run the standard gemma commands. Note we can create the original dosage file by modifying the control file.
Running 1,000 permutations on 20 individuals and 7321 markers took
real 213m0.889s user 2939m21.780s sys 5423m37.356s
We have to improve on that!
14 GEMMA compute GRM
The first step is to compute the kinship matrix or GRM. We'll have different algorithms so we need a switch for method or –impl.
gemma2 --overwrite -o output/21487 convert --bimbam -g example/21487_BXD_geno.txt.gz -p example/21487_BXDPublish_pheno.txt
gemma2 --overwrite -o output/21487-filter filter -c output/21487.json
we can use the gemma1 implementation with
gemma2 grm --impl gemma1 -c output/21487-filter.json
or our new version
gemma2 grm -c output/21487-filter.json
Today I added the necessary filtering steps for the GRM listed in the next section. The maf filter is currently hard coded to make sure results match gemma1.
I finally got a matching K in Python compared to gemma1. Turns out that scaling is not the default option. Ha!
14.1 Calculating kinship
injects the mean genotype for a SNP into missing data fields (the mean over a SNP row). Next it subtracts the mean for every value in a row and if centering it scales
Gemmagsl_vector_scale(geno, 1.0 / sqrt(geno_var));
and finally over the full matrix a division over the number of SNPs
gsl_matrix_scale(matrix_kin, 1.0 / (double)ns_test)
Where ns_test
is the number of SNPs included. SNPs get dropped in a
MAF filter which I rewrote in a fast version in D. The GEMMA filter
happens at reading of the Geno file.
So, essentially
[X]
Always apply the MAF filter when reading genotypes[X]
Apply missiness filter
And when computing kinship
[X]
Always impute missing data (injecting the row mean)[X]
Always subtract the row mean[X]
Center the data by row (which is the NOT the default option-gk 1
, gemma1 CenterMatrix)[X]
Always scale the matrix dividing by # of SNPs (gemma1 ScaleMatrix)
See also R's scaling function.
Prasun's D code may be a bit more readable. And here is our older pylmm implementation which only scales K by dividing the number of SNPs.
[X]
Check D implementation
Python's numpy treats NaN as follows:
>>> np.mean([1.0,2.0,3.0,0.0]) 1.5 >>> np.mean([1.0,2.0,3.0,0.0,0.0]) 1.2 >>> np.mean([1.0,2.0,3.0,0.0,0.0,np.NAN]) nan
which means we have to filter first.
x = np.array([1.0,2.0,3.0,0.0,0.0,np.NAN])
>>> np.mean(x[~numpy.isnan(x)])
1.2
that is better. Note the tilde. Python is syntactically a strange beast. Also I need to be careful about numpy types. They are easily converted to lists, for example.
[ ]
And then there is this R2 filter I should check
r2
is simply the square of the sample correlation coefficient (i.e.,
r
) between the observed outcomes and the observed predictor values.
The manual says correlation with any covariate. By default, SNPs with
r^2
correlation with any of the covariates above 0.9999 will not be
included in the analysis. When I get to covariates I should include
that.
14.2 Implementation
Starting with the MAF filter. It is used both for GRM and GWA. Gemma1 says
-miss [num] specify missingness threshold (default 0.05) -maf [num] specify minor allele frequency threshold (default 0.01) -notsnp minor allele frequency cutoff is not used
Note that using -notsnp
the value of maf_level
is set to -1.
With Gemma2 I want to make all filtering explicit. But what to do if someone forgets to filter? Or filters twice - which would lead to different results.
gemma2 filter -c data --maf 0.01
Creates a new dataset and control file. We can add the filtering state to
the new control data structure with "maf": 0.05
which prevents a second run.
If it is missing we should apply it by default in
gemma2 grm -c data
which will be the same as the single run
gemma2 filter -c data --maf 0.01 '=>' grm
(don't try this yet).
I wrote a MAF filter in Python which puts genotypes in classes and counts the minor alleles. Note that GEMMA1 does not a pure minor allele count because it counts heterozygous as 50%. At this point I am not making a clear distinction.
Another filter is missing data miss_level
defaults to 0.05:
if ((double)n_miss / (double)ni_test > miss_level) pass...
Say we have 6 out of 100 missing it fails. This is rather strict.
15 GEMMA filtering data
With gemma1 people requested more transparent filtering. This is why I am making it a two-step process. First we filter on phenotypes:
15.1 Filter on phenotypes
The first filter is on phenotypes. When a phenotype is missing it should be removed from the kinship matrix and GWA. The R/qtl2 format is simply:
id pheno1 pheno2 1 1.2 3.0 2 NA 3.1 (etc)
So, if we select pheno1
we need to drop id=2
because it is an
NA
. The filter also needs to update the genotypes. Here we filter
using the 6th phenotype column:
gemma2 -o output/ms-filtered -v filters -c result.json -p 6
With the new phenotype filter I was able to create a new GRM based on a reduced genotype list (imported from BIMBAM) for a paper we are putting out.
16 GEMMA convert data
16.1 Convert PLINK to GEMMA2/Rqtl2 and BIMBAM
The plink .fam format has
- Family ID ('FID')
- Within-family ID ('IID'; cannot be '0')
- Within-family ID of father ('0' if father isn't in dataset)
- Within-family ID of mother ('0' if mother isn't in dataset)
- Sex code ('1' = male, '2' = female, '0' = unknown)
- Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
The BIMBAM format just has columns of phenotypes and no headers(!)
For convenience, I am going to output BIMBAM so it can be fed directly into gemma1. Note Plink also outputs BIMBAM from plink files and VCF which is used by GEMMA users today.
To convert from plink to R/qtl2 we already have
gemma2 convert --plink example/mouse_hs1940
To convert from R/qtl2 to BIMBAM we can do
gemma2 convert --to-bimbam -c mouse_hs1940.json
which writes mouse_hs1940_pheno_bimbam.txt
. Introducing a yield
generator. I notice how Python is increasingly starting to look like
Ruby. Yield was introduced in Python3 around 2012 - almost 20 years
later than Ruby(!)
After some thought I decided to split convert into 'read' and 'write' commands. So now it becomes
gemma2 read --plink example/mouse_hs1940
To convert from R/qtl2 to BIMBAM we can do
gemma2 write --bimbam -c mouse_hs1940.json
I think that looks logical. On third thought I am making it
gemma2 convert --plink example/mouse_hs1940
To convert from R/qtl2 to BIMBAM we can do
gemma2 export --bimbam -c mouse_hs1940.json
It is important to come up with the right terms so it feels logical or predictable to users. Convert could be 'import', but Python does not allow that because it is a language keyword. And, in a way, I like 'convert' better. And 'export' is the irregular case that no one should really use. I could name it 'internal'. But hey.
Writing the BIMBAM genotype file it looks like
rs3683945, A, G, 1, 1, 0, 1, 0, 1, 1, ...
without a header. GEMMA does not actually use the allele values (A and G) and skips on spaces. So we can simplify it to
rs3683945 - - 1 1 0 1 0 1 1 ...
Using these new inputs (converted plink -> Rqtl2 -> BIMBAM) the computed cXX matrix is the same as the original PLINK we had.
gemma2 gemma1 -g mouse_hs1940_geno_bimbam.txt.gz -p mouse_hs1940_pheno_bimbam.txt -gk -o mouse_hs1940
and same for the GWA
gemma2 gemma1 -g mouse_hs1940_geno_bimbam.txt.gz -p mouse_hs1940_pheno_bimbam.txt -n 1 -a ./example/mouse_hs1940.anno.txt -k ./output/mouse_hs1940.cXX.txt -lmm -o mouse_hs1940_CD8_lmm-new
==> output/mouse_hs1940_CD8_lmm-new.assoc.txt <== chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle p_wald 1 rs3683945 3197400 0 - - 0.443 -7.882975e-02 6.186788e-02 -1.581876e+03 4.332964e+00 2.028160e-01 1 rs3707673 3407393 0 - - 0.443 -6.566974e-02 6.211343e-02 -1.582125e+03 4.330318e+00 2.905765e-01 ==> output/result.assoc.txt <== chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle p_wald 1 rs3683945 3197400 0 A G 0.443 -7.882975e-02 6.186788e-02 -1.581876e+03 4.332964e+00 2.028160e-01 1 rs3707673 3407393 0 G A 0.443 -6.566974e-02 6.211343e-02 -1.582125e+03 4.330318e+00 2.905765e-01
The BIMBAM version differs because the BIMBAM file in ./example differs slightly from the PLINK version (thanks Xiang, keeps me on my toes!). Minor differences exist because some values, such as 0.863, have been changed for the genotypes. For a faithful and lossless computation of that BIMBAM file we'll need to support those too. But that will come when we start importing BIMBAM files. I'll make a note of that.
16.2 Convert BIMBAM to GEMMA2/Rqtl2
From above we can also parse BIMBAM rather than export. It will need both geno and pheno files to create the GEMMA2/Rqtl2 format:
gemma2 convert --bimbam -g example/mouse_hs1940.geno.txt.gz -p example/mouse_hs1940.pheno.txt
Problem: BIMBAM files can contain any value while the Rqtl2 genotype file appears to be limited to alleles. Karl has a frequency format, but that uses some fancy binary 'standard'. I'll just use the genotypes now because GeneNetwork uses those too. Translating back and forth. BIMBAM
rs31443144, X, Y, 1, 1, 0, 0, 0, ... 0, 1, 0.5, 0.5
becomes Rqtl2
rs31443144
and then back to
rs31443144, - - 1 1 0 0 0 ... 0 1 2 2
For processing with GEMMAv1. Funny. Actually it should be comma delimited to align with PLINK output.
After some hacking
gemma2 convert --bimbam -g example/BXD_geno.txt.gz -p example/BXD_pheno.txt gemma2 export -c result.json gemma -gk -g result_geno_bimbam.txt.gz -p result_pheno.tsv FAILED: Parsing input file 'result_geno_bimbam.txt.gz' failed in function ReadFile_geno in src/gemma_io.cpp at line 743
is still not happy. When looking at the code it fails to get (enough) genotypes. Also interesting is the hard coded in GEMMA1
geno = atof(ch_ptr); if (geno >= 0 && geno <= 0.5) { n_0++; } if (geno > 0.5 && geno < 1.5) { n_1++; } if (geno >= 1.5 && geno <= 2.0) { n_2++; }
GEMMA1 assumes Minor allele homozygous: 2.0; major: 0.0 for BIMBAM and
1.0 is H. The docs say BIMBAM format is particularly useful for
imputed genotypes, as PLINK codes genotypes using 0/1/2, while BIMBAM
can accommodate any real values between 0 and 2 (and any real values
if paired with -notsnp
option which sets cPar.maf_level = -1
).
The first column is SNP id, the second and third columns are allele types with minor allele first, and the remaining columns are the posterior/imputed mean genotypes of different individuals numbered between 0 and 2. An example mean genotype file with two SNPs and three individuals is as follows:
rs1, A, T, 0.02, 0.80, 1.50 rs2, G, C, 0.98, 0.04, 1.00
GEMMA codes alleles exactly as provided in the mean genotype file, and ignores the allele types in the second and third columns. Therefore, the minor allele is the effect allele only if one codes minor allele as 1 and major allele as 0.
BIMBAM mode is described in its own manual. The posterior mean value is the minor allele dosage. This means the encoding should be
rs31443144,-,-,2,2,0,0,0,...,0,2,1,1
but GeneNetwork uses
rs31443144, X, Y, 1, 1, 0, 0, 0, ... 0, 1, 0.5, 0.5
which leads to a different GRM. Turns out the genotypes GeneNetwork is using is wrong. It will probably not be a huge difference because the dosage is just scaled. I'll test it on a live dataset - a job that needs to be done anyway.
GEMMA also has an annotation file for SNPs
rs31443144 3010274 1 rs6269442 3492195 1 rs32285189 3511204 1 rs258367496 3659804 1 etc.
Rqtl2 has a similar gmap file in tidy format:
marker,chr,pos rs13475697,1,1.6449 rs3681603,1,1.6449001 rs13475703,1,1.73685561994571 rs13475710,1,2.57549035621086 rs6367205,1,2.85294211007162 rs13475716,1,2.85294221007162 etc.
entered as "gmap" in the control file. I suppose it is OK to use our chromosome positions there. I'll need to update the converter for BIMBAM and PLINK. The PLINK version for GEMMA looks like
rs3668922 111771071 13 65.0648 rs13480515 17261714 10 4.72355 rs13483034 53249416 17 30.175 rs4184231 48293994 16 33.7747 rs3687346 12815936 14 2.45302
so both positions are included, but no header. All different in other words! Rqtl2 also has a "pmap" file which is the physical mapping distance.
17 GEMMA GRM/K compute
GEMMA1 computes K fairly inefficiently. First of all it reads the large genotype file twice (once for K, once for GWA). Next it tries to optimize for memory consumption by reading the file in blocks and partially computing the multiplication matrix. This saves RAM, but should not be default behaviour (for one, OpenBLAS' threading is not really leveraged). Using gemma1 we can compute K with
time gemma -g ./example/mouse_hs1940.geno.txt.gz -p ./example/mouse_hs1940.pheno.txt -gk -o mouse_hs1940 GEMMA 0.98.2 (2020-05-28) by Xiang Zhou and team (C) 2012-2020 Reading Files ... ## number of total individuals = 1940 ## number of analyzed individuals = 1410 ## number of covariates = 1 ## number of phenotypes = 1 ## number of total SNPs/var = 12226 ## number of analyzed SNPs = 10768 Calculating Relatedness Matrix ... ================================================== 100% real 0m7.545s user 0m14.468s sys 0m1.037s
Now the GRM output file (mousehs1940.cXX.txt) is pretty large at 54Mb and we keep a lot of those cached in GeneNetwork. It contains a matrix 1940x1940 of textual numbers
0.3350589588 -0.02272259412 0.0103535287 0.00838433365 0.04439930169 -0.01604468771 0.08336199305 -0.02272259412 0.3035959579 -0.02537616406 0.003454557308 ...
The gzip version is half that size. We can probably do beter storing 4-byte floats and only storing half the matrix (it is symmetrical after all) followed by compression.
Anyway, first thing to do is read R/qtl2 style files into gemma2 because that is our preferred format.
To pass in information we now use the control file defined below
gemma2 grm --control mouse_hs1940.json
pylmm five years
ago(!) First results show numpy
dot product outperforming gemma1 and
blas today for this size dataset (I am not trying CUDA yet). Next
stage is filtering and centering the GRM.
18 GEMMA with python-click and python-pandas-plink
Python click is powerful and covers most use cases.
This week I got the argument parsing going for gemma2 and the logic for running subcommands. It was a bit of a struggle: having the feeling I would be better off writing argument parsing from scratch. But today I added a parameter for finding the gemma1 binary (a command line switch and an environment variable with help in a one-liner). That was pleasingly quick and powerful.In the next step I wanted to convert PLINK files to GEMMA2/Rqtl2 formats. A useful tool to have for debugging anyhow. I decided to try pandas-plink and added that to GEMMA2 (via a Guix package). Reading the mouse data:
>>> G = read_plink1_bin("./example/mouse_hs1940.bed", "./example/mouse_hs1940.bim", "./example/mouse_hs1940.fam", ver> >>> print(G) <xarray.DataArray 'genotype' (sample: 1940, variant: 12226)> dask.array<transpose, shape=(1940, 12226), dtype=float64, chunksize=(1024, 1024), chunktype=numpy.ndarray> Coordinates: * sample (sample) object '0.224991591484104' '-0.97454252753557' ... nan nan * variant (variant) object '1_rs3683945' '1_rs3707673' ... '19_rs6193060' fid (sample) <U21 '0.224991591484104' '-0.97454252753557' ... 'nan' iid (sample) <U21 '0.224991591484104' '-0.97454252753557' ... 'nan' father (sample) <U32 'nan' 'nan' 'nan' 'nan' ... 'nan' 'nan' 'nan' 'nan' mother (sample) <U3 '1' '0' '1' 'nan' 'nan' ... 'nan' '0' '1' 'nan' 'nan' gender (sample) <U32 'nan' 'nan' 'nan' 'nan' ... 'nan' 'nan' 'nan' 'nan' trait (sample) float64 -0.2854 -2.334 0.04682 nan ... -0.09613 1.595 0.72 chrom (variant) <U2 '1' '1' '1' '1' '1' '1' ... '19' '19' '19' '19' '19' snp (variant) <U18 'rs3683945' 'rs3707673' ... 'rs6193060' cm (variant) float64 0.0 0.1 0.1175 0.1358 ... 53.96 54.02 54.06 54.07 pos (variant) int64 3197400 3407393 3492195 3580634 ... -9 -9 61221468 a0 (variant) <U1 'A' 'G' 'A' 'A' 'G' 'A' ... 'G' 'G' 'C' 'G' 'G' 'A' a1 (variant) <U1 'G' 'A' 'G' 'G' 'G' 'C' ... 'A' 'A' 'G' 'A' 'A' 'G'
Very nice! With support for an embedded GRM we can also export. Note that this tool is not lazy, so for very large sets we may need to write some streaming code.
The first step is to create an R/qtl2 style genotype file to compute the GRM. The BIMBAM version for a marker looks like
rs3683945, A, G, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1, 1, 1, 2, 1, 0, 2, etc rs3707673, G, A, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1, 1, 1, 2, 1, 0, 2, rs6269442, A, G, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 2, 1, 0, 0, 1, 1, 1, 1, 2, 0, 0, 2, (...)
reflecting row a0 and a1 vertically in the BED file.
===> BED (binary format) [[1. 1. 2. 1. 2. 1. 1. 1. 1. 2. 2. 1. 2. 1. 1. 1. 1. 2. 0. 1. 1. 2. 1. 1. 1. 1. 0. 1. 2. 0. 1. 0. 1. 2. 0. 2. 1. 1. 1. 1. [1. 1. 2. 1. 2. 1. 1. 1. 1. 2. 2. 1. 2. 1. 1. 1. 1. 2. 0. 1. 1. 2. 1. 1. 1. 1. 0. 1. 2. 0. 1. 0. 1. 2. 0. 2. 1. 1. 1. 1. [1. 2. 2. 1. 2. 1. 1. 2. 2. 2. 2. 1. 2. 1. 1. 1. 2. 2. 0. 1. 2. 2. 1. 1. 1. 1. 0. 2. 2. 0. 2. 0. 1. 2. 0. 2. 1. 1. 1. 1. (...) ]]
So you can see 2 and 0 values switched meaning H.
The equivalent R/qtl2 can be the human readable
"genotypes": {"A":0, "B":1, "H": 2}
For the genotype file we'll go marker by individual (transposed) because that is what we can stream in GWA. The tentative 'control' file (mirroring recla.json) to start with:
{ "description": "mouse_hs1940", "crosstype": "hs", "individuals": 1940, "markers": 12226, "phenotypes": 7, "geno": "mouse_hs1940_geno.tsv", "alleles": [ "A", "B", "H" ], "genotypes": { "A": 1, "H": 2, "B": 3 }, "geno_transposed": true }
Where cross-type "HS" should probably act similar to "DO". We'll use some (hopefully) sane defaults, such as a tab for separator and '-' and NA for missing data. Comments have '#' on the first position. We add number of individuals, phenotypes and markers/SNPs for easier processing and validation when parsing.
Now the GEMMA2 genotype file should become
marker, 1, 2, 3, 4, ... rs3683945 B B A B A B B B B A A B A B B B B A H B B A B B B B H B A H etc rs3707673 B B A B A B B B B A A B A B B B B A H B B A B B B B H B A H rs6269442 B A A B A B B A A A A B A B B B A A H B A A B B B B H A A H (...)
"geno_compact": true
which makes it the more compact
marker,1,2,3,4,... rs3683945 BBABABBBBAABABBBBAHBBABBBBHBAH etc rs3707673 BBABABBBBAABABBBBAHBBABBBBHBAH rs6269442 BAABABBAAAABABBBAAHBAABBBBHAAH (...)
hope R/qtl2 will support this too. Next step is compression
The uncompressed space version:
-rw-r--r-- 1 wrk users 46M Aug 31 08:25 mouse_hs1940_geno.tsv
with gzip compresses to
-rw-r--r-- 1 wrk users 3.3M Aug 31 07:58 mouse_hs1940_geno.tsv.gz
while the compact version is half the size and compresses better too
-rw-r--r-- 1 wrk users 2.1M Aug 31 08:30 mouse_hs1940_geno.tsv.gz
The bz2 version is a little smaller but a lot slower. lz4 has a larger file
-rw-r--r-- 1 wrk users 5.4M Aug 31 08:37 mouse_hs1940_geno.tsv.lz4
but it is extremely fast. For now we'll just go for smaller files
(these genotype files can be huge). gzip
support is standard in
Python3. Compressing the files from inside Python appeared slow with
the default maximum compression. Reducing the level a bit it is
comparable to the original writer and the file is 20x smaller. It is
also 3x smaller than the original PLINK version (that is supposedly
optimally compressed).
The control file probably has enough showing the compressed file name extension. Now it looks like
{ "description": "mouse_hs1940", "crosstype": "hs", "sep": "\t", "individuals": 1940, "markers": 12226, "phenotypes": 7, "geno": "mouse_hs1940_geno.tsv.gz", "alleles": [ "A", "B", "H" ], "genotypes": { "A": 1, "H": 2, "B": 3 }, "geno_sep": false, "geno_transposed": true }
Next step is generating the GRM!
19 Building GEMMA
Gemma2 is going to act like a modular tool that can invoke many backends. Examples are to compute a GRM using R/qtl2, native GEMMA or some CUDA implementation. The command line will have to deal with all these variations too in an elegant way.
For deployment we'll use GNU Guix and Docker containers as described below. Gemma2 is going to be oblivious about how deployment hangs together because I think it is going to be an ecosystem on its own.
I am looking into command line parsing for gemma2. The requirement is reasonably sophisticated parameter checking that can grow over time. First of all I'll introduce 'commands' such as
gemma grm gemma gwa
which allows for splitting option sets. Better readable too. Also I want to be able to inline 'pipe' commands:
gemma grm => gwa
which will reuse data stored in RAM to speed things up. You can imagine something like
gemma filter => grm --loco => gwa
as a single run. Now bash won't allow for this pipe operator so we may support a syntax like
gemma filter '=>' grm --loco '=>' gwa
or
gemma filter % grm --loco % gwa
Note that these 'pipes' will not be random workflows. It is just a CLI convenience notation that visually looks composable.
Also the current switches should be supported and gemma2 will drop to gemma version 1 if it does not understand a switch. For example
gemma -g /example/mouse_hs1940.geno.txt.gz -p mouse_hs1940.pheno.txt -a mouse_hs1940.anno.txt -gk -no-check
should still work, simply because current workflows expect that. The python click package looks like it can do this. What is tricky is that we want to check all parameters before the software runs. For now, the grouping works OK - you can chain commands, e.g.
@click.group(invoke_without_command=True) @click.pass_context def gemma2(context): if not context.invoked_subcommand: click.echo("** Call gemma1") @click.command() def grm(): click.echo('** Kinship/Genetic Relationship Matrix (GRM) command') if second: gemma2(second) @click.command() # @click.argument('verbose', default=False) def gwa(): click.echo('** Genome-wide Association (GWA)') gemma2.add_command(grm) gemma2.add_command(gwa) gemma2()
which allows
gemma grm ... -> calls into grm and gwa
gemma gwa ...
gemma (no options)
Not perfect because parameters are checked just before the actual chained command runs. But good enough for now. See also the group tutorial.
OpenBLAS development. The gains we get for free.
Performance metrics before doing a new release show openblas has gotten a bit faster for GEMMA. I also tried a build of recent openblas-git-0.3.10 with the usual tweaks. It is really nice to see how much effort is going intoAs people have trouble building GEMMA on MacOS (and I don't have a Mac) I released a test Docker container that can be run as
lario:/home/wrk/iwrk/opensource/code/genetics/gemma# time docker run -v `pwd`/example:/example 2e82532c7440 gemma -g /example/mouse_hs1940.geno.txt.gz -p /example/mouse_hs1940.pheno.txt -a /example/mouse_hs1940.anno.txt -gk -no-check GEMMA 0.98.2 (2020-05-28) by Xiang Zhou and team (C) 2012-2020 Reading Files ... ## number of total individuals = 1940 ## number of analyzed individuals = 1410 ## number of covariates = 1 ## number of phenotypes = 1 ## number of total SNPs/var = 12226 ## number of analyzed SNPs = 10768 Calculating Relatedness Matrix ... ================================================== 100% real 0m8.161s user 0m0.033s sys 0m0.023s
Note that the local directory is mounted with the -v
switch.
Last weeks got hijacked by grant writing, letters and failing machines. Back to GEMMA now. I was considering moving the development environment to our 8-core EPYC server, but it builds and runs actually slightly slower than my desktop (an 4-core Intel Nuc6i7KYK) which shows how IO bound we (still) are.
20 Starting on GEMMA2
http://covid19.genenetwork.org which is getting quite a bit of attention.
A lot happened in the last month. Not least creatingToday is the official kick-off day for new GEMMA development! The coming months I am taking a coding sabbatical. Below you can read I was initially opting for D, but the last months we have increasingly invested in Rust and it looks like new GEMMA will be written in Python + Rust.
First, why Rust? Mostly because our work is showing it forces coders to work harder at getting things correct. Not having a GC can look like a step backward, but when binding to other languages it actually is handy. With Rust you know where you are - no GC kicking in and doing (black) magic. Anyway, I am going to give it a shot. I can revert later and D is not completely ruled out.
Now, why Python instead of Racket? Racket is vastly superior to Python in my opinion. Unfortunately, Python has the momentum and if I write the front-end in Racket it means no one will contribute. So, this is pragmatism to the Nth degree. We want GEMMA2/gemmalib to be useful to a wider community for experimentation. It will be a toolbox rather than an end-product. For example, I want to be able to swap in/out modules that present different algorithms. In the end it will be a toolbox for mapping phenotypes to genotypes.
I don't think we'll regret a choice for Python+Rust. Both languages are complementary and have amazing community support, both in terms of size and activity. Having Python as a front-end implies that is should be fairly trivial to bind the Rust back-end to other languages, such as Racket, R and Ruby. It will happen if we document it well enough.
One feature of new GEMMA2 will be that it actually can run GEMMA1. I am going to create a new development repository that can call into GEMMA1 transparently if functionality does not exist. This means the same command line parameters should work. GEMMA2 will fall back to GEMMA1 with a warning if it does not understand parameters. GEMMA2 specific parameters are a different set.
With Corona virus hitting us hard I am spending my time at home, hacking. For an introvert this is the normal. We are organising the COVID-19 hackathon - which will usurp quite a bit of my time. But that is OK, if we can scale up our precision medicine goals in such a way. Better than hacking it alone for sure! Other than that there are some papers to complete and perhaps one or two grant applications. There is always enough to do - fact of science.
I promised GeneNetwork that I would implement precompute for all GN mappings for GEMMA. I think that is a great idea. It is also an opportunity to clean up GEMMA. Essentially a rewrite where GEMMA becomes more of a library which can be called from any other language. That will also make it easier to optimise GEMMA for certain architectures. It is interesting to note that despite the neglect GEMMA is getting and its poor runtime performance it is still a surprisingly popular tool. The implementation, apparently, still rocks!
So, let's start with GEMMA version 2 (GEMMA2). What is GEMMA2? GEMMA2 is a fast implementation of GEMMA1 in D. Why not Rust? You may ask. Well, Rust is a consideration and we can still port, but D is close to idiomatic C++ which means existing GEMMA code is relatively easy to convert. I had been doing some of that already with Prasun, with faster-lmm-d. That project taught us a lot though we never really got it to replace GEMMA. Next to D I will be using Racket to create the library bindings. Racket is a Lisp and with a good FFI it should be easy to port that to (say) Python or Ruby. So, in short: Front-end Racket, back-end D. Though there may be variations.
21 Porting GeneNetwork1 to GNU Guix
http://gn1.genenetwork.org/ is a legacy Python web application which was running on a 10 year old CentOS server. It depends on an older version of Python, mod-python and other obsolete modules. We decided to package it in GNU Guix because Guix gives full control over the dependency graph. Also GNU Guix has features like timemachine and containers which allows us to make snapshot of a deployment graph in time and serve different versions of releases.
GeneNetwork1The first step to package GN1 with the older packages was executed by Efraim. Also he created a container to run Apache, mod-python and GN1. Only problem is that mod-python in the container did not appear to be working.
22 Chasing that elusive sambamba bug (FIXED!)
github. Locating the bug was fairly
easy - triggered by decompressBgzfBlock
- and I manage to figure out
it had to do with the copying of a thread object that messes up the
stack. Fixing it, however is a different story! It took a while to get
a minimal program to reproduce the problem. It looks like
BamReader bam; auto fn = buildPath(dirName(__FILE__), "data", "ex1_header.bam"); foreach (i; 0 .. 60) { bam = new BamReader(fn); assert(bam.header.format_version == "1.3"); }
which segfaults most of the time, but not always, with a stack trace
#0 0x00007ffff78b98e2 in invariant._d_invariant(Object) () from /usr/lib/x86_64-linux-gnu/libdruntime-ldc-debug-shared.so.87 #1 0x00007ffff7d4711e in std.parallelism.TaskPool.queueLock() () from /usr/lib/x86_64-linux-gnu/libphobos2-ldc-debug-shared.so.87 #2 0x00007ffff7d479b9 in _D3std11parallelism8TaskPool10deleteItemMFPSQBqQBp12AbstractTaskZb () from /usr/lib/x86_64-linux-gnu/libphobos2-ldc-debug-shared.so.87 #3 0x00007ffff7d4791d in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv () #4 0x00005555557125c5 in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x0) #6 0x00007ffff78a56fc in object.TypeInfo_Struct.destroy(void*) const () #7 0x00007ffff78bc80a in rt_finalizeFromGC () from /usr/lib/x86_64-linux-gnu/libdruntime-ldc-debug-shared.so.87 #8 0x00007ffff7899f6b in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm () #9 0x00007ffff78946d2 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm () #10 0x00007ffff78969fc in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv () #11 0x00007ffff78917c3 in _D2gc4impl12conservativeQw3Gcx5allocMFNbmKmkxC8TypeInfoZPv () (...) task_pool=0x7ffff736f000) at reader.d:130 #29 0x000055555575e445 in _D3bio3std3hts3bam6reader9BamReader6__ctorMFAyaZCQBvQBuQBtQBsQBrQBn (warning: (Internal error: pc 0x55555575e444 in read in psymtab, but not in symtab.) at reader.d:135 #30 0x00005555557bff0b in D main () at read_bam_file.d:47
where reader.d:130
adds a BamReader
to the task pool. It is clear the
GC kicks in and we end up with this mess.
Line #4 contains std.parallelism.TaskS bio.core.bgzf.block.decompressBgzfBlock utils.memoize.Cache-DecompressedBgzfBlock-yieldForce
yieldForce executes a task in the current thread which is coming from a cache:
alias Cache!(BgzfBlock, DecompressedBgzfBlock) BgzfBlockCache;
and Cache is part of BioD. One trick aspect of Sambamba is that the
design is intricate. In our crashing example we only use the simple
BamReader wich is defined in std.hts.bam.reader.d
. We are using a
default taskpool. In reader.d not much happens - it is almost all
simple plumbing. std.hts.bam.read.d
, meanwhile, represents the BAM
format.
The Bgzf block processing happens in bio.core.bgzf.inputstream
. The
BamReader uses BgzfInputStream which has functions fillNextBlock,
setupReadBuffer. The constructor sets up a RoundBuf!BlockAux(n_tasks)
.
When I set n_tasks
to a small number it no longer crashes!? The
buffer
_task_buf = uninitializedArray!(DecompressionTask[])(n_tasks);
is a critical piece. Even increasing dim to n_tasks+2
is enough to
remove most segfaults, but not all.
Remember it is defined as
alias Task!(decompressBgzfBlock, BgzfBlock, BgzfBlockCache) DecompressionTask; DecompressionTask[] _task_buf;
with dimension n_tasks
. Meanwhile BlockAux
static struct BlockAux { BgzfBlock block; ushort skip_start; ushort skip_end; DecompressionTask* task; alias task this; }
Injecting code
struct DecompressedBgzfBlock { ~this() { stderr.writeln("destroy DecompressedBgzfBlock ",start_offset,":",end_offset," ",decompressed_data.sizeof); }; ulong start_offset; ulong end_offset; ubyte[] decompressed_data; }
It is interesting to see that even when not segfaulting the block offsets look corrupted:
destroy DecompressedBgzfBlock 0:0 16 destroy DecompressedBgzfBlock 0:0 16 destroy DecompressedBgzfBlock 0:0 16 destroy DecompressedBgzfBlock 89554:139746509800748 16 destroy DecompressedBgzfBlock 140728898420736:139748327903664 16 destroy DecompressedBgzfBlock 107263:124653 16 destroy DecompressedBgzfBlock 89554:107263 16 destroy DecompressedBgzfBlock 71846:89554 16 destroy DecompressedBgzfBlock 54493:71846 16 destroy DecompressedBgzfBlock 36489:54493 16 destroy DecompressedBgzfBlock 18299:36489 16 destroy DecompressedBgzfBlock 104:18299 16 destroy DecompressedBgzfBlock 0:104 16
and I am particularly suspicious about this piece of code in inputstream.d where task gets allocated and the resulting buffer gets copied to the roundbuffer. This is a hack, no doubt about it:
DecompressionTask tmp = void; tmp = scopedTask!decompressBgzfBlock(b.block, _cache); auto t = _task_buf.ptr + _offset / _max_block_size; import core.stdc.string : memcpy; memcpy(t, &tmp, DecompressionTask.sizeof); b.task = t; _tasks.put(b); _pool.put(b.task);
and probably the reason why decompressBgzfBlock
gets corrupted,
followed by sending the GC in a tail spin when it kicks in. Artem
obviously designed it this way to prevent allocating memory for the
task, but I think he went a little too far here! One thing I tried
earlier, I have to try again which is get rid of that copying.
First of all
alias Task!(decompressBgzfBlock, BgzfBlock, BgzfBlockCache) DecompressionTask;
defines DecompressionTask
as calling decompressBgzfBlock
with
parameters which returns a DecompressedBgzfBlock
. Remember it bails
out with this block. There is something else that is notable, it is
actually the cached version that bails out.
Removing the cache code makes it run more reliable. But not completely. Also we are getting memory errors now:
destroy DecompressedBgzfBlock 4294967306:0 16 core.exception.InvalidMemoryOperationError@core/exception.d(702): Invalid memory operation
but that leaves no stack trace. Now we get
std.parallelism.Task_bio.core.bgzf.block.decompressBgzfBlock-DecompressedBgzfBlock-yieldForce
so the caching itself is not the problem. In the next phase we are
going to address that dodgy memory copying by introducing a task
managed by GC instead of using the ScopedTask
.
This is all happening in BgzfInputStream
in inputstream.d
used by
reader.d
which inherits from Stream
. BamReaders uses that
functionality to iterate through the reads usings D popFront()
design. Streams allow reading a variable based on its type,
e.g. a BAM read. The BamReader fetches the necessary data from
BgzfBlockSupplier
with getNextBgzfBlock
which is used as the _bam
variable. BamReader.reader
itself returns an iterator.
It is interesting to note how the OOP design obfuscates what is going
on. It is also clear that I have to fix BgzfInputStream
in
inputstream.d
because it handles the tasks in the roundbuffer.
Part of sambamba's complexity is due to OOP and to having the threadpool running at the lowest level (unpacking bgzf). If I remove the threadpool there it means that threading will have to happen at a higher level. I.e., sambamba gets all its performance from multithreaded low level unpacking of data blocks. It is unusual, but it does have the (potential) advantage of leaving higher level code simple. I note with sambamba sort, however, Artem injected threads there too which begs the question what happens when you add different tasks to the same pool that have different timing characteristics. Be interesting to see the effect of using two task pools.
block.d
again,
BgzfBlock
is defined as a struct containing a _buffer
defined as
public ubyte[] _buffer = void;
and is used in (indeed) block.d
and
inputstream.d
only. The use of struct
means that BgzfBlock
gets
allocated on the stack. Meanwhile _buffer
get pointed into the
uncompressed
buffer which in turn is is a slice of
uncompressed_buf
that is also on the stack in decompressBgzfBlock
(surprise, surprise) and gets assigned right before returning
block._buffer[0 .. block.input_size] = uncompressed[];
now, the cached version (which I disabled) actually does a copy to the heap
BgzfBlock compressed_bgzf_block = block; compressed_bgzf_block._buffer = block._buffer.dup; DecompressedBgzfBlock decompressed_bgzf_block; with (decompressed_bgzf_block) { start_offset = block.start_offset; end_offset = block.end_offset; decompressed_data = uncompressed[].dup; } cache.put(compressed_bgzf_block, decompressed_bgzf_block);
Artem added a comment
/// A buffer is used to reduce number of allocations. /// /// Its size is max(cdata_size, input_size) /// Initially, it contains compressed data, but is rewritten /// during decompressBgzfBlock -- indeed, who cares about /// compressed data after it has been uncompressed?
Well, maybe the GC does! Or maybe the result does not fit the same buffer. Hmmm. If you go by the values
destroy DecompressedBgzfBlock 140728898420736:139748327903664 16
you can see the BgzfBlock
is compromised by a stack overwrite.
Changing the struct to a class fixes the offsets
destroy DecompressedBgzfBlock 104:18299 16 destroy DecompressedBgzfBlock 18299:36489 16 destroy DecompressedBgzfBlock 36489:54493 16
so things start to look normal. But it still segfaults on
DecompressedBgzfBlock
on GC in yieldForce
.
std.parallelism.Task_bio.core.bgzf.block.decompressBgzfBlock-DecompressedBgzfBlock-yieldForce
decompressBgzfBlock
the underpinning data structure is corrupt.
The problem has to be with struct BgzfBlock
, struct BgzfBlockAux
and the Roundbuf
. Both BgzfBlockAux
goes on a Roundbuf
. I was
thinking last night that there may be a struct size problem. The number
of tasks in the roundbuffer should track the number of threads.
Increasing the size of the roundbuffer makes a crash take longer.
Well I hit the jackpot! After disabling
_task_buf = uninitializedArray!(DecompressionTask[])(n_tasks);
there are no more segfaults. I should have looked at that more
closely. I can only surmise that because the contained objects contain
pointers (they do) the GC gets confused because it occassionaly finds
something that looks like a valid pointer. Using uninitializedArray
on an object that includes a pointer reference is dangerous.
Success!! That must have take me at least three days of work to find
this bug, one of the most elusive bugs I have encountered. Annoyingly
I was so close earlier when I expanded the size of n_tasks
! The
segfault got triggered by a new implementation of D's garbage
collector. Pointer space is tricky and it shows how careful we have to
be with non-initialized data.
Now what is the effect of disabling the cache and making more use of garbage collected structures (for each Bgzf block)? User time went down and CPU usage too, but wall clock time nudged up. Memory use also went up by 25%. The garbage collector kicked in twice as often. This shows Artem's aggressive avoidance of the garbage collector does have impact and I'll have to revert on some of my changes now. Note that releasing the current version should be OK, the performance difference does add to overall time and energy use. Even 10% of emissions is worth saving with tools run at this scale.
So, I started reverting on changes and after reverting two items:
- Sambamba - [-] speed test + [X] revert on class DecompressedBgzfBlock to struct + [X] revert on auto buf2 = (block._buffer[0 .. block.input_size]).dup; + [ ] revert on DecompressionTask[] _task_buf; + [ ] revert on tmp = scopedTask!decompressBgzfBlock(b.block) + [ ] revert on Cache
Sambamba is very similar to the last release. The new release is very slightly slower but uses less RAM. I decided not to revert on using the roundbuf, scopedTask and Cache because each of these introduces complexity with no obvious gain. v0.7.1 will be released after I update release notes and final tests.
23 It has been almost a year! And a new job.
A lot can happen in a year. After my last BLOG on the GEMMA release I got a job offer by the University of Tennessee and we moved in March to the USA. The coming years I am writing software and some grant proposals. Focus is still on GeneNetwork, genetic mapping and genomics. Some things changed, but I am working on GEMMA, sambamba, the variant graph and all that… One major shift is from Python to Racket (Scheme, a Lisp). I never liked Python that much (coming from Ruby and Perl earlier), but now I had the excuse to ditch Python. Racket rocks in so many ways! Others have written about that, so I am not going to repeat, but main thing is that I finally found a language that I can use for another 20 years, after trying many alternatives including Scala, Elixir, Erlang etc. Ironically, Lisp is one of the oldest language in use today. Go figure. For performance code I have D and we may replace some of that with Rust. It looks like it is Racket and D and perhaps Rust for me. I have come home to those choices after all these years.
I am writing a REST service which needs to maintain some state. The built-in Racket server has continuations - which is rather nice! Racket also has support for Redis, SQLite (nice example) and a simple key-value interface (ini-style) which I can use for sessions.
Ironically, two weeks after writing above I was hit by a car on my bicycle in Memphis. 14 fractures and a ripped AC joint. It took surgery and four months to feel halfway normal again…
24 Speeding up K
GEMMMA has been released with many bug fixes and speedups. Now it is time to focus on further speedups with K (and GWA after). There are several routes to try this. One possibility is to write our own matrix multiplication routine in D. My hunch is that because computing K is a special case we can get some significant speedups compared to a standard matrix multiplication for larger matrices. Directions we can go are:
- Stop using BLAS and create multi-core dot-product based
multiplication that makes use of
- multi threaded decompression
- start compute while reading data
- use aligned rows only for dot product (less CPU cache misses)
- compute half the result (K is symmetric!)
- heavy use of threading
- use AVX2 optimizations
- use floats instead of doubles (we can do this with hardware checking)
- chromosome-based computations (see below for LOCO)
- Use tensor routines to reduce RAM IO
In other words, quite a few possible ways of improving things. It may be that the current BLAS routine is impossible to beat for our data, but there is only one way to find out: by trying.
The first step you would think is simple: take the genotype data, pass that in to calckinship(g) and get K back. Unfortunately GEMMA intertwines reading the genotype file, the computation of K in steps and scaling the matrix. All in one and the code is duplicated for Plink and BIMBAM formats. Great. Now I don't feel like writing any more code in C++ and, fortunately, Prasun has done some of the hard work in faster-lmm-d already. So the first step is to parse the BIMBAM format using an iterator. We'll use this to load the genotype data in RAM (we are not going to assume memory restrictions for now).
That genotype decompression and reading part is done now and I added it to BioD decompress.d. Next I added a tokenizer which is a safe replacement for the strtok gemma uses.
Using BLAS I added a full K computation after reading then genotype file which is on par (speed-wise) with the current GEMMA implementaton. The GEMMA implementation should be more optimal because it splits the matrix computation in smaller blocks and starts while streaming the genotype data. Prasun did an implementation of that in gemmakinship.d which is probably faster than my current version. Even so, I'll skip that method, for now, until I am convinced that none of the above dot-product optimizations pan out. Note that only a fraction of the time is used to do the actual matrix multiplication.
2018-10-23T09:53:38.299:api.d:flmmd_compute_bimbam_K:37 GZipbyLine 2018-10-23T09:53:43.911:kinship.d:kinship_full:23 Full kinship matrix used
Reading the file took 6 seconds.
2018-10-23T09:53:43.911:dmatrix.d:slow_matrix_transpose:57 slow_matrix_transpose 2018-10-23T09:53:44.422:blas.d:matrix_mult:48 matrix_mult 2018-10-23T09:53:45.035:kinship.d:kinship_full:33 DONE rows is 1940 cols 1940
Transpose + GxGT took 1 second. The total for
6e05143d717d30eb0b0157f8fd9829411f4cf2a0 real 0m9.613s user 0m13.504s sys 0m1.476s
I also wrote a threaded decompressor and it is slightly slower. Gunzip is so fast that building threads adds more overheads.
In this example reading the file dominates the time, but with LOCO we get ~20x K computation (one for each chromosome), so it makes sense to focus on K. For model species, for the forseeable future, we'll look at thousands of individuals, so it is possible to hold all K matrices in RAM. Which also means we can fill them using chromosome-based dot-products. Another possible optimization.
Next steps are to generate output for K and reproduce GEMMA output (also I need to filter on missing phenotype data). The idea is to compute K and LMM in one step, followed by developing the LOCO version as a first class citizen. Based on above metrics we should be able to reduce LOCO K of this sized dataset from 7 minutes to 30 seconds(!) That is even without any major optimizations.
25 MySQL to MariaDB
The version we are running today:
mysql --version mysql Ver 14.12 Distrib 5.0.95, for redhat-linux-gnu (x86_64) using readline 5.1
26 MySQL backups (stage2)
Backup to AWS.
env AWS_ACCESS_KEY_ID=* AWS_SECRET_ACCESS_KEY=* RESTIC_PASSWORD=genenetwork /usr/local/bin/restic --verbose init /mnt/big/mysql_copy_from_lily -r s3:s3.amazonaws.com/backupgn init
env AWS_ACCESS_KEY_ID=* AWS_SECRET_ACCESS_KEY=* RESTIC_PASSWORD=genenetwork /usr/local/bin/restic backup /mnt/big/mysql_copy_from_lily -r s3:s3.amazonaws.com/backupgn
I also added backups for genotypefiles and ES.
27 MySQL backups (stage1)
Before doing any serious work on MySQL I decided to create some backups. Lily is going to be the master for now, so the logical backup is on P1 which has a large drive. Much of the space is taken up by a running MySQL server (which is updated!) and a data file by Lei names 20151028dbsnp containing a liftover of dbSNP142. First I simply compressed the input files, not to throw them away.
Because ssh is so old on lily I can't login nicely from Penguin, but the other way works. This is temporary as mysql user
ssh -i /var/lib/mysql/.ssh/id_rsa pjotr@penguin
The script becomes (as a weekly CRON job for now) as mysql user
rsync -va /var/lib/mysql --rsync-path=/usr/bin/rsync -e "ssh -i /var/lib/mysql/.ssh/id_rsa" pjotr@penguin:/mnt/big/mysql_copy_from_lily/
This means we have a weekly backup for now. I'll improve it with more disk space and MariaDB to have incrementals.
Actually, it looks like only two tables really change
-rw-rw---- 1 pjotr mysql 29G Aug 16 18:59 ProbeSetData.MYD -rw-rw---- 1 pjotr mysql 36G Aug 16 19:00 ProbeSetData.MYI
which are not that large.
To make incrementals we are opting for [restic](https://restic.readthedocs.io/). It looks modern and has interesting features.
env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic init -r /mnt/big/backup_restic_mysql/
now backups can be generated incrementally with
env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic --verbose backup /mnt/big/mysql_copy_from_lily -r /mnt/big/backup_restic_mysql
To list snapshots (directory format)
env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic -r backup_restic_mysql/ snapshots
Check
env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic -r backup_restic_mysql/ check
Prune can merge hashes, saving space
env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic -r backup_restic_mysql/ prune
So, now we can do a daily backup from Lily and have incrementals stored on Penguin too. My cron reads
40 5 * * * env RESTIC_PASSWORD=genenetwork /usr/local/bin/restic --verbose backup /mnt/big/mysql_copy_from_lily -r /mnt/big/backup_restic_mysql|mail -s "MySQL restic backup" pjotr2017@thebird.nl
Restic can push to AWS S3 buckets. That is the next step (planned).
28 Migrating GN1 from EC2
GN1 is costing us $\600+ per month on EC2. With our new shiny server we should move it back into Memphis. The main problem is that the base image is
Linux version 2.6.18-398.el5 (mockbuild@builder17.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-55)) #1 SMP Tue Sep 16 20:50:52 EDT 2014
I mean, seriously, ten years old?
Compared to GN2, GN1 should be simpler to deploy. Main problem is that it requires Python 2.4 - currently. Not sure why that is.
Some system maintenance scripts live here. There is a Docker image by Artem here. And there are older notes referred to in the Artem doc. All of that is about GN2 really. So far, I have not found anything on GN1. In /usr/lib/python2.4/site-packages I find the following modules:
Alacarte elementtree gmenu.so htmlgen iniparse invest json libsvn nose numarray piddle pirut pp.py pptransport.py ppworker.py pyx pyXLWriter reaper.so sos svg urlgrabber
nothing too serious. Worse is that GN1 uses modpython which went out of vogue before 2013. The source code is still updated - you see, once things are out there…
This means we'll need to deploy Apache in the VM with modpython. The installation on Lily pulls in 50+ Apache modules. Argh. If we want to support this… The good news is that most modules come standard with Apache - and I think we can disable a lot of them.
29 Fixing Gunicorn in use
DEBUG:db.webqtlDatabaseFunction:.retrieve_species: retrieve_species result:: mouse DEBUG:base.data_set:.create_dataset: dataset_type: ProbeSet [2018-07-16 16:53:51 +0000] [4185] [ERROR] Connection in use: ('0.0.0.0', 5003) [2018-07-16 16:53:51 +0000] [4185] [ERROR] Retrying in 1 second.
30 Updating ldc with latest LLVM
Updating ldc to latest leaves only 5 tests failing! That is rather good.
99% tests passed, 5 tests failed out of 1629 Total Test time (real) = 4135.71 sec The following tests FAILED: 387 - std.socket (Failed) 785 - std.socket-debug (Failed) 1183 - std.socket-shared (Failed) 1581 - std.socket-debug-shared (Failed) 1629 - lit-tests (Failed) build/Testing/Temporary/LastTest.log: (core.exception.AssertError@std/socket.d(456): Assertion failure build/Testing/Temporary/LastTest.log:core.exception.RangeError@std/socket.d(778): Range violation and Failing Tests (1): LDC :: plugins/addFuncEntryCall/testPlugin.d
Fixing these ldc 1.10.0 compiles on LLVM 3.8!
31 Fixing sambamba
We were getting this new error
/gnu/store/4snsi4vg06bdfi6qhdjfbhss16kvzxj7-ldc-1.10.0/include/d/std/numeric.d(1845): Error: read-modify-write operations are not allowed for shared variables. Use core.atomic.atomicOp!"+="(s, e) instead.
which was by the normalize class
bool normalize(R)(R range, ElementType!(R) sum = 1)
the normalization functions fractions range values to the value of sum, e.g.,
a = [ 1.0, 3.0 ]; assert(normalize(a)); assert(a == [ 0.25, 0.75 ]);
so, the question is why it needs to be a shared range. I modified it to cast to shared after normalization.
The next one was
BioD/bio/maf/reader.d(53): Error: cannot implicitly convert expression `this._f.byLine(cast(Flag)true, '\x0a')` of type `ByLineImpl!(char, char)` to `ByLine!(char, char)`
also reported on https://github.com/bioconda/bioconda-recipes/pull/8787#issuecomment-389195848
After fixing the compile time problems the tests failed for view, view -f unpack, subsample and sort(?!). In fact all md5sum's we test. For view it turns out the order of the output differs. view -f sam returns identity also converting back from BAM. It had to be something in the header. The header (apparently) contains the version of sambamba!
ldc2 -wi -I. -IBioD -IundeaD/src -g -O3 -release -enable-inlining -boundscheck=off -of=bin/sambamba bin/sambamba.o utils/ldc_version_info_.o htslib/libhts.a /usr/lib/x86_64-linux-gnu/liblz4.a -L-L/home/travis/dlang/ldc-1.10.0/lib -L-L/usr/lib/x86_64-linux-gnu -L-lrt -L-lpthread -L-lm -L-llz4 bin/sambamba.o: In function `_D5utils3lz426createDecompressionContextFZPv': /home/travis/build/biod/sambamba/utils/lz4.d:199: undefined reference to `LZ4F_createDecompressionContext' /home/travis/build/biod/sambamba/utils/lz4.d:200: undefined reference to `LZ4F_isError' (...) objdump -T /usr/lib/x86_64-linux-gnu/liblz4.so|grep LZ4F 000000000000cd30 g DF .text 00000000000000f6 Base LZ4F_flush 000000000000ce30 g DF .text 0000000000000098 Base LZ4F_compressEnd 000000000000c520 g DF .text 00000000000002f5 Base LZ4F_compressBegin 000000000000c4a0 g DF .text 000000000000003f Base LZ4F_createCompressionContext 000000000000dd60 g DF .text 00000000000000ee Base LZ4F_getFrameInfo 000000000000ced0 g DF .text 00000000000002eb Base LZ4F_compressFrame 000000000000c4e0 g DF .text 0000000000000033 Base LZ4F_freeCompressionContext 000000000000c470 g DF .text 0000000000000029 Base LZ4F_getErrorName 000000000000c460 g DF .text 000000000000000a Base LZ4F_isError 000000000000c8a0 g DF .text 00000000000000e3 Base LZ4F_compressFrameBound 000000000000d1c0 g DF .text 0000000000000038 Base LZ4F_createDecompressionContext 000000000000d200 g DF .text 0000000000000037 Base LZ4F_freeDecompressionContext 000000000000c990 g DF .text 0000000000000396 Base LZ4F_compressUpdate 000000000000d240 g DF .text 0000000000000b1d Base LZ4F_decompress 000000000000c820 g DF .text 000000000000007d Base LZ4F_compressBound
32 Trapping NaNs
When a floating point computation results in a NaN/inf/underflow/overflow GEMMA should stop The GSL has a generic function gslieeeenvsetup which works through setting an environment variable. Not exactly useful because I want GEMMA to run with just a –check switch.
The GNU compilers have feenableexcept as a function which did not work immediately. Turned out I needed to load fenv.h before the GSL:
#include <fenv.h> #include "gsl/gsl_matrix.h"
Enabling FP checks returns
Thread 1 "gemma" received signal SIGFPE, Arithmetic exception. 0x00007ffff731983d in ieeeck_ () from /home/wrk/opt/gemma-dev-env/lib/libopenblas.so.0 (gdb) bt #0 0x00007ffff731983d in ieeeck_ () from /home/wrk/opt/gemma-dev-env/lib/libopenblas.so.0 #1 0x00007ffff6f418dc in dsyevr_ () from /home/wrk/opt/gemma-dev-env/lib/libopenblas.so.0 #2 0x0000000000489181 in lapack_eigen_symmv (A=A@entry=0x731a00, eval=eval@entry=0x731850, evec=evec@entry=0x7317a0, flag_largematrix=<optimized out>) at src/lapack.cpp:195 #3 0x00000000004897f8 in EigenDecomp (flag_largematrix=<optimized out>, eval=0x731850, U=U@entry=0x7317a0, G=G@entry=0x731a00) at src/lapack.cpp:232 #4 EigenDecomp_Zeroed (G=G@entry=0x731a00, U=U@entry=0x7317a0, eval=eval@entry=0x731850, flag_largematrix=<optimized out>) at src/lapack.cpp:248 #5 0x0000000000457b3a in GEMMA::BatchRun (this=this@entry=0x7fffffffcc30, cPar=...) at src/gemma.cpp:2598
Even for the standard dataset!
Turns out it is division by zero and FP underflow in lapackeigensymmv.
33 A gemma-dev-env package
I went through difficulties of updating GNU Guix and writing a package that creates a GEMMA development environment. This was based on the need for having a really controlled dependency graph. It went wrong last time we released GEMMA, witness gemma got slower, leading to the discovery that I had linked in a less performing lapack.
GNU Guix was not behaving because I had not updated in a while and I discovered at least one bug. Anyway, I have a working build system now and we will work on code in the coming weeks to fix a number of GEMMA issues and bring out a new release.
34 Reviewing a CONDA package
Main achievement last week was getting GEMMA installed in a controlled fashion and proving performance is still up to scratch.
For JOSS I am reviewing a CONDA package for RNA-seq analysis. The author went through great lengths to make it easy to install with Bioconda, so I thought to have a go. GNU Guix has a conda bootstrap, so time to try that!
guix package -A conda conda 4.3.16 out gnu/packages/package-management.scm:704:2
and wants to install
/gnu/store/pj6d293c7r9xrc1nciabjxmh05z24fh0-Pillow-4.3.0.tar.xz /gnu/store/m5prqxzlgaargahq5j74rnvz72yhb77l-python-olefile-0.44 /gnu/store/s9hzpsqf9zh9kb41b389rhmm8fh9ifix-python-clyent-1.2.1 /gnu/store/29wr2r35z2gnxbmvdmdbjncmj0d3l842-python-pytz-2017.3 /gnu/store/jv1p8504kgwp22j41ybd0j9nrz33pmc2-python-anaconda-client-1.6.3.tar.gz /gnu/store/l6c40iwss9g23jkla75k5f1cadqbs4q5-python-dateutil-2.6.1 /gnu/store/y4h31l8xj4bd0705n0q7a8csz6m1s6s5-python-pycosat-0.6.1 /gnu/store/8rafww49qk2nxgr4la9i2v1yildhrvnm-python-cookies-2.2.1 /gnu/store/s5d94pbsv779nzi30n050qdq9w12pi52-python-responses-0.5.1 /gnu/store/kv8nvhmb6h3mkwyj7iw6zrnbqyb0hpld-python-conda-4.3.16.tar.gz /gnu/store/cns9xhimr1i0fi8llx53s8kl33gsk3c4-python-ruamel.yaml-0.15.35
The CONDA package in Guix is a bit older - turns out CONDA has a ridiculous release rate. Let's try the older CONDA first
guix package -i conda
And
conda config --add channels defaults Warning: 'defaults' already in 'channels' list, moving to the top conda config --add channels conda-forge Warning: 'conda-forge' already in 'channels' list, moving to the top conda config --add channels bioconda Warning: 'bioconda' already in 'channels' list, moving to the top
so, that was all OK Next
conda install -c serine rnasik Package plan for installation in environment /gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16: The following NEW packages will be INSTALLED: asn1crypto: 0.24.0-py_1 conda-forge backports: 1.0-py36_1 conda-forge backports.functools_lru_cache: 1.5-py_1 conda-forge bedtools: 2.25.0-3 bioconda bigdatascript: v2.0rc10-0 serine bwa: 0.7.17-pl5.22.0_2 bioconda bzip2: 1.0.6-1 conda-forge ca-certificates: 2018.4.16-0 conda-forge certifi: 2018.4.16-py36_0 conda-forge cffi: 1.11.5-py36_0 conda-forge chardet: 3.0.4-py36_2 conda-forge click: 6.7-py_1 conda-forge colormath: 3.0.0-py_2 conda-forge conda: 4.5.8-py36_1 conda-forge conda-env: 2.6.0-0 conda-forge cryptography: 2.2.1-py36_0 conda-forge curl: 7.60.0-0 conda-forge cycler: 0.10.0-py_1 conda-forge dbus: 1.11.0-0 conda-forge decorator: 4.3.0-py_0 conda-forge expat: 2.2.5-0 conda-forge fastqc: 0.11.5-pl5.22.0_3 bioconda fontconfig: 2.12.1-4 conda-forge freetype: 2.7-1 conda-forge future: 0.16.0-py_1 conda-forge gettext: 0.19.8.1-0 conda-forge glib: 2.53.5-1 conda-forge gst-plugins-base: 1.8.0-0 conda-forge gstreamer: 1.8.0-1 conda-forge hisat2: 2.1.0-py36pl5.22.0_0 bioconda icu: 58.2-0 conda-forge idna: 2.7-py36_2 conda-forge je-suite: 2.0.RC-0 bioconda jinja2: 2.10-py_1 conda-forge jpeg: 9b-2 conda-forge krb5: 1.14.6-0 conda-forge libffi: 3.2.1-3 conda-forge libgcc: 5.2.0-0 libiconv: 1.14-4 conda-forge libpng: 1.6.34-0 conda-forge libssh2: 1.8.0-2 conda-forge libxcb: 1.13-0 conda-forge libxml2: 2.9.5-1 conda-forge lzstring: 1.0.3-py36_0 conda-forge markdown: 2.6.11-py_0 conda-forge markupsafe: 1.0-py36_0 conda-forge matplotlib: 2.1.0-py36_0 conda-forge mkl: 2017.0.3-0 multiqc: 1.5-py36_0 bioconda ncurses: 5.9-10 conda-forge networkx: 2.0-py36_1 conda-forge numpy: 1.13.1-py36_0 openjdk: 8.0.121-1 openssl: 1.0.2o-0 conda-forge pcre: 8.41-1 conda-forge perl: 5.22.0.1-0 conda-forge picard: 2.18.2-py36_0 bioconda pip: 9.0.3-py36_0 conda-forge pycosat: 0.6.3-py36_0 conda-forge pycparser: 2.18-py_1 conda-forge pyopenssl: 18.0.0-py36_0 conda-forge pyparsing: 2.2.0-py_1 conda-forge pyqt: 5.6.0-py36_5 conda-forge pysocks: 1.6.8-py36_1 conda-forge python: 3.6.3-1 conda-forge python-dateutil: 2.7.3-py_0 conda-forge pytz: 2018.5-py_0 conda-forge pyyaml: 3.12-py36_1 conda-forge qt: 5.6.2-3 conda-forge readline: 6.2-0 conda-forge requests: 2.19.1-py36_1 conda-forge rnasik: 1.5.2-0 serine ruamel_yaml: 0.15.35-py36_0 conda-forge samtools: 1.5-1 bioconda setuptools: 40.0.0-py36_0 conda-forge simplejson: 3.8.1-py36_0 bioconda sip: 4.18-py36_1 conda-forge six: 1.11.0-py36_1 conda-forge skewer: 0.2.2-1 bioconda spectra: 0.0.11-py_0 conda-forge sqlite: 3.13.0-1 conda-forge star: 2.5.2b-0 bioconda subread: 1.5.3-0 bioconda tk: 8.5.19-2 conda-forge tornado: 5.1-py36_0 conda-forge urllib3: 1.23-py36_0 conda-forge wheel: 0.31.1-py36_0 conda-forge xorg-libxau: 1.0.8-3 conda-forge xorg-libxdmcp: 1.1.2-3 conda-forge xz: 5.2.3-0 conda-forge yaml: 0.1.7-0 conda-forge zlib: 1.2.11-0 conda-forge
That is a rather long list of packages, including OpenJDK. Conda
CondaIOError: IO error: Missing write permissions in: /gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16 You don't appear to have the necessary permissions to install packages into the install area '/gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16'. However you can clone this environment into your home directory and then make changes to it. This may be done using the command: $ conda create -n my_root --clone=/gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16
OK, I suppose that is an idea, though it kinda defeats the idea of a reproducible base repo. But this worked:
conda create -n joss-review-583 conda install -n joss-review-583 -c serine rnasik
The good news is that conda installs in one directory. But 2.7 GB downloaded…
conda info --envs # conda environments: # conda /home/wrk/.conda/envs/conda joss-review-583 /home/wrk/.conda/envs/joss-review-583 root * /gnu/store/h260m1r0bgnyypl7r469lin9gpyrh12m-conda-4.3.16
Activate the environment
source activate joss-review-583
RNAsik --help 00:00:00.000 Bds 2.0rc10 (build 2018-07-05 09:58), by Pablo Cingolani Usage: RNAsik -fqDir </path/to/your/fastqs> [options] main options -fqDir <string> : path to fastqs directory, can be nested -align <string> : pick your aligner [star|hisat|bwa] -refFiles <string> : directory with reference files -paired <bool> : paired end data [false], will also set pairIds = "_R1,_R2" -all <bool> : short hand for counts, mdups, exonicRate, qc, cov and multiqc more options -gtfFile <string> : path to refFile.gtf -fastaRef <string> : path to refFile.fa -genomeIdx <string> : genome index -counts <bool> : do read counts [featureCounts] -mdups <bool> : process bam files, sort and mark dups [picard] -qc <bool> : do bunch of QCs, fastqc, picard QCs and samtools -exonicRate <bool> : do Int(ra|er)genic rates [qualiMap] -multiqc <bool> : do MultiQC report [multiqc] -trim <bool> : do FASTQ trimming [skewer] -cov <bool> : make coverage plots, bigWig files -umi <bool> : deduplicates using UMIs extra configs -samplesSheet <string> : tab delimited file, each line: old_prefix \t new_prefix -outDir <string> : output directory [sikRun] -extn <string> : specify FASTQ files extension [.fastq.gz] -pairIds <string> : specify read pairs, [none] -extraOpts <string> : add extra options through a file, each line: toolName = options -configFile <string> : specify custome configuration file
I may be critical about CONDA, but this works ;)
Now I tried on a different machine and there was a problem on activate where the environment bumped me out of a shell. Hmmm. The conda settings of activate are:
CONDA_DEFAULT_ENV=joss-review-583 CONDA_PATH_BACKUP=/home/wrk/opt/gemma-dev-env/bin:/usr/bin:/bin CONDA_PREFIX=/home/wrk/.conda/envs/joss-review-583 CONDA_PS1_BACKUP='\[\033[0;35m\]\h:\w\[\033[0m\]$ ' JAVA_HOME=/home/wrk/.conda/envs/joss-review-583 JAVA_HOME_CONDA_BACKUP= PATH=/home/wrk/.conda/envs/joss-review-583/bin:/home/wrk/opt/gemma-dev-env/bin:/usr/bin:/bin _CONDA_D=/home/wrk/.conda/envs/joss-review-583/etc/conda/activate.d _CONDA_DIR=/home/wrk/opt/gemma-dev-env/bin
I guess I can replicate that
penguin2:~$ export JAVA_HOME=$HOME/.conda/envs/joss-review-583 penguin2:~$ export CONDA_PREFIX=$HOME/.conda/envs/joss-review-583 penguin2:~$ export PATH=$HOME/.conda/envs/joss-review-583/bin:$PATH conda install -n joss-review-583 -c bioconda qualimap
wget bioinformatics.erc.monash.edu/home/kirill/sikTestData/rawData/IndustrialAntifoamAgentsYeastRNAseqData.tar
35 Updates
It has been a while since I updated the BLOG (see below older BLOGs). Time to start afresh because we have interesting developments going and the time ahead looks particularly exciting for GEMMA and GeneNetwork with adventures in D, CUDA, Arvados, Jupyter labs, IPFS, blockchain and the list goes on! I also promised to write a BLOG on our development/deployment setup. Might as well start here. My environments are very complex but controlled thanks to GNU Guix.
36 Older BLOGS
Work on LOCO, faster-lmm-d, Intel Phi and packaging with Guix moved to here.
37 Even older BLOG
Older BLOG material on pylmm etc. can be found here.