2015-04-01 15:34 by pjotrp

1 D is a dragon, or why D matters for Bioinformatics

Ruby is a pony. Everyone loves a pony. Ruby is nice.

Scala is a thoroughbred. You know I like Scala - it is beautiful, and runs circles around the pony

D is a dragon. Very powerful, and somewhat unpredictable

The programming languages Ruby, Python, R and Perl have been proven to be very popular in bioinformatics. These languages are interpreted and dynamically typed computer languages. They are all great at parsing and handling genomic information. Results are quick to get, and the development cycle may be gratifying. However, as the language benchmarks game shows, they are also rather slow, and hard to parallelize. It is not easy to get them to use those multi-cores everyone has.

Some newer languages, such as Scala and D, are not only strongly typed, which has a real impact on performance, but are also very good at automatically handling types. This means that coding Scala or D, feels similar to coding dynamically typed languages. Also, Scala and D are OOP languages that marry the functional programming paradigm. In practise, that means that we get OOP goodness (and badness), with constructs that make it safer and easier to parallelize code.

At this point bioinformaticians should sit up and prick up their ears: it is not much harder to program in Scala and D than in Ruby, Python, R and Perl. A little harder, yes, but the rewards can be great. Bioinformatics has entered the era of Big Data (Trelles and Prins, 2010), and we need parallelization and cores to get the work done. There is a lot of hype about the cloud, but with the current affordable large multi-core systems, a lot of programming and analysis can be handled on single largish memory multi-core machines. Scala and D allow fine-grained parallelization using high-level abstractions, these languages have immutable data and built-in high performance message passing it the form of Actors, e.g. Akka, which makes it possible to use those cores.

I wrote BioScala, and you know I love the language. Beautifully designed, and lovely, it is a programmers dream, were it not for the JVM.

With big data performance matters. And whatever the Java love boys claim, the JVM is slower than directly compiled code. With C and D, carefully crafted code runs easily 4x, and often 10x, faster than the fastest JIT compiled JVM code. That is a significant difference. It means buying a 32 core machine instead of a 128 core machine. It means running your code on a 250 machine cluster (or Cloud) instead of on a 1000 machine cluster. It means waiting a month, instead of 4 months for a calculation on the same setup. In short, run time performance matters with big data.

D matters for bioinformatics, as it is a next generation computer language with the performances of C or C++. It comes with garbage collection - and you can still use malloc for fine tuning. Anyone using C or C++, and I know who you are, should reconsider. While programming in JAVA feels to us like programming with the hands behind our back, C programming is rather a matter of repeatedly shooting oneself in the foot.

D is by far the nicer language, and gives you almost the productivity of Ruby or Python. I should know, because I programmed in all mentioned languages, and the last year I choose to do all of my coding in Ruby, Python and D.

Why use D over Ruby or Python? Well, I say, don't stop using Ruby and Python any time soon. D complements them. Ruby, Python, R and Perl are great languages and can easily be bound against D code. Just like with C code, the functions are connected through a foreign function interface (FFI). Use each language to its full potential. D is for tight memory control, raw speed and parallelization. The others are there to churn out code in the simplest and quickest way. In time, however, you may find you'll use D for more than time critical code. For true software engineering a strong type system can be very beneficial. For more on bridging computer languages, check out my soon to be published Springer book chapter on Sharing programming resources between Bio* projects through remote procedure call and native call stack strategies.

Why use D over Scala? Simply because D's performance rocks. Not only is it usually faster than JVM running code, it is much easier to get low level tweaked code. For a simple GFF3 file parser I (somewhat unexpectedly) managed to increase speed significantly by allocating often accessed data on the stack, rather than on the heap. Modern CPUs take advantage of that (stack memory is closer to the processor). It is something the JVM does not allow you to do. Another power feature of D is slicing. The fastest XML parser in the world is written in D. Look it up.

Why is D a dragon? D's language is amazing, but not as carefully designed as Scala's. Scala is simply beautiful. D feels more clunky and can get in the way sometimes. I find its functional language implementation less intuitive than that of Ruby or Scala. Still, it works rather well, and it even has tail end recursion. And, next to raw speed, there are three areas D beats Scala, I believe. First, the compiler is blazingly fast. Second, the template system (generics) is simpler and easier to understand. I really have trouble with Scala's advanced templating, which is not a good sign. Third, D code generation, or compile time evaluation, rocks. Another thing to look into. In a Scala book by David Pollack he gives an example of a computer game that featured in Why the lucky stiff's world (for Ruby insiders). What was enlightening to me was the code repetition David's book, necessary to build the players. That would not be necessary in D's compile time evaluation. There are a few things I miss in D. For example pattern recognition on unpacking data, which is great in Haskell, Erlang, and Scala. D has something similar for actor messages, so it may come to the main language. The second thing I miss is that language elements (such as an if statement) do not always return values. That is something I miss, because I use it in Ruby and Scala all over the place, and makes for shorter code.

Finally some things that keep cropping up when I bring up D. First, the licensing issues. D, for historical reasons was closed source. That is changing now, with D2 compilers getting part of Fedora and Debian to follow. Second the schism and negativism of D1 users caused by an the move to D2. That you'll find on the Internet. D2 is not compatible with D1, and that has caused grief. D2 was reinvented as the language designers progressed their ideas. If you want to read more about the excellent D2 language I strongly recommend Andrei's book. It is a classic in its own right. Even if you never get to appreciate the power of D.