2010-11-15 10:03 by pjotrp

1 GFF3 support in BioRuby

by Pjotr Prins

1.1 Introduction

In this document we explore existing GFF3 functionality in BioRuby, and we create a BioRuby plugin for extracting mRNA and CDS from a file. Not only do we want to allow for very >large</i> data files (so we should use memory sparingly), we also want to validate the data to some degree. Even though many files are generated from databases, validation not only points out problems with the source data, it may also point out problems with the parser we created. For example I wish to check that CDS sections of a single mRNA do not overlap and that there are only standard feature types in the file.

Note that I have also looked at other GFF3 parsers, notably those of EMBOSS, BioPython, and BioPerl. They all load the full GFF data in memory (BioPython allows filtering on features, which helps reduce memory use somewhat). Still, the existing BioRuby edition offers comparable functionality. So, we might as well continue in my favourite dynamic language.

The current source code for the BioRuby plugin can be found here

1.2 Existing BioRuby functionality

BioRuby supports reading GFF3 feature data from file into memory. However the functionality is not well documented and we have to glean usage from the source code and the unit tests.

The unit tests are in test/unit/bio/db/test_gff.rb. The implementation is in lib/bio/db/gff.rb

From the unit tests we can digest that with GFF3 input data

Shell
##gff-version 3
##sequence-region test01 1 400
test01	RANDOM	contig	1	400	.	+	.	ID=test01;Note=this is test
test01	.	mRNA	101	230	.	+	.	ID=mrna01;Name=testmRNA;Note=this is test mRNA
test01	.	mRNA	101	280	.	+	.	ID=mrna01a;Name=testmRNAalterative;Note=test of alternative splicing variant
test01	.	exon	101	160	.	+	.	ID=exon01;Name=exon01;Alias=exon 1;Parent=mrna01,mrna01a
test01	.	exon	201	230	.	+	.	ID=exon02;Name=exon02;Alias=exon 2;Parent=mrna01
test01	.	exon	251	280	.	+	.	ID=exon02a;Name=exon02a;Alias=exon 2a;Parent=mrna01a
test01	.	Match	101	123	.	.	.	ID=match01;Name=match01;Target=EST101 1 21;Gap=M8 D3 M6 I1 M6
##FASTA
>test01
ACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA
GTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA
CCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT
AATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT
GCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC
CCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC

we can look at the records with

Ruby
  gff3 = Bio::GFF::GFF3.new(File.read(fn))
  gff3.records.each do | rec |
    p rec
  end
  gff3.sequences.each do | seq |
    p seq
  end

for example

Shell
ruby -I lib/ ./sample/gff2fasta.rb test/data/gff/test.gff3 
#--- First record
#<Bio::GFF::GFF3::Record:0xb7c6c64c @seqname="Contig1", @frame=nil,
@start=1001, @strand="+", @feature="transcript", @score=42.0,
@source="confirmed", @attributes=[["ID", "Transcript:trans-1"], ["Gene",
"abc-1"], ["Gene", "xyz-2"], ["Note", "function+unknown"]], @end=2000>
#--- Second record
#<Bio::GFF::GFF3::Record:0xb7c6c570 @seqname="Contig1", @frame=nil,
@start=1001, @strand="+", @feature="exon", @score=nil, @source="confirmed",
@attributes=[["ID", "Transcript:trans-1"]], @end=1100>
#--- Sequences
#<Bio::Sequence:0xb7c2b354 @entry_id="test01", @source_data=#<Bio::FastaFormat:0xb7c31574 
@entry_overrun=nil,
@data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
@definition="test01">>

The sequence contains newlines - which looks kinda odd, as a GFF record has start-stop information, which should ignore new lines. Also, notice Bio::FastaFormat is contained inside Bio::Sequence. It duplicates definition and @entry_id. Design choices are due to choices of 'lazy' parsing.

To output contained FASTA we can write:

Ruby
  gff3 = Bio::GFF::GFF3.new(File.read(fn))
  gff3.sequences.each do | item |
    #--- item.to_fasta(item.entry_id, 70) is deprecated
    print item.output(:fasta)
  end

1.3 pulling mRNA from GFF3

We want to pull mRNA from a GFF3 file and turn it into FASTA. Using above methods we can load the file into memory and fetch each record. The relationship within the record allows us to paste together the sequence. Starting from the mrna tag you can find the sequence information. You have to put together the CDSs, or exons, that relate to the mRNA sequence, splicing out the introns.

On the BioRuby mailing list Tomoaki Nishiyama presented a script that does that parses a data file into memory and puts the mRNA together:

Ruby
genearray=Array.new
mRNAhash=Hash.new
exonhash=Hash.new
tehash=Hash.new
lastid = ''
lastrecord = nil
ARGF.each_line do |gffline|
  record=Bio::GFF::GFF3::Record.new(gffline)
  feature_type = record.feature_type
  if(feature_type == 'gene')
    genearray << record.id
  elsif(feature_type == 'mRNA')
    parent = record.get_attribute('Parent')
    if mRNAhash[parent] == nil
      mRNAhash[parent] = [record]
    else 
      mRNAhash[parent] << record
    end
  elsif(feature_type == 'transposable_element')
    #--- not yet implemented
  elsif(feature_type == 'exon')
    parents = record.get_attributes('Parent')
    parents.each do |parent|
      if exonhash[parent] == nil
        exonhash[parent] = Array.new
      end
      exonhash[parent] << record
    end
  end
end
genearray.each do |gene|
  mRNAs=mRNAhash[gene]
  mRNAs.each do |mRNA|
    print "gene target=#{mRNA.seqname} name=#{mRNA.id} strand=#{mRNA.strand} range=#{mRNA.start},#{mRNA.end} ";
    exonary=exonhash[mRNA.id]
    exonary.each do |exon|
      print "exon=#{exon.start},#{exon.end} "
    end
    print "\n"
  end
end

Interesting is that it shows you don't have to use BioRuby's file loader, but can go line by line, as a single line is a record. Something I will use below. Note also that above script will go wrong when there is embedded FASTA. I think using the Parent ID as an identifier may run in to problems when subsections of different genes happen to carry the same names. For example gene01 and gene02 may have both have a named mRNA1. The CDSs in mRNA would share the same parent ID. I don't know if this ever happens, but I would not be surprised. Another thing to note is that above example can not handle mRNA without parent IDs. Even in the BioRuby example above, you can see the testmRNA has no parent information. We will have to piece that together from the first column IDs, known as seqname

There is also something else. I am interested in acquiring and concatenating CDSs for different proteins. With some data the CDSs are unknown and it may well be I want to use the known features for mRNA, introns and transcripts, to patch together a putative mRNA. In other words, if a gene has no defined CDS, I'll want the mRNA. If there is no mRNA I may want to do something more. For example harvest contig information for mRNA. With circular genomes there is a region too.

1.4 Writing a GFF3 Plugin

BioRuby GFF3 reader is not suitable for really large datasets, as all features and sequences are loaded in RAM. Note also that the current implementation is buffer based and the whole File input buffer is in memory at the same time. For example:

Ruby
    gff3 = Bio::GFF::GFF3.new(File.read(TEST1))

We would like to do better.

These days BioRuby supports external plugins. Create a git repository, and it can be downloaded and loaded by an end-user with a 'bioruby install URL' command. For now >we merely use the include path.

The advantage of BioRuby plugins is that you don't have to worry too much about BioRuby standards, and anyone can use the functionality from the word GO.

The GFF3 plugin is on (http://github.com/pjotrp/bioruby-gff3-plugin github) and contains a script ./bin/gff3-fetch

The first version loads uses the default BioRuby reader - which loads everything in memory. It is a rule to always start with the simplest implementation.

What we want to do is load multiple GFF trees into a DB database, named GFFdb and parse each element for mRNA or CDS records. Finally we want to export the data, for example in a FASTA format.

1.4.1 Creating a database

To fetch mRNA we create a mixin that pulls mRNA and CDS from the DB and stores them in tables, similar to what above code does:

Ruby
  module MRNA
    # Digest mRNA from the GFFdb and store in Hash
    # Next yield(id, seq) from Hash
    def each_mRNA
      genes = []   # Store gene ids
      mrnas = LinkedRecs.new   # Store linked mRNA records
      cdss  = LinkedRecs.new
      @gffs.each do | gff |
        gff.records.each do | rec |
          seqnames.add(rec.seqname)
          id = rec.id
          ids.add(id)
          case rec.feature_type
            when 'gene' || 'SO:0000704' : genes << id 
            when 'mRNA' || 'SO:0000234' : mrnas.add(rec) 
            when 'CDS'  || 'SO:0000316' : cdss.add(rec)
            when 'exon' || 'SO:0000147' : false
            else
              if !IGNORE_FEATURES.include?(rec.feature_type)
                ignored_features[rec.feature_type] = true
              end
          end
        end
      end
      # validate CDS sections do not overlap
      cdss.validate
      puts "---- Yield each mRNA"
      mrnas.each do | k, v |
        yield k,v
      end
      cdss.each do | k, v |
        yield k,v
      end

Note that we create one database of multiple GFF trees. This may be convenient later when we want to query multiple genomes.

1.4.2 Validation

Because we added a validation of the parent ID, we find there are mRNA and CDS without a parent:

Shell
ruby -I ../bioruby/lib/ ./bin/gff3-fetch mrna test/data/gff/test.gff3 
Warning: record has no parent <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Warning: record has no parent <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Warning: record has no parent <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Warning: record has no parent <mRNA:trans-8>
Adding mRNA <mRNA:trans-8>
Warning: record has no parent <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Warning: record has no parent <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Warning: record has no parent <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Warning: record has no parent <Misc:thing2>
Adding mRNA <Misc:thing2>
Warning: record has no parent <Misc:thing3>
Adding CDS <Misc:thing3>
Warning: record has no parent <mrna01>
Adding mRNA <mrna01>
Warning: record has no parent <mrna01a>
Adding mRNA <mrna01a>

Without validation we may have missed this! I needed to treat parents differently.

When the parent is missing I want to link it against the container ID. If there is no container ID, even in the example files they are missing, I will use the sequence name in the first column. This name can be shared too (e.g. a chromosome name), so we add the start_stop position. For example ChrX_1000:1250. That way we can always trace back the sequence it came from. In other words, we should always have a valid, and hopefully unique, ID.

Now, there may be the problem that an ID is shared between different GFF trees (read files). This can happen when using genome information from two individuals. In a future version we may store the base filename with the ID to make it 'unique', assuming two files don't have the same basename.

Inside the LinkedRecs.add function we also have the opportunity to do some validation on the fly. For example CDS should not overlap. So:

Ruby
  def validate_nonoverlapping
    each do | id, rec |
      sections = []
      rec.each do | section |
        sections.push Section.new(section)
      end
      sections.sort.each_with_index do | check, i |
        neighbour = sections[i+1]
        if neighbour and check.intersection(neighbour)
          warn "Overlapping sections for ",id
        end
      end
    end
  end

Even this validation brought out some problem with my implementation. For one, it found mRNA overlapping, because I was treating mRNA like CDS.

1.4.3 mRNA

Another thing to validate is that the container/component name, stored in \date{seqname} is the same for all mRNA and CDS from the same sequence:

Ruby
  def validate
    each do | id, rec |
      seqname = rec.first.seqname
      rec.each do | section |
        raise "Non-matching seqname #{section.seqname} in #{seqname}" if section.seqname != seqname
      end
    end
  end

There is more we can validate, like sequence regions, but that we leave for now.

Next we can query the tables, as all the information is stored by ID. Let's write FASTA containing all mRNA:

Ruby
  def each_mRNA
    parse(@gff) if !@mrnalist
    @mrnalist.each do | id, rec |
      seqid = rec[0].seqname
      yield id, rec, @sequencelist[seqid] 
    end
  end
  gffdb = Bio::GFFbrowser::GFFdb.new(gff3)
  gffdb.each_mRNA do | id, rec, sequence |
    puts ">"+id
    puts sequence
  end   

Renders

Shell
---- Digest DB and store data in mRNA Hash
Adding CDS <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Warning: Container <Component> has no ID, so taking sequence name <Contig2>
Warning: Container <Component> has no ID, so taking sequence name <Contig2>
Warning: Container <Component> has no ID, so taking sequence name <Contig2>
Warning: Container <Component> has no ID, so taking sequence name <Contig2>
Adding mRNA <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Adding mRNA <Misc:thing2>
Adding CDS <Misc:thing3>
Adding mRNA <mrna01>
Adding mRNA <mrna01a>
>Misc:thing2
nil
>mrna01a
ACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTAGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACACCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGATAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGTGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGCCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC
>mrna01
ACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTAGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACACCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGATAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGTGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGCCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC
>mRNA:trans-8
nil

Now, we have matching sequence info, though the sections/regions have not been cut out and the full component sequence is shown for mrna01a and mrna01. The other mRNA records have no known sequence in this GFF file.

With every component comes a region which should match the FASTA sequence. The header of mrna01 reads:

Shell
##gff-version 3
##sequence-region test01 1 400
test01  RANDOM  contig  1       400     .       +       .       ID=test01;Note=this is test
test01  .       mRNA    101     230     .       +       .       ID=mrna01;Name=testmRNA;Note=this is test mRNA
test01  .       mRNA    101     280     .       +       .       ID=mrna01a;Name=testmRNAalterative;Note=test of alternative splicing variant

So the contig component runs from 1..400 and the mRNAs run from 101..230 and 101..280 respectively. BioRuby, at this point has no facility for tying this logic together. What I would like to have is to piece this information together for one component. Something like:

Ruby
  gffdb.each_mRNA_sequence do | id, sequence |
    puts ">"+id
    puts sequence
  end

Finding a component is elaborate:

Ruby
  # Walk the component list to find a matching component/container for a
  # record. First use the parent ID. If that is missing go by sequence
  # name.
  def find_component rec
    parent = rec.get_attribute('Parent')
    if @componentlist[parent] 
      # nice, there is a match
      return @componentlist[parent]
    end
    search = rec.seqname
    return @componentlist[search] if @componentlist[search]
    @componentlist.each do | componentid, component |
      # dissemble id
      (id, start, stop) = componentid.split(/ /)
      if id==search and rec.start >= start.to_i and rec.end <= stop.to_i
        return component
      end
    end
    # Ah, painful. At this point the record has no matching container, probably
    # because it has no parent ID and the component has an ID. We have to go by
    # ID for every component individually
    @componentlist.each do | componentid, component |
      if component.seqname==search and rec.start >= component.start and rec.end <= component.end
        return component
      end
    end
    warn "Could not find container/component for",Record::formatID(rec)
  end

Next putting the sequence together:

Ruby
def assemble sequence, startpos, rec
  retval = ""
  sections = []
  rec.each do | section |
    sections.push Section.new(section)
  end
  sections.sort.each do | section |
    retval += sequence[section.rec.start..section.rec.end]
  end
  retval
end

resulting in:

Shell
ruby -I ../bioruby/lib/ ./bin/gff3-fetch mrna test/data/gff/test.gff3 
---- Digest DB and store data in mRNA Hash
Added transcript with component ID Transcript:trans-1
Adding CDS <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Adding CDS <Transcript:trans-1>
Added transcript with component ID Transcript:trans-2
Warning: Container <Component> has no ID, so using sequence name instead <Contig2 1 2000>
Added Component with component ID Contig2 1 2000
Warning: Container <Component> has no ID, so using sequence name instead <Contig2 2001 5000>
Added Component with component ID Contig2 2001 5000
Warning: Container <Component> has no ID, so using sequence name instead <Contig2 5001 20000>
Added Component with component ID Contig2 5001 20000
Warning: Container <Component> has no ID, so using sequence name instead <Contig2 2001 37450>
Added transcript with component ID Transcript:trans-3
Added transcript with component ID Transcript:trans-4
Added Component with component ID Clone:AL12345.2
Adding mRNA <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Adding CDS <mRNA:trans-8>
Added Component with component ID Clone:ABC123
Added gene with component ID Misc:thing1
Adding mRNA <Misc:thing2>
Adding CDS <Misc:thing3>
Added contig with component ID test01
Adding mRNA <mrna01>
Adding mRNA <mrna01a>
Warning: No sequence information for <Misc:thing2>
>mrna01a
ATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACACCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGATAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGTG
>mrna01
ATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACACCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGATAATGGGTACTGCACCCCTCGT
Warning: No sequence information for <mRNA:trans-8>

and, yes! We have the mRNA sequences.

1.4.4 CDS

Finding a CDS sequence is similar to finding the mRNA sequence. We only need to account for the fact that linked CDSs may have an mRNA parent. Here, we find some ambiguity again.

In the GFF3 standard the example shows

Shell
ChrX . gene XXXX YYYY . + . ID=gene01;name=resA 
ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 
ChrX . CDS XXXX YYYY . + . Parent=tran01 

So the parent of CDS points to the mRNA ID. In our example file we have:

Shell
Contig4 clone   Component       1       50000   .       .       .       ID=Clone:ABC123
Contig4 confirmed       gene    32000   35000   .       +       .       ID=Misc:thing1;gene=gene-9
Contig4 confirmed       mRNA    32000   35000   .       +       .       ID=Misc:thing2;mRNA=trans-9;gene=gene-9
Contig4 confirmed       CDS     32000   35000   .       +       .       ID=Misc:thing3;mRNA=trans-9

So CDS contains an mRNA pointer. And next such a reference may be missing altogether, and we have to assume the CDS matches the container. Like in

Shell
Contig1 confirmed       transcript      1001    2000    42      +       .      ID=Transcript:trans-1;Gene=abc-1;Gene=xyz-2;Note=function+unknown
Contig1 confirmed       exon    1001    1100    .       +       .       ID=Transcript:trans-1
Contig1 confirmed       exon    1201    1300    .       +       .       ID=Transcript:trans-1
Contig1 confirmed       exon    1401    1450    .       +       .       ID=Transcript:trans-1
Contig1 confirmed       CDS     1051    1100    .       +       0       ID=Transcript:trans-1

Doh!

The logic I am implementing is that if there is a Parent or mRNA attribute, we will find out whether it points to an mRNA, and use the information of the containing component. If the mRNA is missing we go straight to the component to create the list of CDSs that belong together. The descriptor will contain the ID of both the container and the mRNA, when available.

This script works, as long as the CDSs share the same ID and parent attributes. If either differs I'll have to look again. To test CDS I added to the test file

Shell
test01  . CDS 192 200 . + . ID=cds1;Parent=mrna01a
test01  . CDS 164 190 . + . ID=cds1;Parent=mrna01a
test01  . CDS 192 200 . + . ID=cds2;Parent=mrna01a

resulting in

Shell
>cds1 Sequence:test01_1:400 (164:190, 192:200)
GGCGACTATCGGTCGAAGTTAAGACAATCATGGGCG
>cds2 Sequence:test01_1:400 (192:200)
TCATGGGCG

1.5 EXONS

Exons have another interesting fact: an exon can have multiple parents:

Shell
ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003

Now I know this, I suppose it can also happen for CDS. I wonder what BioRuby will make of that.

1.6 API Specs

At this point it is high time to write some Specs or Unit Tests, to see if the generated output is correct. It is best practice to start writing tests before coding, as it clarifies the interface, and makes refactoring a lot easier. At this point I am not really sure the calculated sequences are correct, and I want to prepare for refactoring the code. RSpec is the tool of choice, for me. RSpec is not (yet) part of BioRuby, but since this module is a BioRuby plugin it is possible to use alternative ways.

First I adapt the test.gff file to render some short sequences I can test on. The first Spec reads

Ruby
<ol><li>---test01  .       mRNA    3       14      .       +       .       ID=mrna01short;Name=testmRNA;Note=this is test mRNA
describe Gffdb, "when fetching mRNA" do
  it "find mrna01short" do
    seq.should == "GAAGATTTGTAT"
  end
end</ol>

Specs are a good way to settle on an API. They can double as documentation, a bit like doctests, as they are more readable than unit tests - which always have an ad hoc flavour and get hard to read when seeking full testing coverage.

I want my API to offer the following:

  1. iterate over all mRNA|exon|CDS
  2. return the contained objects themselves
  3. return a sequence as concatenated string with a description
  4. the parser should have an option for overall reduced memory
  5. reduced memory may have different modes to balance speed/memory

and later

  1. seek a single mRNA|exon|CDS >allow plugging in an external database for storage

So, for iteration we'll have an API

  each_gene     : For every gene yield id, and container with sequence
  each_gene_seq : For every gene yield id and sequence
  each_mRNA     : For every mRNA yield id, list of objects, and container with sequence
  each_mRNA_seq : For every mRNA yield id and sequence
  each_exon     : For every exon yield id, list of objects, and container with sequence
  each_exon_seq : For every exon yield id and sequence
  each_CDS      : For every CDS  yield id, list of objects, and container with sequence
  each_CDS_seq  : For every CDS yield id and sequence

The Spec simply reads:

Ruby
describe GFFdb, "GFF3 API" do
  it "should support each_gene" 
  it "should support each_gene_seq" 
  it "should support each_mRNA" 
  it "should support each_mRNA_seq" 
  it "should support each_exon" 
  it "should support each_exon_seq" 
  it "should support each_CDS" 
  it "should support each_CDS_seq" 
end

and running the Spec says:

Shell
ruby -I ../bioruby/lib/ ~/.gems/bin/spec spec/gffdb_spec.rb 
********F
Pending:
Bio::GFFbrowser::GFFdb GFF3 API should support each_gene (Not Yet Implemented)
./spec/gffdb_spec.rb:15
Bio::GFFbrowser::GFFdb GFF3 API should support each_gene_seq (Not Yet Implemented)
./spec/gffdb_spec.rb:16
Bio::GFFbrowser::GFFdb GFF3 API should support each_mRNA (Not Yet Implemented)
./spec/gffdb_spec.rb:17
(etc)

pointing out the right thing :)

For memory reduction I plan to add two versions. One which reads everything from disk every time. And one that caches the parsed containers/components in memory, as these are most<font color='GRAY'> often accessed. In both cases the seek positions in the files are simply held in a Hash. To define which method to use we can select one of

Ruby
  GFFdb.new(gff3, :memory => :cache_all)         # default, load everything in memory
  GFFdb.new(gff3, :memory => :cache_components)  # only components in memory
  GFFdb.new(gff3, :memory => :cache_records)     # records
  GFFdb.new(gff3, :memory => :cache_none)        

The first test passes and reads:

Ruby
  it "should implement each_CDS_seq" do
    res = [] ; @gffdb.each_CDS_seq { | id, seq | res << id }
    res[0].should == "cds2 Sequence:test01_1:400 (192:200)"
  end

Another test fails, because of a bug in my module:

Ruby
  it "should implement each_mRNA_seq" do
    h = {} ; @gffdb.each_mRNA_seq { | id, seq | h[id] = seq }
    h["mrna01short Sequence:test01_1:400 (3:14)"].should == "GAAGATTTGTAT"
  end

with

Shell
'Bio::GFFbrowser::GFFdb GFF3 API with :cache_none should implement each_mRNA_seq' FAILED
expected: "GAAGATTTGTAT",
     got: "AAGATTTGTATG" (using ==)
./spec/gffdb_spec.rb:37:

turns out the index is wrong when cutting out sections. I also needed to subtract the starting position of the component/container sequence. Funnily enough, I did pass in the value to the assemble function. This is why you need tests. The new code reads:

Ruby
  # Patch a sequence together from a Sequence string and an array
  # of records
  def assemble sequence, startpos, rec
    retval = ""
    Sections::sort(rec).each do | section |
      retval += sequence[(section.rec.start-startpos)..(section.rec.end-startpos)]
    end
    retval
  end

thinking about the caching I am going to adjust the way we define the cache. We are going to implement two caches for records - one that holds components, and one that holds non-comonents (features). The cache contains the parsed records. In addition we are going to maintain an index of seek positions in the file(s).

Use one of:

Ruby
  GFFdb.new(gff3, :cache_components => size)     # int, :cache_all or :cache_none
  GFFdb.new(gff3, :cache_records    => size)     # int, :cache_all or :cache_none

where the size is the size of the cache and :all is the default. Sensible combinations are

Ruby
  1: GFFdb.new(gff3)  # everything in memory
  2: GFFdb.new(gff3, :cache_components => 1000, :cache_records => :cache_none )
  3: GFFdb.new(gff3, :cache_components => 1000, :cache_records => 1000 )
  4: GFFdb.new(gff3, :cache_records => :cache_none )
  5: GFFdb.new(gff3, :cache_components => :cache_none, :cache_records => :cache_none )

My idea is that comparing performance of option (1) and (3) will be interesting. Option (5) may be slow, not so much because of file seeks (as they will be near to each other and we will use the file cache), but because every record has to be parsed every time. Also we may want to use an external FASTA file. This will be using a :fasta => 'filename' option.

Another thing to consider is Parent types that are far from each other in the file. First I am just going to display warnings for unresolved links. Later we can put them aside and resolve them after fully reading the file. This is necessary for all versions, including the in memory one(!) Lincoln specifies in the standard:

The GFF3 format does not enforce a rule in which features must be wholly contained within the location of their parents, since some elements of the Sequence Ontology (e.g. enhancers in genes) allow for distant cis relationships.

The full API Specs you can read FIXME here

1.7 Implementing the Cache

BioRuby has no generic caching facility. So we have to search. StackOverflow always is great resource. As we don't want to add unnecessary dependencies (otherwise we SQLite would be an option). I can use Ruby/Cache as it comes with a Ruby license. But this Blog intrigued me. He does not display a license, so I am just going to try it, and if it is good I can ask the author's permission. The code is in bio/system/lruhash.rb

The idea is that the LRU Hash acts like a normal Hash, but when it exceeds a certain size it starts deleting items that were least accessed. With GFF data we could probably do with a pure FIFO implementation, but that is for someone else to try. So to use the LRUHash do something like

Ruby
  cache = LRUHash.new(1000)
  gff.records.each do | rec |
    # store this record in the cache
    cache[rec.id] = rec
    # say<font color='GRAY'><b> we </b></font>want a parent

    parent = cache[rec.parent]
    if parent == nil
      # Cache miss,<font color='GRAY'><b> we </b></font>need to fetch it and add it to the cache

      parent = read_rec(rec.parent)
      cache[rec.parent] = parent
    end
    (etc., etc.)
  end

This is hidden behind a nice interface in RecordCache.

1.8 Implementing the File seek index

BioRuby has a file indexer - for flatfile. Unfortunately, it is rather elaborate and convoluted, see the (http://github.com/bioruby/bioruby/blob/master/lib/bio/io/flatfile/indexer.rb indexer.rb). My needs are simple, maintain a Hash of record seek positions in memory, simply referenced by key.

1.9 Tying it together

Every GFF record has a unique identifier. We first check the cache whether the record is in it. If it is not in the cache we use the seek index to reread the record from file. This is implemented in the implementations of FetchRecord

In the first implementation we stored everything simply in a list of linked records e.g. mrnas.add(id,rec), which essentially is a Hash of Array. Now, rather than storing records, we are going to store the unique record id, and delegate fetching the individual records. So, rather than self[id] << rec we get self[id] << rec.id. And we have to refactor the code to fetch records on demand with FetchRecord

1.10 Design

The goal of designing a library and its components is to make every component as simple as possible (the KISS principle). If you can describe a class, object, or mixin, in one line, it has the least number of persistent variable members, and the API is self-explanatory - you have succeeded.

To run the different caching techniques I created a factory that allows different GFF3 assemblers, so they act as plugin replacements for each other. For example Bio::GFFBrowser::Digest::InMemory and Bio::GFFBrowser::Digest::NoCache. These live in separate source files: gffinmemory.rb and gffnocache.rb. All shared functionality goes in mixin modules, which are in gffassemble.rb. One of the annoyances of BioRuby is that everything is in very lengthy files. By using smaller files you can explain some of the contained logic, just by naming the files properly. Now have the assembler factory classes which should expose the following functions:

Ruby
describe Bio::GFFBrower::Digest::InMemory, "GFF3 Digest API" do
  it "should support each_gene" 
  it "should support each_gene_seq" 
  it "should support each_mRNA" 
  it "should support each_mRNA_seq" 
  it "should support each_exon" 
  it "should support each_exon_seq" 
  it "should support each_CDS" 
  it "should support each_CDS_seq" 
end

BioRuby does not allow iterating through a GFF3 file. The way to achieve this now, is by reading a file and parsing lines using something like

Ruby
ARGF.each_line do |gffline|
  record=Bio::GFF::GFF3::Record.new(gffline)
  (do something)
end

This is brittle. For one we will have to handle FASTA records ourselves, and second, maybe we miss out on things like empty lines. Time for a BioRuby addition! For now, I'll just add a module class named Gff3::Iterator which does not consume memory and provides something like:

Ruby
   seek = {}   
   fh = File.open(filename)
   iter = Bio::GFF::GFF3::FileIterator.new(fh)
   iter.each_rec do | id, rec |
     seek[id] = rec.io_seek
     (do something with rec)
     # maybe seek another record
     fh.seek(fh,seek[parentid])
     parent = Bio::GFF::GFF3::Record.new(fh.gets)
   end
   # later, you may seek a record
   id1 = 'xxx'
   fh.seek(fh,seek[id1])
   rec = Bio::GFF::GFF3::Record.new(fh.gets)

The seeking and fetching of a record, obviously, should go into its own module.

The iterator only takes care of correctly parsing the file and yielding data records. Note that if will seek at every iteration, just to make sure we haven't moved the file pointer between reads. We are morphing Record into {FileRecord} adding an rec.io_seek, so the file position is automatically stored when caching the record later.

In BioRuby the GFF3 line parser has the following logic:

Ruby
  # parses GFF lines
  str.each_line do |line|
    if /^\#\#([^\s]+)/ =~ line then
      parse_metadata($1, line)
      parse_fasta(str) if @in_fasta
    elsif /^\>/ =~ line then
      # line starts with '>' (gt) and is a FASTA record
      @in_fasta = true
      parse_fasta(str, line)
    elsif line.strip =~ /^$/
      # empty line
      next
    else
      @records << GFF3::Record.new(line) 
    end
  end
  # parses fasta format when str is a String and fasta data exists
  if fst then
    @in_fasta = true
    parse_fasta(fst)
  end

When the FASTA portion starts, the rest of the file consists of sequence information. This is according to the standard too.

'##FASTA' This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together

This logic is missing in the earlier line parser example. I will add it in FileIterator, and next time someone want a line based parser, he just ought to be able to use this. This is why FileIterator belongs in BioRuby itself.

I am going to adapt above line based parser, as it convolutes two parsing actions (non-FASTA GFF3 and FASTA). The only thing I am ignoring, for now, is GFF3 meta information. It becomes:

Ruby
  def each_rec
    fpos = 0
    @fh.each_line do | line |
      line = line.strip
      if line == "##FASTA"
        @fasta_io_seek = fpos
        break
      end
      if line.size != 0 and line !~ /^#/
        rec = FileRecord.new(fpos, line)
        lastpos = @fh.tell
        yield rec.id, rec
        @fh.seek(lastpos) # reset filepos, just in case it changed
      end
      fpos = @fh.tell
    end
  end

and the FASTA parser

Ruby
  def each_sequence
    header = nil
    seqs   = []
    @fh.each_line do | line |
      line = line.strip
      next if line =~ /^#/
      if line =~ /^>/  # FASTA record header
        yield fasta_rec(header, seqs) if header
        header = line
        seqs   = []
      else
        seqs << line
      end
    end
    yield fasta_rec(header, seqs) if header
  end
private
  def fasta_rec header, buf
    fst = Bio::FastaFormat.new(header+"\n"+buf.to_s)
    return fst.definition, fst
  end

Now >we have the iterators, the spec reads:

Ruby
describe Bio::GFF::GFF3::FileIterator, "iterates a file" do
  it "should handle embedded FASTA records" do
    iter = Bio::GFF::GFF3::FileIterator.new(TEST1)
    last = nil
    iter.each_rec do | id, rec |
      last = rec
    end
    last.io_seek == 3256 # position of the last record
    firstid = 'unknown'
    iter.each_sequence do | id, seq |
      firstid = id
    end
    firstid.should == "test01"
  end
end

OK, on to the NoCache parser. Above iterator allows parsing the file just once for every type and create compound objects (e.g. each CDS consists of several CDS segments). We will have to parse the file twice. First to compound the objects, next to iterate them. The difference with InMemory is that we do not store the parsed records, but only the file seek position. As it happens, the code ends up being very similar, which makes it a candidate for refactoring.

The main problem was that the find_component was based on a Hash of records in gffassemble. Since I want to reuse the business logic of finding a component I had two options: refactor the function for using seek positions, or create a class which acts like a Hash, but fetches from a file. If you read this

Ruby
  def find_component rec
    parent = rec.get_attribute('Parent')
    if @componentlist[parent] 
      # nice, there is a match
      return @componentlist[parent]
    end
    search = rec.seqname
    return @componentlist[search] if @componentlist[search]
    @componentlist.each do | componentid, component |
      # dissemble id
      (id, start, stop) = componentid.split(/ /)
      if id==search and rec.start >= start.to_i and rec.end <= stop.to_i
        return component
      end
    end
    # Ah, painful. At this point the record has no matching container, probably
    # because it has no parent ID and the component has an ID. We have to go by
    # ID for every component individually
    @componentlist.each do | componentid, component |
      if component.seqname==search and rec.start >= component.start and rec.end <= component.end
        return component
      end
    end
    warn "Could not find container/component for",Record::formatID(rec)
  end

you may agree it probably is cleanest to make @componentlist fetch records transparently from file.

I created SeekRec, which fetches a record by ID, and SeekRecList which behaves like a Hash, but fetches records from disk, using SeekRec. SeekRecList is really easy to make generic, and link up with a Caching function. It behaves like a Hash, which means the code base for InMemory and NoCache is almost the same. So I figure I am going to factor it out to a shared mixin module named Bio::GFFbrowser::Digest::Parser

All tests pass for NoCache. I should be able to run this in a production setup now.

1.11 FASTA Handler

Sequence information with GFF3 comes in FASTA format, either at the end of the GFF file, or as a separate file. I would like to use the indexed FastaReader of the BigBio project, so as not to duplicate code. It has a nifty features: the index is built when iterating the first time, which means in most cases you only have to read the file once, or even less. We really don't need that with GFF3, as we won't save time.

So, a FastaReader should index a file on the first run, and allow accessing records by ID. Similar to the GFF3 Hash we did above:

Ruby
   # Open the FASTA file with index
   fasta = FastaReader.new(fasta_filename, :index => true)
   # Fetch a sequence by seeking the file and reading on demand
   seq = fasta['sequenceid'] 

The only additional functionality we would like is to fetch FASTA information from an seek position, as the FASTA sequences are at the end of the GFF file, and it saves us from rereading the file. So:

Ruby
   # Open the FASTA file with index and seekpos
   fasta = FastaReader.new(gff_filename, :index => true, :io_seek => 76755)
   # Fetch a sequence by seeking the file and reading on demand
   seq = fasta['sequenceid'] 

Because FASTA is such an easy format I had already included a parser for the infile FASTA in class FileIterator, used in NoCache. However, this implementation is not used in InMemory. Hmmm. Either refactor this out, or create something from BigBio. Choices...

In this case I decided to refactor the built-in version, as it a simple implementation, requires io_seek information, and should not be in the name space of another plugin (e.g. BigBio). We create Bio::GFF::FastaReader. This FastaReader is used directly by NoCache, InMemory, and FileIterator to fetch sequence information.

To test FastaReader I split test.gff into test-ext-fasta.gff and test-ext-fasta.fa. Next it was merely a question of adding specs and the minimal implementation. To the user reading an external Fasta file consists of adding it as a parameter:

Ruby
  TESTGFF1EXT='test/data/gff/test-ext-fasta.gff3'
  TESTGFF1FASTA='test/data/gff/test-ext-fasta.fa'
  gffdb = Bio::GFFbrowser::GFFdb.new(TESTGFF1EXT, :fasta_filename => TESTGFF1FASTA, 
            :cache_components => :cache_none, :cache_records => :cache_none)
  gff = gffdb.assembler
  gff.each_mRNA_seq { | id, seq | print id, seq }

or with the command line tool

Shell
./bin/gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3   

Now >we can handle FASTA the GFF3 plugin is fully functional, both InMemory and NoCache versions.

1.12 Amino acid translation

To make sure the library correctly puts CDS together I added part of an existing database to the test data. The following contig contains

Shell
<ol><li>Gene gene:MhA1_Contig1133.frz3.gene4
MhA1_Contig1133 WormBase  gene  7838  8740  . + . ID=gene:MhA1_Contig1133.frz3.gene4;Name=MhA1_Contig1133.frz
3.gene4;Note=PREDICTED protein_coding;public_name=MhA1_Contig1133.frz3.gene4
MhA1_Contig1133 WormBase  mRNA  7838  8740  . + . ID=transcript:MhA1_Contig1133.frz3.gene4;Parent=gene:MhA1_C
ontig1133.frz3.gene4;Name=MhA1_Contig1133.frz3.gene4;
MhA1_Contig1133 WormBase  exon  7838  7980  . + . ID=exon:MhA1_Contig1133.frz3.gene4.1;Parent=transcript:MhA1
_Contig1133.frz3.gene4
MhA1_Contig1133 WormBase  exon  8065  8308  . + . ID=exon:MhA1_Contig1133.frz3.gene4.2;Parent=transcript:MhA1
_Contig1133.frz3.gene4
MhA1_Contig1133 WormBase  exon  8585  8740  . + . ID=exon:MhA1_Contig1133.frz3.gene4.3;Parent=transcript:MhA1
_Contig1133.frz3.gene4
MhA1_Contig1133 WormBase  CDS 7838  7980  . + 0 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Cont
ig1133.frz3.gene4
MhA1_Contig1133 WormBase  CDS 8065  8308  . + 1 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Cont
ig1133.frz3.gene4
MhA1_Contig1133 WormBase  CDS 8585  8740  . + 0 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Cont
ig1133.frz3.gene4</ol>

which I checked by hand to give for the CDS (so I add a test):

Shell
MhA1_Contig1133 WormBase  CDS 7838  7980  . + 0 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Contig1133.frz3.gene4
TGCGTCCTTTAACAGATGAAGAAACTGAAAAGTTTTTCAAAAAACTTTCAAATTATATTGGTGACAATATTAAACTTTTATTGGAAAGAGAAGATGGAGAATATGTTTTTCGTTTACATAAAGACAGAGTTTATTATTGCA
>EMBOSS_001_3
RPLTDEETEKFFKKLSNYIGDNIKLLLEREDGEYVFRLHKDRVYYCX
MhA1_Contig1133 WormBase  CDS 8065  8308  . + 1 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Contig1133.frz3.gene4
GAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACA
>EMBOSS_001_1
EKLMRQAACIGRKQLGSFGTCLGKFTKGGSFFLHITSLDYLAPYALAKIWLKPQAEQQFLYGNNIVKSGVGRMSEGIEEKX
MhA1_Contig1133 WormBase  CDS 8585  8740  . + 0 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Contig1133.frz3.gene4
GTATTATTATTTATAATATGTCAGATTTACCATTGGGTTTTGGAGTGGCTGCAAAGGGAACATTATCTTGTAGAAAAGTAGATCCTACAGCTTTAGTTGTTTTACATCAATCAGATTTGGGTGAATATATTCGAAATGAAGAGGGATTAATTTA
>EMBOSS_001_3
IIIYNMSDLPLGFGVAAKGTLSCRKVDPTALVVLHQSDLGEYIRNEEGLIX

Added the spectest:

before :all do gffdb = Bio::GFFbrowser::GFFdb.new(GFF3FILE, :fasta_filename => FASTAFILE) @gff = gffdb.assembler @gff.parse @contigsequence = @gff.sequencelist["MhA1_Contig1133"] @cdslist = {} @gff.each_CDS do | id, reclist, component | @component = component @cdslist[id] = reclist end end it "should have CDS 7838:7980" do recs = @cdslist['cds:MhA1_Contig1133.frz3.gene4'] cds0 = recs[0] cds0.start.should == 7838 cds0.end.should == 7980 cds0.frame.should == 0 cds0.seqname.should == 'MhA1_Contig1133' end it "should have CDS 8065:8308" do recs = @cdslist['cds:MhA1_Contig1133.frz3.gene4'] cds1 = recs[1] cds1.start.should == 8065 cds1.end.should == 8308 cds1.frame.should == 1 cds1.strand.should == '+' cds1.seqname.should == 'MhA1_Contig1133' end it "should translate CDS 8065:8308 (in frame 1)" do recs = @cdslist['cds:MhA1_Contig1133.frz3.gene4'] cds1 = recs[1] seq = @gff.assemble(@contigsequence,@component.start,[cds1]) seq.should == "GAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACA" aaseq = @gff.assembleAA(@contigsequence,@component.start,[cds1]) # note it should handle the frame shift and direction! aaseq.should == "EKLMRQAACIGRKQLGSFGTCLGKFTKGGSFFLHITSLDYLAPYALAKIWLKPQAEQQFLYGNNIVKSGVGRMSEGIEEKX" # we are going to force a change of direction cds1rev = cds1 cds1rev.strand = '-' seq = @gff.assemble(@contigsequence,@component.start,[cds1rev]) seq[0..3].should == "ACAA" end

Ruby

Naturally the test failed :). That is what tests are for, since I did not account for the frame shift in the second CDS. Also the container component offset was wrong, as I subtracted the start position of the container. It probably is a good idea to test with real data from the start.

I also realise the API may not be descriptive enough: each_CDS should perhaps iterate all CDS sections - though I could add each_CDS_section instead. As it is, each_CDS returns a list of CDS sections. Good enough for now.

Another nice addition would be translation of the CDS, so we have somthing like each_AA_sequence

1.13 Command line tool

The usage reads:

Shell
> BioRuby GFF3 Plugin Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
> 
>   Fetch and assemble mRNAs, or CDS and print in FASTA format. 
> 
>     gff3-fetch [--no-cache] mRNA|CDS [filename.fa] filename.gff
> 
>   Where:
> 
>     --no-cache      : do not load everything in memory 
>     mRNA            : assemble mRNA
>     CDS             : assemble CDS 
> 
>   Multiple GFF3 files can be used. For external FASTA files,<font color='GRAY'><b> always </b></font>the last

>   one before the GFF file is used.
> 
>   Examples:
> 
>     Find mRNA and CDS information from test.gff3 (which includes sequence information)
> 
>       ./bin/gff3-fetch mRNA test/data/gff/test.gff3
>       ./bin/gff3-fetch CDS test/data/gff/test.gff3
> 
>     Find mRNA from external FASTA file, without loading everythin in RAM
> 
>       ./bin/gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3   
> 
>   If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
> 

1.14 TODO

  1. add each_AA_sequence (and gff3-fetch)
  2. do not fail on CDS when size not multiple of 3
  3. when assembling CDS, correct for wrong size (multiple of 3)

Sequence list caching

Introduce LRU Cache

Fetch gene information

Fetch exon information. Exons are pretty much like CDS, the way they are handled in GFF3.

Add more validation. Potentially more validation is possible.

Make the plugin easy to install on BioRuby. The plugin architecture is still in development. At this point it is probably easiest just to add the lib directory to the search path. See the README.

1.15 Conclusion

One important finding is that the GFF3 standard is open to interpretation. For example it allows people to take liberties with container (component) types and parent IDs. By using validation from the start these issues came up quickly.

The problem with standards like GFF3 is that they evolve with use and implementations. Writing an all-catching parser is therefore problematic, you have to adapt it for the data files you encounter. If a Bio* would strictly maintain the original specification, the parser would rapidly become useless.

Meanwhile it is also clear (to me) that GFF3 is a very useful standard, as it allows very descriptive genomic relationships. I can see why it is becoming common. One thing to check is the (Sanger) Wormbase definition of features, described (http://www.sanger.ac.uk/Projects/C_elegans/DOCS/GFF_features.shtml here)

BioRuby supports plugins. This allows supporting different interpretations of parsers.

The original BioRuby implementation is nice, as it was flexible enough for my needs. It only needed a fix for ignoring empty lines. However, it only provides basic GFF record parsing and not sequence assembly. That is what the plugin does, now. Furthermore the in-memory version of BioRuby is slower than the low-memory version: The low memory version is actually 50 than the InMemory BioRuby edition.

On a decent 15Gb server with fast drives (and ruby 1.8.7 (2010-08-16 patchlevel 302) [x86_64-linux]):

When I parse a 500Mb GFF3 file, without FASTA information, with BioRuby it consumes 8.5 Gb RAM and takes 20 minutes. My NoCache version takes 1Gb RAM and 13 minutes. On my 2Gb laptop the native BioRuby version never completed (which, in my opinion, is unacceptable).

I am not 100% sure why this is, but I know that BioRuby consumes the whole file in memory first, splits it by line and, next, starts parsing GFF. Probably memory allocation and regex are expensive with really large buffers.

When FASTA support was added the NoCache version add 1 Gb RAM for Because FASTA is stored in memory as a string of character, it is quite efficient (storage-wise).

The current source code for the BioRuby plugin can be found on github


Biobliography


wikiTEXer 0.51 by Pjotr Prins - generated 2010-11-15 10:03 by pjotrp