Skip to content

Month: March 2012

Bioperl Install and examples

Creative Commons ATT-SA

Perl is a widely used language in bioinformatics. As I already experimented Python and Biopython for handling a few simple bioinformatics tasks I will now try Perl and Bioperl.

Install on Ubuntu 11.10 (oneiric)

Perl already comes with Ubuntu. Bioperl can be installed (without CPAN):

$ sudo apt-get install bioperl

After the installation on have several tools in your PATH:

bp_aacomp, bp_biblio, bp_biofetch_genbank_proxy, bp_bioflat_index, bp_biogetseq, bp_blast2tree, bp_bulk_load_gff, bp_chaos_plot, bp_classify_hits_kingdom, bp_composite_LD, bp_das_server, bp_dbsplit, bp_download_query_genbank, bp_einfo, bp_extract_feature_seq, bp_fast_load_gff, bp_fastam9_to_table, bp_fetch, bp_filter_search, bp_flanks, bp_gccalc, bp_genbank2gff, bp_genbank2gff3, bp_generate_histogram, bp_heterogeneity_test, bp_hivq, bp_hmmer_to_table, bp_index, bp_load_gff, bp_local_taxonomydb_query, bp_make_mrna_protein, bp_mask_by_search, bp_meta_gff, bp_mrtrans, bp_mutate, bp_netinstall, bp_nexus2nh, bp_nrdb, bp_oligo_count, bp_pairwise_kaks, bp_parse_hmmsearch, bp_process_gadfly, bp_process_sgd, bp_process_wormbase, bp_query_entrez_taxa, bp_remote_blast, bp_revtrans-motif, bp_search2BSML, bp_search2alnblocks, bp_search2gff, bp_search2table, bp_search2tribe, bp_seq_length, bp_seqconvert, bp_seqfeature_delete, bp_seqfeature_gff3, bp_seqfeature_load, bp_seqret, bp_seqretsplit, bp_split_seq, bp_sreformat, bp_taxid4species, bp_taxonomy2tree, bp_translate_seq, bp_tree2pag, bp_unflatten_seq

You can try to import a Bioperl module to check if everything is working properly.

#!/bin/perl -w
use Bio::Seq; 

Writing a nucleotide sequence to a FASTA file

#!/usr/bin/perl -w
use Bio::Seq;
use Bio::SeqIO;
$seq_obj = Bio::Seq->new(-seq => "gattaca",                        
         -display_id => "#10191997",
         -desc => "Example",                        
         -alphabet => "dna" );
$seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );

The output in the sequence.fasta created will be:

#10191997 Example

Reading a Genbank file
Opening the same example I used last time (Hippopotamus amphibius mitochondrion, complete genome).

#!/usr/bin/perl -w

use Bio::Seq;
use Bio::SeqIO;

$seqio_obj = Bio::SeqIO->new(-file => "", -format => "genbank" );

while ($seq_obj = $seqio_obj->next_seq){ 
    print $seq_obj->seq,"\n";

Online Querying Genbank

With Bioperl is possible to programmatically query and retrieve data directly from GenBank. For example, to retrieve the same mitochondrial genome from the Hippopotamus I used in the example above.

#!/usr/bin/perl -w

use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
$query = "Hippopotamus amphibius[ORGN] AND NC_000889[LOCUS]";
$query_obj = Bio::DB::Query::GenBank->new(-db    => 'nucleotide',  -query => $query );
$gb_obj = Bio::DB::GenBank->new;
$stream_obj = $gb_obj->get_Stream_by_query($query_obj);

while ($seq_obj = $stream_obj->next_seq) {    
	print $seq_obj->display_id, "\t", $seq_obj->length, "\n";

Substitutions in a phylogenetic tree file

The newick tree

The Newick tree format is a way of representing a graph trees with edge lengths using parentheses and commas.

A newick tree example:

(((Espresso:2,(Milk Foam:2,Espresso Macchiato:5,((Steamed Milk:2,Cappucino:2,(Whipped Cream:1,Chocolate Syrup:1,Cafe Mocha:3):5):5,Flat White:2):5):5):1,Coffee arabica:0.1,(Columbian:1.5,((Medium Roast:1,Viennese Roast:3,American Roast:5,Instant Coffee:9):2,Heavy Roast:0.1,French Roast:0.2,European Roast:1):5,Brazilian:0.1):1):1,Americano:10,Water:1);

A graphical representation for the newick tree above (using the library):

The Newick format is commonly used for store phylogenetic trees.

The problem

A phylogenetic tree can be highly branched and dense and even using proper visualization software can be difficult to analyse it. Additionally, as a tree are produced by a chain of different software with data from the laboratory, the label for each leaf/node can be something not meaningful for a human reader.

For this particular problem, an example of a node label could be SXS_3014_Albula_vulpes_id_30.

There was a spreadsheet with more meaningful information where a node label could be used as a primary key. Example for the node above:

Taxon Order Family Genus Species ID
Albuliformes Albulidae Albula vulpes SXS_3014_Albula_vulpes_id_30

The problem consists in using the tree and the spreadsheet to produce a new tree with the same structure, where each node have a more meaningful label.

The approach

The new tree can be mounted by substituting each label of the initial tree with the respective information from the spreadsheet. A script can be used to automate this process.

The solution

After converting the spreadsheet to a CSV file that could be more easily handled by a CSV Python library the problem is reduced to a file handling and string substitution. Fortunately, due the simplicity of the Newick format and its limited vocabulary, a tree parser is not necessary.

Source-code at Github.

Difficulties found

The spreadsheet was originally in a Microsoft Office Excel 2007 (.xlsx) and the conversion to CSV provided by Excel was not good and there was no configuration option available. Finally, the conversion provided by LibreOffice Productivity Suite was more configurable and was easier to read by the CSV library.

In the script, the DictReader class showed in the the long-term much more reliable and tolerant to changes in the spreadsheet as long the names of the columns remain the same.

P.S. due to the nature of the original sources for the tree and spreadsheet I don’t have the authorization for public publishing their complete and original content. The artificial data displayed here is merely illustrative.