DNA inspired sculpture by Charles Jencks. Creative Commons photo by Maria Keays.
What is GenBank?
The GenBank sequence database is a widely used collection of nucleotide sequences and their protein translations. A GenBank sequence record file typically has a .gbk or .gb extension and is filled with plain text characters. A example of GenBank file can be found here.
Filename problem
Although there are several metadata are available inside a GenBank record the name of the file are not always in accordance with the content of the file. This is potentially a source of confusion to organize files and requires an additional effort to rename the files according to their content.
Approach using Biopython
The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Among other tools, Biopython includes modules for reading and writing different sequence file formats including the GenBank’s record files.
Despite the fact that is possible to write a parser for GenBank’ files it would represent a redundant effort to develop and maintain such tool. Biopython can be delegated to perform parsing and focus the programming on renaming mechanism.
Biopython installation on Linux (Ubuntu 11.10) or Apple OS X (Lion)
For both Ubuntu 11.10 and OS X Lion, a modern version of Python already comes out of the box.
For Linux you just need to install the Biopython package. One method to install Biopython in a APT ready distribution as Ubuntu 11.10 (Oneiric Ocelot) is:
# apt-get install python-biopython
For an Apple OS X (Lion) you can install Biopython using easy_install, a popular package manager for the Python. Easy_install is bundled with Setuptools, a set of tools for Python.
To install the Setuptools download the .egg file for your python version (probably setuptools-0.6c11-py2.7.egg) and execute it as a Shell Script:
sudo sh setuptools-0.6c11-py2.7.egg
After this you already have easy_install in place and you can use it to install the Biopython library:
sudo easy_install -f http://biopython.org/DIST/ biopython
For both operational systems you can test if you already have Biopython installed using the Python iterative terminal:
$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import Bio
>>> Bio.__version__
‘1.57’
>>>
Automatic rename example through scripting
Below the Python source-code for a simple use of using Biopython to rename a Genbank file to it’s description after removing commas and spaces.
Using the the previous example of GenBank file, suppose you have a file called sequence.gb. To rename this file to the GenBank description metadata inside it you can use the script.
python gbkrename.py sequence.gb
And after this it will be called Hippopotamus_amphibius_mitochondrial_DNA_complete_genome.gbk.
Improvements
There is plenty of room for improvement as:
- Better command line parsing with optparse and parameterization of all possible configuration.
- A graphical interface
- Handle special cases such multiple sequences in a single GenBank file.
How was your experience with Biopython versus Bioperl? If you were recommending one to somebody with an equal knowledge of both languages (even if that knowledge was 0) which would you recommend?