Skip to content

Category: english

Substitutions in a phylogenetic tree file

The newick tree

The Newick tree format is a way of representing a graph trees with edge lengths using parentheses and commas.

A newick tree example:

(((Espresso:2,(Milk Foam:2,Espresso Macchiato:5,((Steamed Milk:2,Cappucino:2,(Whipped Cream:1,Chocolate Syrup:1,Cafe Mocha:3):5):5,Flat White:2):5):5):1,Coffee arabica:0.1,(Columbian:1.5,((Medium Roast:1,Viennese Roast:3,American Roast:5,Instant Coffee:9):2,Heavy Roast:0.1,French Roast:0.2,European Roast:1):5,Brazilian:0.1):1):1,Americano:10,Water:1);

A graphical representation for the newick tree above (using the http://www.jsphylosvg.com/ library):

The Newick format is commonly used for store phylogenetic trees.

The problem

A phylogenetic tree can be highly branched and dense and even using proper visualization software can be difficult to analyse it. Additionally, as a tree are produced by a chain of different software with data from the laboratory, the label for each leaf/node can be something not meaningful for a human reader.

For this particular problem, an example of a node label could be SXS_3014_Albula_vulpes_id_30.

There was a spreadsheet with more meaningful information where a node label could be used as a primary key. Example for the node above:

Taxon Order Family Genus Species ID
Albuliformes Albulidae Albula vulpes SXS_3014_Albula_vulpes_id_30

The problem consists in using the tree and the spreadsheet to produce a new tree with the same structure, where each node have a more meaningful label.

The approach

The new tree can be mounted by substituting each label of the initial tree with the respective information from the spreadsheet. A script can be used to automate this process.

The solution

After converting the spreadsheet to a CSV file that could be more easily handled by a CSV Python library the problem is reduced to a file handling and string substitution. Fortunately, due the simplicity of the Newick format and its limited vocabulary, a tree parser is not necessary.

Source-code at Github.

Difficulties found

The spreadsheet was originally in a Microsoft Office Excel 2007 (.xlsx) and the conversion to CSV provided by Excel was not good and there was no configuration option available. Finally, the conversion provided by LibreOffice Productivity Suite was more configurable and was easier to read by the CSV library.

In the script, the DictReader class showed in the the long-term much more reliable and tolerant to changes in the spreadsheet as long the names of the columns remain the same.

P.S. due to the nature of the original sources for the tree and spreadsheet I don’t have the authorization for public publishing their complete and original content. The artificial data displayed here is merely illustrative.

GenBank renaming

http://www.flickr.com/photos/maria_keays/1251843227/
DNA inspired sculpture by Charles Jencks. Creative Commons photo by Maria Keays.

What is GenBank?

The GenBank sequence database is a widely used collection of nucleotide sequences and their protein translations. A GenBank sequence record file typically has a .gbk or .gb extension and is filled with plain text characters. A example of GenBank file can be found here.

Filename problem

Although there are several metadata are available inside a GenBank record the name of the file are not always in accordance with the content of the file. This is potentially a source of confusion to organize files and requires an additional effort to rename the files according to their content.

Approach using Biopython

The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Among other tools, Biopython includes modules for reading and writing different sequence file formats including the GenBank’s record files.

Despite the fact that is possible to write a parser for GenBank’ files it would represent a redundant effort to develop and maintain such tool. Biopython can be delegated to perform parsing and focus the programming on renaming mechanism.

Biopython installation on Linux (Ubuntu 11.10) or Apple OS X (Lion)

For both Ubuntu 11.10 and OS X Lion, a modern version of Python already comes out of the box.

For Linux you just need to install the Biopython package. One method to install Biopython in a APT ready distribution as Ubuntu 11.10 (Oneiric Ocelot) is:

# apt-get install python-biopython

For an Apple OS X (Lion) you can install Biopython using easy_install, a popular package manager for the Python. Easy_install is bundled with Setuptools, a set of tools for Python.

To install the Setuptools download the .egg file for your python version (probably setuptools-0.6c11-py2.7.egg) and execute it as a Shell Script:

sudo sh setuptools-0.6c11-py2.7.egg

After this you already have easy_install in place and you can use it to install the Biopython library:

sudo easy_install -f http://biopython.org/DIST/ biopython

For both operational systems you can test if you already have Biopython installed using the Python iterative terminal:

$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import Bio
>>> Bio.__version__
‘1.57’
>>>

Automatic rename example through scripting

Below the Python source-code for a simple use of using Biopython to rename a Genbank file to it’s description after removing commas and spaces.

Using the the previous example of GenBank file, suppose you have a file called sequence.gb. To rename this file to the GenBank description metadata inside it you can use the script.

python gbkrename.py sequence.gb

And after this it will be called Hippopotamus_amphibius_mitochondrial_DNA_complete_genome.gbk.

Improvements

There is plenty of room for improvement as:

  • Better command line parsing with optparse and parameterization of all possible configuration.
  • A graphical interface
  • Handle special cases such multiple sequences in a single GenBank file.

C# class properties example

A example of use of C# class properties to convert temperatures in Celsius, Fahrenheit or Kelvin. The temperature is encapsulated and stored in a internal representation, in this example, in Celcius (private double c). Each conversion is accessible by getting or setting a property.

using System;

public class Temperature {
	private double c;

	public double celsius {
		get {
			return c;
		}
		set {
			c = value;
		}
	}

	public double fahrenheit {
		get {
			return (c * 9 / 5) + 32;
		}
		set {
			c = (value - 32) * 5 / 9;
		}
	}

	public double kelvin {
		get {
			return c + 273.15;
		}
		set {
			c = value - 273.15;
		}
	}
}

public class TemperatureExample {
	public static void Main(string[] args) {
		Temperature fortaleza = new Temperature();
		fortaleza.celsius = 26;

		Temperature washington = new Temperature();
		washington.fahrenheit = 32;

		Temperature sun = new Temperature();
		sun.kelvin = 5778;

		Console.WriteLine("Fortaleza {0}°C / {1}°F / {2} K",
			fortaleza.celsius, fortaleza.fahrenheit, fortaleza.kelvin);
		Console.WriteLine("Washington {0}°C / {1}°F / {2} K",
			washington.celsius, washington.fahrenheit, washington.kelvin);
		Console.WriteLine("Sun {0}°C / {1}°F / {2} K",
			sun.celsius, sun.fahrenheit, sun.kelvin);
	}
}

Output:

Fortaleza 26°C / 78.8°F / 299.15 K
Washington 0°C / 32°F / 273.15 K
Sun 5504.85°C / 9940.73°F / 5778 K

There is some good examples of C# class properties at Using Properties (C# Programming Guide) at MSDN.

Halloween Costume 2011

And again, this year people got more scared than I was expecting. It looks like Sandman (Wesley Dodds) or Spy vs. Spy.

My idea was get some stuff from wardrobe and spend a little on a special item that after the Halloween would not be completely useless. 🙂

The costume items:

  • A classic trench coat from Men’s Wearhouse (actually is a great coat for rain, cold weather or snow, water resistant and it has an inner layer removable when not so cold. It is a kinda common coat in DC).
  • A gray fedora hat from Filene’s Basement. It was not really matching but worked.
  • Black leather gloves from last winter.
  • Snow boots from my hiking in Colorado last year.
  • An Israeli Civilian Gas Mask. The only thing I had to buy. From the product description:

    This is the gas mask issued to Israeli civilians when threatened with chemical attack by Saddam’s Iraq. It has full NBC (neuclear, biological, chemical) protection, and comes with one sealed filter.

Silicon Alley 500

I was friday night and I received an invitation to the SA500 in New York. I think I read about the event on Twitter and I filled the forms. “Well, I have no plans for this Saturdays… I could go to New York take a look on the protests in Wall Street and go to this SA500 and maybe meet some developers from the companies I love”. So I took the first flight I could and left Washington DC.

One thing that I didn’t initially realize was that the event has going to be held inside the New York Stock Exchange in Wall Street. The same day a global protest was erupting from the Liberty Square in Manhattan’s financial district at the Occupy Wall Street camping. The city was in turmoil. Paradoxically (or not), in the center of all this a group of 50 selected creative startups and 500 students and IT graduates were gathering to know more about each other.

I could talk with several CEOs and developers and here are some impressions I had.

Language agnostic environment

Almost all companies I talked have a very plural and multicultural language agnostic environment. When you have a problem that can be better solved in Ruby, you do it in Ruby. When you have a problem better solved in Java, you do it in Java, and so on. The efforts to integrate different technologies are by far eclipsed with the benefits of having a broad range of solutions. It is the good and old  “don’t put all one’s eggs in one basket” that sometimes companies try to forget.

Of course that not all companies uses all languages available but the normal I could see was about 3 or 4 languages platforms in every company.

Python Rocks

If you look into my blog maybe you realize that despite I program in some languages I have a favorite. At SA500 I saw that Python is much more stronger in the startups than I could imagine. The majority of companies use some Python on the front-end or back-end and a lot of them use Django.

For a while I thought that Python did not have a good market share for me in the market I’m looking for but I saw that I was fortunately wrong.

Small is Bigger

Looking some of the products that those companies created you may think that there is a huge development team working on them. Actually, no.

Mostly of the companies have a small team of motivated engineers with the right methodologies and tools in hands. Most of those companies have from 5 to 15 developers only. Is not rate to the CEO himself be also a developer, developed the initial product and still be coding.

Context Aware Content

Many of the products from the 50 startups at the SA500 were  context aware applications. They gather information about the user preferences and geolocation to delivery a more specific and rich content to the user. For example, knowing that the user is in a restaurant and it is a sunny day show him that there is an event in a park one block from there. And it is not just about geolocation, many of those applications even when without an user profile can retrieve meaningful content based on his last searches or browsing.

It is not a new idea but now with more powerful smartphones is much more applicable one. Despite been a very simple concept it is full of challenges and possibilities.

The “Netflix business model”

Rent something through a powerful yet simple web interface, quickly delivered in your door. At Renttherunway.com you can rent luxury designer dresses! At Artsicle.com you can rent pieces of art!

I can see this “Netflix” model been expanded and mixed  with so many others.

Through the Open Source. To the Open Source

Of course all those companies use many Open and Free Source. They are lean. Nowadays this is the commonplace in software development. But also, many of them go beyond and create open source products and  a vibrant community around. Just to cite some examples, 10gen create MongoDB, my favorite NoSQL database and used by a lot of the companies at SA500. Long Tail Video have the JW Player that I already used here in the blog and so did so many others.

Videos are hot. Ads are hot.

Wow. So many companies working on products related to video or advertisement and sometimes even both. There is a race out there and I wonder what will come out.


I love gits!

And now I don’t have to buy new t-shirts for awhile. :3

SA500 was amazing. I could talk with people that created products I love and use everyday as Venmo and Meetup. In fact, thank you for all, specially nextjump.com, for organizing this event.

what I feel when I see someone using JPG when he should be using a PNG

Seriously, is that hard?

To discovery which file type you have to use for your image just follow these simple instructions in following priority order:

  1. Have text? Use PNG.
  2. Is a piece of art like a draw, a painting or a webcomic? Use PNG.
  3. It is… moving! Use GIF.
  4. Is a photo? Use JPG.
  5. Is not exactly a photo but contains photos (like people. trees and landscapes)? Use JPG.
  6. Is not a photo, does not contain a photo but I remain concerned about the size of my file despite the breakthrough in telecommunication speeds. Try PNG with indexed palette and Floyd–Steinberg color dithering.
  7. Nah, man. Use JPG but with all lower compression or higher quality options you may find.
  8. It’s nothing listed above! Sir, your problem is far away from the scope of these instructions.

Thank you.

Python, flatten a list

Surprisingly python doesn’t have a shortcut for flatten a list (more generally a list of lists of lists of…).

I made a simple implementation that doesn’t use recursion and tries to be written clearly.

I get a element from a “notflat” list (a list that can have another lists on it). If a element is not a list we store in our flat list. If the element is still a list we deal with him later. The flat list always have only elements that are not a list.
To preserve the original order we reverse the elements at the end.