Merging n lists of size k, using two different approaches.
Merging n lists of size k, using two different approaches.
Eu estava procurando algumas plantas para dar um tom de verde e  de vida no ambiente do apartamento.
Como esse é um tema novo pra mim eu pesquisei um pouco e acabei me deparando com a palestra “Como plantar seu próprio ar fresco” do pesquisador Kamal Meattle no TED. Decidi procurar as plantas que ele recomendou:
Eu acabei escolhendo uma Epipremnum aureum, conhecida nos EUA como “Money Plant” (planta do dinheiro) ou “Pothos”, e uma Chlorophytum comosum, conhecida no Brasil como clorofito e nos EUA como “Spider Plant” (planta aranha). Essa minha clorofito é da variedade Variegatum que tem as folhas verde-escuras. Cada uma custou uns US$ 12 (hoje aproximadamente R$ 24).
Elas estão sendo bem simples de cuidar e tem sobrevivido muito bem nas condições que eu descrevi acima. Recentemente eu tive que viajar por duas semanas e tive que deixar elas sem água. Antes disso eu também tive que deixar elas em um ambiente com pouco sol porque eu estava cuidando por uns dias de uma gata e essas duas plantas são tóxicas para gatos. A jibóia ficou um pouco fraca e com algumas folhas amareladas mas uma semana de volta aos cuidados normais ela voltou ao normal. Já a clorofito ficou ótima, nem parece que ficou sem cuidados, e até cresceu um pouco. É uma planta realmente muito forte. Eu até tenho visto a variedade
O resultado de cuidar dessas plantas já foi sentido no momento que elas entraram no apartamento. O verde que elas trouxeram já mudou completamente o ambiente. Eu já nem sei como eu vivia sem plantas aqui antes. Essas duas estão em potes em uma superfÃcie plana perto da janela mas também poderiam estar em potes suspensos. Os próximos passos são experimentar outros métodos que permitam cultivar algo comestÃvel como tomates e experimentar montar uma Window Farm (fazenda de janela) :D.
Então, fica aà a dica, se estiver procurando uma planta fácil e bonita para cultivar dentro do seu apartamento com condições iguais as minhas ou provavelmente melhores, ficam essas dicas.
This:
i\hbar\frac{\partial}{\partial t}\left|\Psi(t)\right>=H\left|\Psi(t)\right>
Produce this:
$latex i\hbar\frac{\partial}{\partial t}\left|\Psi(t)\right>=H\left|\Psi(t)\right>$
If you are seeing a complicated math formula in a image then it worked.
In order to help me to take decisions about which class to take every semester I did a web scrapping from the graduate and undergraduate bulletin. For every class I could get classe name, prerequisites, credits, teacher, program, description, etc, in a formated tabular document.
Using Python CSV library I could read the tables and parse the data to other formats. One format very useful to handle graph structures is the DOT language script (included in the Graphviz project), in which you can describe both the graph structure and the elements of the graph layout.
Here is the Python source-code to convert the tables to graphs at Github.
The final result (click to view in full size):
Limitations and comments:
Perl is a widely used language in bioinformatics. As I already experimented Python and Biopython for handling a few simple bioinformatics tasks I will now try Perl and Bioperl.
Install on Ubuntu 11.10 (oneiric)
Perl already comes with Ubuntu. Bioperl can be installed (without CPAN):
$ sudo apt-get install bioperl
After the installation on have several tools in your PATH:
bp_aacomp, bp_biblio, bp_biofetch_genbank_proxy, bp_bioflat_index, bp_biogetseq, bp_blast2tree, bp_bulk_load_gff, bp_chaos_plot, bp_classify_hits_kingdom, bp_composite_LD, bp_das_server, bp_dbsplit, bp_download_query_genbank, bp_einfo, bp_extract_feature_seq, bp_fast_load_gff, bp_fastam9_to_table, bp_fetch, bp_filter_search, bp_flanks, bp_gccalc, bp_genbank2gff, bp_genbank2gff3, bp_generate_histogram, bp_heterogeneity_test, bp_hivq, bp_hmmer_to_table, bp_index, bp_load_gff, bp_local_taxonomydb_query, bp_make_mrna_protein, bp_mask_by_search, bp_meta_gff, bp_mrtrans, bp_mutate, bp_netinstall, bp_nexus2nh, bp_nrdb, bp_oligo_count, bp_pairwise_kaks, bp_parse_hmmsearch, bp_process_gadfly, bp_process_sgd, bp_process_wormbase, bp_query_entrez_taxa, bp_remote_blast, bp_revtrans-motif, bp_search2BSML, bp_search2alnblocks, bp_search2gff, bp_search2table, bp_search2tribe, bp_seq_length, bp_seqconvert, bp_seqfeature_delete, bp_seqfeature_gff3, bp_seqfeature_load, bp_seqret, bp_seqretsplit, bp_split_seq, bp_sreformat, bp_taxid4species, bp_taxonomy2tree, bp_translate_seq, bp_tree2pag, bp_unflatten_seq
You can try to import a Bioperl module to check if everything is working properly.
#!/bin/perl -w
use Bio::Seq;
Writing a nucleotide sequence to a FASTA file
#!/usr/bin/perl -w
use Bio::Seq;
use Bio::SeqIO;
$seq_obj = Bio::Seq->new(-seq => "gattaca",
-display_id => "#10191997",
-desc => "Example",
-alphabet => "dna" );
$seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );
$seqio_obj->write_seq($seq_obj);
The output in the sequence.fasta created will be:
#10191997 Example
gattaca
Reading a Genbank file
Opening the same example I used last time (Hippopotamus amphibius mitochondrion, complete genome).
#!/usr/bin/perl -w
use Bio::Seq;
use Bio::SeqIO;
$seqio_obj = Bio::SeqIO->new(-file => "sequence.gb", -format => "genbank" );
while ($seq_obj = $seqio_obj->next_seq){
print $seq_obj->seq,"\n";
}
Online Querying Genbank
With Bioperl is possible to programmatically query and retrieve data directly from GenBank. For example, to retrieve the same mitochondrial genome from the Hippopotamus I used in the example above.
#!/usr/bin/perl -w
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
$query = "Hippopotamus amphibius[ORGN] AND NC_000889[LOCUS]";
$query_obj = Bio::DB::Query::GenBank->new(-db => 'nucleotide', -query => $query );
$gb_obj = Bio::DB::GenBank->new;
$stream_obj = $gb_obj->get_Stream_by_query($query_obj);
while ($seq_obj = $stream_obj->next_seq) {
print $seq_obj->display_id, "\t", $seq_obj->length, "\n";
}
The newick tree
The Newick tree format is a way of representing a graph trees with edge lengths using parentheses and commas.
A newick tree example:
(((Espresso:2,(Milk Foam:2,Espresso Macchiato:5,((Steamed Milk:2,Cappucino:2,(Whipped Cream:1,Chocolate Syrup:1,Cafe Mocha:3):5):5,Flat White:2):5):5):1,Coffee arabica:0.1,(Columbian:1.5,((Medium Roast:1,Viennese Roast:3,American Roast:5,Instant Coffee:9):2,Heavy Roast:0.1,French Roast:0.2,European Roast:1):5,Brazilian:0.1):1):1,Americano:10,Water:1);
A graphical representation for the newick tree above (using the http://www.jsphylosvg.com/ library):
The Newick format is commonly used for store phylogenetic trees.
The problem
A phylogenetic tree can be highly branched and dense and even using proper visualization software can be difficult to analyse it. Additionally, as a tree are produced by a chain of different software with data from the laboratory, the label for each leaf/node can be something not meaningful for a human reader.
For this particular problem, an example of a node label could be SXS_3014_Albula_vulpes_id_30.
There was a spreadsheet with more meaningful information where a node label could be used as a primary key. Example for the node above:
Taxon Order | Family | Genus | Species | ID |
---|---|---|---|---|
Albuliformes | Albulidae | Albula | vulpes | SXS_3014_Albula_vulpes_id_30 |
The problem consists in using the tree and the spreadsheet to produce a new tree with the same structure, where each node have a more meaningful label.
The approach
The new tree can be mounted by substituting each label of the initial tree with the respective information from the spreadsheet. A script can be used to automate this process.
The solution
After converting the spreadsheet to a CSV file that could be more easily handled by a CSV Python library the problem is reduced to a file handling and string substitution. Fortunately, due the simplicity of the Newick format and its limited vocabulary, a tree parser is not necessary.
Difficulties found
The spreadsheet was originally in a Microsoft Office Excel 2007 (.xlsx) and the conversion to CSV provided by Excel was not good and there was no configuration option available. Finally, the conversion provided by LibreOffice Productivity Suite was more configurable and was easier to read by the CSV library.
In the script, the DictReader class showed in the the long-term much more reliable and tolerant to changes in the spreadsheet as long the names of the columns remain the same.
P.S. due to the nature of the original sources for the tree and spreadsheet I don’t have the authorization for public publishing their complete and original content. The artificial data displayed here is merely illustrative.
DNA inspired sculpture by Charles Jencks. Creative Commons photo by Maria Keays.
What is GenBank?
The GenBank sequence database is a widely used collection of nucleotide sequences and their protein translations. A GenBank sequence record file typically has a .gbk or .gb extension and is filled with plain text characters. A example of GenBank file can be found here.
Filename problem
Although there are several metadata are available inside a GenBank record the name of the file are not always in accordance with the content of the file. This is potentially a source of confusion to organize files and requires an additional effort to rename the files according to their content.
Approach using Biopython
The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Among other tools, Biopython includes modules for reading and writing different sequence file formats including the GenBank’s record files.
Despite the fact that is possible to write a parser for GenBank’ files it would represent a redundant effort to develop and maintain such tool. Biopython can be delegated to perform parsing and focus the programming on renaming mechanism.
Biopython installation on Linux (Ubuntu 11.10) or Apple OS X (Lion)
For both Ubuntu 11.10 and OS X Lion, a modern version of Python already comes out of the box.
For Linux you just need to install the Biopython package. One method to install Biopython in a APT ready distribution as Ubuntu 11.10 (Oneiric Ocelot) is:
# apt-get install python-biopython
For an Apple OS X (Lion) you can install Biopython using easy_install, a popular package manager for the Python. Easy_install is bundled with Setuptools, a set of tools for Python.
To install the Setuptools download the .egg file for your python version (probably setuptools-0.6c11-py2.7.egg) and execute it as a Shell Script:
sudo sh setuptools-0.6c11-py2.7.egg
After this you already have easy_install in place and you can use it to install the Biopython library:
sudo easy_install -f http://biopython.org/DIST/ biopython
For both operational systems you can test if you already have Biopython installed using the Python iterative terminal:
$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import Bio
>>> Bio.__version__
‘1.57’
>>>
Automatic rename example through scripting
Below the Python source-code for a simple use of using Biopython to rename a Genbank file to it’s description after removing commas and spaces.
Using the the previous example of GenBank file, suppose you have a file called sequence.gb. To rename this file to the GenBank description metadata inside it you can use the script.
python gbkrename.py sequence.gb
And after this it will be called Hippopotamus_amphibius_mitochondrial_DNA_complete_genome.gbk.
Improvements
There is plenty of room for improvement as: