Introduction to python for bioinformatics


Published on

A talk for the PRBB Technical Seminars series ( on python and bioinformatics

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Riorganizzare tutto meglio
  • Other voices:- community?
  • Python's authors have always focused on:Quick learning curveReadibility
  • Python's authors have always focused on:Quick learning curveReadibility
  • #!/usr/bin/env python'''Some python examples'''# example 1: a 'for' loopfor name in ('Albert', 'Aristoteles', 'Archimedes'):print 'hello, ', name# example 2: Opening a file and parsing itfilehandler = open('samplefile.txt', 'r')for line in filehandler.readlines():if line.startswith('>'):print lineelse:pass
  • Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Readability counts.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.[.....]
  • Other voices:- community?
  • You can mix instructions, functions and objects in the same python file (multi-paradigm)
  • The previous code (where each 'gene' was a dictionary of dictionaries with lists) can be better structured as an object
  • Introduction to python for bioinformatics

    1. 1. prbb technical seminars Introduction to Python for bioinformatics Giovanni MarcoDall'Olio Unidad de Biologia Evolutiva – CEXS Barcelona (Spain)
    2. 2. Python A programming language released in 1991 by  Guido Van Rossum Used for a variety of applications, from scripting  to web programming Adopted by google, yahoo, youtube, CERN,  Nasa, Red Hat.... Lots of jokes in the documentation (it is named  after the Monty Pythons)
    3. 3. Python and bioinformatics Python is widely used in  bioinformatics August 2007 survey - survey
    4. 4. Python – overall view Learning curve Easy to learn, yet powerful ☺☺☺☺☺ Readibility of a ☺☺☺☺☺ python program Community, (for bioinformatics, CPAN is ☺☺☺☺ sligthly bigger) availability of open source modules Programming Multi paradigm (Object ☺☺☺☺☺ Oriented, structured, paradigms functional, etc..) Execution speed Interpreted language;  importance of programmer effort over computer effort Notes: This talk is full of tables like this  They only reflect my opinion (biologist with 3-4 years experience) 
    5. 5. Python – Cons State of open There is good support, but  less compared to perl and source libraries for R bioinformatics Execution speed Comparable to perl, java,  ruby, .. SOAP libraries SOAPpy is very old, suds  is the best one Population Genetics As many other specific  modules, perl and R are modules better supported Lack of true A structural limit make it  impossible to have real multithreading multithreading in python support (various solutions..)  = very sad  = fine
    6. 6. Python – what makes me happy General syntax ☺☺☺☺☺ People are forced to ☺☺☺☺☺ write program similar to yours Quicker to write ☺☺☺☺☺ programs Object Oriented, (will be explained ☺☺☺☺☺ later) multi-paradigm Testing support '' ☺☺☺☺
    7. 7. Python – learning curve Python's syntax is easy  You can concentrate on algorithms and problems  instead of the programming language
    8. 8. Python – learning curve Python's syntax is easy  So you can concentrate on algorithms and problems  instead of the programming language With python you don't have to worry of:  Learning strange symbols (~=, <>, eq, 'n', {}...)  Alternative syntaxes to do the same task  Declaring variables  Inner structure of strings/arrays  Low level IO, passing variables per reference/value, etc.. 
    9. 9. Example of python code #!/usr/bin/env python '''Some python examples''' # example 1: a 'for' loop for name in ('Albert', 'Aristoteles', 'Archimedes'): print 'hello, ', name # example 2: Opening a file and parsing it filehandler = open('samplefile.txt', 'r') for line in filehandler.readlines(): if line.startswith('>'): print line else: pass
    10. 10. Python syntax I - indentation In python, the indentation ( = spaces at the beginning of the  line) is part of the syntax. It is used to delimit loops and conditions, instead of graph  parenthesis ({}) Example:  for name in ('Albert', 'Aristoteles', 'Dayhoff'): print 'hello, ', name print 'and hello to you, too' The first 'print' is inside the cycle, while the second is  outside
    11. 11. A quick perl/python comparison #!/usr/bin/env python #!/usr/bin/env perl a=3 my $a = 3; if a == 3: if ($a == 3){ print 'a is eq to 3' print quot;a is eq to 3nquot;; } (Perl) (Python) Python code is usually easier to read and contains  less symbols (like {})
    12. 12. Python syntax II - simplicity Python has the minimal number of syntax keywords.  There is:  only one way to open files (no 'fopen', 'openf', etc..)  only one to print (no printn, printf, sprintf, sprint, etc..)  only two ways to define loops ('for' and 'while').  Python's phylosophy is about simplicity.  Your colleagues are forced to write their programs in the  same way as you.
    13. 13. Python syntax III – declaring var You don't need to declare variables  The type of a variable is defined the first time  you assign a value to it a = 'cacagtcaga' → a is a string  b = 133 → b is an integer  c = True → c is a boolean 
    14. 14. Notes on Python's speed Python is an interpreted language  its speed is at the level of perl, java, etc.  programs are slower than C, but it's faster to write  them importance of programmer effort over computer  effort Many ways to speed up python  modules can also be written in C  some compilers exist (PyPy)  Google is working on an enhanced version of  python (news of March 2009).
    15. 15. Python – programming goodies Installation and Installed by default ☺☺☺☺☺ in most linux portability distribution, interpreted IDLE / text Interactive shell, ☺☺☺☺☺ ipython, many editors editors Install and easy_install, PyPI ☺☺☺ search new modules Testing support doctest, unittest, ☺☺☺☺☺ nose Writing ☺☺☺☺☺ documentation Debugging Logging, pdb
    16. 16. Python – installation and portability Python comes installed by default in most of the  GNU/Linux distributions Mac users have an old version (2.5), but can  upgrade it On windows, you need to dowload an installer from  first Being an interpreted language, python  programs are easy to port in other platforms
    17. 17. PyPI (Python Package Index) PyPI is a repository of open source modules for python For  bioinformatics, it is smaller than to CPAN, CRAN/bioconduct or, etc.. PyPI (repository of public python modules)
    18. 18. Python – installing new modules Modules can be automatically downloaded and  installed using a tool called 'easy_install' Examples:  easy_install -U biopython # install or update  biopython from PyPI easy_install --prefix ~/usr biopython # install biopython  without requiring admin privilegies easy_install biopython.tar.gz # install biopython from a  previously downloaded tar ball easy_install # install  biopython from its web site
    19. 19. Using python Python can be used as an interactive shell (like  R, octave, matlab, etc..) or by writing programs gioby@dayhoff:~$ cat > gioby@dayhoff:~$ python >>> print 'hola' range(5) >>> print 'hola' 'hola' [0, 1, 2, 3, 4] >>> range(5) gioby@dayhoff:~$ python [0, 1, 2, 3, 4] (python interactive shell) (a python program)
    20. 20. Python interactive shell gioby@dayhoff:~$ python Python 2.5.2 Type quot;helpquot;, quot;copyrightquot;, quot;creditsquot; or quot;licensequot; for more information. >>> >>> print 'hola' 'hola' >>> range(5) [0, 1, 2, 3, 4] You can use it to run programs without having to save  them to a script. It has not a 'session' equivalent like in R  Many programmers prefer to use 'ipython', an enhanced  version of this shell
    21. 21. IPython session gioby@dayhoff:~$ ipython Type quot;copyrightquot;, quot;creditsquot; or quot;licensequot; for more information. In [1]: import random In [2]: random.choice(['ciao', 'hola', 'hello']) Out[2]: 'hello' In [3]: 1200 / 2 Out[3]: 600 In [4]: random? (shows documentation on the random module) In [5]: random.<TAB> (shows auto-completition) In [6]: !ls (executes a bash command)
    22. 22. Programming paradigms and testing Programming Multi paradigm (Object ☺☺☺☺☺ Oriented, Structured, paradigms Functional, etc..) Testing support doctest, unittest, nose ☺☺☺☺☺
    23. 23. Python is a multiparadigm language Your python programs can be a simple list of  instructions (imperative approach), or you can write functions (functional)  or you can use objects (object oriented)  It's a multi-paradigm language 
    24. 24. Python as a imperative language print 'Hi, I am the psychotherapist' print 'How do you do? What brings you here?' response = raw_input() print 'can you elaborate on that?' response = raw_input() print 'Why do you say it is ', response, '?' ....
    25. 25. Python as a functional language def get_sequence(fastafilehandler): '''extracts the sequence from a fasta file''' sequence = '' for line in filehandler.readlines(): if line.startswith('>'): sequence += line else: pass def main(): '''execute the main functions''' filepath = 'samplefile.txt' filehandler = open(, 'r') get_sequence(filehandler) .....
    26. 26. Object Oriented Programming explained in two sentences When you start having complicated nested  variables (like arrays of hashes of arrays of lists of .....)→ Object Oriented programming is something you should look at
    27. 27. Object Oriented Programming example genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts': { 'transcript1': [......], 'transcript2': [......], }, }, 'gene1': { 'position': ...........}, ..... } def get_subseq(genes, geneid, start, end): ''' get a subsequence of a gene, given a dictionary of gene annotations, a gene id, and start/end position ''' pass
    28. 28. Object Oriented Programming example genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts': { 'transcript1': [......], 'transcript2': [......], }, }, 'gene1': { 'position': ...........}, ..... } def get_subseq(genes, geneid, start, end): ''' get a subsequence of a gene, given a dictionary of gene annotations, a gene id, and start/end position ''' pass
    29. 29. A python class class gene: def __init__(self): position = None sequence = '' transcripts = [] def get_subseq(self, start, end): pass Python's syntax for classes is easy  More concise than Java, and not mandatory to use classes  OO is very complicated in Perl 
    30. 30. Python and Java classes class gene: public class Gene { def __init__(self,pos): public int position; self.position = pos public str chromosome; self.sequence = '' public str transcripts[]; self.transcripts = [] public Gene(int pos){ def get_subseq(self, position = pos start, end): } pass (A Python Class) public void getSubseq(start, end) { pass } (A Java Class)
    31. 31. Three ways to test a python program When you write a program or a script and want  to publish its results, you also need a way to prove that it works correctly Python has good instruments for testing:  Doctest  Unittest  Nosetest 
    32. 32. doctest With doctest, you put examples of the usage of  a function in its documentation >>> help(say_hello) Help on function say_hello in module __main__: say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!! Doctests tries to re-execute these examples, and if they don't return the expected values, an error is raised
    33. 33. Doctest example 2
    34. 34. Doctest example 3 Doctest are useful when you collaborate with  non programmers
    35. 35. unittest From unittest import * class SimpleFastaSeqCase(unittest.TestCase): @classmethod Instructions to be executed def setUpClass(cls): ..... before/after all the tests @classmethod def tearDownClass(cls): ..... def setUp(self): Instructions to be executed ..... before/after each one of the def tearDown(self): ..... tests def testCondition1(self): ..... def testCondition2(self): Tests .....
    36. 36. nosetest Nosetest - it scans your code and looks for all  the functions with the word 'test_' in their names def getfasta(filename): pass def count_numbers(numbers, limit): pass def you_like_this_talk(subliminal = True) pass def test_everything_ok(): This is a test pass
    37. 37. Message Python is easy to learn and write  It has good tools to test and demonstrate that  your programs work correctly
    38. 38. Python – some bioinfo use cases Regular re, TAMO, biopython To use regular ☺☺☺☺ expressions, motif expressions, it is search necessary to import a module. Getting help is easier Convert a sequence biopython Biopython is growing ☺☺☺☺ file to another format its support for bioinformatics formats Working with pygr Pygr is a great ☺☺☺☺ genomic data environment to work with genomic data Query Genbank Biopython, pygr ☺☺☺☺ Structural I don't know Bioinformatics
    39. 39. Regular Expressions in Python Using Regular Expressions in Python requires  an additional step than Perl You have to import a module called re first  Regular expressions are also less 'central' to  the developers of the language
    40. 40. Example – Regular Expressions in python >>> import re >>> sequence = 'ACGGCTAGGTCGATGCGATCG' >>> re.findall('A.G', sequence) ['ACG', 'AGG', 'ATG'] >>> help(re) <get help on regular expressions> The only advantage of python over perl for regular  expression is that it is easier to get help
    41. 41. Biopython A collection of free modules for bioinformatics  number of functionalities implemented:  bioconductor > bioperl > biopython > all others Strong points:  File format support  NCBI – entrez APIs  Pdb / structures 
    42. 42. Biopython Examples # Parse a Fasta File and convert it to Genbank from Bio.SeqIO import SeqIO seqfile = open('fastafile.fa', 'r') sequences = SeqIO.to_dict(SeqIO.parse(seqfile)) # Query NCBI results = Entrez.esearch(db='nucleotide', term='cox2')
    43. 43. Pygr Great for genome-wide analysis  Makes it automatic to  Store/retrieve data in databases or pickles  Use and configure local blast databases  Creating annotations and storing them  Interface with ncbi, ensembl (eq. to ensembl perl  APIs), ucsc
    44. 44. Pygr examples # Ensembl APIs serverRegistry = get_registry( host= '', user='anonymous') coreDBAdaptor = serverRegistry.get_DBAdaptor( 'homo_sapiens', 'core', '47_36i') sequence = coreDBAdaptor.fetch_slice_by_seqregion( coordSystemName, seqregionName) # Download the sequence of the Human Genome (18) import pygr.Data hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18( download=True)
    45. 45. TAMO and pyHMM Module to work with motifs  >>> from TAMO import MotifTools >>> msa = ['TGACTCA',... 'TGACTCA',... 'TGAGTCA',... 'TGAGTCA'] >>> m_msa = MotifTools.Motif(msa) >>> print m_msa TGAsTCA(4) >>> m_msa._print_counts() # 0 1 2 3 4 5 6 #A 0.000 0.000 4.000 0.000 0.000 0.000 4.000 #C 0.000 0.000 0.000 2.000 0.000 4.000 0.000 #T 4.000 0.000 0.000 0.000 4.000 0.000 0.000 #G 0.000 4.000 0.000 2.000 0.000 0.000 0.000
    46. 46. Python – bioinformatics utilities Scientific and scipy + numpy ☺☺☺☺☺ statistics Plotting graphs Matplotlib (pylab) ☺☺☺☺☺ SOAP / web suds ☺☺☺ scraping utilities ORM modules, Sqlalchemy + elixir, ☺☺☺☺☺ sqlobject, pytables database handling, HDF5 Persistent data cPickle, shelf, ZODB No R-like sessions ☺☺☺☺
    47. 47. Python and Databases There are some good libraries to Object  Relational Mapping (ORM) ZODB: Object Oriented Database  PyTables: hierarchical database (supports  HDF5, a binary format used in astronomy/physics to store big data)
    48. 48. sqlalchemy example
    49. 49. Scientific Python Numpy: python module to work with arrays and  matrixes Scipy: module to do advanced math, statistics,  and more Matplotlib: module to plot graphics  To get started with python and plotting graphs:  $: easy_install numpy scipy matplotlib ipython $: ipython -pylab
    50. 50. Numpy/Scipy example Hint: use ipython -pylab to have an R-like environment 
    51. 51. Is there anything I forgot? ????? ????? ????? 
    52. 52. Thank you for the attention!  PRBB technical seminars:   These slides will be uploaded on  
    53. 53. Discarded slides
    54. 54. Hint: use ipython -pylab The best way to work with python and plotting  graphs is with ipython -pylab It will give you a shell similar to  matlab/octave/R/etc..
    55. 55. Regular expressions To use regular expressions in python, you need  to import the 're' module first It's not so immediate as with perl, where you  can use regular expressions without importing anything However, it is easier to get the documentation 
    56. 56. Main python modules for bioinformatics Biopython  Pygr 
    57. 57. Python – storing/accessing data Reading/Writing files ☺☺☺☺☺ Persistent data cPickle, shelf, ZoDB ☺☺☺ Database – Object sqlalchemy, elixir ☺☺☺☺☺ Relational Mapping libraries Binary formats pytables ☺☺☺☺ (HDF5) R-like sessions Nothing of my knowledge :(