Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to python for bioinformatics


Published on

A talk for the PRBB Technical Seminars series ( on python and bioinformatics

Published in: Technology, Education

Introduction to python for bioinformatics

  1. 1. prbb technical seminars Introduction to Python for bioinformatics Giovanni MarcoDall'Olio Unidad de Biologia Evolutiva – CEXS Barcelona (Spain)
  2. 2. Python A programming language released in 1991 by  Guido Van Rossum Used for a variety of applications, from scripting  to web programming Adopted by google, yahoo, youtube, CERN,  Nasa, Red Hat.... Lots of jokes in the documentation (it is named  after the Monty Pythons)
  3. 3. Python and bioinformatics Python is widely used in  bioinformatics August 2007 survey - survey
  4. 4. Python – overall view Learning curve Easy to learn, yet powerful ☺☺☺☺☺ Readibility of a ☺☺☺☺☺ python program Community, (for bioinformatics, CPAN is ☺☺☺☺ sligthly bigger) availability of open source modules Programming Multi paradigm (Object ☺☺☺☺☺ Oriented, structured, paradigms functional, etc..) Execution speed Interpreted language;  importance of programmer effort over computer effort Notes: This talk is full of tables like this  They only reflect my opinion (biologist with 3-4 years experience) 
  5. 5. Python – Cons State of open There is good support, but  less compared to perl and source libraries for R bioinformatics Execution speed Comparable to perl, java,  ruby, .. SOAP libraries SOAPpy is very old, suds  is the best one Population Genetics As many other specific  modules, perl and R are modules better supported Lack of true A structural limit make it  impossible to have real multithreading multithreading in python support (various solutions..)  = very sad  = fine
  6. 6. Python – what makes me happy General syntax ☺☺☺☺☺ People are forced to ☺☺☺☺☺ write program similar to yours Quicker to write ☺☺☺☺☺ programs Object Oriented, (will be explained ☺☺☺☺☺ later) multi-paradigm Testing support '' ☺☺☺☺
  7. 7. Python – learning curve Python's syntax is easy  You can concentrate on algorithms and problems  instead of the programming language
  8. 8. Python – learning curve Python's syntax is easy  So you can concentrate on algorithms and problems  instead of the programming language With python you don't have to worry of:  Learning strange symbols (~=, <>, eq, 'n', {}...)  Alternative syntaxes to do the same task  Declaring variables  Inner structure of strings/arrays  Low level IO, passing variables per reference/value, etc.. 
  9. 9. Example of python code #!/usr/bin/env python '''Some python examples''' # example 1: a 'for' loop for name in ('Albert', 'Aristoteles', 'Archimedes'): print 'hello, ', name # example 2: Opening a file and parsing it filehandler = open('samplefile.txt', 'r') for line in filehandler.readlines(): if line.startswith('>'): print line else: pass
  10. 10. Python syntax I - indentation In python, the indentation ( = spaces at the beginning of the  line) is part of the syntax. It is used to delimit loops and conditions, instead of graph  parenthesis ({}) Example:  for name in ('Albert', 'Aristoteles', 'Dayhoff'): print 'hello, ', name print 'and hello to you, too' The first 'print' is inside the cycle, while the second is  outside
  11. 11. A quick perl/python comparison #!/usr/bin/env python #!/usr/bin/env perl a=3 my $a = 3; if a == 3: if ($a == 3){ print 'a is eq to 3' print quot;a is eq to 3nquot;; } (Perl) (Python) Python code is usually easier to read and contains  less symbols (like {})
  12. 12. Python syntax II - simplicity Python has the minimal number of syntax keywords.  There is:  only one way to open files (no 'fopen', 'openf', etc..)  only one to print (no printn, printf, sprintf, sprint, etc..)  only two ways to define loops ('for' and 'while').  Python's phylosophy is about simplicity.  Your colleagues are forced to write their programs in the  same way as you.
  13. 13. Python syntax III – declaring var You don't need to declare variables  The type of a variable is defined the first time  you assign a value to it a = 'cacagtcaga' → a is a string  b = 133 → b is an integer  c = True → c is a boolean 
  14. 14. Notes on Python's speed Python is an interpreted language  its speed is at the level of perl, java, etc.  programs are slower than C, but it's faster to write  them importance of programmer effort over computer  effort Many ways to speed up python  modules can also be written in C  some compilers exist (PyPy)  Google is working on an enhanced version of  python (news of March 2009).
  15. 15. Python – programming goodies Installation and Installed by default ☺☺☺☺☺ in most linux portability distribution, interpreted IDLE / text Interactive shell, ☺☺☺☺☺ ipython, many editors editors Install and easy_install, PyPI ☺☺☺ search new modules Testing support doctest, unittest, ☺☺☺☺☺ nose Writing ☺☺☺☺☺ documentation Debugging Logging, pdb
  16. 16. Python – installation and portability Python comes installed by default in most of the  GNU/Linux distributions Mac users have an old version (2.5), but can  upgrade it On windows, you need to dowload an installer from  first Being an interpreted language, python  programs are easy to port in other platforms
  17. 17. PyPI (Python Package Index) PyPI is a repository of open source modules for python For  bioinformatics, it is smaller than to CPAN, CRAN/bioconduct or, etc.. PyPI (repository of public python modules)
  18. 18. Python – installing new modules Modules can be automatically downloaded and  installed using a tool called 'easy_install' Examples:  easy_install -U biopython # install or update  biopython from PyPI easy_install --prefix ~/usr biopython # install biopython  without requiring admin privilegies easy_install biopython.tar.gz # install biopython from a  previously downloaded tar ball easy_install # install  biopython from its web site
  19. 19. Using python Python can be used as an interactive shell (like  R, octave, matlab, etc..) or by writing programs gioby@dayhoff:~$ cat > gioby@dayhoff:~$ python >>> print 'hola' range(5) >>> print 'hola' 'hola' [0, 1, 2, 3, 4] >>> range(5) gioby@dayhoff:~$ python [0, 1, 2, 3, 4] (python interactive shell) (a python program)
  20. 20. Python interactive shell gioby@dayhoff:~$ python Python 2.5.2 Type quot;helpquot;, quot;copyrightquot;, quot;creditsquot; or quot;licensequot; for more information. >>> >>> print 'hola' 'hola' >>> range(5) [0, 1, 2, 3, 4] You can use it to run programs without having to save  them to a script. It has not a 'session' equivalent like in R  Many programmers prefer to use 'ipython', an enhanced  version of this shell
  21. 21. IPython session gioby@dayhoff:~$ ipython Type quot;copyrightquot;, quot;creditsquot; or quot;licensequot; for more information. In [1]: import random In [2]: random.choice(['ciao', 'hola', 'hello']) Out[2]: 'hello' In [3]: 1200 / 2 Out[3]: 600 In [4]: random? (shows documentation on the random module) In [5]: random.<TAB> (shows auto-completition) In [6]: !ls (executes a bash command)
  22. 22. Programming paradigms and testing Programming Multi paradigm (Object ☺☺☺☺☺ Oriented, Structured, paradigms Functional, etc..) Testing support doctest, unittest, nose ☺☺☺☺☺
  23. 23. Python is a multiparadigm language Your python programs can be a simple list of  instructions (imperative approach), or you can write functions (functional)  or you can use objects (object oriented)  It's a multi-paradigm language 
  24. 24. Python as a imperative language print 'Hi, I am the psychotherapist' print 'How do you do? What brings you here?' response = raw_input() print 'can you elaborate on that?' response = raw_input() print 'Why do you say it is ', response, '?' ....
  25. 25. Python as a functional language def get_sequence(fastafilehandler): '''extracts the sequence from a fasta file''' sequence = '' for line in filehandler.readlines(): if line.startswith('>'): sequence += line else: pass def main(): '''execute the main functions''' filepath = 'samplefile.txt' filehandler = open(, 'r') get_sequence(filehandler) .....
  26. 26. Object Oriented Programming explained in two sentences When you start having complicated nested  variables (like arrays of hashes of arrays of lists of .....)→ Object Oriented programming is something you should look at
  27. 27. Object Oriented Programming example genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts': { 'transcript1': [......], 'transcript2': [......], }, }, 'gene1': { 'position': ...........}, ..... } def get_subseq(genes, geneid, start, end): ''' get a subsequence of a gene, given a dictionary of gene annotations, a gene id, and start/end position ''' pass
  28. 28. Object Oriented Programming example genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts': { 'transcript1': [......], 'transcript2': [......], }, }, 'gene1': { 'position': ...........}, ..... } def get_subseq(genes, geneid, start, end): ''' get a subsequence of a gene, given a dictionary of gene annotations, a gene id, and start/end position ''' pass
  29. 29. A python class class gene: def __init__(self): position = None sequence = '' transcripts = [] def get_subseq(self, start, end): pass Python's syntax for classes is easy  More concise than Java, and not mandatory to use classes  OO is very complicated in Perl 
  30. 30. Python and Java classes class gene: public class Gene { def __init__(self,pos): public int position; self.position = pos public str chromosome; self.sequence = '' public str transcripts[]; self.transcripts = [] public Gene(int pos){ def get_subseq(self, position = pos start, end): } pass (A Python Class) public void getSubseq(start, end) { pass } (A Java Class)
  31. 31. Three ways to test a python program When you write a program or a script and want  to publish its results, you also need a way to prove that it works correctly Python has good instruments for testing:  Doctest  Unittest  Nosetest 
  32. 32. doctest With doctest, you put examples of the usage of  a function in its documentation >>> help(say_hello) Help on function say_hello in module __main__: say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!! Doctests tries to re-execute these examples, and if they don't return the expected values, an error is raised
  33. 33. Doctest example 2
  34. 34. Doctest example 3 Doctest are useful when you collaborate with  non programmers
  35. 35. unittest From unittest import * class SimpleFastaSeqCase(unittest.TestCase): @classmethod Instructions to be executed def setUpClass(cls): ..... before/after all the tests @classmethod def tearDownClass(cls): ..... def setUp(self): Instructions to be executed ..... before/after each one of the def tearDown(self): ..... tests def testCondition1(self): ..... def testCondition2(self): Tests .....
  36. 36. nosetest Nosetest - it scans your code and looks for all  the functions with the word 'test_' in their names def getfasta(filename): pass def count_numbers(numbers, limit): pass def you_like_this_talk(subliminal = True) pass def test_everything_ok(): This is a test pass
  37. 37. Message Python is easy to learn and write  It has good tools to test and demonstrate that  your programs work correctly
  38. 38. Python – some bioinfo use cases Regular re, TAMO, biopython To use regular ☺☺☺☺ expressions, motif expressions, it is search necessary to import a module. Getting help is easier Convert a sequence biopython Biopython is growing ☺☺☺☺ file to another format its support for bioinformatics formats Working with pygr Pygr is a great ☺☺☺☺ genomic data environment to work with genomic data Query Genbank Biopython, pygr ☺☺☺☺ Structural I don't know Bioinformatics
  39. 39. Regular Expressions in Python Using Regular Expressions in Python requires  an additional step than Perl You have to import a module called re first  Regular expressions are also less 'central' to  the developers of the language
  40. 40. Example – Regular Expressions in python >>> import re >>> sequence = 'ACGGCTAGGTCGATGCGATCG' >>> re.findall('A.G', sequence) ['ACG', 'AGG', 'ATG'] >>> help(re) <get help on regular expressions> The only advantage of python over perl for regular  expression is that it is easier to get help
  41. 41. Biopython A collection of free modules for bioinformatics  number of functionalities implemented:  bioconductor > bioperl > biopython > all others Strong points:  File format support  NCBI – entrez APIs  Pdb / structures 
  42. 42. Biopython Examples # Parse a Fasta File and convert it to Genbank from Bio.SeqIO import SeqIO seqfile = open('fastafile.fa', 'r') sequences = SeqIO.to_dict(SeqIO.parse(seqfile)) # Query NCBI results = Entrez.esearch(db='nucleotide', term='cox2')
  43. 43. Pygr Great for genome-wide analysis  Makes it automatic to  Store/retrieve data in databases or pickles  Use and configure local blast databases  Creating annotations and storing them  Interface with ncbi, ensembl (eq. to ensembl perl  APIs), ucsc
  44. 44. Pygr examples # Ensembl APIs serverRegistry = get_registry( host= '', user='anonymous') coreDBAdaptor = serverRegistry.get_DBAdaptor( 'homo_sapiens', 'core', '47_36i') sequence = coreDBAdaptor.fetch_slice_by_seqregion( coordSystemName, seqregionName) # Download the sequence of the Human Genome (18) import pygr.Data hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18( download=True)
  45. 45. TAMO and pyHMM Module to work with motifs  >>> from TAMO import MotifTools >>> msa = ['TGACTCA',... 'TGACTCA',... 'TGAGTCA',... 'TGAGTCA'] >>> m_msa = MotifTools.Motif(msa) >>> print m_msa TGAsTCA(4) >>> m_msa._print_counts() # 0 1 2 3 4 5 6 #A 0.000 0.000 4.000 0.000 0.000 0.000 4.000 #C 0.000 0.000 0.000 2.000 0.000 4.000 0.000 #T 4.000 0.000 0.000 0.000 4.000 0.000 0.000 #G 0.000 4.000 0.000 2.000 0.000 0.000 0.000
  46. 46. Python – bioinformatics utilities Scientific and scipy + numpy ☺☺☺☺☺ statistics Plotting graphs Matplotlib (pylab) ☺☺☺☺☺ SOAP / web suds ☺☺☺ scraping utilities ORM modules, Sqlalchemy + elixir, ☺☺☺☺☺ sqlobject, pytables database handling, HDF5 Persistent data cPickle, shelf, ZODB No R-like sessions ☺☺☺☺
  47. 47. Python and Databases There are some good libraries to Object  Relational Mapping (ORM) ZODB: Object Oriented Database  PyTables: hierarchical database (supports  HDF5, a binary format used in astronomy/physics to store big data)
  48. 48. sqlalchemy example
  49. 49. Scientific Python Numpy: python module to work with arrays and  matrixes Scipy: module to do advanced math, statistics,  and more Matplotlib: module to plot graphics  To get started with python and plotting graphs:  $: easy_install numpy scipy matplotlib ipython $: ipython -pylab
  50. 50. Numpy/Scipy example Hint: use ipython -pylab to have an R-like environment 
  51. 51. Is there anything I forgot? ????? ????? ????? 
  52. 52. Thank you for the attention!  PRBB technical seminars:   These slides will be uploaded on  
  53. 53. Discarded slides
  54. 54. Hint: use ipython -pylab The best way to work with python and plotting  graphs is with ipython -pylab It will give you a shell similar to  matlab/octave/R/etc..
  55. 55. Regular expressions To use regular expressions in python, you need  to import the 're' module first It's not so immediate as with perl, where you  can use regular expressions without importing anything However, it is easier to get the documentation 
  56. 56. Main python modules for bioinformatics Biopython  Pygr 
  57. 57. Python – storing/accessing data Reading/Writing files ☺☺☺☺☺ Persistent data cPickle, shelf, ZoDB ☺☺☺ Database – Object sqlalchemy, elixir ☺☺☺☺☺ Relational Mapping libraries Binary formats pytables ☺☺☺☺ (HDF5) R-like sessions Nothing of my knowledge :(