Introduction to python for bioinformatics
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Introduction to python for bioinformatics

on

  • 21,318 views

A talk for the PRBB Technical Seminars series (http://bg.imim.es/technical-seminars/) on python and bioinformatics

A talk for the PRBB Technical Seminars series (http://bg.imim.es/technical-seminars/) on python and bioinformatics

Statistics

Views

Total Views
21,318
Views on SlideShare
20,271
Embed Views
1,047

Actions

Likes
21
Downloads
524
Comments
6

19 Embeds 1,047

http://bogdan.org.ua 441
http://www.scoop.it 256
http://www.webicina.com 126
http://bioinfoblog.it 70
http://monsterbashseq.wordpress.com 54
http://wonkots.wordpress.com 34
http://www.slideshare.net 25
http://paper.li 18
http://reetusingh.in 8
https://twitter.com 3
https://wonkots.wordpress.com 2
http://wish-to-believe.blogspot.tw 2
http://translate.googleusercontent.com 2
http://192.168.6.52 1
http://wish-to-believe.blogspot.com 1
http://health.medicbd.com 1
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1
http://a0.twimg.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 6 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Riorganizzare tutto meglio
  • Other voices:- community?
  • Python's authors have always focused on:Quick learning curveReadibility
  • Python's authors have always focused on:Quick learning curveReadibility
  • #!/usr/bin/env python'''Some python examples'''# example 1: a 'for' loopfor name in ('Albert', 'Aristoteles', 'Archimedes'):print 'hello, ', name# example 2: Opening a file and parsing itfilehandler = open('samplefile.txt', 'r')for line in filehandler.readlines():if line.startswith('>'):print lineelse:pass
  • Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Readability counts.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.[.....]
  • Other voices:- community?
  • You can mix instructions, functions and objects in the same python file (multi-paradigm)
  • The previous code (where each 'gene' was a dictionary of dictionaries with lists) can be better structured as an object

Introduction to python for bioinformatics Presentation Transcript

  • 1. prbb technical seminars Introduction to Python for bioinformatics Giovanni MarcoDall'Olio Unidad de Biologia Evolutiva – CEXS Barcelona (Spain)
  • 2. Python A programming language released in 1991 by  Guido Van Rossum Used for a variety of applications, from scripting  to web programming Adopted by google, yahoo, youtube, CERN,  Nasa, Red Hat.... Lots of jokes in the documentation (it is named  after the Monty Pythons)
  • 3. Python and bioinformatics Python is widely used in  bioinformatics August 2007 survey - www.bioinformaticszen.com www.bioinformatics.org survey
  • 4. Python – overall view Learning curve Easy to learn, yet powerful ☺☺☺☺☺ Readibility of a ☺☺☺☺☺ python program Community, (for bioinformatics, CPAN is ☺☺☺☺ sligthly bigger) availability of open source modules Programming Multi paradigm (Object ☺☺☺☺☺ Oriented, structured, paradigms functional, etc..) Execution speed Interpreted language;  importance of programmer effort over computer effort Notes: This talk is full of tables like this  They only reflect my opinion (biologist with 3-4 years experience) 
  • 5. Python – Cons State of open There is good support, but  less compared to perl and source libraries for R bioinformatics Execution speed Comparable to perl, java,  ruby, .. SOAP libraries SOAPpy is very old, suds  is the best one Population Genetics As many other specific  modules, perl and R are modules better supported Lack of true A structural limit make it  impossible to have real multithreading multithreading in python support (various solutions..)  = very sad  = fine
  • 6. Python – what makes me happy General syntax ☺☺☺☺☺ People are forced to ☺☺☺☺☺ write program similar to yours Quicker to write ☺☺☺☺☺ programs Object Oriented, (will be explained ☺☺☺☺☺ later) multi-paradigm Testing support '' ☺☺☺☺
  • 7. Python – learning curve Python's syntax is easy  You can concentrate on algorithms and problems  instead of the programming language
  • 8. Python – learning curve Python's syntax is easy  So you can concentrate on algorithms and problems  instead of the programming language With python you don't have to worry of:  Learning strange symbols (~=, <>, eq, 'n', {}...)  Alternative syntaxes to do the same task  Declaring variables  Inner structure of strings/arrays  Low level IO, passing variables per reference/value, etc.. 
  • 9. Example of python code #!/usr/bin/env python '''Some python examples''' # example 1: a 'for' loop for name in ('Albert', 'Aristoteles', 'Archimedes'): print 'hello, ', name # example 2: Opening a file and parsing it filehandler = open('samplefile.txt', 'r') for line in filehandler.readlines(): if line.startswith('>'): print line else: pass
  • 10. Python syntax I - indentation In python, the indentation ( = spaces at the beginning of the  line) is part of the syntax. It is used to delimit loops and conditions, instead of graph  parenthesis ({}) Example:  for name in ('Albert', 'Aristoteles', 'Dayhoff'): print 'hello, ', name print 'and hello to you, too' The first 'print' is inside the cycle, while the second is  outside
  • 11. A quick perl/python comparison #!/usr/bin/env python #!/usr/bin/env perl a=3 my $a = 3; if a == 3: if ($a == 3){ print 'a is eq to 3' print quot;a is eq to 3nquot;; } (Perl) (Python) Python code is usually easier to read and contains  less symbols (like {})
  • 12. Python syntax II - simplicity Python has the minimal number of syntax keywords.  There is:  only one way to open files (no 'fopen', 'openf', etc..)  only one to print (no printn, printf, sprintf, sprint, etc..)  only two ways to define loops ('for' and 'while').  Python's phylosophy is about simplicity.  Your colleagues are forced to write their programs in the  same way as you.
  • 13. Python syntax III – declaring var You don't need to declare variables  The type of a variable is defined the first time  you assign a value to it a = 'cacagtcaga' → a is a string  b = 133 → b is an integer  c = True → c is a boolean 
  • 14. Notes on Python's speed Python is an interpreted language  its speed is at the level of perl, java, etc.  programs are slower than C, but it's faster to write  them importance of programmer effort over computer  effort Many ways to speed up python  modules can also be written in C  some compilers exist (PyPy)  Google is working on an enhanced version of  python (news of March 2009).
  • 15. Python – programming goodies Installation and Installed by default ☺☺☺☺☺ in most linux portability distribution, interpreted IDLE / text Interactive shell, ☺☺☺☺☺ ipython, many editors editors Install and easy_install, PyPI ☺☺☺ search new modules Testing support doctest, unittest, ☺☺☺☺☺ nose Writing ☺☺☺☺☺ documentation Debugging Logging, pdb
  • 16. Python – installation and portability Python comes installed by default in most of the  GNU/Linux distributions Mac users have an old version (2.5), but can  upgrade it On windows, you need to dowload an installer from  www.python.org first Being an interpreted language, python  programs are easy to port in other platforms
  • 17. PyPI (Python Package Index) PyPI is a repository of open source modules for python For  bioinformatics, it is smaller than to CPAN, CRAN/bioconduct or, etc.. PyPI (repository of public python modules) pypi.python.org
  • 18. Python – installing new modules Modules can be automatically downloaded and  installed using a tool called 'easy_install' Examples:  easy_install -U biopython # install or update  biopython from PyPI easy_install --prefix ~/usr biopython # install biopython  without requiring admin privilegies easy_install biopython.tar.gz # install biopython from a  previously downloaded tar ball easy_install http://www.biopython.org/install # install  biopython from its web site
  • 19. Using python Python can be used as an interactive shell (like  R, octave, matlab, etc..) or by writing programs gioby@dayhoff:~$ cat > prog.py gioby@dayhoff:~$ python >>> print 'hola' range(5) >>> print 'hola' 'hola' [0, 1, 2, 3, 4] >>> range(5) gioby@dayhoff:~$ python prog.py [0, 1, 2, 3, 4] (python interactive shell) (a python program)
  • 20. Python interactive shell gioby@dayhoff:~$ python Python 2.5.2 Type quot;helpquot;, quot;copyrightquot;, quot;creditsquot; or quot;licensequot; for more information. >>> >>> print 'hola' 'hola' >>> range(5) [0, 1, 2, 3, 4] You can use it to run programs without having to save  them to a script. It has not a 'session' equivalent like in R  Many programmers prefer to use 'ipython', an enhanced  version of this shell
  • 21. IPython session gioby@dayhoff:~$ ipython Type quot;copyrightquot;, quot;creditsquot; or quot;licensequot; for more information. In [1]: import random In [2]: random.choice(['ciao', 'hola', 'hello']) Out[2]: 'hello' In [3]: 1200 / 2 Out[3]: 600 In [4]: random? (shows documentation on the random module) In [5]: random.<TAB> (shows auto-completition) In [6]: !ls (executes a bash command)
  • 22. Programming paradigms and testing Programming Multi paradigm (Object ☺☺☺☺☺ Oriented, Structured, paradigms Functional, etc..) Testing support doctest, unittest, nose ☺☺☺☺☺
  • 23. Python is a multiparadigm language Your python programs can be a simple list of  instructions (imperative approach), or you can write functions (functional)  or you can use objects (object oriented)  It's a multi-paradigm language 
  • 24. Python as a imperative language print 'Hi, I am the psychotherapist' print 'How do you do? What brings you here?' response = raw_input() print 'can you elaborate on that?' response = raw_input() print 'Why do you say it is ', response, '?' ....
  • 25. Python as a functional language def get_sequence(fastafilehandler): '''extracts the sequence from a fasta file''' sequence = '' for line in filehandler.readlines(): if line.startswith('>'): sequence += line else: pass def main(): '''execute the main functions''' filepath = 'samplefile.txt' filehandler = open(, 'r') get_sequence(filehandler) .....
  • 26. Object Oriented Programming explained in two sentences When you start having complicated nested  variables (like arrays of hashes of arrays of lists of .....)→ Object Oriented programming is something you should look at
  • 27. Object Oriented Programming example genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts': { 'transcript1': [......], 'transcript2': [......], }, }, 'gene1': { 'position': ...........}, ..... } def get_subseq(genes, geneid, start, end): ''' get a subsequence of a gene, given a dictionary of gene annotations, a gene id, and start/end position ''' pass
  • 28. Object Oriented Programming example genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts': { 'transcript1': [......], 'transcript2': [......], }, }, 'gene1': { 'position': ...........}, ..... } def get_subseq(genes, geneid, start, end): ''' get a subsequence of a gene, given a dictionary of gene annotations, a gene id, and start/end position ''' pass
  • 29. A python class class gene: def __init__(self): position = None sequence = '' transcripts = [] def get_subseq(self, start, end): pass Python's syntax for classes is easy  More concise than Java, and not mandatory to use classes  OO is very complicated in Perl 
  • 30. Python and Java classes class gene: public class Gene { def __init__(self,pos): public int position; self.position = pos public str chromosome; self.sequence = '' public str transcripts[]; self.transcripts = [] public Gene(int pos){ def get_subseq(self, position = pos start, end): } pass (A Python Class) public void getSubseq(start, end) { pass } (A Java Class)
  • 31. Three ways to test a python program When you write a program or a script and want  to publish its results, you also need a way to prove that it works correctly Python has good instruments for testing:  Doctest  Unittest  Nosetest 
  • 32. doctest With doctest, you put examples of the usage of  a function in its documentation >>> help(say_hello) Help on function say_hello in module __main__: say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!! Doctests tries to re-execute these examples, and if they don't return the expected values, an error is raised
  • 33. Doctest example 2
  • 34. Doctest example 3 Doctest are useful when you collaborate with  non programmers
  • 35. unittest From unittest import * class SimpleFastaSeqCase(unittest.TestCase): @classmethod Instructions to be executed def setUpClass(cls): ..... before/after all the tests @classmethod def tearDownClass(cls): ..... def setUp(self): Instructions to be executed ..... before/after each one of the def tearDown(self): ..... tests def testCondition1(self): ..... def testCondition2(self): Tests .....
  • 36. nosetest Nosetest - it scans your code and looks for all  the functions with the word 'test_' in their names def getfasta(filename): pass def count_numbers(numbers, limit): pass def you_like_this_talk(subliminal = True) pass def test_everything_ok(): This is a test pass
  • 37. Message Python is easy to learn and write  It has good tools to test and demonstrate that  your programs work correctly
  • 38. Python – some bioinfo use cases Regular re, TAMO, biopython To use regular ☺☺☺☺ expressions, motif expressions, it is search necessary to import a module. Getting help is easier Convert a sequence biopython Biopython is growing ☺☺☺☺ file to another format its support for bioinformatics formats Working with pygr Pygr is a great ☺☺☺☺ genomic data environment to work with genomic data Query Genbank Biopython, pygr ☺☺☺☺ Structural I don't know Bioinformatics
  • 39. Regular Expressions in Python Using Regular Expressions in Python requires  an additional step than Perl You have to import a module called re first  Regular expressions are also less 'central' to  the developers of the language
  • 40. Example – Regular Expressions in python >>> import re >>> sequence = 'ACGGCTAGGTCGATGCGATCG' >>> re.findall('A.G', sequence) ['ACG', 'AGG', 'ATG'] >>> help(re) <get help on regular expressions> The only advantage of python over perl for regular  expression is that it is easier to get help
  • 41. Biopython A collection of free modules for bioinformatics  number of functionalities implemented:  bioconductor > bioperl > biopython > all others Strong points:  File format support  NCBI – entrez APIs  Pdb / structures 
  • 42. Biopython Examples # Parse a Fasta File and convert it to Genbank from Bio.SeqIO import SeqIO seqfile = open('fastafile.fa', 'r') sequences = SeqIO.to_dict(SeqIO.parse(seqfile)) # Query NCBI results = Entrez.esearch(db='nucleotide', term='cox2') Entrez.read(results)
  • 43. Pygr Great for genome-wide analysis  Makes it automatic to  Store/retrieve data in databases or pickles  Use and configure local blast databases  Creating annotations and storing them  Interface with ncbi, ensembl (eq. to ensembl perl  APIs), ucsc
  • 44. Pygr examples # Ensembl APIs serverRegistry = get_registry( host= 'ensembldb.ensembl.org', user='anonymous') coreDBAdaptor = serverRegistry.get_DBAdaptor( 'homo_sapiens', 'core', '47_36i') sequence = coreDBAdaptor.fetch_slice_by_seqregion( coordSystemName, seqregionName) # Download the sequence of the Human Genome (18) import pygr.Data hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18( download=True)
  • 45. TAMO and pyHMM Module to work with motifs  >>> from TAMO import MotifTools >>> msa = ['TGACTCA',... 'TGACTCA',... 'TGAGTCA',... 'TGAGTCA'] >>> m_msa = MotifTools.Motif(msa) >>> print m_msa TGAsTCA(4) >>> m_msa._print_counts() # 0 1 2 3 4 5 6 #A 0.000 0.000 4.000 0.000 0.000 0.000 4.000 #C 0.000 0.000 0.000 2.000 0.000 4.000 0.000 #T 4.000 0.000 0.000 0.000 4.000 0.000 0.000 #G 0.000 4.000 0.000 2.000 0.000 0.000 0.000
  • 46. Python – bioinformatics utilities Scientific and scipy + numpy ☺☺☺☺☺ statistics Plotting graphs Matplotlib (pylab) ☺☺☺☺☺ SOAP / web suds ☺☺☺ scraping utilities ORM modules, Sqlalchemy + elixir, ☺☺☺☺☺ sqlobject, pytables database handling, HDF5 Persistent data cPickle, shelf, ZODB No R-like sessions ☺☺☺☺
  • 47. Python and Databases There are some good libraries to Object  Relational Mapping (ORM) ZODB: Object Oriented Database  PyTables: hierarchical database (supports  HDF5, a binary format used in astronomy/physics to store big data)
  • 48. sqlalchemy example
  • 49. Scientific Python Numpy: python module to work with arrays and  matrixes Scipy: module to do advanced math, statistics,  and more Matplotlib: module to plot graphics  To get started with python and plotting graphs:  $: easy_install numpy scipy matplotlib ipython $: ipython -pylab
  • 50. Numpy/Scipy example Hint: use ipython -pylab to have an R-like environment 
  • 51. Is there anything I forgot? ????? ????? ????? 
  • 52. Thank you for the attention!  PRBB technical seminars:  http://bg.imim.es/technical-seminars/  These slides will be uploaded on  http://www.slideshare.net 
  • 53. Discarded slides
  • 54. Hint: use ipython -pylab The best way to work with python and plotting  graphs is with ipython -pylab It will give you a shell similar to  matlab/octave/R/etc..
  • 55. Regular expressions To use regular expressions in python, you need  to import the 're' module first It's not so immediate as with perl, where you  can use regular expressions without importing anything However, it is easier to get the documentation 
  • 56. Main python modules for bioinformatics Biopython  Pygr 
  • 57. Python – storing/accessing data Reading/Writing files ☺☺☺☺☺ Persistent data cPickle, shelf, ZoDB ☺☺☺ Database – Object sqlalchemy, elixir ☺☺☺☺☺ Relational Mapping libraries Binary formats pytables ☺☺☺☺ (HDF5) R-like sessions Nothing of my knowledge :(