biopython, doctest and makefiles
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

biopython, doctest and makefiles

on

  • 3,180 views

This is a very short 30-minutes talk that I gave to a barcelona python developers meeting. ...

This is a very short 30-minutes talk that I gave to a barcelona python developers meeting.

It explain a proposal to use doctest for biopython documentation (and in general, in bioinformatics).

It also contains an introduction and the use of automated build tools in bioinformatics, like make and scons.

Statistics

Views

Total Views
3,180
Views on SlideShare
3,103
Embed Views
77

Actions

Likes
1
Downloads
53
Comments
0

3 Embeds 77

http://bioinfoblog.it 74
http://www.slideshare.net 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

biopython, doctest and makefiles Presentation Transcript

  • 1. Barcelona Python Developers Seminars biopython, doctest and makefiles
  • 2. This is me
    • Giovanni
    • Phd student in a Population Genetics lab
    • Not a biopython dev
    • (that could be not my real photo)
  • 3. Intro
    • BioPython -> a collection of standard python modules for bioinformatics
    • Advantages of using open source libraries in science:
      • more reproducibility
      • easier to compare results
      • less errors
      • less time spent
  • 4. BioPython – some use cases
    • The human genome sequencing project (2001):
    • TCCATGGCCTCCCGGCAAGCCTAAGCTAGCGCAATTGTCAGACGCACAGGACCGGTCTGGGGAGACCAATGTGTTCAGACAACGATTCCCAGCTAGTACCACTGTTTGACTCGGAAGATGTGTACAACTATTGTAGCGACTGTGTCCCATCATTGCATTCAAACCCAAGTAATTGATGGATCAACAAAGGATACACTCCAAAAGTCGCACAGAGATTGGTCATCTTAACGCGAGATTAAACATGCGTCTATACGCCCGTGTTAAGTTCGGCCGCCATCGTACAAATAAGCGAGNNNNTATCAATCTAATCTTAAACCGGCTCTTGAGAAGGGCTAGCGGCGTTAGGACCCGCTGCCGGCCGTGAGCGTGCGTTCACTCTGAACAGCGCCATCGATGGGTCGCTTGTGTAGCTATTTTAAGGACGCGACATAGGCCCTGGGGCAGTTACTGGGGCATGCCCACTATATCCGCGGGCAAGTTGGTATTCAGCTATGTTTATCTCTCGCCCAATGCGTGAAAGCGCCAAACGTGGGTAGAGGACTTAGCAATTTGGGGCATGCCCTGCTCTTTTAGATCTGTTAAGCAATCCGCGCGTAGGGCTCGCTGCGTCGTAAATGTGAGCGCAAGTCACCGACGCAGTGGTAATATACGTGTAACTGATCATCNNNNNNTCCCGAACCATGCCTTCTAACAGGAGATGCCCAAGGTCGAGGGTCACCGCCAACGACCGGCTGATCCCTGTTGGTGAGGATTTATGGAGGTGGACTGTCAGGTAGGCAAGAACTCTGGGTGAATTTGCGAGCGCTATCTCTAAGTTACACGCTTTACTGGGGCATGCCCGGGCCGTAGAAGTTACTGGGGCATGCCCCACGTAATAGGTTTTCATGAGGAGATGTTTGGTCTGATTCTCGAGATTGTGGCTAAGTATTGAGTCAGACTTACTGGGGCATTTACTGGGGCATGCCCGCCCTGCTCTTTTAGATCTGTTAAGCAATCCGCGCGTAGGGCTCGCTGCGTCGTAAATGTGAGCGCAAGTCACCGACGCAGTGGTAATATACGTGTAACTGATCATCTTCATGATTCCCGAACCATGCCTTCTAACAGGAGATGCCCAAGGTCGAGGGTCACCGCCAACGACCGGCTGATTTACTGGGGCATGCCCCCCNNNNNGAGGATTTNNNNTGGAGCCTATCTCACATTTTAAACTTCAATCATCATAACACGTGCGCACTTTTTCCGCGCTTGACGGCGAAGTGACTGGCCACTTCCTGCTCCCTGTTTTTCCCAATACCTGACAAGTGTGGCATCTGTCCCCCTGAAGAGGACTAGAGTATCATTACGGGGGGCTTGACACTTACCTTCATAGG.............
    • Up to ~3*10 9 characters
    • Lot of regexs (perl-ists like it)
    • Could be obtained for <1000$ in the near future
  • 5. BioPython – use cases
    • Conversion between different formats
    • Structure data into objects (genes, proteins, species, etc..)
    • Match regular expressions/motifs
    • Launch external tools (web or local)
    • Retrieve data from public online resources
    • Interrogate databases
  • 6. BioPython documentation
    • How the documentation of a project like biopython should be?
      • follow strict specifications (it does already, epydoc)
      • be always up-to-date
      • have many examples of usage (there are many in the tutorials)
    • A python module called ' doctest ' that can help in doing this.
  • 7.
        • def say_hello (name): ''' print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!! ''' print 'hello ' + name + '!!!'
    doctest
    • doctest allows to incorporate examples of the usage of a function in its docstring, and use them as tests.
    Example of say_hello's usage function's docstring (everything in green)
  • 8. The docstring
    • The docstring is what is shown when you ask for help for a function;
    >>> help (say_hello) Help on function say_hello in module __main__: say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!!
  • 9. doctest – how does it works
    • #!/usr/bin/env python
    • def sum (x, y): ''' sums two numbers
    • example: >>> print sum(1, 2) 3 ''' return x + y
    • if __name__ == ' __main__ ': import doctest doctest.testmod()
    • doctest.testmod () looks for any line beginning with ' >>> ' and execute it as a python command
    • The result is compared with the subsequent lines (expected output). If there are differences, an error is raised.
    • If 'print sum(1, 2)' doesn't return 3, an error is raised
  • 10. doctest - examples
    • BioPython - SeqIO.parse
  • 11. doctest – file parsing example
    • In bioinformatics there are many formats with semi-homonymous names
      • ped, tped, bed, tmap, pdb, fasta...
    • It is useful to put an example of input file in every parser function
  • 12. Choose good examples
    • Write the doctest along with who will use the script (e.g. A fellow scientist)
    • Ask them 'how this function is supposed to behave in this example?'
    • Simplify: round all numbers to multiples of 100, put comments
  • 13. Doctest – Pros and Cons
    • Pros:
      • docs always up to date
      • Usage examples
      • Quick tests when you are coding
    • Cons:
      • Functions that read files (StringIO? NamedTempFile?)
      • Still need to write a unittest
      • Can't use lines longer than 80 characters (PEP8)
      • Random generators / statistics / rounding
  • 14. Bioinformatics – a different approach
    • The approach between programming software and programming experiments is different:
      • Testing has different dimensions (biological meaning, reproducibility)
      • Usually you write numerous scripts, each one carrying out a small task, and glue them with a pipeline/wrapper script/makefile/automated builds tool/xml described workflow/insert others here
    • I am a makefile guy
  • 15. What is a makefile?
    • gnu/make is an utility for building C/C++ programs.
    • It can be used to save shell commands (...) with their options and re-execute them at will.
    • Example: :$ make all python retrieve_data.py --option1 --option2 perl convert_format.pl --input inputfile --option3 perl convert_format.pl --inputfile inputfile2
  • 16. Simplest Makefile example
    • $: cat Makefile help : echo 'execute “make all” to carry out the whole analysis' get_data : python retrieve_data.py --database ensembl --specie Human --output sequences.fasta calculate_results : perl calculate_results.pl --option1 --option2 --input sequence.fasta --output results.txt all : get_data calculate_results
  • 17. Makefiles – Pros
    • Conditional execution
      • If there is no need to execute a command, it is skipped (checks if the expected output file already exists and is up-to-date)
    • Chaining commands
      • You can define the order in which commands must be executed (download sequences first, then read them)
    • Support for clusters
    • Syntax is ugly, but standard
  • 18. Make - Cons
    • Gnu/Make has a very ugly syntax
    • Really, I hate its syntax
    • I am looking for substitutes in python:
      • scons
      • paver
      • waf (google summer of code project)
    • Still haven't start using them
    • ¿Implement something in biopython?
  • 19. A more complicated Makefile
    • Variables like %, $@, $<
    • Modificators like -, @
    • addprefix, addsuffix ??
    • Triple parentesis ??
  • 20. Thanks for the attention! Did you like the talk?
  • 21. BioPython – use cases
    • Single Nucleotides Polymorphisms are positions in the genome that tend to vary most between different individuals
    • We are working with data on 650.000 SNPs on 1000 of individuals
    • Need to organize data on objects (SNPs, Genotypes, Individuals, Populations), use a database for support, calculate statistics on them
  • 22. Doctest – a closer look
    • #usr/bin/env python
    • def say_hello (name): ''' print hello (name) to the screen
    • example: >>> say_hello('Albert Einstein') hello Albert Einstein!!! ''' print ' hello ' + name + ' !!! '
    • if __name__ == ' __main__ ': import doctest doctest.testmod()
    normal doc example of function usage expected output body of the function call to the doctest module new function definition