Your SlideShare is downloading. ×
0
Bioconductor with Python, What else ?
                ISMB / BOSC


     Laurent Gautier [laurent@cbs.dtu.dk]

           ...
Disclaimer
  • This is not about the comparative merits of scripting
    languages
  • This is about being able to access ...
About Bioconductor



    • Set of open-source packages for R
    • Started circa 2002 with a focus on microarrays
    • R...
About Python


    • Simple and clear all-purpose scripting language
    • Sometimes used in introductions to programming
...
A view on R/bioconductor and Python in bioinformatics
                               Flow-
                             cy...
proteomics,
                            other
                          assays. . .                  Bioinformatics
      ...
Bioinformatics
                                  data
                                                             Automat...
Running R code from Python (an example)
  Aim
  Running edgeR from Python

  Method
    Robinson MD, McCarthy DJ and Smyth...
from rpy2.robjects.packages import importr
from bioc import edger

base = importr(’base’)


summarized = edger.DGEList.new...
R code / Python code
  library(edgeR)
  summarized <- DGEList(counts = counts,
                        lib.size = colSums(...
Bioconductor library IRanges




                               10 / 20
Bioconductor library Biostrings




                                  11 / 20
Separate communities




                       12 / 20
Bilingual community




                      13 / 20
Interpreters/Translators




                           14 / 20
Cost of translation

    R package                Python module
                    lines of code
    AnnotationDbi       ...
R within Python
  • R is running as embedded into Python
  • R objects remain in the R workspace, but can be accessed
    ...
What is needed to continue



  More interpreters/translators
    • Many bioconductor packages.
    • Keep up-to-date exis...
Example with meta-programming:


  class AssayData(rpy2.robjects.methods.RS4):
      """ Abstract class. That class in a C...
Example of a complete application
  A web-server to run EdgeR.
  from bottle import route, run
  from my_edger import get_...
Acknowledgements
   • Users, and communities from R, Bioconductor, Python,
      Biopython
   • (Vincent Davis, Nicolas Ra...
21 / 20
Upcoming SlideShare
Loading in...5
×

Gautier bosc2010 pythonbioconductor

1,058

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,058
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Gautier bosc2010 pythonbioconductor"

  1. 1. Bioconductor with Python, What else ? ISMB / BOSC Laurent Gautier [laurent@cbs.dtu.dk] DMAC / CBS July 10th, 2010 1 / 20
  2. 2. Disclaimer • This is not about the comparative merits of scripting languages • This is about being able to access natively libraries implemented in a different language 2 / 20
  3. 3. About Bioconductor • Set of open-source packages for R • Started circa 2002 with a focus on microarrays • Rooted in statistics, data analyis, and visualization • Several hundred packages, addresses NGS, HTS, flow cytometry, protein-protein interactions, . . . • Biannual releases • Presence on the publication circuit ( > 2, 300 citations for the BioC publication, > 600 for limma, > 500 for affy ) 3 / 20
  4. 4. About Python • Simple and clear all-purpose scripting language • Sometimes used in introductions to programming • Popular for agile development • Bioinformatics libraries: • biopython (libraries for bioinformatics) • galaxy (web front-end to pipelines) • PyCogent, pygr, bx-python (biological sequences-oriented) • Large selection of libraries: • Web development: Zope, Django, Google App Engine • Scientific computing: Scipy / Numpy • Cloud computing: Disco, execnet • Interface with C: ctypes, Cython 4 / 20
  5. 5. A view on R/bioconductor and Python in bioinformatics Flow- cytometry, proteomics, other assays. . . Bioinformatics data Automation Annotation Storage / Retrieval NGS Visualization Non- Samples Microarray interactive abilities Data storage / retrieval Web Statistical R/Bioconductor analysis Algorithm development Python is an all-purpose scripting Python language. Interactive program- Scientific ming computing Biologists Statisticians Physicists Computer Scientists Communities 5 / 20
  6. 6. proteomics, other assays. . . Bioinformatics data Automation Annotation Storage / Retrieval NGS Visualization Non- Samples Microarray interactive abilities Data storage / retrieval Web Statistical R/Bioconductor analysis Algorithm development Python is an all-purpos Python language. Interactive program- Scientific ming computing Biologists Statisticians Physicists
  7. 7. Bioinformatics data Automation Annotation Storage / Retrieval NGS Non- Samples Microarray interactive abilities Data storage / retrieval Web Statistical analysis Algorithm development Python is an all-purpose scripting Python language. Interactive program- Scientific ming computing Biologists sticians Physicists Computer Scientists
  8. 8. Running R code from Python (an example) Aim Running edgeR from Python Method Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140 Data Control Treated lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000230758 0 0 1 0 0 0 0 ENSG00000182463 0 2 4 1 5 5 0 ENSG00000124208 82 124 102 136 90 120 40 ENSG00000230753 0 0 0 3 0 0 0 ENSG00000224628 7 8 8 18 8 7 1 ENSG00000125835 138 209 227 295 281 220 54 ENSG00000125834 25 31 48 56 67 61 15 ENSG00000197818 17 27 16 26 41 39 9 ENSG00000243473 0 0 0 2 0 0 0 ENSG00000226325 0 0 2 0 3 1 0 ... ... ... ... ... ... ... ... 7 / 20
  9. 9. from rpy2.robjects.packages import importr from bioc import edger base = importr(’base’) summarized = edger.DGEList.new(counts = counts, lib_size = base.colSums(counts), group = grp) disp = edger.estimateCommonDisp(summarized) tested = edger.exactTest(disp) results = edger.topTags(tested) logConc logFC PValue FDR ENSG00000127954 -31.03 37.97 0.00 0.00 ENSG00000151503 -12.96 5.40 0.00 0.00 ENSG00000096060 -11.78 4.90 0.00 0.00 ENSG00000091879 -15.36 5.77 0.00 0.00 ENSG00000132437 -14.15 -5.90 0.00 0.00 ENSG00000166451 -12.62 4.57 0.00 0.00 ENSG00000131016 -14.80 5.27 0.00 0.00 ENSG00000163492 -17.28 7.30 0.00 0.00 ENSG00000113594 -12.25 4.05 0.00 0.00 ENSG00000116285 -13.02 4.11 0.00 0.00 8 / 20
  10. 10. R code / Python code library(edgeR) summarized <- DGEList(counts = counts, lib.size = colSums(counts), group = grp) disp <- estimateCommonDisp(summarized) from rpy2.robjects.packages import importr base = importr(’base’) from bioc import edger summarized = edger.DGEList.new(count = counts, lib_size = base.colSums(counts), group = grp) disp = edger.estimateCommonDisp(summarized) Note: • explicit in searching through namespaces • call R functions as native Python functions • use R objects as Python objects 9 / 20
  11. 11. Bioconductor library IRanges 10 / 20
  12. 12. Bioconductor library Biostrings 11 / 20
  13. 13. Separate communities 12 / 20
  14. 14. Bilingual community 13 / 20
  15. 15. Interpreters/Translators 14 / 20
  16. 16. Cost of translation R package Python module lines of code AnnotationDbi 168 annotationdbi.py Biobase 341 biobase.py Biostrings 591 biostrings.py BSgenome 112 bsgenome.py edgeR 107 edger.py GEOquery 102 geoquery.py GGbase 104 ggbase.py GGtools 77 ggtools.py goseq 43 goseq.py GSEABase 149 gseabase.py IRanges 295 iranges.py ShortRead 301 shortread.py 15 / 20
  17. 17. R within Python • R is running as embedded into Python • R objects remain in the R workspace, but can be accessed from Python • Python-level shells to access the R objects • The rpy2 package is used to achieve so biostrings = importr(’Biostrings’) class AAString(XString): _aastring_constructor = biostrings.AAString @classmethod def new(cls, x): """ :param x: a string of amino-acids """ res = cls(cls._aastring_constructor(conversion.py2ri(x))) _setExtractDelegators(res) return res aas = AAString("PROTEIN") 16 / 20
  18. 18. What is needed to continue More interpreters/translators • Many bioconductor packages. • Keep up-to-date existing translations. Keeping up-to-date • Frequent API-breaking changes in bioconductor • Taylored interfaces increase maintenance • Meta-programming and reflexivity can alleviate this 17 / 20
  19. 19. Example with meta-programming: class AssayData(rpy2.robjects.methods.RS4): """ Abstract class. That class in a ClassUnionRepresentation in R, that a is way to create a parent class for existing classes. This is currently not modelled in Python. """ __rname__ = ’AssayData’ __metaclass__ = rpy2.robjects.methods.RS4_Type __accessors__ = ((’featureNames’, ’Biobase’, ’featurenames’, True, ’maps Biobase::featureNames’), (’sampleNames’, ’Biobase’, ’samplenames’, True, ’maps Biobase::samplenames’), (’storageMode’, ’Biobase’, ’storagemode’, True, ’maps Biobase::storageMode’) ) 18 / 20
  20. 20. Example of a complete application A web-server to run EdgeR. from bottle import route, run from my_edger import get_toptags, make_results_page @route(’/’) def index(): return ’’’ <html> <body> <form action="/edger" method="post" enctype="multipart/form-data"> <input type="file" name="data" /> </form> </body> </html>’’’ @route(’/edger’, method=’POST’) def run_edger(): data = request.files.get(’data’) if data: counts, grp = read_count_data(data.file.name) top_tags = get_toptags(counts, grp) return make_result_page(top_tags) else: abort(404, "Invalid count file.") run(host=’localhost’, port=8080) 19 / 20
  21. 21. Acknowledgements • Users, and communities from R, Bioconductor, Python, Biopython • (Vincent Davis, Nicolas Rapin, Brad Chapman) URLs http://pypi.python.org/pypi/rpy2-bioconductor-extensions/ http://bitbucket.org/lgautier/rpy2-bioc-extensions http://packages.python.org/rpy2-bioconductor-extensions/ http://rpy2.sourceforge.net/ 20 / 20
  22. 22. 21 / 20
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×