Next-Generation Informatics

2,753 views

Published on

Talk from the Bioinformatics session of the Advances in Genome Biology and Technology 2009 meeting.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,753
On SlideShare
0
From Embeds
0
Number of Embeds
63
Actions
Shares
0
Downloads
164
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide



  • There is too much data



    4
    genomes to more than an order of magnitude increase



    Move from processing regions to single genomes to multi-genome comparisons



    This is a story about how we are trying to deal with this problem




  • This creates tension












  • Sample in -> answer
    out



    Don’t care how the sausage was made.


  • Never the same pipe

    twice (TJ
    Max)


  • And expanding beyond the laboratory


  • Different aligners, genotypers


  • How do we even begin to tackle
    this problem?



    How do we resolve the tension between changing pipelines and production systems?






  • Metadata



    Store DNA
    types, equipment, reagents, even process steps as rows rather than tables



    So maq is not maq, it is an aligner



    Standards like SAM help


  • Solexa/Maq
    specific commands


  • Generic
    medical resequencing pipeline


  • Never write SQL


  • XML and flow chart


  • Click on any box to see processing
    details including file system location






  • Screenshot of script vs. module


  • photograph










  • What I have talked about here is automation



    There
    is still much work to do in data reduction


  • How do you compare more than three genomes?



    How
    do you track all the analysis?



    So that’s one problem



  • Next-Generation Informatics

    1. 1. Next-Generation Informatics David Dooling <ddooling@wustl.edu> AGBT Bioinformatics 2009-02-05
    2. 2. Framing the problem ddooling@wustl.edu
    3. 3. Framing the problem ,--./01#234# 567# 89-.3:/#;<=># 8/?@/AB/# 6/.1-AA/C# !quot;quot;quot;# !quot;quot;$# !quot;quot;!# !quot;quot;%# !quot;quot;&# !quot;quot;'# !quot;quot;(# !quot;quot;)# !quot;quot;*# !quot;quot;+# !quot;$quot;# ddooling@wustl.edu
    4. 4. Different perspectives ddooling@wustl.edu
    5. 5. LIMS ddooling@wustl.edu
    6. 6. LIMS - Illumina/Solexa ddooling@wustl.edu
    7. 7. LIMS - Roche/454 ddooling@wustl.edu
    8. 8. Analysis ddooling@wustl.edu
    9. 9. Analysis - cDNA Solexa cDNA reads Maq/Tophat [Transcriptome] OR [Genome + SpliceJunctions (SJs)] OR [Genome] Maq Reads Reads Read SNPs map to map to depth Indels novel SJs or “non-genic” introns regions Velvet GenScan Gene Variant Splice Novel expression discovery/ isotypes Genes (to exquisite ASE sensitivity) ddooling@wustl.edu
    10. 10. Project Lead ddooling@wustl.edu
    11. 11. Changing pipelines ddooling@wustl.edu
    12. 12. Changing pipelines - LIMS Tech-Specific Primary Prep Submission Prep /Detection Analysis PCR (Technology- Solexa specific) NCBI SRA Hybrid 454 Selection Flow-space NCBI Medical cDNAs SOLiD Color-space Archive . Bisulfite Church Project . Polony(?) Jumping Archives . Libraries (e.g., DCC) Helicos(?) Sample Pooling 3730 Phred NCBI Trace … WGS Courtesy of Toby Bloom ddooling@wustl.edu
    13. 13. Changing pipelines - Analysis BLAST Phrap BLAT Arachne PASH PCAP ssaha Phusion runMapping Assemblers ELAND Euler Aligners mapreads ATLAS Arachne Newbler MAQ Velvet exonerate Forge SHRiMP SPLIGN SSAKE Mosaik VCAKE SLIM Search Euler-USR SXOligoSearch SHARCGS SOAP2 CABOG NovoCraft Bowtie Tophat ddooling@wustl.edu
    14. 14. Framing the solution ddooling@wustl.edu
    15. 15. Past is prologue ddooling@wustl.edu
    16. 16. Convert this… ddooling@wustl.edu
    17. 17. … into this ddooling@wustl.edu
    18. 18. Convert this… ddooling@wustl.edu
    19. 19. … into this ddooling@wustl.edu
    20. 20. UR • Object-relational mapping (ORM) layer – Interact with persistence layer (e.g., relational database) through objects and methods – Automatic, dynamic class definitions – Moose1-like object definition syntax • Object context – In-memory transactions (even across databases) – Caching/deferred loading • Dynamic command-line interface • Integrated documentation system 1 - http://www.iinteractive.com/moose/ ddooling@wustl.edu
    21. 21. Genome Workflow ddooling@wustl.edu
    22. 22. Genome Model ddooling@wustl.edu
    23. 23. Past is prologue… ddooling@wustl.edu
    24. 24. … but with a wrinkle • Lab personnel accept the software you give them • Analysts are more than happy to develop their own • We need to make it easy for analysts to build tools within the system ddooling@wustl.edu
    25. 25. Easy Perl API ddooling@wustl.edu
    26. 26. Pairing Analyst Programmer ddooling@wustl.edu
    27. 27. Variant Detection Pipeline ddooling@wustl.edu
    28. 28. cDNA Analysis ddooling@wustl.edu
    29. 29. 16S Pipeline ddooling@wustl.edu
    30. 30. Assembly and Annotation Pipeline ddooling@wustl.edu
    31. 31. Challenges • There is still much more work to do • Sequencing is demolishing Moore’s law • The cult of traces • The richness of data • Visualization ddooling@wustl.edu
    32. 32. CIRCOS ddooling@wustl.edu
    33. 33. Thanks Web Site http://genome.wustl.edu/ Blog http://www.politigenomics.com/ LIMS Paper http://www.biomedcentral.com/1471-2105/8/362 UR Presentation http://www.media-landscape.com/yapc/2006-06-27.ScottSmith/ ddooling@wustl.edu

    ×