Welsh_BioHDF_BOSC2009

  • 811 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
811
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
1
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BioHDF : Toward Scalable Bioinformatics Infrastructures Todd Smith (1), Eric Olson (1), Mark Welsh (1), Mike Folk (2), Christopher Mason (3). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF Group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Yale University, New Haven CT. TM © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 1 Confidential
  • 2. Overview • Driver: bioinformatics challenges in Next Generation DNA Sequencing (NGS) • BioHDF project and examples • HDF5 (Hierarchical Data Format) • How you can get involved • Geospiza © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 2 Confidential
  • 3. Next Generation DNA Sequencing “Transforms today’s biology” NGS is Powerful “Democratizing genomics” “Changing the landscape” “Genome center in a mail room” “The beginning of the end for microarrays” © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 3 Confidential
  • 4. Example: Measuring Gene Expression dbEST - Jan 20, 2009 Other Technologies Total Organisms 1,683 Experiment Measurements Total ESTs 59,498,205 SAGE 1,000-100,000 Human 8,163,902 Microarrays 1.4 M (probes) 0.8 M (probes) Mouse 4,850,605 48 K (transcripts) Maize 2,018,337 Arabidopsis 1,526,124 Next Generation Sequencing Instrument Measurements Cattle 1,517,143 454 Titanium 1,000,000 Pig 1,476,771 Illumina GA 80,000,000 Soybean 1,380,071 SOLiD V3 180,000,000 Zebra Fish 1,380,017 Greater sensitivity, higher dynamic ranges 5 others >1,000,000 + Qualitative data: isoforms, alleles, … © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 4 Confidential
  • 5. NGS is Daunting “Prepare for the deluge” “Byte-ing off more than you can chew” “These sequencers are going to totally screw you” © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 5 Confidential
  • 6. NGS Data are Analyzed in Three Phases Primary Data Analysis - Images to bases Sequences + Quality values Run quality Secondary Data Analysis Ref Seq + Aligner Gene lists Secondary Data Read Density Production De novo assembly => Assembler Variant list Sample, run quality Tertiary Data Analysis One or more Data sets Differential expression Methylation sites Gene association Contigs + Annotation Genomic structure Experiment, science © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 6 Confidential
  • 7. Secondary Analysis is Complex Examples: MAQ - http://maq.sourceforge.net Secondary Analysis for: ChIP-Seq Tag profiling Resequencing Story repeats for BWA, Bowtie, TopHat, Mapreads, SOAP … © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 7 Confidential
  • 8. Complexity Limits Scale and Productivity • Data are unstructured - no consistent data model • Solve problems using redundant data processing – Incremental processing with data filtering at each stage – New question? Then rerun alignment operations • Each analysis step has a new output format – One file for tables of alignments – Another file with bases aligned to see mismatches – Another file to ask statistical questions – More files and images for visualization – Files are linked by virtue of being in the same directory – Perl hashes used to link the data fill up memory – Redundant text-based formats fill up disk space © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 8 Confidential
  • 9. Makes Getting Answers Difficult Process Applications Gene Expression Small RNA 10 - 100 million reads Align to reference data Epi-Genomics Parse files, reformat data, create reports Variation Analysis Review results, make decisions © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 9 Confidential
  • 10. And Comparing Between Samples Hard Process Explore Data Between Samples And drill into details 10 - 100 million reads Compare expression Alternative splicing Align to reference data Review results, make decisions Repeat n times, With n samples © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 10 Confidential
  • 11. What is Desired 1. Scalable systems with smoothly operating user interfaces 2. Summarize results and drill into details for single samples 3. Compare results between samples and within groups Data must be structured, indexed, and annotated Need a better way to work with NGS data and information © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 11 Confidential
  • 12. BioHDF Project • NIH STTR – Geospiza, Seattle WA – The HDF Group, Urbana/Champaign IL • Goal: Move bioinformatics problems from organizing and structuring data to asking questions and visualizing data – Develop data models and tools to work with NGS data in HDF (Hierarchical Data Format) – Create HDF5 domain-specific extensions and library modules to support the unique aspects of NGS data => BioHDF – Integrate BioHDF technologies into Geospiza products • Deliver core BioHDF technologies to the community as open-source software © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 12 Confidential
  • 13. Performance Advantages Test Case: 9.3 million GA reads aligned to HG build 36.1 (4-core 3GHz Intel Xeon) Flat File World HDF5 World 143 MB - compressed, fasta file 609 MB - no random access random access Bowtie Alignments = 284 MB - index fasta + alignment 1033 MB - no random access 374 MB + index Export Alignments chr5 ~1 M alignments 1470 ms ~1 M alignments 1100 ms 100 Mbase region 450000 alignments 735 ms 450000 alignments 540 ms 10 Mbase region 44000 alignments 735 ms 44000 alignments 62 ms 1 Mbase region 4000 alignments 735 ms 4000 alignments 19 ms 0.1 Mbase region 600 alignments 735 ms 600 alignments 15 ms Development > Months - develop file formats, Days, Weeks - write I/0 indices, access libraries, and code - parsers, loaders, debug to make efficient and access methods HDF improves storage, access, and development efficiency And does not add to computational overhead © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 13 Confidential
  • 14. Value of Development Time Instead of Software / IT: ID Splice exons Exons observed Index Sample A vs Sample B • Developing and debugging low level infrastructures to support “novel” binary data formats • Optimizing high-end hardware systems • Tuning and redesigning RDBMS and other implementations Focus on Science: • Working with 100s of million reads for 100s of samples • Measuring gene expression • Identifying isoforms • Observing sequence and structural variation • Drilling into details from summaries © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 14 Confidential
  • 15. Enables a Different Approach reads reads Integrate data across platforms reads Series of sources ref data ref data ref data alignments alignments alignments Base Integrate samples / annotations comp. Sample 1 (exon-crossing) queries Sample 2 (exon-crossing) HDF file1 Sample 1 Data by formatters range HDF file2 Sample 2 .wig Integrate between systems HDF file(n) .bed Sample (n) file annotations © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 15 Confidential
  • 16. Explore Data with Different Questions reads One alignment step, different questions Sources Small RNA Analysis Adapter/ Primers rRNA/ mtDNA miRBase Transcripts Genome Exon Splicing / Exon Analysis junctions “Subtractive” question alignments queries “Biology” question formatters HDF file1 Splicing / Exon Analysis Sample 1 Examine match quality annotations © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 16 Confidential
  • 17. Why HDF? HDF5: 20 Years in Physical Sciences  Arrays, rich data types, groups accommodate every kind of data  Store any combination of data objects in one container.  Performance: fast random access and efficient, scalable storage  Portability, data sharing: platform independent, self describing, common data models  Tools for viewing, analysis: HDFview, MATLAB, others  Widespread: used in academia, A platform for creating software to govt, industry - MATLAB, IDL, work with many kinds of scientific data NASA-Earth Observing System © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 17 Confidential
  • 18. HDF Software Tools,
Applications,
Libraries Command Line Tools HDF
I/O
Library BioHDF Library Extensions HDF
File Modifications © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 18 Confidential
  • 19. Benefits • Separates the model, implementation, and view of the data • Combines data from multiple samples • Compression, chunking and other performance advantages • Rapid prototyping environment • Significant reduction in development time • Approach NGS analysis differently Only had to define the data model, write data import and export tools © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 19 Confidential
  • 20. Getting Involved • BioHDF is being built on existing, available, and proven HDF5 technology • Import, export tools will be open source • Geospiza and The HDF Group are seeking collaborations – Bioinformatics pipeline developers – Algorithm developers • Funding - NIH STTR HG003792 • Interested? Contact todd@geospiza.com © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 20 Confidential
  • 21. Geospiza Products GeneSifterTM Laboratory and Analysis Software Systems • For: Core, Service, Data Production Labs and Research Scientists • Working with: Sanger Sequencing, Microarray, Next Generation Sequencing, and (or) other platforms • GeneSifter supports: Laboratory operations, Data Management, Multiple Levels of Data Analysis From Samples to ResultsTM • Deployment: cost effective hosted or on-site models. BioHDF at ISMB: BioHDF at BOSC: Monday June 29th - 6pm - Poster U57 Today - 5:30pm - here: BioHDF Poster Session BioHDF hack-a-thon / BOF Wednesday July 1st - 1pm - Room T1 BioHDF hack-a-thon / BOF Getting BioHDF software: http://www.hdfgroup.org/projects/bioinformatics/ © Copyright 2009 Geospiza, Inc. All Rights Reserved. Page 21 Confidential