Your SlideShare is downloading. ×
0
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Smith T Bio Hdf Bosc2008

1,526

Published on

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,526
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. BioHDF : Open binary file formats for large scale data management Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM
    • 2. Overview <ul><li>Driver: Next Generation DNA Sequencing </li></ul><ul><li>What is HDF5 </li></ul><ul><li>BioHDF Project </li></ul>Laboratory and data workflow management for genetic analysis
    • 3. Next Generation DNA Sequencing <ul><li>Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger) </li></ul><ul><li>A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments </li></ul><ul><li>In Sequence quotes - July 2007 </li></ul><ul><ul><li>Toby Bloom, Broad Institute “Next-gen sequencing i mpacts all aspects of informatics.” </li></ul></ul><ul><ul><li>Phil Butcher, Sanger “ T he best way to move terabytes of data is still disk.” Want to process data closer to the machine. </li></ul></ul><ul><ul><li>Eugen Clark, Harvard “[community] needs to start talks about data retention.” </li></ul></ul><ul><ul><li>Kelly Carpenter, Wash U “these sequencers are going to totally screw you.” </li></ul></ul><ul><li>Nature Methods July 2008: “Byte-ing off more than you can chew” </li></ul>
    • 4. Three Phases of Data Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo assembly =&gt; Assembler Contigs + Annotation
    • 5. Proliferation of files, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ - http://maq.sourceforge.net Secondary Analysis for: Additional files and formats needed for tertiary analysis
    • 6. Challenges <ul><li>Complexity </li></ul><ul><ul><li>Numerous programs, scripts, files, and formats </li></ul></ul><ul><ul><li>Redundant data </li></ul></ul><ul><li>Computational overhead </li></ul><ul><ul><li>All data typically reside in RAM during computation </li></ul></ul><ul><ul><li>Output and input formats differ, so data must be frequently reprocessed </li></ul></ul><ul><li>Space, time, and bandwidth efficiency </li></ul><ul><ul><li>Increased storage </li></ul></ul><ul><ul><li>Computation times increase disproportionately </li></ul></ul><ul><ul><li>Large data sets must be transported for processing </li></ul></ul>
    • 7. What Needs to Be Done <ul><li>Reduce complexity </li></ul><ul><ul><li>Decrease numbers and kinds of files </li></ul></ul><ul><ul><li>Eliminate data duplication (performance) </li></ul></ul><ul><ul><li>API and tools for data access </li></ul></ul><ul><li>Improve resource utilization </li></ul><ul><ul><li>Reduce redundancy, work with compressed data </li></ul></ul><ul><ul><li>Improve program access to data, random reads and writes, map disk to computer memory </li></ul></ul><ul><ul><li>Parallel I/O, Remote access </li></ul></ul><ul><ul><li>Facilitate data sharing, preservation </li></ul></ul><ul><li>Adopt a standard from other data intensive fields </li></ul><ul><ul><li>Benefit from history and experience </li></ul></ul><ul><ul><li>Benefit from refinement </li></ul></ul><ul><ul><li>Build on a proven, widely accepted platform </li></ul></ul>
    • 8. HDF5: Single Platform / Multiple Uses <ul><li>A file format for managing any kind of data </li></ul><ul><li>Software system to manage data in the format </li></ul><ul><li>Designed for high volume or complex data </li></ul><ul><li>Designed every size and type of system </li></ul><ul><li>Open format and software </li></ul><ul><li>One library, with </li></ul><ul><ul><li>Options to adapt I/O and storage to data needs </li></ul></ul><ul><ul><li>Layers on top and below </li></ul></ul><ul><li>Ability to interact well with other technologies </li></ul><ul><li>Attention to past, present, future compatibility </li></ul>
    • 9. HDF5 - 20 yrs in Physical Sciences <ul><li>Gain multiple “working with data efficiencies” slice, recombine … </li></ul><ul><li>Arrays, sets, organizations, compression already there </li></ul><ul><li>Server and remote access </li></ul><ul><li>Quick access to data via HDFView, MATLAB, other tools </li></ul><ul><li>Widely used - MATLAB, Mathematica, IDL, NASA-EOS, </li></ul>Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
    • 10. HDF Software HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File
    • 11. BioHDF <ul><li>SBIR Funded Project </li></ul><ul><li>Phase I - Feasibility for genotyping </li></ul><ul><li>Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications </li></ul><ul><ul><li>Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model </li></ul></ul><ul><ul><li>Develop prototype BioHDF software applications that support common activities utilizing DNA </li></ul></ul><ul><ul><li>Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics </li></ul></ul>
    • 12. Phase I - Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table polyPhred data table, graphs, and chromats
    • 13. Benefits <ul><li>Separated the model, implementation, and view of the data </li></ul><ul><li>Multiple levels of data in a single view </li></ul><ul><li>Hapmap: convert, display, and scroll 100,000s genotypes </li></ul><ul><li>Compressed 5.2 GB LD data into 300 MB (17x) </li></ul><ul><li>Quickly and randomly access subsets of data </li></ul><ul><li>Made use of standard features and a data viewer (HDFview) </li></ul>Only had to build the model and data importer
    • 14. Phase II <ul><li>Primary Data Analysis </li></ul><ul><ul><li>Models for storing and accessing primary data </li></ul></ul><ul><ul><li>Implement and test models, develop compression methods </li></ul></ul><ul><ul><li>Create research tools to access and work with the data </li></ul></ul><ul><li>Secondary Data Analysis </li></ul><ul><ul><li>Models for storing common data structures (assembly graphs, density plots, variants) </li></ul></ul><ul><ul><li>APIs to work with programs, enable out-of-core processing </li></ul></ul><ul><ul><li>Develop research level applications utilizing HDFView, current and emerging genome browsers </li></ul></ul>
    • 15. Collaborations <ul><li>Planned </li></ul><ul><ul><li>Software: SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon) </li></ul></ul><ul><ul><li>Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems </li></ul></ul><ul><li>Emerging </li></ul><ul><ul><li>Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology </li></ul></ul><ul><li>Seeking </li></ul><ul><ul><li>Algorithm developers </li></ul></ul><ul><ul><li>Application developers </li></ul></ul><ul><ul><li>Frameworks, Bio* </li></ul></ul><ul><ul><li>Data sets </li></ul></ul>
    • 16. Summary <ul><li>Data challenges for Next Gen sequencing </li></ul><ul><ul><li>Manage high volumes of data </li></ul></ul><ul><ul><li>Workflow complexity </li></ul></ul><ul><ul><li>Computational performance </li></ul></ul><ul><li>BioHDF will be built on existing, available, and proven HDF5 technology </li></ul><ul><li>Geospiza and The HDF Group are seeking collaborations </li></ul><ul><li>Funding - NIH STTR 1R41HG003792-02 </li></ul><ul><li>Interested? Contact todd@geospiza.com </li></ul>

    ×