Your SlideShare is downloading. ×
Smith T Bio Hdf Bosc2008
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Smith T Bio Hdf Bosc2008


Published on

Published in: Business, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Transcript

    • 1. BioHDF : Open binary file formats for large scale data management Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM
    • 2. Overview
      • Driver: Next Generation DNA Sequencing
      • What is HDF5
      • BioHDF Project
      Laboratory and data workflow management for genetic analysis
    • 3. Next Generation DNA Sequencing
      • Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger)
      • A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments
      • In Sequence quotes - July 2007
        • Toby Bloom, Broad Institute “Next-gen sequencing i mpacts all aspects of informatics.”
        • Phil Butcher, Sanger “ T he best way to move terabytes of data is still disk.” Want to process data closer to the machine.
        • Eugen Clark, Harvard “[community] needs to start talks about data retention.”
        • Kelly Carpenter, Wash U “these sequencers are going to totally screw you.”
      • Nature Methods July 2008: “Byte-ing off more than you can chew”
    • 4. Three Phases of Data Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo assembly => Assembler Contigs + Annotation
    • 5. Proliferation of files, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ - Secondary Analysis for: Additional files and formats needed for tertiary analysis
    • 6. Challenges
      • Complexity
        • Numerous programs, scripts, files, and formats
        • Redundant data
      • Computational overhead
        • All data typically reside in RAM during computation
        • Output and input formats differ, so data must be frequently reprocessed
      • Space, time, and bandwidth efficiency
        • Increased storage
        • Computation times increase disproportionately
        • Large data sets must be transported for processing
    • 7. What Needs to Be Done
      • Reduce complexity
        • Decrease numbers and kinds of files
        • Eliminate data duplication (performance)
        • API and tools for data access
      • Improve resource utilization
        • Reduce redundancy, work with compressed data
        • Improve program access to data, random reads and writes, map disk to computer memory
        • Parallel I/O, Remote access
        • Facilitate data sharing, preservation
      • Adopt a standard from other data intensive fields
        • Benefit from history and experience
        • Benefit from refinement
        • Build on a proven, widely accepted platform
    • 8. HDF5: Single Platform / Multiple Uses
      • A file format for managing any kind of data
      • Software system to manage data in the format
      • Designed for high volume or complex data
      • Designed every size and type of system
      • Open format and software
      • One library, with
        • Options to adapt I/O and storage to data needs
        • Layers on top and below
      • Ability to interact well with other technologies
      • Attention to past, present, future compatibility
    • 9. HDF5 - 20 yrs in Physical Sciences
      • Gain multiple “working with data efficiencies” slice, recombine …
      • Arrays, sets, organizations, compression already there
      • Server and remote access
      • Quick access to data via HDFView, MATLAB, other tools
      • Widely used - MATLAB, Mathematica, IDL, NASA-EOS,
      Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
    • 10. HDF Software HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File
    • 11. BioHDF
      • SBIR Funded Project
      • Phase I - Feasibility for genotyping
      • Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications
        • Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model
        • Develop prototype BioHDF software applications that support common activities utilizing DNA
        • Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics
    • 12. Phase I - Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table polyPhred data table, graphs, and chromats
    • 13. Benefits
      • Separated the model, implementation, and view of the data
      • Multiple levels of data in a single view
      • Hapmap: convert, display, and scroll 100,000s genotypes
      • Compressed 5.2 GB LD data into 300 MB (17x)
      • Quickly and randomly access subsets of data
      • Made use of standard features and a data viewer (HDFview)
      Only had to build the model and data importer
    • 14. Phase II
      • Primary Data Analysis
        • Models for storing and accessing primary data
        • Implement and test models, develop compression methods
        • Create research tools to access and work with the data
      • Secondary Data Analysis
        • Models for storing common data structures (assembly graphs, density plots, variants)
        • APIs to work with programs, enable out-of-core processing
        • Develop research level applications utilizing HDFView, current and emerging genome browsers
    • 15. Collaborations
      • Planned
        • Software: SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon)
        • Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems
      • Emerging
        • Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology
      • Seeking
        • Algorithm developers
        • Application developers
        • Frameworks, Bio*
        • Data sets
    • 16. Summary
      • Data challenges for Next Gen sequencing
        • Manage high volumes of data
        • Workflow complexity
        • Computational performance
      • BioHDF will be built on existing, available, and proven HDF5 technology
      • Geospiza and The HDF Group are seeking collaborations
      • Funding - NIH STTR 1R41HG003792-02
      • Interested? Contact