Smith T Bio Hdf Bosc2008
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Smith T Bio Hdf Bosc2008






Total Views
Views on SlideShare
Embed Views



1 Embed 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Smith T Bio Hdf Bosc2008 Presentation Transcript

  • 1. BioHDF : Open binary file formats for large scale data management Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM
  • 2. Overview
    • Driver: Next Generation DNA Sequencing
    • What is HDF5
    • BioHDF Project
    Laboratory and data workflow management for genetic analysis
  • 3. Next Generation DNA Sequencing
    • Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger)
    • A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments
    • In Sequence quotes - July 2007
      • Toby Bloom, Broad Institute “Next-gen sequencing i mpacts all aspects of informatics.”
      • Phil Butcher, Sanger “ T he best way to move terabytes of data is still disk.” Want to process data closer to the machine.
      • Eugen Clark, Harvard “[community] needs to start talks about data retention.”
      • Kelly Carpenter, Wash U “these sequencers are going to totally screw you.”
    • Nature Methods July 2008: “Byte-ing off more than you can chew”
  • 4. Three Phases of Data Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo assembly => Assembler Contigs + Annotation
  • 5. Proliferation of files, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ - Secondary Analysis for: Additional files and formats needed for tertiary analysis
  • 6. Challenges
    • Complexity
      • Numerous programs, scripts, files, and formats
      • Redundant data
    • Computational overhead
      • All data typically reside in RAM during computation
      • Output and input formats differ, so data must be frequently reprocessed
    • Space, time, and bandwidth efficiency
      • Increased storage
      • Computation times increase disproportionately
      • Large data sets must be transported for processing
  • 7. What Needs to Be Done
    • Reduce complexity
      • Decrease numbers and kinds of files
      • Eliminate data duplication (performance)
      • API and tools for data access
    • Improve resource utilization
      • Reduce redundancy, work with compressed data
      • Improve program access to data, random reads and writes, map disk to computer memory
      • Parallel I/O, Remote access
      • Facilitate data sharing, preservation
    • Adopt a standard from other data intensive fields
      • Benefit from history and experience
      • Benefit from refinement
      • Build on a proven, widely accepted platform
  • 8. HDF5: Single Platform / Multiple Uses
    • A file format for managing any kind of data
    • Software system to manage data in the format
    • Designed for high volume or complex data
    • Designed every size and type of system
    • Open format and software
    • One library, with
      • Options to adapt I/O and storage to data needs
      • Layers on top and below
    • Ability to interact well with other technologies
    • Attention to past, present, future compatibility
  • 9. HDF5 - 20 yrs in Physical Sciences
    • Gain multiple “working with data efficiencies” slice, recombine …
    • Arrays, sets, organizations, compression already there
    • Server and remote access
    • Quick access to data via HDFView, MATLAB, other tools
    • Widely used - MATLAB, Mathematica, IDL, NASA-EOS,
    Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
  • 10. HDF Software HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File
  • 11. BioHDF
    • SBIR Funded Project
    • Phase I - Feasibility for genotyping
    • Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications
      • Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model
      • Develop prototype BioHDF software applications that support common activities utilizing DNA
      • Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics
  • 12. Phase I - Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table polyPhred data table, graphs, and chromats
  • 13. Benefits
    • Separated the model, implementation, and view of the data
    • Multiple levels of data in a single view
    • Hapmap: convert, display, and scroll 100,000s genotypes
    • Compressed 5.2 GB LD data into 300 MB (17x)
    • Quickly and randomly access subsets of data
    • Made use of standard features and a data viewer (HDFview)
    Only had to build the model and data importer
  • 14. Phase II
    • Primary Data Analysis
      • Models for storing and accessing primary data
      • Implement and test models, develop compression methods
      • Create research tools to access and work with the data
    • Secondary Data Analysis
      • Models for storing common data structures (assembly graphs, density plots, variants)
      • APIs to work with programs, enable out-of-core processing
      • Develop research level applications utilizing HDFView, current and emerging genome browsers
  • 15. Collaborations
    • Planned
      • Software: SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon)
      • Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems
    • Emerging
      • Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology
    • Seeking
      • Algorithm developers
      • Application developers
      • Frameworks, Bio*
      • Data sets
  • 16. Summary
    • Data challenges for Next Gen sequencing
      • Manage high volumes of data
      • Workflow complexity
      • Computational performance
    • BioHDF will be built on existing, available, and proven HDF5 technology
    • Geospiza and The HDF Group are seeking collaborations
    • Funding - NIH STTR 1R41HG003792-02
    • Interested? Contact