• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,469
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
16
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BioHDF : Open binary file formats for large scale data management Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM
  • 2. Overview
    • Driver: Next Generation DNA Sequencing
    • What is HDF5
    • BioHDF Project
    Laboratory and data workflow management for genetic analysis
  • 3. Next Generation DNA Sequencing
    • Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger)
    • A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments
    • In Sequence quotes - July 2007
      • Toby Bloom, Broad Institute “Next-gen sequencing i mpacts all aspects of informatics.”
      • Phil Butcher, Sanger “ T he best way to move terabytes of data is still disk.” Want to process data closer to the machine.
      • Eugen Clark, Harvard “[community] needs to start talks about data retention.”
      • Kelly Carpenter, Wash U “these sequencers are going to totally screw you.”
    • Nature Methods July 2008: “Byte-ing off more than you can chew”
  • 4. Three Phases of Data Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo assembly => Assembler Contigs + Annotation
  • 5. Proliferation of files, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ - http://maq.sourceforge.net Secondary Analysis for: Additional files and formats needed for tertiary analysis
  • 6. Challenges
    • Complexity
      • Numerous programs, scripts, files, and formats
      • Redundant data
    • Computational overhead
      • All data typically reside in RAM during computation
      • Output and input formats differ, so data must be frequently reprocessed
    • Space, time, and bandwidth efficiency
      • Increased storage
      • Computation times increase disproportionately
      • Large data sets must be transported for processing
  • 7. What Needs to Be Done
    • Reduce complexity
      • Decrease numbers and kinds of files
      • Eliminate data duplication (performance)
      • API and tools for data access
    • Improve resource utilization
      • Reduce redundancy, work with compressed data
      • Improve program access to data, random reads and writes, map disk to computer memory
      • Parallel I/O, Remote access
      • Facilitate data sharing, preservation
    • Adopt a standard from other data intensive fields
      • Benefit from history and experience
      • Benefit from refinement
      • Build on a proven, widely accepted platform
  • 8. HDF5: Single Platform / Multiple Uses
    • A file format for managing any kind of data
    • Software system to manage data in the format
    • Designed for high volume or complex data
    • Designed every size and type of system
    • Open format and software
    • One library, with
      • Options to adapt I/O and storage to data needs
      • Layers on top and below
    • Ability to interact well with other technologies
    • Attention to past, present, future compatibility
  • 9. HDF5 - 20 yrs in Physical Sciences
    • Gain multiple “working with data efficiencies” slice, recombine …
    • Arrays, sets, organizations, compression already there
    • Server and remote access
    • Quick access to data via HDFView, MATLAB, other tools
    • Widely used - MATLAB, Mathematica, IDL, NASA-EOS,
    Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
  • 10. HDF Software HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File
  • 11. BioHDF
    • SBIR Funded Project
    • Phase I - Feasibility for genotyping
    • Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications
      • Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model
      • Develop prototype BioHDF software applications that support common activities utilizing DNA
      • Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics
  • 12. Phase I - Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table polyPhred data table, graphs, and chromats
  • 13. Benefits
    • Separated the model, implementation, and view of the data
    • Multiple levels of data in a single view
    • Hapmap: convert, display, and scroll 100,000s genotypes
    • Compressed 5.2 GB LD data into 300 MB (17x)
    • Quickly and randomly access subsets of data
    • Made use of standard features and a data viewer (HDFview)
    Only had to build the model and data importer
  • 14. Phase II
    • Primary Data Analysis
      • Models for storing and accessing primary data
      • Implement and test models, develop compression methods
      • Create research tools to access and work with the data
    • Secondary Data Analysis
      • Models for storing common data structures (assembly graphs, density plots, variants)
      • APIs to work with programs, enable out-of-core processing
      • Develop research level applications utilizing HDFView, current and emerging genome browsers
  • 15. Collaborations
    • Planned
      • Software: SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon)
      • Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems
    • Emerging
      • Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology
    • Seeking
      • Algorithm developers
      • Application developers
      • Frameworks, Bio*
      • Data sets
  • 16. Summary
    • Data challenges for Next Gen sequencing
      • Manage high volumes of data
      • Workflow complexity
      • Computational performance
    • BioHDF will be built on existing, available, and proven HDF5 technology
    • Geospiza and The HDF Group are seeking collaborations
    • Funding - NIH STTR 1R41HG003792-02
    • Interested? Contact todd@geospiza.com