Thursday, February 23, 12
Evolving an Analytical Data Platform
          With Applications to Medical Data


          Jeff Hammerbacher
          Chief Scientist, Cloudera
          February 23, 2012



Thursday, February 23, 12
Presentation Outline
         ▪   0. Context
         ▪   1. Philosophy
         ▪   2. Platform
         ▪   3. Applications




Thursday, February 23, 12
0. Context




Thursday, February 23, 12
Context
         About me
         ▪   Mathematics at Harvard
         ▪   Quant at Bear Stearns
         ▪   Manager, Data at Facebook
         ▪   Founder and Chief Scientist at Cloudera
         ▪   Director at Sage Foundation
         ▪   Teach “Introduction to Data Science” at Berkeley




Thursday, February 23, 12
Context
         About Cloudera
         ▪   Founded in 2008
         ▪   Headquarters in Palo Alto
         ▪   185 employees
         ▪   Software
             ▪   CDH
             ▪   Cloudera Manager
         ▪   Training
             ▪   Cloudera University



Thursday, February 23, 12
Context
         What I care about

            1) Open source software for data management and analysis
            2) Teaching the world to use this software effectively
            3) Using this software to effect positive change in the world




Thursday, February 23, 12
1. Philosophy




Thursday, February 23, 12
Philosophy
         ▪   The true challenges in the task of data mining
             ▪   Creating a data set with the relevant and accurate information
             ▪   Determining the appropriate analysis techniques




                                      Adapted from “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu and Ted Johnson




Thursday, February 23, 12
Philosophy
         Creating a data set
         ▪   Store all of your data in one place
         ▪   Data first, questions later
         ▪   Store first, structure later
         ▪   Keep raw data forever




Thursday, February 23, 12
Philosophy
         Choosing an analysis technique
         ▪   Enable everyone to party on the data
             ▪   Developers
             ▪   Analysts
             ▪   Business users




Thursday, February 23, 12
Philosophy
         ▪   We have to produce tools to support the whole research cycle
             ▪   data capture
             ▪   data curation
             ▪   data analysis
             ▪   data visualization




                                                       Adapted from “The Fourth Paradigm” by Jim Gray




Thursday, February 23, 12
Application
                             Requests




                                   Application                Data
                                    Database                Warehouse
                                                 ETL




                                                         Business
                                                                      Analytics
                                                       Intelligence




Thursday, February 23, 12
Application
                             Requests




                                 Application                                      Data
                                  Database
                                               Hadoop + Hive                    Warehouse




                                                                              Business
                                                                                           Analytics
                                                                            Intelligence




                                                   Business
                                                                Analytics
                                                 Intelligence




Thursday, February 23, 12
2. Platform




Thursday, February 23, 12
Platform
         Substrate
         ▪   Commodity servers
             ▪   Open Compute
         ▪   Open source operating system
             ▪   Linux
         ▪   Open source configuration management
             ▪   Puppet, Chef
         ▪   Coordination service
             ▪   ZooKeeper



Thursday, February 23, 12
Platform
         Storage
         ▪   Distributed schema-less storage
             ▪   HDFS
         ▪   Append-only table storage and metadata
             ▪   Hive
         ▪   Mutable table storage and metadata
             ▪   HBase




Thursday, February 23, 12
Platform
         Compute
         ▪   Cluster resource management
             ▪   YARN
         ▪   Processing frameworks
             ▪   MapReduce, MPI
         ▪   High-level interfaces
             ▪   Crunch, PigLatin, HiveQL, Oozie
         ▪   Libraries
             ▪   DataFu, Mahout



Thursday, February 23, 12
Platform
         Integration
         ▪   Data access
             ▪   FUSE
             ▪   ODBC/JDBC
         ▪   Data ingest
             ▪   Sqoop
             ▪   Flume
         ▪   User interface
             ▪   Hue



Thursday, February 23, 12
3. Applications




Thursday, February 23, 12
Applications
         FDA
         ▪   Phase IV/post-market analysis of drug safety
         ▪   Find unsuspected adverse drug events (ADEs)
         ▪   Adverse Event Reporting System (AERS) data is available online
         ▪   Used Pig to identify novel 3-drug combinations
         ▪   No complex algorithms required




Thursday, February 23, 12
Applications
                            HIV Drug Interactions




Thursday, February 23, 12
Applications
         Michael Schatz
         ▪   Contrail: de novo assembly of large genomes from short reads
         ▪   CloudBurst: parallel read mapping
         ▪   Crossbow: find SNPs from short read data
         ▪   Genome indexing: suffix array, BWT
         ▪   Work done at Maryland and Cold Spring Harbor




Thursday, February 23, 12
Applications
         SeqWare Query Engine
         ▪   Load and query variants over thousands of genomes
         ▪   Handles a variety of variants and annotations
         ▪   Proof of concept using the U87MG genome
         ▪   Runs on HBase
         ▪   Open source
         ▪   Work done at UCLA




Thursday, February 23, 12
Applications
         Nephele
         ▪   Genotyping without multiple sequence alignment
         ▪   Represent sequence with complete composition vector
         ▪   Use affinity propagation clustering to group sequences
         ▪   Code is open source
         ▪   Work done at MITRE




Thursday, February 23, 12
Applications
         Hadoop-GIS
         ▪   High performance queries for analytical pathology imaging
         ▪   Spatial query engine RESQUE
         ▪   Augments Hive with spatial query capabilities
         ▪   Will support analytical pathology imaging guided diagnosis
         ▪   Work done at Emory University




Thursday, February 23, 12
Applications
         Explorys
         ▪   “Medical informatics platform”
         ▪   Search and analyze
             ▪   patient populations
             ▪   treatment protocols
             ▪   clinical outcomes
         ▪   Explorys engineer Doug Meil is an HBase committer




Thursday, February 23, 12
Applications
         NextBio
         ▪   “Integrative Genomics”
         ▪   Platform for integrating public and private information
         ▪   Literature search, automated annotation
         ▪   Sequence-specific data management components
         ▪   Pipeline powered by Hadoop




Thursday, February 23, 12
Applications
         IBM Watson
         ▪   Automated diagnosis




Thursday, February 23, 12
Applications
         Microsoft Research
         ▪   Cyberchondria
             ▪   Understanding how web content is navigated
             ▪   Uses search logs for analysis




Thursday, February 23, 12
(c) 2012 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Thursday, February 23, 12

20120223keystone

  • 1.
  • 2.
    Evolving an AnalyticalData Platform With Applications to Medical Data Jeff Hammerbacher Chief Scientist, Cloudera February 23, 2012 Thursday, February 23, 12
  • 3.
    Presentation Outline ▪ 0. Context ▪ 1. Philosophy ▪ 2. Platform ▪ 3. Applications Thursday, February 23, 12
  • 4.
  • 5.
    Context About me ▪ Mathematics at Harvard ▪ Quant at Bear Stearns ▪ Manager, Data at Facebook ▪ Founder and Chief Scientist at Cloudera ▪ Director at Sage Foundation ▪ Teach “Introduction to Data Science” at Berkeley Thursday, February 23, 12
  • 6.
    Context About Cloudera ▪ Founded in 2008 ▪ Headquarters in Palo Alto ▪ 185 employees ▪ Software ▪ CDH ▪ Cloudera Manager ▪ Training ▪ Cloudera University Thursday, February 23, 12
  • 7.
    Context What I care about 1) Open source software for data management and analysis 2) Teaching the world to use this software effectively 3) Using this software to effect positive change in the world Thursday, February 23, 12
  • 8.
  • 9.
    Philosophy ▪ The true challenges in the task of data mining ▪ Creating a data set with the relevant and accurate information ▪ Determining the appropriate analysis techniques Adapted from “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu and Ted Johnson Thursday, February 23, 12
  • 10.
    Philosophy Creating a data set ▪ Store all of your data in one place ▪ Data first, questions later ▪ Store first, structure later ▪ Keep raw data forever Thursday, February 23, 12
  • 11.
    Philosophy Choosing an analysis technique ▪ Enable everyone to party on the data ▪ Developers ▪ Analysts ▪ Business users Thursday, February 23, 12
  • 12.
    Philosophy ▪ We have to produce tools to support the whole research cycle ▪ data capture ▪ data curation ▪ data analysis ▪ data visualization Adapted from “The Fourth Paradigm” by Jim Gray Thursday, February 23, 12
  • 13.
    Application Requests Application Data Database Warehouse ETL Business Analytics Intelligence Thursday, February 23, 12
  • 14.
    Application Requests Application Data Database Hadoop + Hive Warehouse Business Analytics Intelligence Business Analytics Intelligence Thursday, February 23, 12
  • 15.
  • 16.
    Platform Substrate ▪ Commodity servers ▪ Open Compute ▪ Open source operating system ▪ Linux ▪ Open source configuration management ▪ Puppet, Chef ▪ Coordination service ▪ ZooKeeper Thursday, February 23, 12
  • 17.
    Platform Storage ▪ Distributed schema-less storage ▪ HDFS ▪ Append-only table storage and metadata ▪ Hive ▪ Mutable table storage and metadata ▪ HBase Thursday, February 23, 12
  • 18.
    Platform Compute ▪ Cluster resource management ▪ YARN ▪ Processing frameworks ▪ MapReduce, MPI ▪ High-level interfaces ▪ Crunch, PigLatin, HiveQL, Oozie ▪ Libraries ▪ DataFu, Mahout Thursday, February 23, 12
  • 19.
    Platform Integration ▪ Data access ▪ FUSE ▪ ODBC/JDBC ▪ Data ingest ▪ Sqoop ▪ Flume ▪ User interface ▪ Hue Thursday, February 23, 12
  • 20.
  • 21.
    Applications FDA ▪ Phase IV/post-market analysis of drug safety ▪ Find unsuspected adverse drug events (ADEs) ▪ Adverse Event Reporting System (AERS) data is available online ▪ Used Pig to identify novel 3-drug combinations ▪ No complex algorithms required Thursday, February 23, 12
  • 22.
    Applications HIV Drug Interactions Thursday, February 23, 12
  • 23.
    Applications Michael Schatz ▪ Contrail: de novo assembly of large genomes from short reads ▪ CloudBurst: parallel read mapping ▪ Crossbow: find SNPs from short read data ▪ Genome indexing: suffix array, BWT ▪ Work done at Maryland and Cold Spring Harbor Thursday, February 23, 12
  • 24.
    Applications SeqWare Query Engine ▪ Load and query variants over thousands of genomes ▪ Handles a variety of variants and annotations ▪ Proof of concept using the U87MG genome ▪ Runs on HBase ▪ Open source ▪ Work done at UCLA Thursday, February 23, 12
  • 25.
    Applications Nephele ▪ Genotyping without multiple sequence alignment ▪ Represent sequence with complete composition vector ▪ Use affinity propagation clustering to group sequences ▪ Code is open source ▪ Work done at MITRE Thursday, February 23, 12
  • 26.
    Applications Hadoop-GIS ▪ High performance queries for analytical pathology imaging ▪ Spatial query engine RESQUE ▪ Augments Hive with spatial query capabilities ▪ Will support analytical pathology imaging guided diagnosis ▪ Work done at Emory University Thursday, February 23, 12
  • 27.
    Applications Explorys ▪ “Medical informatics platform” ▪ Search and analyze ▪ patient populations ▪ treatment protocols ▪ clinical outcomes ▪ Explorys engineer Doug Meil is an HBase committer Thursday, February 23, 12
  • 28.
    Applications NextBio ▪ “Integrative Genomics” ▪ Platform for integrating public and private information ▪ Literature search, automated annotation ▪ Sequence-specific data management components ▪ Pipeline powered by Hadoop Thursday, February 23, 12
  • 29.
    Applications IBM Watson ▪ Automated diagnosis Thursday, February 23, 12
  • 30.
    Applications Microsoft Research ▪ Cyberchondria ▪ Understanding how web content is navigated ▪ Uses search logs for analysis Thursday, February 23, 12
  • 31.
    (c) 2012 Cloudera,Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, February 23, 12