The document summarizes Jeff Hammerbacher's presentation on evolving an analytical data platform with applications to medical data. The presentation covers Hammerbacher's philosophy of data analysis, the Cloudera platform, and various medical applications that use the Hadoop ecosystem like analyzing adverse drug events with Pig and genome assembly/indexing. It also briefly mentions other companies like Explorys, NextBio, and IBM Watson that are applying these techniques.
2. Evolving an Analytical Data Platform
With Applications to Medical Data
Jeff Hammerbacher
Chief Scientist, Cloudera
February 23, 2012
Thursday, February 23, 12
5. Context
About me
▪ Mathematics at Harvard
▪ Quant at Bear Stearns
▪ Manager, Data at Facebook
▪ Founder and Chief Scientist at Cloudera
▪ Director at Sage Foundation
▪ Teach “Introduction to Data Science” at Berkeley
Thursday, February 23, 12
6. Context
About Cloudera
▪ Founded in 2008
▪ Headquarters in Palo Alto
▪ 185 employees
▪ Software
▪ CDH
▪ Cloudera Manager
▪ Training
▪ Cloudera University
Thursday, February 23, 12
7. Context
What I care about
1) Open source software for data management and analysis
2) Teaching the world to use this software effectively
3) Using this software to effect positive change in the world
Thursday, February 23, 12
9. Philosophy
▪ The true challenges in the task of data mining
▪ Creating a data set with the relevant and accurate information
▪ Determining the appropriate analysis techniques
Adapted from “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu and Ted Johnson
Thursday, February 23, 12
10. Philosophy
Creating a data set
▪ Store all of your data in one place
▪ Data first, questions later
▪ Store first, structure later
▪ Keep raw data forever
Thursday, February 23, 12
11. Philosophy
Choosing an analysis technique
▪ Enable everyone to party on the data
▪ Developers
▪ Analysts
▪ Business users
Thursday, February 23, 12
12. Philosophy
▪ We have to produce tools to support the whole research cycle
▪ data capture
▪ data curation
▪ data analysis
▪ data visualization
Adapted from “The Fourth Paradigm” by Jim Gray
Thursday, February 23, 12
13. Application
Requests
Application Data
Database Warehouse
ETL
Business
Analytics
Intelligence
Thursday, February 23, 12
14. Application
Requests
Application Data
Database
Hadoop + Hive Warehouse
Business
Analytics
Intelligence
Business
Analytics
Intelligence
Thursday, February 23, 12
16. Platform
Substrate
▪ Commodity servers
▪ Open Compute
▪ Open source operating system
▪ Linux
▪ Open source configuration management
▪ Puppet, Chef
▪ Coordination service
▪ ZooKeeper
Thursday, February 23, 12
17. Platform
Storage
▪ Distributed schema-less storage
▪ HDFS
▪ Append-only table storage and metadata
▪ Hive
▪ Mutable table storage and metadata
▪ HBase
Thursday, February 23, 12
21. Applications
FDA
▪ Phase IV/post-market analysis of drug safety
▪ Find unsuspected adverse drug events (ADEs)
▪ Adverse Event Reporting System (AERS) data is available online
▪ Used Pig to identify novel 3-drug combinations
▪ No complex algorithms required
Thursday, February 23, 12
22. Applications
HIV Drug Interactions
Thursday, February 23, 12
23. Applications
Michael Schatz
▪ Contrail: de novo assembly of large genomes from short reads
▪ CloudBurst: parallel read mapping
▪ Crossbow: find SNPs from short read data
▪ Genome indexing: suffix array, BWT
▪ Work done at Maryland and Cold Spring Harbor
Thursday, February 23, 12
24. Applications
SeqWare Query Engine
▪ Load and query variants over thousands of genomes
▪ Handles a variety of variants and annotations
▪ Proof of concept using the U87MG genome
▪ Runs on HBase
▪ Open source
▪ Work done at UCLA
Thursday, February 23, 12
25. Applications
Nephele
▪ Genotyping without multiple sequence alignment
▪ Represent sequence with complete composition vector
▪ Use affinity propagation clustering to group sequences
▪ Code is open source
▪ Work done at MITRE
Thursday, February 23, 12
26. Applications
Hadoop-GIS
▪ High performance queries for analytical pathology imaging
▪ Spatial query engine RESQUE
▪ Augments Hive with spatial query capabilities
▪ Will support analytical pathology imaging guided diagnosis
▪ Work done at Emory University
Thursday, February 23, 12
27. Applications
Explorys
▪ “Medical informatics platform”
▪ Search and analyze
▪ patient populations
▪ treatment protocols
▪ clinical outcomes
▪ Explorys engineer Doug Meil is an HBase committer
Thursday, February 23, 12
28. Applications
NextBio
▪ “Integrative Genomics”
▪ Platform for integrating public and private information
▪ Literature search, automated annotation
▪ Sequence-specific data management components
▪ Pipeline powered by Hadoop
Thursday, February 23, 12
29. Applications
IBM Watson
▪ Automated diagnosis
Thursday, February 23, 12
30. Applications
Microsoft Research
▪ Cyberchondria
▪ Understanding how web content is navigated
▪ Uses search logs for analysis
Thursday, February 23, 12
31. (c) 2012 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, February 23, 12
32. Why you might not care
▪ Long-lived organizations with multiple departments
▪ Data sources primarily internal to the organization
▪ Reporting and ad hoc query workloads as important as analysis
▪ CDH strengths
▪ data capture
▪ data curation
▪ CDH weaknesses
▪ interactive query performance
▪ model fitting (optimization)
▪ linear algebra (arrays are not a primitive type)
Thursday, February 23, 12