SlideShare a Scribd company logo
1 of 31
Download to read offline
Thursday, February 23, 12
Evolving an Analytical Data Platform
          With Applications to Medical Data


          Jeff Hammerbacher
          Chief Scientist, Cloudera
          February 23, 2012



Thursday, February 23, 12
Presentation Outline
         ▪   0. Context
         ▪   1. Philosophy
         ▪   2. Platform
         ▪   3. Applications




Thursday, February 23, 12
0. Context




Thursday, February 23, 12
Context
         About me
         ▪   Mathematics at Harvard
         ▪   Quant at Bear Stearns
         ▪   Manager, Data at Facebook
         ▪   Founder and Chief Scientist at Cloudera
         ▪   Director at Sage Foundation
         ▪   Teach “Introduction to Data Science” at Berkeley




Thursday, February 23, 12
Context
         About Cloudera
         ▪   Founded in 2008
         ▪   Headquarters in Palo Alto
         ▪   185 employees
         ▪   Software
             ▪   CDH
             ▪   Cloudera Manager
         ▪   Training
             ▪   Cloudera University



Thursday, February 23, 12
Context
         What I care about

            1) Open source software for data management and analysis
            2) Teaching the world to use this software effectively
            3) Using this software to effect positive change in the world




Thursday, February 23, 12
1. Philosophy




Thursday, February 23, 12
Philosophy
         ▪   The true challenges in the task of data mining
             ▪   Creating a data set with the relevant and accurate information
             ▪   Determining the appropriate analysis techniques




                                      Adapted from “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu and Ted Johnson




Thursday, February 23, 12
Philosophy
         Creating a data set
         ▪   Store all of your data in one place
         ▪   Data first, questions later
         ▪   Store first, structure later
         ▪   Keep raw data forever




Thursday, February 23, 12
Philosophy
         Choosing an analysis technique
         ▪   Enable everyone to party on the data
             ▪   Developers
             ▪   Analysts
             ▪   Business users




Thursday, February 23, 12
Philosophy
         ▪   We have to produce tools to support the whole research cycle
             ▪   data capture
             ▪   data curation
             ▪   data analysis
             ▪   data visualization




                                                       Adapted from “The Fourth Paradigm” by Jim Gray




Thursday, February 23, 12
Application
                             Requests




                                   Application                Data
                                    Database                Warehouse
                                                 ETL




                                                         Business
                                                                      Analytics
                                                       Intelligence




Thursday, February 23, 12
Application
                             Requests




                                 Application                                      Data
                                  Database
                                               Hadoop + Hive                    Warehouse




                                                                              Business
                                                                                           Analytics
                                                                            Intelligence




                                                   Business
                                                                Analytics
                                                 Intelligence




Thursday, February 23, 12
2. Platform




Thursday, February 23, 12
Platform
         Substrate
         ▪   Commodity servers
             ▪   Open Compute
         ▪   Open source operating system
             ▪   Linux
         ▪   Open source configuration management
             ▪   Puppet, Chef
         ▪   Coordination service
             ▪   ZooKeeper



Thursday, February 23, 12
Platform
         Storage
         ▪   Distributed schema-less storage
             ▪   HDFS
         ▪   Append-only table storage and metadata
             ▪   Hive
         ▪   Mutable table storage and metadata
             ▪   HBase




Thursday, February 23, 12
Platform
         Compute
         ▪   Cluster resource management
             ▪   YARN
         ▪   Processing frameworks
             ▪   MapReduce, MPI
         ▪   High-level interfaces
             ▪   Crunch, PigLatin, HiveQL, Oozie
         ▪   Libraries
             ▪   DataFu, Mahout



Thursday, February 23, 12
Platform
         Integration
         ▪   Data access
             ▪   FUSE
             ▪   ODBC/JDBC
         ▪   Data ingest
             ▪   Sqoop
             ▪   Flume
         ▪   User interface
             ▪   Hue



Thursday, February 23, 12
3. Applications




Thursday, February 23, 12
Applications
         FDA
         ▪   Phase IV/post-market analysis of drug safety
         ▪   Find unsuspected adverse drug events (ADEs)
         ▪   Adverse Event Reporting System (AERS) data is available online
         ▪   Used Pig to identify novel 3-drug combinations
         ▪   No complex algorithms required




Thursday, February 23, 12
Applications
                            HIV Drug Interactions




Thursday, February 23, 12
Applications
         Michael Schatz
         ▪   Contrail: de novo assembly of large genomes from short reads
         ▪   CloudBurst: parallel read mapping
         ▪   Crossbow: find SNPs from short read data
         ▪   Genome indexing: suffix array, BWT
         ▪   Work done at Maryland and Cold Spring Harbor




Thursday, February 23, 12
Applications
         SeqWare Query Engine
         ▪   Load and query variants over thousands of genomes
         ▪   Handles a variety of variants and annotations
         ▪   Proof of concept using the U87MG genome
         ▪   Runs on HBase
         ▪   Open source
         ▪   Work done at UCLA




Thursday, February 23, 12
Applications
         Nephele
         ▪   Genotyping without multiple sequence alignment
         ▪   Represent sequence with complete composition vector
         ▪   Use affinity propagation clustering to group sequences
         ▪   Code is open source
         ▪   Work done at MITRE




Thursday, February 23, 12
Applications
         Hadoop-GIS
         ▪   High performance queries for analytical pathology imaging
         ▪   Spatial query engine RESQUE
         ▪   Augments Hive with spatial query capabilities
         ▪   Will support analytical pathology imaging guided diagnosis
         ▪   Work done at Emory University




Thursday, February 23, 12
Applications
         Explorys
         ▪   “Medical informatics platform”
         ▪   Search and analyze
             ▪   patient populations
             ▪   treatment protocols
             ▪   clinical outcomes
         ▪   Explorys engineer Doug Meil is an HBase committer




Thursday, February 23, 12
Applications
         NextBio
         ▪   “Integrative Genomics”
         ▪   Platform for integrating public and private information
         ▪   Literature search, automated annotation
         ▪   Sequence-specific data management components
         ▪   Pipeline powered by Hadoop




Thursday, February 23, 12
Applications
         IBM Watson
         ▪   Automated diagnosis




Thursday, February 23, 12
Applications
         Microsoft Research
         ▪   Cyberchondria
             ▪   Understanding how web content is navigated
             ▪   Uses search logs for analysis




Thursday, February 23, 12
(c) 2012 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Thursday, February 23, 12

More Related Content

What's hot

BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827
Anthony Potappel
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
 
Urika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageUrika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-page
Adnan Khaleel
 
Going local with a world-class data infrastructure: Enabling SDMX for researc...
Going local with a world-class data infrastructure: Enabling SDMX for researc...Going local with a world-class data infrastructure: Enabling SDMX for researc...
Going local with a world-class data infrastructure: Enabling SDMX for researc...
Rob Grim
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Jyrki Määttä
 

What's hot (19)

Big Data DB for AI+Blockchain Integration
Big Data DB for AI+Blockchain IntegrationBig Data DB for AI+Blockchain Integration
Big Data DB for AI+Blockchain Integration
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 
Urika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageUrika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-page
 
Going local with a world-class data infrastructure: Enabling SDMX for researc...
Going local with a world-class data infrastructure: Enabling SDMX for researc...Going local with a world-class data infrastructure: Enabling SDMX for researc...
Going local with a world-class data infrastructure: Enabling SDMX for researc...
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Thinking BIG
Thinking BIGThinking BIG
Thinking BIG
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 

Similar to 20120223keystone

Big data, why care
Big data, why careBig data, why care
Big data, why care
Daan Gerits
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
royans
 

Similar to 20120223keystone (20)

THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Linked Data Approach for Integration of Human Health & Environmental Data
Linked Data Approach for Integration of Human Health & Environmental DataLinked Data Approach for Integration of Human Health & Environmental Data
Linked Data Approach for Integration of Human Health & Environmental Data
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Viet stack 2nd meetup - BigData in Cloud Computing
Viet stack 2nd meetup - BigData in Cloud ComputingViet stack 2nd meetup - BigData in Cloud Computing
Viet stack 2nd meetup - BigData in Cloud Computing
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
 
Conclusions: Summary and Outlook
Conclusions: Summary and OutlookConclusions: Summary and Outlook
Conclusions: Summary and Outlook
 
Big data, why care
Big data, why careBig data, why care
Big data, why care
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 

More from Jeff Hammerbacher (20)

20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

20120223keystone

  • 2. Evolving an Analytical Data Platform With Applications to Medical Data Jeff Hammerbacher Chief Scientist, Cloudera February 23, 2012 Thursday, February 23, 12
  • 3. Presentation Outline ▪ 0. Context ▪ 1. Philosophy ▪ 2. Platform ▪ 3. Applications Thursday, February 23, 12
  • 5. Context About me ▪ Mathematics at Harvard ▪ Quant at Bear Stearns ▪ Manager, Data at Facebook ▪ Founder and Chief Scientist at Cloudera ▪ Director at Sage Foundation ▪ Teach “Introduction to Data Science” at Berkeley Thursday, February 23, 12
  • 6. Context About Cloudera ▪ Founded in 2008 ▪ Headquarters in Palo Alto ▪ 185 employees ▪ Software ▪ CDH ▪ Cloudera Manager ▪ Training ▪ Cloudera University Thursday, February 23, 12
  • 7. Context What I care about 1) Open source software for data management and analysis 2) Teaching the world to use this software effectively 3) Using this software to effect positive change in the world Thursday, February 23, 12
  • 9. Philosophy ▪ The true challenges in the task of data mining ▪ Creating a data set with the relevant and accurate information ▪ Determining the appropriate analysis techniques Adapted from “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu and Ted Johnson Thursday, February 23, 12
  • 10. Philosophy Creating a data set ▪ Store all of your data in one place ▪ Data first, questions later ▪ Store first, structure later ▪ Keep raw data forever Thursday, February 23, 12
  • 11. Philosophy Choosing an analysis technique ▪ Enable everyone to party on the data ▪ Developers ▪ Analysts ▪ Business users Thursday, February 23, 12
  • 12. Philosophy ▪ We have to produce tools to support the whole research cycle ▪ data capture ▪ data curation ▪ data analysis ▪ data visualization Adapted from “The Fourth Paradigm” by Jim Gray Thursday, February 23, 12
  • 13. Application Requests Application Data Database Warehouse ETL Business Analytics Intelligence Thursday, February 23, 12
  • 14. Application Requests Application Data Database Hadoop + Hive Warehouse Business Analytics Intelligence Business Analytics Intelligence Thursday, February 23, 12
  • 16. Platform Substrate ▪ Commodity servers ▪ Open Compute ▪ Open source operating system ▪ Linux ▪ Open source configuration management ▪ Puppet, Chef ▪ Coordination service ▪ ZooKeeper Thursday, February 23, 12
  • 17. Platform Storage ▪ Distributed schema-less storage ▪ HDFS ▪ Append-only table storage and metadata ▪ Hive ▪ Mutable table storage and metadata ▪ HBase Thursday, February 23, 12
  • 18. Platform Compute ▪ Cluster resource management ▪ YARN ▪ Processing frameworks ▪ MapReduce, MPI ▪ High-level interfaces ▪ Crunch, PigLatin, HiveQL, Oozie ▪ Libraries ▪ DataFu, Mahout Thursday, February 23, 12
  • 19. Platform Integration ▪ Data access ▪ FUSE ▪ ODBC/JDBC ▪ Data ingest ▪ Sqoop ▪ Flume ▪ User interface ▪ Hue Thursday, February 23, 12
  • 21. Applications FDA ▪ Phase IV/post-market analysis of drug safety ▪ Find unsuspected adverse drug events (ADEs) ▪ Adverse Event Reporting System (AERS) data is available online ▪ Used Pig to identify novel 3-drug combinations ▪ No complex algorithms required Thursday, February 23, 12
  • 22. Applications HIV Drug Interactions Thursday, February 23, 12
  • 23. Applications Michael Schatz ▪ Contrail: de novo assembly of large genomes from short reads ▪ CloudBurst: parallel read mapping ▪ Crossbow: find SNPs from short read data ▪ Genome indexing: suffix array, BWT ▪ Work done at Maryland and Cold Spring Harbor Thursday, February 23, 12
  • 24. Applications SeqWare Query Engine ▪ Load and query variants over thousands of genomes ▪ Handles a variety of variants and annotations ▪ Proof of concept using the U87MG genome ▪ Runs on HBase ▪ Open source ▪ Work done at UCLA Thursday, February 23, 12
  • 25. Applications Nephele ▪ Genotyping without multiple sequence alignment ▪ Represent sequence with complete composition vector ▪ Use affinity propagation clustering to group sequences ▪ Code is open source ▪ Work done at MITRE Thursday, February 23, 12
  • 26. Applications Hadoop-GIS ▪ High performance queries for analytical pathology imaging ▪ Spatial query engine RESQUE ▪ Augments Hive with spatial query capabilities ▪ Will support analytical pathology imaging guided diagnosis ▪ Work done at Emory University Thursday, February 23, 12
  • 27. Applications Explorys ▪ “Medical informatics platform” ▪ Search and analyze ▪ patient populations ▪ treatment protocols ▪ clinical outcomes ▪ Explorys engineer Doug Meil is an HBase committer Thursday, February 23, 12
  • 28. Applications NextBio ▪ “Integrative Genomics” ▪ Platform for integrating public and private information ▪ Literature search, automated annotation ▪ Sequence-specific data management components ▪ Pipeline powered by Hadoop Thursday, February 23, 12
  • 29. Applications IBM Watson ▪ Automated diagnosis Thursday, February 23, 12
  • 30. Applications Microsoft Research ▪ Cyberchondria ▪ Understanding how web content is navigated ▪ Uses search logs for analysis Thursday, February 23, 12
  • 31. (c) 2012 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, February 23, 12
  • 32. Why you might not care ▪ Long-lived organizations with multiple departments ▪ Data sources primarily internal to the organization ▪ Reporting and ad hoc query workloads as important as analysis ▪ CDH strengths ▪ data capture ▪ data curation ▪ CDH weaknesses ▪ interactive query performance ▪ model fitting (optimization) ▪ linear algebra (arrays are not a primitive type) Thursday, February 23, 12