SlideShare a Scribd company logo
1 of 57
1
Hadoop ecosystem for life sciences
Uri Laserson
30 September 2013
About the speaker
• Currently “Data Scientist” at Cloudera
• PhD in Biomedical Engineering at
MIT/Harvard (2005-2012)
• Focused on next-generation DNA sequencing
technology in George Church’s lab
• Co-founded Good Start Genetics (2007-)
• First application of next-gen sequencing to
genetic carrier screening
• laserson@cloudera.com
2
Agenda
• Historical context
• Introduction to Hadoop ecosystem
• Genomics on Hadoop
• Other use cases in life sciences
3
4
Historical Context
5
Indexing the Web
• Web is Huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!
6
7
Databases in 1999
1. Buy a really big machine
2. Install expensive DBMS on it
3. Point your workload at it
4. Hope it doesn’t fail
5. Ambitious: buy another big machine as backup
8
9
Database Limitations
• Didn’t scale horizontally
• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking
• Complex analysis (PageRank)
• Unstructured data
10
11
Google does something different
• Designed their own storage and processing
infrastructure
• Google File System (GFS) and MapReduce (MR)
• Goals: KISS
• Cheap
• Scalable
• Reliable
12
Google does something different
• It worked!
• Powered Google Search for many years
• General framework for large-scale batch computation
tasks
• Still used internally at Google to this day
13
Google benevolent enough to publish
14
2003 2004
Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
15
Industry strategy: Copy Google
16
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubby Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on Hadoop
Hadoop
17
Overview of core technology
HDFS design assumptions
• Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
• Massive scale means failures very likely
• Disk, node, or network failures
• Accesses are large and sequential
• Files are append-only
18
HDFS properties
• Fault-tolerant
• Gracefully responds to node/disk/network failures
• Horizontally scalable
• Low marginal cost
• High-bandwidth
19
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distribution
Node A Node B Node C Node D Node E
MapReduce computation
20
MapReduce
• Structured as
1. Embarrassingly parallel “map stage”
2. Cluster-wide distributed sort (“shuffle”)
3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
21
WordCount example
22
HPC separates compute from storage
23
Storage infrastructure Compute cluster
• Proprietary, distributed
file system
• Expensive
• High-performance
hardware
• Low failure rate
• Expensive
Big network
pipe ($$$)
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.
Hadoop is about data.
Hadoop colocates compute and storage
24
Compute cluster
Storage infrastructure
• Commodity hardware
• Data-locality
• Reduced networking
needs
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.
Hadoop is about data.
HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically occurs through MPI
• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually
• Hadoop has fault-tolerance, data locality, horizontal
scalability
25
Sqoop
26
Bidirectional data transfer
between Hadoop and
almost any SQL database
with a JDBC driver
Flume
27
A streaming data
collection and
aggregation system for
massive volumes of
data, such as RPC
services, Log4J,
Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent
Cloudera Impala
28
Modern MPP
database built on top
of HDFS
Designed for
interactive queries
on terabyte-scale
data sets.
Cloudera Search
29
• Interactive search queries on top of
HDFS
• Built on Solr and SolrCloud
• Near-realtime indexing of new documents
Benefits of Hadoop ecosystem
• Inexpensive commodity compute/storage
• Tolerates random hardware failure
• Decreased need for high-bandwidth network pipes
• Co-locate compute and storage
• Exploit data locality
• Simple horizontal scalability by adding nodes
• MapReduce jobs effectively guaranteed to scale
• Fault-tolerance/replication built-in. Data is durable
• Large ecosystem of tools
• Flexible data storage. Schema-on-read. Unstructured data.
30
31
Scaling Genomics
32
NCBI Sequence Read Archive (SRA)
33
Today…
1.14 petabytes
One year ago…
609 terabytes
Every ‘ome has a -seq
34
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq
Genomics ETL
35
GATK best practices
Genomics ETL
36
.fastq .bam .vcf
short read
alignment
genotype
calling
• Short read alignment is embarrassingly parallel
• Pileup/variant calling requires distributed sort
• GATK is a reimplementation of MapReduce; could run on Hadoop
• Already available Hadoop tools
• Crossbow: short read alignment/variant calling
• Hadoop-BAM: distributed bamtools
• BioPig: manipulating large fasta/q
• SEAL: Hadoop-enabled BWA
• Contrail: de-novo assembly
Use case 1: Scaling a genome center
pipeline
• Currently at 5k genomes (150 TB incl. raw), looking to
scale to 25k now (1 PB) and eventually 100k
(requiring 4 PB)
• Current throughput
• >1300 samples per month
• >12 TB raw data per month
• Data ultimately served from MySQL database
• 750 GB of processed variant data
• 25k genomes requires >3.5 TB in MySQL
• Complex 4-tier storage system, including
tape, filer, and RDMBS
37
Use case 1: Scaling a genome center
pipeline
• Database serves population genetics applications and
case/control studies
• Unify all data processing into HDFS
• Replace MySQL with Impala on Hadoop for increased
scalability
• Possibly move raw data processing into MapReduce
38
Use case 2: Querying large, integrated data
sets
• Biotech client has thousands of genomes
• Want to expose ad hoc querying functionality on large
scale
• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets
• Integrating data with public data sets (e.g., ENCODE,
UCSC browser)
• Terabyte-scale annotation sets
• Currently, these capabilities (e.g., data joins) are often
manually implemented
39
Use case 2: Querying large, integrated data
sets
• Hadoop allows all data to be centrally stored and
accessible
• Impala exposes a SQL query interface to data sets in
Hadoop
40
Variant-filtering example
• “Give me all SNPs that are:
• on chromosome 5
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples
• absent from prostate cancer samples
• overlap a DNase hypersensitivity site
• overlap a ChIP-seq site for a particular TF”
• On full 1000 genome data set (~37 billion
variants), query finishes in a couple seconds
41
All-vs-all eQTL
• Possible to generate trillions of hypothesis tests
• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
• Tested below on 120 billion associations
• Example queries:
• “Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”
• Finishes in several seconds
• “Find all cis-eQTLs across the entire genome”
• Finishes in a couple of minutes
• Limited by disk throughput
42
All-vs-all eQTL
• “Find all SNPs that are:
• in LD with some lead SNP
or eQTL of interest
• align with some functional
annotation of interest”
• Still in testing, but likely
finishes in seconds
43
Schaub et al, Genome Research, 2012
Genomics summary
• ETL (raw data to analysis-ready data)
• Data integration
• e.g., interactively queryable UCSC genome browser
• De novo assembly
• NLP on scientific literature
44
45
Clinical data
Manufacturing
Other use cases
Use case 3: Clinical document queries for
EHR company
• EHR wants to expose query functionality to clinicians
• >16 million clinical documents with free text; processed
through NLP pipeline
• >500 million lab results
• Perform subject expansion on search queries via
ontologies
• e.g., “myocardial infarction” will match “heart disease”
• Search functionality implemented with Lucene
(serving) on top of Hbase
(processing/storage/indexing)
46
Use case 3: Clinical document queries for
EHR company
• Interested in recommendation engine-enabled
queries, like:
• Clinician searches “diabetes” and has relevant lab results
already highlighted when opening a patient’s record
• Clinician wants to know what other conditions might be
correlated with a finding of interest
47
Use case 3: Clinical document queries for
EHR company
48
“Find other patients
similar to mine”
• The Stanford system is limited
to search
• Recommendation engines
allow a button “find similar”
Use case 4: Insurance company
• Data from 30 different EHRs across multiple business
units
• High variance in ICD9 coding between locales.
• Use NLP and machine learning to improve ICD9
coding to reduce variance in diagnosis
49
Use case 5: Pharma company variance in
yields
• Pharma company performs large batch fermentations
of their product
• Find high levels of variance in their yield
• Fermentations are automated and highly
instrumented
• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.
• Perform time series analysis on fermentation runs to
predict yields and determine which variables control
variance.
50
Use case 6: AgTech company integrating
data sources
• Multiple reference genome sequences
• Genotyping on thousands of samples
• Weather data
• Soil data
• Microbiome data
• Yield data
• Geo data
• All integrated in HBase
51
Use case 6: AgTech company integrating
data sources
• Can increase crop yields ~15% by “printing” seeds
onto a field
• Support search queries by name, ontology
concepts, protein families, creation dates,
assembly/chromsome positions, SNPs
• Import any annotation data in CSV/GFF
• Integration with cloning tools
• Supports a web front-end for easy access
52
53
Conclusions
Highly heterogeneous data
5
4
COMMUNICATIONS
Location-
based
advertising
HEALTH CARE
Patient sensors,
monitoring,
EHRs
Quality
of care
LAW ENFORCEMENT
& DEFENSE
Threat analysis,
Social media
monitoring,
Photo analysis
EDUCATION
& RESEARCH
Experiment
sensor
analysis
FINANCIAL SERVICES
Risk & portfolio
analysis
New products
ON-LINE ERVICES /
SOCIAL MEDIA
People & career
matching
Website
optimization
UTILITIES
Smart Meter
analysis for
network
capacity
CONSUMER PACKAGED
GOODS
Sentiment analysis
of what’s hot,
customer service
MEDIA /
ENTERTAINMENT
Viewers /
advertising
effectiveness
TRAVEL &
TRANSPORTATION
Sensor analysis
for optimal
traffic flows
Customer
sentiment
LIFE SCIENCES
Clinical trials
Genomics
RETAIL
Consumer sentiment
Optimized
marketing
AUTOMOTIVE
Auto sensors
reporting location,
problems
HIGH TECH /
INDUSTRIAL MFG.
Mfg. quality
Warranty
analysis
OIL & GAS
Drilling
exploration
sensor
analysis
©2013 Cloudera, Inc. All Rights Reserved.
Flexibility
• Store any data
• Run any analysis and processing
• Keeps pace with the rate of change of incoming data
Scalability
• Proven growth to PBs/1,000s of nodes
• No need to rewrite queries, automatically scales
• Keeps pace with the rate of growth of incoming data
Efficiency
• Cost per TB at a fraction of other options
• Keep all of your data alive in an active archive
• Powering the data beats algorithm movement
The Cloudera Enterprise Platform for Big Data
55
©2013 Cloudera, Inc. All Rights Reserved.
56
Cloudera Hadoop Stack
57

More Related Content

What's hot

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storageSanSan149
 

What's hot (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 

Viewers also liked

Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Regunath B
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaarRegunath B
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction EMC
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 

Viewers also liked (6)

Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaar
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 

Similar to Hadoop ecosystem for life sciences

Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesPradeeban Kathiravelu, Ph.D.
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 

Similar to Hadoop ecosystem for life sciences (20)

Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data Lakes
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop
HadoopHadoop
Hadoop
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Hadoop ecosystem for life sciences

  • 1. 1 Hadoop ecosystem for life sciences Uri Laserson 30 September 2013
  • 2. About the speaker • Currently “Data Scientist” at Cloudera • PhD in Biomedical Engineering at MIT/Harvard (2005-2012) • Focused on next-generation DNA sequencing technology in George Church’s lab • Co-founded Good Start Genetics (2007-) • First application of next-gen sequencing to genetic carrier screening • laserson@cloudera.com 2
  • 3. Agenda • Historical context • Introduction to Hadoop ecosystem • Genomics on Hadoop • Other use cases in life sciences 3
  • 5. 5
  • 6. Indexing the Web • Web is Huge • Hundreds of millions of pages in 1999 • How do you index it? • Crawl all the pages • Rank pages based on relevance metrics • Build search index of keywords to pages • Do it in real time! 6
  • 7. 7
  • 8. Databases in 1999 1. Buy a really big machine 2. Install expensive DBMS on it 3. Point your workload at it 4. Hope it doesn’t fail 5. Ambitious: buy another big machine as backup 8
  • 9. 9
  • 10. Database Limitations • Didn’t scale horizontally • High marginal cost ($$$) • No real fault-tolerance story • Vendor lock-in ($$$) • SQL unsuited for search ranking • Complex analysis (PageRank) • Unstructured data 10
  • 11. 11
  • 12. Google does something different • Designed their own storage and processing infrastructure • Google File System (GFS) and MapReduce (MR) • Goals: KISS • Cheap • Scalable • Reliable 12
  • 13. Google does something different • It worked! • Powered Google Search for many years • General framework for large-scale batch computation tasks • Still used internally at Google to this day 13
  • 14. Google benevolent enough to publish 14 2003 2004
  • 15. Birth of Hadoop at Yahoo! • 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR. • 2006: Spun out as Apache Hadoop • Named after Doug’s son’s yellow stuffed elephant 15
  • 16. Industry strategy: Copy Google 16 Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubby Thrift or Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop Hadoop
  • 17. 17 Overview of core technology
  • 18. HDFS design assumptions • Based on Google File System • Files are large (GBs to TBs) • Failures are common • Massive scale means failures very likely • Disk, node, or network failures • Accesses are large and sequential • Files are append-only 18
  • 19. HDFS properties • Fault-tolerant • Gracefully responds to node/disk/network failures • Horizontally scalable • Low marginal cost • High-bandwidth 19 1 2 3 4 5 2 4 5 1 2 5 1 3 4 2 3 5 1 3 4 Input File HDFS storage distribution Node A Node B Node C Node D Node E
  • 21. MapReduce • Structured as 1. Embarrassingly parallel “map stage” 2. Cluster-wide distributed sort (“shuffle”) 3. Aggregation “reduce stage” • Data-locality: process the data where it is stored • Fault-tolerance: failed tasks automatically detected and restarted • Schema-on-read: data must not be stored conforming to rigid schema 21
  • 23. HPC separates compute from storage 23 Storage infrastructure Compute cluster • Proprietary, distributed file system • Expensive • High-performance hardware • Low failure rate • Expensive Big network pipe ($$$) User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. HPC is about compute. Hadoop is about data.
  • 24. Hadoop colocates compute and storage 24 Compute cluster Storage infrastructure • Commodity hardware • Data-locality • Reduced networking needs User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. HPC is about compute. Hadoop is about data.
  • 25. HPC is lower-level than Hadoop • HPC only exposes job scheduling • Parallelization typically occurs through MPI • Very low-level communication primitives • Difficult to horizontally scale by simply adding nodes • Large data sets must be manually split • Failures must be dealt with manually • Hadoop has fault-tolerance, data locality, horizontal scalability 25
  • 26. Sqoop 26 Bidirectional data transfer between Hadoop and almost any SQL database with a JDBC driver
  • 27. Flume 27 A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc. Client Client Client Client Agent Agent Agent
  • 28. Cloudera Impala 28 Modern MPP database built on top of HDFS Designed for interactive queries on terabyte-scale data sets.
  • 29. Cloudera Search 29 • Interactive search queries on top of HDFS • Built on Solr and SolrCloud • Near-realtime indexing of new documents
  • 30. Benefits of Hadoop ecosystem • Inexpensive commodity compute/storage • Tolerates random hardware failure • Decreased need for high-bandwidth network pipes • Co-locate compute and storage • Exploit data locality • Simple horizontal scalability by adding nodes • MapReduce jobs effectively guaranteed to scale • Fault-tolerance/replication built-in. Data is durable • Large ecosystem of tools • Flexible data storage. Schema-on-read. Unstructured data. 30
  • 32. 32
  • 33. NCBI Sequence Read Archive (SRA) 33 Today… 1.14 petabytes One year ago… 609 terabytes
  • 34. Every ‘ome has a -seq 34 Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq
  • 36. Genomics ETL 36 .fastq .bam .vcf short read alignment genotype calling • Short read alignment is embarrassingly parallel • Pileup/variant calling requires distributed sort • GATK is a reimplementation of MapReduce; could run on Hadoop • Already available Hadoop tools • Crossbow: short read alignment/variant calling • Hadoop-BAM: distributed bamtools • BioPig: manipulating large fasta/q • SEAL: Hadoop-enabled BWA • Contrail: de-novo assembly
  • 37. Use case 1: Scaling a genome center pipeline • Currently at 5k genomes (150 TB incl. raw), looking to scale to 25k now (1 PB) and eventually 100k (requiring 4 PB) • Current throughput • >1300 samples per month • >12 TB raw data per month • Data ultimately served from MySQL database • 750 GB of processed variant data • 25k genomes requires >3.5 TB in MySQL • Complex 4-tier storage system, including tape, filer, and RDMBS 37
  • 38. Use case 1: Scaling a genome center pipeline • Database serves population genetics applications and case/control studies • Unify all data processing into HDFS • Replace MySQL with Impala on Hadoop for increased scalability • Possibly move raw data processing into MapReduce 38
  • 39. Use case 2: Querying large, integrated data sets • Biotech client has thousands of genomes • Want to expose ad hoc querying functionality on large scale • e.g., vcftools/PLINK-SEQ on terabyte-scale data sets • Integrating data with public data sets (e.g., ENCODE, UCSC browser) • Terabyte-scale annotation sets • Currently, these capabilities (e.g., data joins) are often manually implemented 39
  • 40. Use case 2: Querying large, integrated data sets • Hadoop allows all data to be centrally stored and accessible • Impala exposes a SQL query interface to data sets in Hadoop 40
  • 41. Variant-filtering example • “Give me all SNPs that are: • on chromosome 5 • absent from dbSNP • present in COSMIC • observed in breast cancer samples • absent from prostate cancer samples • overlap a DNase hypersensitivity site • overlap a ChIP-seq site for a particular TF” • On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds 41
  • 42. All-vs-all eQTL • Possible to generate trillions of hypothesis tests • 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values • Tested below on 120 billion associations • Example queries: • “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)” • Finishes in several seconds • “Find all cis-eQTLs across the entire genome” • Finishes in a couple of minutes • Limited by disk throughput 42
  • 43. All-vs-all eQTL • “Find all SNPs that are: • in LD with some lead SNP or eQTL of interest • align with some functional annotation of interest” • Still in testing, but likely finishes in seconds 43 Schaub et al, Genome Research, 2012
  • 44. Genomics summary • ETL (raw data to analysis-ready data) • Data integration • e.g., interactively queryable UCSC genome browser • De novo assembly • NLP on scientific literature 44
  • 46. Use case 3: Clinical document queries for EHR company • EHR wants to expose query functionality to clinicians • >16 million clinical documents with free text; processed through NLP pipeline • >500 million lab results • Perform subject expansion on search queries via ontologies • e.g., “myocardial infarction” will match “heart disease” • Search functionality implemented with Lucene (serving) on top of Hbase (processing/storage/indexing) 46
  • 47. Use case 3: Clinical document queries for EHR company • Interested in recommendation engine-enabled queries, like: • Clinician searches “diabetes” and has relevant lab results already highlighted when opening a patient’s record • Clinician wants to know what other conditions might be correlated with a finding of interest 47
  • 48. Use case 3: Clinical document queries for EHR company 48 “Find other patients similar to mine” • The Stanford system is limited to search • Recommendation engines allow a button “find similar”
  • 49. Use case 4: Insurance company • Data from 30 different EHRs across multiple business units • High variance in ICD9 coding between locales. • Use NLP and machine learning to improve ICD9 coding to reduce variance in diagnosis 49
  • 50. Use case 5: Pharma company variance in yields • Pharma company performs large batch fermentations of their product • Find high levels of variance in their yield • Fermentations are automated and highly instrumented • e.g., dissolved oxygen, nutrients, COAs, temperature, etc. • Perform time series analysis on fermentation runs to predict yields and determine which variables control variance. 50
  • 51. Use case 6: AgTech company integrating data sources • Multiple reference genome sequences • Genotyping on thousands of samples • Weather data • Soil data • Microbiome data • Yield data • Geo data • All integrated in HBase 51
  • 52. Use case 6: AgTech company integrating data sources • Can increase crop yields ~15% by “printing” seeds onto a field • Support search queries by name, ontology concepts, protein families, creation dates, assembly/chromsome positions, SNPs • Import any annotation data in CSV/GFF • Integration with cloning tools • Supports a web front-end for easy access 52
  • 54. Highly heterogeneous data 5 4 COMMUNICATIONS Location- based advertising HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LAW ENFORCEMENT & DEFENSE Threat analysis, Social media monitoring, Photo analysis EDUCATION & RESEARCH Experiment sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products ON-LINE ERVICES / SOCIAL MEDIA People & career matching Website optimization UTILITIES Smart Meter analysis for network capacity CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, customer service MEDIA / ENTERTAINMENT Viewers / advertising effectiveness TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment LIFE SCIENCES Clinical trials Genomics RETAIL Consumer sentiment Optimized marketing AUTOMOTIVE Auto sensors reporting location, problems HIGH TECH / INDUSTRIAL MFG. Mfg. quality Warranty analysis OIL & GAS Drilling exploration sensor analysis ©2013 Cloudera, Inc. All Rights Reserved.
  • 55. Flexibility • Store any data • Run any analysis and processing • Keeps pace with the rate of change of incoming data Scalability • Proven growth to PBs/1,000s of nodes • No need to rewrite queries, automatically scales • Keeps pace with the rate of growth of incoming data Efficiency • Cost per TB at a fraction of other options • Keep all of your data alive in an active archive • Powering the data beats algorithm movement The Cloudera Enterprise Platform for Big Data 55 ©2013 Cloudera, Inc. All Rights Reserved.
  • 57. 57

Editor's Notes

  1. Already mature technologies at this point.DB community thought it was silly.Non-Google were not yet at this scale.Google not in the business of releasing infrastructure software. They sell ads.
  2. Mostly through the Apache Software Foundation
  3. Talk HDFS and MapReduce.Then some other tools.
  4. Large blocksBlocks replicated around
  5. Two functions required.
  6. Only need to supply 2 functions
  7. Need to be careful because you can DDoS your database.
  8. Log scale.
  9. Define ETL
  10. Define ETL
  11. Volume,Variety, Velocity
  12. Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution