Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Evolution of the Big Data Stack
Jonathan Hsieh| Tech Lead / Software engineer @ Cloudera
BigDataCamp LA ‘14
June 14, 2014
Who Am I?
• Cloudera since 2009
• Tech Lead HBase Team
• Software Engineer
• Apache HBase committer / PMC
• Apache Flume founder / PMC
• U of Washington:
• Research in Distributed Systems
6/14/14 BigDataCamp LA '14 - Hsieh2
Big Data Stack Evolution
•Inspiration
•Imitation
•Innovation
6/14/14 BigDataCamp LA '14 - Hsieh3
Big Data Stack Evolution
•Inspiration
•Imitation
•Innovation
6/14/14 BigDataCamp LA '14 - Hsieh4
Emergence of Big Data
Inspiration
6/14/14 BigDataCamp LA '14 - Hsieh5
6/14/14 BigDataCamp LA '14 - Hsieh6
6/14/14 BigDataCamp LA '14 - Hsieh7
The brute force solution
1. Collect all the data
2. Analyze all the data
3. Serve the results
6/14/14 BigDataCamp LA '14 - Hsieh8
End of free MHz coincides with Rise of Big Data
6/14/14 BigDataCamp LA '14 - Hsieh
http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/abstract
9
A Move towards Distributed Systems
• Scaling Horizontally instead of Vertically
• Challenges:
• Reliability
• Fault tolerance
• Atomicity / Consistency / Isolation / Durability
• High-Availability
• Latency Predictability
6/14/14 BigDataCamp LA '14 - Hsieh10
Google built a Big Data Stack
Sawzall
MapReduce
GFS
6/14/14 BigDataCamp LA '14 - Hsieh11
Google built a Big Data Stack
Sawzall
MapReduce
MySql
Gateway
Big Table
GFS
Chubby
Evenflow Protobufs
6/14/14 BigDataCamp LA '14 - Hsieh12
The core of a Big Data Stack
• .
Query
Processing
Data
Integration
Fast Read /
Write access
File System
Distributed Coordination
Workflow and Scheduling Metadata
6/14/14 BigDataCamp LA '14 - Hsieh13
Big Data for the rest of us
Imitation
6/14/14 BigDataCamp LA '14 - Hsieh14
6/14/14 BigDataCamp LA '14 - Hsieh15
The core of a Hadoop stack
Query
Processing
Data
Integration
Fast Read /
Write access
File System
Distributed Coordination
Workflow and Scheduling Metadata
6/14/14 BigDataCamp LA '14 - Hsieh16
built a Big Data stack
• Donated Hadoop + Friends to the Apache Software Foundation
Pig / Hive
HadoopData Highway* HBase
HDFS
ZooKeeper
Oozie Hive
6/14/14 BigDataCamp LA '14 - Hsieh17
Parallel Components
6/14/14 BigDataCamp LA '14 - Hsieh
Function Google Yahoo! Facebook The Rest of Us
File system GFS => Colossus HDFS HDFS HDFS
Low latency Data store
(NoSQL)
BigTable => Megastore
=> Spanner
PNUTS => Hbase HBase Hbase
Batch processing Google MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce
Spark
Batch query Sawzall, Tenzing,
FlumeJava
Pig Hive Pig, Hive, Impala,
Drill, Crunch
Resource Management Borg => Omega => YARN => Corona YARN
Mesos
Ingest EvenFlow
Custom MySQL Proxy
Custom Scribe / Calligraphus
Custom proxy
Sqoop
Flume
Kafka
Coordination Chubby Zookeeper Zookeeper Zookeeper
Graph Processing Pregel Giraph Giraph, Golden orb
Hama, Titan
Stream processing MillWheel S3 => Storm Puma/PTail Storm, Spark
18
Simplify and remove features to enable scaling
• Scalable and simple
first
• Focus only on
needed features.
Exclude others.
• Re-add them later.
• Ex: NoSQL
• No transactions
• No Schema
6/14/14 BigDataCamp LA '14 - Hsieh19
Big Data industry steps up
Innovation
6/14/14 BigDataCamp LA '14 - Hsieh20
Nov ’06:
Google
BigTable, Chubby OSDI ‘06
Mar’10: Cloudera
Founded
Big Data Stack Timeline
6/14/14 BigDataCamp LA '14 - Hsieh
20142006 2007 2008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase,
Flume, Sqoop,
Oozie
Feb’12: CDH4 GA
with HDFS NN
HA, and YARN
preview
Mar’10: CDH2 GA
with CM
(manager)
2009: CDH1 GA
(first hadoop
distro)
Mar ’04:
Google MapReduce
OSDI ‘04
Oct ’03:
Google GFS
SOSP ‘03
2008:
Google Tenzing
Pub (VLDB’11)
2008:
Facebook
Hive
ICMD ‘08:
Pig Latin
21
Nov ’06:
Google
BigTable, Chubby OSDI ‘06
Mar’10: Cloudera
Founded
Big Data Stack Timeline
6/14/14 BigDataCamp LA '14 - Hsieh
20142006 2007 2008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase,
Flume, Sqoop,
Oozie
Feb’12: CDH4 GA
with HDFS NN
HA, and YARN
preview
Mar’10: CDH2 GA
with CM
(manager)
2009: CDH1 GA
(first hadoop
distro)
Apr’14: CDH5 GA
with Impala,
Spark, Solr,
Navigator
Mar ’04:
Google MapReduce
OSDI ‘04
Oct ’03:
Google GFS
SOSP ‘03
2008:
Google Tenzing
Pub (VLDB’11)
2008:
Google Spanner
OSDI ‘12
2008:
Facebook
Hive
2014:
Facebook
discusses HydraBase
ICMD ‘08:
Pig Latin
2011:
Google Megastore
CIDR ‘11
2010:
Google Percolator
OSDI’10
22
Usability
6/14/14 BigDataCamp LA '14 - Hsieh23
Security + Integration
6/14/14 BigDataCamp LA '14 - Hsieh24
New directions
6/14/14 BigDataCamp LA '14 - Hsieh
oryx
25
6/14/14 BigDataCamp LA '14 - Hsieh26
Thanks!
@jmhsieh
6/14/14 BigDataCamp LA '14 - Hsieh27

140614 bigdatacamp-la-keynote-jon hsieh

  • 1.
    Headline Goes Here SpeakerName or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Evolution of the Big Data Stack Jonathan Hsieh| Tech Lead / Software engineer @ Cloudera BigDataCamp LA ‘14 June 14, 2014
  • 2.
    Who Am I? •Cloudera since 2009 • Tech Lead HBase Team • Software Engineer • Apache HBase committer / PMC • Apache Flume founder / PMC • U of Washington: • Research in Distributed Systems 6/14/14 BigDataCamp LA '14 - Hsieh2
  • 3.
    Big Data StackEvolution •Inspiration •Imitation •Innovation 6/14/14 BigDataCamp LA '14 - Hsieh3
  • 4.
    Big Data StackEvolution •Inspiration •Imitation •Innovation 6/14/14 BigDataCamp LA '14 - Hsieh4
  • 5.
    Emergence of BigData Inspiration 6/14/14 BigDataCamp LA '14 - Hsieh5
  • 6.
  • 7.
  • 8.
    The brute forcesolution 1. Collect all the data 2. Analyze all the data 3. Serve the results 6/14/14 BigDataCamp LA '14 - Hsieh8
  • 9.
    End of freeMHz coincides with Rise of Big Data 6/14/14 BigDataCamp LA '14 - Hsieh http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/abstract 9
  • 10.
    A Move towardsDistributed Systems • Scaling Horizontally instead of Vertically • Challenges: • Reliability • Fault tolerance • Atomicity / Consistency / Isolation / Durability • High-Availability • Latency Predictability 6/14/14 BigDataCamp LA '14 - Hsieh10
  • 11.
    Google built aBig Data Stack Sawzall MapReduce GFS 6/14/14 BigDataCamp LA '14 - Hsieh11
  • 12.
    Google built aBig Data Stack Sawzall MapReduce MySql Gateway Big Table GFS Chubby Evenflow Protobufs 6/14/14 BigDataCamp LA '14 - Hsieh12
  • 13.
    The core ofa Big Data Stack • . Query Processing Data Integration Fast Read / Write access File System Distributed Coordination Workflow and Scheduling Metadata 6/14/14 BigDataCamp LA '14 - Hsieh13
  • 14.
    Big Data forthe rest of us Imitation 6/14/14 BigDataCamp LA '14 - Hsieh14
  • 15.
  • 16.
    The core ofa Hadoop stack Query Processing Data Integration Fast Read / Write access File System Distributed Coordination Workflow and Scheduling Metadata 6/14/14 BigDataCamp LA '14 - Hsieh16
  • 17.
    built a BigData stack • Donated Hadoop + Friends to the Apache Software Foundation Pig / Hive HadoopData Highway* HBase HDFS ZooKeeper Oozie Hive 6/14/14 BigDataCamp LA '14 - Hsieh17
  • 18.
    Parallel Components 6/14/14 BigDataCampLA '14 - Hsieh Function Google Yahoo! Facebook The Rest of Us File system GFS => Colossus HDFS HDFS HDFS Low latency Data store (NoSQL) BigTable => Megastore => Spanner PNUTS => Hbase HBase Hbase Batch processing Google MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Spark Batch query Sawzall, Tenzing, FlumeJava Pig Hive Pig, Hive, Impala, Drill, Crunch Resource Management Borg => Omega => YARN => Corona YARN Mesos Ingest EvenFlow Custom MySQL Proxy Custom Scribe / Calligraphus Custom proxy Sqoop Flume Kafka Coordination Chubby Zookeeper Zookeeper Zookeeper Graph Processing Pregel Giraph Giraph, Golden orb Hama, Titan Stream processing MillWheel S3 => Storm Puma/PTail Storm, Spark 18
  • 19.
    Simplify and removefeatures to enable scaling • Scalable and simple first • Focus only on needed features. Exclude others. • Re-add them later. • Ex: NoSQL • No transactions • No Schema 6/14/14 BigDataCamp LA '14 - Hsieh19
  • 20.
    Big Data industrysteps up Innovation 6/14/14 BigDataCamp LA '14 - Hsieh20
  • 21.
    Nov ’06: Google BigTable, ChubbyOSDI ‘06 Mar’10: Cloudera Founded Big Data Stack Timeline 6/14/14 BigDataCamp LA '14 - Hsieh 20142006 2007 2008 2009 2010 2011 20132012 Apr’11: CDH3 GA with HBase, Flume, Sqoop, Oozie Feb’12: CDH4 GA with HDFS NN HA, and YARN preview Mar’10: CDH2 GA with CM (manager) 2009: CDH1 GA (first hadoop distro) Mar ’04: Google MapReduce OSDI ‘04 Oct ’03: Google GFS SOSP ‘03 2008: Google Tenzing Pub (VLDB’11) 2008: Facebook Hive ICMD ‘08: Pig Latin 21
  • 22.
    Nov ’06: Google BigTable, ChubbyOSDI ‘06 Mar’10: Cloudera Founded Big Data Stack Timeline 6/14/14 BigDataCamp LA '14 - Hsieh 20142006 2007 2008 2009 2010 2011 20132012 Apr’11: CDH3 GA with HBase, Flume, Sqoop, Oozie Feb’12: CDH4 GA with HDFS NN HA, and YARN preview Mar’10: CDH2 GA with CM (manager) 2009: CDH1 GA (first hadoop distro) Apr’14: CDH5 GA with Impala, Spark, Solr, Navigator Mar ’04: Google MapReduce OSDI ‘04 Oct ’03: Google GFS SOSP ‘03 2008: Google Tenzing Pub (VLDB’11) 2008: Google Spanner OSDI ‘12 2008: Facebook Hive 2014: Facebook discusses HydraBase ICMD ‘08: Pig Latin 2011: Google Megastore CIDR ‘11 2010: Google Percolator OSDI’10 22
  • 23.
  • 24.
    Security + Integration 6/14/14BigDataCamp LA '14 - Hsieh24
  • 25.
    New directions 6/14/14 BigDataCampLA '14 - Hsieh oryx 25
  • 26.
  • 27.