Your SlideShare is downloading. ×
140614 bigdatacamp-la-keynote-jon hsieh
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

140614 bigdatacamp-la-keynote-jon hsieh

212
views

Published on

Big Data Camp LA 2014, Evolution of the Big Data Stack by Jonathan Hsieh of Cloudera - Keynote

Big Data Camp LA 2014, Evolution of the Big Data Stack by Jonathan Hsieh of Cloudera - Keynote

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
212
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Evolution of the Big Data Stack Jonathan Hsieh| Tech Lead / Software engineer @ Cloudera BigDataCamp LA ‘14 June 14, 2014
  • 2. Who Am I? • Cloudera since 2009 • Tech Lead HBase Team • Software Engineer • Apache HBase committer / PMC • Apache Flume founder / PMC • U of Washington: • Research in Distributed Systems 6/14/14 BigDataCamp LA '14 - Hsieh2
  • 3. Big Data Stack Evolution •Inspiration •Imitation •Innovation 6/14/14 BigDataCamp LA '14 - Hsieh3
  • 4. Big Data Stack Evolution •Inspiration •Imitation •Innovation 6/14/14 BigDataCamp LA '14 - Hsieh4
  • 5. Emergence of Big Data Inspiration 6/14/14 BigDataCamp LA '14 - Hsieh5
  • 6. 6/14/14 BigDataCamp LA '14 - Hsieh6
  • 7. 6/14/14 BigDataCamp LA '14 - Hsieh7
  • 8. The brute force solution 1. Collect all the data 2. Analyze all the data 3. Serve the results 6/14/14 BigDataCamp LA '14 - Hsieh8
  • 9. End of free MHz coincides with Rise of Big Data 6/14/14 BigDataCamp LA '14 - Hsieh http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/abstract 9
  • 10. A Move towards Distributed Systems • Scaling Horizontally instead of Vertically • Challenges: • Reliability • Fault tolerance • Atomicity / Consistency / Isolation / Durability • High-Availability • Latency Predictability 6/14/14 BigDataCamp LA '14 - Hsieh10
  • 11. Google built a Big Data Stack Sawzall MapReduce GFS 6/14/14 BigDataCamp LA '14 - Hsieh11
  • 12. Google built a Big Data Stack Sawzall MapReduce MySql Gateway Big Table GFS Chubby Evenflow Protobufs 6/14/14 BigDataCamp LA '14 - Hsieh12
  • 13. The core of a Big Data Stack • . Query Processing Data Integration Fast Read / Write access File System Distributed Coordination Workflow and Scheduling Metadata 6/14/14 BigDataCamp LA '14 - Hsieh13
  • 14. Big Data for the rest of us Imitation 6/14/14 BigDataCamp LA '14 - Hsieh14
  • 15. 6/14/14 BigDataCamp LA '14 - Hsieh15
  • 16. The core of a Hadoop stack Query Processing Data Integration Fast Read / Write access File System Distributed Coordination Workflow and Scheduling Metadata 6/14/14 BigDataCamp LA '14 - Hsieh16
  • 17. built a Big Data stack • Donated Hadoop + Friends to the Apache Software Foundation Pig / Hive HadoopData Highway* HBase HDFS ZooKeeper Oozie Hive 6/14/14 BigDataCamp LA '14 - Hsieh17
  • 18. Parallel Components 6/14/14 BigDataCamp LA '14 - Hsieh Function Google Yahoo! Facebook The Rest of Us File system GFS => Colossus HDFS HDFS HDFS Low latency Data store (NoSQL) BigTable => Megastore => Spanner PNUTS => Hbase HBase Hbase Batch processing Google MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Spark Batch query Sawzall, Tenzing, FlumeJava Pig Hive Pig, Hive, Impala, Drill, Crunch Resource Management Borg => Omega => YARN => Corona YARN Mesos Ingest EvenFlow Custom MySQL Proxy Custom Scribe / Calligraphus Custom proxy Sqoop Flume Kafka Coordination Chubby Zookeeper Zookeeper Zookeeper Graph Processing Pregel Giraph Giraph, Golden orb Hama, Titan Stream processing MillWheel S3 => Storm Puma/PTail Storm, Spark 18
  • 19. Simplify and remove features to enable scaling • Scalable and simple first • Focus only on needed features. Exclude others. • Re-add them later. • Ex: NoSQL • No transactions • No Schema 6/14/14 BigDataCamp LA '14 - Hsieh19
  • 20. Big Data industry steps up Innovation 6/14/14 BigDataCamp LA '14 - Hsieh20
  • 21. Nov ’06: Google BigTable, Chubby OSDI ‘06 Mar’10: Cloudera Founded Big Data Stack Timeline 6/14/14 BigDataCamp LA '14 - Hsieh 20142006 2007 2008 2009 2010 2011 20132012 Apr’11: CDH3 GA with HBase, Flume, Sqoop, Oozie Feb’12: CDH4 GA with HDFS NN HA, and YARN preview Mar’10: CDH2 GA with CM (manager) 2009: CDH1 GA (first hadoop distro) Mar ’04: Google MapReduce OSDI ‘04 Oct ’03: Google GFS SOSP ‘03 2008: Google Tenzing Pub (VLDB’11) 2008: Facebook Hive ICMD ‘08: Pig Latin 21
  • 22. Nov ’06: Google BigTable, Chubby OSDI ‘06 Mar’10: Cloudera Founded Big Data Stack Timeline 6/14/14 BigDataCamp LA '14 - Hsieh 20142006 2007 2008 2009 2010 2011 20132012 Apr’11: CDH3 GA with HBase, Flume, Sqoop, Oozie Feb’12: CDH4 GA with HDFS NN HA, and YARN preview Mar’10: CDH2 GA with CM (manager) 2009: CDH1 GA (first hadoop distro) Apr’14: CDH5 GA with Impala, Spark, Solr, Navigator Mar ’04: Google MapReduce OSDI ‘04 Oct ’03: Google GFS SOSP ‘03 2008: Google Tenzing Pub (VLDB’11) 2008: Google Spanner OSDI ‘12 2008: Facebook Hive 2014: Facebook discusses HydraBase ICMD ‘08: Pig Latin 2011: Google Megastore CIDR ‘11 2010: Google Percolator OSDI’10 22
  • 23. Usability 6/14/14 BigDataCamp LA '14 - Hsieh23
  • 24. Security + Integration 6/14/14 BigDataCamp LA '14 - Hsieh24
  • 25. New directions 6/14/14 BigDataCamp LA '14 - Hsieh oryx 25
  • 26. 6/14/14 BigDataCamp LA '14 - Hsieh26
  • 27. Thanks! @jmhsieh 6/14/14 BigDataCamp LA '14 - Hsieh27