SlideShare a Scribd company logo
1 of 14
The Meta of Hadoop


        Joydeep Sen Sarma
Ex-Facebook DI Lead, Founder Qubole
Intro
• File/Database Systems developer (ex- Netapp/Oracle)
• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:
   – SysAdmin: operated massive Hadoop/Hive installs
   – Architect: conceived/wrote Apache Hive. made Hbase@FB
     happen
   – Herded cats: first manager of Data Infra team
   – IT engineer/DBA: built ETL tools, warehouse/reporting for
     FB Virtual Currency

• Founder Qubole Inc. (2011-)
Why Hadoop Succeeded

• Complete Solution and Extensible
   – useful to Engineers, Data Scientists, Analysts
   – performance isn’t everything.
   – Agile – Businesses much faster than before

• Market Dynamics
   – Captive Super-Reference Customer – Yahoo
   – Had early market to itself for Long-Time

• Separation of Compute and Storage
   – Parallel Computing != Database
Why Hadoop Succeeded
        • Data Consolidation!
          – Just store everything in HDFS
          – MR/Hive/Pig can chew
            anything


        • Lights Out Architecture
DATA      – Low System Operational Cost
          – Low Data Management Cost
             • Don’t need Data Priests
DATA
Meta Takeaways
Adaptive Lights-Out Software

• Successful efforts:
   – Automatic map-join/skew join implementations
   – Automatic local mode, resource cache


• Failed:
   – Statistics: alter table analyze table
   – Pre-Bucketing tables


 Learning Frameworks for Systems Software
Adaptive Lights-Out Software

• Caching + Prefetching is Adaptive
   – Replication is not
   – Can bridge gap between Compute and Storage


• Page Cache over Disk >> In-memory
   – Degrades gracefully


• Provide APIs – not packages
Murphy’s Law
• No Trusted Components

• Defend everything
  – Rate-Limit access to every resource
  – Log and Monitor everything


• Clear and Overwhelming Force
  – Oversize it!


• Think QOS from Day-1
Open Source

• Small is Beautiful
  – Build small easy to use/understand components
  – Redis!


• Iterative Small Changes
  – Operators HATE large releases
  – Hive (2 weeks) vs. Hadoop (2 years?)
Opportunities
Interesting Problems - I

• Collaborative Analysis
  – Most analysis is Repeat
  – Tracking and Searching historical analysis


• Consistency Aware Querying
  – OLAP: Snapshots instead of live tables
  – OLTP: Lookup stale caches instead of master
Interesting Problems - II

• SQL is Rope
  – Better than procedural – but still Rope
  – Higher Level templates: moving averages


• Data = Mutating + Immutable
  – Immutable data is easy to manage
  – Cheap: One copy per data center (Facebook
    Haystack)
Think Services, not Software

• Software is getting less interesting
  – Even Distributed Systems Software


• Run/Operate long-running, hot services
  – Innovate inside this boundary
Q&A

More Related Content

What's hot

How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Amazon Web Services
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 

What's hot (20)

How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 

Similar to The Meta of Hadoop - COMAD 2012

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)John Dougherty
 
Hadoop as a data hub featuring sears
Hadoop as a data hub featuring searsHadoop as a data hub featuring sears
Hadoop as a data hub featuring searsDianna Doan
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big data and mstr bridge the elephant
Big data and mstr   bridge the elephantBig data and mstr   bridge the elephant
Big data and mstr bridge the elephantKognitio
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackJohn Dougherty
 

Similar to The Meta of Hadoop - COMAD 2012 (20)

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop as a data hub featuring sears
Hadoop as a data hub featuring searsHadoop as a data hub featuring sears
Hadoop as a data hub featuring sears
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big data and mstr bridge the elephant
Big data and mstr   bridge the elephantBig data and mstr   bridge the elephant
Big data and mstr bridge the elephant
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and Stack
 

The Meta of Hadoop - COMAD 2012

  • 1. The Meta of Hadoop Joydeep Sen Sarma Ex-Facebook DI Lead, Founder Qubole
  • 2. Intro • File/Database Systems developer (ex- Netapp/Oracle) • Yahoo (2005-07), Facebook (2007-11) • @Facebook: – SysAdmin: operated massive Hadoop/Hive installs – Architect: conceived/wrote Apache Hive. made Hbase@FB happen – Herded cats: first manager of Data Infra team – IT engineer/DBA: built ETL tools, warehouse/reporting for FB Virtual Currency • Founder Qubole Inc. (2011-)
  • 3. Why Hadoop Succeeded • Complete Solution and Extensible – useful to Engineers, Data Scientists, Analysts – performance isn’t everything. – Agile – Businesses much faster than before • Market Dynamics – Captive Super-Reference Customer – Yahoo – Had early market to itself for Long-Time • Separation of Compute and Storage – Parallel Computing != Database
  • 4. Why Hadoop Succeeded • Data Consolidation! – Just store everything in HDFS – MR/Hive/Pig can chew anything • Lights Out Architecture DATA – Low System Operational Cost – Low Data Management Cost • Don’t need Data Priests DATA
  • 6. Adaptive Lights-Out Software • Successful efforts: – Automatic map-join/skew join implementations – Automatic local mode, resource cache • Failed: – Statistics: alter table analyze table – Pre-Bucketing tables  Learning Frameworks for Systems Software
  • 7. Adaptive Lights-Out Software • Caching + Prefetching is Adaptive – Replication is not – Can bridge gap between Compute and Storage • Page Cache over Disk >> In-memory – Degrades gracefully • Provide APIs – not packages
  • 8. Murphy’s Law • No Trusted Components • Defend everything – Rate-Limit access to every resource – Log and Monitor everything • Clear and Overwhelming Force – Oversize it! • Think QOS from Day-1
  • 9. Open Source • Small is Beautiful – Build small easy to use/understand components – Redis! • Iterative Small Changes – Operators HATE large releases – Hive (2 weeks) vs. Hadoop (2 years?)
  • 11. Interesting Problems - I • Collaborative Analysis – Most analysis is Repeat – Tracking and Searching historical analysis • Consistency Aware Querying – OLAP: Snapshots instead of live tables – OLTP: Lookup stale caches instead of master
  • 12. Interesting Problems - II • SQL is Rope – Better than procedural – but still Rope – Higher Level templates: moving averages • Data = Mutating + Immutable – Immutable data is easy to manage – Cheap: One copy per data center (Facebook Haystack)
  • 13. Think Services, not Software • Software is getting less interesting – Even Distributed Systems Software • Run/Operate long-running, hot services – Innovate inside this boundary
  • 14. Q&A

Editor's Notes

  1. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  2. They are…123Now lets look at the details of each step, starting with step #1.
  3. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  4. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  5. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…