The Meta of Hadoop - COMAD 2012

•Download as PPTX, PDF•

3 likes•559 views

What do you talk about to a hall full of database gurus? Instead of science - my talk focused on the art. What made Hadoop successful? What can we learn from it? What principles work well in building software for large scale services? What are some interesting unsolved problems in a world overrun by open-source (and VC investments :-))

The Meta of Hadoop

Joydeep Sen Sarma
Ex-Facebook DI Lead, Founder Qubole

Intro
• File/Database Systems developer (ex- Netapp/Oracle)
• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:
– SysAdmin: operated massive Hadoop/Hive installs
– Architect: conceived/wrote Apache Hive. made Hbase@FB
happen
– Herded cats: first manager of Data Infra team
– IT engineer/DBA: built ETL tools, warehouse/reporting for
FB Virtual Currency

• Founder Qubole Inc. (2011-)

Why Hadoop Succeeded

• Complete Solution and Extensible
– useful to Engineers, Data Scientists, Analysts
– performance isn’t everything.
– Agile – Businesses much faster than before

• Market Dynamics
– Captive Super-Reference Customer – Yahoo
– Had early market to itself for Long-Time

• Separation of Compute and Storage
– Parallel Computing != Database

Why Hadoop Succeeded
• Data Consolidation!
– Just store everything in HDFS
– MR/Hive/Pig can chew
anything

• Lights Out Architecture
DATA – Low System Operational Cost
– Low Data Management Cost
• Don’t need Data Priests
DATA

Adaptive Lights-Out Software

• Successful efforts:
– Automatic map-join/skew join implementations
– Automatic local mode, resource cache

• Failed:
– Statistics: alter table analyze table
– Pre-Bucketing tables

 Learning Frameworks for Systems Software

Adaptive Lights-Out Software

• Caching + Prefetching is Adaptive
– Replication is not
– Can bridge gap between Compute and Storage

• Page Cache over Disk >> In-memory
– Degrades gracefully

• Provide APIs – not packages

Murphy’s Law
• No Trusted Components

• Defend everything
– Rate-Limit access to every resource
– Log and Monitor everything

• Clear and Overwhelming Force
– Oversize it!

• Think QOS from Day-1

Open Source

• Small is Beautiful
– Build small easy to use/understand components
– Redis!

• Iterative Small Changes
– Operators HATE large releases
– Hive (2 weeks) vs. Hadoop (2 years?)

Interesting Problems - I

• Collaborative Analysis
– Most analysis is Repeat
– Tracking and Searching historical analysis

• Consistency Aware Querying
– OLAP: Snapshots instead of live tables
– OLTP: Lookup stale caches instead of master

Interesting Problems - II

• SQL is Rope
– Better than procedural – but still Rope
– Higher Level templates: moving averages

• Data = Mutating + Immutable
– Immutable data is easy to manage
– Cheap: One copy per data center (Facebook
Haystack)

Think Services, not Software

• Software is getting less interesting
– Even Distributed Systems Software

• Run/Operate long-running, hot services
– Innovate inside this boundary

What's hot

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Hd insight essentials quick viewRajesh Nadipalli

Hadoop @ eBay: Past, Present, and FutureRyan Hennig

Kylin and Druid Presentationargonauts007

HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.

Netflix running Presto in the AWS CloudZhenxiao Luo

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Cost effective BigData Processing on Amazon EC2Sujee Maniyam

Dataiku big data paris - the rise of the hadoop ecosystemDataiku

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services

Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Amazon Web Services

Koalas: Pandas on Apache SparkDatabricks

Hadoop at ayasdiMohit Jaggi

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

The Evolution of Apache KylinDataWorks Summit/Hadoop Summit

Tuning up with Apache TezGal Vinograd

October 2014 HUG : Hive On SparkYahoo Developer Network

Introduction to MapReduce & hadoopColin Su

What's hot (20)

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Hd insight essentials quick view

Hadoop @ eBay: Past, Present, and Future

Kylin and Druid Presentation

HBaseCon2017 Community-Driven Graphs with JanusGraph

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...

Netflix running Presto in the AWS Cloud

Proud to be Polyglot - Riviera Dev 2015

Cost effective BigData Processing on Amazon EC2

Dataiku big data paris - the rise of the hadoop ecosystem

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...

Koalas: Pandas on Apache Spark

Hadoop at ayasdi

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

The Evolution of Apache Kylin

Tuning up with Apache Tez

October 2014 HUG : Hive On Spark

Introduction to MapReduce & hadoop

Similar to The Meta of Hadoop - COMAD 2012

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events

Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

Hadoop ppt1chariorienit

4. hadoop גיא לבנברגTaldor Group

Practical introduction to hadoopinside-BigData.com

5 Things that Make Hadoop a Game ChangerCaserta

Hadoop Infrastructure (Oct. 3rd, 2012)John Dougherty

HadoopYojana Nanaware

Hadoop as a data hub featuring searsDianna Doan

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw

Intro to Big DataZohar Elkayam

Technologies for Data Analytics PlatformN Masahiro

Big data and mstr bridge the elephantKognitio

INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer

Hadoop, Infrastructure and StackJohn Dougherty

Similar to The Meta of Hadoop - COMAD 2012 (20)

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...

Transforming Data Architecture Complexity at Sears - StampedeCon 2013

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

Hadoop ppt1

4. hadoop גיא לבנברג

Practical introduction to hadoop

5 Things that Make Hadoop a Game Changer

Hadoop Infrastructure (Oct. 3rd, 2012)

Hadoop

Hadoop as a data hub featuring sears

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2

Intro to Big Data

Technologies for Data Analytics Platform

Big data and mstr bridge the elephant

INTRODUCTION TO BIG DATA HADOOP

Hadoop, Infrastructure and Stack

The Meta of Hadoop - COMAD 2012

1. The Meta of Hadoop Joydeep Sen Sarma Ex-Facebook DI Lead, Founder Qubole

2. Intro • File/Database Systems developer (ex- Netapp/Oracle) • Yahoo (2005-07), Facebook (2007-11) • @Facebook: – SysAdmin: operated massive Hadoop/Hive installs – Architect: conceived/wrote Apache Hive. made Hbase@FB happen – Herded cats: first manager of Data Infra team – IT engineer/DBA: built ETL tools, warehouse/reporting for FB Virtual Currency • Founder Qubole Inc. (2011-)

3. Why Hadoop Succeeded • Complete Solution and Extensible – useful to Engineers, Data Scientists, Analysts – performance isn’t everything. – Agile – Businesses much faster than before • Market Dynamics – Captive Super-Reference Customer – Yahoo – Had early market to itself for Long-Time • Separation of Compute and Storage – Parallel Computing != Database

4. Why Hadoop Succeeded • Data Consolidation! – Just store everything in HDFS – MR/Hive/Pig can chew anything • Lights Out Architecture DATA – Low System Operational Cost – Low Data Management Cost • Don’t need Data Priests DATA

5. Meta Takeaways

6. Adaptive Lights-Out Software • Successful efforts: – Automatic map-join/skew join implementations – Automatic local mode, resource cache • Failed: – Statistics: alter table analyze table – Pre-Bucketing tables  Learning Frameworks for Systems Software

7. Adaptive Lights-Out Software • Caching + Prefetching is Adaptive – Replication is not – Can bridge gap between Compute and Storage • Page Cache over Disk >> In-memory – Degrades gracefully • Provide APIs – not packages

8. Murphy’s Law • No Trusted Components • Defend everything – Rate-Limit access to every resource – Log and Monitor everything • Clear and Overwhelming Force – Oversize it! • Think QOS from Day-1

9. Open Source • Small is Beautiful – Build small easy to use/understand components – Redis! • Iterative Small Changes – Operators HATE large releases – Hive (2 weeks) vs. Hadoop (2 years?)

10. Opportunities

11. Interesting Problems - I • Collaborative Analysis – Most analysis is Repeat – Tracking and Searching historical analysis • Consistency Aware Querying – OLAP: Snapshots instead of live tables – OLTP: Lookup stale caches instead of master

12. Interesting Problems - II • SQL is Rope – Better than procedural – but still Rope – Higher Level templates: moving averages • Data = Mutating + Immutable – Immutable data is easy to manage – Cheap: One copy per data center (Facebook Haystack)

13. Think Services, not Software • Software is getting less interesting – Even Distributed Systems Software • Run/Operate long-running, hot services – Innovate inside this boundary

14. Q&A

Editor's Notes

Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
They are…123Now lets look at the details of each step, starting with step #1.
Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…

The Meta of Hadoop - COMAD 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Meta of Hadoop - COMAD 2012

Similar to The Meta of Hadoop - COMAD 2012 (20)

The Meta of Hadoop - COMAD 2012

Editor's Notes