Doug Cutting on the State of the Hadoop Ecosystem

•

15 likes•2,415 views

Doug Cutting, Apache Hadoop Co-founder, explains how the growth of the Hadoop ecosystem has made Hadoop a much more powerful machine, and how the continued expansion will lead to great things.

Technology

The
Hadoop
Ecosystem

Hidden
Gems

Doug
Cu7ng

Chief
Architect,
Cloudera

Chairman,
Apache
So>ware
FoundaAon

Expanding
Hadoop
Ecosystem

•  Hadoop •  the kernel
•  HDFS o  scalable storage
•  MapReduce o  scalable computation
•  HBase & Accumulo •  online key/value store
•  Pig & Hive •  query languages
•  Sqoop •  RDBMS integration
•  Flume •  data collection
•  Oozie •  workflow
•  Whirr •  cloud deployment
•  Mahout •  machine learning

Some
Hidden
Gems

•  YARN

•  Crunch

•  Avro

•  Trevni

YARN
(Yet
Another
Resource
NegoAator)

•  generic
scheduler
for
distributed
applicaAons

o  will
permit
non-‐MapReduce
applicaAons

•  consists
of:

o  Resource
Manager
(per
cluster)

o  Node
Manager
(per
node)

§  runs
ApplicaAon
Managers
(per
job)

§  &
ApplicaAon
Containers
(per
task)

•  in
Hadoop
2.0

o  replaces
JobTracker
&
TaskTracker
(MR1)

YARN:
MR2

MapReduce Status
Node
Job Submission
Manager
Node Status
Resource Request
Container App Master

Client

Resource Node
Manager Manager
Client

App Master Container

Node
Manager

CDH4 includes both MR1 & MR2 Container Container

Crunch

•  an
API
for
MapReduce

o  alternaAve
to
Pig
&
Hive

o  inspired
by
Google's
FlumeJava
paper

o  in
Java
(&
Scala)

•  easier
to
integrate
applicaAon
logic

o  with
a
full
programming
language

•  concepts:

o  PCollecAon:
set
of
values
w/
parallelDo
operaAon

o  PTable:
key/value
mapping
w/
groupBy
operaAon

o  Pipeline:
executor
that
runs
MapReduce
jobs

Crunch
Word
Count

public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);

PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());

PTable counts = Aggregate.count(words);

pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}

$Scrunch Word Count class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } }$

Avro:
a
format
for
Big
Data

•  expressive

o  records,
arrays,
unions,
enums

•  eﬃcient

o  compact
binary,
compressed,
spliable

•  interoperable

o  langs:
C,
C++,
C#,
Java,
Perl,
Python,
Ruby,
PHP

o  tools:
MR,
Pig,
Hive,
Crunch,
Flume,
Sqoop,
etc.

•  dynamic

o  can
read
&
write
without
generaAng
code

•  evolvable

Column
Files

name id size
record X {
String name; Foo 0x0 5
long id;
int size; Bar 0x1 7
}
Baz 0x2 9

Row File
Column File
(Avro, SequenceFile)
(Trevni)

Foo 0x0 5
Foo Bar
Bar 0x1 7
Baz ...
Baz 0x2 9
0x0 0x1
... ... ...
0x2 ...
5 7
9 ...

Column
Files

•  faster
queries

o  only
process
columns
in
query

•  beer
compression

o  since
like
data
is
together

•  data
set
split
into
row
groups

o  to
permit
parallelism

•  to
localize
processing,

o  row
group
should
be
in
single
HDFS
block

•  independent
of
record
serializaAon
format

o  need
shredder

•  primary
format?

Trevni:
a
column
ﬁle
format

•  one
row
group
per
ﬁle

o  &
one
ﬁle
per
HDFS
block

o  minimizes
seeks,
localizes
query

•  shredder
&
assembler
for
Avro
records

o  supports
nested
structures

•  compression
codec
per
column

•  in
Avro
1.7.3+

Doug Cutting on the State of the Hadoop Ecosystem

What's hot

Introduction To Hadoop EcosystemInSemble

Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov

Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks

Apache spark - History and market overviewMartin Zapletal

PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

Apache Spark & HadoopMapR Technologies

Intro to Apache SparkMammoth Data

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta

Hadoop and SparkShravan (Sean) Pabba

Lightening Fast Big Data Analytics using Apache SparkManish Gupta

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Summer Shorts: Big Data Integrationibi

SMACK Stack 1.1Joe Stein

Scalable And Incremental Data Profiling With SparkJen Aman

Introduction to Apache SparkRahul Jain

Big data applicationsJuan Pablo Paz Grau, Ph.D., PMP

Introduction to Apache Sparkdatamantra

What's hot (20)

Introduction To Hadoop Ecosystem

Spark Based Distributed Deep Learning Framework For Big Data Applications

Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...

Apache spark - History and market overview

PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando

Hugfr SPARK & RIAK -20160114_hug_france

Apache Spark & Hadoop

Intro to Apache Spark

Proud to be Polyglot - Riviera Dev 2015

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...

Hadoop and Spark

Lightening Fast Big Data Analytics using Apache Spark

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Summer Shorts: Big Data Integration

SMACK Stack 1.1

Scalable And Incremental Data Profiling With Spark

Introduction to Apache Spark

Big data applications

Introduction to Apache Spark

Similar to Doug Cutting on the State of the Hadoop Ecosystem

The Evolution of the Hadoop EcosystemCloudera, Inc.

HadoopAbhishek Agarwal

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

The Fundamentals Guide to HDP and HDInsightGert Drapers

Understanding the Value and Architecture of Apache DrillDataWorks Summit

Hadoop Summit - Hausenblas 20 MarchMapR Technologies

OCF.tw's talk about "Introduction to spark"Giivee The

Hadoop Overview kdd2011Milind Bhandarkar

Big Data @ Orange - Dev Day 2013 - part 2ovarene

Etu L2 Training - Hadoop 企業應用實作James Chen

PhillyDB Talk - Beyond Batchboorad

An introduction to apache drill presentationMapR Technologies

Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution

Hadoop Overview & Architecture EMC

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

Osd ctw sparkWisely chen

3 apache-avrozafargilani

Hadoop with PythonDonald Miner

(Julien le dem) parquetNAVER D2

2014 08-20-pit-hugAndy Pernsteiner

Similar to Doug Cutting on the State of the Hadoop Ecosystem (20)

The Evolution of the Hadoop Ecosystem

Hadoop

Introduction to Apache Flink - Fast and reliable big data processing

The Fundamentals Guide to HDP and HDInsight

Understanding the Value and Architecture of Apache Drill

Hadoop Summit - Hausenblas 20 March

OCF.tw's talk about "Introduction to spark"

Hadoop Overview kdd2011

Big Data @ Orange - Dev Day 2013 - part 2

Etu L2 Training - Hadoop 企業應用實作

PhillyDB Talk - Beyond Batch

An introduction to apache drill presentation

Architecting the Future of Big Data & Search - Eric Baldeschwieler

Hadoop Overview & Architecture

Big Data Developers Moscow Meetup 1 - sql on hadoop

Osd ctw spark

3 apache-avro

Hadoop with Python

(Julien le dem) parquet

2014 08-20-pit-hug

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Histor y of HAM Radio presentation slidevu2urc

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Slack Application Development 101 Slidespraypatel2

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men

CNv6 Instructor Chapter 6 Quality of Service

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Understanding the Laravel MVC Architecture

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Scaling API-first – The story of a global engineering organization

A Domino Admins Adventures (Engage 2024)

Histor y of HAM Radio presentation slide

Boost PC performance: How more available memory can improve productivity

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Slack Application Development 101 Slides

Google AI Hackathon: LLM based Evaluator for RAG

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Handwritten Text Recognition for manuscripts and early printed texts

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Injustice - Developers Among Us (SciFiDevCon 2024)

Doug Cutting on the State of the Hadoop Ecosystem

1. The Hadoop Ecosystem Hidden Gems Doug Cu7ng Chief Architect, Cloudera Chairman, Apache So>ware FoundaAon

2. Expanding Hadoop Ecosystem •  Hadoop •  the kernel •  HDFS o  scalable storage •  MapReduce o  scalable computation •  HBase & Accumulo •  online key/value store •  Pig & Hive •  query languages •  Sqoop •  RDBMS integration •  Flume •  data collection •  Oozie •  workflow •  Whirr •  cloud deployment •  Mahout •  machine learning

3. Some Hidden Gems •  YARN •  Crunch •  Avro •  Trevni

4. YARN (Yet Another Resource NegoAator) •  generic scheduler for distributed applicaAons o  will permit non-‐MapReduce applicaAons •  consists of: o  Resource Manager (per cluster) o  Node Manager (per node) §  runs ApplicaAon Managers (per job) §  & ApplicaAon Containers (per task) •  in Hadoop 2.0 o  replaces JobTracker & TaskTracker (MR1)

5. YARN: MR2 MapReduce Status Node Job Submission Manager Node Status Resource Request Container App Master Client Resource Node Manager Manager Client App Master Container Node Manager CDH4 includes both MR1 & MR2 Container Container

6. Crunch •  an API for MapReduce o  alternaAve to Pig & Hive o  inspired by Google's FlumeJava paper o  in Java (& Scala) •  easier to integrate applicaAon logic o  with a full programming language •  concepts: o  PCollecAon: set of values w/ parallelDo operaAon o  PTable: key/value mapping w/ groupBy operaAon o  Pipeline: executor that runs MapReduce jobs

7. Crunch Word Count public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }

8. Scrunch Word Count class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } }

9. Avro: a format for Big Data •  expressive o  records, arrays, unions, enums •  eﬃcient o  compact binary, compressed, spliable •  interoperable o  langs: C, C++, C#, Java, Perl, Python, Ruby, PHP o  tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc. •  dynamic o  can read & write without generaAng code •  evolvable

10. Column Files name id size record X { String name; Foo 0x0 5 long id; int size; Bar 0x1 7 } Baz 0x2 9 Row File Column File (Avro, SequenceFile) (Trevni) Foo 0x0 5 Foo Bar Bar 0x1 7 Baz ... Baz 0x2 9 0x0 0x1 ... ... ... 0x2 ... 5 7 9 ...

11. Column Files •  faster queries o  only process columns in query •  beer compression o  since like data is together •  data set split into row groups o  to permit parallelism •  to localize processing, o  row group should be in single HDFS block •  independent of record serializaAon format o  need shredder •  primary format?

12. Trevni: a column file format •  one row group per file o  & one file per HDFS block o  minimizes seeks, localizes query •  shredder & assembler for Avro records o  supports nested structures •  compression codec per column •  in Avro 1.7.3+

Doug Cutting on the State of the Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Doug Cutting on the State of the Hadoop Ecosystem

Similar to Doug Cutting on the State of the Hadoop Ecosystem (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Doug Cutting on the State of the Hadoop Ecosystem