Doug Cutting, Apache Hadoop Co-founder, explains how the growth of the Hadoop ecosystem has made Hadoop a much more powerful machine, and how the continued expansion will lead to great things.
4. YARN
(Yet
Another
Resource
NegoAator)
• generic
scheduler
for
distributed
applicaAons
o will
permit
non-‐MapReduce
applicaAons
• consists
of:
o Resource
Manager
(per
cluster)
o Node
Manager
(per
node)
§ runs
ApplicaAon
Managers
(per
job)
§ &
ApplicaAon
Containers
(per
task)
• in
Hadoop
2.0
o replaces
JobTracker
&
TaskTracker
(MR1)
5. YARN:
MR2
MapReduce Status
Node
Job Submission
Manager
Node Status
Resource Request
Container App Master
Client
Resource Node
Manager Manager
Client
App Master Container
Node
Manager
CDH4 includes both MR1 & MR2 Container Container
6. Crunch
• an
API
for
MapReduce
o alternaAve
to
Pig
&
Hive
o inspired
by
Google's
FlumeJava
paper
o in
Java
(&
Scala)
• easier
to
integrate
applicaAon
logic
o with
a
full
programming
language
• concepts:
o PCollecAon:
set
of
values
w/
parallelDo
operaAon
o PTable:
key/value
mapping
w/
groupBy
operaAon
o Pipeline:
executor
that
runs
MapReduce
jobs
7. Crunch
Word
Count
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
8. Scrunch
Word
Count
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
9. Avro:
a
format
for
Big
Data
• expressive
o records,
arrays,
unions,
enums
• efficient
o compact
binary,
compressed,
spliable
• interoperable
o langs:
C,
C++,
C#,
Java,
Perl,
Python,
Ruby,
PHP
o tools:
MR,
Pig,
Hive,
Crunch,
Flume,
Sqoop,
etc.
• dynamic
o can
read
&
write
without
generaAng
code
• evolvable
10. Column
Files
name id size
record X {
String name; Foo 0x0 5
long id;
int size; Bar 0x1 7
}
Baz 0x2 9
Row File
Column File
(Avro, SequenceFile)
(Trevni)
Foo 0x0 5
Foo Bar
Bar 0x1 7
Baz ...
Baz 0x2 9
0x0 0x1
... ... ...
0x2 ...
5 7
9 ...
11. Column
Files
• faster
queries
o only
process
columns
in
query
• beer
compression
o since
like
data
is
together
• data
set
split
into
row
groups
o to
permit
parallelism
• to
localize
processing,
o row
group
should
be
in
single
HDFS
block
• independent
of
record
serializaAon
format
o need
shredder
• primary
format?
12. Trevni:
a
column
file
format
• one
row
group
per
file
o &
one
file
per
HDFS
block
o minimizes
seeks,
localizes
query
• shredder
&
assembler
for
Avro
records
o supports
nested
structures
• compression
codec
per
column
• in
Avro
1.7.3+