Hadoop

e Hadoop users. .
Notabl
Yahoo! LinkedIn

Facebook New York Times

Twitter Rackspace

Baidu eHarmony

eBay Powerset
http://wiki.apache.org/hadoop/PoweredBy

Hadoop in the Real
World..

Recommendation
Financial analysis
systems

Natural Language
Correlation engines
Processing (NLP)

Data warehousing Image/video processing

Market research/forecasting Log analysis

Finance Social networking

Health &
Academic research
Life Sciences

Government Telecommunications

Inspired by Google BigTable and
MapReduce papers circa 2004

Created by Doug Cutting

Originally built to support distribution
for Nutch search engine

Named after a stuffed elephant

OK, So what exactly
is Hadoop?

An open source...

batch/ofﬂine oriented...

data & I/O intensive...

general purpose framework for
creating distributed applications that
process huge amounts of data.

One definition of "huge"
25,000 machines

More than 10 clusters

3 petabytes of data (compressed, unreplicated)

700+ users

10,000+ jobs/week

Had oop
M ajor nts:
C omp one

Distributed File System
(HDFS)

Map/Reduce System

But first, what
isn't Hadoop?

doop is NOT:
Ha
...a relational database!

...an online transaction processing (OLTP) system!

...a structured data store of any kind!

Hadoop Relational

Scale-out Scale-up(*)

Key/value pairs Tables
Say how to process Say what you want
the data (SQL)
Ofﬂine/batch Online/real-time

(*) Sharding attempts to horizontally scale RDBMS, but is difﬁcult at best

HDFS
(Hadoop Distributed File System)

Data is distributed and replicated
over multiple machines

Designed for large ﬁles
(where "large" means GB to TB)

Block oriented

Linux-style commands, e.g. ls, cp,
mv, rm, etc.

NameNode
File Block Mappings:

/user/aaron/data1.txt -> 1, 2, 3
/user/aaron/data2.txt -> 4, 5
/user/andrew/data3.txt -> 6, 7

DataNode(s)

5 1 4 2 2 3 7
4 6 1 4 6
2 3 6 1
3 7 5 7 5

fault tolerant when nodes fail

Self-healing rebalances ﬁles across cluster

scalable just by adding new nodes!

Split input ﬁles (e.g. by HDFS blocks)

Operate on key/value pairs

Mappers ﬁlter & transform input data

Reducers aggregate mapper output

map:
(K1, V1) list(K2, V2)

reduce:
(K2, list(V2)) list(K3, V3)

Word Count
(the canonical Map/Reduce example)

the quick brown fox
jumped over
the lazy brown dog

m ap phase -
inputs
(K1, V1)

(0, "the quick brown fox")

(20, "jumped over")

(32, "the lazy brown dog")

map ph
ase -
list(K2, V2) outpu
ts

("the", 1) ("quick", 1)

("brown", 1) ("fox", 1)

("jumped", 1) ("over", 1)

("the", 1) ("lazy", 1)

("brown", 1) ("dog", 1)

redu ce phase -
inputs (K2, list(V2))

("brown", (1, 1)) ("dog", (1))

("fox", (1)) ("jumped", (1))

("lazy", (1)) ("over", (1))

("quick", (1)) ("the", (1, 1))

reduce
phase
outpu -
list(K3, V3) ts

("brown", 2) ("dog", 1)

("fox", 1) ("jumped", 1)

("lazy", 1) ("over", 1)

("quick", 1) ("the", 2)

public class SimpleWordCount
extends Configured implements Tool {

public static class MapClass
extends Mapper<Object, Text, Text, IntWritable> {
...
}

public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
...
}

public int run(String[] args) throws Exception { ... }

public static void main(String[] args) { ... }
}

public static class MapClass
extends Mapper<Object, Text, Text, IntWritable> {

private static final IntWritable ONE = new IntWritable(1L);
private Text word = new Text();

@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {

StringTokenizer st = new StringTokenizer(value.toString());
while (st.hasMoreTokens()) {
word.set(st.nextToken());
context.write(word, ONE);
}
}
}

public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable count = new IntWritable();

@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
count.set(sum);
context.write(key, count);
}
}

public int run(String[] args) throws Exception {
Configuration conf = getConf();

Job job = new Job(conf, "Counting Words");
job.setJarByClass(SimpleWordCount.class);
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int result = ToolRunner.run(new Configuration(),
new SimpleWordCount(),
args);
System.exit(result);
}

aF low
uce Dat
p/Red
M a

(Image from Hadoop in Action...great book!)

Partitioning
Deciding which keys go to which reducer

Desire even distribution across reducers

Skewed data can overload a single reducer!

Map/Reduce Partitioning & Shuffling

(Image from Hadoop in Action...great book!)

Combiner
Effectively a reduce in the mappers

a.k.a. "Local Reduce"

Shuffling WordCount
data # k/v pairs shufﬂed

without combiner ("the", 1) 1000

with combiner ("the", 1000) 1

(looking at one mapper that sees the word "the" 1000 times)

Advanced Map/Reduce
Hadoop Streaming

Chaining Map/Reduce jobs

Joining data

Bloom ﬁlters

HDFS
NameNode

Secondary
NameNode
Map/Reduce
DataNode JobTracker

TaskTracker

Secondary
NameNode

NameNode JobTracker

DataNode1 DataNode2 DataNodeN

TaskTracker1 TaskTracker2 TaskTrackerN
map map map

reduce reduce reduce

NameNode
Bookkeeper for HDFS

Manages DataNodes

Should not store data or run jobs

Single point of failure!

DataNode
Store actual ﬁle blocks on disk

Does not store entire ﬁles!

Report block info to NameNode

Receive instructions from NameNode

Secondary NameNode

Snapshot of NameNode

Not a failover server for NameNode!

Help minimize downtime/data loss
if NameNode fails

JobTracker

Partition tasks across HDFS cluster

Track map/reduce tasks

Re-start failed tasks on different nodes

Speculative execution

TaskTracker

Track individual map & reduce tasks

Report progress to JobTracker

distributed processing

distributed debugging

Logs
View task logs on machine where
speciﬁc task was processed
(or via web UI)

$HADOOP_HOME/logs/userlogs on task tracker

Counters
Deﬁne one or more counters

Increment counters during map/reduce tasks

Counter values displayed in job tracker UI

IsolationRunner

Re-run failed tasks with original input data

Must set keep.failed.tasks.ﬁles to 'true'

Skipping Bad Records
Data may not always be clean

New data may have new interesting twists

Can you pre-process to ﬁlter & validate input?

Speculative execution Use a Combiner
(on by default)

Reduce amount of JVM Re-use
input data (be careful)

Refactor code/
Data compression
algorithms

Lots of knobs Trash can

Needs active Add/remove
management data nodes

Network topology/
"Fair" scheduling
rack awareness

NameNode/SNN
Permissions/quotas
management

Simulate structure for data stored in Hadoop

Query language analogous to SQL (Hive QL)

Translates queries into Map/Reduce job(s)...

...so not for real-time processing!

Queries:
Projection Joins (inner, outer, semi)

Grouping Aggregation

Sub-queries Multi-table insert

Customizable:
User-deﬁned functions

Input/output formats with SerDe

/user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt

"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
Patent citation dataset
3858242,3668705
3858242,3707004
3858243,2949611
3858243,3146465
3858243,3156927
3858243,3221341
3858243,3574238
...
http://www.nber.org/patents

create external table patent_citations (citing string, cited string)
row format delimited fields terminated by ','
stored as textfile
location '/user/sleberkn/nber-patent/tables/patent_citation';

create table citation_histogram (num_citations int, count int)
stored as sequencefile;

insert overwrite table citation_histogram
select num_citations, count(num_citations) from
(select cited, count(cited) as num_citations
from patent_citations group by cited) citation_counts
group by num_citations
order by num_citations;

Amazon EC2 + S3
EC2 instances are compute nodes (Map/Reduce)

Storage options:

HDFS on EC2 nodes

HDFS on EC2 nodes loading data from S3

Native S3 (bypasses HDFS)

Amazon Elastic MapReduce
Interact via web-based console

Submit Map/Reduce job
(streaming, Hive, Pig, or JAR)

EMR conﬁgures & launches Hadoop cluster for job

Uses S3 for data input/output

Hadoop = HDFS + Map/Reduce

Distributed, parallel processing

Designed for fault tolerance

Horizontal scale-out

Structure & queries via Hive

http://hadoop.apache.org/

http://hadoop.apache.org/hive/

Hadoop in Action
http://www.manning.com/lam/

Deﬁnitive Guide to Hadoop, 2nd ed.
http://oreilly.com/catalog/0636920010388

Yahoo! Hadoop blog
http://developer.yahoo.net/blogs/hadoop/

Cloudera
http://www.cloudera.com/

http://lmgtfy.com/?q=hadoop

http://www.letmebingthatforyou.com/?q=hadoop

(my info)

scott.leberknight@nearinfinity.com
www.nearinfinity.com/blogs/
twitter: sleberknight

Hadoop

More Related Content

What's hot

Similar to Hadoop

More from Scott Leberknight

Recently uploaded

Hadoop