Intro To Hadoop

Intro To Hadoop
Bill Graham - @billgraham
Data Systems Engineer, Analytics Infrastructure
Info 290 - Analyzing Big Data With Twitter
UC Berkeley Information School
September 2012

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Outline

• What is Big Data?
• Hadoop
• HDFS
• MapReduce
• Twitter Analytics and Hadoop

2

What is big data?

• A bunch of data?
• An industry?
• An expertise?
• A trend?
• A cliche?

3

Wikipedia big data

In information technology, big data is a loosely-deﬁned
term used to describe data sets so large and
complex that they become awkward to work with
using on-hand database management tools.

Source: http://en.wikipedia.org/wiki/Big_data 4

How big is big?

• 2008: Google processes 20 PB a day

5

How big is big?

• 2009: Facebook has 2.5 PB user data + 15 TB/
day

5

How big is big?

day
• 2009: eBay has 6.5 PB user data + 50 TB/day

5

How big is big?

day
• 2011: Yahoo! has 180-200 PB of data

5

How big is big?

day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day

5

That’s a lot of data

Credit: http://www.ﬂickr.com/photos/19779889@N00/1367404058/ 6

So what?

s/data/knowledge/g

7

No really, what do you do with it?

8


• User behavior analysis

8


• AB test analysis

8


• Ad targeting

8


• Ad targeting
• Trending topics

8


• Ad targeting
• Trending topics
• User and topic modeling

8


• Ad targeting
• Trending topics
• Recommendations

8


• Ad targeting
• Trending topics
• Recommendations
• And more...

8

How to scale data?

9

Divide and Conquer
“Work”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine

10

Parallel processing is complicated

Credit: http://www.ﬂickr.com/photos/sybrenstuvel/2468506922/ 11


• How do we assign tasks to workers?



• What if we have more tasks than slots?



• What happens when tasks fail?



• What happens when tasks fail?
• How do you handle distributed
synchronization?


Data storage is not trivial
• Data volumes are massive
• Reliably storing PBs of data is challenging
• Disk/hardware/network failures
• Probability of failure event increases with number of
machines

For example:
1000 hosts, each with 10 disks
a disk lasts 3 year
how many failures per day?
12

Hadoop cluster

Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)

13

Hadoop provides

• Redundant, fault-tolerant data storage
• Parallel computation framework
• Job coordination

http://hapdoop.apache.org 15

Joy

Credit: http://www.ﬂickr.com/photos/spyndle/3480602438/ 16

Hadoop origins

• Hadoop is an open-source implementation based
on GFS and MapReduce from Google

17

Hadoop origins

• Sanjay Ghemawat, Howard Gobioff, and Shun-
Tak Leung. (2003) The Google File System

17

Hadoop origins

• Sanjay Ghemawat, Howard Gobioff, and Shun-
Tak Leung. (2003) The Google File System
• Jeffrey Dean and Sanjay Ghemawat. (2004)
MapReduce: Simpliﬁed Data Processing on Large
Clusters. OSDI 2004

17

Hadoop Stack

(Columnar Database) Pig Hive Cascading
(Data Flow) (SQL) (Java)
HBase

MapReduce
(Distributed Programming Framework)

HDFS
(Hadoop Distributed File System)

18

HDFS is...
• A distributed ﬁle system

20

HDFS is...
• Redundant storage

20

HDFS is...
• Designed to reliably store data using commodity
hardware

20

HDFS is...
hardware
• Designed to expect hardware failures

20

HDFS is...
hardware
• Intended for large ﬁles

20

HDFS is...
hardware
• Designed for batch inserts

20

HDFS is...
hardware
• Designed for batch inserts
• The Hadoop Distributed File System

20

HDFS - ﬁles and blocks

21

• Files are stored as a collection of blocks

21

• Blocks are 64 MB chunks of a ﬁle (conﬁgurable)

21

• Blocks are replicated on 3 nodes (conﬁgurable)

21

• The NameNode (NN) manages metadata about ﬁles
and blocks

21

and blocks
• The SecondaryNameNode (SNN) holds a backup
of the NN data

21

and blocks
• The SecondaryNameNode (SNN) holds a backup
of the NN data
• DataNodes (DN) store and serve blocks

21

Replication

• Multiple copies of a block are stored
• Replication strategy:
• Copy #1 on another node on same rack
• Copy #2 on another node on different rack

22

HDFS - writes
Client Master
Note: Write path for a
File NameNode single block shown.
Client writes multiple
blocks in parallel.

block

Rack #1 Rack #2
Slave node Slave node Slave node

DataNode DataNode DataNode

Block Block Block

23

HDFS - reads
Client Master
Client reads multiple blocks
File NameNode in parallel and re-assembles
into a ﬁle.

block N
block 1 block 2



Block Block Block

24

What about DataNode failures?

• DNs check in with the NN to report health
• Upon failure NN orders DNs to replicate under-
replicated blocks

Credit: http://www.ﬂickr.com/photos/18536761@N00/367661087/ 25

MapReduce is...

• A programming model for expressing distributed
computations at a massive scale

27

MapReduce is...

• An execution framework for organizing and
performing such computations

27

MapReduce is...

• An execution framework for organizing and
performing such computations
• An open-source implementation called Hadoop

27

Typical large-data problem

(Dean and Ghemawat, OSDI 2004) 28


• Iterate over a large number of records



• Extract something of interest from each



• Shufﬂe and sort intermediate results



• Aggregate intermediate results



• Generate ﬁnal output



Map



Map
• Aggregate intermediate results Reduce



MapReduce paradigm

29

MapReduce paradigm

• Implement two functions:

29

MapReduce paradigm

Map(k1, v1) -> list(k2, v2)

29

MapReduce paradigm

Reduce(k2, list(v2)) -> list(v3)

29

MapReduce paradigm

• Framework handles everything else*

29

MapReduce paradigm

• Framework handles everything else*
• Value with same key go to same reducer

29

MapReduce Flow
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 3 6 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
30

MapReduce - word count example

function map(String name, String document):
for each word w in document:
emit(w, 1)

function reduce(String word, Iterator partialCounts):
totalCount = 0
for each count in partialCounts:
totalCount += count
emit(word, totalCount)

31

MapReduce paradigm - part 2

32

• There’s more!

32

• There’s more!
• Partioners decide what key goes to what reducer

32

• There’s more!
• partition(k’, numPartitions) ->
partNumber

32

• There’s more!
partNumber
• Divides key space into parallel reducers chunks

32

• There’s more!
partNumber
• Default is hash-based

32

• There’s more!
partNumber
• Combiners can combine Mapper output before
sending to reducer

32

• There’s more!
partNumber
• Combiners can combine Mapper output before
sending to reducer
• Reduce(k2, list(v2)) -> list(v3)

32

MapReduce ﬂow
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 9 8 8
3 6

reduce reduce reduce

r1 s1 r2 s2 r3 s3

33

MapReduce additional details

34


• Reduce starts after all mappers complete

34


• Mapper output gets written to disk

34


• Intermediate data can be copied sooner

34


• Reducer gets keys in sorted order

34


• Keys not sorted across reducers

34


• Keys not sorted across reducers
• Global sort requires 1 reducer or smart
partitioning

34

MapReduce - jobs and tasks

35

• Job: a user-submitted map and reduce
implementation to apply to a data set

35

• Task: a single mapper or reducer task

35

• Failed tasks get retried automatically

35

• Tasks run local to their data, ideally

35

• JobTracker (JT) manages job submission and
task delegation

35

• JobTracker (JT) manages job submission and
task delegation
• TaskTrackers (TT) ask for work and execute
tasks

35

MapReduce architecture
Client Master

Job JobTracker


TaskTracker TaskTracker TaskTracker

Task Task Task

36

What about failed tasks?

Credit: http://www.ﬂickr.com/photos/phobia/2308371224/ 37


• Tasks will fail



• Tasks will fail
• JT will retry failed tasks up to N attempts



• Tasks will fail
• After N failed attempts for a task, job fails



• Tasks will fail
• Some tasks are slower than other



• Tasks will fail
• Speculative execution is JT starting up
multiple of the same task



• Tasks will fail
• Speculative execution is JT starting up
multiple of the same task
• First one to complete wins, other is killed


MapReduce data locality

38


• Move computation to the data

38


• Moving data between nodes has a cost

38


• MapReduce tries to schedule tasks on nodes
with the data

38


• MapReduce tries to schedule tasks on nodes
with the data
• When not possible TT has to fetch data from DN

38

MapReduce - Java API
• Mapper:
void map(WritableComparable key,
Writable value,
OutputCollector output,
Reporter reporter)

• Reducer:
void reduce(WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)

39


40

• Writable

40

• Writable
• Hadoop wrapper interface

40

• Writable
• Text, IntWritable, LongWritable, etc

40

• Writable
• WritableComparable

40

• Writable
• Writable classes implement WritableComparable

40

• Writable
• OutputCollector

40

• Writable
• OutputCollector
• Class that collects keys and values

40

• Writable
• OutputCollector
• Reporter

40

• Writable
• OutputCollector
• Reporter
• Reports progress, updates counters

40

• Writable
• OutputCollector
• Reporter
• InputFormat

40

• Writable
• OutputCollector
• Reporter
• InputFormat
• Reads data and provide InputSplits

40

• Writable
• OutputCollector
• Reporter
• InputFormat
• Examples: TextInputFormat, KeyValueTextInputFormat

40

• Writable
• OutputCollector
• Reporter
• InputFormat
• OutputFormat

40

• Writable
• OutputCollector
• Reporter
• InputFormat
• OutputFormat
• Writes data

40

• Writable
• OutputCollector
• Reporter
• InputFormat
• OutputFormat
• Writes data
• Examples: TextOutputFormat, SequenceFileOutputFormat
40

MapReduce - Counters are...

41

• A distributed count of events during a job

41

• A way to indicate job metrics without logging

41

• Your friend

41

• Your friend

• Bad:

41

• Your friend

• Bad:
System.out.println(“Couldn’t parse value”);

41

• Your friend

• Bad:
• Good:

41

• Your friend

• Bad:
• Good:
reporter.incrCounter(BadParseEnum, 1L);

41

MapReduce - word count mapper
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
42

MapReduce - word count reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

43

MapReduce - word count main
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}

44

MapReduce - running a job

• To run word count, add ﬁles to HDFS and do:

$ bin/hadoop jar wordcount.jar
org.myorg.WordCount input_dir output_dir

45

MapReduce is good for...

46


• Embarrassingly parallel algorithms

46


• Summing, grouping, ﬁltering, joining

46


• Off-line batch jobs on massive data sets

46


• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset

46

MapReduce is ok for...

47


• Iterative jobs (i.e., graph algorithms)

47


• Each iteration must read/write data to disk

47


• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high

47

MapReduce is not good for...

48


• Jobs that need shared state/coordination

48


• Tasks are shared-nothing

48


• Shared-state requires scalable state store

48


• Low-latency jobs

48


• Jobs on small datasets

48


• Jobs on small datasets
• Finding individual records

48

Hadoop combined architecture
Master

JobTracker
Backup

NameNode SecondaryNameNode


TaskTracker TaskTracker TaskTracker


49

NameNode UI
• Tool for browsing HDFS

50

JobTracker UI
• Tool to see running/completed/failed jobs

51

Running Hadoop

• Multiple options

52

Running Hadoop

• On your local machine (standalone or pseudo-
distributed)

52

Running Hadoop

distributed)
• Local with a virtual machine

52

Running Hadoop

distributed)
• On the cloud (i.e. Amazon EC2)

52

Running Hadoop

distributed)
• On the cloud (i.e. Amazon EC2)
• In your own datacenter

52

Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded

53

Cloudera VM
pre-loaded
• Great tool for learning Hadoop

53

Cloudera VM
pre-loaded
• Eases the pain of downloading/installing

53

Cloudera VM
pre-loaded
• Pre-loaded with sample data and jobs

53

Cloudera VM
pre-loaded
• Documented tutorials

53

Cloudera VM
pre-loaded
• VM: https://ccp.cloudera.com/display/SUPPORT/
Cloudera's+Hadoop+Demo+VM+for+CDH4

53

Cloudera VM
pre-loaded
• VM: https://ccp.cloudera.com/display/SUPPORT/
Cloudera's+Hadoop+Demo+VM+for+CDH4
• Tutorial: https://ccp.cloudera.com/display/SUPPORT/
Hadoop+Tutorial

53

Twitter Analytics
and Hadoop

54

Multiple teams use Hadoop

• Analytics
• Revenue
• Personalization & Recommendations
• Growth

55

Twitter Analytics data ﬂow

Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User proﬁles
Staging Hadoop Cluster

Main Hadoop DW HBase Analytics
Vertica
Web Tools

MySQL
56


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Staging Hadoop Cluster Crawler

Log
Mover

Vertica
Web Tools

MySQL
56


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Crane Crane
Crane
Log
Mover

Vertica
Web Tools
Crane

Crane
Crane

MySQL
56


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Crane Crane
Crane
Log
Mover

Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
56


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Crane Crane
Crane
Log Rasvelg
Mover

Oink
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
56


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
Mover

Oink
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
56


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
HCatalog Mover

Oink
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
56

Example: active users

Production Hosts

Main Hadoop DW

MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica

57

Job DAG

Log mover
Production Hosts
Log mover
(via staging cluster)

ib e web_events
Scr
Main Hadoop DW
Scr
ibe sms_events

MySQL/ Analytics
Vertica

57

Job DAG

Oink
Log mover
Production Hosts
Log mover
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct

MySQL/ Analytics
Vertica

57

Job DAG

Oink Oink
Log mover
Production Hosts
Log mover
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Analytics
Vertica

57

Job DAG

Oink Oink
Log mover
Production Hosts
Crane
Log mover
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Crane Analytics
user_proﬁles Vertica

57

Job DAG

Oink Oink
Log mover
Production Hosts
Crane
Log mover Rasvelg
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Crane Analytics
user_proﬁles Vertica

Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 57

Job DAG

Oink Oink
...
Log mover
Production Hosts
Crane
Log mover Rasvelg Crane
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Crane Crane Analytics
user_proﬁles Vertica active_by_*

Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 57

Credits

• Data-Intensive Information Processing Applications ―
Session #1, Jimmy Lin, University of Maryland

58

Questions?

Bill Graham - @billgraham

59

Intro To Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Intro To Hadoop

Similar to Intro To Hadoop (20)

Recently uploaded

Recently uploaded (20)

Intro To Hadoop

Editor's Notes