[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Introduction to
Zak Stone <zak@eecs.harvard.edu>
PhD candidate, Harvard School of Engineering and Applied Sciences
Advisor: Todd Zickler (Computer Vision)

Hadoop distributes data and computation across a
large number of computers.

Outline

1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources

Why should you care? - Lots of Data

LOTS OF DATA
EVERYWHERE


L
O
T
S
!

Why should you care? - Even Grocery Stores Care

...

Why!! ! ! ! ! ! for big data?

• Most credible open-source toolset for large-scale, general-purpose computing

• Backed by ,

• Used by , , many others

• Increasing support from web services

• Hadoop closely imitates infrastructure developed by

• Hadoop processes petabytes daily, right now

Why!! ! ! ! ! ! for big data?

DISCLAIMER
• Don’t use Hadoop if your data and computation ﬁt on one machine

• Getting easier to use, but still complicated

http://www.wired.com/gadgetlab/2008/07/patent-crazines/

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects; focus on two right now

An overview of Hadoop Map-Reduce

Traditional
Hadoop
Computing

(one computer)

(many computers)

An overview of Hadoop Map-Reduce

(Actually more like this)

(many computers, little communication,
stragglers and failures)

Map-Reduce: Three phases

1. Map

2. Sort

3. Reduce

Map-Reduce: Map phase

Only specify operations on key-value pairs!
INPUT PAIR OUTPUT PAIRS
(key, value) (key, value)
(key, value)
(key, value)
(zero or more output pairs)

(each “elephant” works on an input pair;
doesn’t know other elephants exist )

Map-Reduce: Map phase, word-count example

(line1, “Hello there.”) (“hello”, 1)

(“there”, 1)

(line2, “Why, hello.”) (“why”, 1)

(“hello”, 1)

Map-Reduce: Sort phase

(key1, value289)
(key1, value43)
(key1, value3)
...
(key2, value512)
(key2, value11)
(key2, value67)
...

Map-Reduce: Sort phase, word-count example

(“hello”, 1)
(“hello”, 1)

(“there”, 1)

(“why”, 1)

Map-Reduce: Reduce phase

(key1, value289)
(key1, value43) (key1, output1)
(key1, value3)

...

Map-Reduce: Reduce phase, word-count example

(“hello”, 1)
(“hello”, 2)
(“hello”, 1)

(“there”, 1) (“there”, 1)

(“why”, 1) (“why”, 1)

Map-Reduce: Code for word-count

def mapper(key,value):
for word in value.split():
yield word,1

def reducer(key,values):
yield key,sum(values)

Seems like too much work
for a word-count!

Map-Reduce: Imagine word-count on the Web

Map-Reduce: The main advantage

With Hadoop, this very same code could run on
the entire Web! (In theory, at least)
def mapper(key,value):
for word in value.split():
yield word,1

def reducer(key,values):
yield key,sum(values)

HDFS: Hadoop Distributed File System

... (chunks of data
on computers)

Data ... (each chunk
replicated more
than once for
reliability)

...
...

HDFS: Hadoop Distributed File System
(key1, value1)
(key2, value2)
...

... (key1, value1)
(key2, value2)
...
...

Computation is local to the data
Key-value pairs processed independently in parallel

HDFS: Inspired by the Google File System

Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation

• Computation local to data avoids network overload

• Tasks are independent

• Easy to handle partial failures - entire nodes can fail and restart

• Avoid crawling horrors of failure-tolerant synchronous distributed systems

• Speculative execution to work around stragglers

• Linear scaling in the ideal case

• Designed for cheap, commodity hardware

• Simple programming model

• The “end-user” programmer only writes map-reduce tasks

Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development

• e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

• Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

• No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job ﬂow isn’t trivial when intermediate data should be kept

• Optimal conﬁguration of nodes not obvious (# mappers, # reducers, mem. limits)

Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

• Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/

Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

• Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

• The Python word-count example and others come with Dumbo

• Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy

Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/

• Always test locally on a tiny dataset before running on a cluster!

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

More Related Content

What's hot

Similar to [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

More from npinto

Recently uploaded

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)