Hadoop v0.3.1

Hadoop
Divide and conquer gigantic data

© Matthew McCullough, Ambient Ideas, LLC

Talk Metadata
Twitter
@matthewmccull
#HadoopIntro
Matthew McCullough
Ambient Ideas, LLC
matthewm@ambientideas.com
http://ambientideas.com/blog
http://speakerrate.com/matthew.mccullough

MapReduce: Simplified Dat
a Processing on Large Clusters

Jeffrey Dean and Sanjay Ghe
mawat
jeff@google.com, sanjay@goo
gle.com

Google, Inc.

Abstract
given day, etc. Most such comp
MapReduce is a programming utations are conceptu-
model and an associ- ally straightforward. However,
ated implementation for proce the input data is usually
ssing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mach
key/value pair to generate a set ines in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, distri
values associated with the same bute the data, and handle
intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this funct As a reaction to this complexity
ional style are automati- , we designed a new
cally parallelized and executed abstraction that allows us to expre
on a large cluster of com- ss the simple computa-
modity machines. The run-time tions we were trying to perfo
system takes care of the rm but hides the messy de-
details of partitioning the input tails of parallelization, fault-
data, scheduling the pro- tolerance, data distribution
gram’s execution across a set and load balancing in a librar
of machines, handling ma- y. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional langu
programmers without any ages. We realized that
experience with parallel and most of our computations invol
distributed systems to eas- ved applying a map op-
ily utilize the resources of a large eration to each logical “record”
distributed system. in our input in order to
Our implementation of MapR compute a set of intermediat
educe runs on a large e key/value pairs, and then
cluster of commodity machines applying a reduce operation to
and is highly scalable: all the values that shared
a typical MapReduce computatio the same key, in order to comb
n processes many ter- ine the derived data ap-
abytes of data on thousands of propriately. Our use of a funct
machines. Programmers ional model with user-
find the system easy to use: hund specified map and reduce opera
reds of MapReduce pro- tions allows us to paral-
grams have been implemented lelize large computations easily
and upwards of one thou- and to use re-execution
sand MapReduce jobs are execu as the primary mechanism for
ted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scal
e computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clust
ers of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hund gives several examples. Secti
reds of special-purpose on 3 describes an imple-
computations that process large mentation of the MapReduce
amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deriv scribes several refinements of
ed data, such as inverted the programming model
indices, various representations that we have found useful. Secti
of the graph structure on 5 has performance
of web documents, summaries measurements of our implement
of the number of pages ation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experienc
es in using it as the basis
To appear in OSDI 2004
1

Jeffrey Dean and Sanjay
Ghemawat
jeff@google.com, sanjay
@ google.com

Google, Inc.

Abstract
given day, etc. Most su
MapReduce is a progra ch computations are co
mming model and an as ally straightforward. Ho nceptu-
ated implementation fo soci- wever, the input data is
r processing and genera large and the computatio usually
data sets. Users specify ting large ns have to be distributed
a map function that proc hundreds or thousands across
key/value pair to genera esses a of machines in order to
te a set of intermediate ke a reasonable amount of finish in
pairs, and a reduce func y/value time. The issues of how
tion that merges all inter allelize the computatio to par-
values associated with th mediate n, distribute the data, an
e same intermediate key. failures conspire to obsc d handle
real world tasks are expr Many ure the original simple
essible in this model, as tation with large amount compu-
in the paper. shown s of complex code to de
these issues. al with
Programs written in this As a reaction to this co
functional style are auto mplexity, we designed
cally parallelized and ex mati- abstraction that allows us a new
ecuted on a large cluste to express the simple co
modity machines. The r of com- tions we were trying to mputa-
run-time system takes ca perform but hides the m
details of partitioning th re of the tails of parallelization, essy de-
e input data, scheduling fault-tolerance, data distr
gram’s execution across the pro- and load balancing in ibution
a set of machines, hand a library. Our abstractio
chine failures, and man ling ma- spired by the map and n is in-
aging the required inter reduce primitives presen
communication. This all -machine and many other functio t in Lisp
ows programmers with nal languages. We reali
experience with paralle out any most of our computatio zed that
l and distributed system ns involved applying a
ily utilize the resources s to eas- eration to each logical map op-
of a large distributed sy “record” in our input in
stem. compute a set of interm order to
Our implementation of ediate key/value pairs,
MapReduce runs on a and then
cluster of commodity m large applying a reduce oper
achines and is highly sc ation to all the values th
a typical MapReduce co alable: the same key, in order at shared
mputation processes m to combine the derived
abytes of data on thousa any ter- propriately. Our use of data ap-
nds of machines. Progra a functional model with
find the system easy to us mmers specified map and redu user-
e: hundreds of MapRedu ce operations allows us
grams have been implem ce pro- lelize large computatio to paral-
ented and upwards of on ns easily and to use re-e
sand MapReduce jobs ar e thou- as the primary mechani xecution
e executed on Google’s sm for fault tolerance.
every day. clusters The major contributions
of this work are a simpl
powerful interface that e and
enables automatic paralle
and distribution of large lization
1 Introduction -scale computations, co
with an implementatio mbined
n of this interface that
high performance on lar achieves
ge clusters of commodity
Over the past five years, Section 2 describes the PCs.
the authors and many ot basic programming mod
Google have implemen hers at gives several examples el and
ted hu

Abstract
given da
MapReduce is a progra ally strai
mming model and an a
ated implementation fo ssoci-
r processing and genera large and
data sets. Users specify ting large
a map function that pro hundreds
key/value pair to genera cesses a
te a set of intermediate k a reasona
pairs, and a reduce func ey/value
tion that merges all inte allelize th
values associated with th rmediate
e same intermediate key failures co
real world tasks are exp . Many
ressible in this model, a tation wit
in the paper. s shown
these issue
Programs written in this As a re
functional style are auto
cally parallelized and ex mati- abstraction
ecuted on a large cluste
modity machines. The r o f co m - tions we w
run-time system takes c
details of partitioning th are of the tails of par
e input data, scheduling
gram’s execution across the pro- and load b
a set of machines, hand
chine failures, and man ling ma- spired by th
aging the required inter-
communication. This a machine and many o
llows programmers wit
experience with paralle hout any most of ou
l and distributed system
il s to eas-

expressible in this mod tation wit
in the paper. e l, as shown
these issu
Programs written in this As a re
functional style are auto
cally parallelized and ex mati- abstractio
ecuted on a large cluste
modity machines. The r o f co m - tions we w
run-time system takes c
details of partitioning th are of the tails of pa
e input data, scheduling
gram’s execution across the pro- and load b
a set of machines, hand
chine failures, and man ling ma- spired by t
aging the required inter-
communication. This a machine and many
llows programmers wit
experience with paralle hout any most of ou
l and distributed system
ily utilize the resources s to eas- eration to e
of a large distributed sy
stem. compute a
Our implementation of
MapReduce runs on a
cluster of commodity m large applying a
achines and is highly s
a typical MapReduce c calable: the same ke
omputation processes m
abytes of data on thous any ter- propriately.
ands of machines. Prog
ﬁnd the system easy to u rammers speciﬁed m
se: hundreds of MapRed
grams have been implem uce pro- lelize large
ented and upwards of o
sand MapReduce jobs a ne thou- as the prima
re executed on Google’s
every day. clusters The major
powerful int
and distribut

MapReduce history

“ ing mod el and
A pr ogramm
ion for p rocessing
imp lementat

”
g large d ata sets
and generatin

Origins
MapReduce implementation

Founded by

OpenSource at

Today
0.20.1 current version

Dozens of companies contributing
Hundreds of companies using

0
0
,0
$ 10

vs
0
0
1 ,0
$

ur
Y o ut
y o ure
Bu ay ail
w F
f
o

vs
is
re ble
lu ta
i i
Fa ev a
p
in C he
Go

Failure is
inevitable

Go Cheap

Sproinnnng!

Bzzzt!

Crrrkt!

server Funerals

No pagers go off when machines die
Report of dead machines once a week
Clean out the carcasses

utes pre vented
obustness attrib n code
R
g into a pplicatio
from bleedin
Data redundancy
Node death
Retries
Data geography
Parallelism
Scalability

NOSQL

Death of the RDBMS is a lie

NOSQL

NoJOINs

NOSQL

NoJOINs
NoNormalization

NOSQL

NoJOINs
NoNormalization
Big-data tools are solving
different issues than RDBMSes

Applications

Protein folding
(pharmaceuticals)

Applications

Protein folding
(pharmaceuticals)

Search engines

Applications

Protein folding
(pharmaceuticals)

Search engines
Sorting

Applications

Protein folding
(pharmaceuticals)

Search engines
Sorting
Classification
(government intelligence)

Applications

Price search
Steganography

Applications

Price search
Steganography
Analytics

Applications

Price search
Steganography
Analytics
Primes
(code breaking)

Particle Physics

Large Hadron Collider

Particle Physics

Large Hadron Collider
15 petabytes of data per year

Financial Trends
Daily trade performance analysis

Financial Trends
Market trending

Financial Trends
Market trending

Uses employee desktops during off
hours

Financial Trends
Market trending

Uses employee desktops during off
hours
Fiscally responsible/economical

30%
of Amazon sales are from
recommendations

Not right now...

Do you expect to tackle a very
large problem before you:
change jobs
change industries
retire
die
see the heat death of the universe

In the next decade, the class (scale) of
problems we are aiming to solve will
grow exponentially.

MapReduce
map then... um... reduce.

The process
Every item in dataset is parallel candidate for
Map

The process
Map
Map(k1,v1) -> list(k2,v2)

The process
Map
Collects and groups pairs from all lists by key

The process
Map
Reduce in parallel on each group

The process
Map
Reduce in parallel on each group

Reduce(k2, list (v2)) -> list(v3)

FP For the Grid

MapReduce
Functional programming
on
a distributed processing platform

The Goal

Provide the occurrence count
of each distinct word across
all documents

Have Code,
Will Travel

Code travels to the data
Opposite of traditional systems

Competition
TeraSort
Jim Gray, MSFT
1985 paper
Derived sort benchmark
http://sortbenchmark.org/

209 seconds (2007)
120 seconds (2009)

Processing Nodes

Anonymous
“No identity” is good
Commodity equipment

Master Node

Master is a special machine
Use high quality hardware
Single point of failure
But recoverable

Hadoop
Components
Pig
Hive
Core
Common
Chukwa
HBase
HDFS

the PlayAs

a Comm
Chukw ZooKeeper on HBa
se

Hive HDFS

HDFS Basics

Based on Google BigTable

HDFS Basics

Replicated data store

HDFS Basics

Replicated data store
Stored in 64MB blocks

HDFS

Replicating
Rack-location aware
Configurable redundancy factor
Self-healing

Pig Basics

Yahoo-authored add-on DSL & tool
Origin: Pig Latin
Analyzes large data sets
High-level language for expressing data
analysis programs

PIG Questions
Ask big questions on unstructured
data
How many ___?
Should we ____?
Decide on the questions you want to
ask long after you`ve collected the
data.

Pig Sample

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';

HBase Basics

Structured data store
Notice we didn’t say relational
Relies on ZooKeeper and HDFS

NoSQL

Voldemort
Google BigTable
MongoDB
HBase

Hive Basics

Authored by
SQL interface to HBase
Hive is low-level
Hive-specific metadata

Sqoop

Sqoop by is higher
level
Importing from RDBMS to Hive
sqoop --connect jdbc:mysql://database.example.com/

Sync, Async

RDBMS SQL is realtime
Hadoop is primarily asynchronous

Amazon
Elastic
MapReduce
Hosted Hadoop clusters

Amazon
Elastic
MapReduce
True use of cloud computing

Amazon
Elastic
MapReduce
Easy to set up

Amazon
Elastic
MapReduce
Easy to set up
Pay per use

EMR Languages
Supports applications in...

Java
PHP
Perl
R
Ruby
C++
Python

EMR Functions
RunJobFlow: Creates a job flow request, starts
EC2 instances and begins processing.
DescribeJobFlows: Provides status of your job
flow request(s).
AddJobFlowSteps: Adds additional step to an
already running job flow.
TerminateJobFlows: Terminates running job flow
and shutdowns all instances.

Ha! Your
Hadoop is Shut up!
slower than my I’m reducing.
Hadoop!

The RDBMS is not dead
Has new friends, helpers
NoSQL is taking the world by
storm
No more throwing away
perfectly good historical data

Failure is acceptable
❖ Failure is inevitable

❖ Go cheap

❖ Go cheap

Go distributed

Hadoop
Divide and conquer gigantic data

Matthew McCullough
Email matthewm@ambientideas.com
Twitter @matthewmccull
Blog http://ambientideas.com/blog

http://www.fontspace.com/david-rakowski/tribeca
http://www.cern.ch/
http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-
has-serious-cooling-needs/
http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-
center-design-ability-to-predict-performance.html
http://upload.wikimedia.org/wikipedia/commons/f/fc/
CERN_LHC_Tunnel1.jpg
http://www.flickr.com/photos/mandj98/3804322095/
http://www.flickr.com/photos/8583446@N05/3304141843/
http://www.flickr.com/photos/joits/219824254/
http://www.flickr.com/photos/streetfly_jz/2312194534/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
http://www.flickr.com/photos/lacklusters/2080288154/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
http://www.flickr.com/photos/robryb/14826417/sizes/l/
http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/
http://www.flickr.com/photos/robryb/14826486/sizes/l/
All others, iStockPhoto.com

Hadoop v0.3.1

Recommended

Recommended

More Related Content

Similar to Hadoop v0.3.1

Similar to Hadoop v0.3.1 (20)

More from Matthew McCullough

More from Matthew McCullough (20)

Recently uploaded

Recently uploaded (20)

Hadoop v0.3.1