Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

(More)
Apache Hadoop
Philip Zeyliger (Math, Dunster ‘04)
philip@cloudera.com
@philz42 @cloudera
October 19, 2009
CS 264

Who am I?
Software Engineer
Zak’s classmate
Worked at
(Interns)

Outline
Review of last Wednesday
Your Homework
Data Warehousing
Some Hadoop Internals
Research & Hadoop
Short Break

The Basics
Clusters, not
individual machines
Scale Linearly
Separate App Code
from Fault-Tolerant
Distributed Systems
Code
Systems
Programmers Statisticians

Some Big Numbers
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ’09)
Google: 40 GB/s GFS read/write load (Jeff Dean,
LADIS ’09) [~3,500 TB/day]
Facebook: 4TB new data per day; DW: 4800 cores, 5.5
PB (Dhruba Borthakur, HadoopWorld)

Physical Flow
M-R Model
Logical
Physical

Important APIs
Input Format
Mapper
Reducer
Partitioner
Combiner
Out. Format
M/R
Flow
Other
Writable
JobClient
*Context
Filesystem
K₁,V₁→K₂,V₂
data→K₁,V₁
K₂,iter(V₂)→K₂,V₂
K₂,V₂→int
K₂, iter(V₂)→K₃,V₃
K₃,V₃→data
→ is 1:many

public int run(String[] args)
throws Exception {
if (args.length < 3) {
System.out.println("Grep
<inDir> <outDir> <regex>
[<group>]");
ToolRunner.printGenericCommandUsage
(System.out);
return -1;
}
Path tempDir = new Path("grep-
temp-"+Integer.toString(new
Random().nextInt(Integer.MAX_VALUE)
));
JobConf grepJob = new
JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-
search");
FileInputFormat.setInputPaths(grepJ
ob, args[0]);
grepJob.setMapperClass(RegexMapper.
class);
grepJob.set("mapred.mapper.regex",
args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.gr
oup", args[3]);
grepJob.setCombinerClass(LongSumRed
ucer.class);
grepJob.setReducerClass(LongSumRedu
cer.class);
FileOutputFormat.setOutputPath(grep
Job, tempDir);
grepJob.setOutputFormat(SequenceFil
eOutputFormat.class);
grepJob.setOutputKeyClass(Text.clas
s);
grepJob.setOutputValueClass(LongWri
table.class);
JobClient.runJob(grepJob);
JobConf sortJob = new
JobConf(Grep.class);
sortJob.setJobName("grep-
sort");
FileInputFormat.setInputPaths(sortJ
ob, tempDir);
sortJob.setInputFormat(SequenceFile
InputFormat.class);
sortJob.setMapperClass(InverseMappe
r.class);
// write a single file
sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sort
Job, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass
(LongWritable.DecreasingComparator.
class);
JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(temp
Dir, true);
}
return 0;
}
the “grep”
example

$ cat input.txt
adams dunster kirkland dunster
kirland dudley dunster
adams dunster winthrop
$ bin/hadoop jar hadoop-0.18.3-
examples.jar grep input.txt output1
'dunster|adams'
$ cat output1/part-00000
4 dunster
2 adams

JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
} ...
Job
1of 2

JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
// write a single file
sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass(
LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
Job
2 of 2
(implicit identity reducer)

The types there...
?,Text
Text, Long
Long,Text
Text, list(Long)
Text, Long

A Simple Join
Id Last First
1 Washington George
2 Lincoln Abraham
Location Id Time
Dunster 1 11:00am
Dunster 2 11:02am
Kirkland 2 11:08am
You want to track individuals throughout the day.
How would you do this in M/R, if you had to?
People
Key
Entry
Log

Your Homework
(this is the only lolcat in this lecture)

Mental Challenges
Learn an
algorithm
Adapt it to M/R
Model
Practical Challenges
Learn Finicky
Software
Debug an unfamiliar
environment
Implement PageRank overWikipedia Pages

Tackle Parts Separately
Algorithm
Implementing in M/R
(What are the type signatures?)
Starting a cluster on EC2
Small dataset
Large dataset
Advice

More Advice
Wealth of “Getting Started”
materials online
Feel free to work together
Don’t be a perfectionist about
it; data is dirty!
if (____ ≫ Java), use “streaming”

What is DW?
a.k.a. BI “Business Intelligence”
Provides data to support decisions
Not the operational/transactional
database
e.g., answers “what has our inventory
been over time?”, not “what is our
inventory now?”

Why DW?
Learn from data
Reporting
Ad-hoc analysis
e.g.: which trail mix should TJ’s
discontinue? (and other important
business questions)
17
A bit smaller than
natural peanut butte
TraderJoe’s Mini M
excellent for snacki
for chocolate chips
cream. We’re sellin
Chocolat
$1.99
Do you have a first
it involve nearly br
Trader Joe’s, you c
it takes is a bit of
Coated Granola B
No rock-hard-teeth
oats, organic rice cr
The bottoms are cov
chocolate. They’re
these little chocolat
Trader Joe’s Cho
Bars are definitely
healthier when we
flavors, colors or pr
fats. And because
deliciously affordable,
pitting and popping cherries into our mouths at a rate of more
than 157 million pounds over a three month period. Wow!
So what becomes of the other 53 million pounds? Well,
some of the fruit is frozen, some used for jams and preserves
and some is used to make Trader Joe’s Cherry Cider. Our
Cherry Cider is a 100% juice blend – cherry, apple, plum
and pineapple juices from concentrate – that makes ample
use of Bing cherries from the Pacific Northwest. It has big,
bold cherry sweetness and no added sugar. We’re selling
Cherry Cider in a 64 fluid ounce bottle for $3.69, every day.
I told you, hands off the
Chocolate Chip Granola Bars!
Geez, lighten up. You
get six in every box.
You could share.

Traditionally...
Big databases
Schemas
Dimensional Modelling (Ralph Kimball)

Magnetic
Agile
Deep
“MAD Skills”
MAD Skills: New Analysis Practices for Big Data
Jeffrey Cohen
Greenplum
Brian Dolan
Fox Interactive Media
Mark Dunlap
Evergreen Technologies
Joseph M. Hellerstein
U.C. Berkeley
Caleb Welton
Greenplum
ABSTRACT
As massive data acquisition and storage becomes increas-
ingly affordable, a wide variety of enterprises are employing
statisticians to engage in sophisticated data analysis. In this
paper we highlight the emerging practice of Magnetic, Ag-
ile, Deep (MAD) data analysis as a radical departure from
traditional Enterprise Data Warehouses and Business Intel-
ligence. We present our design philosophy, techniques and
experience providing MAD analytics for one of the world’s
largest advertising networks at Fox Interactive Media, us-
ing the Greenplum parallel database system. We describe
database design methodologies that support the agile work-
ing style of analysts in these settings. We present data-
parallel algorithms for sophisticated statistical techniques,
with a focus on density methods. Finally, we reflect on
database system features that enable agile design and flexi-
ble algorithm development using both SQL and MapReduce
interfaces over a variety of storage mechanisms.
1. INTRODUCTION
If you are looking for a career where your services will be
in high demand, you should find something where you provide
a scarce, complementary service to something that is getting
ubiquitous and cheap. So what’s getting ubiquitous and cheap?
Data. And what is complementary to data? Analysis.
– Prof. Hal Varian, UC Berkeley, Chief Economist at Google [5]
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.
– UrbanDictionary.com [22]
Standard business practices for large-scale data analysis cen-
ter on the notion of an “Enterprise Data Warehouse” (EDW)
that is queried by “Business Intelligence” (BI) software. BI
tools produce reports and interactive interfaces that summa-
into groups. This was the topic of significant academic re-
search and industrial development throughout the 1990’s.
Traditionally, a carefully designed EDW is considered to
have a central role in good IT practice. The design and
evolution of a comprehensive EDW schema serves as the
rallying point for disciplined data integration within a large
enterprise, rationalizing the outputs and representations of
all business processes. The resulting database serves as the
repository of record for critical business functions. In addi-
tion, the database server storing the EDW has traditionally
been a major computational asset, serving as the central,
scalable engine for key enterprise analytics. The concep-
tual and computational centrality of the EDW makes it a
mission-critical, expensive resource, used for serving data-
intensive reports targeted at executive decision-makers. It is
traditionally controlled by a dedicated IT staff that not only
maintains the system, but jealously controls access to ensure
that executives can rely on a high quality of service. [12]
While this orthodox EDW approach continues today in
many settings, a number of factors are pushing towards a
very different philosophy for large-scale data management in
the enterprise. First, storage is now so cheap that small sub-
groups within an enterprise can develop an isolated database
of astonishing scale within their discretionary budget. The
world’s largest data warehouse from just over a decade ago
can be stored on less than 20 commodity disks priced at
under $100 today. A department can pay for 1-2 orders
of magnitude more storage than that without coordinating
with management. Meanwhile, the number of massive-scale
data sources in an enterprise has grown remarkably: mas-
sive databases arise today even from single sources like click-
streams, software logs, email and discussion forum archives,
etc. Finally, the value of data analysis has entered com-
mon culture, with numerous companies showing how sophis-
ticated data analysis leads to cost savings and even direct
revenue. The end result of these opportunities is a grassroots
move to collect and leverage data in multiple organizational

MADness is Enabling
Instrumentation
Collection
Storage (Raw Data)
ETL (Extraction,Transform, Load)
RDBMS (Aggregates)
BI / Reporting
Traditional DW
} Ad-hoc
Queries?
Data Mining?

Data Mining
Instrumentation
Collection
Storage (Raw Data)
ETL (Extraction,Transform, Load)
RDBMS (Aggregates)
BI / Reporting
Traditional DW
} Ad-hoc
Queries

Facebook’s DW (phase N)
Oracle Database
Server
Data Collection
Server
MySQL Tier
be Tier

Facebook’s DW (phase M)
M > N
Facebook Data Infrastructure
2008
MySQL Tier
Scribe Tier
Hadoop Tier
Oracle RAC Servers
Wednesday, April 1, 2009

HDFS
Namenode
Datanodes
One Rack A Different Rack
3x64MB file, 3 rep
4x64MB file, 3 rep
Small file, 7 rep

HDFS Failures?
Datanode crash?
Clients read another copy
Background rebalance
Namenode crash?
uh-oh

M/R
Tasktrackers on the same
machines as datanodes
One Rack A Different Rack
Job on stars
Different job
Idle

Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence
M/R Failures

Programming these
systems...
Everything can fail
Inherently multi-threaded
Toolset still young
Mental models are different...

Scheduling & Sharing
Mixed use
Batch
Interactive
Real-time
Isolation
Text
Metrics: Latency,Throughput, Utilization (per resource)

Scheduling
Fair and LATE
Scheduling (Berkeley)
Nexus (Berkeley)
Quincy (MSR)

Implementation
BOOM Project
(Berkeley)
Overlog (Berkeley)
APPENDIX
A. NARADA IN OverLog
Here we provide an executable OverLog implementation
of Narada’s mesh maintenance algorithms. Current limita-
tions of the P2 parser and planner require slightly wordier
syntax for some of our constructs. Specifically, handling of
negation is still incomplete, requiring that we rewrite some
rules to eliminate negation. Furthermore, our planner cur-
rently handles rules with collocated terms only. The Over-
Log specification below is directly parsed and executed by
our current codebase.
/** Base tables */
materialize(member, infinity, infinity, keys(2)).
materialize(sequence, infinity, 1, keys(2)).
materialize(neighbor, infinity, infinity, keys(2)).
/* Environment table containing configuration
values */
materialize(env, infinity, infinity, keys(2,3)).
/* Setup of configuration values */
E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X,
H, Y), H == "neighbor".
/** Start with sequence number 0 */
S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0,
1), Sequence := 0.
/** Periodically start a refresh */
R1 refreshEvent@X(X) :- periodic@X(X, E, 3).
/** Increment my own sequence number */
R2 refreshSequence@X(X, NewSequence) :-
refreshEvent@X(X), sequence@X(X, Sequence),
NewSequence := Sequence + 1.
/** Save my incremented sequence */
R3 sequence@X(X, NewSequence) :-
refreshSequence@X(X, NewSequence).
/** Send a refresh to all neighbors with my current
membership */
R4 refresh@Y(Y, X, NewSequence, Address, ASequence,
ALive) :- refreshSequence@X(X, NewSequence),
member@X(X, Address, ASequence, Time, ALive),
neighbor@X(X, Y).
/** How many member entries that match the member
in a refresh message (but not myself) do I have? */
R5 membersFound@X(X, Address, ASeq, ALive,
count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq,
ALive), member@X(X, Address, MySeq, MyTime,
MyLive), X != Address.
/** If I have none, just store what I got */
R6 member@X(X, Address, ASequence, T, ALive) :-
membersFound@X(X, Address, ASequence, ALive, C),
C == 0, T := f_now().
/** If I have some, just update with the
information I received if it has a higher
sequence number. */
R7 member@X(X, Address, ASequence, T, ALive) :-
membersFound@X(X, Address, ASequence, ALive, C),
C > 0, T := f_now(), member@X(X, Address,
MySequence, MyT, MyLive), MySequence < ASequence.
/** Update my neighbor’s member entry */
R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X,
Y, YSeq, A, AS, AL), T := f_now(), YLive := 1.
/** Add anyone from whom I receive a refresh
message to my neighbors */
N1 neighbor@X(X, Y) :- refresh@X(X, Y,
YS, A, AS, L).
/** Probing of neighbor liveness */
L1 neighborProbe@X(X) :- periodic@X(X, E, 1).
L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T :=
f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT,
L), T - YT > 20.
L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y).
L4 member@X(X, Neighbor, DeadSequence, T, Live) :-
deadNeighbor@X(X, Neighbor), member@X(X,
Neighbor, S, T1, L), Live := 0, DeadSequence := S
+ 1, T:= f_now().
B. CHORD IN OverLog
Here we provide the full OverLog specification for Chord.
This specification deals with lookups, ring maintenance with
a fixed number of successors, finger-table maintenance and
opportunistic finger table population, joins, stabilization,
and node failure detection.
/* The base tuples */
materialize(node, infinity, 1, keys(1)).
materialize(finger, 180, 160, keys(2)).
materialize(bestSucc, infinity, 1, keys(1)).
materialize(succDist, 10, 100, keys(2)).
materialize(succ, 10, 100, keys(2)).
materialize(pred, infinity, 100, keys(1)).
materialize(succCount, infinity, 1, keys(1)).
materialize(join, 10, 5, keys(1)).
materialize(landmark, infinity, 1, keys(1)).
materialize(fFix, infinity, 160, keys(2)).
materialize(nextFingerFix, infinity, 1, keys(1)).
materialize(pingNode, 10, infinity, keys(2)).
materialize(pendingPing, 10, infinity, keys(2)).
/** Lookups */
L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N),
lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in
15

Debugging and
Visualization
0 100 200 300 400
0
10
20
30
40
Time/s
Per-task
Task durations (RandomWriter: 100GB written: 4 hosts): All nodes
JT_Map
0 200 400 600 800
0
50
100
150
Time/s
Per-task
Task durations (Sort: 20GB input: 4 hosts): All nodes
JT_Map
JT_Reduce
Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom)
0 200 400 600 800
0
10
20
30
40
50
60
Time/s
Per-task
Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-node
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
0 100 200 300 400 500 600 700
0
20
40
60
Time/s
Per-task
Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-node
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below)
4 Examples of Mochi’s Value
We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). All
of the data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in § 4.1,
§ 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’s
analysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but we
omit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zooming
in/out and targeted inspection) is of more benefit, rather than a static one.
4.1 Understanding Hadoop Job Structure
Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of the
Mochi (CMU)
Parallax (UW)

Performance
Need for
benchmarks
(besides GraySort)
Low-hanging fruit!

Higher-Level Languages
Hive (a lot like SQL) (Facebook/Apache)
Pig Latin (Yahoo!/Apache)
DryadLINQ (Microsoft)
Sawzall (Google)
SCOPE (Microsoft)
JAQL (IBM)

Optimizations
For a single query....
For a single workflow...
Across workflows...
Bring out last century’s DB
research! (joins)
And file system research
too! (RAID)
HadoopDB (Yale)
Data Formats (yes, in ’09)

New Datastore Models
File System
Bigtable, Dynamo,
Cassanda, ...
Database

New Computation
Models
MPI
M/R
Online M/R
Dryad
Pregel for Graphs
Iterative ML Algorithms

Hardware
Data Center Design
(Hamilton, Barroso,
Hölzle)
Energy-Efficiency
Network Topology and
Hardware
What does flash mean in
this context?
What about multi-core?
Larger-Scale Computing

Synchronization,
Coordination, and
Consistency
Chubby, ZooKeeper, Paxos, ...
Eventual Consistency

Applied Research
(research using M/R)
“Unreasonable
Effectiveness of Data”
WebTables (Cafarella)
Translation
ML...

Conferences...
(some in exotic locales)
SIGMOD
VLDB
ICDE
CIDR
HPTS
SOSP
LADIS
OSDI
SIGCOMM
HotCloud
NSDI
SC/ISC
SoCC
Others
(ask a prof!)

The Wheel
Don’t Re-invent
Focus on your
data/problem
What about...
Reliability,
Durability,
Stability,Tooling
19
crispies) to create a salty, savory snack that dares to thin
outside the snack box. Sound a little strange? Perhaps. Bu
once you try them, we think you’ll be back for more. We’r
selling Trader Joe’s Sesame Seaweed Rice Balls in a fiv
ounce bag for only $1.49.
Baby Swiss from a Master • Only $3.99 a Pound!
Trader Joe’s Baby Swiss Cheese comes to us from a
Wisconsin farmer-owned cheese co-op that has been
producing craftsman cheeses since 1885. It is an artisan-
made cheese produced under the watchful eye of a Master
Cheesemaker who has been creating quality cheeses fo
more than 30 years.
Baby Swiss is similar to Swiss cheese but is aged for a shorte
period of time, resulting in a milder cheese with significantl
smaller “eyes” than its grown-up namesake. From a flavo
standpoint, it’s buttery, a little nutty and a touch sweet. I
chunks well for salads, melts beautifully on burgers an
slices easily for snacks. We’re selling random weight block
of Master-crafted Trader Joe’s Baby Swiss Cheese fo
$3.99 a pound, every day – a terrific value, and the sam
great price we offered on this cheese back in 2005!
Sweet & Nutty… Just Like We Are!
“The Original”
Honey Roasted Peanuts
Remember the sweet and crunchy taste of the original honey
roasted peanuts? Remember the first time you tried a knock-
off version and felt sadness, coupled with disappointment,
enveloped in ennui, longing for a snack that was as good
as the original? Trader Joe’s has the power to make you
ennui-free.
When the original purveyor of honey roasted peanuts became
yet another victim of corporate reorganization, one of our
industrious nut suppliers bought exclusive rights to their
original honey roasted peanut recipe, and we’ve been selling
truckloads of them ever since. Honey Roasted Peanuts are
a natural for snacking any time – to satisfy the afternoon
munchies, out on a long hike, or just sitting in front of the
TV watching a game.
Proof that our nut buyer is as industrious as our nut supplier,
we’re selling this one-of-a-kind product at a one-of-a-kind
price – each 16 ounce bag of Trader Joe’s The Original
Honey Roasted Peanuts is $2.69, every day.
the flour, and are able to sell a five pound bag to you for only
$2.99. Our flour is made from 100% U.S. grown hard wheat
– All Purpose is a blend of hard winter and spring wheat
and White Whole Wheat is 100% hard white winter wheat
– and both have four grams of protein in every quarter-cup
serving. You’ll find both Baker Josef’s Flours directly at
the source – your neighborhood Trader Joe’s.
Uh-oh. Looks like Joe’s been reinventing the wheel again.
“Look, there are lots of different types of wheels!” –Todd Lipcon
Re-invent!
Lots of new
possibilities!
New Models!
New implementations!
Better optimizations!

Conclusion
It’s a great time to be in Distributed Systems.
Participate!
Build!
Collaborate!

Questions?
philip@cloudera.com
(we’re hiring) (interns)

Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (6)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Hadoop Lecture for Harvard's CS 264 -- October 19, 2009