Aws map-reduce-aws

Elastic MapReduce

Andy Marks
Principal Consultant, ThoughtWorks
amarks@thoughtworks.com

Objectives

High level understanding
Limitations

Examples Inspired to try it out

Multiple choice: MapReduce is…

a) A combination of 2 common functional programming
messages
b) Used extensively* by Google
c) Implemented in libraries for all languages (that matter )
d) A framework for management and execution of processing in
parallel
e) Getting more and more relevant with the emergence of “Big
Data”
f) Implementable as a service via AWS
g) Targeted towards batch style computation
h) All of the above

* Approx 12K MR programs from http://www.youtube.com/watch?v=NXCIItzkn3E

A potted history of MapReduce
Hadoop started by Doug Cutting at Yahoo
AWS launch ElasticMapReduce

Facebook announces 21PB
Hadoop cluster

002 2004 2006 2008 2010 2012

Yahoo announces 10K Hadoop cluster

http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/gfs.html

Processing flow
MAP

MAP REDUCE

Process
Call
Read and Call MAP chunk, Partition
REDUCE Process Persist
split input for each returning and sort
for each partition output
into chunks chunk intermediate results
partition
results

…
MAP

REDUCE
…

MAP

Map and Reduce by example: word
part_1.txt part_2.txt
Peter Piper picked a peck of pickled peppers, If Peter Piper picked a peck of pickled peppers,
A peck of pickled peppers Peter Piper picked; Where's the peck of pickled peppers Peter Piper
picked?

map calls reduce calls
Input key Input value Output keys Output values Input key Input value Output values
part_1.txt Peter Piper peter 1 a [1, 1, 1] a -> 3
picked a peck piper 1
of pickled picked 1
if [1] if -> 1
peppers, a 1
peck 1
of 1 of [1, 1, 1, 1] of -> 4
pickled 1
peppers 1 peck [1, 1, 1, 1] peck -> 4
part_1.txt A peck of a 1
pickled peppers peck 1 peppers [1, 1, 1, 1] peppers -> 4
Peter Piper of 1
picked pickled 1
peter [1, 1, 1, 1] peter -> 4
peppers 1
peter 1
piper 1 picked [1, 1, 1, 1] picked -> 4
picked 1
part_2.txt If Peter Piper If 1 pickled [1, 1, 1, 1] pickled -> 4
picked a peck peter 1
of pickled piper 1 piper [1, 1, 1, 1] piper -> 4
peppers picked 1
a 1
peck 1 the [1] the ->1
of 1
pickled 1

cat part_* | tr -cs "[:alpha:]" "n" | tr
"[:upper:]" "[:lower:]" | sort | uniq -c

Map and Reduce by pattern

[C  D,
A B map E  F,
G  H,
…]

W  [X, Y, Z] reduce V

Map and Reduce for Word count

[word1  1,
fileoffset line of text map word2  1,
word3  1,
…]

word  [1, 1, 1] reduce word  3

Map and Reduce for Search

[searchterm filename + line1,
fileoffset line of text map filename + line2]

searchterm [filename1 + line1,
searchterm [filename1 + line1 + line 2,
filename1 + line2, reduce filename2 + line1]
filename2 + line1]

Map and Reduce for Index

fileoffset line of text map [word1  filename,
word2  filename,
word3  filename]

word1 [filename1, word1  [filename1,
filename2, reduce filename2,
filename3] filename3]

Getting started with AWS and EMR

MapReduce architecture in AWS
SSH

Slave security group
Master security group
app (s3n) EC2 EC2
Input (s3n) Node 1 Node 2
EC2
S3 Master
output
EC2
logging …
Node
N

Note: EC2 AMIs are Debian/Lenny 32 or 64 bit

To the Ruby EMR CLI!
credentials.json
{
"access_id": ”…",
./elastic-mapreduce "private_key": ”…",
"keypair": "mr-oregon",
--create "key-pair-file": “mr-oregon.pem",
--name word-count "log_uri": "s3n://mr-word-count/",
"region": "us-west-2"
--stream }

--instance-count 1
--instance-type m1.small
--key-pair mr-oregon
--input s3n://mr-word-count-input/
--output s3n://mr-word-count-output/
--mapper "ruby s3n://mr-word-count/map.rb"
--reducer "ruby s3n://mr-word-count/reduce.rb"

Supply name and set as streaming

Configure instance types and #

Limitations

Processing must be parallelisable
Large amounts of consistent data requiring consistent
processing and few dependencies
Not designed for high reliability
E.g.,Name Node single point of failure on Hadoop DFS

MapReduce in practice

Log and/or clickstream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data, e.g.
for compliance

Source: http://en.wikipedia.org/wiki/Apache_Hadoop

FABUQ

What if my input has multiline records?
What if my EMR instances don’t have the required libraries, etc
to run my steps?
What if I needed to nest jobs within steps?
What are the signs that a MR solution might “fit” the problem?
How do I control the number of mappers and reducers used?
What if I don’t need to do any reduction?
How does MR provide fault tolerance?

Recap

High level understanding
Limitations

Examples Inspired to try it out

Aws map-reduce-aws

Recommended

Recommended

More Related Content

More from Andy Marks

More from Andy Marks (10)

Recently uploaded

Recently uploaded (20)

Aws map-reduce-aws

Editor's Notes