MAP-REDUCE
USING PERL
by Phil Whelan
Vancouver.pm 12th August 2009
What is Map-Reduce?
What is Map-Reduce?
A way of processing large amounts of
data across many machines
Why use Map-Reduce?
Why use Map-Reduce?
If you need to increase your
computational power, you’ll need to
distribute it across more than one
machine
Processing large volumes
of data
Must be able to split-up the data in
chunks for processing, which are then
recombined later
Requires a constant flow of data from
one simple state to another
Solving Problems...
An example...
Familiar with grep and sort?
grep “apple” fruit_diary.log | sort
[2007/12/25 09:15] Ate an apple
[2008/02/12 12:37] Thought about apples
[2009/01/09 19:55] MMmmm.. apples
An example...
“Grep” extracts all the matching lines
grep “apple” fruit_diary.log | sort
[2007/12/25 09:15] Ate an apple
[2008/02/12 12:37] Thought about apples
[2009/01/09 19:55] MMmmm.. apples
An example...
“Sort” sorts all the lines in memory
grep “apple” fruit_diary.log | sort
[2007/12/25 09:15] Ate an apple
[2008/02/12 12:37] Thought about apples
[2009/01/09 19:55] MMmmm.. apples
An example...
As the amount of data increases sort
requires more and more memory
grep “apple” fruit_diary.log | sort
[2007/12/25 09:15] Ate an apple
[2008/02/12 12:37] Thought about apples
[2009/01/09 19:55] MMmmm.. apples
An example...
What is my fruit_diary.log was 500Gb?
grep “apple” fruit_diary.log | sort
[2007/12/25 09:15] Ate an apple
[2008/02/12 12:37] Thought about apples
[2009/01/09 19:55] MMmmm.. apples
[2009/02/15 10:19] I sure do like apples
[2009/02/16 10:20] Apples apples apples!!
An example...
Were going have to re-engineer this
grep “apple” fruit_diary.log | sort
A bigger example...
What if this log was actually all the
tweets on Twitter?
grep “apple” twitter.log
A bigger example...
What is this log was actually all the
tweets on Twitter?
grep “apple” twitter.log
Forget “grep”! How do we write all that
data to disk in the first place?
Distributed File-Systems
Distributed File-Systems
Share the file-system transparently across
many machines
Distributed File-Systems
Share the file-system transparently across
many machines
You simply see the usual file structure
ls /root/data/example/
drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file1.txt
drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file2.txt
drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file3.txt
drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file4.txt
Distributed File-Systems
Share the file-system transparently across
many machines
You simply see the usual file structure
Each file maybe stored across many
machines
Distributed File-Systems
Share the file-system transparently across
many machines
You simply see the usual file structure
Each file maybe stored across many
machines
Files can be replicated across many
machines
Let’s look at Hadoop...
What is Hadoop?
What is Hadoop?
What is Hadoop?
A Map-Reduce framework
“for running applications on large
clusters built of commodity
hardware”
What is Hadoop?
A Map-Reduce framework
“for running applications on large
clusters built of commodity
hardware”
Includes HDFS
What is HDFS?
What is HDFS?
The file system of Hadoop
What is HDFS?
The file system of Hadoop
Stands for
“Hadoop Distributed File System”
What is Map-Reduce?
A way a processing large amounts of data
across many machines
What is Map-Reduce?
A way a processing large amounts of data
across many machines
Map-Reduce is a way of breaking down a
large task into smaller manageable tasks
What is Map-Reduce?
A way a processing large amounts of data
across many machines
Map-Reduce is a way of breaking down a
large task into smaller manageable tasks
First we Map, then we Reduce
How MailChannels
uses Map-Reduce
We maintain reputation system of IP
addresses
We give each IP a reputation score 0-100
We have scores for many millions of IP
addresses
We create these scores from billions of
loglines
Our IP Reputation System
Log lines
Algorithm
IP => score
Simplified Algorithm
For each unique IP
foreach logline
Good or Bad?
count(Good)
score =
count(Bad)
What we want
We want a count
of all the good
lines and bad lines Log lines
for each IP
# array of [<ip>,<good count>,<bad count>]
@ip_data = (
[‘1.1.1.1’, 5, 97],
[‘1.1.1.2’, 121, 7],
[‘1.1.1.3’, 15, 7954],
...
...
[‘255.255.255.254’, 765, 807],
[‘255.255.255.255’, 95, 97]
);
The Map-Reduce Way
Log lines
Map: How lines are grouped
Reduce: How groups of lines
are processed
The Map-Reduce Way
Log lines
Map: How lines are grouped
Reduce: How groups of lines
are processed
The Map-Reduce Way
Extract the IP
and group log
lines from same Log lines
the IP
Map: IP => log line
Reduce: How groups of lines
are processed
The Map-Reduce Way
...or run our log
line algorithm
Log lines now, which is
more efficient
Map: IP => is_good, is_bad
Reduce: How groups of lines
are processed
The Map-Reduce Way
...which is actually
Log lines a lot more
complicated...
Map: IP => is_good, is_bad, factor_x, delta_z....
Reduce: How groups of lines
are processed
The Map-Reduce Way
...but let’s keep it
Log lines simple for now
Map: IP => is_good, is_bad
Reduce: How groups of lines
are processed
The Map-Reduce Way
All the IP records
are grouped
Log lines together by
Hadoop’s sorting
Map: IP => is_good, is_bad
sort by key
Reduce: How groups of lines
are processed
The Map-Reduce Way
All the IP records
are grouped
Log lines together by
Hadoop’s sorting
1.1.1.1 0 1
1.1.1.1 0 1
1.1.1.1 Map: IP => is_good, is_bad
0 1
1.1.1.2 1 0
1.1.1.2 0 1 sort by key
1.1.1.2 1 0
1.1.1.3Reduce:0How groups of lines
1
1.1.1.3 0 are processed
1
1.1.1.3 1 0
The Map-Reduce Way
All the IP records
are grouped
Log lines together by
Hadoop’s sorting
1.1.1.1 0 1
1.1.1.1 0 1
1.1.1.1 Map: IP => is_good, is_bad
0 1
1.1.1.2 1 0
1.1.1.2 0 1 sort by key
1.1.1.2 1 0
1.1.1.3Reduce:0How groups of lines
1
1.1.1.3 0 are processed
1
1.1.1.3 1 0
The Map-Reduce Way
Log lines
Map: IP => is_good, is_bad
Reduce: How groups of lines
are processed
The Map-Reduce Way
We count the
good and the bad
Log lines until the
incoming IP
changes
Map: IP => is_good, is_bad
Reduce: IP => count(is_good), count(is_bad)
The Map-Reduce Way
We count the
good and the bad
Log lines until the
incoming IP
1.1.1.1 0 1 changes
1.1.1.1 0 1
1.1.1.1 Map: IP => is_good, is_bad
0 1
1.1.1.2 1 0 Reset the counter here
1.1.1.2 0 1 good_count = 0
1.1.1.2 1 0 bad_count = 0
1.1.1.3 IP => count(is_good), count(is_bad)
Reduce: 1 0
1.1.1.3 0 1
1.1.1.3 1 0
The Map-Reduce Way
We count the
good and the bad
Log lines until the
incoming IP
1.1.1.1 0 1 changes
1.1.1.1 0 1
1.1.1.1 Map: IP => is_good, is_bad
0 1
1.1.1.2 1 0
1.1.1.2 0 1
1.1.1.2 1 0
Output counter results here
1.1.1.3 IP => count(is_good), count(is_bad)
Reduce: 1 0
1.1.1.3 0 1 “1.1.1.2 2 1”
1.1.1.3 1 0
The Map-Reduce Way
Log lines
Map: IP => is_good, is_bad
Reduce: IP => count(is_good), count(is_bad)
The Map-Reduce Way
Obviously our
algorithm is
Log lines actually a lot more
complicated...
Map: IP => is_good, is_bad
Reduce: IP =>
The Map-Reduce Way
...but you get the
Log lines idea
Map: IP => is_good, is_bad
Reduce: IP => count(is_good), count(is_bad)
The Map-Reduce Way
Log lines
Map: IP => is_good, is_bad
Reduce: IP => count(is_good), count(is_bad)
Let’s write some code...
Some Perl! (at last)
map.pl reduce.pl
map.pl
map.pl
Hadoop streams log lines on STDIN
map.pl
We extract the IP and decide if this log line
indicates good or bad behaviour
map.pl
Skip anything useful
map.pl
Print out the
“key=value” record
map.pl
Where the IP is the key
map.pl
Everything else is the value
map.pl
Separate key and value with
a tab (for Hadoop)
reduce.pl
Hadoop streams our records
back to us on STDIN
reduce.pl
Split on tab to get IP key
reduce.pl
Extract our good and bad
values from the remainder
reduce.pl
Check for new IP
Output the aggregated
record for this IP
Reset the counters for
the next IP
Keep incrementing the
counts until the IP changes
reduce.pl
The Reduce should output it’s
records in the same format as
the Map
Sanity Checking
This should work with small data-sets
cat loglines.log
| perl -ne map.pl
| sort
| perl -ne reduce.pl
Running in “the cloud”
Running in “the cloud”
Define the Map and Reduce
commands to be run
Running in “the cloud”
Attach any required files
Running in “the cloud”
Specify the input and output
files within HDFS
Running in “the cloud”
Wait....
Checking HDFS
Checking Job Progress
Cluster Summary
Running Jobs
Completed Jobs
Failed Jobs
Job Statistics
Detailed Job Logs
Checking Cluster Health
List Data-Nodes
Dead Nodes
Node Heart-beat information
Failed Jobs
Job Statistics
Detailed Job Logs
Map-Reduce Conclusion
Is a different paradigm for solving large-
scale problems
Not a silver-bullet
Can (only) solve specific problems that
can be defined in a Map-Reduce way
1–6 of 6 previous next Post a comment