MAP-REDUCE
    USING PERL
       by Phil Whelan
Vancouver.pm 12th August 2009
What is Map-Reduce?
What is Map-Reduce?

A way of processing large amounts of
data across many machines
Why use Map-Reduce?
Why use Map-Reduce?

If you need to increase your
computational power, you’ll need to
distribute it across more than one
m...
Processing large volumes
         of data
Must be able to split-up the data in
chunks for processing, which are then
recom...
Solving Problems...
An example...
Familiar with grep and sort?

 grep “apple” fruit_diary.log | sort
    [2007/12/25 09:15] Ate an apple
    [...
An example...
“Grep” extracts all the matching lines

 grep “apple” fruit_diary.log | sort
    [2007/12/25 09:15] Ate an a...
An example...
“Sort” sorts all the lines in memory

 grep “apple” fruit_diary.log | sort
    [2007/12/25 09:15] Ate an app...
An example...
As the amount of data increases sort
requires more and more memory
 grep “apple” fruit_diary.log | sort
    ...
An example...
What is my fruit_diary.log was 500Gb?

 grep “apple” fruit_diary.log | sort
    [2007/12/25   09:15]   Ate a...
An example...
Were going have to re-engineer this

 grep “apple” fruit_diary.log | sort
A bigger example...
 What if this log was actually all the
 tweets on Twitter?

grep “apple” twitter.log
A bigger example...
 What is this log was actually all the
 tweets on Twitter?

grep “apple” twitter.log



 Forget “grep”...
Distributed File-Systems
Distributed File-Systems
Share the file-system transparently across
many machines
Distributed File-Systems
Share the file-system transparently across
many machines
You simply see the usual file structure
 l...
Distributed File-Systems
Share the file-system transparently across
many machines
You simply see the usual file structure

E...
Distributed File-Systems
Share the file-system transparently across
many machines
You simply see the usual file structure

E...
Let’s look at Hadoop...
What is Hadoop?
What is Hadoop?
What is Hadoop?
A Map-Reduce framework
  “for running applications on large
  clusters built of commodity
  hardware”
What is Hadoop?
A Map-Reduce framework
  “for running applications on large
  clusters built of commodity
  hardware”


In...
What is HDFS?
What is HDFS?

The file system of Hadoop
What is HDFS?

The file system of Hadoop

Stands for
“Hadoop Distributed File System”
Interacting with HDFS
HDFS supports familiar syntax
hadoop   fs   -cat /root/example/file1.txt
hadoop   fs   -chmod -R 755...
Let’s get back to
Map-Reduce...
What is Map-Reduce?
A way a processing large amounts of data
across many machines
What is Map-Reduce?
A way a processing large amounts of data
across many machines
Map-Reduce is a way of breaking down a
l...
What is Map-Reduce?
A way a processing large amounts of data
across many machines
Map-Reduce is a way of breaking down a
l...
How MailChannels
   uses Map-Reduce
We maintain reputation system of IP
addresses
We give each IP a reputation score 0-100...
Our IP Reputation System

         Log lines

         Algorithm

        IP => score
Simplified Algorithm
     For each unique IP


     foreach logline


     Good or Bad?

             count(Good)
     scor...
What we want
 We want a count
   of all the good
lines and bad lines      Log lines
    for each IP

        # array of [<...
The Map-Reduce Way
          Log lines


  Map: How lines are grouped

  Reduce: How groups of lines
        are processed
The Map-Reduce Way
          Log lines


  Map: How lines are grouped

  Reduce: How groups of lines
        are processed
The Map-Reduce Way
 Extract the IP
  and group log
lines from same       Log lines
     the IP


                  Map: IP...
The Map-Reduce Way
                          ...or run our log
                           line algorithm
          Log lin...
The Map-Reduce Way
                                      ...which is actually
                  Log lines                a...
The Map-Reduce Way
                          ...but let’s keep it
          Log lines         simple for now



  Map: IP ...
The Map-Reduce Way
                          All the IP records
                             are grouped
          Log lin...
The Map-Reduce Way
                                 All the IP records
                                    are grouped
   ...
The Map-Reduce Way
                                 All the IP records
                                    are grouped
   ...
The Map-Reduce Way
          Log lines


  Map: IP => is_good, is_bad

  Reduce: How groups of lines
        are processed
The Map-Reduce Way
                                   We count the
                                 good and the bad
     ...
The Map-Reduce Way
                                    We count the
                                  good and the bad
   ...
The Map-Reduce Way
                                       We count the
                                     good and the b...
The Map-Reduce Way
                Log lines


       Map: IP => is_good, is_bad


Reduce: IP => count(is_good), count(is_...
The Map-Reduce Way
                               Obviously our
                                algorithm is
             ...
The Map-Reduce Way
                                 ...but you get the
                Log lines                idea



  ...
The Map-Reduce Way
                Log lines


       Map: IP => is_good, is_bad


Reduce: IP => count(is_good), count(is_...
Let’s write some code...
Some Perl! (at last)
map.pl      reduce.pl
map.pl
map.pl

Hadoop streams log lines on STDIN
map.pl

We extract the IP and decide if this log line
     indicates good or bad behaviour
map.pl


Skip anything useful
map.pl

   Print out the
“key=value” record
map.pl


Where the IP is the key
map.pl



Everything else is the value
map.pl


Separate key and value with
    a tab (for Hadoop)
map.pl




End the record with a newline
Hadoop sorts our keys
  (IPs in this example)...
1.1.1.1   0   1
1.1.1.1   0   1
1.1.1.1   0   1
1.1.1.2   1   0
1.1.1.2  ...
reduce.pl


Hadoop streams our records
  back to us on STDIN
reduce.pl



Split on tab to get IP key
reduce.pl



Extract our good and bad
values from the remainder
reduce.pl




Check for new IP
Output the aggregated
  record for this IP
Reset the counters for
     the next IP
Keep incrementing the
counts until the IP changes
reduce.pl


  The Reduce should output it’s
  records in the same format as
             the Map
Sanity Checking
This should work with small data-sets

      cat loglines.log     
      | perl -ne map.pl    
      | sor...
Running in “the cloud”
Running in “the cloud”

       Define the Map and Reduce
          commands to be run
Running in “the cloud”



       Attach any required files
Running in “the cloud”


        Specify the input and output
             files within HDFS
Running in “the cloud”


         Wait....
Checking HDFS
Checking Job Progress
Cluster Summary
Running Jobs
Completed Jobs
Failed Jobs
Job Statistics
Detailed Job Logs
Checking Cluster Health
List Data-Nodes
Dead Nodes
Node Heart-beat information
Failed Jobs
Job Statistics
Detailed Job Logs
Map-Reduce Conclusion

Is a different paradigm for solving large-
scale problems
Not a silver-bullet
Can (only) solve speci...
Upcoming SlideShare
Loading in...5
×

Map Reduce Using Perl

27,132

Published on

Talk given at Vancouver Perl Mongers meeting. Describes the basic concepts of Map-Reduce and how to use Hadoop to write Map-Reduce scripts in Perl.

Published in: Technology, Education
7 Comments
65 Likes
Statistics
Notes
No Downloads
Views
Total Views
27,132
On Slideshare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
0
Comments
7
Likes
65
Embeds 0
No embeds

No notes for slide

Map Reduce Using Perl

  1. 1. MAP-REDUCE USING PERL by Phil Whelan Vancouver.pm 12th August 2009
  2. 2. What is Map-Reduce?
  3. 3. What is Map-Reduce? A way of processing large amounts of data across many machines
  4. 4. Why use Map-Reduce?
  5. 5. Why use Map-Reduce? If you need to increase your computational power, you’ll need to distribute it across more than one machine
  6. 6. Processing large volumes of data Must be able to split-up the data in chunks for processing, which are then recombined later Requires a constant flow of data from one simple state to another
  7. 7. Solving Problems...
  8. 8. An example... Familiar with grep and sort? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  9. 9. An example... “Grep” extracts all the matching lines grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  10. 10. An example... “Sort” sorts all the lines in memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  11. 11. An example... As the amount of data increases sort requires more and more memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  12. 12. An example... What is my fruit_diary.log was 500Gb? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples [2009/02/15 10:19] I sure do like apples [2009/02/16 10:20] Apples apples apples!!
  13. 13. An example... Were going have to re-engineer this grep “apple” fruit_diary.log | sort
  14. 14. A bigger example... What if this log was actually all the tweets on Twitter? grep “apple” twitter.log
  15. 15. A bigger example... What is this log was actually all the tweets on Twitter? grep “apple” twitter.log Forget “grep”! How do we write all that data to disk in the first place?
  16. 16. Distributed File-Systems
  17. 17. Distributed File-Systems Share the file-system transparently across many machines
  18. 18. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure ls /root/data/example/ drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file1.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file2.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file3.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file4.txt
  19. 19. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines
  20. 20. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines Files can be replicated across many machines
  21. 21. Let’s look at Hadoop...
  22. 22. What is Hadoop?
  23. 23. What is Hadoop?
  24. 24. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware”
  25. 25. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware” Includes HDFS
  26. 26. What is HDFS?
  27. 27. What is HDFS? The file system of Hadoop
  28. 28. What is HDFS? The file system of Hadoop Stands for “Hadoop Distributed File System”
  29. 29. Interacting with HDFS HDFS supports familiar syntax hadoop fs -cat /root/example/file1.txt hadoop fs -chmod -R 755 /root/example/file1.txt hadoop fs -chown phil /root/example/file1.txt hadoop fs -cp /root/example/file1.txt /root/example/file1.new hadoop fs -ls /root/example/ hadoop fs -mkdir /root/example/new_directory
  30. 30. Let’s get back to Map-Reduce...
  31. 31. What is Map-Reduce? A way a processing large amounts of data across many machines
  32. 32. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks
  33. 33. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks First we Map, then we Reduce
  34. 34. How MailChannels uses Map-Reduce We maintain reputation system of IP addresses We give each IP a reputation score 0-100 We have scores for many millions of IP addresses We create these scores from billions of loglines
  35. 35. Our IP Reputation System Log lines Algorithm IP => score
  36. 36. Simplified Algorithm For each unique IP foreach logline Good or Bad? count(Good) score = count(Bad)
  37. 37. What we want We want a count of all the good lines and bad lines Log lines for each IP # array of [<ip>,<good count>,<bad count>] @ip_data = ( [‘1.1.1.1’, 5, 97], [‘1.1.1.2’, 121, 7], [‘1.1.1.3’, 15, 7954], ... ... [‘255.255.255.254’, 765, 807], [‘255.255.255.255’, 95, 97] );
  38. 38. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
  39. 39. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
  40. 40. The Map-Reduce Way Extract the IP and group log lines from same Log lines the IP Map: IP => log line Reduce: How groups of lines are processed
  41. 41. The Map-Reduce Way ...or run our log line algorithm Log lines now, which is more efficient Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  42. 42. The Map-Reduce Way ...which is actually Log lines a lot more complicated... Map: IP => is_good, is_bad, factor_x, delta_z.... Reduce: How groups of lines are processed
  43. 43. The Map-Reduce Way ...but let’s keep it Log lines simple for now Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  44. 44. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting Map: IP => is_good, is_bad sort by key Reduce: How groups of lines are processed
  45. 45. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
  46. 46. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
  47. 47. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  48. 48. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP changes Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  49. 49. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 Reset the counter here 1.1.1.2 0 1 good_count = 0 1.1.1.2 1 0 bad_count = 0 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 1.1.1.3 1 0
  50. 50. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 1.1.1.2 1 0 Output counter results here 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 “1.1.1.2 2 1” 1.1.1.3 1 0
  51. 51. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  52. 52. The Map-Reduce Way Obviously our algorithm is Log lines actually a lot more complicated... Map: IP => is_good, is_bad Reduce: IP =>
  53. 53. The Map-Reduce Way ...but you get the Log lines idea Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  54. 54. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  55. 55. Let’s write some code...
  56. 56. Some Perl! (at last) map.pl reduce.pl
  57. 57. map.pl
  58. 58. map.pl Hadoop streams log lines on STDIN
  59. 59. map.pl We extract the IP and decide if this log line indicates good or bad behaviour
  60. 60. map.pl Skip anything useful
  61. 61. map.pl Print out the “key=value” record
  62. 62. map.pl Where the IP is the key
  63. 63. map.pl Everything else is the value
  64. 64. map.pl Separate key and value with a tab (for Hadoop)
  65. 65. map.pl End the record with a newline
  66. 66. Hadoop sorts our keys (IPs in this example)... 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3 1 0 1.1.1.3 0 1 1.1.1.3 1 0
  67. 67. reduce.pl Hadoop streams our records back to us on STDIN
  68. 68. reduce.pl Split on tab to get IP key
  69. 69. reduce.pl Extract our good and bad values from the remainder
  70. 70. reduce.pl Check for new IP
  71. 71. Output the aggregated record for this IP
  72. 72. Reset the counters for the next IP
  73. 73. Keep incrementing the counts until the IP changes
  74. 74. reduce.pl The Reduce should output it’s records in the same format as the Map
  75. 75. Sanity Checking This should work with small data-sets cat loglines.log | perl -ne map.pl | sort | perl -ne reduce.pl
  76. 76. Running in “the cloud”
  77. 77. Running in “the cloud” Define the Map and Reduce commands to be run
  78. 78. Running in “the cloud” Attach any required files
  79. 79. Running in “the cloud” Specify the input and output files within HDFS
  80. 80. Running in “the cloud” Wait....
  81. 81. Checking HDFS
  82. 82. Checking Job Progress Cluster Summary Running Jobs Completed Jobs Failed Jobs Job Statistics Detailed Job Logs
  83. 83. Checking Cluster Health List Data-Nodes Dead Nodes Node Heart-beat information Failed Jobs Job Statistics Detailed Job Logs
  84. 84. Map-Reduce Conclusion Is a different paradigm for solving large- scale problems Not a silver-bullet Can (only) solve specific problems that can be defined in a Map-Reduce way

×