Map Reduce Using Perl

29,450 views
29,096 views

Published on

Talk given at Vancouver Perl Mongers meeting. Describes the basic concepts of Map-Reduce and how to use Hadoop to write Map-Reduce scripts in Perl.

Published in: Technology, Education
7 Comments
68 Likes
Statistics
Notes
No Downloads
Views
Total views
29,450
On SlideShare
0
From Embeds
0
Number of Embeds
1,810
Actions
Shares
0
Downloads
0
Comments
7
Likes
68
Embeds 0
No embeds

No notes for slide

Map Reduce Using Perl

  1. 1. MAP-REDUCE USING PERL by Phil Whelan Vancouver.pm 12th August 2009
  2. 2. What is Map-Reduce?
  3. 3. What is Map-Reduce? A way of processing large amounts of data across many machines
  4. 4. Why use Map-Reduce?
  5. 5. Why use Map-Reduce? If you need to increase your computational power, you’ll need to distribute it across more than one machine
  6. 6. Processing large volumes of data Must be able to split-up the data in chunks for processing, which are then recombined later Requires a constant flow of data from one simple state to another
  7. 7. Solving Problems...
  8. 8. An example... Familiar with grep and sort? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  9. 9. An example... “Grep” extracts all the matching lines grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  10. 10. An example... “Sort” sorts all the lines in memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  11. 11. An example... As the amount of data increases sort requires more and more memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  12. 12. An example... What is my fruit_diary.log was 500Gb? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples [2009/02/15 10:19] I sure do like apples [2009/02/16 10:20] Apples apples apples!!
  13. 13. An example... Were going have to re-engineer this grep “apple” fruit_diary.log | sort
  14. 14. A bigger example... What if this log was actually all the tweets on Twitter? grep “apple” twitter.log
  15. 15. A bigger example... What is this log was actually all the tweets on Twitter? grep “apple” twitter.log Forget “grep”! How do we write all that data to disk in the first place?
  16. 16. Distributed File-Systems
  17. 17. Distributed File-Systems Share the file-system transparently across many machines
  18. 18. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure ls /root/data/example/ drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file1.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file2.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file3.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file4.txt
  19. 19. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines
  20. 20. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines Files can be replicated across many machines
  21. 21. Let’s look at Hadoop...
  22. 22. What is Hadoop?
  23. 23. What is Hadoop?
  24. 24. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware”
  25. 25. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware” Includes HDFS
  26. 26. What is HDFS?
  27. 27. What is HDFS? The file system of Hadoop
  28. 28. What is HDFS? The file system of Hadoop Stands for “Hadoop Distributed File System”
  29. 29. Interacting with HDFS HDFS supports familiar syntax hadoop fs -cat /root/example/file1.txt hadoop fs -chmod -R 755 /root/example/file1.txt hadoop fs -chown phil /root/example/file1.txt hadoop fs -cp /root/example/file1.txt /root/example/file1.new hadoop fs -ls /root/example/ hadoop fs -mkdir /root/example/new_directory
  30. 30. Let’s get back to Map-Reduce...
  31. 31. What is Map-Reduce? A way a processing large amounts of data across many machines
  32. 32. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks
  33. 33. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks First we Map, then we Reduce
  34. 34. How MailChannels uses Map-Reduce We maintain reputation system of IP addresses We give each IP a reputation score 0-100 We have scores for many millions of IP addresses We create these scores from billions of loglines
  35. 35. Our IP Reputation System Log lines Algorithm IP => score
  36. 36. Simplified Algorithm For each unique IP foreach logline Good or Bad? count(Good) score = count(Bad)
  37. 37. What we want We want a count of all the good lines and bad lines Log lines for each IP # array of [<ip>,<good count>,<bad count>] @ip_data = ( [‘1.1.1.1’, 5, 97], [‘1.1.1.2’, 121, 7], [‘1.1.1.3’, 15, 7954], ... ... [‘255.255.255.254’, 765, 807], [‘255.255.255.255’, 95, 97] );
  38. 38. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
  39. 39. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
  40. 40. The Map-Reduce Way Extract the IP and group log lines from same Log lines the IP Map: IP => log line Reduce: How groups of lines are processed
  41. 41. The Map-Reduce Way ...or run our log line algorithm Log lines now, which is more efficient Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  42. 42. The Map-Reduce Way ...which is actually Log lines a lot more complicated... Map: IP => is_good, is_bad, factor_x, delta_z.... Reduce: How groups of lines are processed
  43. 43. The Map-Reduce Way ...but let’s keep it Log lines simple for now Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  44. 44. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting Map: IP => is_good, is_bad sort by key Reduce: How groups of lines are processed
  45. 45. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
  46. 46. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
  47. 47. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  48. 48. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP changes Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  49. 49. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 Reset the counter here 1.1.1.2 0 1 good_count = 0 1.1.1.2 1 0 bad_count = 0 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 1.1.1.3 1 0
  50. 50. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 1.1.1.2 1 0 Output counter results here 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 “1.1.1.2 2 1” 1.1.1.3 1 0
  51. 51. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  52. 52. The Map-Reduce Way Obviously our algorithm is Log lines actually a lot more complicated... Map: IP => is_good, is_bad Reduce: IP =>
  53. 53. The Map-Reduce Way ...but you get the Log lines idea Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  54. 54. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  55. 55. Let’s write some code...
  56. 56. Some Perl! (at last) map.pl reduce.pl
  57. 57. map.pl
  58. 58. map.pl Hadoop streams log lines on STDIN
  59. 59. map.pl We extract the IP and decide if this log line indicates good or bad behaviour
  60. 60. map.pl Skip anything useful
  61. 61. map.pl Print out the “key=value” record
  62. 62. map.pl Where the IP is the key
  63. 63. map.pl Everything else is the value
  64. 64. map.pl Separate key and value with a tab (for Hadoop)
  65. 65. map.pl End the record with a newline
  66. 66. Hadoop sorts our keys (IPs in this example)... 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3 1 0 1.1.1.3 0 1 1.1.1.3 1 0
  67. 67. reduce.pl Hadoop streams our records back to us on STDIN
  68. 68. reduce.pl Split on tab to get IP key
  69. 69. reduce.pl Extract our good and bad values from the remainder
  70. 70. reduce.pl Check for new IP
  71. 71. Output the aggregated record for this IP
  72. 72. Reset the counters for the next IP
  73. 73. Keep incrementing the counts until the IP changes
  74. 74. reduce.pl The Reduce should output it’s records in the same format as the Map
  75. 75. Sanity Checking This should work with small data-sets cat loglines.log | perl -ne map.pl | sort | perl -ne reduce.pl
  76. 76. Running in “the cloud”
  77. 77. Running in “the cloud” Define the Map and Reduce commands to be run
  78. 78. Running in “the cloud” Attach any required files
  79. 79. Running in “the cloud” Specify the input and output files within HDFS
  80. 80. Running in “the cloud” Wait....
  81. 81. Checking HDFS
  82. 82. Checking Job Progress Cluster Summary Running Jobs Completed Jobs Failed Jobs Job Statistics Detailed Job Logs
  83. 83. Checking Cluster Health List Data-Nodes Dead Nodes Node Heart-beat information Failed Jobs Job Statistics Detailed Job Logs
  84. 84. Map-Reduce Conclusion Is a different paradigm for solving large- scale problems Not a silver-bullet Can (only) solve specific problems that can be defined in a Map-Reduce way

×