Map Reduce Using Perl

  • 24,746 views
Uploaded on

Talk given at Vancouver Perl Mongers meeting. Describes the basic concepts of Map-Reduce and how to use Hadoop to write Map-Reduce scripts in Perl.

Talk given at Vancouver Perl Mongers meeting. Describes the basic concepts of Map-Reduce and how to use Hadoop to write Map-Reduce scripts in Perl.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
24,746
On Slideshare
0
From Embeds
0
Number of Embeds
20

Actions

Shares
Downloads
0
Comments
7
Likes
61

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MAP-REDUCE USING PERL by Phil Whelan Vancouver.pm 12th August 2009
  • 2. What is Map-Reduce?
  • 3. What is Map-Reduce? A way of processing large amounts of data across many machines
  • 4. Why use Map-Reduce?
  • 5. Why use Map-Reduce? If you need to increase your computational power, you’ll need to distribute it across more than one machine
  • 6. Processing large volumes of data Must be able to split-up the data in chunks for processing, which are then recombined later Requires a constant flow of data from one simple state to another
  • 7. Solving Problems...
  • 8. An example... Familiar with grep and sort? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  • 9. An example... “Grep” extracts all the matching lines grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  • 10. An example... “Sort” sorts all the lines in memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  • 11. An example... As the amount of data increases sort requires more and more memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
  • 12. An example... What is my fruit_diary.log was 500Gb? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples [2009/02/15 10:19] I sure do like apples [2009/02/16 10:20] Apples apples apples!!
  • 13. An example... Were going have to re-engineer this grep “apple” fruit_diary.log | sort
  • 14. A bigger example... What if this log was actually all the tweets on Twitter? grep “apple” twitter.log
  • 15. A bigger example... What is this log was actually all the tweets on Twitter? grep “apple” twitter.log Forget “grep”! How do we write all that data to disk in the first place?
  • 16. Distributed File-Systems
  • 17. Distributed File-Systems Share the file-system transparently across many machines
  • 18. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure ls /root/data/example/ drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file1.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file2.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file3.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file4.txt
  • 19. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines
  • 20. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines Files can be replicated across many machines
  • 21. Let’s look at Hadoop...
  • 22. What is Hadoop?
  • 23. What is Hadoop?
  • 24. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware”
  • 25. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware” Includes HDFS
  • 26. What is HDFS?
  • 27. What is HDFS? The file system of Hadoop
  • 28. What is HDFS? The file system of Hadoop Stands for “Hadoop Distributed File System”
  • 29. Interacting with HDFS HDFS supports familiar syntax hadoop fs -cat /root/example/file1.txt hadoop fs -chmod -R 755 /root/example/file1.txt hadoop fs -chown phil /root/example/file1.txt hadoop fs -cp /root/example/file1.txt /root/example/file1.new hadoop fs -ls /root/example/ hadoop fs -mkdir /root/example/new_directory
  • 30. Let’s get back to Map-Reduce...
  • 31. What is Map-Reduce? A way a processing large amounts of data across many machines
  • 32. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks
  • 33. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks First we Map, then we Reduce
  • 34. How MailChannels uses Map-Reduce We maintain reputation system of IP addresses We give each IP a reputation score 0-100 We have scores for many millions of IP addresses We create these scores from billions of loglines
  • 35. Our IP Reputation System Log lines Algorithm IP => score
  • 36. Simplified Algorithm For each unique IP foreach logline Good or Bad? count(Good) score = count(Bad)
  • 37. What we want We want a count of all the good lines and bad lines Log lines for each IP # array of [<ip>,<good count>,<bad count>] @ip_data = ( [‘1.1.1.1’, 5, 97], [‘1.1.1.2’, 121, 7], [‘1.1.1.3’, 15, 7954], ... ... [‘255.255.255.254’, 765, 807], [‘255.255.255.255’, 95, 97] );
  • 38. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
  • 39. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
  • 40. The Map-Reduce Way Extract the IP and group log lines from same Log lines the IP Map: IP => log line Reduce: How groups of lines are processed
  • 41. The Map-Reduce Way ...or run our log line algorithm Log lines now, which is more efficient Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  • 42. The Map-Reduce Way ...which is actually Log lines a lot more complicated... Map: IP => is_good, is_bad, factor_x, delta_z.... Reduce: How groups of lines are processed
  • 43. The Map-Reduce Way ...but let’s keep it Log lines simple for now Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  • 44. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting Map: IP => is_good, is_bad sort by key Reduce: How groups of lines are processed
  • 45. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
  • 46. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
  • 47. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: How groups of lines are processed
  • 48. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP changes Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  • 49. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 Reset the counter here 1.1.1.2 0 1 good_count = 0 1.1.1.2 1 0 bad_count = 0 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 1.1.1.3 1 0
  • 50. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 1.1.1.2 1 0 Output counter results here 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 “1.1.1.2 2 1” 1.1.1.3 1 0
  • 51. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  • 52. The Map-Reduce Way Obviously our algorithm is Log lines actually a lot more complicated... Map: IP => is_good, is_bad Reduce: IP =>
  • 53. The Map-Reduce Way ...but you get the Log lines idea Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  • 54. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
  • 55. Let’s write some code...
  • 56. Some Perl! (at last) map.pl reduce.pl
  • 57. map.pl
  • 58. map.pl Hadoop streams log lines on STDIN
  • 59. map.pl We extract the IP and decide if this log line indicates good or bad behaviour
  • 60. map.pl Skip anything useful
  • 61. map.pl Print out the “key=value” record
  • 62. map.pl Where the IP is the key
  • 63. map.pl Everything else is the value
  • 64. map.pl Separate key and value with a tab (for Hadoop)
  • 65. map.pl End the record with a newline
  • 66. Hadoop sorts our keys (IPs in this example)... 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3 1 0 1.1.1.3 0 1 1.1.1.3 1 0
  • 67. reduce.pl Hadoop streams our records back to us on STDIN
  • 68. reduce.pl Split on tab to get IP key
  • 69. reduce.pl Extract our good and bad values from the remainder
  • 70. reduce.pl Check for new IP
  • 71. Output the aggregated record for this IP
  • 72. Reset the counters for the next IP
  • 73. Keep incrementing the counts until the IP changes
  • 74. reduce.pl The Reduce should output it’s records in the same format as the Map
  • 75. Sanity Checking This should work with small data-sets cat loglines.log | perl -ne map.pl | sort | perl -ne reduce.pl
  • 76. Running in “the cloud”
  • 77. Running in “the cloud” Define the Map and Reduce commands to be run
  • 78. Running in “the cloud” Attach any required files
  • 79. Running in “the cloud” Specify the input and output files within HDFS
  • 80. Running in “the cloud” Wait....
  • 81. Checking HDFS
  • 82. Checking Job Progress Cluster Summary Running Jobs Completed Jobs Failed Jobs Job Statistics Detailed Job Logs
  • 83. Checking Cluster Health List Data-Nodes Dead Nodes Node Heart-beat information Failed Jobs Job Statistics Detailed Job Logs
  • 84. Map-Reduce Conclusion Is a different paradigm for solving large- scale problems Not a silver-bullet Can (only) solve specific problems that can be defined in a Map-Reduce way