Map Reduce Using Perl

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    Map Reduce Using Perl - Presentation Transcript

    1. MAP-REDUCE USING PERL by Phil Whelan Vancouver.pm 12th August 2009
    2. What is Map-Reduce?
    3. What is Map-Reduce? A way of processing large amounts of data across many machines
    4. Why use Map-Reduce?
    5. Why use Map-Reduce? If you need to increase your computational power, you’ll need to distribute it across more than one machine
    6. Processing large volumes of data Must be able to split-up the data in chunks for processing, which are then recombined later Requires a constant flow of data from one simple state to another
    7. Solving Problems...
    8. An example... Familiar with grep and sort? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
    9. An example... “Grep” extracts all the matching lines grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
    10. An example... “Sort” sorts all the lines in memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
    11. An example... As the amount of data increases sort requires more and more memory grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples
    12. An example... What is my fruit_diary.log was 500Gb? grep “apple” fruit_diary.log | sort [2007/12/25 09:15] Ate an apple [2008/02/12 12:37] Thought about apples [2009/01/09 19:55] MMmmm.. apples [2009/02/15 10:19] I sure do like apples [2009/02/16 10:20] Apples apples apples!!
    13. An example... Were going have to re-engineer this grep “apple” fruit_diary.log | sort
    14. A bigger example... What if this log was actually all the tweets on Twitter? grep “apple” twitter.log
    15. A bigger example... What is this log was actually all the tweets on Twitter? grep “apple” twitter.log Forget “grep”! How do we write all that data to disk in the first place?
    16. Distributed File-Systems
    17. Distributed File-Systems Share the file-system transparently across many machines
    18. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure ls /root/data/example/ drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file1.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file2.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file3.txt drwxr-xr-x 34 phil staff 1156 12 Sep 2008 file4.txt
    19. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines
    20. Distributed File-Systems Share the file-system transparently across many machines You simply see the usual file structure Each file maybe stored across many machines Files can be replicated across many machines
    21. Let’s look at Hadoop...
    22. What is Hadoop?
    23. What is Hadoop?
    24. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware”
    25. What is Hadoop? A Map-Reduce framework “for running applications on large clusters built of commodity hardware” Includes HDFS
    26. What is HDFS?
    27. What is HDFS? The file system of Hadoop
    28. What is HDFS? The file system of Hadoop Stands for “Hadoop Distributed File System”
    29. Interacting with HDFS HDFS supports familiar syntax hadoop fs -cat /root/example/file1.txt hadoop fs -chmod -R 755 /root/example/file1.txt hadoop fs -chown phil /root/example/file1.txt hadoop fs -cp /root/example/file1.txt /root/example/file1.new hadoop fs -ls /root/example/ hadoop fs -mkdir /root/example/new_directory
    30. Let’s get back to Map-Reduce...
    31. What is Map-Reduce? A way a processing large amounts of data across many machines
    32. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks
    33. What is Map-Reduce? A way a processing large amounts of data across many machines Map-Reduce is a way of breaking down a large task into smaller manageable tasks First we Map, then we Reduce
    34. How MailChannels uses Map-Reduce We maintain reputation system of IP addresses We give each IP a reputation score 0-100 We have scores for many millions of IP addresses We create these scores from billions of loglines
    35. Our IP Reputation System Log lines Algorithm IP => score
    36. Simplified Algorithm For each unique IP foreach logline Good or Bad? count(Good) score = count(Bad)
    37. What we want We want a count of all the good lines and bad lines Log lines for each IP # array of [<ip>,<good count>,<bad count>] @ip_data = ( [‘1.1.1.1’, 5, 97], [‘1.1.1.2’, 121, 7], [‘1.1.1.3’, 15, 7954], ... ... [‘255.255.255.254’, 765, 807], [‘255.255.255.255’, 95, 97] );
    38. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
    39. The Map-Reduce Way Log lines Map: How lines are grouped Reduce: How groups of lines are processed
    40. The Map-Reduce Way Extract the IP and group log lines from same Log lines the IP Map: IP => log line Reduce: How groups of lines are processed
    41. The Map-Reduce Way ...or run our log line algorithm Log lines now, which is more efficient Map: IP => is_good, is_bad Reduce: How groups of lines are processed
    42. The Map-Reduce Way ...which is actually Log lines a lot more complicated... Map: IP => is_good, is_bad, factor_x, delta_z.... Reduce: How groups of lines are processed
    43. The Map-Reduce Way ...but let’s keep it Log lines simple for now Map: IP => is_good, is_bad Reduce: How groups of lines are processed
    44. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting Map: IP => is_good, is_bad sort by key Reduce: How groups of lines are processed
    45. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
    46. The Map-Reduce Way All the IP records are grouped Log lines together by Hadoop’s sorting 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3Reduce:0How groups of lines 1 1.1.1.3 0 are processed 1 1.1.1.3 1 0
    47. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: How groups of lines are processed
    48. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP changes Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
    49. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 Reset the counter here 1.1.1.2 0 1 good_count = 0 1.1.1.2 1 0 bad_count = 0 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 1.1.1.3 1 0
    50. The Map-Reduce Way We count the good and the bad Log lines until the incoming IP 1.1.1.1 0 1 changes 1.1.1.1 0 1 1.1.1.1 Map: IP => is_good, is_bad 0 1 1.1.1.2 1 0 1.1.1.2 0 1 1.1.1.2 1 0 Output counter results here 1.1.1.3 IP => count(is_good), count(is_bad) Reduce: 1 0 1.1.1.3 0 1 “1.1.1.2 2 1” 1.1.1.3 1 0
    51. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
    52. The Map-Reduce Way Obviously our algorithm is Log lines actually a lot more complicated... Map: IP => is_good, is_bad Reduce: IP =>
    53. The Map-Reduce Way ...but you get the Log lines idea Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
    54. The Map-Reduce Way Log lines Map: IP => is_good, is_bad Reduce: IP => count(is_good), count(is_bad)
    55. Let’s write some code...
    56. Some Perl! (at last) map.pl reduce.pl
    57. map.pl
    58. map.pl Hadoop streams log lines on STDIN
    59. map.pl We extract the IP and decide if this log line indicates good or bad behaviour
    60. map.pl Skip anything useful
    61. map.pl Print out the “key=value” record
    62. map.pl Where the IP is the key
    63. map.pl Everything else is the value
    64. map.pl Separate key and value with a tab (for Hadoop)
    65. map.pl End the record with a newline
    66. Hadoop sorts our keys (IPs in this example)... 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.1 0 1 1.1.1.2 1 0 1.1.1.2 0 1 sort by key 1.1.1.2 1 0 1.1.1.3 1 0 1.1.1.3 0 1 1.1.1.3 1 0
    67. reduce.pl Hadoop streams our records back to us on STDIN
    68. reduce.pl Split on tab to get IP key
    69. reduce.pl Extract our good and bad values from the remainder
    70. reduce.pl Check for new IP
    71. Output the aggregated record for this IP
    72. Reset the counters for the next IP
    73. Keep incrementing the counts until the IP changes
    74. reduce.pl The Reduce should output it’s records in the same format as the Map
    75. Sanity Checking This should work with small data-sets cat loglines.log | perl -ne map.pl | sort | perl -ne reduce.pl
    76. Running in “the cloud”
    77. Running in “the cloud” Define the Map and Reduce commands to be run
    78. Running in “the cloud” Attach any required files
    79. Running in “the cloud” Specify the input and output files within HDFS
    80. Running in “the cloud” Wait....
    81. Checking HDFS
    82. Checking Job Progress Cluster Summary Running Jobs Completed Jobs Failed Jobs Job Statistics Detailed Job Logs
    83. Checking Cluster Health List Data-Nodes Dead Nodes Node Heart-beat information Failed Jobs Job Statistics Detailed Job Logs
    84. Map-Reduce Conclusion Is a different paradigm for solving large- scale problems Not a silver-bullet Can (only) solve specific problems that can be defined in a Map-Reduce way

    + Phil WhelanPhil Whelan, 3 months ago

    custom

    769 views, 2 favs, 0 embeds more stats

    Talk given at Vancouver Perl Mongers meeting. Descr more

    More info about this document

    CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

    Go to text version

    • Total Views 769
      • 769 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories