Perl on Amazon Elastic MapReduce

  • 2,412 views
Uploaded on

Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages. …

Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.

This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,412
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
30
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.\n
  • \n
  • The sorting guarantees that all values for a given key are sent to a single reducer.\n
  • \n
  • Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.\n
  • \n
  • On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.\n2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago\n1 day to modify the Glow protocol, 1 day to build\nEverything stored on S3\n
  • \n
  • \n
  • Serialisation, heartbeat, node management, directory, etc.\nSpeculative task execution, first one to finish wins\nPotentially very simple and contained code\n
  • You supply the mapper, reducer, and driver code\n
  • S3 gives you virtually unlimited storage with very high redundancy\nS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)\nAll this is controlled using a REST API\nJobs are called ‘steps’ in EMR lingo\n
  • No way to customise the image and, e.g., install your own Perl\nSo it’s a good idea to store the final results of a workflow in S3\nNo way to store dependencies in HDFS when cluster is created\n
  • \n
  • \n
  • If you set a value to 0, you’ll know that it’s going to be the first (k,v) the reducer will see, 1 will be the second, etc.\nwhen the userid changes, it’s a new user.\n
  • E.g., no control over output file names, many of the API settings can’t be configured programmatically (cmd-line switches), no separate mappers per input, etc.\nBecause reducer input is also sorted on keys, when the key changes you know you won’t be seeing any more of those. Might need to keep track of the current key, to use as the previous.\n
  • So how do you get all the CPAN goodness you know and love in there?\nHDFS operations are limited to copy, move, delete, and the host OS doesn’t see it - no untar’ing!\n
  • Can have multiple inputs\n
  • That -D is a Hadoop define, not a JVM system property definition\n
  • On a streaming job you specify the programs to use as mapper and reducer\n
  • \n
  • \n
  • In the unknown directory where the task is running, making it accessible to it\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • At the end of the job, Hadoop aggregates counters from all tasks.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Hive partitioning\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Perl on Amazon Elastic MapReduce
  • 2. A Gentle Introduction to MapReduce• Distributed computing model• Mappers process the input and forward intermediate results to reducers.• Reducers aggregate these intermediate results, and emit the final results.
  • 3. $ map | sort | reduce
  • 4. MapReduce• Input data sent to mappers as (k, v) pairs.• After processing, mappers emit (k v ). out, out• These pairs are sorted and sent to reducers.• All (k out, vout) pairs for a given kout are sent to a single reducer.
  • 5. MapReduce• Reducers get (k, [v , v , …, v ]). 1 2 n• After processing, the reducer emits a (k , v ) f f per result.
  • 6. MapReduce We wanted to have a world map showingwhere people were starting our games (like Mozilla Glow)
  • 7. Glowfish
  • 8. MapReduce• Input: ( epoch, IP address )• Mappers group these into 5-minute blocks, and emit ( block Id, IP address )• Reducers get ( blockId, [ip , ip , …, ip ] ) 1 2 n• Do a geo lookup and emit ( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )
  • 9. $ map | sort | reduce
  • 10. Apache Hadoop• Distributed programming framework• Implements MapReduce• Does all the usual distributed programming heavy-lifting for you• Highly-fault tolerant, automatic task re- assignment in case of failure• You focus on mappers and reducers
  • 11. Apache Hadoop• Native Java API• Streaming API which can use mappers and reducers written in any programming language.• Distributed file system (HDFS)• Distributed Cache
  • 12. Amazon Elastic MapReduce• On-demand Hadoop clusters running on EC2 instances.• Improved S3 support for storage of input and output data.• Build workflows by sending jobs to a cluster.
  • 13. EMR Downsides• No control over the machine images.• Perl 5.8.8• Ephemeral, when your cluster is shut down (or dies), HDFS is gone.• HDFS not available at cluster-creation time.• Debian
  • 14. Streaming vs. Native$ cat | map | sort | reduce
  • 15. Streaming vs. NativeInstead of ( k, [ v1, v2, …, vn ] )reducers get (( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))
  • 16. Composite Keys• Reducers receive both keys and values sorted• Merge 3 tables: userid, 0, … # customer info userid, 1, … # payments history userid, recordid1, … # clickstream userid, recordid2, … # clickstream
  • 17. Streaming vs. Native• Limited API• About a 7-10% increase in run time• About a 1000% decrease in development time (as reported by a non-representative sample of developers)
  • 18. Where’s My Towel?• Tasks run chrooted in a non-deterministic location.• It’s easy to store files in HDFS when submitting a job, impossible to store directory trees.• For native Java jobs, your dependencies get packaged in the JAR alongside your code.
  • 19. Streaming’s Little HelpersDefine your inputs and outputs:--input s3://events/2011-30-10--output s3://glowfish/output/2011-30-10
  • 20. Streaming’s Little HelpersYou can use any class in Hadoop’s classpathas a codec, several come bundled:-D mapred.output.key.comparator.class =org.apache.hadoop.mapred.lib.KeyFieldBasedComparator-partitionerorg.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • 21. Streaming’s Little Helpers• Use S3 to store… • input data • output data • supporting data (e.g., Geo-IP) • your code
  • 22. Mapper and ReducerTo specify the mapper and reducer to beused in your streaming job, you can pointHadoop to S3:--mapper s3://glowfish/bin/mapper.pl--reducer s3://glowfish/bin/reducer.pl
  • 23. Support FilesWhen specifying a file to store in the DC, aURI fragment will be used as a symlink in thelocal filesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
  • 24. Support FilesWhen specifying a file to store in the DC, aURI fragment will be used as a symlink in thelocal filesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
  • 25. DependenciesBut if you store an archive (Zip, TGZ, or JAR)in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz
  • 26. DependenciesBut if you store an archive (Zip, TGZ, or JAR)in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz
  • 27. DependenciesBut if you store an archive (Zip, TGZ, or JAR)in the Distributed Cache, …-cacheArchive s3://glowfish/lib/perllib.tgz#locallib
  • 28. Dependencies Hadoop will uncompress it and create a linkto whatever directory it created, in the task’s working directory.
  • 29. DependenciesWhich is where it stores your mapper and reducer.
  • 30. Dependenciesuse lib qw/ locallib /;
  • 31. Mapper#!/usr/bin/env perluse strict;use warnings;use lib qw/ locallib /;use JSON::PP;my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{ip} ) { $missing_ip++; next; } print "$epocht$json->{ip}n";}print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
  • 32. Mapper#!/usr/bin/env perluse strict;use warnings;use lib qw/ locallib /;use JSON::PP;my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{ip} ) { $missing_ip++; next; } print "$epocht$json->{ip}n";}print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
  • 33. Reducer#!/usr/bin/env perluse strict;use warnings;use lib qw/ locallib /;use Geo::IP;use Regexp::Common qw/ net /;use Readonly;Readonly::Scalar my $TAB => "t";my $geo = Geo::IP->open( GeoLiteCity.dat, GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n";my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;my $time_slot;my $previous_time_slot = -1;
  • 34. Reducer#!/usr/bin/env perluse strict;use warnings;use lib qw/ locallib /;use Geo::IP;use Regexp::Common qw/ net /;use Readonly;Readonly::Scalar my $TAB => "t";my $geo = Geo::IP->open( GeoLiteCity.dat, GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n";my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;my $time_slot;my $previous_time_slot = -1;
  • 35. Reducer#!/usr/bin/env perluse strict;use warnings;use lib qw/ locallib /;use Geo::IP;use Regexp::Common qw/ net /;use Readonly;Readonly::Scalar my $TAB => "t";my $geo = Geo::IP->open( GeoLiteCity.dat, GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n";my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;my $time_slot;my $previous_time_slot = -1;
  • 36. Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # weve entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 37. Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # weve entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 38. Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # weve entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 39. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot;} # while ( <> )emit( $time_slot + 1, $time_slot );print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errorsn";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_addressn";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errorsn";
  • 40. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot;} # while ( <> )emit( $time_slot + 1, $time_slot );print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errorsn";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_addressn";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errorsn";
  • 41. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot;} # while ( <> )emit( $time_slot + 1, $time_slot );print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errorsn";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_addressn";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errorsn";
  • 42. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot;} # while ( <> )emit( $time_slot + 1, $time_slot );print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errorsn";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_addressn";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errorsn";
  • 43. Recap• EMR clusters are volatile!
  • 44. Recap• EMR clusters are volatile.• Values for a given key will all go to a single reducer, sorted.
  • 45. Recap• EMR clusters are volatile.• Values for a given key will all go to a single reducer, sorted.• Use S3 for everything, and plan your dataflow ahead.
  • 46. ( On data )• Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others: s3://bucket/path/data/run_date=2011-11-12• Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.
  • 47. Recap• EMR clusters are volatile.• Values for a given key will all go to a single reducer, sorted. Watch for the key changing.• Use S3 for everything, and plan your dataflow ahead.• Make carton a part of your life, and especially of your build tool’s.
  • 48. ( carton )• Shipwright for humans• Reads dependencies from Makefile.PL• Installs them locally to your app• Deploy your stuff, including carton.lock• Run carton install --deployment• Tar result and upload to S3
  • 49. URLs• The MapReduce Paper http://labs.google.com/papers/mapreduce.html• Apache Hadoop http://hadoop.apache.org/• Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/
  • 50. URLs• Hadoop Streaming Tutorial (Apache) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html• Hadoop Streaming How-To (Amazon) http://docs.amazonwebservices.com/ElasticMapReduce/latest/ GettingStartedGuide/CreateJobFlowStreaming.html
  • 51. URLs• Amazon EMR Perl Client Library http://aws.amazon.com/code/Elastic-MapReduce/2309• Amazon EMR Command-Line Tool http://aws.amazon.com/code/Elastic-MapReduce/2264
  • 52. That’s All, Folks! Slides available athttp://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce me@pedrofigueiredo.org