• Save
Aws map-reduce-aws
Upcoming SlideShare
Loading in...5
×
 

Aws map-reduce-aws

on

  • 1,343 views

The slides from the presentation I gave to awsug.com.au on November 30, 2011 in Melbourne, Australia

The slides from the presentation I gave to awsug.com.au on November 30, 2011 in Melbourne, Australia

Statistics

Views

Total Views
1,343
Views on SlideShare
1,342
Embed Views
1

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 1

http://www.hanrss.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • What Is MapReduceHow does it workAn implementation without the frameworkAn implementation with the frameworkAWS architecture for MapReduceAn example using HiveAn example using PigA custom example in JavaLimitations
  • Unconscious incompetence -> conscious incompetenceHigh level understandingKnowledge of low level usageLimitationsHave conversation with customer
  • The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
  • Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  • Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  • Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  • Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  • Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  • Fibonacci ✖Searching ✔
  • “Monitoring the filesystem counters for a job- particularly relative to byte counts from the map and into the reduce- is invaluable to the tuning of these parameters.” (from http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Source+Code)
  • Unconscious incompetence -> conscious incompetenceHigh level understandingKnowledge of low level usageLimitationsHave conversation with customer

Aws map-reduce-aws Aws map-reduce-aws Presentation Transcript

  • Elastic MapReduce Andy Marks Principal Consultant, ThoughtWorks amarks@thoughtworks.com
  • ObjectivesHigh level understanding Limitations Examples Inspired to try it out
  • Multiple choice: MapReduce is… a) A combination of 2 common functional programming messages b) Used extensively* by Google c) Implemented in libraries for all languages (that matter ) d) A framework for management and execution of processing in parallel e) Getting more and more relevant with the emergence of “Big Data” f) Implementable as a service via AWS g) Targeted towards batch style computation h) All of the above * Approx 12K MR programs from http://www.youtube.com/watch?v=NXCIItzkn3E
  • A potted history of MapReduce Hadoop started by Doug Cutting at Yahoo AWS launch ElasticMapReduce Facebook announces 21PB Hadoop cluster002 2004 2006 2008 2010 2012 Yahoo announces 10K Hadoop cluster http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  • Processing flow MAP MAP REDUCE Process Call Read and Call MAP chunk, Partition REDUCE Process Persist split input for each returning and sort for each partition outputinto chunks chunk intermediate results partition results … MAP REDUCE … MAP
  • Map and Reduce by example: wordpart_1.txt part_2.txtPeter Piper picked a peck of pickled peppers, If Peter Piper picked a peck of pickled peppers,A peck of pickled peppers Peter Piper picked; Wheres the peck of pickled peppers Peter Piper picked? map calls reduce callsInput key Input value Output keys Output values Input key Input value Output valuespart_1.txt Peter Piper peter 1 a [1, 1, 1] a -> 3 picked a peck piper 1 of pickled picked 1 if [1] if -> 1 peppers, a 1 peck 1 of 1 of [1, 1, 1, 1] of -> 4 pickled 1 peppers 1 peck [1, 1, 1, 1] peck -> 4part_1.txt A peck of a 1 pickled peppers peck 1 peppers [1, 1, 1, 1] peppers -> 4 Peter Piper of 1 picked pickled 1 peter [1, 1, 1, 1] peter -> 4 peppers 1 peter 1 piper 1 picked [1, 1, 1, 1] picked -> 4 picked 1part_2.txt If Peter Piper If 1 pickled [1, 1, 1, 1] pickled -> 4 picked a peck peter 1 of pickled piper 1 piper [1, 1, 1, 1] piper -> 4 peppers picked 1 a 1 peck 1 the [1] the ->1 of 1 pickled 1
  • cat part_* | tr -cs "[:alpha:]" "n" | tr "[:upper:]" "[:lower:]" | sort | uniq -c
  • Map and Reduce by pattern [C  D, A B map E  F, G  H, …] W  [X, Y, Z] reduce V
  • Map and Reduce for Word count [word1  1, fileoffset line of text map word2  1, word3  1, …] word  [1, 1, 1] reduce word  3
  • Map and Reduce for Search [searchterm filename + line1, fileoffset line of text map filename + line2]searchterm [filename1 + line1, searchterm [filename1 + line1 + line 2, filename1 + line2, reduce filename2 + line1] filename2 + line1]
  • Map and Reduce for Index fileoffset line of text map [word1  filename, word2  filename, word3  filename]word1 [filename1, word1  [filename1, filename2, reduce filename2, filename3] filename3]
  • A basic example - Java
  • Hadoop architecture
  • A basic example - Ruby
  • Getting started with AWS and EMR
  • MapReduce architecture in AWS SSH Slave security group Master security group app (s3n) EC2 EC2 Input (s3n) Node 1 Node 2 EC2 S3 Master output EC2 logging … Node N Note: EC2 AMIs are Debian/Lenny 32 or 64 bit
  • To the Ruby EMR CLI! credentials.json { "access_id": ”…",./elastic-mapreduce "private_key": ”…", "keypair": "mr-oregon", --create "key-pair-file": “mr-oregon.pem", --name word-count "log_uri": "s3n://mr-word-count/", "region": "us-west-2" --stream } --instance-count 1 --instance-type m1.small --key-pair mr-oregon --input s3n://mr-word-count-input/ --output s3n://mr-word-count-output/ --mapper "ruby s3n://mr-word-count/map.rb" --reducer "ruby s3n://mr-word-count/reduce.rb"
  • Setup S3 bucket
  • Create new EMR job
  • Supply name and set as streaming
  • Configure against S3 bucket
  • Configure instance types and #
  • Nothing to see here
  • Review and go!
  • Watch as job starts…
  • Runs…
  • And finishes!
  • Ta da!
  • Back to S3 for output
  • Limitations Processing must be parallelisable Large amounts of consistent data requiring consistent processing and few dependencies Not designed for high reliability E.g.,Name Node single point of failure on Hadoop DFS
  • MapReduce in practice Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance Source: http://en.wikipedia.org/wiki/Apache_Hadoop
  • FABUQ What if my input has multiline records? What if my EMR instances don’t have the required libraries, etc to run my steps? What if I needed to nest jobs within steps? What are the signs that a MR solution might “fit” the problem? How do I control the number of mappers and reducers used? What if I don’t need to do any reduction? How does MR provide fault tolerance?
  • RecapHigh level understanding Limitations Examples Inspired to try it out