Elastic MapReduce   Outsourcing BigData        Nathan McCourtney            @beaknit
What is MapReduce?From Wikipedia:MapReduce is a framework for processing highly distributable problems across huge dataset...
The MapMapping involves taking raw data and converting it into aseries of symbols.For example, DNA sequencing:ddATP   ->  ...
Practical MappingInputs are generally flat-files containing lines of text.   clever_critters.txt:       foxes are clever  ...
Practical Mapping ContdThe mapper processes the line and outputs a key/valuepair to STDOUT for each symbol it maps   foxes...
Work PartitioningThese key/value pairs are passed to a "partition function"which organizes the output and assigns it to re...
Practical ReductionThe Reducers each receive the shardedworkload assigned to them by the partitioning.Typically the work i...
Practical Reduction ContdThe reduction is essentially whatever you want it to be.There are common patterns that are often ...
What is Hadoop?From wikipedia:Apache Hadoop is a software framework that supports data-intensive distributed applications ...
Hadoops Gutssource: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html
Fun to build?    No
Solution?Amazons Elastic MapReduce
Look complex? Its not1.   Sign up for the service2.   Download the tools (requires ruby 1.8)3.   mkdir ~/elastic-mapreduce...
Run it  ruby   elastic-mapreduce        --list  ruby   elastic-mapreduce        --create --alive  ruby   elastic-mapreduce...
Creating a minimal job1. Set up a dedicated s3 bucket2. Create a folder called "input" in that bucket3. Upload your inputs...
Minimal Job Contd4. Write a mapper     eg:     ARGF.each do |line|        # remove any newline        line = line.chomp   ...
Minimal Job Contd5. Upload your mapper to your s3 bucket     s3cmd put mapper.rb s3://bucket6. Run it     elastic-mapreduc...
AWS Demo AppAWS has a very cool publicly-available app torun:elastic-mapreduce --create --stream      --mapper s3://elasti...
PossibilitiesEMR is a fully-functional Hadoopimplementation.Mappers and reducers can be written in python,ruby, PHP and Ja...
Further ReadingTom Whites OReilly on HadoopAWS EMR Getting Started GuideHadoop Wiki
Upcoming SlideShare
Loading in...5
×

Aws dc elastic-mapreduce

635

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
635
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Aws dc elastic-mapreduce

  1. 1. Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit
  2. 2. What is MapReduce?From Wikipedia:MapReduce is a framework for processing highly distributable problems across huge datasets using a large number ofcomputers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes usedifferent hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in adatabase (structured)."Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to workernodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes thesmaller problem, and passes the answer back to its master node."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some wayto form the output – the answer to the problem it was originally trying to solve.
  3. 3. The MapMapping involves taking raw data and converting it into aseries of symbols.For example, DNA sequencing:ddATP -> AddGTP -> GddCTP -> CddTTP -> TResults in representations like "GATTACA"
  4. 4. Practical MappingInputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are cleverFiles are read in and fed to a mapper one line at a time viaSTDIN. cat clever_critters.txt | mapper.rb
  5. 5. Practical Mapping ContdThe mapper processes the line and outputs a key/valuepair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1
  6. 6. Work PartitioningThese key/value pairs are passed to a "partition function"which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4
  7. 7. Practical ReductionThe Reducers each receive the shardedworkload assigned to them by the partitioning.Typically the work is received as a stream ofkey/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4
  8. 8. Practical Reduction ContdThe reduction is essentially whatever you want it to be.There are common patterns that are often pre-solved bythe map-reduce framework.See Hadoops Built-In Reducerseg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1
  9. 9. What is Hadoop?From wikipedia:Apache Hadoop is a software framework that supports data-intensive distributed applications under afree license.[1] It enables applications to work with thousands of computational independentcomputers and petabytes of data. Hadoop was derived from Googles MapReduce and Google FileSystem (GFS) papers.Essentially, Hadoop is a practical implementation of all the pieces youd need toaccomplish everything weve discussed thus far. It takes in the data, organizesthe tasks, passes the data through its entire path and finally outputs thereduction.
  10. 10. Hadoops Gutssource: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html
  11. 11. Fun to build? No
  12. 12. Solution?Amazons Elastic MapReduce
  13. 13. Look complex? Its not1. Sign up for the service2. Download the tools (requires ruby 1.8)3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli4. Create your credentials.json file { "access_id": "<key>", "private_key": "<secret key>", "keypair": "<name of keypair>", "key-pair-file": "~/.ssh/<key>.pem", "log_uri": "s3://<unique s3 bucket/", "region": "us-east-1" }5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
  14. 14. Run it ruby elastic-mapreduce --list ruby elastic-mapreduce --create --alive ruby elastic-mapreduce --list ruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double- click the file.
  15. 15. Creating a minimal job1. Set up a dedicated s3 bucket2. Create a folder called "input" in that bucket3. Upload your inputs into s3://bucket/input s3cmd put *log s3://bucket/input
  16. 16. Minimal Job Contd4. Write a mapper eg: ARGF.each do |line| # remove any newline line = line.chomp if /ERROR/.match(line) puts "ERRORt1" end if /INFO/.match(line) puts "INFOt1" end if /DEBUG/.match(line) puts "DEBUGt1" end endSee http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ forexamples
  17. 17. Minimal Job Contd5. Upload your mapper to your s3 bucket s3cmd put mapper.rb s3://bucket6. Run it elastic-mapreduce --create --stream --mapper s3://bucket/mapper.rb --input s3://bucket/input --output s3://bucket/output --reducer aggregate NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically.7. Pick up your results in the output folder
  18. 18. AWS Demo AppAWS has a very cool publicly-available app torun:elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://elasticmapreduce/samples/wordcount/input --output s3://bucket/output --reducer aggregateSee Amazon Example Doc
  19. 19. PossibilitiesEMR is a fully-functional Hadoopimplementation.Mappers and reducers can be written in python,ruby, PHP and JavaGo crazy.
  20. 20. Further ReadingTom Whites OReilly on HadoopAWS EMR Getting Started GuideHadoop Wiki
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×