Aws dc elastic-mapreduce


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aws dc elastic-mapreduce

  1. 1. Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit
  2. 2. What is MapReduce?From Wikipedia:MapReduce is a framework for processing highly distributable problems across huge datasets using a large number ofcomputers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes usedifferent hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in adatabase (structured)."Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to workernodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes thesmaller problem, and passes the answer back to its master node."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some wayto form the output – the answer to the problem it was originally trying to solve.
  3. 3. The MapMapping involves taking raw data and converting it into aseries of symbols.For example, DNA sequencing:ddATP -> AddGTP -> GddCTP -> CddTTP -> TResults in representations like "GATTACA"
  4. 4. Practical MappingInputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are cleverFiles are read in and fed to a mapper one line at a time viaSTDIN. cat clever_critters.txt | mapper.rb
  5. 5. Practical Mapping ContdThe mapper processes the line and outputs a key/valuepair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1
  6. 6. Work PartitioningThese key/value pairs are passed to a "partition function"which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4
  7. 7. Practical ReductionThe Reducers each receive the shardedworkload assigned to them by the partitioning.Typically the work is received as a stream ofkey/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4
  8. 8. Practical Reduction ContdThe reduction is essentially whatever you want it to be.There are common patterns that are often pre-solved bythe map-reduce framework.See Hadoops Built-In Reducerseg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1
  9. 9. What is Hadoop?From wikipedia:Apache Hadoop is a software framework that supports data-intensive distributed applications under afree license.[1] It enables applications to work with thousands of computational independentcomputers and petabytes of data. Hadoop was derived from Googles MapReduce and Google FileSystem (GFS) papers.Essentially, Hadoop is a practical implementation of all the pieces youd need toaccomplish everything weve discussed thus far. It takes in the data, organizesthe tasks, passes the data through its entire path and finally outputs thereduction.
  10. 10. Hadoops Gutssource:
  11. 11. Fun to build? No
  12. 12. Solution?Amazons Elastic MapReduce
  13. 13. Look complex? Its not1. Sign up for the service2. Download the tools (requires ruby 1.8)3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli4. Create your credentials.json file { "access_id": "<key>", "private_key": "<secret key>", "keypair": "<name of keypair>", "key-pair-file": "~/.ssh/<key>.pem", "log_uri": "s3://<unique s3 bucket/", "region": "us-east-1" }5. unzip ~/Downloads/
  14. 14. Run it ruby elastic-mapreduce --list ruby elastic-mapreduce --create --alive ruby elastic-mapreduce --list ruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double- click the file.
  15. 15. Creating a minimal job1. Set up a dedicated s3 bucket2. Create a folder called "input" in that bucket3. Upload your inputs into s3://bucket/input s3cmd put *log s3://bucket/input
  16. 16. Minimal Job Contd4. Write a mapper eg: ARGF.each do |line| # remove any newline line = line.chomp if /ERROR/.match(line) puts "ERRORt1" end if /INFO/.match(line) puts "INFOt1" end if /DEBUG/.match(line) puts "DEBUGt1" end endSee forexamples
  17. 17. Minimal Job Contd5. Upload your mapper to your s3 bucket s3cmd put mapper.rb s3://bucket6. Run it elastic-mapreduce --create --stream --mapper s3://bucket/mapper.rb --input s3://bucket/input --output s3://bucket/output --reducer aggregate NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically.7. Pick up your results in the output folder
  18. 18. AWS Demo AppAWS has a very cool publicly-available app torun:elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/ --input s3://elasticmapreduce/samples/wordcount/input --output s3://bucket/output --reducer aggregateSee Amazon Example Doc
  19. 19. PossibilitiesEMR is a fully-functional Hadoopimplementation.Mappers and reducers can be written in python,ruby, PHP and JavaGo crazy.
  20. 20. Further ReadingTom Whites OReilly on HadoopAWS EMR Getting Started GuideHadoop Wiki