SlideShare a Scribd company logo
1 of 34
Elastic MapReduce



              Andy Marks
  Principal Consultant, ThoughtWorks
      amarks@thoughtworks.com
Objectives




High level understanding
                                                    Limitations




           Examples        Inspired to try it out
Multiple choice: MapReduce is…

 a) A combination of 2 common functional programming
    messages
 b) Used extensively* by Google
 c) Implemented in libraries for all languages (that matter )
 d) A framework for management and execution of processing in
    parallel
 e) Getting more and more relevant with the emergence of “Big
    Data”
 f) Implementable as a service via AWS
 g) Targeted towards batch style computation
 h) All of the above


                            * Approx 12K MR programs from http://www.youtube.com/watch?v=NXCIItzkn3E
A potted history of MapReduce
                  Hadoop started by Doug Cutting at Yahoo
                                              AWS launch ElasticMapReduce

                                                  Facebook announces 21PB
                                                  Hadoop cluster


002    2004      2006         2008        2010       2012

                                Yahoo announces 10K Hadoop cluster

              http://labs.google.com/papers/mapreduce.html
         http://labs.google.com/papers/gfs.html
Processing flow
                             MAP




                             MAP                                 REDUCE



                             Process
                                                       Call
 Read and      Call MAP       chunk,     Partition
                                                     REDUCE      Process     Persist
 split input   for each     returning    and sort
                                                     for each    partition   output
into chunks     chunk     intermediate    results
                                                     partition
                              results




                                                                    …
                             MAP

                                                                 REDUCE
                                …




                             MAP
Map and Reduce by example: word
part_1.txt                                       part_2.txt
Peter Piper picked a peck of pickled peppers,    If Peter Piper picked a peck of pickled peppers,
A peck of pickled peppers Peter Piper picked;    Where's the peck of pickled peppers Peter Piper
                                                 picked?


                             map calls                                               reduce calls
Input key      Input value         Output keys   Output values        Input key       Input value    Output values
part_1.txt     Peter Piper        peter         1                    a               [1, 1, 1]      a -> 3
               picked a peck      piper         1
               of pickled          picked        1
                                                                      if              [1]            if -> 1
               peppers,           a             1
                                   peck          1
                                   of            1                    of              [1, 1, 1, 1]   of -> 4
                                   pickled       1
                                   peppers       1                    peck            [1, 1, 1, 1]   peck -> 4
part_1.txt     A peck of          a             1
               pickled peppers    peck          1                    peppers         [1, 1, 1, 1]   peppers -> 4
               Peter Piper        of            1
               picked              pickled       1
                                                                      peter           [1, 1, 1, 1]   peter -> 4
                                   peppers       1
                                   peter         1
                                   piper         1                    picked          [1, 1, 1, 1]   picked -> 4
                                   picked        1
part_2.txt     If Peter Piper     If            1                    pickled         [1, 1, 1, 1]   pickled -> 4
               picked a peck      peter         1
               of pickled         piper         1                    piper           [1, 1, 1, 1]   piper -> 4
               peppers             picked        1
                                   a             1
                                   peck          1                    the             [1]            the ->1
                                   of            1
                                   pickled       1
cat part_* | tr -cs "[:alpha:]" "n" | tr
 "[:upper:]" "[:lower:]" | sort | uniq -c
Map and Reduce by pattern

                           [C  D,
         A B      map     E  F,
                           G  H,
                           …]




  W  [X, Y, Z]   reduce   V
Map and Reduce for Word count

                                     [word1  1,
 fileoffset line of text    map     word2  1,
                                     word3  1,
                                     …]




   word  [1, 1, 1]         reduce   word  3
Map and Reduce for Search

                                               [searchterm filename + line1,
           fileoffset line of text    map                    filename + line2]




searchterm [filename1 + line1,
                                               searchterm [filename1 + line1 + line 2,
               filename1 + line2,     reduce                  filename2 + line1]
               filename2 + line1]
Map and Reduce for Index


 fileoffset line of text    map     [word1  filename,
                                     word2  filename,
                                     word3  filename]



word1 [filename1,                   word1  [filename1,
        filename2,          reduce           filename2,
        filename3]                           filename3]
A basic example - Java
Hadoop architecture
A basic example - Ruby
Getting started with AWS and EMR
MapReduce architecture in AWS
   SSH




                                                      Slave security group
                          Master security group
           app (s3n)                                  EC2           EC2
         Input (s3n)                                 Node 1        Node 2
                              EC2
  S3                         Master
                 output
                                                                    EC2
                logging                                 …
                                                                    Node
                                                                     N




                                                  Note: EC2 AMIs are Debian/Lenny 32 or 64 bit
To the Ruby EMR CLI!
                          credentials.json
                          {
                          "access_id": ”…",
./elastic-mapreduce      "private_key": ”…",
                          "keypair": "mr-oregon",
  --create               "key-pair-file": “mr-oregon.pem",
  --name word-count      "log_uri": "s3n://mr-word-count/",
                          "region": "us-west-2"
  --stream               }

  --instance-count 1 
  --instance-type m1.small 
  --key-pair mr-oregon 
  --input   s3n://mr-word-count-input/ 
  --output s3n://mr-word-count-output/ 
  --mapper "ruby s3n://mr-word-count/map.rb" 
  --reducer "ruby s3n://mr-word-count/reduce.rb"
Setup S3 bucket
Create new EMR job
Supply name and set as streaming
Configure against S3 bucket
Configure instance types and #
Nothing to see here
Review and go!
Watch as job starts…
Runs…
And finishes!
Ta da!
Back to S3 for output
Limitations

  Processing must be parallelisable
     Large amounts of consistent data requiring consistent
     processing and few dependencies
  Not designed for high reliability
     E.g.,Name Node single point of failure on Hadoop DFS
MapReduce in practice

    Log and/or clickstream analysis of various kinds
    Marketing analytics
    Machine learning and/or sophisticated data mining
    Image processing
    Processing of XML messages
    Web crawling and/or text processing
    General archiving, including of relational/tabular data, e.g.
    for compliance



                         Source: http://en.wikipedia.org/wiki/Apache_Hadoop
FABUQ

 What if my input has multiline records?
 What if my EMR instances don’t have the required libraries, etc
 to run my steps?
 What if I needed to nest jobs within steps?
 What are the signs that a MR solution might “fit” the problem?
 How do I control the number of mappers and reducers used?
 What if I don’t need to do any reduction?
 How does MR provide fault tolerance?
Recap




High level understanding
                                                    Limitations




           Examples        Inspired to try it out
Aws map-reduce-aws

More Related Content

More from Andy Marks

Melbourne Clojure Meetup Jan 2018 - ClojureBridge
Melbourne Clojure Meetup Jan 2018  - ClojureBridgeMelbourne Clojure Meetup Jan 2018  - ClojureBridge
Melbourne Clojure Meetup Jan 2018 - ClojureBridgeAndy Marks
 
YOW WEST 2014: "Adopting Functional Programming Languages"
YOW WEST 2014: "Adopting Functional Programming Languages"YOW WEST 2014: "Adopting Functional Programming Languages"
YOW WEST 2014: "Adopting Functional Programming Languages"Andy Marks
 
YOW West 2015: "Macromonitoring for Microservices"
YOW West 2015: "Macromonitoring for Microservices"YOW West 2015: "Macromonitoring for Microservices"
YOW West 2015: "Macromonitoring for Microservices"Andy Marks
 
Lambda Jam 2015: Event Processing in Clojure
Lambda Jam 2015: Event Processing in ClojureLambda Jam 2015: Event Processing in Clojure
Lambda Jam 2015: Event Processing in ClojureAndy Marks
 
ThoughtWorks Live 2014: "Building Systems That Pivot"
ThoughtWorks Live 2014: "Building Systems That Pivot"ThoughtWorks Live 2014: "Building Systems That Pivot"
ThoughtWorks Live 2014: "Building Systems That Pivot"Andy Marks
 
YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"
YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"
YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"Andy Marks
 
2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "
2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "
2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "Andy Marks
 
2017 YOW West: "Does Smelly Code Also Sound Bad?"
2017 YOW West: "Does Smelly Code Also Sound Bad?"2017 YOW West: "Does Smelly Code Also Sound Bad?"
2017 YOW West: "Does Smelly Code Also Sound Bad?"Andy Marks
 
1st conference 2015 devops
1st conference 2015   devops1st conference 2015   devops
1st conference 2015 devopsAndy Marks
 
Agile Methods for NTU Software Engineers
Agile Methods for NTU Software EngineersAgile Methods for NTU Software Engineers
Agile Methods for NTU Software EngineersAndy Marks
 

More from Andy Marks (10)

Melbourne Clojure Meetup Jan 2018 - ClojureBridge
Melbourne Clojure Meetup Jan 2018  - ClojureBridgeMelbourne Clojure Meetup Jan 2018  - ClojureBridge
Melbourne Clojure Meetup Jan 2018 - ClojureBridge
 
YOW WEST 2014: "Adopting Functional Programming Languages"
YOW WEST 2014: "Adopting Functional Programming Languages"YOW WEST 2014: "Adopting Functional Programming Languages"
YOW WEST 2014: "Adopting Functional Programming Languages"
 
YOW West 2015: "Macromonitoring for Microservices"
YOW West 2015: "Macromonitoring for Microservices"YOW West 2015: "Macromonitoring for Microservices"
YOW West 2015: "Macromonitoring for Microservices"
 
Lambda Jam 2015: Event Processing in Clojure
Lambda Jam 2015: Event Processing in ClojureLambda Jam 2015: Event Processing in Clojure
Lambda Jam 2015: Event Processing in Clojure
 
ThoughtWorks Live 2014: "Building Systems That Pivot"
ThoughtWorks Live 2014: "Building Systems That Pivot"ThoughtWorks Live 2014: "Building Systems That Pivot"
ThoughtWorks Live 2014: "Building Systems That Pivot"
 
YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"
YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"
YOW West 2016: "A Rose By Any Other Name: Monoglot Microservices"
 
2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "
2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "
2017 Melb.JVM: "The Hills are alive with the Sound of your Crappy Code! "
 
2017 YOW West: "Does Smelly Code Also Sound Bad?"
2017 YOW West: "Does Smelly Code Also Sound Bad?"2017 YOW West: "Does Smelly Code Also Sound Bad?"
2017 YOW West: "Does Smelly Code Also Sound Bad?"
 
1st conference 2015 devops
1st conference 2015   devops1st conference 2015   devops
1st conference 2015 devops
 
Agile Methods for NTU Software Engineers
Agile Methods for NTU Software EngineersAgile Methods for NTU Software Engineers
Agile Methods for NTU Software Engineers
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Aws map-reduce-aws

  • 1. Elastic MapReduce Andy Marks Principal Consultant, ThoughtWorks amarks@thoughtworks.com
  • 2. Objectives High level understanding Limitations Examples Inspired to try it out
  • 3. Multiple choice: MapReduce is… a) A combination of 2 common functional programming messages b) Used extensively* by Google c) Implemented in libraries for all languages (that matter ) d) A framework for management and execution of processing in parallel e) Getting more and more relevant with the emergence of “Big Data” f) Implementable as a service via AWS g) Targeted towards batch style computation h) All of the above * Approx 12K MR programs from http://www.youtube.com/watch?v=NXCIItzkn3E
  • 4. A potted history of MapReduce Hadoop started by Doug Cutting at Yahoo AWS launch ElasticMapReduce Facebook announces 21PB Hadoop cluster 002 2004 2006 2008 2010 2012 Yahoo announces 10K Hadoop cluster http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  • 5. Processing flow MAP MAP REDUCE Process Call Read and Call MAP chunk, Partition REDUCE Process Persist split input for each returning and sort for each partition output into chunks chunk intermediate results partition results … MAP REDUCE … MAP
  • 6. Map and Reduce by example: word part_1.txt part_2.txt Peter Piper picked a peck of pickled peppers, If Peter Piper picked a peck of pickled peppers, A peck of pickled peppers Peter Piper picked; Where's the peck of pickled peppers Peter Piper picked? map calls reduce calls Input key Input value Output keys Output values Input key Input value Output values part_1.txt Peter Piper peter 1 a [1, 1, 1] a -> 3 picked a peck piper 1 of pickled picked 1 if [1] if -> 1 peppers, a 1 peck 1 of 1 of [1, 1, 1, 1] of -> 4 pickled 1 peppers 1 peck [1, 1, 1, 1] peck -> 4 part_1.txt A peck of a 1 pickled peppers peck 1 peppers [1, 1, 1, 1] peppers -> 4 Peter Piper of 1 picked pickled 1 peter [1, 1, 1, 1] peter -> 4 peppers 1 peter 1 piper 1 picked [1, 1, 1, 1] picked -> 4 picked 1 part_2.txt If Peter Piper If 1 pickled [1, 1, 1, 1] pickled -> 4 picked a peck peter 1 of pickled piper 1 piper [1, 1, 1, 1] piper -> 4 peppers picked 1 a 1 peck 1 the [1] the ->1 of 1 pickled 1
  • 7. cat part_* | tr -cs "[:alpha:]" "n" | tr "[:upper:]" "[:lower:]" | sort | uniq -c
  • 8. Map and Reduce by pattern [C  D, A B map E  F, G  H, …] W  [X, Y, Z] reduce V
  • 9. Map and Reduce for Word count [word1  1, fileoffset line of text map word2  1, word3  1, …] word  [1, 1, 1] reduce word  3
  • 10. Map and Reduce for Search [searchterm filename + line1, fileoffset line of text map filename + line2] searchterm [filename1 + line1, searchterm [filename1 + line1 + line 2, filename1 + line2, reduce filename2 + line1] filename2 + line1]
  • 11. Map and Reduce for Index fileoffset line of text map [word1  filename, word2  filename, word3  filename] word1 [filename1, word1  [filename1, filename2, reduce filename2, filename3] filename3]
  • 12. A basic example - Java
  • 14. A basic example - Ruby
  • 15. Getting started with AWS and EMR
  • 16. MapReduce architecture in AWS SSH Slave security group Master security group app (s3n) EC2 EC2 Input (s3n) Node 1 Node 2 EC2 S3 Master output EC2 logging … Node N Note: EC2 AMIs are Debian/Lenny 32 or 64 bit
  • 17. To the Ruby EMR CLI! credentials.json { "access_id": ”…", ./elastic-mapreduce "private_key": ”…", "keypair": "mr-oregon", --create "key-pair-file": “mr-oregon.pem", --name word-count "log_uri": "s3n://mr-word-count/", "region": "us-west-2" --stream } --instance-count 1 --instance-type m1.small --key-pair mr-oregon --input s3n://mr-word-count-input/ --output s3n://mr-word-count-output/ --mapper "ruby s3n://mr-word-count/map.rb" --reducer "ruby s3n://mr-word-count/reduce.rb"
  • 20. Supply name and set as streaming
  • 25. Watch as job starts…
  • 29. Back to S3 for output
  • 30. Limitations Processing must be parallelisable Large amounts of consistent data requiring consistent processing and few dependencies Not designed for high reliability E.g.,Name Node single point of failure on Hadoop DFS
  • 31. MapReduce in practice Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance Source: http://en.wikipedia.org/wiki/Apache_Hadoop
  • 32. FABUQ What if my input has multiline records? What if my EMR instances don’t have the required libraries, etc to run my steps? What if I needed to nest jobs within steps? What are the signs that a MR solution might “fit” the problem? How do I control the number of mappers and reducers used? What if I don’t need to do any reduction? How does MR provide fault tolerance?
  • 33. Recap High level understanding Limitations Examples Inspired to try it out

Editor's Notes

  1. What Is MapReduceHow does it workAn implementation without the frameworkAn implementation with the frameworkAWS architecture for MapReduceAn example using HiveAn example using PigA custom example in JavaLimitations
  2. Unconscious incompetence -> conscious incompetenceHigh level understandingKnowledge of low level usageLimitationsHave conversation with customer
  3. The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
  4. Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  5. Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  6. Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  7. Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  8. Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
  9. Fibonacci ✖Searching ✔
  10. “Monitoring the filesystem counters for a job- particularly relative to byte counts from the map and into the reduce- is invaluable to the tuning of these parameters.” (from http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Source+Code)
  11. Unconscious incompetence -> conscious incompetenceHigh level understandingKnowledge of low level usageLimitationsHave conversation with customer