MapReduce
           Daniel Reverri
         Developer Advocate




basho
Overview
        Why MapReduce?
        MapReduce Basics
        Using MapReduce
        Examples
        Comparisons


ba...
Why MapReduce?

        Parallel, distributed queries
        Easy to write
        Easy to run




basho
Riak is a
        Key/Value store




basho
Key/Value Data


        /riak/cat/snowball1                         /riak/cat/snowball2




                             ...
Cluster


    catlady@192.168.1.10              catlady@192.168.1.11




                     catlady@192.168.1.12

basho
MapReduce Basics
        Operates over a known set of keys
        Runs near the data
        Consists of two types of fun...
What is a Map
               Function?

        Function applied to one piece of data
        Operates in isolation
      ...
What can I do with a
          Map Function?
        Filtering
           Filter documents by “tags”
        Extracting
  ...
Map
        cross_the_road(cat)



        cross_the_road(cat)



        cross_the_road(cat)


basho
What is a Reduce
            Function?

        Function applied to a list of results
        Merges results from Map phas...
What can I do with a
         Reduce Function?

        Aggregate
        Sort




basho
Reduce
        cross_the_road(cat)




        cross_the_road(cat)        sort(cats)




        cross_the_road(cat)




b...
Using MapReduce

        Define and submit request
          REST
          Protocol Buffers
        Review results



basho
Request (REST)
            POST to “/mapred”
        Content-Type: application/json

                           List of bu...
Inputs




basho
Query




basho
Phase




basho
Phase

            Type (map, reduce, link)




basho
Phase


                Function (named)




basho
Phase


            Function (anonymous)




basho
Phase



                Keep (true|false)




basho
Phase




                Argument



basho
Function Arguments




basho
Map - value




basho
Map - keyData, arg




basho
Reduce - arg




basho
Examples



basho
Map Demo

         Count the number of times the word
        “demo” appears in a set of documents




basho
Demo Data
        map_demo/key1.txt
          Random boring demo data for map demo

        map_demo/key2.txt
          Mo...
Request




basho
Inputs




basho
Query




basho
Map




basho
Results




basho
Reduce Demo

        Sort documents by the number of times
                  “demo” appears




basho
Request




basho
Inputs




basho
Query




basho
Reduce




basho
Results




basho
Argument Demo
        Enhance “demo” count example to count
          words matching a regular expression




basho
Map with arg




basho
Results




basho
Deploying Demo

        Deploy enhanced count function as a
                 named function




basho
js_source_dir
        app.config




                      $ riak restart




basho
Named Function
        /tmp/js_source/count_by_regex.js




                $ riak-admin js_reload



basho
Query




basho
Results




basho
Comparisons



basho
Hadoop (similarities)

Distributed across multiple machines
Provides data locality (HDFS)
Phases run near the data
Hadoop (differences)

Used for large, long running jobs (hours)
Restarts failed tasks
3 phases (map, combine, reduce)
CouchDB
       (differences)
Not distributed across multiple machines
Runs over all docs in a database
Computes cached vie...
MongoDB
       (differences)

Not run in parallel
Not spread across multiple machines
3 phases (map, reduce, finalize)
Closing thoughts



basho
Good to Know

        Phases must always return lists
        Map inputs are always bucket/key pairs
        Bucket querie...
Features not
                Reviewed
        Link phase (link walking)
        Results from multiple phases
        Erlan...
Questions?



basho
Upcoming SlideShare
Loading in...5
×

Riak MapReduce

10,918

Published on

Slides from webinar on MapReduce:
http://blog.basho.com/2010/07/15/free-webinar---map/reduce-querying-in-riak---july-22-@-2pm-eastern/

Published in: Technology
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,918
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
226
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide



















































  • Tasks - individual map processes
    Combine - function to run over map results on local nodes before shipping data to reduce operations






  • Riak MapReduce

    1. 1. MapReduce Daniel Reverri Developer Advocate basho
    2. 2. Overview Why MapReduce? MapReduce Basics Using MapReduce Examples Comparisons basho
    3. 3. Why MapReduce? Parallel, distributed queries Easy to write Easy to run basho
    4. 4. Riak is a Key/Value store basho
    5. 5. Key/Value Data /riak/cat/snowball1 /riak/cat/snowball2 /riak/cat/snowball3 basho
    6. 6. Cluster catlady@192.168.1.10 catlady@192.168.1.11 catlady@192.168.1.12 basho
    7. 7. MapReduce Basics Operates over a known set of keys Runs near the data Consists of two types of functions Map Reduce basho
    8. 8. What is a Map Function? Function applied to one piece of data Operates in isolation Returns a list of results basho
    9. 9. What can I do with a Map Function? Filtering Filter documents by “tags” Extracting Count words in a document Extract links to related data basho
    10. 10. Map cross_the_road(cat) cross_the_road(cat) cross_the_road(cat) basho
    11. 11. What is a Reduce Function? Function applied to a list of results Merges results from Map phases basho
    12. 12. What can I do with a Reduce Function? Aggregate Sort basho
    13. 13. Reduce cross_the_road(cat) cross_the_road(cat) sort(cats) cross_the_road(cat) basho
    14. 14. Using MapReduce Define and submit request REST Protocol Buffers Review results basho
    15. 15. Request (REST) POST to “/mapred” Content-Type: application/json List of bucket/key pairs List of phase definitions Timeout in milliseconds basho
    16. 16. Inputs basho
    17. 17. Query basho
    18. 18. Phase basho
    19. 19. Phase Type (map, reduce, link) basho
    20. 20. Phase Function (named) basho
    21. 21. Phase Function (anonymous) basho
    22. 22. Phase Keep (true|false) basho
    23. 23. Phase Argument basho
    24. 24. Function Arguments basho
    25. 25. Map - value basho
    26. 26. Map - keyData, arg basho
    27. 27. Reduce - arg basho
    28. 28. Examples basho
    29. 29. Map Demo Count the number of times the word “demo” appears in a set of documents basho
    30. 30. Demo Data map_demo/key1.txt Random boring demo data for map demo map_demo/key2.txt More useless demo data map_demo/key3.txt demo demo demo demo demo basho
    31. 31. Request basho
    32. 32. Inputs basho
    33. 33. Query basho
    34. 34. Map basho
    35. 35. Results basho
    36. 36. Reduce Demo Sort documents by the number of times “demo” appears basho
    37. 37. Request basho
    38. 38. Inputs basho
    39. 39. Query basho
    40. 40. Reduce basho
    41. 41. Results basho
    42. 42. Argument Demo Enhance “demo” count example to count words matching a regular expression basho
    43. 43. Map with arg basho
    44. 44. Results basho
    45. 45. Deploying Demo Deploy enhanced count function as a named function basho
    46. 46. js_source_dir app.config $ riak restart basho
    47. 47. Named Function /tmp/js_source/count_by_regex.js $ riak-admin js_reload basho
    48. 48. Query basho
    49. 49. Results basho
    50. 50. Comparisons basho
    51. 51. Hadoop (similarities) Distributed across multiple machines Provides data locality (HDFS) Phases run near the data
    52. 52. Hadoop (differences) Used for large, long running jobs (hours) Restarts failed tasks 3 phases (map, combine, reduce)
    53. 53. CouchDB (differences) Not distributed across multiple machines Runs over all docs in a database Computes cached views for lookups No query time arguments 2 phase (map, reduce)
    54. 54. MongoDB (differences) Not run in parallel Not spread across multiple machines 3 phases (map, reduce, finalize)
    55. 55. Closing thoughts basho
    56. 56. Good to Know Phases must always return lists Map inputs are always bucket/key pairs Bucket queries are bad Anonymous functions are bad basho
    57. 57. Features not Reviewed Link phase (link walking) Results from multiple phases Erlang MapReduce functions Streaming results basho
    58. 58. Questions? basho
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×