Riak MapReduce

11,926 views
11,582 views

Published on

Slides from webinar on MapReduce:
http://blog.basho.com/2010/07/15/free-webinar---map/reduce-querying-in-riak---july-22-@-2pm-eastern/

Published in: Technology
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
11,926
On SlideShare
0
From Embeds
0
Number of Embeds
636
Actions
Shares
0
Downloads
247
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide



















































  • Tasks - individual map processes
    Combine - function to run over map results on local nodes before shipping data to reduce operations






  • Riak MapReduce

    1. 1. MapReduce Daniel Reverri Developer Advocate basho
    2. 2. Overview Why MapReduce? MapReduce Basics Using MapReduce Examples Comparisons basho
    3. 3. Why MapReduce? Parallel, distributed queries Easy to write Easy to run basho
    4. 4. Riak is a Key/Value store basho
    5. 5. Key/Value Data /riak/cat/snowball1 /riak/cat/snowball2 /riak/cat/snowball3 basho
    6. 6. Cluster catlady@192.168.1.10 catlady@192.168.1.11 catlady@192.168.1.12 basho
    7. 7. MapReduce Basics Operates over a known set of keys Runs near the data Consists of two types of functions Map Reduce basho
    8. 8. What is a Map Function? Function applied to one piece of data Operates in isolation Returns a list of results basho
    9. 9. What can I do with a Map Function? Filtering Filter documents by “tags” Extracting Count words in a document Extract links to related data basho
    10. 10. Map cross_the_road(cat) cross_the_road(cat) cross_the_road(cat) basho
    11. 11. What is a Reduce Function? Function applied to a list of results Merges results from Map phases basho
    12. 12. What can I do with a Reduce Function? Aggregate Sort basho
    13. 13. Reduce cross_the_road(cat) cross_the_road(cat) sort(cats) cross_the_road(cat) basho
    14. 14. Using MapReduce Define and submit request REST Protocol Buffers Review results basho
    15. 15. Request (REST) POST to “/mapred” Content-Type: application/json List of bucket/key pairs List of phase definitions Timeout in milliseconds basho
    16. 16. Inputs basho
    17. 17. Query basho
    18. 18. Phase basho
    19. 19. Phase Type (map, reduce, link) basho
    20. 20. Phase Function (named) basho
    21. 21. Phase Function (anonymous) basho
    22. 22. Phase Keep (true|false) basho
    23. 23. Phase Argument basho
    24. 24. Function Arguments basho
    25. 25. Map - value basho
    26. 26. Map - keyData, arg basho
    27. 27. Reduce - arg basho
    28. 28. Examples basho
    29. 29. Map Demo Count the number of times the word “demo” appears in a set of documents basho
    30. 30. Demo Data map_demo/key1.txt Random boring demo data for map demo map_demo/key2.txt More useless demo data map_demo/key3.txt demo demo demo demo demo basho
    31. 31. Request basho
    32. 32. Inputs basho
    33. 33. Query basho
    34. 34. Map basho
    35. 35. Results basho
    36. 36. Reduce Demo Sort documents by the number of times “demo” appears basho
    37. 37. Request basho
    38. 38. Inputs basho
    39. 39. Query basho
    40. 40. Reduce basho
    41. 41. Results basho
    42. 42. Argument Demo Enhance “demo” count example to count words matching a regular expression basho
    43. 43. Map with arg basho
    44. 44. Results basho
    45. 45. Deploying Demo Deploy enhanced count function as a named function basho
    46. 46. js_source_dir app.config $ riak restart basho
    47. 47. Named Function /tmp/js_source/count_by_regex.js $ riak-admin js_reload basho
    48. 48. Query basho
    49. 49. Results basho
    50. 50. Comparisons basho
    51. 51. Hadoop (similarities) Distributed across multiple machines Provides data locality (HDFS) Phases run near the data
    52. 52. Hadoop (differences) Used for large, long running jobs (hours) Restarts failed tasks 3 phases (map, combine, reduce)
    53. 53. CouchDB (differences) Not distributed across multiple machines Runs over all docs in a database Computes cached views for lookups No query time arguments 2 phase (map, reduce)
    54. 54. MongoDB (differences) Not run in parallel Not spread across multiple machines 3 phases (map, reduce, finalize)
    55. 55. Closing thoughts basho
    56. 56. Good to Know Phases must always return lists Map inputs are always bucket/key pairs Bucket queries are bad Anonymous functions are bad basho
    57. 57. Features not Reviewed Link phase (link walking) Results from multiple phases Erlang MapReduce functions Streaming results basho
    58. 58. Questions? basho

    ×