Your SlideShare is downloading. ×
Riak MapReduce
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Riak MapReduce


Published on

Slides from webinar on MapReduce: …

Slides from webinar on MapReduce:

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide

  • Tasks - individual map processes
    Combine - function to run over map results on local nodes before shipping data to reduce operations

  • Transcript

    • 1. MapReduce Daniel Reverri Developer Advocate basho
    • 2. Overview Why MapReduce? MapReduce Basics Using MapReduce Examples Comparisons basho
    • 3. Why MapReduce? Parallel, distributed queries Easy to write Easy to run basho
    • 4. Riak is a Key/Value store basho
    • 5. Key/Value Data /riak/cat/snowball1 /riak/cat/snowball2 /riak/cat/snowball3 basho
    • 6. Cluster catlady@ catlady@ catlady@ basho
    • 7. MapReduce Basics Operates over a known set of keys Runs near the data Consists of two types of functions Map Reduce basho
    • 8. What is a Map Function? Function applied to one piece of data Operates in isolation Returns a list of results basho
    • 9. What can I do with a Map Function? Filtering Filter documents by “tags” Extracting Count words in a document Extract links to related data basho
    • 10. Map cross_the_road(cat) cross_the_road(cat) cross_the_road(cat) basho
    • 11. What is a Reduce Function? Function applied to a list of results Merges results from Map phases basho
    • 12. What can I do with a Reduce Function? Aggregate Sort basho
    • 13. Reduce cross_the_road(cat) cross_the_road(cat) sort(cats) cross_the_road(cat) basho
    • 14. Using MapReduce Define and submit request REST Protocol Buffers Review results basho
    • 15. Request (REST) POST to “/mapred” Content-Type: application/json List of bucket/key pairs List of phase definitions Timeout in milliseconds basho
    • 16. Inputs basho
    • 17. Query basho
    • 18. Phase basho
    • 19. Phase Type (map, reduce, link) basho
    • 20. Phase Function (named) basho
    • 21. Phase Function (anonymous) basho
    • 22. Phase Keep (true|false) basho
    • 23. Phase Argument basho
    • 24. Function Arguments basho
    • 25. Map - value basho
    • 26. Map - keyData, arg basho
    • 27. Reduce - arg basho
    • 28. Examples basho
    • 29. Map Demo Count the number of times the word “demo” appears in a set of documents basho
    • 30. Demo Data map_demo/key1.txt Random boring demo data for map demo map_demo/key2.txt More useless demo data map_demo/key3.txt demo demo demo demo demo basho
    • 31. Request basho
    • 32. Inputs basho
    • 33. Query basho
    • 34. Map basho
    • 35. Results basho
    • 36. Reduce Demo Sort documents by the number of times “demo” appears basho
    • 37. Request basho
    • 38. Inputs basho
    • 39. Query basho
    • 40. Reduce basho
    • 41. Results basho
    • 42. Argument Demo Enhance “demo” count example to count words matching a regular expression basho
    • 43. Map with arg basho
    • 44. Results basho
    • 45. Deploying Demo Deploy enhanced count function as a named function basho
    • 46. js_source_dir app.config $ riak restart basho
    • 47. Named Function /tmp/js_source/count_by_regex.js $ riak-admin js_reload basho
    • 48. Query basho
    • 49. Results basho
    • 50. Comparisons basho
    • 51. Hadoop (similarities) Distributed across multiple machines Provides data locality (HDFS) Phases run near the data
    • 52. Hadoop (differences) Used for large, long running jobs (hours) Restarts failed tasks 3 phases (map, combine, reduce)
    • 53. CouchDB (differences) Not distributed across multiple machines Runs over all docs in a database Computes cached views for lookups No query time arguments 2 phase (map, reduce)
    • 54. MongoDB (differences) Not run in parallel Not spread across multiple machines 3 phases (map, reduce, finalize)
    • 55. Closing thoughts basho
    • 56. Good to Know Phases must always return lists Map inputs are always bucket/key pairs Bucket queries are bad Anonymous functions are bad basho
    • 57. Features not Reviewed Link phase (link walking) Results from multiple phases Erlang MapReduce functions Streaming results basho
    • 58. Questions? basho