Your SlideShare is downloading. ×
  • Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Riak MapReduce: A Story In Three Acts

  • 12,996 views
Published

Presentation describing the current state of Riak's MapReduce and future features.

Presentation describing the current state of Riak's MapReduce and future features.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,996
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
203
Comments
0
Likes
26

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

























































































Transcript

  • 1. Riak MapReduce: A Story In Three Acts Kevin A. Smith Basho Technologies
  • 2. Part 1: Current Release
  • 3. Map/Reduce Terms
  • 4. Map/Reduce Terms • Phase: A step within a job
  • 5. Map/Reduce Terms • Phase: A step within a job • Job: A sequence of phases and inputs
  • 6. Map/Reduce Terms • Phase: A step within a job • Job: A sequence of phases and inputs • Map: Data collection phase
  • 7. Map/Reduce Terms • Phase: A step within a job • Job: A sequence of phases and inputs • Map: Data collection phase • Reduce: Data collation or processing phase
  • 8. Map/Reduce Facts
  • 9. Map/Reduce Facts • Map phases execute in parallel w/data locality
  • 10. Map/Reduce Facts • Map phases execute in parallel w/data locality • Reduce phases execute in parallel on the node where job was submitted
  • 11. Map/Reduce Facts • Map phases execute in parallel w/data locality • Reduce phases execute in parallel on the node where job was submitted • Results are not cached or stored
  • 12. Map/Reduce Facts • Map phases execute in parallel w/data locality • Reduce phases execute in parallel on the node where job was submitted • Results are not cached or stored • Phases can be written in Erlang or Javascript
  • 13. Map Phase
  • 14. Map Phase • Inputs must generate bucket/key pairs
  • 15. Map Phase • Inputs must generate bucket/key pairs • Must return a list
  • 16. Map Phase • Inputs must generate bucket/key pairs • Must return a list • Parallel results are aggregated into a single list
  • 17. Parallel Map
  • 18. Parallel Map
  • 19. Parallel Map
  • 20. Parallel Map
  • 21. Reduce Phase
  • 22. Reduce Phase • Performed on the node coordinating the map/reduce job
  • 23. Reduce Phase • Performed on the node coordinating the map/reduce job • Two processes per reduce phase to add minor parallelism
  • 24. Reduce Phase • Performed on the node coordinating the map/reduce job • Two processes per reduce phase to add minor parallelism • Must return a list
  • 25. Submitting Jobs via HTTP
  • 26. Submitting Jobs via HTTP • Riak exposes M/R via its REST API
  • 27. Submitting Jobs via HTTP • Riak exposes M/R via its REST API • Job is described in JSON
  • 28. Submitting Jobs via HTTP • Riak exposes M/R via its REST API • Job is described in JSON • Submitted via POST
  • 29. Submitting Jobs via HTTP • Riak exposes M/R via its REST API • Job is described in JSON • Submitted via POST • Default URL is /mapred
  • 30. MapReduce via JSON
  • 31. MapReduce via JSON {“inputs”: [“stocks”, “goog”],
  • 32. MapReduce via JSON {“inputs”: [“stocks”, “goog”], “query”: [{“map”:{“language”:”javascript”,
  • 33. MapReduce via JSON {“inputs”: [“stocks”, “goog”], “query”: [{“map”:{“language”:”javascript”, “name”: “Riak.mapValuesJson”},
  • 34. MapReduce via JSON {“inputs”: [“stocks”, “goog”], “query”: [{“map”:{“language”:”javascript”, “name”: “Riak.mapValuesJson”}, “keep”: true}]}
  • 35. MapReduce via JSON
  • 36. MapReduce via JSON {“inputs”: “stocks”,
  • 37. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”,
  • 38. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”,
  • 39. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”},
  • 40. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false},
  • 41. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false}, {“reduce”:{“language”:”javascript,
  • 42. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false}, {“reduce”:{“language”:”javascript, “name”: “Riak.reduceMin”},
  • 43. MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false}, {“reduce”:{“language”:”javascript, “name”: “Riak.reduceMin”}, “keep”: true}]}
  • 44. Part II: Up Next
  • 45. Problem Mapping beats up nodes and is inefficient for large buckets.
  • 46. vnode 1 vnode 2 vnode 3 vnode 4 Node 1 Node 2 Node 3 Node 4 vnode 5 vnode 6 vnode 7 vnode 8
  • 47. vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 48. <<map(“goog”, “20100510”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 49. vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 <<map(“goog”, “20100522”)>> MapReduce Request
  • 50. <<map(“goog”, “20100511”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 51. <<map(“goog”, “20100514”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 52. <<map(“goog”, “20100517”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 53. <<map(“goog”, “20100510”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 54. Symptoms
  • 55. Symptoms • Hotspots
  • 56. Symptoms • Hotspots • Javascript VM contention
  • 57. Symptoms • Hotspots • Javascript VM contention • Memory footprint
  • 58. Solution Write a real query scheduler.
  • 59. <<get("goog", ...)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce <<get("goog", ...)>> Request
  • 60. <<get("goog", ...)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode <<get("goog", ...)>> 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • 61. Benefits 2x Performance Improvement
  • 62. Benefits • Groups keys into batches 2x Performance Improvement
  • 63. Benefits • Groups keys into batches • Reduces contention for Javascript VMs 2x Performance Improvement
  • 64. Benefits • Groups keys into batches • Reduces contention for Javascript VMs • 2x Performance Improvement Uses replicas for better cluster utilization
  • 65. Benefits • Groups keys into batches • Reduces contention for Javascript VMs • Uses replicas for better cluster utilization 2x Performance Improvement
  • 66. Problem Querying data is expensive when all you have are map and reduce functions.
  • 67. Symptoms
  • 68. Symptoms • Resource contention
  • 69. Symptoms • Resource contention • Job throughput
  • 70. Symptoms • Resource contention • Job throughput • Higher memory consumption
  • 71. Solution Integrate key filtering operations into the MapReduce pipeline.
  • 72. Key Filtering
  • 73. Key Filtering • Select objects based on their bucket and key
  • 74. Key Filtering • Select objects based on their bucket and key • Takes advantage of “meaningful keys”
  • 75. Key Filtering • Select objects based on their bucket and key • Takes advantage of “meaningful keys” • Reduces the work required to satisfy a query
  • 76. Comparison Map Functions Key Filters Find all the ticker entries for a stock between 2009 and 2010. Use map function to Filter input keys to inspect each object’s select only relevant key to determine objects relevancy
  • 77. Comparison Find all the ticker entries for a stock between 2009 and 2010. Map Functions Key Filters Use map function to Filter input keys to inspect each object’s select only relevant key to determine objects relevancy
  • 78. Example {"inputs": { "bucket": "msft", "key_filters": [["tokenize", "-", 1], [“string_to_int”], ["between", 2009, 2010]]}, "query": [{"map":{ "language": "javascript", "source": "function(obj) { var parsed = JSON.parse(obj); var obj2 = parsed.values[0][‘data’]; obj2.key = obj.key; obj2.bucket = obj.bucket; return [obj2]; }", “keep”: true}}]}
  • 79. Data Conversions
  • 80. Data Conversions • Datatypes: string_to_int, int_to_string, float_to_string, string_to_float
  • 81. Data Conversions • Datatypes: string_to_int, int_to_string, float_to_string, string_to_float • Formatting: tokenize, urldecode
  • 82. Data Conversions • Datatypes: string_to_int, int_to_string, float_to_string, string_to_float • Formatting: tokenize, urldecode • Strings: to_upper, to_lower
  • 83. Filtering Ops
  • 84. Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq
  • 85. Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq • Comparisons: eq, neq, similar_to
  • 86. Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq • Comparisons: eq, neq, similar_to • Range: between (numbers and strings)
  • 87. Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq • Comparisons: eq, neq, similar_to • Range: between (numbers and strings) • Misc: matches (regular expr), set_member
  • 88. Demo
  • 89. Part III: The Future
  • 90. Future Projects
  • 91. Future Projects • Upgrade erlang_js to Jaegermonkey
  • 92. Future Projects • Upgrade erlang_js to Jaegermonkey • Distributed reduce phases
  • 93. Future Projects • Upgrade erlang_js to Jaegermonkey • Distributed reduce phases • External MapReduce processes