• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Riak MapReduce: A Story In Three Acts

on

  • 13,811 views

Presentation describing the current state of Riak's MapReduce and future features.

Presentation describing the current state of Riak's MapReduce and future features.

Statistics

Views

Total Views
13,811
Views on SlideShare
13,050
Embed Views
761

Actions

Likes
26
Downloads
195
Comments
0

11 Embeds 761

http://www.royans.net 336
http://friendfeedredux.appspot.com 237
http://www.nosqldatabases.com 174
http://static.slidesharecdn.com 4
http://www.scalebig.com 3
http://www.onlydoo.com 2
http://www.hanrss.com 1
http://feeds.feedburner.com 1
http://twittertim.es 1
http://www.slideshare.net 1
http://dashboard.bloglines.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Riak MapReduce: A Story In Three Acts Riak MapReduce: A Story In Three Acts Presentation Transcript

  • Riak MapReduce: A Story In Three Acts Kevin A. Smith Basho Technologies
  • Part 1: Current Release
  • Map/Reduce Terms
  • Map/Reduce Terms • Phase: A step within a job
  • Map/Reduce Terms • Phase: A step within a job • Job: A sequence of phases and inputs
  • Map/Reduce Terms • Phase: A step within a job • Job: A sequence of phases and inputs • Map: Data collection phase
  • Map/Reduce Terms • Phase: A step within a job • Job: A sequence of phases and inputs • Map: Data collection phase • Reduce: Data collation or processing phase
  • Map/Reduce Facts
  • Map/Reduce Facts • Map phases execute in parallel w/data locality
  • Map/Reduce Facts • Map phases execute in parallel w/data locality • Reduce phases execute in parallel on the node where job was submitted
  • Map/Reduce Facts • Map phases execute in parallel w/data locality • Reduce phases execute in parallel on the node where job was submitted • Results are not cached or stored
  • Map/Reduce Facts • Map phases execute in parallel w/data locality • Reduce phases execute in parallel on the node where job was submitted • Results are not cached or stored • Phases can be written in Erlang or Javascript
  • Map Phase
  • Map Phase • Inputs must generate bucket/key pairs
  • Map Phase • Inputs must generate bucket/key pairs • Must return a list
  • Map Phase • Inputs must generate bucket/key pairs • Must return a list • Parallel results are aggregated into a single list
  • Parallel Map
  • Parallel Map
  • Parallel Map
  • Parallel Map
  • Reduce Phase
  • Reduce Phase • Performed on the node coordinating the map/reduce job
  • Reduce Phase • Performed on the node coordinating the map/reduce job • Two processes per reduce phase to add minor parallelism
  • Reduce Phase • Performed on the node coordinating the map/reduce job • Two processes per reduce phase to add minor parallelism • Must return a list
  • Submitting Jobs via HTTP
  • Submitting Jobs via HTTP • Riak exposes M/R via its REST API
  • Submitting Jobs via HTTP • Riak exposes M/R via its REST API • Job is described in JSON
  • Submitting Jobs via HTTP • Riak exposes M/R via its REST API • Job is described in JSON • Submitted via POST
  • Submitting Jobs via HTTP • Riak exposes M/R via its REST API • Job is described in JSON • Submitted via POST • Default URL is /mapred
  • MapReduce via JSON
  • MapReduce via JSON {“inputs”: [“stocks”, “goog”],
  • MapReduce via JSON {“inputs”: [“stocks”, “goog”], “query”: [{“map”:{“language”:”javascript”,
  • MapReduce via JSON {“inputs”: [“stocks”, “goog”], “query”: [{“map”:{“language”:”javascript”, “name”: “Riak.mapValuesJson”},
  • MapReduce via JSON {“inputs”: [“stocks”, “goog”], “query”: [{“map”:{“language”:”javascript”, “name”: “Riak.mapValuesJson”}, “keep”: true}]}
  • MapReduce via JSON
  • MapReduce via JSON {“inputs”: “stocks”,
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”,
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”,
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”},
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false},
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false}, {“reduce”:{“language”:”javascript,
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false}, {“reduce”:{“language”:”javascript, “name”: “Riak.reduceMin”},
  • MapReduce via JSON {“inputs”: “stocks”, “query”: [{“map”:{“language”:”javascript”, “name”: “App.extractTickers”, “arg”: “GOOG”}, “keep”: false}, {“reduce”:{“language”:”javascript, “name”: “Riak.reduceMin”}, “keep”: true}]}
  • Part II: Up Next
  • Problem Mapping beats up nodes and is inefficient for large buckets.
  • vnode 1 vnode 2 vnode 3 vnode 4 Node 1 Node 2 Node 3 Node 4 vnode 5 vnode 6 vnode 7 vnode 8
  • vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • <<map(“goog”, “20100510”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 <<map(“goog”, “20100522”)>> MapReduce Request
  • <<map(“goog”, “20100511”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • <<map(“goog”, “20100514”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • <<map(“goog”, “20100517”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • <<map(“goog”, “20100510”)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • Symptoms
  • Symptoms • Hotspots
  • Symptoms • Hotspots • Javascript VM contention
  • Symptoms • Hotspots • Javascript VM contention • Memory footprint
  • Solution Write a real query scheduler.
  • <<get("goog", ...)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce <<get("goog", ...)>> Request
  • <<get("goog", ...)>> vnode vnode vnode vnode 1 2 3 4 Node Node 1 2 vnode <<get("goog", ...)>> 8 Node Node 3 4 vnode 7 vnode vnode 5 6 MapReduce Request
  • Benefits 2x Performance Improvement
  • Benefits • Groups keys into batches 2x Performance Improvement
  • Benefits • Groups keys into batches • Reduces contention for Javascript VMs 2x Performance Improvement
  • Benefits • Groups keys into batches • Reduces contention for Javascript VMs • 2x Performance Improvement Uses replicas for better cluster utilization
  • Benefits • Groups keys into batches • Reduces contention for Javascript VMs • Uses replicas for better cluster utilization 2x Performance Improvement
  • Problem Querying data is expensive when all you have are map and reduce functions.
  • Symptoms
  • Symptoms • Resource contention
  • Symptoms • Resource contention • Job throughput
  • Symptoms • Resource contention • Job throughput • Higher memory consumption
  • Solution Integrate key filtering operations into the MapReduce pipeline.
  • Key Filtering
  • Key Filtering • Select objects based on their bucket and key
  • Key Filtering • Select objects based on their bucket and key • Takes advantage of “meaningful keys”
  • Key Filtering • Select objects based on their bucket and key • Takes advantage of “meaningful keys” • Reduces the work required to satisfy a query
  • Comparison Map Functions Key Filters Find all the ticker entries for a stock between 2009 and 2010. Use map function to Filter input keys to inspect each object’s select only relevant key to determine objects relevancy
  • Comparison Find all the ticker entries for a stock between 2009 and 2010. Map Functions Key Filters Use map function to Filter input keys to inspect each object’s select only relevant key to determine objects relevancy
  • Example {"inputs": { "bucket": "msft", "key_filters": [["tokenize", "-", 1], [“string_to_int”], ["between", 2009, 2010]]}, "query": [{"map":{ "language": "javascript", "source": "function(obj) { var parsed = JSON.parse(obj); var obj2 = parsed.values[0][‘data’]; obj2.key = obj.key; obj2.bucket = obj.bucket; return [obj2]; }", “keep”: true}}]}
  • Data Conversions
  • Data Conversions • Datatypes: string_to_int, int_to_string, float_to_string, string_to_float
  • Data Conversions • Datatypes: string_to_int, int_to_string, float_to_string, string_to_float • Formatting: tokenize, urldecode
  • Data Conversions • Datatypes: string_to_int, int_to_string, float_to_string, string_to_float • Formatting: tokenize, urldecode • Strings: to_upper, to_lower
  • Filtering Ops
  • Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq
  • Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq • Comparisons: eq, neq, similar_to
  • Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq • Comparisons: eq, neq, similar_to • Range: between (numbers and strings)
  • Filtering Ops • Numbers: greater_than, less_than, greater_than_eq, less_than_eq • Comparisons: eq, neq, similar_to • Range: between (numbers and strings) • Misc: matches (regular expr), set_member
  • Demo
  • Part III: The Future
  • Future Projects
  • Future Projects • Upgrade erlang_js to Jaegermonkey
  • Future Projects • Upgrade erlang_js to Jaegermonkey • Distributed reduce phases
  • Future Projects • Upgrade erlang_js to Jaegermonkey • Distributed reduce phases • External MapReduce processes