Scaling geodatawith MapReduce     Nathan Vander Wilt CouchConf Seattle (2012 May)
BackgroundI’m Nate, a freelancer — I do all sorts web, native, embedded product development. Of all the technologies I use...
ON THE DOCKET:                 • Views as “indexes”                 • Basic location examples                 • Geo Hacks ...
Views as “indexes”
Views as “indexes”                                                                                                  p     ...
Views as “indexes”                                                   “map” function emits                 Database        ...
GeoCouch for indexing         Spatial         IndexUsing R-trees as a 2-dimensional index to speed bounding box queries.
GeoCouch for indexingSpatialLookup
GeoCouch for indexing                    !Drawbacks if a lot of points inside bounding box...
MapReduce indexing             B-tree             indexUse B-trees as a index to speed 1-dimensional lookups
MapReduce indexing                                     …               B-tree              lookupefficient “start/end” rang...
MapReduce indexing                  1                    4   2   3   1            B-tree          reductionas well as grou...
View “indexes” as...                 Database                                                                           In...
“Indexes” as projections                          http://en.wikipedia.org/wiki/File:Axonometric_projection.svgWhat do I me...
“Indexes” as projections                                         Image: USGSExample of n-D onto 1-D: map functions
Views as projections                                                                         single dimension view        ...
View “indexes” as...                 Database                                                               Database*     ...
Questions so far?
Basic location indexes
Location example #1Where was I when...?
Location example #1Map functionfor each pt in doc:  emit(pt.timestamp, pt.coord)
Location example #1       (Demo)
Location example #2Where have I been?
Location example #2Map functionfor each pt in doc:  emit(pt.timestamp, pt.coord)Reduce functionfor each value in reduction...
Location example #2       (Demo)
Location example #3             What photos did I             capture in this area?The reason I’ve been recording my locat...
Location example #3                  emit([2,0,2,…])                  ?group_level=Z                    Image: http://msdn...
Location example #3Map functionfor each pt in doc:  key = quadkey(pt.coord)  emit(key, pt)Reduce function (basic)_count
Location example #3       (Demo)
Questions now?Have I managed to confuse anyone by now?
Scalable geo hacksGeo hacks -> scalable geo hacks.Keep in mind that everything we’ve talked about so far will scale well. ...
Hack #1Geo: not just for geo
Hack #1                                               one-dimensional view              n-          dimensional           ...
Hack #1             n-         dimensional            data                          two-dimensional view!2-d index!
Hack #1                          11000                           8250                           5500                      ...
Hack #1                           11000                            8250                            5500                   ...
Hack #1                             5                             4                             3                         ...
Hack #1                                     Consider:                        ?bbox=…&limit=500                      ?bbox=...
Hack #1(No demo, sorry)
Hack #2All log(n) you can eat
Hack #2Map function (usual timestamp/location emit)...Reduce functionn = Math.log(values.length)while averages.length < n:...
Hack #3Out-of-the ordinarylocation summary
Hack #3                                                   …?Averaging locations puts dots on places I’ve never actually be...
Hack #3What might be more interesting is to pick good samples of actual locations I’ve been too.“Representative” locations...
Hack #3Finally realized that it was actually the “outliers” that provided the most interesting answer to“where have I all ...
Hack #3   Map function   ...  Reduce function  n = Math.log(values.length)  while outliers.length < n:    outliers.push(an...
Hack #3                   Caution: RESULT IS “UNDEFINED”This cheats.Reduce function should be “commutative and associative...
Hack #3                             (but is interesting anyway)In this case, I’d rather have randomly picked but practical...
Hack #3 (Demo)
Thanks!           (Any more questions?)                               @natevw                             http://exts.chCo...
Upcoming SlideShare
Loading in …5
×

Scaling geodata with MapReduce

729 views
665 views

Published on

Published in: Technology, Sports
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
729
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Scaling geodata with MapReduce

  1. 1. Scaling geodatawith MapReduce Nathan Vander Wilt CouchConf Seattle (2012 May)
  2. 2. BackgroundI’m Nate, a freelancer — I do all sorts web, native, embedded product development. Of all the technologies I use, Couch is one of the most fun totalk about...So we’ve learned how Couchbase’s elasticity and query model lets grow an app to massive scale.Well, for better or for worse, the user in this user story is me. Heres how I turned almost ten years of personal geodata into something I can relive inless than two seconds. I hope youll find some of these ideas helpful when dealing with tens of thousands of users.
  3. 3. ON THE DOCKET: • Views as “indexes” • Basic location examples • Geo Hacks (pls to interrupt)Gonna divide this up into three general chunks: make sure we’re all on the same page as far as concepts go, then dive into the main examples.Finally we’ll explore at some interesting twists of the available features.Feel free to interrupt at any time. I’d like to have some discussion between each of these main sections, so you can also save questions for then.
  4. 4. Views as “indexes”
  5. 5. Views as “indexes” p re d looku ent filte Database effic i (Everything Ever) efficien t filter ed loo kup effi cie nt filt ere d lo ok upWhat do I mean by “indexes”?Efficient lookup of data, extracted from documents. Think of a word index in a book, or the topical index of a “real” encyclopedia set.
  6. 6. Views as “indexes” “map” function emits Database IndexWith Couch you have full control. Your code defines which terms — “keys” — go into this index.
  7. 7. GeoCouch for indexing Spatial IndexUsing R-trees as a 2-dimensional index to speed bounding box queries.
  8. 8. GeoCouch for indexingSpatialLookup
  9. 9. GeoCouch for indexing !Drawbacks if a lot of points inside bounding box...
  10. 10. MapReduce indexing B-tree indexUse B-trees as a index to speed 1-dimensional lookups
  11. 11. MapReduce indexing … B-tree lookupefficient “start/end” range queries
  12. 12. MapReduce indexing 1 4 2 3 1 B-tree reductionas well as grouped “reduce” queries!
  13. 13. View “indexes” as... Database Index PROJECTIONBecause Couch exposes its index keys directly, you can imagine the indexes as “projections” of data viewed from different perspectives.
  14. 14. “Indexes” as projections http://en.wikipedia.org/wiki/File:Axonometric_projection.svgWhat do I mean by projection?Example of 3D onto 2D: pictures.
  15. 15. “Indexes” as projections Image: USGSExample of n-D onto 1-D: map functions
  16. 16. Views as projections single dimension view n- dimensional dataExample of n-D onto 1-D: map functions!Imagine each “aspect” of your data as its own dimension.We’ll focus back on this in the last section on Geo Hacks, but it’s helpful to keep in mind
  17. 17. View “indexes” as... Database Database* (*index posing as...) PROJECTIONBecause they’re sort of just a different “perspective” on your data, in all the examples ahead we’ll look at Couch indexes almost as databasesthemselves.
  18. 18. Questions so far?
  19. 19. Basic location indexes
  20. 20. Location example #1Where was I when...?
  21. 21. Location example #1Map functionfor each pt in doc: emit(pt.timestamp, pt.coord)
  22. 22. Location example #1 (Demo)
  23. 23. Location example #2Where have I been?
  24. 24. Location example #2Map functionfor each pt in doc: emit(pt.timestamp, pt.coord)Reduce functionfor each value in reduction: avg += value.coord n += value.n || 1return {avg, n}
  25. 25. Location example #2 (Demo)
  26. 26. Location example #3 What photos did I capture in this area?The reason I’ve been recording my location for so many years.
  27. 27. Location example #3 emit([2,0,2,…]) ?group_level=Z Image: http://msdn.microsoft.com/en-us/library/bb259689.aspxTiles, quadkeys
  28. 28. Location example #3Map functionfor each pt in doc: key = quadkey(pt.coord) emit(key, pt)Reduce function (basic)_count
  29. 29. Location example #3 (Demo)
  30. 30. Questions now?Have I managed to confuse anyone by now?
  31. 31. Scalable geo hacksGeo hacks -> scalable geo hacks.Keep in mind that everything we’ve talked about so far will scale well. I have “only” a few millionlocation breadcrumbs — that’s a lot, but you might have many many more. The same index treesthat got me this far are designed to keep going farther.Now we’re going to talk about some hacks: some I haven’t actually used yet, the next makes yourcode uglier, and the last one adds cheating on top of that. But all of these “hacks” still have somegood scaling properties.
  32. 32. Hack #1Geo: not just for geo
  33. 33. Hack #1 one-dimensional view n- dimensional dataWe talked about this: “projecting” an aspect of multi-dimensional data onto a 1-d sorted index. Butwhy not project onto a...
  34. 34. Hack #1 n- dimensional data two-dimensional view!2-d index!
  35. 35. Hack #1 11000 8250 5500 2750 0 2008 2009 2010 2011 2012Time and altitude, e.g. when did I fly in 2009–2010
  36. 36. Hack #1 11000 8250 5500 2750 0 2008 2009 2010 2011 2012With B-trees, the emitted values are in some sense actually 0-dimensional points: they don’t coverany space. You can only find data lying *within* a given range.With GeoCouch’s R-tree, you can actually find index entries that go *across* your query range. Thatis, you can project (emit) shapes that will cover space and intersections are efficiently detected bythe index. (Jordan curve / mean value theorem)
  37. 37. Hack #1 5 4 3 2 1 0 1 2 3 4Those first examples are still somewhat “geo” — geometric. Doesn’t have to be:Camera and rating, e.g. clean up bad photos from recent cameras
  38. 38. Hack #1 Consider: ?bbox=…&limit=500 ?bbox=…&count=trueUnfortunately, this does not provide a *sorted* index. GeoCouch is scalable in the sense that thedetermining which objects are within the bounding box will stay fast, but the result set might bemanageable only when counting or limiting.
  39. 39. Hack #1(No demo, sorry)
  40. 40. Hack #2All log(n) you can eat
  41. 41. Hack #2Map function (usual timestamp/location emit)...Reduce functionn = Math.log(values.length)while averages.length < n: averages.push(another)return averages
  42. 42. Hack #3Out-of-the ordinarylocation summary
  43. 43. Hack #3 …?Averaging locations puts dots on places I’ve never actually been.
  44. 44. Hack #3What might be more interesting is to pick good samples of actual locations I’ve been too.“Representative” locations so to speak. What’s a good “representative” location? The most common,ones with a lot of other points nearby, ones closest to that average we calculate?
  45. 45. Hack #3Finally realized that it was actually the “outliers” that provided the most interesting answer to“where have I all been”!
  46. 46. Hack #3 Map function ... Reduce function n = Math.log(values.length) while outliers.length < n: outliers.push(another) return outliersusing log(n) trick from before
  47. 47. Hack #3 Caution: RESULT IS “UNDEFINED”This cheats.Reduce function should be “commutative and associative for the array value input, to be able reduceon its own output and get the same answer”This is not really associative: the result depends on the internal tree structure. The points picked arerandom/unstable.
  48. 48. Hack #3 (but is interesting anyway)In this case, I’d rather have randomly picked but practically useful output data, than stable“averages”.
  49. 49. Hack #3 (Demo)
  50. 50. Thanks! (Any more questions?) @natevw http://exts.chCode from https://github.com/natevw/LocLog and https://github.com/natevw/ShutterStem

×