Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WHY WE CHOSE MONGODB TO PUT BIG-DATA ‘ON THE MAP’          JUNE 2012           @nknize        +Nicholas Knize
“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in onelocation…this capability allo...
• Expose enterprise data in a geo-temporal user defined  environment• Provide a flexible and scalable spatial indexing fra...
Desired Data Store Characteristic for ‘Big Data’• Horizontally scalable – Large volume / elastic• Vertically scalable – He...
Subset of Evaluated NoSQL Options           • Cassandra                 –   Nice Bring Your Own Index (BYOI) design       ...
Why MongoDB for Thermopylae?• Documents based on JSON – A GEOJSON match made in heaven!• C++ - No Garbage Collection Overh...
... The Spatial Indexer wasn’t quite right• MongoDB (like nearly all relational DBs) uses a b-Tree     – Data structure fo...
How does MongoDB solve the dimensionality problem?• Space Filling (Z) Curve     – A continuous line that       intersects ...
Issues with the Geohash b-Tree approach• Neighbors aren’t so  close!     – Neighboring points on the       Geoid may end u...
Mongo Multi-location Document Clipping Issues                         ($within search doesn’t always work w/ multi-locatio...
Potential Solutions • Constrain the system to single point searches       – Multi-dimension support will be exponentially ...
Thermopylae Custom Tuned MongoDB               for GeoTST Leverage’s Guttman’s 1984 Research in R/R* Trees• R-Trees organi...
Spatial Indexing at Scale with R-TreesSpatial data represented as minimum bounding rectangles (2-dimension),cubes (3-dimen...
mn o p    R*-Tree Spatial Index Example• Sample insertion result for 4th order  tree• Objectives:                         ...
Insert • Similar to insertion into B+-tree but may insert   into any leaf; leaf splits in case capacity exceeded.       – ...
Insert—Leaf Selection• Follow a path from root to leaf.• At each node move into subtree whose MBR area  increases least wi...
Insert—Leaf Selection• Insert into m.               m
Insert—Leaf Selection• Insert into n.                        n
Insert—Leaf Selection• Insert into o.                        o
Insert—Leaf Selection• Insert into p.                        p
mn o pQuery  • Start at root                     a b cd           e f            g h i   jk l  • Find all overlapping MBRs...
mn o pQuery                                 a b cd           e f            g h i   jk l• Search m.                       ...
R*-Tree Leverages B-Tree Base Data Structures (buckets) R*-TREE MONGODB IMPLEMENTATION ACCOMPLISHING THE IMPOSSIBLE
Geo-Sharding – (in work)     Scalable Distributed R* Tree (SD-r*Tree)“Balanced” binary tree, withnodes distributed on a se...
SD-r*Tree Data Structure Illustration                               a                                a                    ...
SD-r*Tree Structure Distribution                             a                                      c       GeoShard 2    ...
GeoSharding Alternative – 3D / 4D Hilbert Scanning Order  GEO-SHARDING ALTERNATIVE  ACCOMPLISHING THE IMPOSSIBLE
Next Steps: Beyond 4-Dimensions - X-Tree                                  (Berchtold, Keim, Kriegel – 1996)               ...
X-Tree Performance Results                               (Berchtold, Keim, Kriegel – 1996)X-TREE PERFORMANCEACCOMPLISHING ...
T-Sciences Custom Tuned Spatial Indexer• Optimized Spatial Search – Finds intersecting MBR and recurses into  those nodes•...
Example Use Case – OSINT (Foursquare Data)• Sample Foursquare  data set mashed with  Government Intel  Data (poly reports)...
Community Support• Thermopylae contributes fixes to the codebase      – http://github.com/mongodb• TST will work with 10ge...
THANK YOU                                 Questions?                                   Nicholas Knize                     ...
Backup
Thermopylae Sciences & Technology – Who are we?• Advanced technology w/ 160+ employees• Core customers in national securit...
Key Customers - Government        • US Dept of State Bureau of Diplomatic Security              – Build and support 30 TB ...
Key Customers - Commercial     Cleveland                 USGIF      Las Vegas     Baltimore     Cavaliers                 ...
Upcoming SlideShare
Loading in …5
×

RTree Spatial Indexing with MongoDB - MongoDC

14,473 views

Published on

Presentation by Nicholas W. Knize for MongoDC describing a scalable R-Tree implementation extension for MongoDB

Published in: Technology
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ♥♥♥ http://bit.ly/2Q98JRS ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/2Q98JRS ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Slide 9 doesn't illustrate the point. Huge mistake there: two red dots on top-right and bottom-right should be shown close to each other - and those are in different rectangles. Geographically very close, but index says "to far". Red dots incorrectly illustrate the issue.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Slide 7, it says that B-Tree cannot store multi-dimension (>2D) data. This is mistake.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

RTree Spatial Indexing with MongoDB - MongoDC

  1. 1. WHY WE CHOSE MONGODB TO PUT BIG-DATA ‘ON THE MAP’ JUNE 2012 @nknize +Nicholas Knize
  2. 2. “The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in onelocation…this capability allows for unprecedented situational awareness and information sharing” -Gen. Doug Frasier TST PRODUCTS ACCOMPLISHING THE IMPOSSIBLE
  3. 3. • Expose enterprise data in a geo-temporal user defined environment• Provide a flexible and scalable spatial indexing framework for heterogeneous data• Visualize spatially referenced data on 3D globe & 2D maps• Manage real-time data feeds and mobile messaging• View data over geo-rectified imagery with 3D terrain• Support mission planning and simulation• Provide real-time collaboration and sharing ISPATIAL OVERVIEW ACCOMPLISHING THE IMPOSSIBLE
  4. 4. Desired Data Store Characteristic for ‘Big Data’• Horizontally scalable – Large volume / elastic• Vertically scalable – Heterogeneous data types (“Data Stack”)• Smartly Distributed – Reduce the distance bits must travel• Fault Tolerant – Replication Strategy and Consistency model• High Availability – Node recovery• Fast – Reads or writes (can’t always have both) BIG DATA STORAGE CHARACTERISTICS ACCOMPLISHING THE IMPOSSIBLE
  5. 5. Subset of Evaluated NoSQL Options • Cassandra – Nice Bring Your Own Index (BYOI) design – … but Java, Java, Java… Memory management can be an issue – Adding new nodes can be a pain (Token Changes, nodetool) – Key-Value store…good for simple data models • Hbase – Nice BigTable model – Theory grounded heavily in C.A.P, inflexible trade-offs – Complicated setup and maintenance • CouchDB – Provides some GeoSpatial functionality (Currently being rewritten) – HEAVILY dependent on Map-Reduce model (complicated design) – Erlang based – poor multi-threaded heap managementNOSQL OPTIONSACCOMPLISHING THE IMPOSSIBLE
  6. 6. Why MongoDB for Thermopylae?• Documents based on JSON – A GEOJSON match made in heaven!• C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging• Disk storage is memory mapped, enabling fast swapping when necessary• Built in auto-failover with replica sets and fast recovery with journaling• Tunable Consistency – Consistency defined at application layer• Schema Flexible – friendly properties of SQL enable easy port• Provided initial spatial indexing support – Point based limited! WHY TST LIKES MONGODB ACCOMPLISHING THE IMPOSSIBLE
  7. 7. ... The Spatial Indexer wasn’t quite right• MongoDB (like nearly all relational DBs) uses a b-Tree – Data structure for storing sorted data in log time – Great for indexing numerical and text documents (1D attribute data) – Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY FRIENDLY MONGODB SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE
  8. 8. How does MongoDB solve the dimensionality problem?• Space Filling (Z) Curve – A continuous line that intersects every point in a two-dimensional plane• Use Geohash to represent lat/lon values – Interleave the bits of a lat/long pair – Base32 encode the result DIMENSIONALITY REDUCTION ACCOMPLISHING THE IMPOSSIBLE
  9. 9. Issues with the Geohash b-Tree approach• Neighbors aren’t so close! – Neighboring points on the Geoid may end up on opposite ends of the plane – Impacts search efficiency• What about Geometry? – Doesn’t support > 2D – Mongo uses Multi- Location documents which really just indexes multiple points that link back to a single document GEOHASH BTREE ISSUES ACCOMPLISHING THE IMPOSSIBLE
  10. 10. Mongo Multi-location Document Clipping Issues ($within search doesn’t always work w/ multi-location) Case 1: Success! Case 3: Fail! Case 2: Success! Case 4: Fail! Multi-Location Document (aka. Polygon) Search PolygonMULTI-LOCATION CLIPPINGACCOMPLISHING THE IMPOSSIBLE
  11. 11. Potential Solutions • Constrain the system to single point searches – Multi-dimension support will be exponentially complex (won’t scale) • Interpolate points along the edge of the shape – Multi-dimension support will be exponentially complex (won’t scale) • Customize the spatial indexer – Selected approachSOLUTIONS TO GEOHASH PROBLEMACCOMPLISHING THE IMPOSSIBLE
  12. 12. Thermopylae Custom Tuned MongoDB for GeoTST Leverage’s Guttman’s 1984 Research in R/R* Trees• R-Trees organize any-dimensional data by representing the data as a minimum bounding box.• Each node bounds it’s children. A node can have many objects in it (max: m min: ceil(m/2) )• Splits and merges optimized by minimizing overlaps• The leaves point to the actual objects (stored on disk probably)• Height balanced – search is always O(log n) CUSTOM TUNED SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE
  13. 13. Spatial Indexing at Scale with R-TreesSpatial data represented as minimum bounding rectangles (2-dimension),cubes (3-dimension), hexadecant (4-dimension)Index represented as: <I, DiskLoc> where: I = (I0, I1, … In) : n = number of dimensions Each I is a set in the form of [min,max] describing MBR range along a dimension RTREE THEORY ACCOMPLISHING THE IMPOSSIBLE
  14. 14. mn o p R*-Tree Spatial Index Example• Sample insertion result for 4th order tree• Objectives: a b cd e f g h i jk l 1. Minimize area 2. Minimize overlaps 3. Minimize margins 4. Maximize inner node utilization R*-TREE INDEX OBJECTIVES ACCOMPLISHING THE IMPOSSIBLE
  15. 15. Insert • Similar to insertion into B+-tree but may insert into any leaf; leaf splits in case capacity exceeded. – Which leaf to insert into? – How to split a node?R*-TREE INSERT EXAMPLEACCOMPLISHING THE IMPOSSIBLE
  16. 16. Insert—Leaf Selection• Follow a path from root to leaf.• At each node move into subtree whose MBR area increases least with addition of new rectangle. n m o p
  17. 17. Insert—Leaf Selection• Insert into m. m
  18. 18. Insert—Leaf Selection• Insert into n. n
  19. 19. Insert—Leaf Selection• Insert into o. o
  20. 20. Insert—Leaf Selection• Insert into p. p
  21. 21. mn o pQuery • Start at root a b cd e f g h i jk l • Find all overlapping MBRs • Search subtrees recursively n m a o p a x a
  22. 22. mn o pQuery a b cd e f g h i jk l• Search m. e n a m a a b a g a o p c d x x
  23. 23. R*-Tree Leverages B-Tree Base Data Structures (buckets) R*-TREE MONGODB IMPLEMENTATION ACCOMPLISHING THE IMPOSSIBLE
  24. 24. Geo-Sharding – (in work) Scalable Distributed R* Tree (SD-r*Tree)“Balanced” binary tree, withnodes distributed on a set ofservers:• Each internal node has exactly two children• Each leaf node stores a subset of the indexed dataset• At each node, the height of the subtrees differ by at most one• mongos “routing” node maintains binary tree GEO-SHARDING ACCOMPLISHING THE IMPOSSIBLE
  25. 25. SD-r*Tree Data Structure Illustration a a a c c d0 r1 b r1 b Data Node Spatial Coverage c b d0 d1 c b d0 r2 d e e d1 d2 d • di = Data Node (Chunk) • ri = Coverage NodeLeveraged work from Litwin, Mouza, Rigaux 2007 SD-r*Tree DATA STRUCTURE ACCOMPLISHING THE IMPOSSIBLE
  26. 26. SD-r*Tree Structure Distribution a c GeoShard 2 GeoShard 3 r1 b d1 d2 mongos c b d0 r2 d r1 r2 GeoShard 1 e d0 e d1 d2 dSD-r*TREE STRUCTURE DISTRIBUTIONACCOMPLISHING THE IMPOSSIBLE
  27. 27. GeoSharding Alternative – 3D / 4D Hilbert Scanning Order GEO-SHARDING ALTERNATIVE ACCOMPLISHING THE IMPOSSIBLE
  28. 28. Next Steps: Beyond 4-Dimensions - X-Tree (Berchtold, Keim, Kriegel – 1996) Normal Internal Nodes Supernodes Data Nodes• Avoid MBR overlaps• Avoid node splits (main cause for high overlap)• Introduce new node structure: Supernodes – Large Directory nodes of variable size BEYOND 4-DIMENSIONS ACCOMPLISHING THE IMPOSSIBLE
  29. 29. X-Tree Performance Results (Berchtold, Keim, Kriegel – 1996)X-TREE PERFORMANCEACCOMPLISHING THE IMPOSSIBLE
  30. 30. T-Sciences Custom Tuned Spatial Indexer• Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search – 28% reduction in number of nodes touched• Optimize Deletes – Leverages R* split/merge approach for rebalancing tree when nodes become over/under-full• Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning CONCLUSION ACCOMPLISHING THE IMPOSSIBLE
  31. 31. Example Use Case – OSINT (Foursquare Data)• Sample Foursquare data set mashed with Government Intel Data (poly reports)• 100 million Geo Document test (3D points and polys)• 4 server replica set• ~350ms query response• ~300% improvement over PostGIS EXAMPLE ACCOMPLISHING THE IMPOSSIBLE
  32. 32. Community Support• Thermopylae contributes fixes to the codebase – http://github.com/mongodb• TST will work with 10gen to fold into the baseline• Active developer collaboration – IRC: #mongodb freenode.netFIND USACCOMPLISHING THE IMPOSSIBLE
  33. 33. THANK YOU Questions? Nicholas Knize nknize@t-sciences.comTHANK YOUACCOMPLISHING THE IMPOSSIBLE
  34. 34. Backup
  35. 35. Thermopylae Sciences & Technology – Who are we?• Advanced technology w/ 160+ employees• Core customers in national security, venues and events, military and police, and city planning• Partnered with Google and imagery providers• Long term relationship focused – TS/SCI Staff TST + 10gen + Google = Game-changing approachENTERPRISE PARTNERWHO ARE THESE GUYS?ACCOMPLISHING THE IMPOSSIBLE
  36. 36. Key Customers - Government • US Dept of State Bureau of Diplomatic Security – Build and support 30 TB Google Earth Globe with multi- terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework. • US Army Intelligence Security Command – Provide expertise in managing technology integration – prime contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework. • US Southern Command – Coordinate Intelligence management systems spatial data collection, indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest. – Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard)GOVERNMENT CUSTOMERSACCOMPLISHING THE IMPOSSIBLE
  37. 37. Key Customers - Commercial Cleveland USGIF Las Vegas Baltimore Cavaliers Motor Speedway Grand PrixiSpatial framework serves thousands of mobile devicesCOMMERCIAL CUSTOMERSACCOMPLISHING THE IMPOSSIBLE

×