Your SlideShare is downloading.
×

×

Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- High Dimensional Indexing using Mon... by Nicholas Knize, P... 1417 views
- Arbol R - R Tree by Yuliana Apaza 601 views
- R-trees (data structure) by mahesamrin 305 views
- B-tree & R-tree by Shakil Ahmed 1260 views
- Neo4j Spatial - Backing a GIS with ... by Craig Taverner 9074 views
- R-trees – adapting out-of-core tech... by Олег Андреев 1460 views
- R-Trees and Geospatial Data Structures by Amrinder Arora 918 views
- Geo tagging & spatial indexing of t... by Shiv Shakti Ghosh 253 views
- VO Course 11: Spatial indexing by SKA (Square Kilom... 221 views
- Giving MongoDB a Way to Play with t... by MongoDB 1821 views
- Overview of OSGeo + OGC + Neo4J Spa... by Frans Thamura 1274 views
- Mongodb by satoru mikami 10652 views

8,105

Published on

Presentation by Nicholas W. Knize for MongoDC describing a scalable R-Tree implementation extension for MongoDB

Presentation by Nicholas W. Knize for MongoDC describing a scalable R-Tree implementation extension for MongoDB

Published in:
Technology

No Downloads

Total Views

8,105

On Slideshare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

142

Comments

3

Likes

8

No embeds

No notes for slide

- 1. WHY WE CHOSE MONGODB TO PUT BIG-DATA ‘ON THE MAP’ JUNE 2012 @nknize +Nicholas Knize
- 2. “The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in onelocation…this capability allows for unprecedented situational awareness and information sharing” -Gen. Doug Frasier TST PRODUCTS ACCOMPLISHING THE IMPOSSIBLE
- 3. • Expose enterprise data in a geo-temporal user defined environment• Provide a flexible and scalable spatial indexing framework for heterogeneous data• Visualize spatially referenced data on 3D globe & 2D maps• Manage real-time data feeds and mobile messaging• View data over geo-rectified imagery with 3D terrain• Support mission planning and simulation• Provide real-time collaboration and sharing ISPATIAL OVERVIEW ACCOMPLISHING THE IMPOSSIBLE
- 4. Desired Data Store Characteristic for ‘Big Data’• Horizontally scalable – Large volume / elastic• Vertically scalable – Heterogeneous data types (“Data Stack”)• Smartly Distributed – Reduce the distance bits must travel• Fault Tolerant – Replication Strategy and Consistency model• High Availability – Node recovery• Fast – Reads or writes (can’t always have both) BIG DATA STORAGE CHARACTERISTICS ACCOMPLISHING THE IMPOSSIBLE
- 5. Subset of Evaluated NoSQL Options • Cassandra – Nice Bring Your Own Index (BYOI) design – … but Java, Java, Java… Memory management can be an issue – Adding new nodes can be a pain (Token Changes, nodetool) – Key-Value store…good for simple data models • Hbase – Nice BigTable model – Theory grounded heavily in C.A.P, inflexible trade-offs – Complicated setup and maintenance • CouchDB – Provides some GeoSpatial functionality (Currently being rewritten) – HEAVILY dependent on Map-Reduce model (complicated design) – Erlang based – poor multi-threaded heap managementNOSQL OPTIONSACCOMPLISHING THE IMPOSSIBLE
- 6. Why MongoDB for Thermopylae?• Documents based on JSON – A GEOJSON match made in heaven!• C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging• Disk storage is memory mapped, enabling fast swapping when necessary• Built in auto-failover with replica sets and fast recovery with journaling• Tunable Consistency – Consistency defined at application layer• Schema Flexible – friendly properties of SQL enable easy port• Provided initial spatial indexing support – Point based limited! WHY TST LIKES MONGODB ACCOMPLISHING THE IMPOSSIBLE
- 7. ... The Spatial Indexer wasn’t quite right• MongoDB (like nearly all relational DBs) uses a b-Tree – Data structure for storing sorted data in log time – Great for indexing numerical and text documents (1D attribute data) – Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY FRIENDLY MONGODB SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE
- 8. How does MongoDB solve the dimensionality problem?• Space Filling (Z) Curve – A continuous line that intersects every point in a two-dimensional plane• Use Geohash to represent lat/lon values – Interleave the bits of a lat/long pair – Base32 encode the result DIMENSIONALITY REDUCTION ACCOMPLISHING THE IMPOSSIBLE
- 9. Issues with the Geohash b-Tree approach• Neighbors aren’t so close! – Neighboring points on the Geoid may end up on opposite ends of the plane – Impacts search efficiency• What about Geometry? – Doesn’t support > 2D – Mongo uses Multi- Location documents which really just indexes multiple points that link back to a single document GEOHASH BTREE ISSUES ACCOMPLISHING THE IMPOSSIBLE
- 10. Mongo Multi-location Document Clipping Issues ($within search doesn’t always work w/ multi-location) Case 1: Success! Case 3: Fail! Case 2: Success! Case 4: Fail! Multi-Location Document (aka. Polygon) Search PolygonMULTI-LOCATION CLIPPINGACCOMPLISHING THE IMPOSSIBLE
- 11. Potential Solutions • Constrain the system to single point searches – Multi-dimension support will be exponentially complex (won’t scale) • Interpolate points along the edge of the shape – Multi-dimension support will be exponentially complex (won’t scale) • Customize the spatial indexer – Selected approachSOLUTIONS TO GEOHASH PROBLEMACCOMPLISHING THE IMPOSSIBLE
- 12. Thermopylae Custom Tuned MongoDB for GeoTST Leverage’s Guttman’s 1984 Research in R/R* Trees• R-Trees organize any-dimensional data by representing the data as a minimum bounding box.• Each node bounds it’s children. A node can have many objects in it (max: m min: ceil(m/2) )• Splits and merges optimized by minimizing overlaps• The leaves point to the actual objects (stored on disk probably)• Height balanced – search is always O(log n) CUSTOM TUNED SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE
- 13. Spatial Indexing at Scale with R-TreesSpatial data represented as minimum bounding rectangles (2-dimension),cubes (3-dimension), hexadecant (4-dimension)Index represented as: <I, DiskLoc> where: I = (I0, I1, … In) : n = number of dimensions Each I is a set in the form of [min,max] describing MBR range along a dimension RTREE THEORY ACCOMPLISHING THE IMPOSSIBLE
- 14. mn o p R*-Tree Spatial Index Example• Sample insertion result for 4th order tree• Objectives: a b cd e f g h i jk l 1. Minimize area 2. Minimize overlaps 3. Minimize margins 4. Maximize inner node utilization R*-TREE INDEX OBJECTIVES ACCOMPLISHING THE IMPOSSIBLE
- 15. Insert • Similar to insertion into B+-tree but may insert into any leaf; leaf splits in case capacity exceeded. – Which leaf to insert into? – How to split a node?R*-TREE INSERT EXAMPLEACCOMPLISHING THE IMPOSSIBLE
- 16. Insert—Leaf Selection• Follow a path from root to leaf.• At each node move into subtree whose MBR area increases least with addition of new rectangle. n m o p
- 17. Insert—Leaf Selection• Insert into m. m
- 18. Insert—Leaf Selection• Insert into n. n
- 19. Insert—Leaf Selection• Insert into o. o
- 20. Insert—Leaf Selection• Insert into p. p
- 21. mn o pQuery • Start at root a b cd e f g h i jk l • Find all overlapping MBRs • Search subtrees recursively n m a o p a x a
- 22. mn o pQuery a b cd e f g h i jk l• Search m. e n a m a a b a g a o p c d x x
- 23. R*-Tree Leverages B-Tree Base Data Structures (buckets) R*-TREE MONGODB IMPLEMENTATION ACCOMPLISHING THE IMPOSSIBLE
- 24. Geo-Sharding – (in work) Scalable Distributed R* Tree (SD-r*Tree)“Balanced” binary tree, withnodes distributed on a set ofservers:• Each internal node has exactly two children• Each leaf node stores a subset of the indexed dataset• At each node, the height of the subtrees differ by at most one• mongos “routing” node maintains binary tree GEO-SHARDING ACCOMPLISHING THE IMPOSSIBLE
- 25. SD-r*Tree Data Structure Illustration a a a c c d0 r1 b r1 b Data Node Spatial Coverage c b d0 d1 c b d0 r2 d e e d1 d2 d • di = Data Node (Chunk) • ri = Coverage NodeLeveraged work from Litwin, Mouza, Rigaux 2007 SD-r*Tree DATA STRUCTURE ACCOMPLISHING THE IMPOSSIBLE
- 26. SD-r*Tree Structure Distribution a c GeoShard 2 GeoShard 3 r1 b d1 d2 mongos c b d0 r2 d r1 r2 GeoShard 1 e d0 e d1 d2 dSD-r*TREE STRUCTURE DISTRIBUTIONACCOMPLISHING THE IMPOSSIBLE
- 27. GeoSharding Alternative – 3D / 4D Hilbert Scanning Order GEO-SHARDING ALTERNATIVE ACCOMPLISHING THE IMPOSSIBLE
- 28. Next Steps: Beyond 4-Dimensions - X-Tree (Berchtold, Keim, Kriegel – 1996) Normal Internal Nodes Supernodes Data Nodes• Avoid MBR overlaps• Avoid node splits (main cause for high overlap)• Introduce new node structure: Supernodes – Large Directory nodes of variable size BEYOND 4-DIMENSIONS ACCOMPLISHING THE IMPOSSIBLE
- 29. X-Tree Performance Results (Berchtold, Keim, Kriegel – 1996)X-TREE PERFORMANCEACCOMPLISHING THE IMPOSSIBLE
- 30. T-Sciences Custom Tuned Spatial Indexer• Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search – 28% reduction in number of nodes touched• Optimize Deletes – Leverages R* split/merge approach for rebalancing tree when nodes become over/under-full• Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning CONCLUSION ACCOMPLISHING THE IMPOSSIBLE
- 31. Example Use Case – OSINT (Foursquare Data)• Sample Foursquare data set mashed with Government Intel Data (poly reports)• 100 million Geo Document test (3D points and polys)• 4 server replica set• ~350ms query response• ~300% improvement over PostGIS EXAMPLE ACCOMPLISHING THE IMPOSSIBLE
- 32. Community Support• Thermopylae contributes fixes to the codebase – http://github.com/mongodb• TST will work with 10gen to fold into the baseline• Active developer collaboration – IRC: #mongodb freenode.netFIND USACCOMPLISHING THE IMPOSSIBLE
- 33. THANK YOU Questions? Nicholas Knize nknize@t-sciences.comTHANK YOUACCOMPLISHING THE IMPOSSIBLE
- 34. Backup
- 35. Thermopylae Sciences & Technology – Who are we?• Advanced technology w/ 160+ employees• Core customers in national security, venues and events, military and police, and city planning• Partnered with Google and imagery providers• Long term relationship focused – TS/SCI Staff TST + 10gen + Google = Game-changing approachENTERPRISE PARTNERWHO ARE THESE GUYS?ACCOMPLISHING THE IMPOSSIBLE
- 36. Key Customers - Government • US Dept of State Bureau of Diplomatic Security – Build and support 30 TB Google Earth Globe with multi- terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework. • US Army Intelligence Security Command – Provide expertise in managing technology integration – prime contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework. • US Southern Command – Coordinate Intelligence management systems spatial data collection, indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest. – Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard)GOVERNMENT CUSTOMERSACCOMPLISHING THE IMPOSSIBLE
- 37. Key Customers - Commercial Cleveland USGIF Las Vegas Baltimore Cavaliers Motor Speedway Grand PrixiSpatial framework serves thousands of mobile devicesCOMMERCIAL CUSTOMERSACCOMPLISHING THE IMPOSSIBLE

If not then I assume thermopylae are selling a product?