High Dimensional Indexing using MongoDB (MongoSV 2012)


Published on

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Screen shot of UDOP…blow-out of key features (sharing, presentation builder, etc)
  • High Dimensional Indexing using MongoDB (MongoSV 2012)

    2. 2. Thermopylae Sciences & Technology – Who are we?• Mixed Government (70%) and Commercial (30%) contractingcompany w/ ~150 employees• Core customers:– SOUTHCOM, Intel & Security Command, Army Intel Sector, DOI– LVMS, Select Energy Oil & Gas, OSU, Cleveland Cavaliers, and STL Rams• #1 Google Enterprise partner for Federal and partner w/imagery providers (GeoEye / Digital Globe)• FOSS4G contributor and 10gen Enterprise partnerWHO ARE THESE GUYS?ACCOMPLISHING THE IMPOSSIBLEENTERPRISEPARTNER
    3. 3. “The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in onelocation…this capability allows for unprecedented situational awareness and information sharing”-Gen. Doug FrasierTST PRODUCTSACCOMPLISHING THE IMPOSSIBLE
    4. 4. COMMERCIAL CUSTOMERSACCOMPLISHING THE IMPOSSIBLECommercial ExamplesClevelandCavaliersUSGIF Las VegasMotor SpeedwayBaltimoreGrand PrixiSpatial framework serves millions of mobile devices
    5. 5. 1. iSpatial provides web-based interface for Multi-INT visualization and collaborations2. Map/Reduce provides spatial statistic processing (spatial regression) and heuristics3. Modified MongoDB provides storing and indexing multi-dimension spatial data at scaleTST ARCHITECTUREACCOMPLISHING THE IMPOSSIBLEiSpatial – UI/VisualizationHadoop M/R – Processing / AnalysisMongoDB – Spatial Data Management @ Scale1 23
    6. 6. What the…..HOW MUCH DATA?!?• “Swimming in sensors drowning in data”– What size data tsunami are we talking about?• “Fix and Finish are meaningless until FIND is accomplished”– A “Big Data” Spatial Search ProblemTHAT’S A LOT OF DATA….ACCOMPLISHING THE IMPOSSIBLESensor Type Resolution Data Bandwidth TB/HrFMV 640 x 480 (Std Def)1920 x 1080 (HD)HD: 16bit x 3 bands @30fps ~1Gbps~0.45 TBWAMI Constant Hawk = 96 MpxGorgon Stare = 460 MpxArgus = 1.8 GpxGS @ 16bit x 3 bands @2fps ~15.3GpsArgus @ 16bit x 3 bands@ 12fps ~345.6Gps~6.89 TB~155 TBSatellite NITF / JP2 resolutions32K x 32K432K x 216K32K x 32K @ 8bit x 3bands @ 1frame/5mins~27Gps~12.15 TB
    7. 7. • Horizontally scalable – Large volume / elastic• Vertically scalable – Heterogeneous data types (“Data Stack”)• Smartly Distributed – Reduce the distance bits must travel• Fault Tolerant – Replication Strategy and Consistency model• High Availability – Node recovery• Fast – Reads or writes (can’t always have both)BIG DATA STORAGE CHARACTERISTICSACCOMPLISHING THE IMPOSSIBLEDesired Data Store Characteristic for ‘Big Data’
    8. 8. • Cassandra– Nice Bring Your Own Index (BYOI) design– … but Java, Java, Java… Memory management can be a maintenance issue– Adding new nodes can be a pain (Token Changes, nodetool)– Key-Value store…good for simple data models• Hbase– Nice BigTable model– Key-Value store…good for simple data models– Lots of Java JNI (primarily based on std:hashmap of std:hashmap)• CouchDB– Provides some GeoSpatial functionality (Currently being rewritten)– HEAVILY dependent on Map-Reduce model (complicated design)– Erlang based – poor multi-threaded heap managementNOSQL OPTIONSACCOMPLISHING THE IMPOSSIBLESubset of Evaluated NoSQL Options
    9. 9. Why MongoDB for Thermopylae?• Documents based on JSON – A GEOJSON match made in heaven! (OGC)• C++ - No Garbage Collection Overhead! Efficient memory managementdesign reduces disk swapping and paging• Disk storage is memory mapped, enabling fast swapping when necessary• Built in auto-failover with replica sets and fast recovery with journaling• Tunable Consistency – Consistency defined at application layer• Schema Flexible – friendly properties of SQL enable easy port• Provided initial spatial indexing support – Point based limited!WHY TST <3’S MONGODBACCOMPLISHING THE IMPOSSIBLE
    10. 10. MONGODB SPATIAL INDEXERACCOMPLISHING THE IMPOSSIBLE... The Spatial Indexer wasn’t quite right• MongoDB (like nearly all relational DBs) uses a b-Tree– Data structure for storing sorted data in log time– Great for indexing numerical and text documents (1D attribute data)– Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRYFRIENDLY
    11. 11. DIMENSIONALITY REDUCTIONACCOMPLISHING THE IMPOSSIBLEHow does MongoDB solve the dimensionality problem?• Space Filling (Z) Curve– A continuous line thatintersects every point in atwo-dimensional plane• Use Geohash torepresent lat/lon values– Interleave the bits of alat/long pair– Base32 encode the result
    12. 12. GEOHASH BTREE ISSUESACCOMPLISHING THE IMPOSSIBLE• Neighbors aren’t soclose!– Neighboring points on theGeoid may end up onopposite ends of theplane– Impacts search efficiency• What about Geometry?– Doesn’t support > 2D– Mongo uses Multi-Location documentswhich really just indexesmultiple points that linkback to a single documentIssues with the Geohash b-Tree approach
    13. 13. Sort Order and Multi-Dimension…a nightmare(3D / 4D Hilbert Scanning Order)GEO-SHARDING ALTERNATIVEACCOMPLISHING THE IMPOSSIBLE
    14. 14. Case 3:Case 4:Multi-Location Document (aka. Polygon) Search PolygonCase 1:Case 2:Success!Success!Fail!Fail!Mongo Multi-location Document Clipping Issues($within search doesn’t always work w/ multi-location)MULTI-LOCATION CLIPPINGACCOMPLISHING THE IMPOSSIBLE
    15. 15. • Constrain the system to single point searches– Multi-dimension support will be exponentially complex (won’t scale)• Interpolate points along the edge of the shape– Multi-dimension support will be exponentially complex (won’t scale)• Customize the spatial indexer– Selected approachSOLUTIONS TO GEOHASH PROBLEMACCOMPLISHING THE IMPOSSIBLEPotential Solutions
    16. 16. CUSTOM TUNED SPATIAL INDEXERACCOMPLISHING THE IMPOSSIBLEThermopylae Custom Tuned MongoDB for GeoTST Leverage’s Kriegel’s 1996 Research in R* Trees• R-Trees organize any-dimensional data by representingthe data as a minimum bounding box.• Each node bounds it’s children. A node can have manyobjects in it (max: m min: ceil(m/2) )• Splits and merges optimized by minimizing overlaps• The leaves point to the actual objects (stored on diskprobably)• Height balanced – search is always O(log n)
    17. 17. Spatial Indexing at Scale with R-TreesRTREE THEORYACCOMPLISHING THE IMPOSSIBLESpatial data represented as minimum bounding rectangles (2-dimension), cubes (3-dimension), hexadecant (4-dimension)Index represented as: <I, DiskLoc> where:I = (I0, I1, … In) : n = number of dimensionsEach I is a set in the form of [min,max] describing MBR range along a dimension
    18. 18. R*-Tree Spatial Index Example• Sample insertion result for 4th ordertree• Objectives:1. Minimize area2. Minimize overlaps3. Minimize margins4. Maximize inner node utilizationa b cd e f g h i j k lm n o pR*-TREE INDEX OBJECTIVESACCOMPLISHING THE IMPOSSIBLE
    19. 19. Insert• Similar to insertion into B+-tree but may insertinto any leaf; leaf splits in case capacity exceeded.– Which leaf to insert into?– How to split a node?R*-TREE INSERT EXAMPLEACCOMPLISHING THE IMPOSSIBLE
    20. 20. Insert—Leaf Selection• Follow a path from root to leaf.• At each node move into subtree whose MBR areaincreases least with addition of new rectangle.mno p
    21. 21. Insert—Leaf Selection• Insert into m.m
    22. 22. Insert—Leaf Selection• Insert into n.n
    23. 23. Insert—Leaf Selection• Insert into o.o
    24. 24. Insert—Leaf Selection• Insert into p.p
    25. 25. mno paaaxa b cd e f g h i j k lm n o pQuery• Start at root• Find all overlapping MBRs• Search subtrees recursively
    26. 26. Query• Search m.mno paax xa b cd e f g h i j k lm n o paaabcdeg
    27. 27. R*-Tree Leverages B-Tree Base Data Structures (buckets)R*-TREE MONGODB IMPLEMENTATIONACCOMPLISHING THE IMPOSSIBLE
    28. 28. Spatial IndexArchitecture, Organization, & PerformanceMBRKeyNode(s)BucketHeaderMBRHeader…Dimensions Num Buckets Tree Height Read Time3 3,448,276 3 190 ms5 50,76,143 3 275 ms100 90,909,091 8 ~4.9 sec1B Polygon Read Performance (worst case O(n))SPATIAL INDEX ARCH & ORGACCOMPLISHING THE IMPOSSIBLE
    29. 29. Geo-Sharding – (in work)Scalable Distributed R* Tree (SD-r*Tree)“Balanced” binary tree, withnodes distributed on a set ofservers:• Each internal node hasexactly two children• Each leaf node stores asubset of the indexeddataset• At each node, the heightof the subtrees differ byat most one• mongos “routing” nodemaintains binary treeGEO-SHARDINGACCOMPLISHING THE IMPOSSIBLE
    30. 30. d0 d1r1d0Data Node SpatialCoveragea abccb d0r1abccbd2d1eddr2eSD-r*Tree Data Structure Illustration• di = Data Node (Chunk)• ri = Coverage NodeLeveraged work from Litwin, Mouza, Rigaux 2007SD-r*Tree DATA STRUCTUREACCOMPLISHING THE IMPOSSIBLE
    31. 31. SD-r*Tree Structure Distributiond0r1abccbd2d1eddr2er2d1 d2d0r1GeoShard 2 GeoShard 3GeoShard 1mongosSD-r*TREE STRUCTURE DISTRIBUTIONACCOMPLISHING THE IMPOSSIBLE
    32. 32. Beyond 4-Dimensions - X-Tree(Berchtold, Keim, Kriegel – 1996)Normal Internal Nodes Supernodes Data Nodes• Avoid MBR overlaps – more overlaps approaches worst case O(n) read• Avoid node splits (main cause for high overlap)• Introduce new node structure: Supernodes – Large Directory nodes of variable sizeBEYOND 4-DIMENSIONSACCOMPLISHING THE IMPOSSIBLE
    33. 33. X-TREE PERFORMANCEACCOMPLISHING THE IMPOSSIBLEX-Tree Performance Results(Berchtold, Keim, Kriegel – 1996)
    34. 34. T-Sciences Custom Tuned Spatial Indexer• Optimized Spatial Search – Finds intersecting MBR and recurses intothose nodes• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid toguide search– 28% reduction in number of nodes touched• Optimize Deletes – Leverages R* split/merge approach for rebalancingtree when nodes become over/under-full• Low maintenance – Leverages MongoDB’s automatic data compactionand partitioningCONCLUSIONACCOMPLISHING THE IMPOSSIBLE
    35. 35. Example: Mosaicked Video with KLV FootprintsSLIDESHOW HEADERACCOMPLISHING THE IMPOSSIBLE• Rip throughKLV Metadata• Index framefootprints, andannotations asMBR intoX(R*)-Tree• Leverage Geo-Sharding forspatiallyrelevant scale
    36. 36. Example Use Case – OSINT (Foursquare Data)• Sample Foursquaredata set mashed withGovernment IntelData (poly reports)• 100 million GeoDocument test (3Dpoints and polys)• 4 server replica set• ~350ms queryresponse• ~300%improvement overPostGISEXAMPLEACCOMPLISHING THE IMPOSSIBLE
    37. 37. Community Support• Thermopylae plans to open source– http://github.com/thermopylae• TST working with 10gen to offer as a spatial extension• Active developer collaboration– IRC: #mongodb freenode.netFIND USACCOMPLISHING THE IMPOSSIBLE
    38. 38. THANK YOUQuestions?Nicholas Knizenknize@t-sciences.comTHANK YOUACCOMPLISHING THE IMPOSSIBLE
    39. 39. Backup
    40. 40. Key Customers - Government• US Dept of State Bureau of Diplomatic Security– Build and support 30 TB Google Earth Globe with multi-terabytes of individual globes sent to embassies throughoutthe world. Integrated Google Earth and iSpatial framework.• US Army Intelligence Security Command– Provide expertise in managing technology integration –prime contractor providing operations, intelligence, and ITsupport worldwide. Partners include IBM, Lockheed Martin,Google, MIT, Carnegie Mellon. Integrated Google Earth andiSpatial framework.• US Southern Command– Coordinate Intelligence management systems spatial datacollection, indexing, and distribution. Integrated GoogleEarth, iSpatial, and iHarvest.– Index large volume imagery and expose it for differentservices (Air Force, Navy, Army, Marines, Coast Guard)GOVERNMENT CUSTOMERSACCOMPLISHING THE IMPOSSIBLE
    41. 41. COMMERCIAL CUSTOMERSACCOMPLISHING THE IMPOSSIBLEKey Customers - CommercialClevelandCavaliersUSGIF Las VegasMotor SpeedwayBaltimoreGrand PrixiSpatial framework serves millions of mobile devices
    42. 42. • Expose and manage Multi-INT enterprise data in a geo-temporaluser defined environment• Provide a flexible and scalable spatial data infrastructure (SDI)for Multi-INT data access and analysis• Spatially referenced data visualization on 3D globe & 2D maps• Access real/near real-time data feeds from forward deployeddevices• Enable real-time information sharing and mission collaborationISPATIAL OVERVIEWACCOMPLISHING THE IMPOSSIBLE