High Dimensional Indexing using MongoDB (MongoSV 2012)

MONGODB FOR MULTI-DIMENSION
SPATIAL INDEXING
DECEMBER 2012
@nknize
+Nicholas Knize

Thermopylae Sciences & Technology – Who are we?
• Mixed Government (70%) and Commercial (30%) contracting
company w/ ~150 employees
• Core customers:
– SOUTHCOM, Intel & Security Command, Army Intel Sector, DOI
– LVMS, Select Energy Oil & Gas, OSU, Cleveland Cavaliers, and STL Rams
• #1 Google Enterprise partner for Federal and partner w/
imagery providers (GeoEye / Digital Globe)
• FOSS4G contributor and 10gen Enterprise partner
WHO ARE THESE GUYS?
ACCOMPLISHING THE IMPOSSIBLE
ENTERPRISE
PARTNER

“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one
location…this capability allows for unprecedented situational awareness and information sharing”
-Gen. Doug Frasier
TST PRODUCTS

COMMERCIAL CUSTOMERS
Commercial Examples
Cleveland
Cavaliers
USGIF Las Vegas
Motor Speedway
Baltimore
Grand Prix
iSpatial framework serves millions of mobile devices

1. iSpatial provides web-based interface for Multi-INT visualization and collaborations
2. Map/Reduce provides spatial statistic processing (spatial regression) and heuristics
3. Modified MongoDB provides storing and indexing multi-dimension spatial data at scale
TST ARCHITECTURE
iSpatial – UI/Visualization
Hadoop M/R – Processing / Analysis
MongoDB – Spatial Data Management @ Scale
1 2
3

What the…..HOW MUCH DATA?!?
• “Swimming in sensors drowning in data”
– What size data tsunami are we talking about?
• “Fix and Finish are meaningless until FIND is accomplished”
– A “Big Data” Spatial Search Problem
THAT’S A LOT OF DATA….
Sensor Type Resolution Data Bandwidth TB/Hr
FMV 640 x 480 (Std Def)
1920 x 1080 (HD)
HD: 16bit x 3 bands @
30fps ~1Gbps
~0.45 TB
WAMI Constant Hawk = 96 Mpx
Gorgon Stare = 460 Mpx
Argus = 1.8 Gpx
GS @ 16bit x 3 bands @
2fps ~15.3Gps
Argus @ 16bit x 3 bands
@ 12fps ~345.6Gps
~6.89 TB
~155 TB
Satellite NITF / JP2 resolutions
32K x 32K
432K x 216K
32K x 32K @ 8bit x 3
bands @ 1frame/5mins
~27Gps
~12.15 TB

• Horizontally scalable – Large volume / elastic
• Vertically scalable – Heterogeneous data types (“Data Stack”)
• Smartly Distributed – Reduce the distance bits must travel
• Fault Tolerant – Replication Strategy and Consistency model
• High Availability – Node recovery
• Fast – Reads or writes (can’t always have both)
BIG DATA STORAGE CHARACTERISTICS
Desired Data Store Characteristic for ‘Big Data’

• Cassandra
– Nice Bring Your Own Index (BYOI) design
– … but Java, Java, Java… Memory management can be a maintenance issue
– Adding new nodes can be a pain (Token Changes, nodetool)
– Key-Value store…good for simple data models
• Hbase
– Nice BigTable model
– Key-Value store…good for simple data models
– Lots of Java JNI (primarily based on std:hashmap of std:hashmap)
• CouchDB
– Provides some GeoSpatial functionality (Currently being rewritten)
– HEAVILY dependent on Map-Reduce model (complicated design)
– Erlang based – poor multi-threaded heap management
NOSQL OPTIONS
Subset of Evaluated NoSQL Options

Why MongoDB for Thermopylae?
• Documents based on JSON – A GEOJSON match made in heaven! (OGC)
• C++ - No Garbage Collection Overhead! Efficient memory management
design reduces disk swapping and paging
• Disk storage is memory mapped, enabling fast swapping when necessary
• Built in auto-failover with replica sets and fast recovery with journaling
• Tunable Consistency – Consistency defined at application layer
• Schema Flexible – friendly properties of SQL enable easy port
• Provided initial spatial indexing support – Point based limited!
WHY TST <3’S MONGODB

MONGODB SPATIAL INDEXER
... The Spatial Indexer wasn’t quite right
• MongoDB (like nearly all relational DBs) uses a b-Tree
– Data structure for storing sorted data in log time
– Great for indexing numerical and text documents (1D attribute data)
– Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY
FRIENDLY

DIMENSIONALITY REDUCTION
How does MongoDB solve the dimensionality problem?
• Space Filling (Z) Curve
– A continuous line that
intersects every point in a
two-dimensional plane
• Use Geohash to
represent lat/lon values
– Interleave the bits of a
lat/long pair
– Base32 encode the result

GEOHASH BTREE ISSUES
• Neighbors aren’t so
close!
– Neighboring points on the
Geoid may end up on
opposite ends of the
plane
– Impacts search efficiency
• What about Geometry?
– Doesn’t support > 2D
– Mongo uses Multi-
Location documents
which really just indexes
multiple points that link
back to a single document
Issues with the Geohash b-Tree approach

Sort Order and Multi-Dimension…a nightmare
(3D / 4D Hilbert Scanning Order)
GEO-SHARDING ALTERNATIVE

Case 3:
Case 4:
Multi-Location Document (aka. Polygon) Search Polygon
Case 1:
Case 2:
Success!
Success!
Fail!
Fail!
Mongo Multi-location Document Clipping Issues
($within search doesn’t always work w/ multi-location)
MULTI-LOCATION CLIPPING

• Constrain the system to single point searches
– Multi-dimension support will be exponentially complex (won’t scale)
• Interpolate points along the edge of the shape
– Multi-dimension support will be exponentially complex (won’t scale)
• Customize the spatial indexer
– Selected approach
SOLUTIONS TO GEOHASH PROBLEM
Potential Solutions

CUSTOM TUNED SPATIAL INDEXER
Thermopylae Custom Tuned MongoDB for Geo
TST Leverage’s Kriegel’s 1996 Research in R* Trees
• R-Trees organize any-dimensional data by representing
the data as a minimum bounding box.
• Each node bounds it’s children. A node can have many
objects in it (max: m min: ceil(m/2) )
• Splits and merges optimized by minimizing overlaps
• The leaves point to the actual objects (stored on disk
probably)
• Height balanced – search is always O(log n)

Spatial Indexing at Scale with R-Trees
RTREE THEORY
Spatial data represented as minimum bounding rectangles (2-
dimension), cubes (3-dimension), hexadecant (4-dimension)
Index represented as: <I, DiskLoc> where:
I = (I0, I1, … In) : n = number of dimensions
Each I is a set in the form of [min,max] describing MBR range along a dimension

R*-Tree Spatial Index Example
• Sample insertion result for 4th order
tree
• Objectives:
1. Minimize area
2. Minimize overlaps
3. Minimize margins
4. Maximize inner node utilization
a b cd e f g h i j k l
m n o p
R*-TREE INDEX OBJECTIVES

Insert
• Similar to insertion into B+-tree but may insert
into any leaf; leaf splits in case capacity exceeded.
– Which leaf to insert into?
– How to split a node?
R*-TREE INSERT EXAMPLE

Insert—Leaf Selection
• Follow a path from root to leaf.
• At each node move into subtree whose MBR area
increases least with addition of new rectangle.
m
n
o p

• Insert into m.
m

• Insert into n.
n

• Insert into o.
o

• Insert into p.
p

m
n
o p
a
a
a
x
m n o p
Query
• Start at root
• Find all overlapping MBRs
• Search subtrees recursively

Query
• Search m.
m
n
o p
a
a
x x
m n o p
a
a
a
b
c
d
e
g

R*-Tree Leverages B-Tree Base Data Structures (buckets)
R*-TREE MONGODB IMPLEMENTATION

Spatial Index
Architecture, Organization, & Performance
MBRKeyNode(s)
BucketHeader
MBRHeader
…
Dimensions Num Buckets Tree Height Read Time
3 3,448,276 3 190 ms
5 50,76,143 3 275 ms
100 90,909,091 8 ~4.9 sec
1B Polygon Read Performance (worst case O(n))
SPATIAL INDEX ARCH & ORG

Geo-Sharding – (in work)
Scalable Distributed R* Tree (SD-r*Tree)
“Balanced” binary tree, with
nodes distributed on a set of
servers:
• Each internal node has
exactly two children
• Each leaf node stores a
subset of the indexed
dataset
• At each node, the height
of the subtrees differ by
at most one
• mongos “routing” node
maintains binary tree
GEO-SHARDING

d0 d1
r1d0
Data Node Spatial
Coverage
a a
b
c
cb d0
r1
a
b
c
c
b
d2d1
e
d
d
r2
e
SD-r*Tree Data Structure Illustration
• di = Data Node (Chunk)
• ri = Coverage Node
Leveraged work from Litwin, Mouza, Rigaux 2007
SD-r*Tree DATA STRUCTURE

SD-r*Tree Structure Distribution
d0
r1
a
b
c
c
b
d2d1
e
d
d
r2
e
r2
d1 d2
d0
r1
GeoShard 2 GeoShard 3
GeoShard 1
mongos
SD-r*TREE STRUCTURE DISTRIBUTION

Beyond 4-Dimensions - X-Tree
(Berchtold, Keim, Kriegel – 1996)
Normal Internal Nodes Supernodes Data Nodes
• Avoid MBR overlaps – more overlaps approaches worst case O(n) read
• Avoid node splits (main cause for high overlap)
• Introduce new node structure: Supernodes – Large Directory nodes of variable size
BEYOND 4-DIMENSIONS

X-TREE PERFORMANCE
X-Tree Performance Results
(Berchtold, Keim, Kriegel – 1996)

T-Sciences Custom Tuned Spatial Indexer
• Optimized Spatial Search – Finds intersecting MBR and recurses into
those nodes
• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to
guide search
– 28% reduction in number of nodes touched
• Optimize Deletes – Leverages R* split/merge approach for rebalancing
tree when nodes become over/under-full
• Low maintenance – Leverages MongoDB’s automatic data compaction
and partitioning
CONCLUSION

Example: Mosaicked Video with KLV Footprints
SLIDESHOW HEADER
• Rip through
KLV Metadata
• Index frame
footprints, and
annotations as
MBR into
X(R*)-Tree
• Leverage Geo-
Sharding for
spatially
relevant scale

Example Use Case – OSINT (Foursquare Data)
• Sample Foursquare
data set mashed with
Government Intel
Data (poly reports)
• 100 million Geo
Document test (3D
points and polys)
• 4 server replica set
• ~350ms query
response
• ~300%
improvement over
PostGIS
EXAMPLE

Community Support
• Thermopylae plans to open source
– http://github.com/thermopylae
• TST working with 10gen to offer as a spatial extension
• Active developer collaboration
– IRC: #mongodb freenode.net
FIND US

THANK YOU
Questions?
Nicholas Knize
nknize@t-sciences.com
THANK YOU

Key Customers - Government
• US Dept of State Bureau of Diplomatic Security
– Build and support 30 TB Google Earth Globe with multi-
terabytes of individual globes sent to embassies throughout
the world. Integrated Google Earth and iSpatial framework.
• US Army Intelligence Security Command
– Provide expertise in managing technology integration –
prime contractor providing operations, intelligence, and IT
support worldwide. Partners include IBM, Lockheed Martin,
Google, MIT, Carnegie Mellon. Integrated Google Earth and
iSpatial framework.
• US Southern Command
– Coordinate Intelligence management systems spatial data
collection, indexing, and distribution. Integrated Google
Earth, iSpatial, and iHarvest.
– Index large volume imagery and expose it for different
services (Air Force, Navy, Army, Marines, Coast Guard)
GOVERNMENT CUSTOMERS

COMMERCIAL CUSTOMERS
Key Customers - Commercial
Cleveland
Cavaliers
USGIF Las Vegas
Motor Speedway
Baltimore
Grand Prix
iSpatial framework serves millions of mobile devices

• Expose and manage Multi-INT enterprise data in a geo-temporal
user defined environment
• Provide a flexible and scalable spatial data infrastructure (SDI)
for Multi-INT data access and analysis
• Spatially referenced data visualization on 3D globe & 2D maps
• Access real/near real-time data feeds from forward deployed
devices
• Enable real-time information sharing and mission collaboration
ISPATIAL OVERVIEW

High Dimensional Indexing using MongoDB (MongoSV 2012)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to High Dimensional Indexing using MongoDB (MongoSV 2012)

Similar to High Dimensional Indexing using MongoDB (MongoSV 2012) (20)

Recently uploaded

Recently uploaded (20)

High Dimensional Indexing using MongoDB (MongoSV 2012)

Editor's Notes