T-Sciences offers iSpatial - a web-based Spatial Data Infrastructure (SDI) to enable integration of third-party applications with geo-visualization tools. The iHarvest tool further enables the mining and analysis of data aggregated in the iSpatial platform for spatio-temporal behavior modelling. At the back-end of both products is MongoDB, providing fundamental framework capabilities for the spatial indexing and data analysis techniques. Come witness how Thermopylae Sciences and Technology leveraged the aggregation framework, and extended the spatial capabilities of MongoDB to tackle dynamic spatio-behavioral data at scale.
2. Overview
• About Thermopylae Sciences + Technology
• What is iHarvest?
• The Problem
• Other Solutions
• Why we chose MongoDB
• Lesson Learned
• What next?
3. What is iHarvest?
iHarvest (Interest Harvest) is a system that builds profiles of activities by discrete node based on a
number of variables and then analyzes those models for similarity. It is an automated and
intelligent system that continually monitors changes in activities and models using
advanced, proprietary algorithms. iHarvest is designed to:
• Identify - Collect and store event activities and data feeds
• Model - Build and identify related interests to store as profile model
• Analyze - Identify similarity and comparisons between common activities
• Report - Aggregate and provide recommendations and analytics on findings
Features
• Operating unobtrusively on any closed-network system
• Adapts to system and usage activity changes as it is used
• Hone in on user-specific needs, becoming more accurate, efficient, and easy to use
• Deliver customized solutions such as collaboration, monitoring, and even insider threat
analysis
• Alerts on "non-observable" data and relationships
5. Our Problems
• Data Storage was Difficult to Scale
o 2012 iHarvest Roadmap Releases required adding significantly more analytic
processing and storage of results.
• Document Based Data Store (JSON)
o Need to rapidly increase the richness of our data models dynamically so we don't have
to redesign our data access layer and schema with each update/change.
• Geo Spatial Index – event data is not purely textual – needed a solution that
included support for spatial qualities
• Increased Analytics requiring more processing power – as data grew – so does
analytic processing requirement
• Had a requirement to provide Statistical and Aggregate results of our data
6. Other Solutions We Tried/Looked at
• PostgreSQL. Used "NoSQL" like key-value pair
store, but totally failed on performance when trying
to access sub-field data.
• Accumulo. Very difficult to setup, configure and
develop against. Required HDFS, Hadoop, and
Zookeeper. Also required expert Admins that just
didn't exist yet. On the plus side, provided
MapReduce capability.
7. Why we chose MongoDB
• Built-in MapReduce
– Based on the fact that we are predominantly doing massive amounts of
analytics on our data
• Aggregation Framework
– Connected directly to REST endpoints for developer/prototyping use –
substantial decrease in development time
• No Need for separate Hadoop Cluster
– Faster development and reduced installation/integration/maintenance for
customers
• Developer Friendly
• Instead of using complex JDBC and SQL, able to simply instantiate Objects
and call Methods
• Great Documentation
• Easy to Scale
8. How we're using MongoDB in iHarvest
Scalable Dynamic Storage for:
• Events, Feeds, Profile models
• Processed Analytic Results
Aggregation Framework
• Statistics and Data Aggregation
MapReduce
• Primarily to process K-Means clustering algorithm of Geo
Data
9. Dynamic Storage
• We leverage a JSON based document model
• Allows us to add new fields/attributes without having to update a schema
• Shard addition allows us to scale easily with our data
• Events
– High volume of incoming data – data can grow to be very large very
quickly
• Profile Engine processes events
– 16 x 16 profile tables – key based on profile ID allows even distribution
that allows us to dedicate profile engine processing to specific profiles
that require updating - No one Processing Engine has to do all the
work
– MongoDB allows us to dynamically grow our dimensions by adding
new tables to the grid programmatically
10. Aggregation Framework
• Create statistical endpoints by making calls to Aggregation Framework
– We created a REST API that allows JS query to aggregation framework
against any of our tables/indexes
– This gives us very powerful way to prototype new statistics and data
aggregations quickly
• Temporal aggregation is very valuable to us and we basically get it for
“free” with MongoDB built-in functions
• We leverage Aggregation Framework on the following components
– Activity
• Raw incoming data related to event
– Events
• Summarization of Activities
– Node
• Discrete item the Activity/Event is related to
11. MapReduce
• Geo Clustering is the predominant use for built-in MapReduce at this point
(outside of what Aggregation Framework is already doing)
• K-means is the cluster analysis method we use to look at geo
similarity/overlap
• In order to take advantage of this method, we use the inherent MongoDB
geo-indexing mechanism to quickly spatially access data by geographic
region
• Segregation of this data allows us to quickly perform k-means clustering
on this data alone w/o having to jam the data alongside other event data
• Developed our own k-means MapReduce queries to support the spatial-
clustering model development/process
• Automatically scales processing across # of shards which is very helpful
since k-means is very computationally intensive
12. Lessons Learned
• Moving to NoSQL from a Relational database requires switching mind-set
on data storage and processing. ie. More back-end processing, and
immediate access to results when they are ready (done being processed).
• Aggregation Framework is powerful, but could use more tutorials and
example on usage.
• Built in MapReduce allowed us to offload much of our processing and
take advantage of MongoDB auto sharding / processing.
• When storing dates in MongoDB, be sure to use ISODate to take
advantage of their Date/Time related functions.
• Understanding the data types provided by MongoDB is important to fully
take advantage of the inherent Aggregation Framework capabilities
• Go to schema workshop!
• You can of course write your own MapReduce query but can do a lot out of the box by
being mindful/knowledgeable on what is already provided for you
13. What's next?
• Increased use of MapReduce to process:
o Enhance Similarity Analytics Processing to increase
efficiency
o Additional Interest Building Algorithms
for Profile generation
• Integration of Mahout and MongoDB for additional
Clustering Algorithms
• Integration of Spring/MongoDB to better abstract
the data model