MONGODB FOR SPATIO-BEHAVIORAL DATA ANALYSIS &
               VISUALIZATION

                 JOHN-ISAAC CLARK
             CHIEF INNOVATION OFFICER
Overview
•   About Thermopylae Sciences + Technology
•   What is iHarvest?
•   The Problem
•   Other Solutions
•   Why we chose MongoDB
•   Lesson Learned
•   What next?
What is iHarvest?
iHarvest (Interest Harvest) is a system that builds profiles of activities by discrete node based on a
    number of variables and then analyzes those models for similarity. It is an automated and
    intelligent system that continually monitors changes in activities and models using
    advanced, proprietary algorithms. iHarvest is designed to:
 •    Identify - Collect and store event activities and data feeds
 •    Model - Build and identify related interests to store as profile model
 •    Analyze - Identify similarity and comparisons between common activities
 •    Report - Aggregate and provide recommendations and analytics on findings

Features
 •   Operating unobtrusively on any closed-network system
 •   Adapts to system and usage activity changes as it is used
 •   Hone in on user-specific needs, becoming more accurate, efficient, and easy to use
 •   Deliver customized solutions such as collaboration, monitoring, and even insider threat
     analysis
 •   Alerts on "non-observable" data and relationships
iHarvest Architecture
Our Problems
•   Data Storage was Difficult to Scale
    o    2012 iHarvest Roadmap Releases required adding significantly more analytic
         processing and storage of results.

•   Document Based Data Store (JSON)
    o    Need to rapidly increase the richness of our data models dynamically so we don't have
         to redesign our data access layer and schema with each update/change.

•   Geo Spatial Index – event data is not purely textual – needed a solution that
    included support for spatial qualities

•   Increased Analytics requiring more processing power – as data grew – so does
    analytic processing requirement

•   Had a requirement to provide Statistical and Aggregate results of our data
Other Solutions We Tried/Looked at
•   PostgreSQL. Used "NoSQL" like key-value pair
    store, but totally failed on performance when trying
    to access sub-field data.

•   Accumulo. Very difficult to setup, configure and
    develop against. Required HDFS, Hadoop, and
    Zookeeper. Also required expert Admins that just
    didn't exist yet. On the plus side, provided
    MapReduce capability.
Why we chose MongoDB
•   Built-in MapReduce
    –   Based on the fact that we are predominantly doing massive amounts of
        analytics on our data
•   Aggregation Framework
    –   Connected directly to REST endpoints for developer/prototyping use –
        substantial decrease in development time
•   No Need for separate Hadoop Cluster
    –   Faster development and reduced installation/integration/maintenance for
        customers
•   Developer Friendly
    •   Instead of using complex JDBC and SQL, able to simply instantiate Objects
        and call Methods
•   Great Documentation
•   Easy to Scale
How we're using MongoDB in iHarvest
Scalable Dynamic Storage for:
•   Events, Feeds, Profile models
•   Processed Analytic Results

Aggregation Framework
•  Statistics and Data Aggregation

MapReduce
• Primarily to process K-Means clustering algorithm of Geo
  Data
Dynamic Storage
• We leverage a JSON based document model
• Allows us to add new fields/attributes without having to update a schema
• Shard addition allows us to scale easily with our data
• Events
   – High volume of incoming data – data can grow to be very large very
      quickly
• Profile Engine processes events
   – 16 x 16 profile tables – key based on profile ID allows even distribution
      that allows us to dedicate profile engine processing to specific profiles
      that require updating - No one Processing Engine has to do all the
      work
   – MongoDB allows us to dynamically grow our dimensions by adding
      new tables to the grid programmatically
Aggregation Framework
• Create statistical endpoints by making calls to Aggregation Framework
   – We created a REST API that allows JS query to aggregation framework
      against any of our tables/indexes
   – This gives us very powerful way to prototype new statistics and data
      aggregations quickly
• Temporal aggregation is very valuable to us and we basically get it for
  “free” with MongoDB built-in functions
• We leverage Aggregation Framework on the following components
    – Activity
         •   Raw incoming data related to event
    – Events
        • Summarization of Activities
    – Node
        • Discrete item the Activity/Event is related to
MapReduce
• Geo Clustering is the predominant use for built-in MapReduce at this point
  (outside of what Aggregation Framework is already doing)
• K-means is the cluster analysis method we use to look at geo
  similarity/overlap
• In order to take advantage of this method, we use the inherent MongoDB
  geo-indexing mechanism to quickly spatially access data by geographic
  region
• Segregation of this data allows us to quickly perform k-means clustering
  on this data alone w/o having to jam the data alongside other event data
• Developed our own k-means MapReduce queries to support the spatial-
  clustering model development/process
• Automatically scales processing across # of shards which is very helpful
  since k-means is very computationally intensive
Lessons Learned
•   Moving to NoSQL from a Relational database requires switching mind-set
    on data storage and processing. ie. More back-end processing, and
    immediate access to results when they are ready (done being processed).
•   Aggregation Framework is powerful, but could use more tutorials and
    example on usage.
•   Built in MapReduce allowed us to offload much of our processing and
    take advantage of MongoDB auto sharding / processing.
•   When storing dates in MongoDB, be sure to use ISODate to take
    advantage of their Date/Time related functions.
•   Understanding the data types provided by MongoDB is important to fully
    take advantage of the inherent Aggregation Framework capabilities
    •   Go to schema workshop!
    •   You can of course write your own MapReduce query but can do a lot out of the box by
        being mindful/knowledgeable on what is already provided for you
What's next?
•   Increased use of MapReduce to process:
    o   Enhance Similarity Analytics Processing to increase
        efficiency
    o   Additional Interest Building Algorithms
        for Profile generation
•   Integration of Mahout and MongoDB for additional
    Clustering Algorithms

•   Integration of Spring/MongoDB to better abstract
    the data model
Questions?

MongoDB for Spatio-Behavioral Data Analysis and Visualization

  • 1.
    MONGODB FOR SPATIO-BEHAVIORALDATA ANALYSIS & VISUALIZATION JOHN-ISAAC CLARK CHIEF INNOVATION OFFICER
  • 2.
    Overview • About Thermopylae Sciences + Technology • What is iHarvest? • The Problem • Other Solutions • Why we chose MongoDB • Lesson Learned • What next?
  • 3.
    What is iHarvest? iHarvest(Interest Harvest) is a system that builds profiles of activities by discrete node based on a number of variables and then analyzes those models for similarity. It is an automated and intelligent system that continually monitors changes in activities and models using advanced, proprietary algorithms. iHarvest is designed to: • Identify - Collect and store event activities and data feeds • Model - Build and identify related interests to store as profile model • Analyze - Identify similarity and comparisons between common activities • Report - Aggregate and provide recommendations and analytics on findings Features • Operating unobtrusively on any closed-network system • Adapts to system and usage activity changes as it is used • Hone in on user-specific needs, becoming more accurate, efficient, and easy to use • Deliver customized solutions such as collaboration, monitoring, and even insider threat analysis • Alerts on "non-observable" data and relationships
  • 4.
  • 5.
    Our Problems • Data Storage was Difficult to Scale o 2012 iHarvest Roadmap Releases required adding significantly more analytic processing and storage of results. • Document Based Data Store (JSON) o Need to rapidly increase the richness of our data models dynamically so we don't have to redesign our data access layer and schema with each update/change. • Geo Spatial Index – event data is not purely textual – needed a solution that included support for spatial qualities • Increased Analytics requiring more processing power – as data grew – so does analytic processing requirement • Had a requirement to provide Statistical and Aggregate results of our data
  • 6.
    Other Solutions WeTried/Looked at • PostgreSQL. Used "NoSQL" like key-value pair store, but totally failed on performance when trying to access sub-field data. • Accumulo. Very difficult to setup, configure and develop against. Required HDFS, Hadoop, and Zookeeper. Also required expert Admins that just didn't exist yet. On the plus side, provided MapReduce capability.
  • 7.
    Why we choseMongoDB • Built-in MapReduce – Based on the fact that we are predominantly doing massive amounts of analytics on our data • Aggregation Framework – Connected directly to REST endpoints for developer/prototyping use – substantial decrease in development time • No Need for separate Hadoop Cluster – Faster development and reduced installation/integration/maintenance for customers • Developer Friendly • Instead of using complex JDBC and SQL, able to simply instantiate Objects and call Methods • Great Documentation • Easy to Scale
  • 8.
    How we're usingMongoDB in iHarvest Scalable Dynamic Storage for: • Events, Feeds, Profile models • Processed Analytic Results Aggregation Framework • Statistics and Data Aggregation MapReduce • Primarily to process K-Means clustering algorithm of Geo Data
  • 9.
    Dynamic Storage • Weleverage a JSON based document model • Allows us to add new fields/attributes without having to update a schema • Shard addition allows us to scale easily with our data • Events – High volume of incoming data – data can grow to be very large very quickly • Profile Engine processes events – 16 x 16 profile tables – key based on profile ID allows even distribution that allows us to dedicate profile engine processing to specific profiles that require updating - No one Processing Engine has to do all the work – MongoDB allows us to dynamically grow our dimensions by adding new tables to the grid programmatically
  • 10.
    Aggregation Framework • Createstatistical endpoints by making calls to Aggregation Framework – We created a REST API that allows JS query to aggregation framework against any of our tables/indexes – This gives us very powerful way to prototype new statistics and data aggregations quickly • Temporal aggregation is very valuable to us and we basically get it for “free” with MongoDB built-in functions • We leverage Aggregation Framework on the following components – Activity • Raw incoming data related to event – Events • Summarization of Activities – Node • Discrete item the Activity/Event is related to
  • 11.
    MapReduce • Geo Clusteringis the predominant use for built-in MapReduce at this point (outside of what Aggregation Framework is already doing) • K-means is the cluster analysis method we use to look at geo similarity/overlap • In order to take advantage of this method, we use the inherent MongoDB geo-indexing mechanism to quickly spatially access data by geographic region • Segregation of this data allows us to quickly perform k-means clustering on this data alone w/o having to jam the data alongside other event data • Developed our own k-means MapReduce queries to support the spatial- clustering model development/process • Automatically scales processing across # of shards which is very helpful since k-means is very computationally intensive
  • 12.
    Lessons Learned • Moving to NoSQL from a Relational database requires switching mind-set on data storage and processing. ie. More back-end processing, and immediate access to results when they are ready (done being processed). • Aggregation Framework is powerful, but could use more tutorials and example on usage. • Built in MapReduce allowed us to offload much of our processing and take advantage of MongoDB auto sharding / processing. • When storing dates in MongoDB, be sure to use ISODate to take advantage of their Date/Time related functions. • Understanding the data types provided by MongoDB is important to fully take advantage of the inherent Aggregation Framework capabilities • Go to schema workshop! • You can of course write your own MapReduce query but can do a lot out of the box by being mindful/knowledgeable on what is already provided for you
  • 13.
    What's next? • Increased use of MapReduce to process: o Enhance Similarity Analytics Processing to increase efficiency o Additional Interest Building Algorithms for Profile generation • Integration of Mahout and MongoDB for additional Clustering Algorithms • Integration of Spring/MongoDB to better abstract the data model
  • 14.