SlideShare a Scribd company logo
1 of 14
MONGODB FOR SPATIO-BEHAVIORAL DATA ANALYSIS &
               VISUALIZATION

                 JOHN-ISAAC CLARK
             CHIEF INNOVATION OFFICER
Overview
•   About Thermopylae Sciences + Technology
•   What is iHarvest?
•   The Problem
•   Other Solutions
•   Why we chose MongoDB
•   Lesson Learned
•   What next?
What is iHarvest?
iHarvest (Interest Harvest) is a system that builds profiles of activities by discrete node based on a
    number of variables and then analyzes those models for similarity. It is an automated and
    intelligent system that continually monitors changes in activities and models using
    advanced, proprietary algorithms. iHarvest is designed to:
 •    Identify - Collect and store event activities and data feeds
 •    Model - Build and identify related interests to store as profile model
 •    Analyze - Identify similarity and comparisons between common activities
 •    Report - Aggregate and provide recommendations and analytics on findings

Features
 •   Operating unobtrusively on any closed-network system
 •   Adapts to system and usage activity changes as it is used
 •   Hone in on user-specific needs, becoming more accurate, efficient, and easy to use
 •   Deliver customized solutions such as collaboration, monitoring, and even insider threat
     analysis
 •   Alerts on "non-observable" data and relationships
iHarvest Architecture
Our Problems
•   Data Storage was Difficult to Scale
    o    2012 iHarvest Roadmap Releases required adding significantly more analytic
         processing and storage of results.

•   Document Based Data Store (JSON)
    o    Need to rapidly increase the richness of our data models dynamically so we don't have
         to redesign our data access layer and schema with each update/change.

•   Geo Spatial Index – event data is not purely textual – needed a solution that
    included support for spatial qualities

•   Increased Analytics requiring more processing power – as data grew – so does
    analytic processing requirement

•   Had a requirement to provide Statistical and Aggregate results of our data
Other Solutions We Tried/Looked at
•   PostgreSQL. Used "NoSQL" like key-value pair
    store, but totally failed on performance when trying
    to access sub-field data.

•   Accumulo. Very difficult to setup, configure and
    develop against. Required HDFS, Hadoop, and
    Zookeeper. Also required expert Admins that just
    didn't exist yet. On the plus side, provided
    MapReduce capability.
Why we chose MongoDB
•   Built-in MapReduce
    –   Based on the fact that we are predominantly doing massive amounts of
        analytics on our data
•   Aggregation Framework
    –   Connected directly to REST endpoints for developer/prototyping use –
        substantial decrease in development time
•   No Need for separate Hadoop Cluster
    –   Faster development and reduced installation/integration/maintenance for
        customers
•   Developer Friendly
    •   Instead of using complex JDBC and SQL, able to simply instantiate Objects
        and call Methods
•   Great Documentation
•   Easy to Scale
How we're using MongoDB in iHarvest
Scalable Dynamic Storage for:
•   Events, Feeds, Profile models
•   Processed Analytic Results

Aggregation Framework
•  Statistics and Data Aggregation

MapReduce
• Primarily to process K-Means clustering algorithm of Geo
  Data
Dynamic Storage
• We leverage a JSON based document model
• Allows us to add new fields/attributes without having to update a schema
• Shard addition allows us to scale easily with our data
• Events
   – High volume of incoming data – data can grow to be very large very
      quickly
• Profile Engine processes events
   – 16 x 16 profile tables – key based on profile ID allows even distribution
      that allows us to dedicate profile engine processing to specific profiles
      that require updating - No one Processing Engine has to do all the
      work
   – MongoDB allows us to dynamically grow our dimensions by adding
      new tables to the grid programmatically
Aggregation Framework
• Create statistical endpoints by making calls to Aggregation Framework
   – We created a REST API that allows JS query to aggregation framework
      against any of our tables/indexes
   – This gives us very powerful way to prototype new statistics and data
      aggregations quickly
• Temporal aggregation is very valuable to us and we basically get it for
  “free” with MongoDB built-in functions
• We leverage Aggregation Framework on the following components
    – Activity
         •   Raw incoming data related to event
    – Events
        • Summarization of Activities
    – Node
        • Discrete item the Activity/Event is related to
MapReduce
• Geo Clustering is the predominant use for built-in MapReduce at this point
  (outside of what Aggregation Framework is already doing)
• K-means is the cluster analysis method we use to look at geo
  similarity/overlap
• In order to take advantage of this method, we use the inherent MongoDB
  geo-indexing mechanism to quickly spatially access data by geographic
  region
• Segregation of this data allows us to quickly perform k-means clustering
  on this data alone w/o having to jam the data alongside other event data
• Developed our own k-means MapReduce queries to support the spatial-
  clustering model development/process
• Automatically scales processing across # of shards which is very helpful
  since k-means is very computationally intensive
Lessons Learned
•   Moving to NoSQL from a Relational database requires switching mind-set
    on data storage and processing. ie. More back-end processing, and
    immediate access to results when they are ready (done being processed).
•   Aggregation Framework is powerful, but could use more tutorials and
    example on usage.
•   Built in MapReduce allowed us to offload much of our processing and
    take advantage of MongoDB auto sharding / processing.
•   When storing dates in MongoDB, be sure to use ISODate to take
    advantage of their Date/Time related functions.
•   Understanding the data types provided by MongoDB is important to fully
    take advantage of the inherent Aggregation Framework capabilities
    •   Go to schema workshop!
    •   You can of course write your own MapReduce query but can do a lot out of the box by
        being mindful/knowledgeable on what is already provided for you
What's next?
•   Increased use of MapReduce to process:
    o   Enhance Similarity Analytics Processing to increase
        efficiency
    o   Additional Interest Building Algorithms
        for Profile generation
•   Integration of Mahout and MongoDB for additional
    Clustering Algorithms

•   Integration of Spring/MongoDB to better abstract
    the data model
Questions?

More Related Content

What's hot

J-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationJ-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your application
Maciej Bilas
 

What's hot (20)

Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
RubiX
RubiXRubiX
RubiX
 
Presto Summit 2018 - 08 - FINRA
Presto Summit 2018  - 08 - FINRAPresto Summit 2018  - 08 - FINRA
Presto Summit 2018 - 08 - FINRA
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
Practical Use of a NoSQL
Practical Use of a NoSQLPractical Use of a NoSQL
Practical Use of a NoSQL
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Necker
 
J-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationJ-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your application
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 

Viewers also liked

Precision Processing Equipment Kolkata
Precision Processing Equipment KolkataPrecision Processing Equipment Kolkata
Precision Processing Equipment Kolkata
prepec
 
DLT Solutions interview questions and answers
DLT Solutions interview questions and answersDLT Solutions interview questions and answers
DLT Solutions interview questions and answers
getbrid665
 
CYBERSECURITY LEGISLATION
CYBERSECURITY LEGISLATIONCYBERSECURITY LEGISLATION
CYBERSECURITY LEGISLATION
3.com
 
Presidio Networked Solutions Sales Presentation Ns Ppt 1108
Presidio Networked Solutions Sales Presentation Ns Ppt 1108Presidio Networked Solutions Sales Presentation Ns Ppt 1108
Presidio Networked Solutions Sales Presentation Ns Ppt 1108
mmata1031
 
Snr Systems Engineer ArthitK_CVMar2016
Snr Systems Engineer ArthitK_CVMar2016Snr Systems Engineer ArthitK_CVMar2016
Snr Systems Engineer ArthitK_CVMar2016
Arthit Kliangprom
 
AQSIQ Successful Cases
AQSIQ Successful CasesAQSIQ Successful Cases
AQSIQ Successful Cases
AQSIQ license
 
Protein microarrays, ICAT, and HPLC protein purification
Protein microarrays, ICAT, and HPLC protein purificationProtein microarrays, ICAT, and HPLC protein purification
Protein microarrays, ICAT, and HPLC protein purification
Raul Soto
 

Viewers also liked (20)

Precision Processing Equipment Kolkata
Precision Processing Equipment KolkataPrecision Processing Equipment Kolkata
Precision Processing Equipment Kolkata
 
DLT Solutions interview questions and answers
DLT Solutions interview questions and answersDLT Solutions interview questions and answers
DLT Solutions interview questions and answers
 
AWS DC Symposium Keynote: Teresa Carlson
AWS DC Symposium Keynote: Teresa CarlsonAWS DC Symposium Keynote: Teresa Carlson
AWS DC Symposium Keynote: Teresa Carlson
 
Good heater specifications pay off
Good heater specifications pay off Good heater specifications pay off
Good heater specifications pay off
 
Region 6
Region 6Region 6
Region 6
 
CYBERSECURITY LEGISLATION
CYBERSECURITY LEGISLATIONCYBERSECURITY LEGISLATION
CYBERSECURITY LEGISLATION
 
Cfmc pumps
Cfmc pumpsCfmc pumps
Cfmc pumps
 
Summary of smart building
Summary of smart buildingSummary of smart building
Summary of smart building
 
Presidio Networked Solutions Sales Presentation Ns Ppt 1108
Presidio Networked Solutions Sales Presentation Ns Ppt 1108Presidio Networked Solutions Sales Presentation Ns Ppt 1108
Presidio Networked Solutions Sales Presentation Ns Ppt 1108
 
GovExec - January 2015 Cover Story on Federal Hiring and Firing
GovExec - January 2015 Cover Story on Federal Hiring and FiringGovExec - January 2015 Cover Story on Federal Hiring and Firing
GovExec - January 2015 Cover Story on Federal Hiring and Firing
 
Snr Systems Engineer ArthitK_CVMar2016
Snr Systems Engineer ArthitK_CVMar2016Snr Systems Engineer ArthitK_CVMar2016
Snr Systems Engineer ArthitK_CVMar2016
 
Get Started Today with Cloud-Ready Contracts | AWS Public Sector Summit 2016
Get Started Today with Cloud-Ready Contracts | AWS Public Sector Summit 2016Get Started Today with Cloud-Ready Contracts | AWS Public Sector Summit 2016
Get Started Today with Cloud-Ready Contracts | AWS Public Sector Summit 2016
 
Regional Physical Framework Plan, 2004 2030 of SOCCSKSARGEN
Regional Physical Framework Plan, 2004 2030 of SOCCSKSARGENRegional Physical Framework Plan, 2004 2030 of SOCCSKSARGEN
Regional Physical Framework Plan, 2004 2030 of SOCCSKSARGEN
 
AQSIQ Successful Cases
AQSIQ Successful CasesAQSIQ Successful Cases
AQSIQ Successful Cases
 
Protein microarrays, ICAT, and HPLC protein purification
Protein microarrays, ICAT, and HPLC protein purificationProtein microarrays, ICAT, and HPLC protein purification
Protein microarrays, ICAT, and HPLC protein purification
 
Art & Ideation
Art & IdeationArt & Ideation
Art & Ideation
 
Master Source-to-Pay with Cloud and Business Networks [Stockholm]
Master Source-to-Pay with Cloud and Business Networks [Stockholm]Master Source-to-Pay with Cloud and Business Networks [Stockholm]
Master Source-to-Pay with Cloud and Business Networks [Stockholm]
 
IT-AAC Defense IT Reform Report to the Sec 809 Panel
IT-AAC Defense IT Reform Report to the Sec 809 PanelIT-AAC Defense IT Reform Report to the Sec 809 Panel
IT-AAC Defense IT Reform Report to the Sec 809 Panel
 
Odroid Magazine March 2014
Odroid Magazine March 2014Odroid Magazine March 2014
Odroid Magazine March 2014
 
Getting hired 101 | Lessons from Where's Waldo
Getting hired 101 | Lessons from Where's WaldoGetting hired 101 | Lessons from Where's Waldo
Getting hired 101 | Lessons from Where's Waldo
 

Similar to MongoDB for Spatio-Behavioral Data Analysis and Visualization

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
Neha Kapoor
 
The final frontier
The final frontierThe final frontier
The final frontier
Terry Bunio
 

Similar to MongoDB for Spatio-Behavioral Data Analysis and Visualization (20)

unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
SAP HANA_class1.pptx
SAP HANA_class1.pptxSAP HANA_class1.pptx
SAP HANA_class1.pptx
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresOPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
 
The final frontier
The final frontierThe final frontier
The final frontier
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and more
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 

More from MongoDB

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

MongoDB for Spatio-Behavioral Data Analysis and Visualization

  • 1. MONGODB FOR SPATIO-BEHAVIORAL DATA ANALYSIS & VISUALIZATION JOHN-ISAAC CLARK CHIEF INNOVATION OFFICER
  • 2. Overview • About Thermopylae Sciences + Technology • What is iHarvest? • The Problem • Other Solutions • Why we chose MongoDB • Lesson Learned • What next?
  • 3. What is iHarvest? iHarvest (Interest Harvest) is a system that builds profiles of activities by discrete node based on a number of variables and then analyzes those models for similarity. It is an automated and intelligent system that continually monitors changes in activities and models using advanced, proprietary algorithms. iHarvest is designed to: • Identify - Collect and store event activities and data feeds • Model - Build and identify related interests to store as profile model • Analyze - Identify similarity and comparisons between common activities • Report - Aggregate and provide recommendations and analytics on findings Features • Operating unobtrusively on any closed-network system • Adapts to system and usage activity changes as it is used • Hone in on user-specific needs, becoming more accurate, efficient, and easy to use • Deliver customized solutions such as collaboration, monitoring, and even insider threat analysis • Alerts on "non-observable" data and relationships
  • 5. Our Problems • Data Storage was Difficult to Scale o 2012 iHarvest Roadmap Releases required adding significantly more analytic processing and storage of results. • Document Based Data Store (JSON) o Need to rapidly increase the richness of our data models dynamically so we don't have to redesign our data access layer and schema with each update/change. • Geo Spatial Index – event data is not purely textual – needed a solution that included support for spatial qualities • Increased Analytics requiring more processing power – as data grew – so does analytic processing requirement • Had a requirement to provide Statistical and Aggregate results of our data
  • 6. Other Solutions We Tried/Looked at • PostgreSQL. Used "NoSQL" like key-value pair store, but totally failed on performance when trying to access sub-field data. • Accumulo. Very difficult to setup, configure and develop against. Required HDFS, Hadoop, and Zookeeper. Also required expert Admins that just didn't exist yet. On the plus side, provided MapReduce capability.
  • 7. Why we chose MongoDB • Built-in MapReduce – Based on the fact that we are predominantly doing massive amounts of analytics on our data • Aggregation Framework – Connected directly to REST endpoints for developer/prototyping use – substantial decrease in development time • No Need for separate Hadoop Cluster – Faster development and reduced installation/integration/maintenance for customers • Developer Friendly • Instead of using complex JDBC and SQL, able to simply instantiate Objects and call Methods • Great Documentation • Easy to Scale
  • 8. How we're using MongoDB in iHarvest Scalable Dynamic Storage for: • Events, Feeds, Profile models • Processed Analytic Results Aggregation Framework • Statistics and Data Aggregation MapReduce • Primarily to process K-Means clustering algorithm of Geo Data
  • 9. Dynamic Storage • We leverage a JSON based document model • Allows us to add new fields/attributes without having to update a schema • Shard addition allows us to scale easily with our data • Events – High volume of incoming data – data can grow to be very large very quickly • Profile Engine processes events – 16 x 16 profile tables – key based on profile ID allows even distribution that allows us to dedicate profile engine processing to specific profiles that require updating - No one Processing Engine has to do all the work – MongoDB allows us to dynamically grow our dimensions by adding new tables to the grid programmatically
  • 10. Aggregation Framework • Create statistical endpoints by making calls to Aggregation Framework – We created a REST API that allows JS query to aggregation framework against any of our tables/indexes – This gives us very powerful way to prototype new statistics and data aggregations quickly • Temporal aggregation is very valuable to us and we basically get it for “free” with MongoDB built-in functions • We leverage Aggregation Framework on the following components – Activity • Raw incoming data related to event – Events • Summarization of Activities – Node • Discrete item the Activity/Event is related to
  • 11. MapReduce • Geo Clustering is the predominant use for built-in MapReduce at this point (outside of what Aggregation Framework is already doing) • K-means is the cluster analysis method we use to look at geo similarity/overlap • In order to take advantage of this method, we use the inherent MongoDB geo-indexing mechanism to quickly spatially access data by geographic region • Segregation of this data allows us to quickly perform k-means clustering on this data alone w/o having to jam the data alongside other event data • Developed our own k-means MapReduce queries to support the spatial- clustering model development/process • Automatically scales processing across # of shards which is very helpful since k-means is very computationally intensive
  • 12. Lessons Learned • Moving to NoSQL from a Relational database requires switching mind-set on data storage and processing. ie. More back-end processing, and immediate access to results when they are ready (done being processed). • Aggregation Framework is powerful, but could use more tutorials and example on usage. • Built in MapReduce allowed us to offload much of our processing and take advantage of MongoDB auto sharding / processing. • When storing dates in MongoDB, be sure to use ISODate to take advantage of their Date/Time related functions. • Understanding the data types provided by MongoDB is important to fully take advantage of the inherent Aggregation Framework capabilities • Go to schema workshop! • You can of course write your own MapReduce query but can do a lot out of the box by being mindful/knowledgeable on what is already provided for you
  • 13. What's next? • Increased use of MapReduce to process: o Enhance Similarity Analytics Processing to increase efficiency o Additional Interest Building Algorithms for Profile generation • Integration of Mahout and MongoDB for additional Clustering Algorithms • Integration of Spring/MongoDB to better abstract the data model