SlideShare a Scribd company logo
1 of 41
Architecting Wide-Ranging
Analytical Solutions on
MongoDB
Matt Kalan
Sr. Solutions & Enterprise Architect,
MongoDB
#MDBW16
Agenda
Why Focus on
Analytics
01 Analytics
Scenarios
03
Relevant
MongoDB
Capabilities
02
Recommendation
Engine With
Spark
04 Quick Demo
05 Summary
06
Why Focus on Analytics
#MDBW16
How to Drive More Value From Data
?
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
#MDBW16
So Many Options
Part of image from: http://mattturck.com/wp-content/uploads/2016/01/matt_turck_big_data_landscape_full.png
#MDBW16
Why Are Analytics Important?
From http://www.bain.com/publications/capability-insights/advanced-analytics.aspx
#MDBW16
What Criteria To Consider For Choosing
Technology
• Assumption: you identified what derived data/analytic(s) has ROI
• Criteria
• Operations on data (read/write, transform, aggregation, algorithm)
• Time SLA – both how up-to-date data is and response times
• Effort (training, development, management)
• Processing model for analytic (partitionable, iterative, streaming, etc.)
• Cost (data duplication, memory, servers, software)
MongoDB Capabilities
Available
#MDBW16
MongoDB Capabilities to Highlights for Analytics
Community/Open Source
1. Aggregation Framework
2. Reading from secondaries (priority = votes = 0 recommended)
3. Mongo Connector – replication to other MongoDB, search engines, etc.
4. Hadoop Connector – exposes MongoDB as native input/output for Hive, Pig,
MR, etc.
5. Spark Connector – exposes MongoDB as an RDD/DataFrame/DataSet for
read/write
Enterprise Advanced
1. In-memory storage engine – now GA for production use
2. BI Connector – BI & SQL read access to MongoDB
#MDBW16
Aggregation Framework
#MDBW16
Aggregation Pipeline Stages
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer joins
• $unwind
Expand arrays in documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset of
documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection
#MDBW16
Aggregation With a Sharded Database
Workload split between shards
1. Client works through mongos as
with any query
2. Shards execute pipeline up to a
point
3. A single shard merges cursors and
continues processing
4. $lookup & $out performed within
Primary shard for the database
Analytics Scenarios
Using Aggregation Framework
#MDBW16
On-Demand Analytics with Agg FW
Benefits
1. Up-to-date data
2. One technology
3. Only raw data stored
4. Flexible
Tradeoff
1. Slow if scanning many
documents
Common Uses
Groups, counts, sum,
averages for small subsets
of data
Aggregation
Framework
Runtime
agg pipeline
Results in real-time
Application
#MDBW16
Offline Analytics With Aggregation Framework
Benefits
1. One technology
2. Can filter at DB on
aggregations
3. Low latency (in C++)
Tradeoffs
1. Storing additional data
2. One thread per
server/instance
3. Advanced functions not
included
Common Uses
1. Pre-calculating values
across dataset
2. Batch transformations
Aggregation
Framework
$out:
“results”
*Agg Pipeline
Application
* MapReduce also possible but slower (run in Javascript) and most requirements can be done in agg fw
Outputting to a sharded collection with agg fw would be returned to driver and written from there to sharded collection
Also can return
data to application
#MDBW16
Microsharding for Highly Parallel Processing
Benefits
1. Multiple threads for agg
fw query per server
2. One technology
Tradeoffs
1. # of parallel threads and
partitions in DB
predefined
2. No native job scheduling
or resource
management
Common Uses
Analytics on large result sets
to minimize latency
Agg
pipeline
…
Mongos
Run in parallel
on N partitions
Data returned
In parallel
Application
Each server
Analytics in Application
#MDBW16
Analytics in Custom Application/Framework
Benefits
1. Flexible & in app team control
2. All language libraries &
frameworks available
3. Tailing oplog gives near real-
time
Tradeoffs
1. Data might not fit in memory
2. Threading managed by
developer
Common Uses
1. Statistical analysis w/ R, Matlab,
etc.
2. Advanced analytics & algos
3. Updating counts & aggregations
Query raw data
Results in real-time
Application
Optionally store analyzed
data back in DB
Can use tailable
cursor for tracking
events
#MDBW16
Documents
returned
SQL result sets
returned
Analytics in 3rd-Party Products
BI or other
analytics
product
Benefits
1. Pre-built UI and toolkits
2. Supports most all 3rd party
SQL-based tools
3. Can migrate to MongoDB &
keep reporting tools
Tradeoffs
1. Optimal performance often
requires configuring views
2. Joins between 2 sharded
collections can be slow
Common Products
1. Pentaho, Jaspersoft, Alteryx
2. Tableau, Qlikview,
SQL Query MongoDB BI
Connector
MongoDB
Query
Native Integrations
Analytics in Distributed
Processing Frameworks
#MDBW16
Partitionable Analytics (e.g. MapReduce)
From http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/
#MDBW16
Partitionable Distributed Analytics
Benefits
1. Very parallelizable to
scale horizontally
2. Intermediate results can
be on disk, not necessarily
memory
Tradeoff
1. Often significant overhead
in learning the framework
Common Frameworks
1. Hadoop
2. Spark
…
Partitions
lined up
between
workers &
shard
Worker
Worker
Worker
…
Mongos
Mongos
Mongos
Master
Worker Mongos
#MDBW16
Iterative Analytics (e.g. Machine Learning)
From http://www.learnbymarketing.com/methods/k-means-clustering/
#MDBW16
Iterative Distributed Analytics
Benefits
1. Great for machine
learning
2. Memory-based
frameworks can be
much faster
Tradeoff
1. Harder overall to speed
up with horizontal
scaling
Common Framework
1. Spark
…
Stages of iterations might be
partitionable
Worker
…
Mongos
Master
Worker Mongos
#MDBW16
Streaming Distributed Analytics
From http://docs.streambase.com/latest/index.jsp?topic=/com.streambase.sb.ide.help/data/html/admin/execorder.html
#MDBW16
Streaming Distributed Analytics
Benefits
1. Analysis on current data
2. Can analyze
incrementally to avoid
batch windows
3. Can use some
frameworks for
streaming + batch
Tradeoff
1. Depends on streaming
sources being available
2. Some analytics cannot be
calculated incrementally
Common Uses &
Frameworks
1. Sentiment analysis
2. Spark Streaming, Storm,
Flink, Kafka Streams
Stream
Processing
Framework
Event
Sources
Storing events &
analytic results
Historical or
reference data
on-demand
Tailable cursor
Stream
Processing
Framework
…
Machine Learning Example with
Spark
#MDBW16
Given Users’ ratings
for some Items, how
to infer users’
ratings for all items
Useful for:
1. Recommendation
s
2. Cross-sell
3. Accurate
targeting
Recommendation Engine Problem Description
Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html
#MDBW16
Alternating Least Squares (ALS) Algo
Image from http://netprophetblog.blogspot.com/2013/10/local-regression.html
2-dimensional
Given f(x) = a*x + b
Can minimize
di = Σi (yi – f(xi))2
ALS approach
Fix a and solve for b
Alternate: fix b, and solve for
a
ALS can extend to n-
dimensional
#MDBW16
Example Solution
Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html
#MDBW16
Architecture of Solution
Spark
Worker
Spark Master
Spark
Worker
Pushes
ALSExampleMongoDB to
Workers
Each worker
handles partitions of
data as appropriate
and also shuffle
Worker reads its partition of
User ratings for Items from
MongoDB
Worker writes its partition of
data for predictions back to
MongoDB
On startup, shared libraries
loaded by Workers
1. MongoDB Spark
Connector
2. Java Driver
Full code for example can be found at:
https://github.com/matthewkalan/mongo-spark-recommender-example
#MDBW16
Code for Configuration and Reading from
MongoDB
object ALSExampleMongoDB {
def main(args: Array[String]) {
//this conf should only be used when run locally because sc.getOrCreate() reuses already running SparkContexts
val sc = SparkContext.getOrCreate()
val sqlContext = SQLContext.getOrCreate(sc)
var inputUri = args(1) //pass MongoDB connection string from args
//setting up DataFrame to read from MongoDB - Connector automatically partitions the data to spread across workers
var ratingsAll = sqlContext.read.options(
Map(
"uri" -> inputUri
//"localThreshold" -> "0", //Add these two parameters to connect to the nearest Mongos, if desired
//"readPreference.name" -> "nearest",
//"partitionerOptions.partitionSizeMB" -> ”128", //Typically partitions should be 64 - 512 MB
//"partitioner" -> "MongoSamplePartitioner" //If customer partitioner desired
)).mongo()
var userIdThreshold = args(3)
ratings = ratingsAll.filter(ratingsAll("userId") > userIdThreshold) //Filtering & aggregation pushed down to DB w/ indexes
//caching the DataFrame in memory of Spark workers
ratings.cache()
#MDBW16
Code for Training ALS Algo and Making
Predictions
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2)) //split into a training and test dataset
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training) //train the model
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
.withColumn("rating", col("rating").cast(DoubleType))
.withColumn("prediction", col("prediction").cast(DoubleType))
//remove NaN values if a user is not in both the training and test dataset
val predictionsValidUsers = predictions.na.drop("any", Seq("rating", "prediction"))
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictionsValidUsers)
#MDBW16
Code for Writing Predictions to MongoDB
//store the users predictions back into MongoDB
var outputUri = args(2)
MongoSpark.save(predictionsValidUsers.write.option("uri", outputUri))
//calculate and print running time in seconds
val endTime = Calendar.getInstance().getTime()
var elapsedTime = (endTime.getTime() - startTime.getTime()) / 1000
Quick Demo in Databricks
Summary
#MDBW16
Start simple, expand as required
1. Aggregation
Framework
2. Language libraries
3. 3rd Party Products
4. Distributed
Processing
Frameworks
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
#MDBW16
Resource Location
MongoDB Connector for Spark github.com/mongodb/mongo-spark
Spark ALS Recommendation Engine Example
github.com/matthewkalan/mongo-spark-recommender-
example
Blog: Future Big Data Architecture - Delivering on the Data
Lake Vision
www.mongodb.com/blog/post/the-future-of-big-data-
architecture
White Paper: Unlocking Operational Intelligence from the
Data Lake
www.mongodb.com/collateral/unlocking-operational-
intelligence-from-the-data-lake
Blog: Using MongoDB with Hadoop
www.mongodb.com/blog/post/using-mongodb-hadoop-
spark-part-1-introduction-setup
Free Online Training university.mongodb.com
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
For More Information
Architecting Wide-ranging Analytical Solutions with MongoDB

More Related Content

Similar to Architecting Wide-ranging Analytical Solutions with MongoDB

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014Dylan Tong
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentationOleksii Usyk
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónMongoDB
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoSander Mangel
 
EEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web ApplicationsEEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web ApplicationsExpertos en TI
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Drupal performance and scalability
Drupal performance and scalabilityDrupal performance and scalability
Drupal performance and scalabilityTwinbit
 
MongoDB at Gilt Groupe
MongoDB at Gilt GroupeMongoDB at Gilt Groupe
MongoDB at Gilt GroupeMongoDB
 
MongoDB : Scaling, Security & Performance
MongoDB : Scaling, Security & PerformanceMongoDB : Scaling, Security & Performance
MongoDB : Scaling, Security & PerformanceSasidhar Gogulapati
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Mongo db pefrormance optimization strategies
Mongo db pefrormance optimization strategiesMongo db pefrormance optimization strategies
Mongo db pefrormance optimization strategiesronwarshawsky
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB
 

Similar to Architecting Wide-ranging Analytical Solutions with MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
 
mongodb tutorial
mongodb tutorialmongodb tutorial
mongodb tutorial
 
Mongodb
MongodbMongodb
Mongodb
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in Magento
 
EEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web ApplicationsEEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web Applications
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
MongoDB on Azure
MongoDB on AzureMongoDB on Azure
MongoDB on Azure
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Drupal performance and scalability
Drupal performance and scalabilityDrupal performance and scalability
Drupal performance and scalability
 
MongoDB at Gilt Groupe
MongoDB at Gilt GroupeMongoDB at Gilt Groupe
MongoDB at Gilt Groupe
 
MongoDB : Scaling, Security & Performance
MongoDB : Scaling, Security & PerformanceMongoDB : Scaling, Security & Performance
MongoDB : Scaling, Security & Performance
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Mongo db pefrormance optimization strategies
Mongo db pefrormance optimization strategiesMongo db pefrormance optimization strategies
Mongo db pefrormance optimization strategies
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Architecting Wide-ranging Analytical Solutions with MongoDB

  • 2. Matt Kalan Sr. Solutions & Enterprise Architect, MongoDB
  • 3. #MDBW16 Agenda Why Focus on Analytics 01 Analytics Scenarios 03 Relevant MongoDB Capabilities 02 Recommendation Engine With Spark 04 Quick Demo 05 Summary 06
  • 4. Why Focus on Analytics
  • 5. #MDBW16 How to Drive More Value From Data ? Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
  • 6. #MDBW16 So Many Options Part of image from: http://mattturck.com/wp-content/uploads/2016/01/matt_turck_big_data_landscape_full.png
  • 7. #MDBW16 Why Are Analytics Important? From http://www.bain.com/publications/capability-insights/advanced-analytics.aspx
  • 8. #MDBW16 What Criteria To Consider For Choosing Technology • Assumption: you identified what derived data/analytic(s) has ROI • Criteria • Operations on data (read/write, transform, aggregation, algorithm) • Time SLA – both how up-to-date data is and response times • Effort (training, development, management) • Processing model for analytic (partitionable, iterative, streaming, etc.) • Cost (data duplication, memory, servers, software)
  • 10. #MDBW16 MongoDB Capabilities to Highlights for Analytics Community/Open Source 1. Aggregation Framework 2. Reading from secondaries (priority = votes = 0 recommended) 3. Mongo Connector – replication to other MongoDB, search engines, etc. 4. Hadoop Connector – exposes MongoDB as native input/output for Hive, Pig, MR, etc. 5. Spark Connector – exposes MongoDB as an RDD/DataFrame/DataSet for read/write Enterprise Advanced 1. In-memory storage engine – now GA for production use 2. BI Connector – BI & SQL read access to MongoDB
  • 12. #MDBW16 Aggregation Pipeline Stages • $match Filter documents • $geoNear Geospherical query • $project Reshape documents • $lookup New – Left-outer joins • $unwind Expand arrays in documents • $group Summarize documents • $sample New – Randomly selects a subset of documents • $sort Order documents • $skip Jump over a number of documents • $limit Limit number of documents • $redact Restrict documents • $out Sends results to a new collection
  • 13. #MDBW16 Aggregation With a Sharded Database Workload split between shards 1. Client works through mongos as with any query 2. Shards execute pipeline up to a point 3. A single shard merges cursors and continues processing 4. $lookup & $out performed within Primary shard for the database
  • 16. #MDBW16 On-Demand Analytics with Agg FW Benefits 1. Up-to-date data 2. One technology 3. Only raw data stored 4. Flexible Tradeoff 1. Slow if scanning many documents Common Uses Groups, counts, sum, averages for small subsets of data Aggregation Framework Runtime agg pipeline Results in real-time Application
  • 17. #MDBW16 Offline Analytics With Aggregation Framework Benefits 1. One technology 2. Can filter at DB on aggregations 3. Low latency (in C++) Tradeoffs 1. Storing additional data 2. One thread per server/instance 3. Advanced functions not included Common Uses 1. Pre-calculating values across dataset 2. Batch transformations Aggregation Framework $out: “results” *Agg Pipeline Application * MapReduce also possible but slower (run in Javascript) and most requirements can be done in agg fw Outputting to a sharded collection with agg fw would be returned to driver and written from there to sharded collection Also can return data to application
  • 18. #MDBW16 Microsharding for Highly Parallel Processing Benefits 1. Multiple threads for agg fw query per server 2. One technology Tradeoffs 1. # of parallel threads and partitions in DB predefined 2. No native job scheduling or resource management Common Uses Analytics on large result sets to minimize latency Agg pipeline … Mongos Run in parallel on N partitions Data returned In parallel Application Each server
  • 20. #MDBW16 Analytics in Custom Application/Framework Benefits 1. Flexible & in app team control 2. All language libraries & frameworks available 3. Tailing oplog gives near real- time Tradeoffs 1. Data might not fit in memory 2. Threading managed by developer Common Uses 1. Statistical analysis w/ R, Matlab, etc. 2. Advanced analytics & algos 3. Updating counts & aggregations Query raw data Results in real-time Application Optionally store analyzed data back in DB Can use tailable cursor for tracking events
  • 21. #MDBW16 Documents returned SQL result sets returned Analytics in 3rd-Party Products BI or other analytics product Benefits 1. Pre-built UI and toolkits 2. Supports most all 3rd party SQL-based tools 3. Can migrate to MongoDB & keep reporting tools Tradeoffs 1. Optimal performance often requires configuring views 2. Joins between 2 sharded collections can be slow Common Products 1. Pentaho, Jaspersoft, Alteryx 2. Tableau, Qlikview, SQL Query MongoDB BI Connector MongoDB Query Native Integrations
  • 23. #MDBW16 Partitionable Analytics (e.g. MapReduce) From http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/
  • 24. #MDBW16 Partitionable Distributed Analytics Benefits 1. Very parallelizable to scale horizontally 2. Intermediate results can be on disk, not necessarily memory Tradeoff 1. Often significant overhead in learning the framework Common Frameworks 1. Hadoop 2. Spark … Partitions lined up between workers & shard Worker Worker Worker … Mongos Mongos Mongos Master Worker Mongos
  • 25. #MDBW16 Iterative Analytics (e.g. Machine Learning) From http://www.learnbymarketing.com/methods/k-means-clustering/
  • 26. #MDBW16 Iterative Distributed Analytics Benefits 1. Great for machine learning 2. Memory-based frameworks can be much faster Tradeoff 1. Harder overall to speed up with horizontal scaling Common Framework 1. Spark … Stages of iterations might be partitionable Worker … Mongos Master Worker Mongos
  • 27. #MDBW16 Streaming Distributed Analytics From http://docs.streambase.com/latest/index.jsp?topic=/com.streambase.sb.ide.help/data/html/admin/execorder.html
  • 28. #MDBW16 Streaming Distributed Analytics Benefits 1. Analysis on current data 2. Can analyze incrementally to avoid batch windows 3. Can use some frameworks for streaming + batch Tradeoff 1. Depends on streaming sources being available 2. Some analytics cannot be calculated incrementally Common Uses & Frameworks 1. Sentiment analysis 2. Spark Streaming, Storm, Flink, Kafka Streams Stream Processing Framework Event Sources Storing events & analytic results Historical or reference data on-demand Tailable cursor Stream Processing Framework …
  • 30. #MDBW16 Given Users’ ratings for some Items, how to infer users’ ratings for all items Useful for: 1. Recommendation s 2. Cross-sell 3. Accurate targeting Recommendation Engine Problem Description Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html
  • 31. #MDBW16 Alternating Least Squares (ALS) Algo Image from http://netprophetblog.blogspot.com/2013/10/local-regression.html 2-dimensional Given f(x) = a*x + b Can minimize di = Σi (yi – f(xi))2 ALS approach Fix a and solve for b Alternate: fix b, and solve for a ALS can extend to n- dimensional
  • 32. #MDBW16 Example Solution Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html
  • 33. #MDBW16 Architecture of Solution Spark Worker Spark Master Spark Worker Pushes ALSExampleMongoDB to Workers Each worker handles partitions of data as appropriate and also shuffle Worker reads its partition of User ratings for Items from MongoDB Worker writes its partition of data for predictions back to MongoDB On startup, shared libraries loaded by Workers 1. MongoDB Spark Connector 2. Java Driver Full code for example can be found at: https://github.com/matthewkalan/mongo-spark-recommender-example
  • 34. #MDBW16 Code for Configuration and Reading from MongoDB object ALSExampleMongoDB { def main(args: Array[String]) { //this conf should only be used when run locally because sc.getOrCreate() reuses already running SparkContexts val sc = SparkContext.getOrCreate() val sqlContext = SQLContext.getOrCreate(sc) var inputUri = args(1) //pass MongoDB connection string from args //setting up DataFrame to read from MongoDB - Connector automatically partitions the data to spread across workers var ratingsAll = sqlContext.read.options( Map( "uri" -> inputUri //"localThreshold" -> "0", //Add these two parameters to connect to the nearest Mongos, if desired //"readPreference.name" -> "nearest", //"partitionerOptions.partitionSizeMB" -> ”128", //Typically partitions should be 64 - 512 MB //"partitioner" -> "MongoSamplePartitioner" //If customer partitioner desired )).mongo() var userIdThreshold = args(3) ratings = ratingsAll.filter(ratingsAll("userId") > userIdThreshold) //Filtering & aggregation pushed down to DB w/ indexes //caching the DataFrame in memory of Spark workers ratings.cache()
  • 35. #MDBW16 Code for Training ALS Algo and Making Predictions val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2)) //split into a training and test dataset // Build the recommendation model using ALS on the training data val als = new ALS() .setMaxIter(5) .setRegParam(0.01) .setUserCol("userId") .setItemCol("movieId") .setRatingCol("rating") val model = als.fit(training) //train the model // Evaluate the model by computing the RMSE on the test data val predictions = model.transform(test) .withColumn("rating", col("rating").cast(DoubleType)) .withColumn("prediction", col("prediction").cast(DoubleType)) //remove NaN values if a user is not in both the training and test dataset val predictionsValidUsers = predictions.na.drop("any", Seq("rating", "prediction")) val evaluator = new RegressionEvaluator() .setMetricName("rmse") .setLabelCol("rating") .setPredictionCol("prediction") val rmse = evaluator.evaluate(predictionsValidUsers)
  • 36. #MDBW16 Code for Writing Predictions to MongoDB //store the users predictions back into MongoDB var outputUri = args(2) MongoSpark.save(predictionsValidUsers.write.option("uri", outputUri)) //calculate and print running time in seconds val endTime = Calendar.getInstance().getTime() var elapsedTime = (endTime.getTime() - startTime.getTime()) / 1000
  • 37. Quick Demo in Databricks
  • 39. #MDBW16 Start simple, expand as required 1. Aggregation Framework 2. Language libraries 3. 3rd Party Products 4. Distributed Processing Frameworks Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
  • 40. #MDBW16 Resource Location MongoDB Connector for Spark github.com/mongodb/mongo-spark Spark ALS Recommendation Engine Example github.com/matthewkalan/mongo-spark-recommender- example Blog: Future Big Data Architecture - Delivering on the Data Lake Vision www.mongodb.com/blog/post/the-future-of-big-data- architecture White Paper: Unlocking Operational Intelligence from the Data Lake www.mongodb.com/collateral/unlocking-operational- intelligence-from-the-data-lake Blog: Using MongoDB with Hadoop www.mongodb.com/blog/post/using-mongodb-hadoop- spark-part-1-introduction-setup Free Online Training university.mongodb.com Documentation docs.mongodb.org MongoDB Downloads mongodb.com/download For More Information

Editor's Notes

  1. Explain I mean a broad definition for analytics, really any derived data Addresses: is MongoDB enough? Should I be using other products in addition?
  2. Poll audience for what analytics they are considering
  3. For aggregation operations that run on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard to merge the results and avoid overloading the primary shard for that database. Aggregation operations that require running on the database’s primary shard are the $out stage and $lookup stage. Note: place before the scenario that deals with this & remove some bullets
  4. Replica set or shards are hidden in the database icon Application uses a programming language driver to send agg pipeline Example: total balance or value of customer, total number of posts, esp. for a given entity (i.e. can filter) and NOT for the whole database Obviously the simplest and most common, get the data you want and then call a library in the application
  5. Example: good for pre-calculating totals and aggregations, e.g. balances, documents, dollar values, etc. If need to generate bulk reports, could send the data back to the reporting tool
  6. Example: good for longer running jobs, e.g. Top 10 Bank, has a personal in-memory data mart with 2GB allocated per person for report data (from their 2PB DW) spread across all shards so queried in parallel Note: Be sure to explain an easily digestible example and point out it is not a common pattern This can be on each server or you can shard across instances to get parallelism – the main concept here is sharding earlier than otherwise necessary to get parallelism in analytical processing
  7. Previous slides were focused on agg fw Point this out because some hear analytics and think Hadoop/Spark maybe – but there are many libraries and analytics in Java, Python, R, etc. If data can be filtered well, the latency should be similar for analytic in app vs. agg fw (difference between C++ and language in use) Example: Using R, Matlab, and other statistical packages directly against MongoDB
  8. Example are SAS, Tableau, etc. or any tool that is read-only from the DB
  9. Point out could even run the Workers on the same server as each MongoDB node, but have to know in advance how big an instance to use. Having the Worker node separate (and it is stateless) allows the Worker to be sized dynamically depending on the job
  10. Point out could even run the Workers on the same server as each MongoDB node, but have to know in advance how big an instance to use. Having the Worker node separate (and it is stateless) allows the Worker to be sized dynamically depending on the job Can be combined with partitionable portions of algo so that each iteration is partitionable. Then in-memory and distribution are important