Architecting Wide-ranging Analytical Solutions with MongoDB

Architecting Wide-Ranging
Analytical Solutions on
MongoDB

Matt Kalan
Sr. Solutions & Enterprise Architect,
MongoDB

#MDBW16
Agenda
Why Focus on
Analytics
01 Analytics
Scenarios
03
Relevant
MongoDB
Capabilities
02
Recommendation
Engine With
Spark
04 Quick Demo
05 Summary
06

#MDBW16
How to Drive More Value From Data
?
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/

#MDBW16
So Many Options
Part of image from: http://mattturck.com/wp-content/uploads/2016/01/matt_turck_big_data_landscape_full.png

#MDBW16
Why Are Analytics Important?
From http://www.bain.com/publications/capability-insights/advanced-analytics.aspx

#MDBW16
What Criteria To Consider For Choosing
Technology
• Assumption: you identified what derived data/analytic(s) has ROI
• Criteria
• Operations on data (read/write, transform, aggregation, algorithm)
• Time SLA – both how up-to-date data is and response times
• Effort (training, development, management)
• Processing model for analytic (partitionable, iterative, streaming, etc.)
• Cost (data duplication, memory, servers, software)

MongoDB Capabilities
Available

#MDBW16
MongoDB Capabilities to Highlights for Analytics
Community/Open Source
1. Aggregation Framework
2. Reading from secondaries (priority = votes = 0 recommended)
3. Mongo Connector – replication to other MongoDB, search engines, etc.
4. Hadoop Connector – exposes MongoDB as native input/output for Hive, Pig,
MR, etc.
5. Spark Connector – exposes MongoDB as an RDD/DataFrame/DataSet for
read/write
Enterprise Advanced
1. In-memory storage engine – now GA for production use
2. BI Connector – BI & SQL read access to MongoDB

#MDBW16
Aggregation Pipeline Stages
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer joins
• $unwind
Expand arrays in documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset of
documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection

#MDBW16
Aggregation With a Sharded Database
Workload split between shards
1. Client works through mongos as
with any query
2. Shards execute pipeline up to a
point
3. A single shard merges cursors and
continues processing
4. $lookup & $out performed within
Primary shard for the database

#MDBW16
On-Demand Analytics with Agg FW
Benefits
1. Up-to-date data
2. One technology
3. Only raw data stored
4. Flexible
Tradeoff
1. Slow if scanning many
documents
Common Uses
Groups, counts, sum,
averages for small subsets
of data
Aggregation
Framework
Runtime
agg pipeline
Results in real-time
Application

#MDBW16
Offline Analytics With Aggregation Framework
Benefits
1. One technology
2. Can filter at DB on
aggregations
3. Low latency (in C++)
Tradeoffs
1. Storing additional data
2. One thread per
server/instance
3. Advanced functions not
included
Common Uses
1. Pre-calculating values
across dataset
2. Batch transformations
Aggregation
Framework
$out:
“results”
*Agg Pipeline
Application
* MapReduce also possible but slower (run in Javascript) and most requirements can be done in agg fw
Outputting to a sharded collection with agg fw would be returned to driver and written from there to sharded collection
Also can return
data to application

#MDBW16
Microsharding for Highly Parallel Processing
Benefits
1. Multiple threads for agg
fw query per server
2. One technology
Tradeoffs
1. # of parallel threads and
partitions in DB
predefined
2. No native job scheduling
or resource
management
Common Uses
Analytics on large result sets
to minimize latency
Agg
pipeline
…
Mongos
Run in parallel
on N partitions
Data returned
In parallel
Application
Each server

#MDBW16
Analytics in Custom Application/Framework
Benefits
1. Flexible & in app team control
2. All language libraries &
frameworks available
3. Tailing oplog gives near real-
time
Tradeoffs
1. Data might not fit in memory
2. Threading managed by
developer
Common Uses
1. Statistical analysis w/ R, Matlab,
etc.
2. Advanced analytics & algos
3. Updating counts & aggregations
Query raw data
Results in real-time
Application
Optionally store analyzed
data back in DB
Can use tailable
cursor for tracking
events

#MDBW16
Documents
returned
SQL result sets
returned
Analytics in 3rd-Party Products
BI or other
analytics
product
Benefits
1. Pre-built UI and toolkits
2. Supports most all 3rd party
SQL-based tools
3. Can migrate to MongoDB &
keep reporting tools
Tradeoffs
1. Optimal performance often
requires configuring views
2. Joins between 2 sharded
collections can be slow
Common Products
1. Pentaho, Jaspersoft, Alteryx
2. Tableau, Qlikview,
SQL Query MongoDB BI
Connector
MongoDB
Query
Native Integrations

Analytics in Distributed
Processing Frameworks

#MDBW16
Partitionable Analytics (e.g. MapReduce)
From http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/

#MDBW16
Partitionable Distributed Analytics
Benefits
1. Very parallelizable to
scale horizontally
2. Intermediate results can
be on disk, not necessarily
memory
Tradeoff
1. Often significant overhead
in learning the framework
Common Frameworks
1. Hadoop
2. Spark
…
Partitions
lined up
between
workers &
shard
Worker
Worker
Worker
…
Mongos
Mongos
Mongos
Master
Worker Mongos

#MDBW16
Iterative Analytics (e.g. Machine Learning)
From http://www.learnbymarketing.com/methods/k-means-clustering/

#MDBW16
Iterative Distributed Analytics
Benefits
1. Great for machine
learning
2. Memory-based
frameworks can be
much faster
Tradeoff
1. Harder overall to speed
up with horizontal
scaling
Common Framework
1. Spark
…
Stages of iterations might be
partitionable
Worker
…
Mongos
Master
Worker Mongos

#MDBW16
Streaming Distributed Analytics
From http://docs.streambase.com/latest/index.jsp?topic=/com.streambase.sb.ide.help/data/html/admin/execorder.html

#MDBW16
Streaming Distributed Analytics
Benefits
1. Analysis on current data
2. Can analyze
incrementally to avoid
batch windows
3. Can use some
frameworks for
streaming + batch
Tradeoff
1. Depends on streaming
sources being available
2. Some analytics cannot be
calculated incrementally
Common Uses &
Frameworks
1. Sentiment analysis
2. Spark Streaming, Storm,
Flink, Kafka Streams
Stream
Processing
Framework
Event
Sources
Storing events &
analytic results
Historical or
reference data
on-demand
Tailable cursor
Stream
Processing
Framework
…

Machine Learning Example with
Spark

#MDBW16
Given Users’ ratings
for some Items, how
to infer users’
ratings for all items
Useful for:
1. Recommendation
s
2. Cross-sell
3. Accurate
targeting
Recommendation Engine Problem Description
Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html

#MDBW16
Alternating Least Squares (ALS) Algo
Image from http://netprophetblog.blogspot.com/2013/10/local-regression.html
2-dimensional
Given f(x) = a*x + b
Can minimize
di = Σi (yi – f(xi))2
ALS approach
Fix a and solve for b
Alternate: fix b, and solve for
a
ALS can extend to n-
dimensional

#MDBW16
Example Solution
Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html

#MDBW16
Architecture of Solution
Spark
Worker
Spark Master
Spark
Worker
Pushes
ALSExampleMongoDB to
Workers
Each worker
handles partitions of
data as appropriate
and also shuffle
Worker reads its partition of
User ratings for Items from
MongoDB
Worker writes its partition of
data for predictions back to
MongoDB
On startup, shared libraries
loaded by Workers
1. MongoDB Spark
Connector
2. Java Driver
Full code for example can be found at:
https://github.com/matthewkalan/mongo-spark-recommender-example

#MDBW16
Code for Configuration and Reading from
MongoDB
object ALSExampleMongoDB {
def main(args: Array[String]) {
//this conf should only be used when run locally because sc.getOrCreate() reuses already running SparkContexts
val sc = SparkContext.getOrCreate()
val sqlContext = SQLContext.getOrCreate(sc)
var inputUri = args(1) //pass MongoDB connection string from args
//setting up DataFrame to read from MongoDB - Connector automatically partitions the data to spread across workers
var ratingsAll = sqlContext.read.options(
Map(
"uri" -> inputUri
//"localThreshold" -> "0", //Add these two parameters to connect to the nearest Mongos, if desired
//"readPreference.name" -> "nearest",
//"partitionerOptions.partitionSizeMB" -> ”128", //Typically partitions should be 64 - 512 MB
//"partitioner" -> "MongoSamplePartitioner" //If customer partitioner desired
)).mongo()
var userIdThreshold = args(3)
ratings = ratingsAll.filter(ratingsAll("userId") > userIdThreshold) //Filtering & aggregation pushed down to DB w/ indexes
//caching the DataFrame in memory of Spark workers
ratings.cache()

#MDBW16
Code for Training ALS Algo and Making
Predictions
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2)) //split into a training and test dataset
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training) //train the model
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
.withColumn("rating", col("rating").cast(DoubleType))
.withColumn("prediction", col("prediction").cast(DoubleType))
//remove NaN values if a user is not in both the training and test dataset
val predictionsValidUsers = predictions.na.drop("any", Seq("rating", "prediction"))
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictionsValidUsers)

#MDBW16
Code for Writing Predictions to MongoDB
//store the users predictions back into MongoDB
var outputUri = args(2)
MongoSpark.save(predictionsValidUsers.write.option("uri", outputUri))
//calculate and print running time in seconds
val endTime = Calendar.getInstance().getTime()
var elapsedTime = (endTime.getTime() - startTime.getTime()) / 1000

#MDBW16
Start simple, expand as required
1. Aggregation
Framework
2. Language libraries
3. 3rd Party Products
4. Distributed
Processing
Frameworks
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/

#MDBW16
Resource Location
MongoDB Connector for Spark github.com/mongodb/mongo-spark
Spark ALS Recommendation Engine Example
github.com/matthewkalan/mongo-spark-recommender-
example
Blog: Future Big Data Architecture - Delivering on the Data
Lake Vision
www.mongodb.com/blog/post/the-future-of-big-data-
architecture
White Paper: Unlocking Operational Intelligence from the
Data Lake
www.mongodb.com/collateral/unlocking-operational-
intelligence-from-the-data-lake
Blog: Using MongoDB with Hadoop
www.mongodb.com/blog/post/using-mongodb-hadoop-
spark-part-1-introduction-setup
Free Online Training university.mongodb.com
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
For More Information

Architecting Wide-ranging Analytical Solutions with MongoDB

Architecting Wide-ranging Analytical Solutions with MongoDB

Recommended

Recommended

More Related Content

Similar to Architecting Wide-ranging Analytical Solutions with MongoDB

Similar to Architecting Wide-ranging Analytical Solutions with MongoDB (20)

Recently uploaded

Recently uploaded (20)

Architecting Wide-ranging Analytical Solutions with MongoDB

Editor's Notes