5. #MDBW16
How to Drive More Value From Data
?
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
6. #MDBW16
So Many Options
Part of image from: http://mattturck.com/wp-content/uploads/2016/01/matt_turck_big_data_landscape_full.png
7. #MDBW16
Why Are Analytics Important?
From http://www.bain.com/publications/capability-insights/advanced-analytics.aspx
8. #MDBW16
What Criteria To Consider For Choosing
Technology
• Assumption: you identified what derived data/analytic(s) has ROI
• Criteria
• Operations on data (read/write, transform, aggregation, algorithm)
• Time SLA – both how up-to-date data is and response times
• Effort (training, development, management)
• Processing model for analytic (partitionable, iterative, streaming, etc.)
• Cost (data duplication, memory, servers, software)
10. #MDBW16
MongoDB Capabilities to Highlights for Analytics
Community/Open Source
1. Aggregation Framework
2. Reading from secondaries (priority = votes = 0 recommended)
3. Mongo Connector – replication to other MongoDB, search engines, etc.
4. Hadoop Connector – exposes MongoDB as native input/output for Hive, Pig,
MR, etc.
5. Spark Connector – exposes MongoDB as an RDD/DataFrame/DataSet for
read/write
Enterprise Advanced
1. In-memory storage engine – now GA for production use
2. BI Connector – BI & SQL read access to MongoDB
12. #MDBW16
Aggregation Pipeline Stages
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer joins
• $unwind
Expand arrays in documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset of
documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection
13. #MDBW16
Aggregation With a Sharded Database
Workload split between shards
1. Client works through mongos as
with any query
2. Shards execute pipeline up to a
point
3. A single shard merges cursors and
continues processing
4. $lookup & $out performed within
Primary shard for the database
16. #MDBW16
On-Demand Analytics with Agg FW
Benefits
1. Up-to-date data
2. One technology
3. Only raw data stored
4. Flexible
Tradeoff
1. Slow if scanning many
documents
Common Uses
Groups, counts, sum,
averages for small subsets
of data
Aggregation
Framework
Runtime
agg pipeline
Results in real-time
Application
17. #MDBW16
Offline Analytics With Aggregation Framework
Benefits
1. One technology
2. Can filter at DB on
aggregations
3. Low latency (in C++)
Tradeoffs
1. Storing additional data
2. One thread per
server/instance
3. Advanced functions not
included
Common Uses
1. Pre-calculating values
across dataset
2. Batch transformations
Aggregation
Framework
$out:
“results”
*Agg Pipeline
Application
* MapReduce also possible but slower (run in Javascript) and most requirements can be done in agg fw
Outputting to a sharded collection with agg fw would be returned to driver and written from there to sharded collection
Also can return
data to application
18. #MDBW16
Microsharding for Highly Parallel Processing
Benefits
1. Multiple threads for agg
fw query per server
2. One technology
Tradeoffs
1. # of parallel threads and
partitions in DB
predefined
2. No native job scheduling
or resource
management
Common Uses
Analytics on large result sets
to minimize latency
Agg
pipeline
…
Mongos
Run in parallel
on N partitions
Data returned
In parallel
Application
Each server
20. #MDBW16
Analytics in Custom Application/Framework
Benefits
1. Flexible & in app team control
2. All language libraries &
frameworks available
3. Tailing oplog gives near real-
time
Tradeoffs
1. Data might not fit in memory
2. Threading managed by
developer
Common Uses
1. Statistical analysis w/ R, Matlab,
etc.
2. Advanced analytics & algos
3. Updating counts & aggregations
Query raw data
Results in real-time
Application
Optionally store analyzed
data back in DB
Can use tailable
cursor for tracking
events
21. #MDBW16
Documents
returned
SQL result sets
returned
Analytics in 3rd-Party Products
BI or other
analytics
product
Benefits
1. Pre-built UI and toolkits
2. Supports most all 3rd party
SQL-based tools
3. Can migrate to MongoDB &
keep reporting tools
Tradeoffs
1. Optimal performance often
requires configuring views
2. Joins between 2 sharded
collections can be slow
Common Products
1. Pentaho, Jaspersoft, Alteryx
2. Tableau, Qlikview,
SQL Query MongoDB BI
Connector
MongoDB
Query
Native Integrations
24. #MDBW16
Partitionable Distributed Analytics
Benefits
1. Very parallelizable to
scale horizontally
2. Intermediate results can
be on disk, not necessarily
memory
Tradeoff
1. Often significant overhead
in learning the framework
Common Frameworks
1. Hadoop
2. Spark
…
Partitions
lined up
between
workers &
shard
Worker
Worker
Worker
…
Mongos
Mongos
Mongos
Master
Worker Mongos
26. #MDBW16
Iterative Distributed Analytics
Benefits
1. Great for machine
learning
2. Memory-based
frameworks can be
much faster
Tradeoff
1. Harder overall to speed
up with horizontal
scaling
Common Framework
1. Spark
…
Stages of iterations might be
partitionable
Worker
…
Mongos
Master
Worker Mongos
28. #MDBW16
Streaming Distributed Analytics
Benefits
1. Analysis on current data
2. Can analyze
incrementally to avoid
batch windows
3. Can use some
frameworks for
streaming + batch
Tradeoff
1. Depends on streaming
sources being available
2. Some analytics cannot be
calculated incrementally
Common Uses &
Frameworks
1. Sentiment analysis
2. Spark Streaming, Storm,
Flink, Kafka Streams
Stream
Processing
Framework
Event
Sources
Storing events &
analytic results
Historical or
reference data
on-demand
Tailable cursor
Stream
Processing
Framework
…
30. #MDBW16
Given Users’ ratings
for some Items, how
to infer users’
ratings for all items
Useful for:
1. Recommendation
s
2. Cross-sell
3. Accurate
targeting
Recommendation Engine Problem Description
Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html
31. #MDBW16
Alternating Least Squares (ALS) Algo
Image from http://netprophetblog.blogspot.com/2013/10/local-regression.html
2-dimensional
Given f(x) = a*x + b
Can minimize
di = Σi (yi – f(xi))2
ALS approach
Fix a and solve for b
Alternate: fix b, and solve for
a
ALS can extend to n-
dimensional
33. #MDBW16
Architecture of Solution
Spark
Worker
Spark Master
Spark
Worker
Pushes
ALSExampleMongoDB to
Workers
Each worker
handles partitions of
data as appropriate
and also shuffle
Worker reads its partition of
User ratings for Items from
MongoDB
Worker writes its partition of
data for predictions back to
MongoDB
On startup, shared libraries
loaded by Workers
1. MongoDB Spark
Connector
2. Java Driver
Full code for example can be found at:
https://github.com/matthewkalan/mongo-spark-recommender-example
34. #MDBW16
Code for Configuration and Reading from
MongoDB
object ALSExampleMongoDB {
def main(args: Array[String]) {
//this conf should only be used when run locally because sc.getOrCreate() reuses already running SparkContexts
val sc = SparkContext.getOrCreate()
val sqlContext = SQLContext.getOrCreate(sc)
var inputUri = args(1) //pass MongoDB connection string from args
//setting up DataFrame to read from MongoDB - Connector automatically partitions the data to spread across workers
var ratingsAll = sqlContext.read.options(
Map(
"uri" -> inputUri
//"localThreshold" -> "0", //Add these two parameters to connect to the nearest Mongos, if desired
//"readPreference.name" -> "nearest",
//"partitionerOptions.partitionSizeMB" -> ”128", //Typically partitions should be 64 - 512 MB
//"partitioner" -> "MongoSamplePartitioner" //If customer partitioner desired
)).mongo()
var userIdThreshold = args(3)
ratings = ratingsAll.filter(ratingsAll("userId") > userIdThreshold) //Filtering & aggregation pushed down to DB w/ indexes
//caching the DataFrame in memory of Spark workers
ratings.cache()
35. #MDBW16
Code for Training ALS Algo and Making
Predictions
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2)) //split into a training and test dataset
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training) //train the model
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
.withColumn("rating", col("rating").cast(DoubleType))
.withColumn("prediction", col("prediction").cast(DoubleType))
//remove NaN values if a user is not in both the training and test dataset
val predictionsValidUsers = predictions.na.drop("any", Seq("rating", "prediction"))
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictionsValidUsers)
36. #MDBW16
Code for Writing Predictions to MongoDB
//store the users predictions back into MongoDB
var outputUri = args(2)
MongoSpark.save(predictionsValidUsers.write.option("uri", outputUri))
//calculate and print running time in seconds
val endTime = Calendar.getInstance().getTime()
var elapsedTime = (endTime.getTime() - startTime.getTime()) / 1000
39. #MDBW16
Start simple, expand as required
1. Aggregation
Framework
2. Language libraries
3. 3rd Party Products
4. Distributed
Processing
Frameworks
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
40. #MDBW16
Resource Location
MongoDB Connector for Spark github.com/mongodb/mongo-spark
Spark ALS Recommendation Engine Example
github.com/matthewkalan/mongo-spark-recommender-
example
Blog: Future Big Data Architecture - Delivering on the Data
Lake Vision
www.mongodb.com/blog/post/the-future-of-big-data-
architecture
White Paper: Unlocking Operational Intelligence from the
Data Lake
www.mongodb.com/collateral/unlocking-operational-
intelligence-from-the-data-lake
Blog: Using MongoDB with Hadoop
www.mongodb.com/blog/post/using-mongodb-hadoop-
spark-part-1-introduction-setup
Free Online Training university.mongodb.com
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
For More Information
Editor's Notes
Explain I mean a broad definition for analytics, really any derived data
Addresses: is MongoDB enough? Should I be using other products in addition?
Poll audience for what analytics they are considering
For aggregation operations that run on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard to merge the results and avoid overloading the primary shard for that database. Aggregation operations that require running on the database’s primary shard are the $out stage and $lookup stage.
Note: place before the scenario that deals with this & remove some bullets
Replica set or shards are hidden in the database icon
Application uses a programming language driver to send agg pipeline
Example: total balance or value of customer, total number of posts, esp. for a given entity (i.e. can filter) and NOT for the whole database
Obviously the simplest and most common, get the data you want and then call a library in the application
Example: good for pre-calculating totals and aggregations, e.g. balances, documents, dollar values, etc.
If need to generate bulk reports, could send the data back to the reporting tool
Example: good for longer running jobs, e.g. Top 10 Bank, has a personal in-memory data mart with 2GB allocated per person for report data (from their 2PB DW) spread across all shards so queried in parallel
Note: Be sure to explain an easily digestible example and point out it is not a common pattern
This can be on each server or you can shard across instances to get parallelism – the main concept here is sharding earlier than otherwise necessary to get parallelism in analytical processing
Previous slides were focused on agg fw
Point this out because some hear analytics and think Hadoop/Spark maybe – but there are many libraries and analytics in Java, Python, R, etc.
If data can be filtered well, the latency should be similar for analytic in app vs. agg fw (difference between C++ and language in use)
Example: Using R, Matlab, and other statistical packages directly against MongoDB
Example are SAS, Tableau, etc. or any tool that is read-only from the DB
Point out could even run the Workers on the same server as each MongoDB node, but have to know in advance how big an instance to use. Having the Worker node separate (and it is stateless) allows the Worker to be sized dynamically depending on the job
Point out could even run the Workers on the same server as each MongoDB node, but have to know in advance how big an instance to use. Having the Worker node separate (and it is stateless) allows the Worker to be sized dynamically depending on the job
Can be combined with partitionable portions of algo so that each iteration is partitionable. Then in-memory and distribution are important