Blazing Fast Analytics with MongoDB & Spark

3
Muthu Chinnasamy
Senior Solutions Architect
muthu@mongodb.com
Twitter: @MuthuMongo

4
Agenda
The data challenge
Spark
Use Cases
Connectors
Demo

2010
Eric Schmidt
Every two days now we create as
much information as we did from the
dawn of civilization up until 2003
“

Apache Spark is the
Taylor Swift of big
data software.
“
Derrick Harris, Fortune

8
What is Spark?
Fast and general computing engine for clusters
• Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, Graph
• It’s fundamentally different to what’s come before

9
Why not just use Hadoop?
• Spark is FAST
–Faster to write.
–Faster to run.
• Up to 100x faster than Hadoop in memory
• 10x faster on disk.

A visual comparison
Hadoop Spark

11
RDD Operations
Transformations Actions
map reduce
filter collect
flatMap count
mapPartitions save
sample lookupKey
union take
join foreach
groupByKey
reduceByKey

12
Spark higher level libraries
Spark
Spark
SQL
Spark
Streaming
MLIB GraphX

14
Data Management
OLTP
Applications
Fine grained operations
Low Latency
Offline Processing
Analytics
Data Warehousing
High Throughput

15
Spark + MongoDB top use cases:
– Business Intelligence
– Data Warehousing
– Recommendation
– Log processing
– User Facing Services
– Fraud detection

17
Spark reading directly from MongoDB

18
Aggregation pipeline to Pre-filter
Aggregation pipeline filter: $match

19
Spark writing directly to MongoDB

Fraud Detection
I'm so in love!
Me, too<3
Now send me your
CC number
?
Ok, XXXX-123-zzz
$$$

Sharing Workloads
Chat App
HDFS HDFS HDFS
Archiving
Data Crunching
Login
User Profile
Contacts
Messages
…
Fraud Detection
Segmentation
Recommendations
Spark

24
MongoDB Spark Connector
https://spark-packages.org/?q=official+mongodb

MongoDB
Spark
Connector
MongoDB
Shard
Spark
MongoDB Spark Connector
https://github.com/mongodb/mongo-spark

27
Spark Streaming
Twitter Feed Spark

28
Spark Streaming
Twitter Feed
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}

29
Spark Streaming
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 1
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
Spark

30
Capped Collection
MongoDB and Spark Streaming feature
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
{
"time": "Mon Nov 5 09:40",
“mongoDBLondon": 400
}
{
"time": "Mon Nov 5 11:50",
“spark": 7556
}
{
"time": "Mon Nov 24 12:50",
"itshappening": 100
}
Tailable Cursor

32
Collaborative Filtering
• Two parts
• Collaborative: Using Rating preference from several Users
• Filtering: Recommend preferences
UserId / MovieId Star Wars Toy Story Frozen
Buzz 4 4 5
Woody 5 4
Jessie 5 ?
Movie Ratings as a matrix

33
MLib ALS
• Approximate into User & Movie latent factor matrices
UserId /
MovieId
Frozen Toy
Story
Star
Wars
Buzz 4 4 5
Woody 5 4
Jessie 5
Buzz x y
Woody x y
Jessie x y
Star
Wars
Toy
Story
Frozen
x x x
y y y
f(i)
f(j)
rij

34
Prediction Process
• Load movie ratings data from MongoDB
• Reflect and Infer the input formats for the ALS algorithm
• Split the data
–80% for training and 20% for validating the model
• Calculate the best model using ALS algorithm
–Build/train a User Movie matrix model
• Combine the data with user preferences and retrain the
model

35
Explore as a Databricks Notebook
http://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html

37
China Eastern Airlines – Fare Engine
130K seats,180 million fares & 1.6 billion daily searches

38
Spark and MongoDB
• An extremely powerful combination
• Many possible use cases
• Some operations are actually faster if performed using
Aggregation Framework
• Evolving all the time

Questions?
Muthu Chinnasamy
muthu@mongodb.com
@muthumongo

Blazing Fast Analytics with MongoDB & Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Blazing Fast Analytics with MongoDB & Spark

Similar to Blazing Fast Analytics with MongoDB & Spark (20)

More from MongoDB

More from MongoDB (20)

Blazing Fast Analytics with MongoDB & Spark