5. 2010
Eric Schmidt
Every two days now we create as
much information as we did from the
dawn of civilization up until 2003
“
6.
7. Apache Spark is the
Taylor Swift of big
data software.
“
Derrick Harris, Fortune
8. 8
What is Spark?
Fast and general computing engine for clusters
• Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, Graph
• It’s fundamentally different to what’s come before
9. 9
Why not just use Hadoop?
• Spark is FAST
–Faster to write.
–Faster to run.
• Up to 100x faster than Hadoop in memory
• 10x faster on disk.
32. 32
Collaborative Filtering
• Two parts
• Collaborative: Using Rating preference from several Users
• Filtering: Recommend preferences
UserId / MovieId Star Wars Toy Story Frozen
Buzz 4 4 5
Woody 5 4
Jessie 5 ?
Movie Ratings as a matrix
33. 33
MLib ALS
• Approximate into User & Movie latent factor matrices
UserId /
MovieId
Frozen Toy
Story
Star
Wars
Buzz 4 4 5
Woody 5 4
Jessie 5
Buzz x y
Woody x y
Jessie x y
Star
Wars
Toy
Story
Frozen
x x x
y y y
f(i)
f(j)
rij
34. 34
Prediction Process
• Load movie ratings data from MongoDB
• Reflect and Infer the input formats for the ALS algorithm
• Split the data
–80% for training and 20% for validating the model
• Calculate the best model using ALS algorithm
–Build/train a User Movie matrix model
• Combine the data with user preferences and retrain the
model
35. 35
Explore as a Databricks Notebook
http://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html
38. 38
Spark and MongoDB
• An extremely powerful combination
• Many possible use cases
• Some operations are actually faster if performed using
Aggregation Framework
• Evolving all the time