This Edureka "Apache Spark Training" tutorial will talk about how Apache Spark works practically. We have demonstrated a Movie Recommendation Project using Apache Spark in this tutorial. Below are the topics covered in this tutorial:
1) Use Cases Of Real Time Analytics
2) Movie Recommendation System Using Spark
3) What Is Spark?
4) Getting Movie Dataset
5) Spark Streaming
6) Collaborative Filtering
7) Spark MLlib
8) Fetching Results
9) Storing Results
2. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What to expect?
Use Cases Of Real Time Analytics
Movie Recommendation System Using Spark
What Is Spark?
Getting Movie Dataset
Spark Streaming
Collaborative Filtering
Spark MLlib
Fetching Results
Storing Results
4. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Cases of Real Time Analytics
Government
Government agencies perform Real
Time Analysis mostly in the field of
national security.
Countries need to continuously
keep a track of all the military and
police agencies for updates
regarding threats to security.
5. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Cases of Real Time Analytics
Healthcare
Healthcare domain uses Real Time
analysis to continuously check the
medical status of critical patients.
Hospitals on the look out for blood
and organ transplants need to stay
in a real-time contact with each
other during emergencies.
Getting medical attention on time is
a matter of life and death for
patients.
6. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Cases of Real Time Analytics
Telecommunications
Companies revolving around
services in the form of calls, video
chats and streaming use real time
analysis to reduce customer churn
and stay ahead of competition.
They also extract measurements of
jitter and delay in mobile networks
to improve customer experiences.
7. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Cases of Real Time Analytics
Banking
Banking transacts with almost all of
the world’s money.
It becomes very important to
ensure fault tolerant transactions
across the whole system.
Fraud detection is made possible
through real time analytics in
banking.
8. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Cases of Real Time Analytics
Stock Market
Stock brokers use real time
analytics to predict movement of
stock portfolios.
Companies re-think their business
model after using real time
analytics to analyze the market
demand for their brand.
11. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Movie Recommendation System
Problem Statement
To build a Movie Recommendation System which recommends movies based
on a user’s preferences using Apache Spark.
Process huge amount of data
Easy to use
Fast processing
Our Requirements:
Input from multiple sources
Apache Spark is the perfect tool to implement our Movie Recommendation System.
12. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Flow Diagram
Huge amount of
Movie Rating data
1
Data from
Streaming / HDFS
2
Getting Input using
Spark Streaming
3 4
Machine Learning
Using MLlib
Train the data
Evaluate ALS
Generate
Recommendations
Fetching Results
using Spark SQL
5
Storing Results in
RDBMS System for
Websites
6
14. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What is Spark?
Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation.
Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations.
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark
Figure: Support for multiple source formats Figure: Lazy Evaluation
15. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
20. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Getting Dataset
For our Movie Recommendation System, we can get user ratings from many popular
websites like IMDB, Rotten Tomatoes and Times Movie Ratings.
This dataset is available in many formats such as CSV files, text files and databases.
We can either stream the data live from the websites or download and store them in our
local file system or HDFS.
Figure: Various File Formats
25. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Collaborative Filtering
We will use Collaborative Filtering (CF) to predict the ratings for users for particular
movies based on their ratings for other movies.
We then collaborate this with other users’ rating for that particular movie.
Movie Alice Bob Carol Dave
Shutter Island 4 3 5 1
Fight Club 5 4 4 2
Dark Knight 5 3 4
21 4 3 5
Home Alone 4 4 5 5
Figure: Predicting the rating of Dave for Dark Knight and Carol for 21 using Collaborative Filtering
?
?
27. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark MLlib
Spark MLlib is used to perform machine learning in Apache Spark.
Machine learning in Spark is implemented using Spark’s MLlib.
MLlib stands for Machine Learning Library.
Train Data using
Alternating Least
Squares (ALS)
Generate
Recommendations
using Collaborative
Filtering
Machine Learning
Using
Spark MLlib
Figure: Machine Learning Flow Diagram
Machine
Learning Tools
ML Algorithms
Featurization
Pipelines
Persistence
Utilities
Figure: Machine Learning Tools
29. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL for Fetching Results
Machine Learning
Output
Spark SQL Results
To get the results from our Machine Learning, we need to use Spark
SQL’s DataFrame, Dataset and SQL Service.
The results in Machine Learning needs to be stored in a RDBMS so
that our web application can display the recommendations to a
particular use.
33. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Storing Results
The results for our Movie Recommendation System can be stored either locally or into
external storage systems.
We can store the Recommended Movies along with the Ratings in a text file or a CSV file.
We should prefer storing the results into an RDBMS system so that we can access it directly
from our web application and display recommendations and top movies.
35. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Job Trends
The following is the Job Trend of
Apache Spark across the world.
Spark has almost thrice the average
number of jobs in comparison to its
competitors and is the market leader
from 2014.
Source: www.indeed.com
39. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
Congrats!
We have hence demonstrated the power of Apache Spark in Real-Time Analysis.
The hands-on examples will give you the required confidence to work on any future
projects you encounter in Apache Spark.