Spark is an open source cluster computing framework that allows processing of large datasets across clusters of computers using a simple programming model. It provides high-level APIs in Java, Scala, Python and R.
Typical machine learning workflows in Spark involve loading data, preprocessing, feature engineering, training models, evaluating performance, and tuning hyperparameters. Spark MLlib provides algorithms for common tasks like classification, regression, clustering and collaborative filtering.
The document provides an example of building a spam filtering application in Spark. It involves reading email data, extracting features using tokenization and hashing, training a logistic regression model, evaluating performance on test data, and tuning hyperparameters via cross validation.
1. Large scale machine learning
with Apache Spark
Md. Mahedi Kaysar (Research Master), Insight Centre for Data Analytics [DCU]
mahedi.kaysar@insight-centre.org
3. Spark Overview
• Open source large scale data processing engine
• 100x times faster than hadoop map-reduce in
memory or 10x faster on disk
• Can write application on java, scala, python and R
• Runs on Mesos, standalone or YARN cluster
manager
• It can access diverse data sources including HDFS,
Cassandra, HBase and S3
3
4. Spark Overview
• MapReduce: distributed execution model
– Map read data from hard disk, process it and write
in back to the disk. Before doing the shuffle
operation it send data to reducer
– Reduce read data from disk and process it and
sent back to disk
4
7. Spark Overview
• RDD: Resilient distributed dataset
– We write program in terms of operations on
distributed data set
– Partitioned collection of object across the cluster,
stored in memory or disk
– RDDs built and manipulated though a diverse
source of parallel transformation (Map, filter,
join), action (save, count, collect)
– RDDs automatically rebuild on machine failure
7
8. Spark Overview
• RDD: Resilient distributed dataset
– immutable and programmer specifies number of
partitions for an RDD.
8
10. Spark Core: underlying general
execution engine. It provides In
memory computing. APIs are build
upon it.
• Spark SQL
• Spark Mlib
• Spark GraphX
• Spark Streaming
10
Spark Ecosystem
Apache Spark 2.0
11. Apache Spark 2.0
• Spark SQL
– Module for structured or tabular data processing
– Actually it is new data abstraction called
SchemaRDD
– Internally it has more information about the
structure of both data and the computation being
performed
– Two way to interact with Spark SQL
• SQL queries: “SELECT * FROM PEOPLE”
• Dataset/DataFrame: domain specific language
11
13. Apache Spark 2.0
• Spark Mlib
– Machine learning library
– ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
• SVM, Decision Tree
– Featurization: feature extraction, transformation, dimensionality
reduction, and selection
• Term frequency, document frequency
– Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
– Persistence: saving and load algorithms, models, and Pipelines
– Utilities: linear algebra, statistics, data handling, etc.
– DataFrame-based API is primary API (spark.ml)
13
14. Apache Spark 2.0
• Spark Streaming
– provides a high-level abstraction called discretized stream
or DStream, which represents a continuous stream of data
– DStream is represented as a sequence of RDDs.
14
15. Apache Spark 2.0
• Structured Streaming (Experimental)
– scalable and fault-tolerant stream processing engine built
on the Spark SQL engine
– The Spark SQL engine will take care of running it
incrementally and continuously and updating the final
result as streaming data continues to arrive
– Can be used Dataset or DataFrame APIs
15
16. Apache Spark 2.0
• GraphX
– extends the Spark RDD by
introducing a new Graph
abstraction
– a directed multigraph with
properties attached to each
vertex and edge
– Pagerank: measures the
importance of a vertex
– Connected component
– Triangle count
16
17. Apache Spark 2.0
• RDD vs. Data Frame vs. Dataset
– All are immutable and distributed dataset
– RDD is the main building block of Apache Spark called
resilient distributed dataset. It process data in
memory for efficient use.
– The DataFrame and Dataset are more abstract then
RDD and those are optimized and good when you
have structured data like CSV, JSON, Hive and so on.
– When you have row data like text file you can use RDD
and transform to structured data with the help of
DataFrame and Dataset
17
18. Apache Spark 2.0
• RDD:
– immutable, partitioned
collections of objects
– Two main Operations
18
19. Apache Spark 2.0
• DataFrame
– A dataset organized into named columns.
– It is conceptually equivalent to a table in a
relational database
– can be constructed from a wide array of sources
such as: structured data files, tables in Hive,
external databases, or existing RDDs.
19
20. Apache Spark 2.0
• Dataset
– distributed collection of data
– It has all the benefits of RDD and Dataframe with
more optimization
– You can switch any form of data from dataset
– It is the latest API for data collections
20
21. Apache Spark 2.0
• DataFrame/Dataset
– Reading a json data using dataset
21
22. Apache Spark 2.0
• DataFrame/Dataset
– Connect hive and query it by HiveQL
22
23. Apache Spark 2.0
• Dataset
– You can transform a dataset to rdd and rdd to dataset
23
24. Spark Cluster Overview
• Spark uses master/slave architecture
• One central coordinator called driver
that communicates with many
distributed workers (executors)
• Drivers and executors run in their own
Java Process
24
A Driver is a process where the main method runs.
It converts the user program into task and schedule the task to the
executors with the help of cluster manager
Cluster manager runs the executors and manages the worker nodes
Executors run the spark jobs and send back the result to the Driver
program.
They provide in-memory storage for RDDs that are cached by user
program
The workers are in charge of communicating the cluster manager the
availability of their resources
25. Spark Cluster Overview
• Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.
• Example: A standalone cluster with 2 worker nodes (each node having 2 cores)
– Local machine
– Cloud EC2
25
Conf/spark-env.sh
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=1g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_DIR=/home/work/sparkdata
./sbin/start-master.sh
Conf/slaves
Master node IP
./sbin/start-slaves.sh
28. Machine Learning with Spark
• Typical Machine learning workflow:
Load the sample data.
Parse the data into the input format for the algorithm.
Pre-process the data and handle the missing values.
Split the data into two sets, one for building the model (training dataset) and one for
testing the model (validation dataset).
Run the algorithm to build or train your ML model.
28
29. Machine Learning with Spark
• Typical Machine learning workflow:
Make predictions with the training data and observe the results.
Test and evaluate the model with the test data or alternatively validate the with some
cross-validator technique using the third dataset, called the validation dataset.
Tune the model for better performance and accuracy.
Scale-up the model so that it can handle massive datasets in the future
Deploy the ML model in commercialization:
29
30. Machine Learning with Spark
• Pre-processing
– The three most common data preprocessing steps
that are used are
• formatting: data may not be in a good shape
• cleaning: data may have unwanted records or
sometimes with missing entries against a record. This
cleaning process deals with the removal or fixing of
missing data
• sampling the data: when the available data size is large
– Data Transformation
– Dataset, RDD and DataFrame
30
31. Machine Learning with Spark
• Feature Engineering
Extraction: Extracting features from “raw” data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features
31
32. Machine Learning with Spark
• ML Algorithms
Classifications
Regression
Tuning
32
33. Machine Learning with Spark
• ML Pipeline:
– Higher level API build on top of DataFrame
– Can combines multiple algorithms together to
make a complete workflow.
– For example: text analytics
• Split the texts=> words
• Convert words => numerical feature vectors
• Numerical feature vectors => labeling
• Build an ML model as a prediction model using vectors
and labels
33
34. Machine Learning with Spark
• ML Pipeline Component:
– Transformers
• is an abstraction that includes feature transformers and
learned models
• an algorithm for transforming one dataset or dataframe
to another dataset or dataframe
• Ex. HashingTF
– Estimators
• an algorithm which can fit on a dataset or dataframe to
produce a transformer or model. Ex- Logistic Regression
34
35. Machine Learning with Spark
• Spam detection or spam filtering:
– Given some e-mails in an inbox, the task is to
identify those e-mails that are spam and those
that are non-spam (often called ham) e-mail
messages.
35
36. Machine Learning with Spark
• Spam detection or spam filtering:
– Reading dataset
– SparkSession is the single entry point to interact
with underlying spark functionality. It allows
dataframe and dataset for programming
36
37. Machine Learning with Spark
• Spam detection or spam filtering:
– Pre-process the dataset
37
38. Machine Learning with Spark
• Spam detection or spam filtering:
– Feature Extraction: make feature vectors
– TF: Term frequency is the number of times that
term appears in document
• Feature vectorization method
38
39. Machine Learning with Spark
• Spam detection or spam filtering:
– Tokenizer: Transformer to tokenise the text into
words
– HashingTF: Transformer for making feature Vector
using TF techniques.
• Takes set of terms
• Converts it to set of feature vector
• It uses hashing trick for indexing terms
39
40. Machine Learning with Spark
• Spam detection or spam filtering:
– Train a model
– Define classifier
– Fit transet
40
42. Machine Learning with Spark
• Tuning
– Model selection:
• Hyper parameter tuning
• Find the best model or parameter for a given task
• Tuning can be done for individial estimator such as
logistic regression, pipeline
– Model selection via cross-validation
– Model selection via train validation split
42
43. Machine Learning with Spark
• Tuning
– Model selection workflow
• Split input data into separate training set or test set
• For each (training,test) pair, they iterate through set of
ParamMaps.
– For each ParamMap they fit the estimator using those
parameteras
– Get fitted model and evaluate the models performance using
evaluator
• They select the model produced by best performing set
of parameters.
43
44. Machine Learning with Spark
• Tuning
– Model selection workflow
• The evaluator can be RegressionEvaluator or
BinaryClassificationEvaluator and so on.
44
45. Machine Learning with Spark
• Tuning
– Model selection via cross validation
• CrossValidator begins by splitting the dataset into a set of folds
(k=3) means create 3 (training,test) dataset pair
• Each pair use 2/3 of the data for training and 1/3 for testing
• To evaluate particular ParamMap it computes the average
evaluation matric for the three model fitting by estimator
• However, it is also a well-established method for choosing
parameters which is more statistically sound than heuristic hand-
tuning.
– Model selection via train-validation split
• only evaluates each combination of parameters once
• less expensive, but will not produce as reliable results when the
training dataset is not sufficiently large.
45
46. Spam Filtering Application
• What we did so far?
– Reading dataset
– Cleaning
– Feature engineering
– Training
– Testing
– Tuning
– Deploying
– Persisting the model
– Reuse the existing model
46
Actually it extend the map reduce programming model to better support of Iterative programming model like machine learning, graphs and so on. The motivation of develop spark programming models comes from most currently programming models like acyclic data flow. Means it flows the data from stable storage to stable storage. It benefits the runtime to decide where to run the tasks and automatically recovers from failure. But it is inefficient for applications that repeatedly reuse a working set of data. For example machine learning and graph datasets. Apps used to reload date from stable or persistent storage on each query.
Then Apache Spark brings a solution that is called resilient distributed dataset (RDD) that allows apps to to keep working set in memory for efficient reuse. It also keeps attractive properties of Map-Reduce which are fault tolerant, data locality and scalability.
Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of