SlideShare a Scribd company logo
Large scale machine learning
with Apache Spark
Md. Mahedi Kaysar (Research Master), Insight Centre for Data Analytics [DCU]
mahedi.kaysar@insight-centre.org
Agenda
• Spark Overview
• Installing spark and deploying application
• Machine learning with Spark 2.0
– Typical Machine Learning workflow
– Spark Mlib
• Develop a machine learning application
– Spam Filtering
2
Spark Overview
• Open source large scale data processing engine
• 100x times faster than hadoop map-reduce in
memory or 10x faster on disk
• Can write application on java, scala, python and R
• Runs on Mesos, standalone or YARN cluster
manager
• It can access diverse data sources including HDFS,
Cassandra, HBase and S3
3
Spark Overview
• MapReduce: distributed execution model
– Map read data from hard disk, process it and write
in back to the disk. Before doing the shuffle
operation it send data to reducer
– Reduce read data from disk and process it and
sent back to disk
4
Spark Overview
• MapReduce Execution Module:
– Iterative job: lots of disk i/o operation
5
Spark Overview
• Spark Execution Model
– Use memory instead of disk
6
Spark Overview
• RDD: Resilient distributed dataset
– We write program in terms of operations on
distributed data set
– Partitioned collection of object across the cluster,
stored in memory or disk
– RDDs built and manipulated though a diverse
source of parallel transformation (Map, filter,
join), action (save, count, collect)
– RDDs automatically rebuild on machine failure
7
Spark Overview
• RDD: Resilient distributed dataset
– immutable and programmer specifies number of
partitions for an RDD.
8
Spark Overview
• RDD: Transformation
– New dataset from existing one
9
 Spark Core: underlying general
execution engine. It provides In
memory computing. APIs are build
upon it.
• Spark SQL
• Spark Mlib
• Spark GraphX
• Spark Streaming
10
Spark Ecosystem
Apache Spark 2.0
Apache Spark 2.0
• Spark SQL
– Module for structured or tabular data processing
– Actually it is new data abstraction called
SchemaRDD
– Internally it has more information about the
structure of both data and the computation being
performed
– Two way to interact with Spark SQL
• SQL queries: “SELECT * FROM PEOPLE”
• Dataset/DataFrame: domain specific language
11
Apache Spark 2.0
• Spark SQL
12
Apache Spark 2.0
• Spark Mlib
– Machine learning library
– ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
• SVM, Decision Tree
– Featurization: feature extraction, transformation, dimensionality
reduction, and selection
• Term frequency, document frequency
– Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
– Persistence: saving and load algorithms, models, and Pipelines
– Utilities: linear algebra, statistics, data handling, etc.
– DataFrame-based API is primary API (spark.ml)
13
Apache Spark 2.0
• Spark Streaming
– provides a high-level abstraction called discretized stream
or DStream, which represents a continuous stream of data
– DStream is represented as a sequence of RDDs.
14
Apache Spark 2.0
• Structured Streaming (Experimental)
– scalable and fault-tolerant stream processing engine built
on the Spark SQL engine
– The Spark SQL engine will take care of running it
incrementally and continuously and updating the final
result as streaming data continues to arrive
– Can be used Dataset or DataFrame APIs
15
Apache Spark 2.0
• GraphX
– extends the Spark RDD by
introducing a new Graph
abstraction
– a directed multigraph with
properties attached to each
vertex and edge
– Pagerank: measures the
importance of a vertex
– Connected component
– Triangle count
16
Apache Spark 2.0
• RDD vs. Data Frame vs. Dataset
– All are immutable and distributed dataset
– RDD is the main building block of Apache Spark called
resilient distributed dataset. It process data in
memory for efficient use.
– The DataFrame and Dataset are more abstract then
RDD and those are optimized and good when you
have structured data like CSV, JSON, Hive and so on.
– When you have row data like text file you can use RDD
and transform to structured data with the help of
DataFrame and Dataset
17
Apache Spark 2.0
• RDD:
– immutable, partitioned
collections of objects
– Two main Operations
18
Apache Spark 2.0
• DataFrame
– A dataset organized into named columns.
– It is conceptually equivalent to a table in a
relational database
– can be constructed from a wide array of sources
such as: structured data files, tables in Hive,
external databases, or existing RDDs.
19
Apache Spark 2.0
• Dataset
– distributed collection of data
– It has all the benefits of RDD and Dataframe with
more optimization
– You can switch any form of data from dataset
– It is the latest API for data collections
20
Apache Spark 2.0
• DataFrame/Dataset
– Reading a json data using dataset
21
Apache Spark 2.0
• DataFrame/Dataset
– Connect hive and query it by HiveQL
22
Apache Spark 2.0
• Dataset
– You can transform a dataset to rdd and rdd to dataset
23
Spark Cluster Overview
• Spark uses master/slave architecture
• One central coordinator called driver
that communicates with many
distributed workers (executors)
• Drivers and executors run in their own
Java Process
24
 A Driver is a process where the main method runs.
 It converts the user program into task and schedule the task to the
executors with the help of cluster manager
 Cluster manager runs the executors and manages the worker nodes
 Executors run the spark jobs and send back the result to the Driver
program.
 They provide in-memory storage for RDDs that are cached by user
program
 The workers are in charge of communicating the cluster manager the
availability of their resources
Spark Cluster Overview
• Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.
• Example: A standalone cluster with 2 worker nodes (each node having 2 cores)
– Local machine
– Cloud EC2
25
Conf/spark-env.sh
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=1g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_DIR=/home/work/sparkdata
./sbin/start-master.sh
Conf/slaves
Master node IP
./sbin/start-slaves.sh
Application Deployment
• Standalone mode
26
Application Deployment
• Standalone mode
– Client Deploy mode
– Cluster Deploy mode
27
Machine Learning with Spark
• Typical Machine learning workflow:
 Load the sample data.
 Parse the data into the input format for the algorithm.
 Pre-process the data and handle the missing values.
 Split the data into two sets, one for building the model (training dataset) and one for
testing the model (validation dataset).
 Run the algorithm to build or train your ML model.
28
Machine Learning with Spark
• Typical Machine learning workflow:
 Make predictions with the training data and observe the results.
 Test and evaluate the model with the test data or alternatively validate the with some
cross-validator technique using the third dataset, called the validation dataset.
 Tune the model for better performance and accuracy.
 Scale-up the model so that it can handle massive datasets in the future
 Deploy the ML model in commercialization:
29
Machine Learning with Spark
• Pre-processing
– The three most common data preprocessing steps
that are used are
• formatting: data may not be in a good shape
• cleaning: data may have unwanted records or
sometimes with missing entries against a record. This
cleaning process deals with the removal or fixing of
missing data
• sampling the data: when the available data size is large
– Data Transformation
– Dataset, RDD and DataFrame
30
Machine Learning with Spark
• Feature Engineering
 Extraction: Extracting features from “raw” data
 Transformation: Scaling, converting, or modifying features
 Selection: Selecting a subset from a larger set of features
31
Machine Learning with Spark
• ML Algorithms
 Classifications
 Regression
 Tuning
32
Machine Learning with Spark
• ML Pipeline:
– Higher level API build on top of DataFrame
– Can combines multiple algorithms together to
make a complete workflow.
– For example: text analytics
• Split the texts=> words
• Convert words => numerical feature vectors
• Numerical feature vectors => labeling
• Build an ML model as a prediction model using vectors
and labels
33
Machine Learning with Spark
• ML Pipeline Component:
– Transformers
• is an abstraction that includes feature transformers and
learned models
• an algorithm for transforming one dataset or dataframe
to another dataset or dataframe
• Ex. HashingTF
– Estimators
• an algorithm which can fit on a dataset or dataframe to
produce a transformer or model. Ex- Logistic Regression
34
Machine Learning with Spark
• Spam detection or spam filtering:
– Given some e-mails in an inbox, the task is to
identify those e-mails that are spam and those
that are non-spam (often called ham) e-mail
messages.
35
Machine Learning with Spark
• Spam detection or spam filtering:
– Reading dataset
– SparkSession is the single entry point to interact
with underlying spark functionality. It allows
dataframe and dataset for programming
36
Machine Learning with Spark
• Spam detection or spam filtering:
– Pre-process the dataset
37
Machine Learning with Spark
• Spam detection or spam filtering:
– Feature Extraction: make feature vectors
– TF: Term frequency is the number of times that
term appears in document
• Feature vectorization method
38
Machine Learning with Spark
• Spam detection or spam filtering:
– Tokenizer: Transformer to tokenise the text into
words
– HashingTF: Transformer for making feature Vector
using TF techniques.
• Takes set of terms
• Converts it to set of feature vector
• It uses hashing trick for indexing terms
39
Machine Learning with Spark
• Spam detection or spam filtering:
– Train a model
– Define classifier
– Fit transet
40
Machine Learning with Spark
• Spam detection or spam filtering:
– Test a model
41
Machine Learning with Spark
• Tuning
– Model selection:
• Hyper parameter tuning
• Find the best model or parameter for a given task
• Tuning can be done for individial estimator such as
logistic regression, pipeline
– Model selection via cross-validation
– Model selection via train validation split
42
Machine Learning with Spark
• Tuning
– Model selection workflow
• Split input data into separate training set or test set
• For each (training,test) pair, they iterate through set of
ParamMaps.
– For each ParamMap they fit the estimator using those
parameteras
– Get fitted model and evaluate the models performance using
evaluator
• They select the model produced by best performing set
of parameters.
43
Machine Learning with Spark
• Tuning
– Model selection workflow
• The evaluator can be RegressionEvaluator or
BinaryClassificationEvaluator and so on.
44
Machine Learning with Spark
• Tuning
– Model selection via cross validation
• CrossValidator begins by splitting the dataset into a set of folds
(k=3) means create 3 (training,test) dataset pair
• Each pair use 2/3 of the data for training and 1/3 for testing
• To evaluate particular ParamMap it computes the average
evaluation matric for the three model fitting by estimator
• However, it is also a well-established method for choosing
parameters which is more statistically sound than heuristic hand-
tuning.
– Model selection via train-validation split
• only evaluates each combination of parameters once
• less expensive, but will not produce as reliable results when the
training dataset is not sufficiently large.
45
Spam Filtering Application
• What we did so far?
– Reading dataset
– Cleaning
– Feature engineering
– Training
– Testing
– Tuning
– Deploying
– Persisting the model
– Reuse the existing model
46
Spam Filtering Application
• Deployment
47
Thanks
Questions??
48

More Related Content

What's hot

Html basic
Html basicHtml basic
Html basic
Viccky Khairnar
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
 
Javascript
JavascriptJavascript
Javascript
Mayank Bhatt
 
Cascading Style Sheets - Part 01
Cascading Style Sheets - Part 01Cascading Style Sheets - Part 01
Cascading Style Sheets - Part 01
Hatem Mahmoud
 
Cassandra
CassandraCassandra
Cassandra
Upaang Saxena
 
Frontend Crash Course: HTML and CSS
Frontend Crash Course: HTML and CSSFrontend Crash Course: HTML and CSS
Frontend Crash Course: HTML and CSS
Thinkful
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
Gil Fink
 
Understanding the Web Page Layout
Understanding the Web Page LayoutUnderstanding the Web Page Layout
Understanding the Web Page Layout
Jhaun Paul Enriquez
 
Html tags
Html tagsHtml tags
Html tags
sotero66
 
Pandas csv
Pandas csvPandas csv
Pandas csv
Devashish Kumar
 
Cascading style sheets (CSS)
Cascading style sheets (CSS)Cascading style sheets (CSS)
Cascading style sheets (CSS)
Harshita Yadav
 
R data types
R   data typesR   data types
R data types
Learnbay Datascience
 
Difference Between HTML and HTML5
Difference Between HTML and HTML5Difference Between HTML and HTML5
Difference Between HTML and HTML5
Bapu Graphics India
 
Inline functions & macros
Inline functions & macrosInline functions & macros
Inline functions & macros
Anand Kumar
 
CSS framework By Palash
CSS framework By PalashCSS framework By Palash
CSS framework By Palash
PalashBajpai
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
Chris Baglieri
 

What's hot (20)

Html basic
Html basicHtml basic
Html basic
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Javascript
JavascriptJavascript
Javascript
 
Cascading Style Sheets - Part 01
Cascading Style Sheets - Part 01Cascading Style Sheets - Part 01
Cascading Style Sheets - Part 01
 
Cassandra
CassandraCassandra
Cassandra
 
Frontend Crash Course: HTML and CSS
Frontend Crash Course: HTML and CSSFrontend Crash Course: HTML and CSS
Frontend Crash Course: HTML and CSS
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
 
Understanding the Web Page Layout
Understanding the Web Page LayoutUnderstanding the Web Page Layout
Understanding the Web Page Layout
 
Html tags
Html tagsHtml tags
Html tags
 
Html ppt
Html pptHtml ppt
Html ppt
 
Pandas csv
Pandas csvPandas csv
Pandas csv
 
Cascading style sheets (CSS)
Cascading style sheets (CSS)Cascading style sheets (CSS)
Cascading style sheets (CSS)
 
Css
CssCss
Css
 
R data types
R   data typesR   data types
R data types
 
CSS ppt
CSS pptCSS ppt
CSS ppt
 
Difference Between HTML and HTML5
Difference Between HTML and HTML5Difference Between HTML and HTML5
Difference Between HTML and HTML5
 
Inline functions & macros
Inline functions & macrosInline functions & macros
Inline functions & macros
 
CSS framework By Palash
CSS framework By PalashCSS framework By Palash
CSS framework By Palash
 
Html ppt
Html pptHtml ppt
Html ppt
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 

Viewers also liked

Automation and machine learning in the enterprise
Automation and machine learning in the enterpriseAutomation and machine learning in the enterprise
Automation and machine learning in the enterprise
alphydan
 
Heterogeneous Workflows With Spark At Netflix
Heterogeneous Workflows With Spark At NetflixHeterogeneous Workflows With Spark At Netflix
Heterogeneous Workflows With Spark At Netflix
Jen Aman
 
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
GeeksLab Odessa
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
datamantra
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
Databricks
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Jen Aman
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 

Viewers also liked (20)

Automation and machine learning in the enterprise
Automation and machine learning in the enterpriseAutomation and machine learning in the enterprise
Automation and machine learning in the enterprise
 
Heterogeneous Workflows With Spark At Netflix
Heterogeneous Workflows With Spark At NetflixHeterogeneous Workflows With Spark At Netflix
Heterogeneous Workflows With Spark At Netflix
 
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 

Similar to Large Scale Machine learning with Spark

Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
Maloy Manna, PMP®
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 

Similar to Large Scale Machine learning with Spark (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 

Recently uploaded

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 

Recently uploaded (20)

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 

Large Scale Machine learning with Spark

  • 1. Large scale machine learning with Apache Spark Md. Mahedi Kaysar (Research Master), Insight Centre for Data Analytics [DCU] mahedi.kaysar@insight-centre.org
  • 2. Agenda • Spark Overview • Installing spark and deploying application • Machine learning with Spark 2.0 – Typical Machine Learning workflow – Spark Mlib • Develop a machine learning application – Spam Filtering 2
  • 3. Spark Overview • Open source large scale data processing engine • 100x times faster than hadoop map-reduce in memory or 10x faster on disk • Can write application on java, scala, python and R • Runs on Mesos, standalone or YARN cluster manager • It can access diverse data sources including HDFS, Cassandra, HBase and S3 3
  • 4. Spark Overview • MapReduce: distributed execution model – Map read data from hard disk, process it and write in back to the disk. Before doing the shuffle operation it send data to reducer – Reduce read data from disk and process it and sent back to disk 4
  • 5. Spark Overview • MapReduce Execution Module: – Iterative job: lots of disk i/o operation 5
  • 6. Spark Overview • Spark Execution Model – Use memory instead of disk 6
  • 7. Spark Overview • RDD: Resilient distributed dataset – We write program in terms of operations on distributed data set – Partitioned collection of object across the cluster, stored in memory or disk – RDDs built and manipulated though a diverse source of parallel transformation (Map, filter, join), action (save, count, collect) – RDDs automatically rebuild on machine failure 7
  • 8. Spark Overview • RDD: Resilient distributed dataset – immutable and programmer specifies number of partitions for an RDD. 8
  • 9. Spark Overview • RDD: Transformation – New dataset from existing one 9
  • 10.  Spark Core: underlying general execution engine. It provides In memory computing. APIs are build upon it. • Spark SQL • Spark Mlib • Spark GraphX • Spark Streaming 10 Spark Ecosystem Apache Spark 2.0
  • 11. Apache Spark 2.0 • Spark SQL – Module for structured or tabular data processing – Actually it is new data abstraction called SchemaRDD – Internally it has more information about the structure of both data and the computation being performed – Two way to interact with Spark SQL • SQL queries: “SELECT * FROM PEOPLE” • Dataset/DataFrame: domain specific language 11
  • 12. Apache Spark 2.0 • Spark SQL 12
  • 13. Apache Spark 2.0 • Spark Mlib – Machine learning library – ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering • SVM, Decision Tree – Featurization: feature extraction, transformation, dimensionality reduction, and selection • Term frequency, document frequency – Pipelines: tools for constructing, evaluating, and tuning ML Pipelines – Persistence: saving and load algorithms, models, and Pipelines – Utilities: linear algebra, statistics, data handling, etc. – DataFrame-based API is primary API (spark.ml) 13
  • 14. Apache Spark 2.0 • Spark Streaming – provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data – DStream is represented as a sequence of RDDs. 14
  • 15. Apache Spark 2.0 • Structured Streaming (Experimental) – scalable and fault-tolerant stream processing engine built on the Spark SQL engine – The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive – Can be used Dataset or DataFrame APIs 15
  • 16. Apache Spark 2.0 • GraphX – extends the Spark RDD by introducing a new Graph abstraction – a directed multigraph with properties attached to each vertex and edge – Pagerank: measures the importance of a vertex – Connected component – Triangle count 16
  • 17. Apache Spark 2.0 • RDD vs. Data Frame vs. Dataset – All are immutable and distributed dataset – RDD is the main building block of Apache Spark called resilient distributed dataset. It process data in memory for efficient use. – The DataFrame and Dataset are more abstract then RDD and those are optimized and good when you have structured data like CSV, JSON, Hive and so on. – When you have row data like text file you can use RDD and transform to structured data with the help of DataFrame and Dataset 17
  • 18. Apache Spark 2.0 • RDD: – immutable, partitioned collections of objects – Two main Operations 18
  • 19. Apache Spark 2.0 • DataFrame – A dataset organized into named columns. – It is conceptually equivalent to a table in a relational database – can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. 19
  • 20. Apache Spark 2.0 • Dataset – distributed collection of data – It has all the benefits of RDD and Dataframe with more optimization – You can switch any form of data from dataset – It is the latest API for data collections 20
  • 21. Apache Spark 2.0 • DataFrame/Dataset – Reading a json data using dataset 21
  • 22. Apache Spark 2.0 • DataFrame/Dataset – Connect hive and query it by HiveQL 22
  • 23. Apache Spark 2.0 • Dataset – You can transform a dataset to rdd and rdd to dataset 23
  • 24. Spark Cluster Overview • Spark uses master/slave architecture • One central coordinator called driver that communicates with many distributed workers (executors) • Drivers and executors run in their own Java Process 24  A Driver is a process where the main method runs.  It converts the user program into task and schedule the task to the executors with the help of cluster manager  Cluster manager runs the executors and manages the worker nodes  Executors run the spark jobs and send back the result to the Driver program.  They provide in-memory storage for RDDs that are cached by user program  The workers are in charge of communicating the cluster manager the availability of their resources
  • 25. Spark Cluster Overview • Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster. • Example: A standalone cluster with 2 worker nodes (each node having 2 cores) – Local machine – Cloud EC2 25 Conf/spark-env.sh export SPARK_WORKER_MEMORY=1g export SPARK_EXECUTOR_MEMORY=1g export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=2 export SPARK_WORKER_DIR=/home/work/sparkdata ./sbin/start-master.sh Conf/slaves Master node IP ./sbin/start-slaves.sh
  • 27. Application Deployment • Standalone mode – Client Deploy mode – Cluster Deploy mode 27
  • 28. Machine Learning with Spark • Typical Machine learning workflow:  Load the sample data.  Parse the data into the input format for the algorithm.  Pre-process the data and handle the missing values.  Split the data into two sets, one for building the model (training dataset) and one for testing the model (validation dataset).  Run the algorithm to build or train your ML model. 28
  • 29. Machine Learning with Spark • Typical Machine learning workflow:  Make predictions with the training data and observe the results.  Test and evaluate the model with the test data or alternatively validate the with some cross-validator technique using the third dataset, called the validation dataset.  Tune the model for better performance and accuracy.  Scale-up the model so that it can handle massive datasets in the future  Deploy the ML model in commercialization: 29
  • 30. Machine Learning with Spark • Pre-processing – The three most common data preprocessing steps that are used are • formatting: data may not be in a good shape • cleaning: data may have unwanted records or sometimes with missing entries against a record. This cleaning process deals with the removal or fixing of missing data • sampling the data: when the available data size is large – Data Transformation – Dataset, RDD and DataFrame 30
  • 31. Machine Learning with Spark • Feature Engineering  Extraction: Extracting features from “raw” data  Transformation: Scaling, converting, or modifying features  Selection: Selecting a subset from a larger set of features 31
  • 32. Machine Learning with Spark • ML Algorithms  Classifications  Regression  Tuning 32
  • 33. Machine Learning with Spark • ML Pipeline: – Higher level API build on top of DataFrame – Can combines multiple algorithms together to make a complete workflow. – For example: text analytics • Split the texts=> words • Convert words => numerical feature vectors • Numerical feature vectors => labeling • Build an ML model as a prediction model using vectors and labels 33
  • 34. Machine Learning with Spark • ML Pipeline Component: – Transformers • is an abstraction that includes feature transformers and learned models • an algorithm for transforming one dataset or dataframe to another dataset or dataframe • Ex. HashingTF – Estimators • an algorithm which can fit on a dataset or dataframe to produce a transformer or model. Ex- Logistic Regression 34
  • 35. Machine Learning with Spark • Spam detection or spam filtering: – Given some e-mails in an inbox, the task is to identify those e-mails that are spam and those that are non-spam (often called ham) e-mail messages. 35
  • 36. Machine Learning with Spark • Spam detection or spam filtering: – Reading dataset – SparkSession is the single entry point to interact with underlying spark functionality. It allows dataframe and dataset for programming 36
  • 37. Machine Learning with Spark • Spam detection or spam filtering: – Pre-process the dataset 37
  • 38. Machine Learning with Spark • Spam detection or spam filtering: – Feature Extraction: make feature vectors – TF: Term frequency is the number of times that term appears in document • Feature vectorization method 38
  • 39. Machine Learning with Spark • Spam detection or spam filtering: – Tokenizer: Transformer to tokenise the text into words – HashingTF: Transformer for making feature Vector using TF techniques. • Takes set of terms • Converts it to set of feature vector • It uses hashing trick for indexing terms 39
  • 40. Machine Learning with Spark • Spam detection or spam filtering: – Train a model – Define classifier – Fit transet 40
  • 41. Machine Learning with Spark • Spam detection or spam filtering: – Test a model 41
  • 42. Machine Learning with Spark • Tuning – Model selection: • Hyper parameter tuning • Find the best model or parameter for a given task • Tuning can be done for individial estimator such as logistic regression, pipeline – Model selection via cross-validation – Model selection via train validation split 42
  • 43. Machine Learning with Spark • Tuning – Model selection workflow • Split input data into separate training set or test set • For each (training,test) pair, they iterate through set of ParamMaps. – For each ParamMap they fit the estimator using those parameteras – Get fitted model and evaluate the models performance using evaluator • They select the model produced by best performing set of parameters. 43
  • 44. Machine Learning with Spark • Tuning – Model selection workflow • The evaluator can be RegressionEvaluator or BinaryClassificationEvaluator and so on. 44
  • 45. Machine Learning with Spark • Tuning – Model selection via cross validation • CrossValidator begins by splitting the dataset into a set of folds (k=3) means create 3 (training,test) dataset pair • Each pair use 2/3 of the data for training and 1/3 for testing • To evaluate particular ParamMap it computes the average evaluation matric for the three model fitting by estimator • However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand- tuning. – Model selection via train-validation split • only evaluates each combination of parameters once • less expensive, but will not produce as reliable results when the training dataset is not sufficiently large. 45
  • 46. Spam Filtering Application • What we did so far? – Reading dataset – Cleaning – Feature engineering – Training – Testing – Tuning – Deploying – Persisting the model – Reuse the existing model 46

Editor's Notes

  1. Actually it extend the map reduce programming model to better support of Iterative programming model like machine learning, graphs and so on. The motivation of develop spark programming models comes from most currently programming models like acyclic data flow. Means it flows the data from stable storage to stable storage. It benefits the runtime to decide where to run the tasks and automatically recovers from failure. But it is inefficient for applications that repeatedly reuse a working set of data. For example machine learning and graph datasets. Apps used to reload date from stable or persistent storage on each query. Then Apache Spark brings a solution that is called resilient distributed dataset (RDD) that allows apps to to keep working set in memory for efficient reuse. It also keeps attractive properties of Map-Reduce which are fault tolerant, data locality and scalability.
  2. Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
  3. Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
  4. Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
  5. Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
  6. Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of
  7. Here spam is not a feature. You have to extract the features and lebel from here. Then you have to transform it into feature vectors which are newmerial represetation of