SlideShare a Scribd company logo
1 of 29
Download to read offline
Understanding Parallelization
of Machine Learning Algorithms
in Apache Spark™
About Richard
Richard Garris has spent 15 years
advising customers in data management
and analytics. As a Director of Solutions
Architecture at Databricks the past 3
years, he works closely with data
scientists to build machine learning
pipelines and advise companies on best
practices in advanced analytics. He has
advanced degrees from Carnegie Mellon
and The Ohio State University.
2
Agenda
●Machine Learning Overview
●Machine Learning as an Optimization Problem
●Logistic Regression
●Single Machine vs Distributed Machine Learning
●Other Algorithms e.g. Random Forest
●Other ML Parallelization Techniques
●Demonstration
●Q&A
3
AI is Changing the World
AlphaGoSelf-driving cars Alexa/Google Home
Data is the Fuel that Drives ML
Andrew Ng calls the algorithms the
“rocket ship” and the data “the fuel that
you feed machine learning” to build
deep learning applications
*Source: Buckham & Duffy
5
What is Machine Learning?
Supervised Machine Learning
A set of techniques that, given a set of examples, attempts to
predict outcomes for future values.
The set of techniques are the algorithms which includes both
transformational algorithms (featurizers) and machine learning
methods (estimators)
Examples are used to “train” the algorithm to produce the Model
A Model is a Mathematical Function
 
Example Bank Marketing
Who has heard of Signet Bank?
“data on the specific transactions of a bank’s customers can
improve models for deciding what product offers to make.1
”
(1) Data Science for Business
Provost, Foster
O’Reilly 2013
8
But …
given a set of examples how do I get training data to the
function (i.e. model)?
9
I want to
predict if my
customer
will do this
Given this data ...
I want to create this
Logistic regression
function
Machine Learning is an Optimization Problem
● Given the training data how do I as quickly as possible come up with
the most optimal equation given training data X that solves for Y
● In generalized linear regression, the goal is to learn the coefficients
(i.e. weights) based on the examples and fit a line based on the
points
● In decision tree learning, we are learning where to split on features
● Time can be saved by minimizing the amount of work and
optimization is achieved when the result produced yields the least
error on the test set
● Getting to the optimal model is called convergence
10
Error
The goal of the optimization is to either maximize a positive
metric like the area under the curve (AUC) or minimize a negative
metric (like error.)
This goal is called the objective function (or the loss / cost
function.)
Over each iteration
the AUC goes up.
11
Train - Tune - Test
Data is split into samples
Training is used to fit the model(s)
The tuning (or validation) set is used to find the optimal
parameters
The test set is used to
evaluate the final
model
12
Source: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
Amount of Work
● Big O Notation - provides a worst case scenario for how long an algorithm will take
relative to the amount of data (or number of iterations)
● In Spark MLLib we always want scalable machine learning so that the runtime
doesn’t explode when we get massive datasets
● Example: finding a single element in a set of 1M values (say it takes 0.001 seconds per
comparison):
○ Item by Item Search - O(N):
■ Worst Case - 16 minutes
○ Binary Search - O(Log N)
■ Worst Case - 6 milliseconds
○ Hash Search - O (1)
■ Worst Case - 1 millisecond
13
Data Representation
Pandas - DataFrames represented on a single machine as Python
data structures
RDDs - Spark’s foundational structure Resilient Distributed
Dataset is represented as a reference to partitioned data without
types
DataFrames - Spark’s strongly typed optimized distributed
collection of rows
All ML Algorithms need to represent data as vectors (array of
Double values) in order to train machine learning algorithms
14
ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!
Parallelization Technique # 1
Feature engineer in Spark but train the model on a single
driver machine
○ Gather source data, perform joins, do complex ETL and feature
engineer
○ Works well when you have big data for your transactions but smaller
aggregated or subsamples once you have features to train on
○ May need to use a larger size driver node that can fit all the training
features into memory (may need to adjust
spark.driver.maxResultSize)
○ Use Scikit-learn, R, TensorFlow (optionally with GPU) or other single
machine learners
17
Parallelization in ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
You can parallelize
here
Single machine here
Parallelization Technique # 2
Train the entire model with distributed machine learning
○ MLLib built for large scale model creation (millions of instances)
19
○ You will create a single model with all this
data (you use model selection techniques
like Train Tune Validation splits or a cross
validator)
○ Spark Deep Learning, Horvod (for
TensorFlow), XGBoost (the JVM version)
and H2O also support distributed ML
Spark Architecture
20
Parallelization Technique # 3
Create many models one on each worker
○ Works for creating one model per customer, one model per set of
features for feature selection, one model per set of hyperparameters
for tuning…
○ Spark-sklearn helps facilitate creating one model per worker
○ You can as of Spark 2.3 use the cross validation parallelism parameter
to run do model selection / hyperparameter tuning in parallel in
MLLib as well
○ Broadcast your data but use Spark to train models on each worker
with different sets of hyperparameters to find the optimal model
21
Single Machine Learning
In a single machine algorithm such as scikit-learn, the data is
loaded into the memory of a single machine (usually as Pandas or
NumPy vectors)
The single machine algorithm iterates (loops) over the data in
memory and using a single thread (sometimes multicore) takes a
“guess” at the equation
The error is calculated and if the error from the previous iteration
is less than the tolerance or the iterations are >= max iterations
then STOP
22
Distributed Machine Learning
In Distributed Machine Learning like Spark MLlib, the data is represented as an RDD or
a DataFrame typically in distributed file storage such as S3, Blob Store or HDFS
As a best practice, you should cache the data in memory across the nodes so that
multiple iterations don’t have to query the disk each time
The calculation of the gradients is distributed across all the nodes using Spark’s
distributed compute engine (similar to MapReduce)
After each iteration, the results return to the driver and iterations continue until either
max_iter or tol is reached
23
Algorithms
Logistic Regression Basics
The goal of logistic regression is to fit a sigmoid curve over
the dataset.
Why not use a standard straight line?
Probabilities have to between 0 and 1
25
Logistic Regression Solvers
There are many methods to finding an
optimal equation in Logistic regression
● Stochastic Gradient Descent
● LBFGS
● Newtonian
The basic premise is at each step you
“guess” a new model by taking a step
toward reducing the error or increasing
the AUC
26
Random Forest
Random Forest uses a collection of weighted decision trees trained on
random subsample of data to “vote” on the label
Random Forest is a general purpose algorithm that works well for larger
datasets due to its ability to find non-linearities in the data
Because Random Forest uses a number of trees it “averages out” outliers
and noise and therefore is resistant to overfitting
Each decision tree has depth and a number of decision trees in the forest
27
Single vs Distributed Random Forest
In Spark because each tree is trained on a subsample of data (and
features) and do not depend on the other trees many trees can be
trained in parallel.
The selection of features to split on is also distributed
Splits are chosen based on information gain
Tree creation is stopped when maxDepth is
hit, information gain is < minInfoGain or
no split candidates have minInstancesPerNode
28
Demonstration
https://community.cloud.databricks.com/?o=15269310110807
74#notebook/2040911831196221/command/174442750355907
3
Databricks Community Edition or Free Trial
https://databricks.com/try-databricks
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
2
9

More Related Content

What's hot

Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded searchHema Kashyap
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Push down automata
Push down automataPush down automata
Push down automataSomya Bagai
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISPNilt1234
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learningsafa cimenli
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with SparkMohammed Guller
 
Relational algebra in DBMS
Relational algebra in DBMSRelational algebra in DBMS
Relational algebra in DBMSArafat Hossan
 

What's hot (20)

Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Turing machine
Turing machineTuring machine
Turing machine
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded search
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Push down automata
Push down automataPush down automata
Push down automata
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISP
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
LISP: Introduction to lisp
LISP: Introduction to lispLISP: Introduction to lisp
LISP: Introduction to lisp
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Unit05 dbms
Unit05 dbmsUnit05 dbms
Unit05 dbms
 
svm classification
svm classificationsvm classification
svm classification
 
Er diagram
Er diagramEr diagram
Er diagram
 
Relational algebra in DBMS
Relational algebra in DBMSRelational algebra in DBMS
Relational algebra in DBMS
 

Similar to Understanding Parallelization of Machine Learning Algorithms in Apache Spark™

Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series ForecastingBillTubbs
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Presentation
PresentationPresentation
Presentationbutest
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_LibrarySaurabh Chauhan
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchArtemSunfun
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)ActiveEon
 

Similar to Understanding Parallelization of Machine Learning Algorithms in Apache Spark™ (20)

Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
unit 2 hpc.pptx
unit 2 hpc.pptxunit 2 hpc.pptx
unit 2 hpc.pptx
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Presentation
PresentationPresentation
Presentation
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
 
Unit 1 dsa
Unit 1 dsaUnit 1 dsa
Unit 1 dsa
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 

Recently uploaded (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™

  • 1. Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
  • 2. About Richard Richard Garris has spent 15 years advising customers in data management and analytics. As a Director of Solutions Architecture at Databricks the past 3 years, he works closely with data scientists to build machine learning pipelines and advise companies on best practices in advanced analytics. He has advanced degrees from Carnegie Mellon and The Ohio State University. 2
  • 3. Agenda ●Machine Learning Overview ●Machine Learning as an Optimization Problem ●Logistic Regression ●Single Machine vs Distributed Machine Learning ●Other Algorithms e.g. Random Forest ●Other ML Parallelization Techniques ●Demonstration ●Q&A 3
  • 4. AI is Changing the World AlphaGoSelf-driving cars Alexa/Google Home
  • 5. Data is the Fuel that Drives ML Andrew Ng calls the algorithms the “rocket ship” and the data “the fuel that you feed machine learning” to build deep learning applications *Source: Buckham & Duffy 5
  • 6. What is Machine Learning? Supervised Machine Learning A set of techniques that, given a set of examples, attempts to predict outcomes for future values. The set of techniques are the algorithms which includes both transformational algorithms (featurizers) and machine learning methods (estimators) Examples are used to “train” the algorithm to produce the Model
  • 7. A Model is a Mathematical Function  
  • 8. Example Bank Marketing Who has heard of Signet Bank? “data on the specific transactions of a bank’s customers can improve models for deciding what product offers to make.1 ” (1) Data Science for Business Provost, Foster O’Reilly 2013 8
  • 9. But … given a set of examples how do I get training data to the function (i.e. model)? 9 I want to predict if my customer will do this Given this data ... I want to create this Logistic regression function
  • 10. Machine Learning is an Optimization Problem ● Given the training data how do I as quickly as possible come up with the most optimal equation given training data X that solves for Y ● In generalized linear regression, the goal is to learn the coefficients (i.e. weights) based on the examples and fit a line based on the points ● In decision tree learning, we are learning where to split on features ● Time can be saved by minimizing the amount of work and optimization is achieved when the result produced yields the least error on the test set ● Getting to the optimal model is called convergence 10
  • 11. Error The goal of the optimization is to either maximize a positive metric like the area under the curve (AUC) or minimize a negative metric (like error.) This goal is called the objective function (or the loss / cost function.) Over each iteration the AUC goes up. 11
  • 12. Train - Tune - Test Data is split into samples Training is used to fit the model(s) The tuning (or validation) set is used to find the optimal parameters The test set is used to evaluate the final model 12 Source: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
  • 13. Amount of Work ● Big O Notation - provides a worst case scenario for how long an algorithm will take relative to the amount of data (or number of iterations) ● In Spark MLLib we always want scalable machine learning so that the runtime doesn’t explode when we get massive datasets ● Example: finding a single element in a set of 1M values (say it takes 0.001 seconds per comparison): ○ Item by Item Search - O(N): ■ Worst Case - 16 minutes ○ Binary Search - O(Log N) ■ Worst Case - 6 milliseconds ○ Hash Search - O (1) ■ Worst Case - 1 millisecond 13
  • 14. Data Representation Pandas - DataFrames represented on a single machine as Python data structures RDDs - Spark’s foundational structure Resilient Distributed Dataset is represented as a reference to partitioned data without types DataFrames - Spark’s strongly typed optimized distributed collection of rows All ML Algorithms need to represent data as vectors (array of Double values) in order to train machine learning algorithms 14
  • 15. ML Pipelines Train model Evaluate Load data Extract features A very simple pipeline
  • 16. ML Pipelines Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline!
  • 17. Parallelization Technique # 1 Feature engineer in Spark but train the model on a single driver machine ○ Gather source data, perform joins, do complex ETL and feature engineer ○ Works well when you have big data for your transactions but smaller aggregated or subsamples once you have features to train on ○ May need to use a larger size driver node that can fit all the training features into memory (may need to adjust spark.driver.maxResultSize) ○ Use Scikit-learn, R, TensorFlow (optionally with GPU) or other single machine learners 17
  • 18. Parallelization in ML Pipelines Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble You can parallelize here Single machine here
  • 19. Parallelization Technique # 2 Train the entire model with distributed machine learning ○ MLLib built for large scale model creation (millions of instances) 19 ○ You will create a single model with all this data (you use model selection techniques like Train Tune Validation splits or a cross validator) ○ Spark Deep Learning, Horvod (for TensorFlow), XGBoost (the JVM version) and H2O also support distributed ML
  • 21. Parallelization Technique # 3 Create many models one on each worker ○ Works for creating one model per customer, one model per set of features for feature selection, one model per set of hyperparameters for tuning… ○ Spark-sklearn helps facilitate creating one model per worker ○ You can as of Spark 2.3 use the cross validation parallelism parameter to run do model selection / hyperparameter tuning in parallel in MLLib as well ○ Broadcast your data but use Spark to train models on each worker with different sets of hyperparameters to find the optimal model 21
  • 22. Single Machine Learning In a single machine algorithm such as scikit-learn, the data is loaded into the memory of a single machine (usually as Pandas or NumPy vectors) The single machine algorithm iterates (loops) over the data in memory and using a single thread (sometimes multicore) takes a “guess” at the equation The error is calculated and if the error from the previous iteration is less than the tolerance or the iterations are >= max iterations then STOP 22
  • 23. Distributed Machine Learning In Distributed Machine Learning like Spark MLlib, the data is represented as an RDD or a DataFrame typically in distributed file storage such as S3, Blob Store or HDFS As a best practice, you should cache the data in memory across the nodes so that multiple iterations don’t have to query the disk each time The calculation of the gradients is distributed across all the nodes using Spark’s distributed compute engine (similar to MapReduce) After each iteration, the results return to the driver and iterations continue until either max_iter or tol is reached 23
  • 25. Logistic Regression Basics The goal of logistic regression is to fit a sigmoid curve over the dataset. Why not use a standard straight line? Probabilities have to between 0 and 1 25
  • 26. Logistic Regression Solvers There are many methods to finding an optimal equation in Logistic regression ● Stochastic Gradient Descent ● LBFGS ● Newtonian The basic premise is at each step you “guess” a new model by taking a step toward reducing the error or increasing the AUC 26
  • 27. Random Forest Random Forest uses a collection of weighted decision trees trained on random subsample of data to “vote” on the label Random Forest is a general purpose algorithm that works well for larger datasets due to its ability to find non-linearities in the data Because Random Forest uses a number of trees it “averages out” outliers and noise and therefore is resistant to overfitting Each decision tree has depth and a number of decision trees in the forest 27
  • 28. Single vs Distributed Random Forest In Spark because each tree is trained on a subsample of data (and features) and do not depend on the other trees many trees can be trained in parallel. The selection of features to split on is also distributed Splits are chosen based on information gain Tree creation is stopped when maxDepth is hit, information gain is < minInfoGain or no split candidates have minInstancesPerNode 28
  • 29. Demonstration https://community.cloud.databricks.com/?o=15269310110807 74#notebook/2040911831196221/command/174442750355907 3 Databricks Community Edition or Free Trial https://databricks.com/try-databricks Additional Questions? Contact us at http://go.databricks.com/contact-databricks 2 9