SlideShare a Scribd company logo
1 of 36
Download to read offline
Generalized Linear Models
in Spark MLlib and SparkR
Xiangrui Meng
joint with Joseph Bradley, Eric Liang, Yanbo Liang (MiningLamp), DB Tsai (Netflix), et al.
2016/02/17 - Spark Summit East
About me
• Software Engineer at Databricks
• Spark PMC member and MLlib/PySpark maintainer
• Ph.D. from Stanford on randomized algorithms for large-
scale linear regression problems
2
Outline
• Generalized linear models (GLMs)
• linear regression / logistic regression / general form
• accelerated failure time (AFT) model for survival analysis
• intercept / regularization / weights
• GLMs in MLlib and SparkR
• demo: R formula in Spark
• Implementing GLMs
• gradient descent / L-BFGS / OWL-QN
• weighted least squares / iteratively re-weighted least squares (IRLS)
• performance tips
3
Generalized linear models
Linear regression
5
https://en.wikipedia.org/wiki/Linear_regression
inference / prediction
Linear least squares
• m observations:
• x: explanatory variables, y: dependent variable
• assumes linear relationship between x and y
• minimizes the sum of the squares of the errors
6
Linear least squares
• the oldest linear model, trace back to Gauss
• the simplest and the most studied linear model
• has analytic solutions
• easy to solve
• easy to inspect
• sensitive to outliers
7
Logistic regression
8
https://en.wikipedia.org/wiki/Logistic_regression
Logistic regression
• classification with binary response:
• true/false, clicked/not clicked, liked/disliked
• uses logistic function to indicate the likelihood
• maximizes the sum of the log-likelihoods, i.e.,
9
Logistic regression
• one of the simplest binary classification models
• widely used in industry
• relatively easy to solve
• easy to interpret
10
Multinomial logistic regression
• classification with multiclass response:
• uses softmax function to indicate likelihood
• maximizes the sum of log-likelihoods
11
Generalized linear models (GLMs)
• Both linear least squares and logistic regression are
special cases of generalized linear models.
• A GLM is specified by the following:
• a distribution of the response (from the exponential family),
• a link function g such that
• maximizes the sum of log-likelihoods
12
Distributions and link functions
13
Model Distribution Link
linear least squares normal identity
logistic regression binomial logit
multinomial logic regression multinomial generalized logit
Poisson regression Poisson log
gamma regression gamma inverse
Accelerated failure time (AFT) model
• m observations:
• y: survival time, c: censor variable (alive or dead)
• assumes the effect of an explanatory variable is to
accelerate or decelerate the life time by some constant
• uses maximum likelihood estimation while treating
censored and uncensored observations differently
14
AFT model for survival analysis
• one popular parametric model for survival analysis
• widely used for lifetime estimation and churn analysis
• could be solved under the same framework as GLMs
15
Intercept, regularization, and weights
In practice, a linear model is often more complex





where w describes instance weights, beta_0 is the
intercept term to adjust bias, and sigma regularized beta
with a constant lambda > 0 to avoid overfitting.
16
Types of regularization
• Ridge (L2):
• easy to solve (strongly convex)
• Lasso (L1):
• enhance model sparsity
• harder to solver (though still convex)
• Elastic-Net:
• Others: group lasso, nonconvex, etc
17
GLMs in MLlib and SparkR
GLMs in Spark MLlib
Linear models in MLlib are implemented as ML pipeline
estimators. They accept the following params:
• featuresCol: a vector column containing features (x)
• labelCol: a double column containing responses (y)
• weightCol: a double column containing weights (w)
• regType: regularization type, “none”, “l1”, “l2”, “elastic-net”
• regParam: regularization constant
• fitIntercept: whether to fit an intercept term
• …
19
Fit a linear model in MLlib
from pyspark.ml.classification import LogisticRegression
# Load training data
training = sqlContext.read.parquet(“path/to/training“)
lr = LogisticRegression(
weightCol=“weight”, fitIntercept=False, maxIter=10,
regParam=0.3, elasticNetParam=0.8)
# Fit the model
model = lr.fit(training)
20
Make predictions and evaluate models
from pyspark.ml.evaluation import BinaryClassificationEvaluator
test = sqlContext.read.parquet(“path/to/test“)
# make predictions by calling transform
predictions = model.tranform(test)
# create a binary classification evaluator
evaluator = BinaryClassificationEvaluator(
metricName=“areaUnderROC”)
evaulator.evaluate(predictions)
21
GLMs in SparkR
In Python/Scala/Java, we keep the APIs about the same
for consistency. But in SparkR, we make the APIs similar
to existing ones in R (or R packages).
# Create the DataFrame
df <- read.df(sqlContext, “path/to/training”)
# Fit a Gaussian GLM model
model <- glm(y ~ x1 + x2, data = df, family = "gaussian")
22
R formula in SparkR
23
• R provides model formula to express linear models.
• We support the following R formula operators in SparkR:
• `~` separate target and terms
• `+` concat terms, "+ 0" means removing intercept
• `-` remove a term, "- 1" means removing intercept
• `:` interaction (multiplication for numeric values, or binarized categorical
values)
• `.` all columns except target
• For example, “y ~ x + z + x:z -1” means using x, z, and their
interaction (x:z) to predict y without intercept (-1).
Demo: GLMs in Spark
… using Databricks Community Edition!
Implementing GLMs
Row-based distributed storage
26
w x y
w x y
partition 1
partition 2
Gradient descent methods
• Stochastic gradient descent (SGD):
• trade-offs on the merge scheme and convergence
• Mini-batch SGD:
• hard to sample mini-batches efficiently
• communication overhead on merging gradients
• Batch gradient descent:
• slow convergence
27
Quasi-Newton methods
• Newton’s method converges much than GD, but it
requires second-order information:
• L-BFGS works for smooth objectives. It approximates the
inverse Hessian using only first-order information.
• OWL-QN works for objectives with L1 regularization.
• MLlib calls L-BFGS/OWL-QN implemented in breeze.
28
Direct methods for linear least squares
• Linear least squares has an analytic solution:
• The solution could be computed directly or through QR
factorization, both of which are implemented in Spark.
• requires only a single pass
• efficient when the number of features is small (<4000)
• provides R-like model summary statistics
29
Iteratively re-weighted least squares (IRLS)
• Generalized linear models with exponential family can
be solved via iteratively re-weighted least squares (IRLS).
• linearizes the objective at the current solution
• solves the weighted linear least squares problem
• repeat above steps until convergence
• efficient when the number of features is small (<4000)
• provides R-like model summary statistics
• This is the implementation in R.
30
Verification using R
Besides normal tests, we also verify our implementation using R.
/*

df <- as.data.frame(cbind(A, b))

for (formula in c(b ~ . -1, b ~ .)) {

model <- lm(formula, data=df, weights=w)

print(as.vector(coef(model)))

}



[1] -3.727121 3.009983

[1] 18.08 6.08 -0.60

*/
val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983),

Vectors.dense(18.08, 6.08, -0.60))
31
Standardization
To match the result in both R and glmnet, the most
popular R package for GLMs, we provide options to
standardize features and labels before training:







where delta is the stddev of labels, and sigma_j is the
stddev of the j-th feature column.
32
Performance tips
• Utilize sparsity.
• Use tree aggregation and torrent broadcast.
• Watch numerical issues, e.g., log(1+exp(x)).
• Do not change input data. Scaling could be applied after
each iteration and intercept could be derived later.
33
Future directions
• easy handling of categorical features and labels
• better R formula support
• more model summary statistics
• feature parity in different languages
• model parallelism
• vector-free L-BFGS with 2D partitioning (WIP)
• using matrix kernels
34
Other GLM implementations on Spark
• CoCoA+: communication-efficient optimization
• LIBLINEAR for Spark: a Spark port of LIBLINEAR
• sparkGLM: an R-like GLM package for Spark
• TFOCS for Spark: first-order conic solvers for Spark
• General-purpose packages that implement GLMs
• aerosolve, DistML, sparkling-water, thunder, zen, etc
• … and more on Spark Packages
35
Thank you.
• MLlibuserguideandroadmapforSpark2.0
• GLMsonWikipedia
• DatabricksCommunityEdition,blogposts,andcareers

More Related Content

What's hot

The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
Matrix and Tensor Tools for Computer Vision
Matrix and Tensor Tools for Computer VisionMatrix and Tensor Tools for Computer Vision
Matrix and Tensor Tools for Computer VisionActiveEon
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GANDai-Hai Nguyen
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxRohanBorgalli
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Cross validation
Cross validationCross validation
Cross validationRidhaAfrawe
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboostShuai Zhang
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regressionSreerajVA
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation MethodSHUBHAM GUPTA
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in VisionSangmin Woo
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

What's hot (20)

The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Matrix and Tensor Tools for Computer Vision
Matrix and Tensor Tools for Computer VisionMatrix and Tensor Tools for Computer Vision
Matrix and Tensor Tools for Computer Vision
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GAN
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Cross validation
Cross validationCross validation
Cross validation
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
Xgboost
XgboostXgboost
Xgboost
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 

Viewers also liked

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengSpark Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Spark Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
Twitter Sentiment Prediction.pptx
Twitter Sentiment Prediction.pptxTwitter Sentiment Prediction.pptx
Twitter Sentiment Prediction.pptxKrishnesh Pujari
 
10 examples for if statement
10 examples for if statement10 examples for if statement
10 examples for if statementfyjordan9
 
Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...
Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...
Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...IOSR Journals
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkDalei Li
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Application of local search methods for solving a quadratic assignment proble...
Application of local search methods for solving a quadratic assignment proble...Application of local search methods for solving a quadratic assignment proble...
Application of local search methods for solving a quadratic assignment proble...ertekg
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementationJongsu "Liam" Kim
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkDatabricks
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalRunwei Qiang
 

Viewers also liked (20)

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Training
TrainingTraining
Training
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Twitter Sentiment Prediction.pptx
Twitter Sentiment Prediction.pptxTwitter Sentiment Prediction.pptx
Twitter Sentiment Prediction.pptx
 
10 examples for if statement
10 examples for if statement10 examples for if statement
10 examples for if statement
 
Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...
Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...
Study Electronic And Mechanical Properties Of Carbon, Silicon, And Hypothetic...
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on Spark
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Application of local search methods for solving a quadratic assignment proble...
Application of local search methods for solving a quadratic assignment proble...Application of local search methods for solving a quadratic assignment proble...
Application of local search methods for solving a quadratic assignment proble...
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog Retrieval
 

Similar to Generalized Linear Models in Spark MLlib and SparkR

Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlibXiangrui Meng
 
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...WithTheBest
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...SSA KPI
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonChristopher Conlan
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymSri Ambati
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...Jihun Yun
 
Lecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airLecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airatutor_te
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Philip Goddard
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 

Similar to Generalized Linear Models in Spark MLlib and SparkR (20)

Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Group Project
Group ProjectGroup Project
Group Project
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
Module-2_ML.pdf
Module-2_ML.pdfModule-2_ML.pdf
Module-2_ML.pdf
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
 
Machine Learning - Supervised Learning
Machine Learning - Supervised LearningMachine Learning - Supervised Learning
Machine Learning - Supervised Learning
 
Machine Learning - Principles
Machine Learning - PrinciplesMachine Learning - Principles
Machine Learning - Principles
 
Lecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airLecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.air
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Java 8
Java 8Java 8
Java 8
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

Generalized Linear Models in Spark MLlib and SparkR

  • 1. Generalized Linear Models in Spark MLlib and SparkR Xiangrui Meng joint with Joseph Bradley, Eric Liang, Yanbo Liang (MiningLamp), DB Tsai (Netflix), et al. 2016/02/17 - Spark Summit East
  • 2. About me • Software Engineer at Databricks • Spark PMC member and MLlib/PySpark maintainer • Ph.D. from Stanford on randomized algorithms for large- scale linear regression problems 2
  • 3. Outline • Generalized linear models (GLMs) • linear regression / logistic regression / general form • accelerated failure time (AFT) model for survival analysis • intercept / regularization / weights • GLMs in MLlib and SparkR • demo: R formula in Spark • Implementing GLMs • gradient descent / L-BFGS / OWL-QN • weighted least squares / iteratively re-weighted least squares (IRLS) • performance tips 3
  • 6. Linear least squares • m observations: • x: explanatory variables, y: dependent variable • assumes linear relationship between x and y • minimizes the sum of the squares of the errors 6
  • 7. Linear least squares • the oldest linear model, trace back to Gauss • the simplest and the most studied linear model • has analytic solutions • easy to solve • easy to inspect • sensitive to outliers 7
  • 9. Logistic regression • classification with binary response: • true/false, clicked/not clicked, liked/disliked • uses logistic function to indicate the likelihood • maximizes the sum of the log-likelihoods, i.e., 9
  • 10. Logistic regression • one of the simplest binary classification models • widely used in industry • relatively easy to solve • easy to interpret 10
  • 11. Multinomial logistic regression • classification with multiclass response: • uses softmax function to indicate likelihood • maximizes the sum of log-likelihoods 11
  • 12. Generalized linear models (GLMs) • Both linear least squares and logistic regression are special cases of generalized linear models. • A GLM is specified by the following: • a distribution of the response (from the exponential family), • a link function g such that • maximizes the sum of log-likelihoods 12
  • 13. Distributions and link functions 13 Model Distribution Link linear least squares normal identity logistic regression binomial logit multinomial logic regression multinomial generalized logit Poisson regression Poisson log gamma regression gamma inverse
  • 14. Accelerated failure time (AFT) model • m observations: • y: survival time, c: censor variable (alive or dead) • assumes the effect of an explanatory variable is to accelerate or decelerate the life time by some constant • uses maximum likelihood estimation while treating censored and uncensored observations differently 14
  • 15. AFT model for survival analysis • one popular parametric model for survival analysis • widely used for lifetime estimation and churn analysis • could be solved under the same framework as GLMs 15
  • 16. Intercept, regularization, and weights In practice, a linear model is often more complex
 
 
 where w describes instance weights, beta_0 is the intercept term to adjust bias, and sigma regularized beta with a constant lambda > 0 to avoid overfitting. 16
  • 17. Types of regularization • Ridge (L2): • easy to solve (strongly convex) • Lasso (L1): • enhance model sparsity • harder to solver (though still convex) • Elastic-Net: • Others: group lasso, nonconvex, etc 17
  • 18. GLMs in MLlib and SparkR
  • 19. GLMs in Spark MLlib Linear models in MLlib are implemented as ML pipeline estimators. They accept the following params: • featuresCol: a vector column containing features (x) • labelCol: a double column containing responses (y) • weightCol: a double column containing weights (w) • regType: regularization type, “none”, “l1”, “l2”, “elastic-net” • regParam: regularization constant • fitIntercept: whether to fit an intercept term • … 19
  • 20. Fit a linear model in MLlib from pyspark.ml.classification import LogisticRegression # Load training data training = sqlContext.read.parquet(“path/to/training“) lr = LogisticRegression( weightCol=“weight”, fitIntercept=False, maxIter=10, regParam=0.3, elasticNetParam=0.8) # Fit the model model = lr.fit(training) 20
  • 21. Make predictions and evaluate models from pyspark.ml.evaluation import BinaryClassificationEvaluator test = sqlContext.read.parquet(“path/to/test“) # make predictions by calling transform predictions = model.tranform(test) # create a binary classification evaluator evaluator = BinaryClassificationEvaluator( metricName=“areaUnderROC”) evaulator.evaluate(predictions) 21
  • 22. GLMs in SparkR In Python/Scala/Java, we keep the APIs about the same for consistency. But in SparkR, we make the APIs similar to existing ones in R (or R packages). # Create the DataFrame df <- read.df(sqlContext, “path/to/training”) # Fit a Gaussian GLM model model <- glm(y ~ x1 + x2, data = df, family = "gaussian") 22
  • 23. R formula in SparkR 23 • R provides model formula to express linear models. • We support the following R formula operators in SparkR: • `~` separate target and terms • `+` concat terms, "+ 0" means removing intercept • `-` remove a term, "- 1" means removing intercept • `:` interaction (multiplication for numeric values, or binarized categorical values) • `.` all columns except target • For example, “y ~ x + z + x:z -1” means using x, z, and their interaction (x:z) to predict y without intercept (-1).
  • 24. Demo: GLMs in Spark … using Databricks Community Edition!
  • 26. Row-based distributed storage 26 w x y w x y partition 1 partition 2
  • 27. Gradient descent methods • Stochastic gradient descent (SGD): • trade-offs on the merge scheme and convergence • Mini-batch SGD: • hard to sample mini-batches efficiently • communication overhead on merging gradients • Batch gradient descent: • slow convergence 27
  • 28. Quasi-Newton methods • Newton’s method converges much than GD, but it requires second-order information: • L-BFGS works for smooth objectives. It approximates the inverse Hessian using only first-order information. • OWL-QN works for objectives with L1 regularization. • MLlib calls L-BFGS/OWL-QN implemented in breeze. 28
  • 29. Direct methods for linear least squares • Linear least squares has an analytic solution: • The solution could be computed directly or through QR factorization, both of which are implemented in Spark. • requires only a single pass • efficient when the number of features is small (<4000) • provides R-like model summary statistics 29
  • 30. Iteratively re-weighted least squares (IRLS) • Generalized linear models with exponential family can be solved via iteratively re-weighted least squares (IRLS). • linearizes the objective at the current solution • solves the weighted linear least squares problem • repeat above steps until convergence • efficient when the number of features is small (<4000) • provides R-like model summary statistics • This is the implementation in R. 30
  • 31. Verification using R Besides normal tests, we also verify our implementation using R. /*
 df <- as.data.frame(cbind(A, b))
 for (formula in c(b ~ . -1, b ~ .)) {
 model <- lm(formula, data=df, weights=w)
 print(as.vector(coef(model)))
 }
 
 [1] -3.727121 3.009983
 [1] 18.08 6.08 -0.60
 */ val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983),
 Vectors.dense(18.08, 6.08, -0.60)) 31
  • 32. Standardization To match the result in both R and glmnet, the most popular R package for GLMs, we provide options to standardize features and labels before training:
 
 
 
 where delta is the stddev of labels, and sigma_j is the stddev of the j-th feature column. 32
  • 33. Performance tips • Utilize sparsity. • Use tree aggregation and torrent broadcast. • Watch numerical issues, e.g., log(1+exp(x)). • Do not change input data. Scaling could be applied after each iteration and intercept could be derived later. 33
  • 34. Future directions • easy handling of categorical features and labels • better R formula support • more model summary statistics • feature parity in different languages • model parallelism • vector-free L-BFGS with 2D partitioning (WIP) • using matrix kernels 34
  • 35. Other GLM implementations on Spark • CoCoA+: communication-efficient optimization • LIBLINEAR for Spark: a Spark port of LIBLINEAR • sparkGLM: an R-like GLM package for Spark • TFOCS for Spark: first-order conic solvers for Spark • General-purpose packages that implement GLMs • aerosolve, DistML, sparkling-water, thunder, zen, etc • … and more on Spark Packages 35
  • 36. Thank you. • MLlibuserguideandroadmapforSpark2.0 • GLMsonWikipedia • DatabricksCommunityEdition,blogposts,andcareers