Automation and optimisation of machine learning pipelines on
top of Apache
Peter Rudenko
@peter_rud
peter.rudenko@datarobot.com
DataRobot data pipeline
Data
upload
Training models,
selecting best
models &
hyperparameters
Exploratory
data
analysis
Models
leaderboard
Prediction
API
Our journey to Apache Spark
PySpark vs Scala API?
Spark
worker
JVM
Python
process
Sending instructions:
df.agg({"age": "max"})
FAST!
Spark
worker
JVM
Python
process
Sending data:
data.map(lambda x: …)
data.filter(lambda x: …)
SLOW!
Instructions py4j Data ipc/serde
Our journey to Apache Spark
RDD vs DataFrame
RDD[Row[(Double, String, Vector)]]
Dataframe
(DoubleType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(StringType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(VectorType,
nullable=true)
+ Attributes
(in spark-1.4)
Attributes:
NumericAttribute
NominalAttribute (Ordinal)
BinaryAttribute
Our journey to Apache Spark
Mllib vs ML
Mllib:
● Low - level implementation of machine learning algorithms
● Based on RDD
ML:
● High-level pipeline abstractions
● Based on dataframes
● Uses mllib under the hood.
Columnar format
● Compression
● Scan optimization
● Null-imputor improvement
- val na2mean = {value: Double =>
- if (value.isNaN) meanValue else value
- }
- dataset.withColumn(map(outputCol),
callUDF(na2mean, DoubleType, dataset(map
(inputCol))))
+ dataset.na.fill(map(inputCols).zip
(meanValues).toMap)
Typical machine learning pipeline
● Features extraction
● Missing values imputation
● Variables encoding
● Dimensionality reduction
● Training model (finding
the optimal model
parameters)
● Selecting
hyperparameters
Model evaluation on
some metric (AUC,
R2, RMSE, etc.)
Train data (features + label)
Test data (features)
Model state (parameters +
hyperparameters)
Prediction
Introducing Blueprint
Pipeline config
pipeline: {
"1": {
input: ["NUM"],
class: "org.apache.spark.ml.feature.MeanImputor"
},
"2": {
input: ["CAT"],
class: "org.apache.spark.ml.feature.OneHotEncoder"
},
"3":{
input: ["1", "2"],
class: "org.apache.spark.ml.feature.VectorAssembler"
},
"4": {
input: "3",
class : "org.apache.spark.ml.classification.LogisticRegression",
params: {
optimizer: “LBFGS”,
regParam: [0.5, 0.1, 0.01, 0.001]
}
}
}
Introducing Blueprint
YARN cluster
Blueprint
Spark jobserver
Transformer (pure function)
abstract class Transformer extends PipelineStage with Params {
/**
* Transforms the dataset with provided parameter map as additional parameters.
* @param dataset input dataset
* @param paramMap additional parameters, overwrite embedded params
* @return transformed dataset
*/
def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame
}
Example:
(new HashingTF).
setInputCol("categorical_column").
setOutputCol("Hashing_tf_1").
setNumFeatures(1<<20).
transform(data)
Estimator
abstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
/**
* Fits a single model to the input data with optional parameters.
*
* @param dataset input dataset
* @param paramPairs Optional list of param pairs.
* These values override any specified in this Estimator's
embedded ParamMap.
* @return fitted model
*/
@varargs
def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = {
val map = ParamMap(paramPairs: _*)
fit(dataset, map)
}
}
Example:
val oneHotEncoderModel = (new OneHotEncoder).
setInputCol("vector_col").
fit(trainingData)
oneHotEncoderModel.transform(trainingData)
oneHotEncoderModel.transform(testData)
Estimator => Transformer
Predictor
Estimator that predicts a value
ProbabilisticClassifier
Predictor
Classifier Regressor
Evaluator
abstract class Evaluator extends Identifiable {
/**
* Evaluates the output.
*
* @param dataset a dataset that contains labels/observations and predictions.
* @param paramMap parameter map that specifies the input columns and output metrics
* @return metric
*/
def evaluate(dataset: DataFrame, paramMap: ParamMap): Double
}
Example:
val areaUnderROC = (new BinaryClassificationEvaluator).
setScoreCol("prediction").
evaluate(data)
Pipeline
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Input
data
Tockenizer
HashingTF
Logistic
Regression
fit
Pipeline
Model
Estimator that encapsulates other transformers / estimators
CrossValidator
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new
BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array
(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01)) .
build()
crossval.setEstimatorParamMaps(
paramGrid)
crossval.setNumFolds(3)
val cvModel = crossval.fit(training.toDF)
Input
data
Tockenizer
HashingTF
Logistic
Regression fit
CrossVal
Model
numFeatures:
{10, 100, 1000}
regParam:
{0.1, 0.01}
Folds
Pluggable backend
● H20
● Flink
● DeepLearning4J
● http://keystone-ml.org/
● etc.
Optimization
● Disable k-fold cross validation
● Minimize redundant pre-processing
● Parallel grid search
● Parallel DAG pipeline
● Pluggable optimizer
● Non-gridsearch hyperparameter optimization
(bayesian & hypergrad):
http://arxiv.org/pdf/1502.03492v2.pdf
http://arxiv.org/pdf/1206.2944.pdf
http://arxiv.org/pdf/1502.05700v1.pdf
Minimize redundant pre-processing
regParam:
0.1
regParam:
0.01
val rdd1 = rdd.map(function)
val rdd2 = rdd.map(function)
rdd1 != rdd2
Summary
● Good model != good result. Feature engineering is
the key.
● Spark provides a good abstraction, but need to tune
some parts to achieve good performance.
● ml pipeline API gives a pluggable and reusable
building blocks.
● Don’t forget to clean after yourself (unpersist cache).
Thanks,
Demo & QA

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

  • 1.
    Automation and optimisationof machine learning pipelines on top of Apache Peter Rudenko @peter_rud peter.rudenko@datarobot.com
  • 2.
    DataRobot data pipeline Data upload Trainingmodels, selecting best models & hyperparameters Exploratory data analysis Models leaderboard Prediction API
  • 3.
    Our journey toApache Spark PySpark vs Scala API? Spark worker JVM Python process Sending instructions: df.agg({"age": "max"}) FAST! Spark worker JVM Python process Sending data: data.map(lambda x: …) data.filter(lambda x: …) SLOW! Instructions py4j Data ipc/serde
  • 4.
    Our journey toApache Spark RDD vs DataFrame RDD[Row[(Double, String, Vector)]] Dataframe (DoubleType, nullable=true) + Attributes (in spark-1.4) Dataframe (StringType, nullable=true) + Attributes (in spark-1.4) Dataframe (VectorType, nullable=true) + Attributes (in spark-1.4) Attributes: NumericAttribute NominalAttribute (Ordinal) BinaryAttribute
  • 5.
    Our journey toApache Spark Mllib vs ML Mllib: ● Low - level implementation of machine learning algorithms ● Based on RDD ML: ● High-level pipeline abstractions ● Based on dataframes ● Uses mllib under the hood.
  • 6.
    Columnar format ● Compression ●Scan optimization ● Null-imputor improvement - val na2mean = {value: Double => - if (value.isNaN) meanValue else value - } - dataset.withColumn(map(outputCol), callUDF(na2mean, DoubleType, dataset(map (inputCol)))) + dataset.na.fill(map(inputCols).zip (meanValues).toMap)
  • 7.
    Typical machine learningpipeline ● Features extraction ● Missing values imputation ● Variables encoding ● Dimensionality reduction ● Training model (finding the optimal model parameters) ● Selecting hyperparameters Model evaluation on some metric (AUC, R2, RMSE, etc.) Train data (features + label) Test data (features) Model state (parameters + hyperparameters) Prediction
  • 8.
  • 9.
    Pipeline config pipeline: { "1":{ input: ["NUM"], class: "org.apache.spark.ml.feature.MeanImputor" }, "2": { input: ["CAT"], class: "org.apache.spark.ml.feature.OneHotEncoder" }, "3":{ input: ["1", "2"], class: "org.apache.spark.ml.feature.VectorAssembler" }, "4": { input: "3", class : "org.apache.spark.ml.classification.LogisticRegression", params: { optimizer: “LBFGS”, regParam: [0.5, 0.1, 0.01, 0.001] } } }
  • 10.
  • 11.
    Transformer (pure function) abstractclass Transformer extends PipelineStage with Params { /** * Transforms the dataset with provided parameter map as additional parameters. * @param dataset input dataset * @param paramMap additional parameters, overwrite embedded params * @return transformed dataset */ def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame } Example: (new HashingTF). setInputCol("categorical_column"). setOutputCol("Hashing_tf_1"). setNumFeatures(1<<20). transform(data)
  • 12.
    Estimator abstract class Estimator[M<: Model[M]] extends PipelineStage with Params { /** * Fits a single model to the input data with optional parameters. * * @param dataset input dataset * @param paramPairs Optional list of param pairs. * These values override any specified in this Estimator's embedded ParamMap. * @return fitted model */ @varargs def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = { val map = ParamMap(paramPairs: _*) fit(dataset, map) } } Example: val oneHotEncoderModel = (new OneHotEncoder). setInputCol("vector_col"). fit(trainingData) oneHotEncoderModel.transform(trainingData) oneHotEncoderModel.transform(testData) Estimator => Transformer
  • 13.
    Predictor Estimator that predictsa value ProbabilisticClassifier Predictor Classifier Regressor
  • 14.
    Evaluator abstract class Evaluatorextends Identifiable { /** * Evaluates the output. * * @param dataset a dataset that contains labels/observations and predictions. * @param paramMap parameter map that specifies the input columns and output metrics * @return metric */ def evaluate(dataset: DataFrame, paramMap: ParamMap): Double } Example: val areaUnderROC = (new BinaryClassificationEvaluator). setScoreCol("prediction"). evaluate(data)
  • 15.
    Pipeline val tokenizer =new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) Input data Tockenizer HashingTF Logistic Regression fit Pipeline Model Estimator that encapsulates other transformers / estimators
  • 16.
    CrossValidator val crossval =new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array (10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) . build() crossval.setEstimatorParamMaps( paramGrid) crossval.setNumFolds(3) val cvModel = crossval.fit(training.toDF) Input data Tockenizer HashingTF Logistic Regression fit CrossVal Model numFeatures: {10, 100, 1000} regParam: {0.1, 0.01} Folds
  • 17.
    Pluggable backend ● H20 ●Flink ● DeepLearning4J ● http://keystone-ml.org/ ● etc.
  • 18.
    Optimization ● Disable k-foldcross validation ● Minimize redundant pre-processing ● Parallel grid search ● Parallel DAG pipeline ● Pluggable optimizer ● Non-gridsearch hyperparameter optimization (bayesian & hypergrad): http://arxiv.org/pdf/1502.03492v2.pdf http://arxiv.org/pdf/1206.2944.pdf http://arxiv.org/pdf/1502.05700v1.pdf
  • 19.
    Minimize redundant pre-processing regParam: 0.1 regParam: 0.01 valrdd1 = rdd.map(function) val rdd2 = rdd.map(function) rdd1 != rdd2
  • 20.
    Summary ● Good model!= good result. Feature engineering is the key. ● Spark provides a good abstraction, but need to tune some parts to achieve good performance. ● ml pipeline API gives a pluggable and reusable building blocks. ● Don’t forget to clean after yourself (unpersist cache).
  • 21.