SlideShare a Scribd company logo
Kaminski, Schlegel | Oct. 25, 2017
BUILDING CUSTOM ML PIPELINESTAGES
FOR FEATURE SELECTION.
SPARK SUMMIT EUROPE 2017.
WHATYOU WILL LEARN DURING THIS SESSION.
 How data-driven car diagnostics look like at BMW.
 Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).
 Attention: There will be Scala code examples!
 Howto use spark-FeatureSelection in your Spark ML Pipeline.
 The impact of feature selection on learning performance andthe understanding of the big data black box.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
 Automatic knowledge generation.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
 Automatic knowledge generation.
 Automatic workshop diagnostics.
 Predictive maintenance.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
High sparsity
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
High sparsity
High class imbalance
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
SPARK PIPELINE.
Relational
DWH
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
PipelineStage
SPARK PIPELINE API.
Interface for usage in
Pipeline
data ?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
PipelineStage
SPARK PIPELINE API.
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
PipelineStage
SPARK PIPELINE API.
Estimator
‘Learns from data’
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data dataTransformer data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
ORG.APACHE.SPARK.ML.*
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Transforms data
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
FeatureSelector
Interface for FS
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
FeatureSelectionModel
Model from
FeatureSelector
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0 features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0
Attention:
VectorColumns have Metadata:
Name, Type, Range, etc.
features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
fit
(= learn from data)
Dataset Transformer
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Not necessary, but avoids
code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Same idea as in Estimator, but
different tasks.
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
Adds persistence.
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
getters are shared between
Estimator and Transformer.
setters not, for the pursuit of
concatenation.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
 Create DataFrame and use write.parquet(…)
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
 Create DataFrame and use write.parquet(…)
 How do we dothat?
 Create companion object FeatureSelectorModel, which offersthe following classes:
 abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}
 class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…}
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
val dfT = plModel.transform(df).drop(“Features")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
selected Label
[0,1] 1.0
[0,0] 0.0
[1,1] 0.0
[1,0] 1.0
df dft
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Transform
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
SPARK-FEATURESELECTION PACKAGE.
 Offers selection based on:
 Gini coefficient
 Correlation coefficient
 Information gain
 L1-Logistic regression weights
 Randomforest importances
 Utility stage:
 VectorMerger
 Three modes:
 Percentile (default)
 Fixed number of columns
 Compare to random column [4]
Find on GitHub: spark-FeatureSelection or on Spark-packages
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15
[4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Informationgain Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
LESSONS LEARNT.
 Know what your data looks like and where it is located! Example:
 Operations can succeed in local mode, but fail on a cluster.
 Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.
 Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.
 Use the SparkWeb GUIto understand your Spark jobs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
QUESTIONS?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
Marc.Kaminski@bmw.de
Bernhard.bb.Schegel@bmw.de
Page 18
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19
BACKUP.
DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE.
Own namespace
Pro Con
Safer solution Code duplication
org.apache.spark.ml.*
Pro Con
Less code duplication
(sharedParams,
SchemaUtils, …)
More dangerous,
when not
cautious
Easier to implement
persistence
vs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
E.g.:
- Correlation
- InformationGain
- RandomForest
etc.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
FEATURE SELECTION.
Description Advantages Disadvantages Examples
Filter Evaluate intrinsic data properties
Fast
Scalable
Ignore inter-feature dependencies
Ignore interaction with classifier
Chi-squared
Information gain
Correlation
Wrapper
Evaluate model performance of
feature subset
Feature dependencies
Simple
Classifier dependent selection
Computational expensive
Risk of overfitting
Genetic algorithms
Search algorithms
Embedded
Feature selection is embedded in
classifier training
Feature dependencies Classifier dependent selection L1-Logistic regression
Random forest
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
CHALLENGES.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
 Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.
 Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.
 Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended.
Page 23

More Related Content

What's hot

What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
Henrik Skogström
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
AllenPeter7
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
Avinash Patil
 
Machine Learning Operations & Azure
Machine Learning Operations & AzureMachine Learning Operations & Azure
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
Databricks
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
Ning Jiang
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
VikasBisoi
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Márton Kodok
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
Nisha Talagala
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
Databricks
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
James Serra
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
Jannes Klaas
 
Building trust through Explainable AI
Building trust through Explainable AIBuilding trust through Explainable AI
Building trust through Explainable AI
Peet Denny
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
Kush Kulshrestha
 
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Neo4j
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
safa cimenli
 

What's hot (20)

What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 
Machine Learning Operations & Azure
Machine Learning Operations & AzureMachine Learning Operations & Azure
Machine Learning Operations & Azure
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
 
Encryption
EncryptionEncryption
Encryption
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
 
Building trust through Explainable AI
Building trust through Explainable AIBuilding trust through Explainable AI
Building trust through Explainable AI
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
 
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 

Viewers also liked

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 

Viewers also liked (7)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 

Similar to Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Databricks
 
Evolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedEvolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managed
Samuel Festus
 
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Tom Hubregtsen
 
StarTuned August 2013
StarTuned August 2013StarTuned August 2013
StarTuned August 2013
RBMParts
 
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI Alliance
 
Triple Forward Camera from Tesla Model 3
 Triple Forward Camera from Tesla Model 3 Triple Forward Camera from Tesla Model 3
Triple Forward Camera from Tesla Model 3
system_plus
 
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software GmbH
 
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Yole Developpement
 
Automotive supply chain visibility v2
Automotive supply chain visibility v2Automotive supply chain visibility v2
Automotive supply chain visibility v2
Prasaga
 
Examining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H MichelExamining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H Michel
mfrancis
 
Position Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsPosition Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive Applications
HEINZ OYRER
 
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
IRJET Journal
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
DataScienceConferenc1
 
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
AEI Rsearch
 
daimler presentataion
daimler presentataiondaimler presentataion
daimler presentataionAnubhav goel
 
Tas case study one
Tas case study oneTas case study one
Tas case study one
Ralph Paglia
 
Maxim auto business update final
Maxim auto business update finalMaxim auto business update final
Maxim auto business update final
maxim2015ir
 
Fvdi abrites commander
Fvdi abrites commanderFvdi abrites commander
Fvdi abrites commanderLandy Lan
 

Similar to Building Custom ML PipelineStages for Feature Selection with Marc Kaminski (20)

Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
 
Evolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedEvolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managed
 
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
 
StarTuned August 2013
StarTuned August 2013StarTuned August 2013
StarTuned August 2013
 
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
 
Bmw cas4
Bmw cas4Bmw cas4
Bmw cas4
 
Triple Forward Camera from Tesla Model 3
 Triple Forward Camera from Tesla Model 3 Triple Forward Camera from Tesla Model 3
Triple Forward Camera from Tesla Model 3
 
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
 
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
 
Automotive supply chain visibility v2
Automotive supply chain visibility v2Automotive supply chain visibility v2
Automotive supply chain visibility v2
 
MYNews 2015 01
MYNews 2015 01MYNews 2015 01
MYNews 2015 01
 
Examining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H MichelExamining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H Michel
 
Position Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsPosition Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive Applications
 
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
 
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
 
daimler presentataion
daimler presentataiondaimler presentataion
daimler presentataion
 
Tas case study one
Tas case study oneTas case study one
Tas case study one
 
Maxim auto business update final
Maxim auto business update finalMaxim auto business update final
Maxim auto business update final
 
Fvdi abrites commander
Fvdi abrites commanderFvdi abrites commander
Fvdi abrites commander
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

  • 1. Kaminski, Schlegel | Oct. 25, 2017 BUILDING CUSTOM ML PIPELINESTAGES FOR FEATURE SELECTION. SPARK SUMMIT EUROPE 2017.
  • 2. WHATYOU WILL LEARN DURING THIS SESSION.  How data-driven car diagnostics look like at BMW.  Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).  Attention: There will be Scala code examples!  Howto use spark-FeatureSelection in your Spark ML Pipeline.  The impact of feature selection on learning performance andthe understanding of the big data black box. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
  • 3. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1] [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 4. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 5. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach: [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 6. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 7. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation.  Automatic workshop diagnostics.  Predictive maintenance. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 8. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 9. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 10. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 11. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity High class imbalance Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 12. SPARK PIPELINE. Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  • 13. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  • 14. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 15. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 16. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 17. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 18. PipelineStage SPARK PIPELINE API. Interface for usage in Pipeline data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 19. PipelineStage SPARK PIPELINE API. Transformer ‘Transforms data’ Interface for usage in Pipeline data data data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 20. PipelineStage SPARK PIPELINE API. Estimator ‘Learns from data’ Transformer ‘Transforms data’ Interface for usage in Pipeline data data dataTransformer data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 21. ORG.APACHE.SPARK.ML.* PipelineStage Estimator Interface for usage in Pipeline Transformer Transforms data Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 22. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors PipelineModel Model from Pipeline PredictionModel Model from predictor PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 23. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors FeatureSelector Interface for FS PipelineModel Model from Pipeline PredictionModel Model from predictor FeatureSelectionModel Model from FeatureSelector PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 24. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 25. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 26. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 27. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 28. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 29. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 30. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  • 31. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 Attention: VectorColumns have Metadata: Name, Type, Range, etc. features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  • 32. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  • 33. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. fit (= learn from data) Dataset Transformer Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  • 34. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 35. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 36. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Not necessary, but avoids code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 37. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11
  • 38. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11 For persistence.
  • 39. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 For persistence.
  • 40. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Same idea as in Estimator, but different tasks. For persistence.
  • 41. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence.
  • 42. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence. Adds persistence.
  • 43. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 44. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 45. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 46. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. getters are shared between Estimator and Transformer. setters not, for the pursuit of concatenation. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 47. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 48. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 49. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…) DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 50. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…)  How do we dothat?  Create companion object FeatureSelectorModel, which offersthe following classes:  abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}  class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…} DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 51. HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 52. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 53. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 54. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 55. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 56. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  • 57. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) val dfT = plModel.transform(df).drop(“Features") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. selected Label [0,1] 1.0 [0,0] 0.0 [1,1] 0.0 [1,0] 1.0 df dft Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Transform Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  • 58. SPARK-FEATURESELECTION PACKAGE.  Offers selection based on:  Gini coefficient  Correlation coefficient  Information gain  L1-Logistic regression weights  Randomforest importances  Utility stage:  VectorMerger  Three modes:  Percentile (default)  Fixed number of columns  Compare to random column [4] Find on GitHub: spark-FeatureSelection or on Spark-packages Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15 [4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
  • 59. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 60. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  • 61. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  • 62. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 63. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Informationgain Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 64. LESSONS LEARNT.  Know what your data looks like and where it is located! Example:  Operations can succeed in local mode, but fail on a cluster.  Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.  Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.  Use the SparkWeb GUIto understand your Spark jobs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
  • 65. QUESTIONS? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Marc.Kaminski@bmw.de Bernhard.bb.Schegel@bmw.de Page 18
  • 66. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19 BACKUP.
  • 67. DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE. Own namespace Pro Con Safer solution Code duplication org.apache.spark.ml.* Pro Con Less code duplication (sharedParams, SchemaUtils, …) More dangerous, when not cautious Easier to implement persistence vs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
  • 68. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  • 69. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 E.g.: - Correlation - InformationGain - RandomForest etc. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  • 70. FEATURE SELECTION. Description Advantages Disadvantages Examples Filter Evaluate intrinsic data properties Fast Scalable Ignore inter-feature dependencies Ignore interaction with classifier Chi-squared Information gain Correlation Wrapper Evaluate model performance of feature subset Feature dependencies Simple Classifier dependent selection Computational expensive Risk of overfitting Genetic algorithms Search algorithms Embedded Feature selection is embedded in classifier training Feature dependencies Classifier dependent selection L1-Logistic regression Random forest Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
  • 71. CHALLENGES. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017  Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.  Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.  Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended. Page 23