Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

For predicting vehicle defects at BMW, a machine learning pipeline evaluating several thousand features was implemented. As important features can be useful for evaluating specific defects, a feature selection approach has been used. For further evaluating the importance of features, several feature selection techniques (filters and wrappers) have been implemented as ml PipelineStages for usage on dataframes for incorporation in a complete Spark ml Pipeline, including preprocessing and classification. The general steps for building custom Spark ml Estimators are presented. The API of the newly implemented feature selection techniques is demonstrated and results of a performance analysis are shown. Besides that, experiences gained and pitfalls that should be avoided are shared.

  • Be the first to comment

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

  1. 1. Kaminski, Schlegel | Oct. 25, 2017 BUILDING CUSTOM ML PIPELINESTAGES FOR FEATURE SELECTION. SPARK SUMMIT EUROPE 2017.
  2. 2. WHATYOU WILL LEARN DURING THIS SESSION.  How data-driven car diagnostics look like at BMW.  Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).  Attention: There will be Scala code examples!  Howto use spark-FeatureSelection in your Spark ML Pipeline.  The impact of feature selection on learning performance andthe understanding of the big data black box. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
  3. 3. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1] [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  4. 4. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  5. 5. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach: [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  6. 6. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  7. 7. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation.  Automatic workshop diagnostics.  Predictive maintenance. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  8. 8. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  9. 9. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  10. 10. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  11. 11. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity High class imbalance Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  12. 12. SPARK PIPELINE. Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  13. 13. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  14. 14. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  15. 15. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  16. 16. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  17. 17. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  18. 18. PipelineStage SPARK PIPELINE API. Interface for usage in Pipeline data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  19. 19. PipelineStage SPARK PIPELINE API. Transformer ‘Transforms data’ Interface for usage in Pipeline data data data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  20. 20. PipelineStage SPARK PIPELINE API. Estimator ‘Learns from data’ Transformer ‘Transforms data’ Interface for usage in Pipeline data data dataTransformer data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  21. 21. ORG.APACHE.SPARK.ML.* PipelineStage Estimator Interface for usage in Pipeline Transformer Transforms data Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  22. 22. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors PipelineModel Model from Pipeline PredictionModel Model from predictor PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  23. 23. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors FeatureSelector Interface for FS PipelineModel Model from Pipeline PredictionModel Model from predictor FeatureSelectionModel Model from FeatureSelector PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  24. 24. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  25. 25. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  26. 26. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  27. 27. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  28. 28. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  29. 29. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  30. 30. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  31. 31. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 Attention: VectorColumns have Metadata: Name, Type, Range, etc. features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  32. 32. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  33. 33. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. fit (= learn from data) Dataset Transformer Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  34. 34. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  35. 35. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  36. 36. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Not necessary, but avoids code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  37. 37. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11
  38. 38. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11 For persistence.
  39. 39. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 For persistence.
  40. 40. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Same idea as in Estimator, but different tasks. For persistence.
  41. 41. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence.
  42. 42. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence. Adds persistence.
  43. 43. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  44. 44. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  45. 45. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  46. 46. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. getters are shared between Estimator and Transformer. setters not, for the pursuit of concatenation. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  47. 47. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  48. 48. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  49. 49. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…) DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  50. 50. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…)  How do we dothat?  Create companion object FeatureSelectorModel, which offersthe following classes:  abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}  class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…} DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  51. 51. HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  52. 52. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  53. 53. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  54. 54. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  55. 55. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  56. 56. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  57. 57. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) val dfT = plModel.transform(df).drop(“Features") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. selected Label [0,1] 1.0 [0,0] 0.0 [1,1] 0.0 [1,0] 1.0 df dft Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Transform Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  58. 58. SPARK-FEATURESELECTION PACKAGE.  Offers selection based on:  Gini coefficient  Correlation coefficient  Information gain  L1-Logistic regression weights  Randomforest importances  Utility stage:  VectorMerger  Three modes:  Percentile (default)  Fixed number of columns  Compare to random column [4] Find on GitHub: spark-FeatureSelection or on Spark-packages Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15 [4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
  59. 59. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  60. 60. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  61. 61. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  62. 62. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  63. 63. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Informationgain Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  64. 64. LESSONS LEARNT.  Know what your data looks like and where it is located! Example:  Operations can succeed in local mode, but fail on a cluster.  Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.  Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.  Use the SparkWeb GUIto understand your Spark jobs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
  65. 65. QUESTIONS? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Marc.Kaminski@bmw.de Bernhard.bb.Schegel@bmw.de Page 18
  66. 66. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19 BACKUP.
  67. 67. DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE. Own namespace Pro Con Safer solution Code duplication org.apache.spark.ml.* Pro Con Less code duplication (sharedParams, SchemaUtils, …) More dangerous, when not cautious Easier to implement persistence vs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
  68. 68. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  69. 69. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 E.g.: - Correlation - InformationGain - RandomForest etc. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  70. 70. FEATURE SELECTION. Description Advantages Disadvantages Examples Filter Evaluate intrinsic data properties Fast Scalable Ignore inter-feature dependencies Ignore interaction with classifier Chi-squared Information gain Correlation Wrapper Evaluate model performance of feature subset Feature dependencies Simple Classifier dependent selection Computational expensive Risk of overfitting Genetic algorithms Search algorithms Embedded Feature selection is embedded in classifier training Feature dependencies Classifier dependent selection L1-Logistic regression Random forest Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
  71. 71. CHALLENGES. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017  Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.  Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.  Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended. Page 23

×