R, Scikit-Learn and Apache Spark ML -
What difference does it make?
Villu Ruusmann
Openscoring OÜ
Overview
● Identifying long-standing, high-value opportunities in the
applied predictive analytics domain
● Thinking about problems in API terms
● Providing solutions in API terms
● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
The trade-off
"More data beats better algorithms"
The state of the art
Scaling out horizontally
Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset
● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model
○ From model to real-life output table
● Model
● Statistics
Calling R from within Apache Spark
1. Create and initialize R runtime
2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD
3. Destroy R runtime
Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array
2. Invoke Scikit-Learn via Python/C API
3. Parse output numpy.array into result RDD
API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activities
Short-term << Long-term
JPMML - Java PMML API
● Conversion API
● Maintenance API
● Execution API
○ Interpreted mode
○ Translated + compiled ("Transpiled") mode
● Serving API
○ Integrations with popular Big Data frameworks
○ REST web service
Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:
○ A continuous label: log(price)
○ Two string and four numeric categorical features
○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:
○ 153'978 complete cases
○ 116'480 incomplete (ie. with missing values) cases
Gradient-Boosted Trees (GBTs)
R training and conversion API
#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
Scikit-Learn training and conversion API
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
Dataset
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory
layout
Contiguous,
dense
Contiguous,
dense(?)
Contiguous,
dense/sparse
Contiguous,
dense/sparse
Distributed,
dense/sparse
Data type Any double float float or
double
double
Categorical
values
As-is (factor) Encoded Binarized Binarized Binarized
Missing
values
Yes Pseudo (NaN) Pseudo (NaN) No No
LightGBM via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
XGBoost via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
GBT algorithm (training)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoost
ingRegressor
GBTRegressor
Parameterizab
ility
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical
values
"set contains" "equals" Pseudo
("equals")
Pseudo
("equals")
"equals"
Missing
values
First-class Pseudo Pseudo No No
gbm-style splits
<Node id="9">
<SimplePredicate field="interior_type" operator="isMissing"/>
<Node id="12" score="3.0702062395803734E-4">
<SimplePredicate field="colour" operator="isMissing"/>
</Node>
<Node id="10" score="-0.018950416258408962">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Grün Rot Violett Weiß</Array>
</SimpleSetPredicate>
</Node>
<Node id="11" score="-0.0017446280908351925">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>
</SimpleSetPredicate>
</Node>
</Node>
LightGBM- and XGBoost-style splits (1/3)
<Node id="39" defaultChild="76">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<Node id="76" score="0.0030283758">
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</Node>
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="colour" operator="isMissing"/>
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else if("Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
<!-- else return null -->
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="colour" operator="isNotMissing"/>
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<True/>
</Node>
</Node>
Model measurement using JPMML
org.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
GBT algorithm (interpretation)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Feature
importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model
persistence
RDS (binary) Proprietary
(text)
Proprietary
(binary, text)
Pickle (binary) SER (binary) or
JSON (text)
Model
reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
LightGBM feature importances
Age 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
Model execution using JPMML
org.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
Lessons (to be-) learned
● Limits and limitations of individual APIs
● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform
○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in
most application scenarios
● "Conventions over configuration"
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

R, Scikit-Learn and Apache Spark ML - What difference does it make?

  • 1.
    R, Scikit-Learn andApache Spark ML - What difference does it make? Villu Ruusmann Openscoring OÜ
  • 2.
    Overview ● Identifying long-standing,high-value opportunities in the applied predictive analytics domain ● Thinking about problems in API terms ● Providing solutions in API terms ● Developing and applying custom tools + A couple of tips if you're looking to buy or sell a VW Golf
  • 3.
  • 4.
    "More data beatsbetter algorithms"
  • 5.
    The state ofthe art
  • 6.
  • 7.
    Elements of reproducibility Standardized,human- and machine-readable descriptions: ● Dataset ● Data pre- and post-processing steps: ○ From real-life input table (SQL, CSV) to model ○ From model to real-life output table ● Model ● Statistics
  • 8.
    Calling R fromwithin Apache Spark 1. Create and initialize R runtime 2. Format and upload input RDD; upload and execute R model; download output and parse into result RDD 3. Destroy R runtime
  • 9.
    Calling Scikit-Learn fromwithin Apache Spark 1. Format input RDD (eg. using Java NIO) as numpy.array 2. Invoke Scikit-Learn via Python/C API 3. Parse output numpy.array into result RDD
  • 10.
    API prioritization Training <<Maintenance ~ Deployment One-time activity << Repeated activities Short-term << Long-term
  • 11.
    JPMML - JavaPMML API ● Conversion API ● Maintenance API ● Execution API ○ Interpreted mode ○ Translated + compiled ("Transpiled") mode ● Serving API ○ Integrations with popular Big Data frameworks ○ REST web service
  • 12.
    Calling JPMML-Spark fromwithin Apache Spark org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..; org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build(); org.apache.spark.sql.Dataset<Row> input = ..; org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
  • 13.
    The case study Predictingthe price of VW Golf cars using GBT algorithms: ● 71 columns: ○ A continuous label: log(price) ○ Two string and four numeric categorical features ○ 64 binary-like (0/1) and numeric continuous features ● 270'458 rows: ○ 153'978 complete cases ○ 116'480 incomplete (ie. with missing values) cases
  • 14.
  • 15.
    R training andconversion API #library("caret") library("gbm") library("r2pmml") cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A") factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type") for(factor_col in factor_cols){ cars[, factor_col] = as.factor(cars[, factor_col]) } # Doesn't work with factors with missing values #cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..) cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6) r2pmml(cars.gbm, "gbm.pmml")
  • 16.
    Scikit-Learn training andconversion API from sklearn_pandas import DataFrameMapper from sklearn.model_selection import GridSearchCV from sklearn2pmml import sklearn2pmml, PMMLPipeline cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"]) mapper = DataFrameMapper(..) regressor = .. tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..) tuner.fit(mapper.fit_transform(cars), cars["price"]) pipeline = PMMLPipeline([ ("mapper", mapper), ("regressor", tuner.best_estimator_) ]) sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
  • 17.
    Dataset R LightGBM XGBoost Scikit- Learn Apache SparkML Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector> Memory layout Contiguous, dense Contiguous, dense(?) Contiguous, dense/sparse Contiguous, dense/sparse Distributed, dense/sparse Data type Any double float float or double double Categorical values As-is (factor) Encoded Binarized Binarized Binarized Missing values Yes Pseudo (NaN) Pseudo (NaN) No No
  • 18.
    LightGBM via Scikit-Learn fromsklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelEncoder from lightgbm import LGBMRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64) regressor.fit(transformed_cars, cars["price"], categorical_feature = list(range(0, len(factor_columns))))
  • 19.
    XGBoost via Scikit-Learn fromsklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelBinarizer from xgboost.sklearn import XGBRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6) regressor.fit(transformed_cars, cars["price"])
  • 20.
    GBT algorithm (training) RLightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction gbm LGBMRegressor XGBRegressor GradientBoost ingRegressor GBTRegressor Parameterizab ility Medium High High Medium Medium Split type Multi-way Binary Binary Binary Binary Categorical values "set contains" "equals" Pseudo ("equals") Pseudo ("equals") "equals" Missing values First-class Pseudo Pseudo No No
  • 21.
    gbm-style splits <Node id="9"> <SimplePredicatefield="interior_type" operator="isMissing"/> <Node id="12" score="3.0702062395803734E-4"> <SimplePredicate field="colour" operator="isMissing"/> </Node> <Node id="10" score="-0.018950416258408962"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Grün Rot Violett Weiß</Array> </SimpleSetPredicate> </Node> <Node id="11" score="-0.0017446280908351925"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array> </SimpleSetPredicate> </Node> </Node>
  • 22.
    LightGBM- and XGBoost-stylesplits (1/3) <Node id="39" defaultChild="76"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <Node id="76" score="0.0030283758"> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </Node> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> </Node>
  • 23.
    LightGBM- and XGBoost-stylesplits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 --> <Node id="76" score="0.0030283758"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="colour" operator="isMissing"/> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </CompoundPredicate> </Node> <!-- else if("Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> <!-- else return null --> </Node>
  • 24.
    LightGBM- and XGBoost-stylesplits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="colour" operator="isNotMissing"/> <SimplePredicate field="colour" operator="equal" value="Orange"/> </CompoundPredicate> </Node> <!-- else return 0.0030283758 --> <Node id="76" score="0.0030283758"> <True/> </Node> </Node>
  • 25.
    Model measurement usingJPMML org.dmg.pmml.tree.TreeModel treeModel = ..; treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){ private int count = 0; // Number of Node elements private int maxDepth = 0; // Max "nesting depth" of Node elements @Override public VisitorAction visit(org.dmg.pmml.tree.Node node){ this.count++; int depth = 0; for(org.dmg.pmml.PMMLObject parent : getParents()){ if(!(parent instanceof org.dmg.pmml.tree.Node)) break; depth++; } this.maxDepth = Math.max(this.maxDepth, depth); return super.visit(node); } });
  • 29.
    GBT algorithm (interpretation) RLightGBM XGBoost Scikit- Learn Apache Spark ML Feature importances Direct Direct Transformed Transformed Transformed Decision path No No(?) No(?) Transformed Transformed Model persistence RDS (binary) Proprietary (text) Proprietary (binary, text) Pickle (binary) SER (binary) or JSON (text) Model reusability Good Fair(?) Good Fair Fair Java API No No Pseudo No Yes
  • 30.
    LightGBM feature importances Age936 Mileage 887 Performance 738 [Category] 205 New? 179 [Type of fuel] 170 [Type of interior] 167 Airbags? 130 [Colour] 129 [Type of gearbox] 105
  • 31.
    Model execution usingJPMML org.dmg.pmml.PMML pmml; try(InputStream is = ..){ pmml = org.jpmml.model.PMMLUtil.unmarshal(is); } org.jpmml.evaluator.Evaluator evaluator = new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml); org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..); org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..); for(int value = min; value <= max; value += increment){ Map<FieldName, FieldValue> arguments = Collections.singletonMap(inputField.getName(), inputField.prepare(value)); Map<FieldName, ?> result = evaluator.evaluate(arguments); System.out.println(result.get(targetField.getName())); }
  • 34.
    Lessons (to be-)learned ● Limits and limitations of individual APIs ● Vertical integration vs. horizontal integration: ○ All capabilities on a single platform ○ Specialized capabilities on specialized platforms ● Ease-of-use and robustness beat raw performance in most application scenarios ● "Conventions over configuration"
  • 35.