SlideShare a Scribd company logo
R, Scikit-Learn and Apache Spark ML -
What difference does it make?
Villu Ruusmann
Openscoring OÜ
Overview
● Identifying long-standing, high-value opportunities in the
applied predictive analytics domain
● Thinking about problems in API terms
● Providing solutions in API terms
● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
The trade-off
"More data beats better algorithms"
The state of the art
Scaling out horizontally
Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset
● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model
○ From model to real-life output table
● Model
● Statistics
Calling R from within Apache Spark
1. Create and initialize R runtime
2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD
3. Destroy R runtime
Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array
2. Invoke Scikit-Learn via Python/C API
3. Parse output numpy.array into result RDD
API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activities
Short-term << Long-term
JPMML - Java PMML API
● Conversion API
● Maintenance API
● Execution API
○ Interpreted mode
○ Translated + compiled ("Transpiled") mode
● Serving API
○ Integrations with popular Big Data frameworks
○ REST web service
Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:
○ A continuous label: log(price)
○ Two string and four numeric categorical features
○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:
○ 153'978 complete cases
○ 116'480 incomplete (ie. with missing values) cases
Gradient-Boosted Trees (GBTs)
R training and conversion API
#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
Scikit-Learn training and conversion API
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
Dataset
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory
layout
Contiguous,
dense
Contiguous,
dense(?)
Contiguous,
dense/sparse
Contiguous,
dense/sparse
Distributed,
dense/sparse
Data type Any double float float or
double
double
Categorical
values
As-is (factor) Encoded Binarized Binarized Binarized
Missing
values
Yes Pseudo (NaN) Pseudo (NaN) No No
LightGBM via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
XGBoost via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
GBT algorithm (training)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoost
ingRegressor
GBTRegressor
Parameterizab
ility
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical
values
"set contains" "equals" Pseudo
("equals")
Pseudo
("equals")
"equals"
Missing
values
First-class Pseudo Pseudo No No
gbm-style splits
<Node id="9">
<SimplePredicate field="interior_type" operator="isMissing"/>
<Node id="12" score="3.0702062395803734E-4">
<SimplePredicate field="colour" operator="isMissing"/>
</Node>
<Node id="10" score="-0.018950416258408962">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Grün Rot Violett Weiß</Array>
</SimpleSetPredicate>
</Node>
<Node id="11" score="-0.0017446280908351925">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>
</SimpleSetPredicate>
</Node>
</Node>
LightGBM- and XGBoost-style splits (1/3)
<Node id="39" defaultChild="76">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<Node id="76" score="0.0030283758">
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</Node>
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="colour" operator="isMissing"/>
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else if("Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
<!-- else return null -->
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="colour" operator="isNotMissing"/>
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<True/>
</Node>
</Node>
Model measurement using JPMML
org.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
GBT algorithm (interpretation)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Feature
importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model
persistence
RDS (binary) Proprietary
(text)
Proprietary
(binary, text)
Pickle (binary) SER (binary) or
JSON (text)
Model
reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
LightGBM feature importances
Age 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
Model execution using JPMML
org.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
Lessons (to be-) learned
● Limits and limitations of individual APIs
● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform
○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in
most application scenarios
● "Conventions over configuration"
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

More Related Content

What's hot

Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Computer vision, machine, and deep learning
Computer vision, machine, and deep learningComputer vision, machine, and deep learning
Computer vision, machine, and deep learning
Igi Ardiyanto
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
Vikrant Arya
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation project
Abhishek Jaisingh
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Shrey Malik
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
WinWire Technologies Inc
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
Databricks
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
Crossing Minds
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
Jannes Klaas
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola
 

What's hot (20)

Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Computer vision, machine, and deep learning
Computer vision, machine, and deep learningComputer vision, machine, and deep learning
Computer vision, machine, and deep learning
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation project
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Apache flink
Apache flinkApache flink
Apache flink
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 

Viewers also liked

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
Villu Ruusmann
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
Atul Ashar
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
Timothy Spann
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
Villu Ruusmann
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
Renato Javier Marroquín Mogrovejo
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
Nan Zhu
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
Deepak Narayanan
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
DataWorks Summit/Hadoop Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
MLconf
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Chris Fregly
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
Mohit Jain
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
Alluxio, Inc.
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017
EDB
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 

Similar to R, Scikit-Learn and Apache Spark ML - What difference does it make?

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databasesTomáš Drenčák
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
D. j Vicky
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
D. j Vicky
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Matt Raible
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IP
D. j Vicky
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
Matt Raible
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
Anju Garg
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)
JAINAM KAPADIYA
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzML
BigML, Inc
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
Sarath C
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
R console
R consoleR console
R console
Ananth Raj
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
Oleksandr Tarasenko
 

Similar to R, Scikit-Learn and Apache Spark ML - What difference does it make? (20)

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IP
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzML
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
R console
R consoleR console
R console
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
 

Recently uploaded

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 

Recently uploaded (20)

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 

R, Scikit-Learn and Apache Spark ML - What difference does it make?

  • 1. R, Scikit-Learn and Apache Spark ML - What difference does it make? Villu Ruusmann Openscoring OÜ
  • 2. Overview ● Identifying long-standing, high-value opportunities in the applied predictive analytics domain ● Thinking about problems in API terms ● Providing solutions in API terms ● Developing and applying custom tools + A couple of tips if you're looking to buy or sell a VW Golf
  • 4. "More data beats better algorithms"
  • 5. The state of the art
  • 7. Elements of reproducibility Standardized, human- and machine-readable descriptions: ● Dataset ● Data pre- and post-processing steps: ○ From real-life input table (SQL, CSV) to model ○ From model to real-life output table ● Model ● Statistics
  • 8. Calling R from within Apache Spark 1. Create and initialize R runtime 2. Format and upload input RDD; upload and execute R model; download output and parse into result RDD 3. Destroy R runtime
  • 9. Calling Scikit-Learn from within Apache Spark 1. Format input RDD (eg. using Java NIO) as numpy.array 2. Invoke Scikit-Learn via Python/C API 3. Parse output numpy.array into result RDD
  • 10. API prioritization Training << Maintenance ~ Deployment One-time activity << Repeated activities Short-term << Long-term
  • 11. JPMML - Java PMML API ● Conversion API ● Maintenance API ● Execution API ○ Interpreted mode ○ Translated + compiled ("Transpiled") mode ● Serving API ○ Integrations with popular Big Data frameworks ○ REST web service
  • 12. Calling JPMML-Spark from within Apache Spark org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..; org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build(); org.apache.spark.sql.Dataset<Row> input = ..; org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
  • 13. The case study Predicting the price of VW Golf cars using GBT algorithms: ● 71 columns: ○ A continuous label: log(price) ○ Two string and four numeric categorical features ○ 64 binary-like (0/1) and numeric continuous features ● 270'458 rows: ○ 153'978 complete cases ○ 116'480 incomplete (ie. with missing values) cases
  • 15. R training and conversion API #library("caret") library("gbm") library("r2pmml") cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A") factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type") for(factor_col in factor_cols){ cars[, factor_col] = as.factor(cars[, factor_col]) } # Doesn't work with factors with missing values #cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..) cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6) r2pmml(cars.gbm, "gbm.pmml")
  • 16. Scikit-Learn training and conversion API from sklearn_pandas import DataFrameMapper from sklearn.model_selection import GridSearchCV from sklearn2pmml import sklearn2pmml, PMMLPipeline cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"]) mapper = DataFrameMapper(..) regressor = .. tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..) tuner.fit(mapper.fit_transform(cars), cars["price"]) pipeline = PMMLPipeline([ ("mapper", mapper), ("regressor", tuner.best_estimator_) ]) sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
  • 17. Dataset R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector> Memory layout Contiguous, dense Contiguous, dense(?) Contiguous, dense/sparse Contiguous, dense/sparse Distributed, dense/sparse Data type Any double float float or double double Categorical values As-is (factor) Encoded Binarized Binarized Binarized Missing values Yes Pseudo (NaN) Pseudo (NaN) No No
  • 18. LightGBM via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelEncoder from lightgbm import LGBMRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64) regressor.fit(transformed_cars, cars["price"], categorical_feature = list(range(0, len(factor_columns))))
  • 19. XGBoost via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelBinarizer from xgboost.sklearn import XGBRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6) regressor.fit(transformed_cars, cars["price"])
  • 20. GBT algorithm (training) R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction gbm LGBMRegressor XGBRegressor GradientBoost ingRegressor GBTRegressor Parameterizab ility Medium High High Medium Medium Split type Multi-way Binary Binary Binary Binary Categorical values "set contains" "equals" Pseudo ("equals") Pseudo ("equals") "equals" Missing values First-class Pseudo Pseudo No No
  • 21. gbm-style splits <Node id="9"> <SimplePredicate field="interior_type" operator="isMissing"/> <Node id="12" score="3.0702062395803734E-4"> <SimplePredicate field="colour" operator="isMissing"/> </Node> <Node id="10" score="-0.018950416258408962"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Grün Rot Violett Weiß</Array> </SimpleSetPredicate> </Node> <Node id="11" score="-0.0017446280908351925"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array> </SimpleSetPredicate> </Node> </Node>
  • 22. LightGBM- and XGBoost-style splits (1/3) <Node id="39" defaultChild="76"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <Node id="76" score="0.0030283758"> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </Node> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> </Node>
  • 23. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 --> <Node id="76" score="0.0030283758"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="colour" operator="isMissing"/> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </CompoundPredicate> </Node> <!-- else if("Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> <!-- else return null --> </Node>
  • 24. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="colour" operator="isNotMissing"/> <SimplePredicate field="colour" operator="equal" value="Orange"/> </CompoundPredicate> </Node> <!-- else return 0.0030283758 --> <Node id="76" score="0.0030283758"> <True/> </Node> </Node>
  • 25. Model measurement using JPMML org.dmg.pmml.tree.TreeModel treeModel = ..; treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){ private int count = 0; // Number of Node elements private int maxDepth = 0; // Max "nesting depth" of Node elements @Override public VisitorAction visit(org.dmg.pmml.tree.Node node){ this.count++; int depth = 0; for(org.dmg.pmml.PMMLObject parent : getParents()){ if(!(parent instanceof org.dmg.pmml.tree.Node)) break; depth++; } this.maxDepth = Math.max(this.maxDepth, depth); return super.visit(node); } });
  • 26.
  • 27.
  • 28.
  • 29. GBT algorithm (interpretation) R LightGBM XGBoost Scikit- Learn Apache Spark ML Feature importances Direct Direct Transformed Transformed Transformed Decision path No No(?) No(?) Transformed Transformed Model persistence RDS (binary) Proprietary (text) Proprietary (binary, text) Pickle (binary) SER (binary) or JSON (text) Model reusability Good Fair(?) Good Fair Fair Java API No No Pseudo No Yes
  • 30. LightGBM feature importances Age 936 Mileage 887 Performance 738 [Category] 205 New? 179 [Type of fuel] 170 [Type of interior] 167 Airbags? 130 [Colour] 129 [Type of gearbox] 105
  • 31. Model execution using JPMML org.dmg.pmml.PMML pmml; try(InputStream is = ..){ pmml = org.jpmml.model.PMMLUtil.unmarshal(is); } org.jpmml.evaluator.Evaluator evaluator = new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml); org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..); org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..); for(int value = min; value <= max; value += increment){ Map<FieldName, FieldValue> arguments = Collections.singletonMap(inputField.getName(), inputField.prepare(value)); Map<FieldName, ?> result = evaluator.evaluate(arguments); System.out.println(result.get(targetField.getName())); }
  • 32.
  • 33.
  • 34. Lessons (to be-) learned ● Limits and limitations of individual APIs ● Vertical integration vs. horizontal integration: ○ All capabilities on a single platform ○ Specialized capabilities on specialized platforms ● Ease-of-use and robustness beat raw performance in most application scenarios ● "Conventions over configuration"