SlideShare a Scribd company logo
Converting Scikit-Learn to PMML
Villu Ruusmann
Openscoring OÜ
"Train once, deploy anywhere"
Scikit-Learn challenge
pipeline = Pipeline([...])
pipeline.fit(Xtrain
, ytrain
)
ytest
= pipeline.predict(Xtest
)
● Serializing and deserializing the fitted pipeline:
○ (C)Pickle
○ Joblib
● Creating Xtest
that is functionally equivalent to Xtrain
:
○ ???
(J)PMML solution
Evaluator evaluator = ...;
List<InputField> argumentFields = evaluator.getInputFields();
List<ResultField> resultFields =
Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields());
Map<FieldName, ?> arguments = readRecord(argumentFields);
Map<FieldName, ?> result = evaluator.evaluate(arguments);
writeRecord(result, resultFields);
● The fitted pipeline (data pre-and post-processing, model)
is represented using standardized PMML data structures
● Well-defined data entry and exit interfaces
Workflow
The PMML pipeline
A smart pipeline that collects supporting information about
"passed through" features and label(s):
● Name
● Data type (eg. string, float, integer, boolean)
● Operational type (eg. continuous, categorical, ordinal)
● Domain of valid values
● Missing and invalid value treatments
Making a PMML pipeline (1/2)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
pipeline = PMMLPipeline([...])
pipeline.fit(X, y)
sklearn2pmml(pipeline, "pipeline.pmml")
Making a PMML pipeline (2/2)
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
pipeline = Pipeline([...])
pipeline.fit(X, y)
pipeline = make_pmml_pipeline(
pipeline,
active_fields = ["x1", "x2", ...],
target_fields = ["y"]
)
sklearn2pmml(pipeline, "pipeline.pmml")
Pipeline setup (1/3)
1. Feature and label definition
Specific to (J)PMML
2. Feature engineering
3. Feature selection
Scikit-Learn persists all features, (J)PMML persists "surviving" features
4. Estimator fitting
5. Decision engineering
Specific to (J)PMML
Pipeline setup (2/3)
Workflow:
1. Column- and column set-oriented feature definition,
engineering and selection
2. Table-oriented feature engineering and selection
3. Estimator fitting
Full support for pipeline nesting, branching.
Pipeline setup (3/3)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn_pandas import DataFrameMapper
pipeline = PMMLPipeline([
("stage1", DataFrameMapper([
(["x1"], [CategoricalDomain(), ...]),
(["x2", "x3"], [ContinuousDomain(), ...]),
])),
("stage2", SelectKBest(10)),
("stage3", LogisticRegression())
])
Feature definition
Continuous features (1/2)
Without missing values:
("Income", [ContinuousDomain(invalid_value_treatment = "return_invalid",
missing_value_treatment = "as_is", with_data = True,
with_statistics = True)])
With missing values (encoded as None/NaN):
("Income", [ContinuousDomain(), Imputer()])
With missing values (encoded using dummy values):
("Income", [ContinuousDomain(missing_values = -999),
Imputer(missing_values = -999)])
Continuous features (2/2)
<DataDictionary>
<DataField name="Income" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="1598.95" rightMargin="481259.5"/>
</DataField>
</DataDictionary>
<RegressionModel>
<MiningSchema>
<MiningField name="Income" invalidValueTreatment="returnInvalid"
missingValueTreatment="asMean" missingValueReplacement="84807.39297450424"/>
</MiningSchema>
<ModelStats>
<UnivariateStats field="Income">
<Counts totalFreq="1899.0" missingFreq="487.0" invalidFreq="0.0"/>
<NumericInfo minimum="1598.95" maximum="481259.5" mean="84807.39297450424" median="60596.35"
standardDeviation="69696.87637064351" interQuartileRange="81611.1225"/>
</UnivariateStats>
</ModelStats>
</RegressionModel>
Categorical features (1/3)
Without missing values:
("Education", [CategoricalDomain(invalid_value_treatment =
"return_invalid", missing_value_treatment = "as_is", with_data = True,
with_statistics = True), LabelBinarizer()])
With missing values, missing value-aware estimator:
("Education", [CategoricalDomain(), PMMLLabelBinarizer()])
With missing values, missing value-unaware estimator:
("Education", [CategoricalDomain(), CategoricalImputer(),
LabelBinarizer()])
Categorical features (2/3)
<DataDictionary>
<DataField name="Education" optype="categorical" dataType="string">
<Value value="Associate"/>
<Value value="Bachelor"/>
<Value value="College"/>
<Value value="Doctorate"/>
<Value value="HSgrad"/>
<Value value="Master"/>
<Value value="Preschool"/>
<Value value="Professional"/>
<Value value="Vocational"/>
<Value value="Yr10t12"/>
<Value value="Yr1t4"/>
<Value value="Yr5t9"/>
</DataField>
</DataDictionary>
Categorical features (3/3)
<RegressionModel>
<MiningSchema>
<MiningField name="Education" invalidValueTreatment="asIs", x-invalidValueReplacement="HSgrad"
missingValueTreatment="asMode" missingValueReplacement="HSgrad"/>
</MiningSchema>
<UnivariateStats field="Education">
<Counts totalFreq="1899.0" missingFreq="497.0" invalidFreq="0.0"/>
<DiscrStats>
<Array type="string">Associate Bachelor College Doctorate HSgrad Master Preschool
Professional Vocational Yr10t12 Yr1t4 Yr5t9</Array>
<Array type="int">55 241 309 22 462 78 6 15 55 95 4 60</Array>
</DiscrStats>
</UnivariateStats>
</RegressionModel>
Feature engineering
Continuous features (1/2)
from sklearn.preprocessing import Binarizer, FunctionTransformer
from sklearn2pmml.preprocessing import ExpressionTransformer
features = FeatureUnion([
("identity", DataFrameMapper([
(["Income", "Hours"], ContinuousDomain())
])),
("transformation", DataFrameMapper([
(["Income"], FunctionTransformer(numpy.log10)),
(["Hours"], Binarizer(threshold = 40)),
(["Income", "Hours"], ExpressionTransformer("X[:,0]/(X[:,1]*52)"))
]))
])
Continuous features (2/2)
<TransformationDictionary>
<DerivedField name="log10(Income)" optype="continuous" dataType="double">
<Apply function="log10"><FieldRef field="Income"/></Apply>
</DerivedField>
<DerivedField name="binarizer(Hours)" optype="continuous" dataType="double">
<Apply function="threshold"><FieldRef field="Hours"/><Constant>40</Constant></Apply>
</DerivedField>
<DerivedField name="eval(X[:,0]/(X[:,1]*52))" optype="continuous" dataType="double">
<Apply function="/">
<FieldRef field="Income"/>
<Apply function="*">
<FieldRef field="Hours"/>
<Constant dataType="integer">52</Constant>
</Apply>
</Apply>
</DerivedField>
</TransformationDictionary>
Categorical features
from sklearn preprocessing import LabelBinarizer, PolynomialFeatures
features = Pipeline([
("identity", DataFrameMapper([
("Education", [CategoricalDomain(), LabelBinarizer()]),
("Occupation", [CategoricalDomain(), LabelBinarizer()])
])),
("interaction", PolynomialFeatures())
])
Feature selection
Scikit-Learn challenge
class SelectPercentile(BaseTransformer, SelectorMixin):
def _get_support_mask(self):
scores = self.scores
threshold = stats.scoreatpercentile(scores, 100 - self.percentile)
return (scores > threshold)
Python methods don't have a persistent state:
Exception in thread "main": java.lang.IllegalArgumentException: The
selector object does not have a persistent '_get_support_mask' attribute
(J)PMML solution
"Hiding" a stateless selector behind a stateful meta-selector:
from sklearn2pmml import SelectorProxy, PMMLPipeline
pipeline = PMMLPipeline([
("stage1", DataFrameMapper([
(["x1"], [CategoricalDomain(), LabelBinarizer(),
SelectorProxy(SelectFromModel(DecisionTreeClassifier()))])
])),
("stage2", SelectorProxy(SelectPercentile(percentile = 50)))
])
Estimator fitting
Hyper-parameter tuning (1/2)
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
pipeline = PMMLPipeline([...])
tuner = GridSearchCV(pipeline, param_grid = {...})
tuner.fit(X, y)
# GridSearchCV.best_estimator_ is of type PMMLPipeline
sklearn2pmml(tuner.best_estimator_, "pipeline.pmml")
Hyper-parameter tuning (2/2)
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
pipeline = Pipeline([...])
tuner = GridSearchCV(pipeline, param_grid = {...})
tuner.fit(X, y)
# GridSearchCV.best_estimator_ is of type Pipeline
sklearn2pmml(make_pmml_pipeline(tuner.best_estimator_, active_fields =
[...], target_fields = [...]), "pipeline.pmml")
Algorithm tuning
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
from tpot import TPOTClassifier
# See https://github.com/rhiever/tpot
tpot = TPOTClassifier()
tpot.fit(X, y)
sklearn2pmml(make_pmml_pipeline(tpot.fitted_pipeline_, active_fields =
[...], target_fields = [...]), "pipeline.pmml")
Model customization (1/2)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
# Keep reference to the estimator
classifier = DecisionTreeClassifier()
pipeline = PMMLPipeline([..., ("classifier", classifier)])
pipeline.fit(X, y)
# Apply customization(s) to the fitted estimator
classifier.compact = True
sklearn2pmml(pipeline, "pipeline.pmml")
Model customization (2/2)
● org.jpmml.sklearn.HasOptions
○ lightgbm.sklearn.HasLightGBMOptions
■ compact
■ num_iteration
○ sklearn.tree.HasTreeOptions
■ compact
○ xgboost.sklearn.HasXGBoostOptions
■ compact
■ ntree_limit
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
Software (Nov 2017)
● The Python side:
○ sklearn 0.19(.0)
○ sklearn2pmml 0.26(.0)
○ sklearn_pandas 1.5(.0)
○ TPOT 0.9(.0)
● The Java side:
○ JPMML-SkLearn 1.4(.0)
○ JPMML-Evaluator 1.3(.10)

More Related Content

What's hot

AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
Amazon Web Services
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
Donghyeon Kim
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
OpenVX 1.3 Reference Guide
OpenVX 1.3 Reference GuideOpenVX 1.3 Reference Guide
OpenVX 1.3 Reference Guide
The Khronos Group Inc.
 
Developing R Graphical User Interfaces
Developing R Graphical User InterfacesDeveloping R Graphical User Interfaces
Developing R Graphical User Interfaces
Setia Pramana
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
InfluxData
 
Extending MariaDB with user-defined functions
Extending MariaDB with user-defined functionsExtending MariaDB with user-defined functions
Extending MariaDB with user-defined functions
MariaDB plc
 
Rules engine
Rules engineRules engine
Rules engine
Stanimir Todorov
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
Pier Luca Lanzi
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
DataStax
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
suga93
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM
EdwardIm1
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Universitat Politècnica de Catalunya
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
confluent
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 

What's hot (20)

AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
OpenVX 1.3 Reference Guide
OpenVX 1.3 Reference GuideOpenVX 1.3 Reference Guide
OpenVX 1.3 Reference Guide
 
Developing R Graphical User Interfaces
Developing R Graphical User InterfacesDeveloping R Graphical User Interfaces
Developing R Graphical User Interfaces
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
 
Extending MariaDB with user-defined functions
Extending MariaDB with user-defined functionsExtending MariaDB with user-defined functions
Extending MariaDB with user-defined functions
 
Rules engine
Rules engineRules engine
Rules engine
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 

Viewers also liked

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
AWeber
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 

Viewers also liked (20)

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Similar to Converting Scikit-Learn to PMML

svm classification
svm classificationsvm classification
svm classification
Akhilesh Joshi
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Abhishek Thakur
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
Akhilesh Joshi
 
knn classification
knn classificationknn classification
knn classification
Akhilesh Joshi
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
Assignment 6.2a.pdf
Assignment 6.2a.pdfAssignment 6.2a.pdf
Assignment 6.2a.pdf
dash41
 
Xgboost
XgboostXgboost
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
San Kim
 
Xgboost
XgboostXgboost
Machine Learning on GCP
Machine Learning on GCPMachine Learning on GCP
Machine Learning on GCP
AllCloud
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstraction
Intro C# Book
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
Zahid Hasan
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
Dr. Volkan OBAN
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
FarhanAhmade
 
생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트기룡 남
 
Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeleton
Iram Ramrajkar
 

Similar to Converting Scikit-Learn to PMML (20)

svm classification
svm classificationsvm classification
svm classification
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 
knn classification
knn classificationknn classification
knn classification
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Assignment 6.2a.pdf
Assignment 6.2a.pdfAssignment 6.2a.pdf
Assignment 6.2a.pdf
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Xgboost
XgboostXgboost
Xgboost
 
Machine Learning on GCP
Machine Learning on GCPMachine Learning on GCP
Machine Learning on GCP
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstraction
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트
 
Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeleton
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 

Recently uploaded

Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 

Recently uploaded (20)

Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 

Converting Scikit-Learn to PMML

  • 1. Converting Scikit-Learn to PMML Villu Ruusmann Openscoring OÜ
  • 3. Scikit-Learn challenge pipeline = Pipeline([...]) pipeline.fit(Xtrain , ytrain ) ytest = pipeline.predict(Xtest ) ● Serializing and deserializing the fitted pipeline: ○ (C)Pickle ○ Joblib ● Creating Xtest that is functionally equivalent to Xtrain : ○ ???
  • 4. (J)PMML solution Evaluator evaluator = ...; List<InputField> argumentFields = evaluator.getInputFields(); List<ResultField> resultFields = Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields()); Map<FieldName, ?> arguments = readRecord(argumentFields); Map<FieldName, ?> result = evaluator.evaluate(arguments); writeRecord(result, resultFields); ● The fitted pipeline (data pre-and post-processing, model) is represented using standardized PMML data structures ● Well-defined data entry and exit interfaces
  • 6. The PMML pipeline A smart pipeline that collects supporting information about "passed through" features and label(s): ● Name ● Data type (eg. string, float, integer, boolean) ● Operational type (eg. continuous, categorical, ordinal) ● Domain of valid values ● Missing and invalid value treatments
  • 7. Making a PMML pipeline (1/2) from sklearn2pmml import PMMLPipeline from sklearn2pmml import sklearn2pmml pipeline = PMMLPipeline([...]) pipeline.fit(X, y) sklearn2pmml(pipeline, "pipeline.pmml")
  • 8. Making a PMML pipeline (2/2) from sklearn2pmml import make_pmml_pipeline, sklearn2pmml pipeline = Pipeline([...]) pipeline.fit(X, y) pipeline = make_pmml_pipeline( pipeline, active_fields = ["x1", "x2", ...], target_fields = ["y"] ) sklearn2pmml(pipeline, "pipeline.pmml")
  • 9. Pipeline setup (1/3) 1. Feature and label definition Specific to (J)PMML 2. Feature engineering 3. Feature selection Scikit-Learn persists all features, (J)PMML persists "surviving" features 4. Estimator fitting 5. Decision engineering Specific to (J)PMML
  • 10. Pipeline setup (2/3) Workflow: 1. Column- and column set-oriented feature definition, engineering and selection 2. Table-oriented feature engineering and selection 3. Estimator fitting Full support for pipeline nesting, branching.
  • 11. Pipeline setup (3/3) from sklearn2pmml import PMMLPipeline from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain from sklearn_pandas import DataFrameMapper pipeline = PMMLPipeline([ ("stage1", DataFrameMapper([ (["x1"], [CategoricalDomain(), ...]), (["x2", "x3"], [ContinuousDomain(), ...]), ])), ("stage2", SelectKBest(10)), ("stage3", LogisticRegression()) ])
  • 13. Continuous features (1/2) Without missing values: ("Income", [ContinuousDomain(invalid_value_treatment = "return_invalid", missing_value_treatment = "as_is", with_data = True, with_statistics = True)]) With missing values (encoded as None/NaN): ("Income", [ContinuousDomain(), Imputer()]) With missing values (encoded using dummy values): ("Income", [ContinuousDomain(missing_values = -999), Imputer(missing_values = -999)])
  • 14. Continuous features (2/2) <DataDictionary> <DataField name="Income" optype="continuous" dataType="double"> <Interval closure="closedClosed" leftMargin="1598.95" rightMargin="481259.5"/> </DataField> </DataDictionary> <RegressionModel> <MiningSchema> <MiningField name="Income" invalidValueTreatment="returnInvalid" missingValueTreatment="asMean" missingValueReplacement="84807.39297450424"/> </MiningSchema> <ModelStats> <UnivariateStats field="Income"> <Counts totalFreq="1899.0" missingFreq="487.0" invalidFreq="0.0"/> <NumericInfo minimum="1598.95" maximum="481259.5" mean="84807.39297450424" median="60596.35" standardDeviation="69696.87637064351" interQuartileRange="81611.1225"/> </UnivariateStats> </ModelStats> </RegressionModel>
  • 15. Categorical features (1/3) Without missing values: ("Education", [CategoricalDomain(invalid_value_treatment = "return_invalid", missing_value_treatment = "as_is", with_data = True, with_statistics = True), LabelBinarizer()]) With missing values, missing value-aware estimator: ("Education", [CategoricalDomain(), PMMLLabelBinarizer()]) With missing values, missing value-unaware estimator: ("Education", [CategoricalDomain(), CategoricalImputer(), LabelBinarizer()])
  • 16. Categorical features (2/3) <DataDictionary> <DataField name="Education" optype="categorical" dataType="string"> <Value value="Associate"/> <Value value="Bachelor"/> <Value value="College"/> <Value value="Doctorate"/> <Value value="HSgrad"/> <Value value="Master"/> <Value value="Preschool"/> <Value value="Professional"/> <Value value="Vocational"/> <Value value="Yr10t12"/> <Value value="Yr1t4"/> <Value value="Yr5t9"/> </DataField> </DataDictionary>
  • 17. Categorical features (3/3) <RegressionModel> <MiningSchema> <MiningField name="Education" invalidValueTreatment="asIs", x-invalidValueReplacement="HSgrad" missingValueTreatment="asMode" missingValueReplacement="HSgrad"/> </MiningSchema> <UnivariateStats field="Education"> <Counts totalFreq="1899.0" missingFreq="497.0" invalidFreq="0.0"/> <DiscrStats> <Array type="string">Associate Bachelor College Doctorate HSgrad Master Preschool Professional Vocational Yr10t12 Yr1t4 Yr5t9</Array> <Array type="int">55 241 309 22 462 78 6 15 55 95 4 60</Array> </DiscrStats> </UnivariateStats> </RegressionModel>
  • 19. Continuous features (1/2) from sklearn.preprocessing import Binarizer, FunctionTransformer from sklearn2pmml.preprocessing import ExpressionTransformer features = FeatureUnion([ ("identity", DataFrameMapper([ (["Income", "Hours"], ContinuousDomain()) ])), ("transformation", DataFrameMapper([ (["Income"], FunctionTransformer(numpy.log10)), (["Hours"], Binarizer(threshold = 40)), (["Income", "Hours"], ExpressionTransformer("X[:,0]/(X[:,1]*52)")) ])) ])
  • 20. Continuous features (2/2) <TransformationDictionary> <DerivedField name="log10(Income)" optype="continuous" dataType="double"> <Apply function="log10"><FieldRef field="Income"/></Apply> </DerivedField> <DerivedField name="binarizer(Hours)" optype="continuous" dataType="double"> <Apply function="threshold"><FieldRef field="Hours"/><Constant>40</Constant></Apply> </DerivedField> <DerivedField name="eval(X[:,0]/(X[:,1]*52))" optype="continuous" dataType="double"> <Apply function="/"> <FieldRef field="Income"/> <Apply function="*"> <FieldRef field="Hours"/> <Constant dataType="integer">52</Constant> </Apply> </Apply> </DerivedField> </TransformationDictionary>
  • 21. Categorical features from sklearn preprocessing import LabelBinarizer, PolynomialFeatures features = Pipeline([ ("identity", DataFrameMapper([ ("Education", [CategoricalDomain(), LabelBinarizer()]), ("Occupation", [CategoricalDomain(), LabelBinarizer()]) ])), ("interaction", PolynomialFeatures()) ])
  • 23. Scikit-Learn challenge class SelectPercentile(BaseTransformer, SelectorMixin): def _get_support_mask(self): scores = self.scores threshold = stats.scoreatpercentile(scores, 100 - self.percentile) return (scores > threshold) Python methods don't have a persistent state: Exception in thread "main": java.lang.IllegalArgumentException: The selector object does not have a persistent '_get_support_mask' attribute
  • 24. (J)PMML solution "Hiding" a stateless selector behind a stateful meta-selector: from sklearn2pmml import SelectorProxy, PMMLPipeline pipeline = PMMLPipeline([ ("stage1", DataFrameMapper([ (["x1"], [CategoricalDomain(), LabelBinarizer(), SelectorProxy(SelectFromModel(DecisionTreeClassifier()))]) ])), ("stage2", SelectorProxy(SelectPercentile(percentile = 50))) ])
  • 26. Hyper-parameter tuning (1/2) from sklearn.model_selection import GridSearchCV from sklearn2pmml import PMMLPipeline from sklearn2pmml import sklearn2pmml pipeline = PMMLPipeline([...]) tuner = GridSearchCV(pipeline, param_grid = {...}) tuner.fit(X, y) # GridSearchCV.best_estimator_ is of type PMMLPipeline sklearn2pmml(tuner.best_estimator_, "pipeline.pmml")
  • 27. Hyper-parameter tuning (2/2) from sklearn2pmml import make_pmml_pipeline, sklearn2pmml pipeline = Pipeline([...]) tuner = GridSearchCV(pipeline, param_grid = {...}) tuner.fit(X, y) # GridSearchCV.best_estimator_ is of type Pipeline sklearn2pmml(make_pmml_pipeline(tuner.best_estimator_, active_fields = [...], target_fields = [...]), "pipeline.pmml")
  • 28. Algorithm tuning from sklearn2pmml import make_pmml_pipeline, sklearn2pmml from tpot import TPOTClassifier # See https://github.com/rhiever/tpot tpot = TPOTClassifier() tpot.fit(X, y) sklearn2pmml(make_pmml_pipeline(tpot.fitted_pipeline_, active_fields = [...], target_fields = [...]), "pipeline.pmml")
  • 29. Model customization (1/2) from sklearn2pmml import PMMLPipeline from sklearn2pmml import sklearn2pmml # Keep reference to the estimator classifier = DecisionTreeClassifier() pipeline = PMMLPipeline([..., ("classifier", classifier)]) pipeline.fit(X, y) # Apply customization(s) to the fitted estimator classifier.compact = True sklearn2pmml(pipeline, "pipeline.pmml")
  • 30. Model customization (2/2) ● org.jpmml.sklearn.HasOptions ○ lightgbm.sklearn.HasLightGBMOptions ■ compact ■ num_iteration ○ sklearn.tree.HasTreeOptions ■ compact ○ xgboost.sklearn.HasXGBoostOptions ■ compact ■ ntree_limit
  • 32. Software (Nov 2017) ● The Python side: ○ sklearn 0.19(.0) ○ sklearn2pmml 0.26(.0) ○ sklearn_pandas 1.5(.0) ○ TPOT 0.9(.0) ● The Java side: ○ JPMML-SkLearn 1.4(.0) ○ JPMML-Evaluator 1.3(.10)