SlideShare a Scribd company logo
Converting R to PMML
Villu Ruusmann
Openscoring OÜ
"Train once, deploy anywhere"
R challenge
"Fat code around thin model objects":
● Many packages to solve the same problem, many ways of
doing and/or representing the same thing
● The unit of reusability is R script
● Model object is tightly coupled to the R script that
produced it. Incomplete/split schemas
● No (widely adopted-) pipeline formalization
(J)PMML solution
Evaluator evaluator = ...;
List<InputField> argumentFields = evaluator.getInputFields();
List<ResultField> resultFields =
Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields());
Map<FieldName, ?> arguments = readRecord(argumentFields);
Map<FieldName, ?> result = evaluator.evaluate(arguments);
writeRecord(result, resultFields);
● The R script (data pre-and post-processing, model) is
represented using standardized PMML data structures
● Well-defined data entry and exit interfaces
Workflow
R challenge
Training:
audit_train = read.csv("Audit.csv")
audit_train$HourlyIncome = (audit_train$Income / (audit_train$Hours * 52))
audit.model = model(Adjusted ~ ., data = audit_train)
saveRDS(audit.model, "Audit.rds")
Deployment:
audit_test = read.csv("Audit.csv")
audit.model = readRDS("Audit.rds")
# "Error in eval(expr, envir, enclos) : object 'HourlyIncome' not found"
predict(audit.model, newdata = audit_test)
R solution
Matrix interface ( ):
label = audit$Adjusted
features = audit[, !(colnames(audit) %in% c("Adjusted"))]
# Returns a `model` object
audit.model = model(x = features, y = label)
Formula interface ( ):
# Returns a `model.formula` object
audit.model = model(Adjusted ~ . + I(Income / (Hours * 52)), data = audit)
From `pmml` to `r2pmml`
library("pmml")
library("r2pmml")
audit = read.csv("Audit.csv")
audit$Adjusted = as.factor(audit$Adjusted)
audit.glm = glm(Adjusted ~ . + I(Income / (Hours * 52)),
data = audit, family = "binomial")
saveXML(pmml(audit.glm), "Audit.pmml")
r2pmml(audit.glm, "Audit.pmml")
Package `pmml` ( )
<PMML>
<DataDictionary>
<DataField name="I(Income/(Hours * 52))" optype="continuous" dataType="double"/>
</DataDictionary>
<GeneralRegressionModel>
<MiningSchema>
<MiningField name="I(Income/(Hours * 52))"/>
</MiningSchema>
</GeneralRegressionModel>
</PMML>
Package `r2pmml` ( )
<PMML>
<DataDictionary>
<DataField name="Income" optype="continuous" dataType="double"/>
<DataField name="Hours" optype="continuous" dataType="double"/>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="Income/(Hours * 52)" optype="continuous" dataType="double">
<Apply function="/"><FieldRef field="Income"/><Apply function="*"><FieldRef
field="Hours"/><Constant>52</Constant></Apply></Apply>
</DerivedField>
</TransformationDictionary>
<GeneralRegressionModel>
<MiningSchema>
<MiningField name="Income"/>
<MiningField name="Hours"/>
</MiningSchema>
</GeneralRegressionModel>
</PMML>
Manual conversion (1/2)
r2pmml::r2pmml
r2pmml::decorate base::saveRDS RDS file JPMML-R library
The R side The Java side
Manual conversion (2/2)
The R side:
audit.model = model(Adjusted ~ ., data = audit)
# Decorate the model object with supporting information
audit.model = r2pmml::decorate(audit.model, dataset = audit)
saveRDS(audit.model, "Audit.rds")
The Java side:
$ git clone https://github.com/jpmml/jpmml-r.git
$ cd jpmml-r; mvn clean package
$ java -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input
/path/to/Audit.rds --pmml-output /path/to/Audit.pmml
In-formula feature engineering
Sanitization (1/2)
?base::ifelse
# Static replacement value
as.formula(Adjusted ~ ifelse(is.na(Hours), 0, Hours))
# Dynamic replacement value
as.formula(paste("Adjusted ~ ifelse((is.na(Hours) | Hours < 0 | Hours >
168),", mean(df$Hours), ", Hours)", sep = ""))
Sanitization (2/2)
<DerivedField name="ifelse(is.na(Hours))" optype="continuous" dataType="double">
<Extension>ifelse(is.na(Hours), 0, Hours)</Extension>
<Apply function="if">
<Apply function="isMissing">
<FieldRef field="Hours"/>
</Apply>
<Constant>0</Constant>
<FieldRef field="Hours"/>
</Apply>
</DerivedField>
Binarization (1/2)
?base::ifelse
as.formula(Adjusted ~ ifelse(Hours >= 40, "full-time", "part-time"))
Binarization (2/2)
<DerivedField name="ifelse(Hours &gt;= 40)" optype="categorical" dataType="string">
<Extension>ifelse(Hours &gt;= 40, "full-time", "part-time")</Extension>
<Apply function="if">
<Apply function="greaterOrEqual">
<FieldRef field="Hours"/>
<Constant>40</Constant>
</Apply>
<Constant>full-time</Constant>
<Constant>part-time</Constant>
</Apply>
</DerivedField>
Bucketization (1/2)
?base::cut
# Static intervals
as.formula(Adjusted ~ cut(Income, breaks = 3))
# Dynamic intervals
as.formula(Adjusted ~ cut(log10(Income), breaks = quantile(log10(Income)))
Bucketization (2/2)
<DerivedField name="cut(Income)" optype="categorical" dataType="string">
<Extension>cut(Income, breaks = 3)</Extension>
<Discretize field="Income">
<DiscretizeBin binValue="(129,1.61e+05]">
<Interval closure="openClosed" leftMargin="129.0" rightMargin="161000.0"/>
</DiscretizeBin>
<DiscretizeBin binValue="(1.61e+05,3.21e+05]">
<Interval closure="openClosed" leftMargin="161000.0" rightMargin="321000.0"/>
</DiscretizeBin>
<DiscretizeBin binValue="(3.21e+05,4.82e+05]">
<Interval closure="openClosed" leftMargin="321000.0" rightMargin="482000.0"/>
</DiscretizeBin>
</Discretize>
</DerivedField>
Mathematical expressions (1/2)
?base::I
as.formula(Adjusted ~ I(floor(Income / (Hours * 52))))
Supported constructs:
● Arithmetic operators `+`, `-`, `*`, `/`, `^` and `**`
● Relational operators `==`, `!=`, `<`, `<=`, `>=` and `>`
● Logical operators `!`, `&` and `|`
● Functions `abs`, `ceiling`, `exp`, `floor`, `is.na`, `log`, `log10`, `round`
and `sqrt`
Mathematical expressions (2/2)
<DerivedField name="floor(Income/(Hours * 52))" optype="continuous" dataType="double">
<Extension>I(floor(Income/(Hours * 52)))</Extension>
<Apply function="floor">
<Apply function="/">
<FieldRef field="Income"/>
<Apply function="*">
<FieldRef field="Hours"/>
<Constant>52</Constant>
</Apply>
</Apply>
</Apply>
</DerivedField>
Renaming and regrouping categories (1/2)
library("plyr")
?plyr::revalue
?plyr::mapvalues
as.formula(Adjusted ~ plyr::revalue(Employment, c("PSFederal" = "Public",
"PSState" = "Public", "PSLocal" = "Public")))
as.formula(Adjusted ~ plyr::revalue(Education, c("Yr1t4" = "Yr1t6",
"Yr5t6" = "Yr1t6", "Yr7t8" = "Yr7t9", "Yr9" = "Yr7t9", "Yr10" = "Yr10t12",
"Yr11" = "Yr10t12", "Yr12" = "Yr10t12")))
as.formula(Adjusted ~ plyr::mapvalues(Gender, c("Male", "Female"), c(0, 1)))
Renaming and regrouping categories (2/2)
<DerivedField name="revalue(Employment)" optype="categorical" dataType="string">
<Extension>plyr::revalue(Employment, c(PSFederal = "Public", PSState = "Public",
PSLocal = "Public"))</Extension>
<MapValues outputColumn="to">
<FieldColumnPair field="Employment" column="from"/>
<InlineTable>
<row><from>PSFederal</from><to>Public</to></row>
<row><from>PSState</from><to>Public</to></row>
<row><from>PSLocal</from><to>Public</to></row>
<row><from>Consultant</from><to>Consultant</to></row>
<row><from>Private</from><to>Private</to></row>
<row><from>SelfEmp</from><to>SelfEmp</to></row>
<row><from>Volunteer</from><to>Volunteer</to></row>
</InlineTable>
</MapValues>
</DerivedField>
Interactions
as.formula(Adjusted ~
# Continuous vs. continuous
Age:Income +
# Continuous vs. transformed continuous
Age:I(log10(Income)) +
# Continuous vs. categorical
Gender:Income +
# Categorical vs. categorical
(Gender + Education) ^ 2
)
Estimator fitting
Alternative encodings (1/2)
# Continuous label ~ Continuous and categorical features
audit.glm = glm(Income ~ Age + Employment + Education + Marital +
Occupation + Gender + Hours, data = audit, family = "gaussian")
# Encoded as GeneralRegressionModel element
r2pmml(audit.glm, "Audit.pmml")
# Encoded as RegressionModel element
r2pmml(audit.glm, "Audit.pmml", converter = "org.jpmml.rexp.LMConverter")
Alternative encodings (2/2)
# Binary label ~ Categorical features
audit.glm = glm(Adjusted ~ cut(Age, breaks = c(0, 21, 65, 100)) +
Employment + Education + Marital + Occupation + cut(Income, breaks =
quantile(Income)) + Gender + ifelse(Hours >= 40, "full-time",
"part-time"), data = audit, family = "binomial")
audit.scorecard = r2pmml::as.scorecard(audit.glm)
# Encoded as Scorecard element
r2pmml(audit.scorecard, "Audit.pmml")
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
Software (Nov 2017)
● The R side:
○ base 3.3(.1)
○ pmml 1.5(.2)
○ r2pmml 0.15(.0)
○ plyr 1.8(.4)
● The Java side:
○ JPMML-R 1.2(.20)
○ JPMML-Evaluator 1.3(.10)

More Related Content

What's hot

Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
Nishant Upadhyay
 
Java: Inheritance
Java: InheritanceJava: Inheritance
Java: Inheritance
Tareq Hasan
 
Python Programming - Files & Exceptions
Python Programming - Files & ExceptionsPython Programming - Files & Exceptions
Python Programming - Files & Exceptions
Omid AmirGhiasvand
 
Proxies are Awesome!
Proxies are Awesome!Proxies are Awesome!
Proxies are Awesome!
Brendan Eich
 
Sql
SqlSql
Sql
krymo
 
Domain Modeling with FP (DDD Europe 2020)
Domain Modeling with FP (DDD Europe 2020)Domain Modeling with FP (DDD Europe 2020)
Domain Modeling with FP (DDD Europe 2020)
Scott Wlaschin
 
An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...
Claudio Capobianco
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
Andrew Henshaw
 
PHP
PHPPHP
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript Promises
L&T Technology Services Limited
 
Web technology lab manual
Web technology lab manualWeb technology lab manual
Web technology lab manual
neela madheswari
 
Javascript built in String Functions
Javascript built in String FunctionsJavascript built in String Functions
Javascript built in String Functions
Avanitrambadiya
 
Jedi slides 2.1 object-oriented concepts
Jedi slides 2.1 object-oriented conceptsJedi slides 2.1 object-oriented concepts
Jedi slides 2.1 object-oriented concepts
Maryo Manjaruni
 
String.ppt
String.pptString.ppt
String.ppt
ajeela mushtaq
 
You code sucks, let's fix it
You code sucks, let's fix itYou code sucks, let's fix it
You code sucks, let's fix it
Rafael Dohms
 
C++ 프로젝트에 단위 테스트 도입하기
C++ 프로젝트에 단위 테스트 도입하기C++ 프로젝트에 단위 테스트 도입하기
C++ 프로젝트에 단위 테스트 도입하기OnGameServer
 
Java and XML
Java and XMLJava and XML
Java and XML
Raji Ghawi
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
banq jdon
 
Introduction to JSX
Introduction to JSXIntroduction to JSX
Introduction to JSX
Micah Wood
 

What's hot (20)

Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
 
Java: Inheritance
Java: InheritanceJava: Inheritance
Java: Inheritance
 
Python Programming - Files & Exceptions
Python Programming - Files & ExceptionsPython Programming - Files & Exceptions
Python Programming - Files & Exceptions
 
Proxies are Awesome!
Proxies are Awesome!Proxies are Awesome!
Proxies are Awesome!
 
Sql
SqlSql
Sql
 
Domain Modeling with FP (DDD Europe 2020)
Domain Modeling with FP (DDD Europe 2020)Domain Modeling with FP (DDD Europe 2020)
Domain Modeling with FP (DDD Europe 2020)
 
An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
PHP
PHPPHP
PHP
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript Promises
 
Web technology lab manual
Web technology lab manualWeb technology lab manual
Web technology lab manual
 
Javascript built in String Functions
Javascript built in String FunctionsJavascript built in String Functions
Javascript built in String Functions
 
Jedi slides 2.1 object-oriented concepts
Jedi slides 2.1 object-oriented conceptsJedi slides 2.1 object-oriented concepts
Jedi slides 2.1 object-oriented concepts
 
String.ppt
String.pptString.ppt
String.ppt
 
You code sucks, let's fix it
You code sucks, let's fix itYou code sucks, let's fix it
You code sucks, let's fix it
 
C++ 프로젝트에 단위 테스트 도입하기
C++ 프로젝트에 단위 테스트 도입하기C++ 프로젝트에 단위 테스트 도입하기
C++ 프로젝트에 단위 테스트 도입하기
 
Java and XML
Java and XMLJava and XML
Java and XML
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
Introduction to JSX
Introduction to JSXIntroduction to JSX
Introduction to JSX
 

Similar to Converting R to PMML

Innovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and MonitoringInnovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and Monitoring
Cary Millsap
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
Alithya
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
Paco Nathan
 
Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)
Eyal Vardi
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
Jonathan Felch
 
Drupal 8 migrate!
Drupal 8 migrate!Drupal 8 migrate!
Drupal 8 migrate!
Pavel Makhrinsky
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
Michelangelo van Dam
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
Michelangelo van Dam
 
Introduction to Zend Framework web services
Introduction to Zend Framework web servicesIntroduction to Zend Framework web services
Introduction to Zend Framework web services
Michelangelo van Dam
 
Mind Your Business. And Its Logic
Mind Your Business. And Its LogicMind Your Business. And Its Logic
Mind Your Business. And Its Logic
Vladik Khononov
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
Pavan Chitumalla
 
Reason and GraphQL
Reason and GraphQLReason and GraphQL
Reason and GraphQL
Nikolaus Graf
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
ManjuKumara GH
 
Broadleaf Presents Thymeleaf
Broadleaf Presents ThymeleafBroadleaf Presents Thymeleaf
Broadleaf Presents Thymeleaf
Broadleaf Commerce
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJS
stefanmayer13
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAnanth PackkilDurai
 
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...ZFConf Conference
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
Sameer Wadkar
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
Sri Ambati
 

Similar to Converting R to PMML (20)

Innovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and MonitoringInnovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and Monitoring
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Drupal 8 migrate!
Drupal 8 migrate!Drupal 8 migrate!
Drupal 8 migrate!
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
 
Introduction to Zend Framework web services
Introduction to Zend Framework web servicesIntroduction to Zend Framework web services
Introduction to Zend Framework web services
 
Mind Your Business. And Its Logic
Mind Your Business. And Its LogicMind Your Business. And Its Logic
Mind Your Business. And Its Logic
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
Reason and GraphQL
Reason and GraphQLReason and GraphQL
Reason and GraphQL
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
 
Broadleaf Presents Thymeleaf
Broadleaf Presents ThymeleafBroadleaf Presents Thymeleaf
Broadleaf Presents Thymeleaf
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJS
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 

More from Villu Ruusmann

State of the (J)PMML art
State of the (J)PMML artState of the (J)PMML art
State of the (J)PMML art
Villu Ruusmann
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
Villu Ruusmann
 
The case for (J)PMML
The case for (J)PMMLThe case for (J)PMML
The case for (J)PMML
Villu Ruusmann
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Villu Ruusmann
 
Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
Villu Ruusmann
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
Villu Ruusmann
 

More from Villu Ruusmann (6)

State of the (J)PMML art
State of the (J)PMML artState of the (J)PMML art
State of the (J)PMML art
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
The case for (J)PMML
The case for (J)PMMLThe case for (J)PMML
The case for (J)PMML
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 

Recently uploaded

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Converting R to PMML

  • 1. Converting R to PMML Villu Ruusmann Openscoring OÜ
  • 3. R challenge "Fat code around thin model objects": ● Many packages to solve the same problem, many ways of doing and/or representing the same thing ● The unit of reusability is R script ● Model object is tightly coupled to the R script that produced it. Incomplete/split schemas ● No (widely adopted-) pipeline formalization
  • 4. (J)PMML solution Evaluator evaluator = ...; List<InputField> argumentFields = evaluator.getInputFields(); List<ResultField> resultFields = Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields()); Map<FieldName, ?> arguments = readRecord(argumentFields); Map<FieldName, ?> result = evaluator.evaluate(arguments); writeRecord(result, resultFields); ● The R script (data pre-and post-processing, model) is represented using standardized PMML data structures ● Well-defined data entry and exit interfaces
  • 6. R challenge Training: audit_train = read.csv("Audit.csv") audit_train$HourlyIncome = (audit_train$Income / (audit_train$Hours * 52)) audit.model = model(Adjusted ~ ., data = audit_train) saveRDS(audit.model, "Audit.rds") Deployment: audit_test = read.csv("Audit.csv") audit.model = readRDS("Audit.rds") # "Error in eval(expr, envir, enclos) : object 'HourlyIncome' not found" predict(audit.model, newdata = audit_test)
  • 7. R solution Matrix interface ( ): label = audit$Adjusted features = audit[, !(colnames(audit) %in% c("Adjusted"))] # Returns a `model` object audit.model = model(x = features, y = label) Formula interface ( ): # Returns a `model.formula` object audit.model = model(Adjusted ~ . + I(Income / (Hours * 52)), data = audit)
  • 8. From `pmml` to `r2pmml` library("pmml") library("r2pmml") audit = read.csv("Audit.csv") audit$Adjusted = as.factor(audit$Adjusted) audit.glm = glm(Adjusted ~ . + I(Income / (Hours * 52)), data = audit, family = "binomial") saveXML(pmml(audit.glm), "Audit.pmml") r2pmml(audit.glm, "Audit.pmml")
  • 9. Package `pmml` ( ) <PMML> <DataDictionary> <DataField name="I(Income/(Hours * 52))" optype="continuous" dataType="double"/> </DataDictionary> <GeneralRegressionModel> <MiningSchema> <MiningField name="I(Income/(Hours * 52))"/> </MiningSchema> </GeneralRegressionModel> </PMML>
  • 10. Package `r2pmml` ( ) <PMML> <DataDictionary> <DataField name="Income" optype="continuous" dataType="double"/> <DataField name="Hours" optype="continuous" dataType="double"/> </DataDictionary> <TransformationDictionary> <DerivedField name="Income/(Hours * 52)" optype="continuous" dataType="double"> <Apply function="/"><FieldRef field="Income"/><Apply function="*"><FieldRef field="Hours"/><Constant>52</Constant></Apply></Apply> </DerivedField> </TransformationDictionary> <GeneralRegressionModel> <MiningSchema> <MiningField name="Income"/> <MiningField name="Hours"/> </MiningSchema> </GeneralRegressionModel> </PMML>
  • 11. Manual conversion (1/2) r2pmml::r2pmml r2pmml::decorate base::saveRDS RDS file JPMML-R library The R side The Java side
  • 12. Manual conversion (2/2) The R side: audit.model = model(Adjusted ~ ., data = audit) # Decorate the model object with supporting information audit.model = r2pmml::decorate(audit.model, dataset = audit) saveRDS(audit.model, "Audit.rds") The Java side: $ git clone https://github.com/jpmml/jpmml-r.git $ cd jpmml-r; mvn clean package $ java -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input /path/to/Audit.rds --pmml-output /path/to/Audit.pmml
  • 14. Sanitization (1/2) ?base::ifelse # Static replacement value as.formula(Adjusted ~ ifelse(is.na(Hours), 0, Hours)) # Dynamic replacement value as.formula(paste("Adjusted ~ ifelse((is.na(Hours) | Hours < 0 | Hours > 168),", mean(df$Hours), ", Hours)", sep = ""))
  • 15. Sanitization (2/2) <DerivedField name="ifelse(is.na(Hours))" optype="continuous" dataType="double"> <Extension>ifelse(is.na(Hours), 0, Hours)</Extension> <Apply function="if"> <Apply function="isMissing"> <FieldRef field="Hours"/> </Apply> <Constant>0</Constant> <FieldRef field="Hours"/> </Apply> </DerivedField>
  • 16. Binarization (1/2) ?base::ifelse as.formula(Adjusted ~ ifelse(Hours >= 40, "full-time", "part-time"))
  • 17. Binarization (2/2) <DerivedField name="ifelse(Hours &gt;= 40)" optype="categorical" dataType="string"> <Extension>ifelse(Hours &gt;= 40, "full-time", "part-time")</Extension> <Apply function="if"> <Apply function="greaterOrEqual"> <FieldRef field="Hours"/> <Constant>40</Constant> </Apply> <Constant>full-time</Constant> <Constant>part-time</Constant> </Apply> </DerivedField>
  • 18. Bucketization (1/2) ?base::cut # Static intervals as.formula(Adjusted ~ cut(Income, breaks = 3)) # Dynamic intervals as.formula(Adjusted ~ cut(log10(Income), breaks = quantile(log10(Income)))
  • 19. Bucketization (2/2) <DerivedField name="cut(Income)" optype="categorical" dataType="string"> <Extension>cut(Income, breaks = 3)</Extension> <Discretize field="Income"> <DiscretizeBin binValue="(129,1.61e+05]"> <Interval closure="openClosed" leftMargin="129.0" rightMargin="161000.0"/> </DiscretizeBin> <DiscretizeBin binValue="(1.61e+05,3.21e+05]"> <Interval closure="openClosed" leftMargin="161000.0" rightMargin="321000.0"/> </DiscretizeBin> <DiscretizeBin binValue="(3.21e+05,4.82e+05]"> <Interval closure="openClosed" leftMargin="321000.0" rightMargin="482000.0"/> </DiscretizeBin> </Discretize> </DerivedField>
  • 20. Mathematical expressions (1/2) ?base::I as.formula(Adjusted ~ I(floor(Income / (Hours * 52)))) Supported constructs: ● Arithmetic operators `+`, `-`, `*`, `/`, `^` and `**` ● Relational operators `==`, `!=`, `<`, `<=`, `>=` and `>` ● Logical operators `!`, `&` and `|` ● Functions `abs`, `ceiling`, `exp`, `floor`, `is.na`, `log`, `log10`, `round` and `sqrt`
  • 21. Mathematical expressions (2/2) <DerivedField name="floor(Income/(Hours * 52))" optype="continuous" dataType="double"> <Extension>I(floor(Income/(Hours * 52)))</Extension> <Apply function="floor"> <Apply function="/"> <FieldRef field="Income"/> <Apply function="*"> <FieldRef field="Hours"/> <Constant>52</Constant> </Apply> </Apply> </Apply> </DerivedField>
  • 22. Renaming and regrouping categories (1/2) library("plyr") ?plyr::revalue ?plyr::mapvalues as.formula(Adjusted ~ plyr::revalue(Employment, c("PSFederal" = "Public", "PSState" = "Public", "PSLocal" = "Public"))) as.formula(Adjusted ~ plyr::revalue(Education, c("Yr1t4" = "Yr1t6", "Yr5t6" = "Yr1t6", "Yr7t8" = "Yr7t9", "Yr9" = "Yr7t9", "Yr10" = "Yr10t12", "Yr11" = "Yr10t12", "Yr12" = "Yr10t12"))) as.formula(Adjusted ~ plyr::mapvalues(Gender, c("Male", "Female"), c(0, 1)))
  • 23. Renaming and regrouping categories (2/2) <DerivedField name="revalue(Employment)" optype="categorical" dataType="string"> <Extension>plyr::revalue(Employment, c(PSFederal = "Public", PSState = "Public", PSLocal = "Public"))</Extension> <MapValues outputColumn="to"> <FieldColumnPair field="Employment" column="from"/> <InlineTable> <row><from>PSFederal</from><to>Public</to></row> <row><from>PSState</from><to>Public</to></row> <row><from>PSLocal</from><to>Public</to></row> <row><from>Consultant</from><to>Consultant</to></row> <row><from>Private</from><to>Private</to></row> <row><from>SelfEmp</from><to>SelfEmp</to></row> <row><from>Volunteer</from><to>Volunteer</to></row> </InlineTable> </MapValues> </DerivedField>
  • 24. Interactions as.formula(Adjusted ~ # Continuous vs. continuous Age:Income + # Continuous vs. transformed continuous Age:I(log10(Income)) + # Continuous vs. categorical Gender:Income + # Categorical vs. categorical (Gender + Education) ^ 2 )
  • 26. Alternative encodings (1/2) # Continuous label ~ Continuous and categorical features audit.glm = glm(Income ~ Age + Employment + Education + Marital + Occupation + Gender + Hours, data = audit, family = "gaussian") # Encoded as GeneralRegressionModel element r2pmml(audit.glm, "Audit.pmml") # Encoded as RegressionModel element r2pmml(audit.glm, "Audit.pmml", converter = "org.jpmml.rexp.LMConverter")
  • 27. Alternative encodings (2/2) # Binary label ~ Categorical features audit.glm = glm(Adjusted ~ cut(Age, breaks = c(0, 21, 65, 100)) + Employment + Education + Marital + Occupation + cut(Income, breaks = quantile(Income)) + Gender + ifelse(Hours >= 40, "full-time", "part-time"), data = audit, family = "binomial") audit.scorecard = r2pmml::as.scorecard(audit.glm) # Encoded as Scorecard element r2pmml(audit.scorecard, "Audit.pmml")
  • 29. Software (Nov 2017) ● The R side: ○ base 3.3(.1) ○ pmml 1.5(.2) ○ r2pmml 0.15(.0) ○ plyr 1.8(.4) ● The Java side: ○ JPMML-R 1.2(.20) ○ JPMML-Evaluator 1.3(.10)