Pattern: An Open Source Project for Migrating Predictive Models from SAS

DRIVING
INNOVATION
THROUGH
DATA
PATTERN: AN OPEN SOURCE PROJECT FOR MIGRATING
PREDICTIVE MODELS FROM SAS, ETC., ONTO HADOOP
Alexis Roos | June 3 2014 | Hadoop Summit

Copyright 2014, Concurrent Inc.Conﬁdential2
Pattern is:!
!
• An open source project that works on top of Cascading to support
scoring of PMML models (from R, SAS, etc.) at scale on to
Hadoop.!
!
• Models are reused and deployed within Cascading workﬂows.
PATTERN IN A NUTSHELL
Copyright 2014, Concurrent Inc.

Copyright 2014, Concurrent Inc3
AGENDA
• Cascading
• PMML and Cascading
• Pattern Scenarios
• Demo

Experiments – comparing models
• Much customer interest in leveraging Cascading and  
Apache Hadoop to run customer experiments at scale

• run multiple variants, then measure relative “lift”

• Concurrent runtime – tag and track models

!
the following example compares two models trained  
with different machine learning algorithms

 
this is exaggerated, one has an important variable  
intentionally omitted to help illustrate the experiment

## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="",
na.strings="NULL", header=TRUE, encoding="UTF8")!
!
## split data into test and train sets!
set.seed(71)!
split_ratio <- 2/10!
split <- round(dim(data)[1] * split_ratio)!
data_tests <- data[1:split,]!
!
data_train <- data[(split + 1):dim(data)[1],]!
i <- colnames(data_train) == "order_id"!
j <- 1:length(i)!
data_train <- data_train[,-j[i]]!
!
## train a RandomForest model!
f <- as.formula("as.factor(label) ~ .")!
fit <- randomForest(f, data_train, ntree=25)!
!
## test the model on the holdout test set!
print(fit$importance)!
print(fit)!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 13.12%!
Confusion matrix:!
0 1 class.error!
0 57 9 0.1363636!
1 12 82 0.1276596

<?xml version="1.0"?>!
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/
XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/
pmml-4-1.xsd">!
<Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">!
<Extension name="user" value="alexisroos" extender="Rattle/PMML"/>!
<Application name="Rattle/PMML" version="1.4"/>!
<Timestamp>2014-02-17 22:11:37</Timestamp>!
</Header>!
<DataDictionary numberOfFields="4">!
<DataField name="label" optype="categorical" dataType="string">!
<Value value="0"/>!
<Value value="1"/>!
</DataField>!
<DataField name="var0" optype="continuous" dataType="double"/>!
</DataDictionary>!
<MiningModel modelName="randomForest_Model" functionName="classification">!
<MiningSchema>!
<MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>!
<MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>!
</MiningSchema>!
<Output>!
<OutputField name="Predicted_label" feature="predictedValue"/>!
<OutputField name="Probability_0" optype="continuous" dataType="double"
feature="probability" value="0"/>!

In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-*.jar data/
sample.tsv out/classify --pmml data/
antifraud.rf.xml!
!
hadoop dfs -cat out/classify/part-*

CASCADING OVERVIEW
8
•Enterprise Grade - Proven application development
framework for building robust and complex Big Data
applications with thousands of deployments.

•Productive - Cascading relies on software patterns to
provide optimal level of abstraction allowing to greatly
simplify creation, testing, deployment and operation of
applications by focusing on business logic ﬁrst.

•Flexible & Extensible - Runs on all popular Hadoop
distributions, but not limited to Hadoop. Easily
extensible framework supporting a variety of
extensions, tools, and other integrations.
Hadoop
On-Premise or Cloud
Data Applications!
ETL, Analytics, Data
Processing, Machine Learning

CASCADING ECOSYSTEM
9
On-Premise Deployments
Other Data StoresHadoop Distributions
ClojureSQL
RDBMS
MPP
EDW
LINGUALPATTERNSCALDINGCASCALOG
Languages

Copyright 2014, Concurrent Inc.Conﬁdential
BUSINESSES DEPEND ON US
10
>30% of Marketplace’s 1000 node
Hadoop cluster runs Cascading
applications
Cascading powers Revenues, Publisher
Analytics, and User Engagement
applications
Built their business for weather
insurance using using Cascading

Sold to Monsanto for $950MM
Standardize on Cascading for their

fraud detection business

Copyright 2014, Concurrent Inc.ConﬁdentialConﬁdential
CASCADING - DATA APPS
11
Enterprise IT!
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
!
Corporate Apps!
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
!
Telecom!
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail!
Mobile, Social, Search Analytics
Funnel analysis
Revenue attribution
Customer experiments
Ad Optimization
Retail recommenders
!
Consumer / Entertainment!
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
!
!
Finance!
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
!
Health / Biotech!
Aggregate metrics for Govt
Person biometrics
Veterinary diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps
!

Copyright 2014, Concurrent Inc.12
The Cascading processing model is based on a metaphor of flows based on patterns
Source Tap
Sink Tap
Pipe
Tuple
Stream
Pipe
Assembly
Flow
CASCADING - MODEL METAPHOR
. Data is represented as flows of tuples. 
. Pipes allow you to manage a data flow through functional programming: 
. Splitting, Merging, Filtering, Parsing, Transforming, Grouping, Aggregating, Buffering, Joining, etc.
! +!
Fields

CASCADE
• Cascade joins together multiple ﬂows and execute them based on dependencies.

• Flow planners allow Flows to be independent from the execution platform and the
processing query planner is responsible for defining, sharing, and executing data-
processing workflows

• Currently there are two kinds of flow planners

- Local

- Hadoop (1 & 2)

• Allows for “fail fast”

- The flow planners can check completeness of flows, operations, type safety, etc.

• Maps the pipe assembly to MapReduce in a deterministic way
FLOWS EXECUTION

FLOWS EXECUTION
Map
CoGroupfunc aggr SinkSource GroupBy
func
aggr
Source func
functemp
Reduce Map Reduce
Flow
Client
FlowAssembly
Cluster
Job Job
Cascading automatically generates MapReduce jobs for speciﬁed platform

CASCADE - EXAMPLE CODE
• Top 10 IPs for Apache log ﬁle
RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"), 
"^([^ ]*) S+ S+ [([w:/]+s[+-]d{4})] "(.+?)" (d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5});
!
Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS);
processPipe = new GroupBy(processPipe, new Fields("ip"));
processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL);
!
Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true);
sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));

CASCADE - EXAMPLE CODE - DRIVEN

AGENDA
• Cascading
• PMML in Cascading
• Demo

• Established XML standard for predictive model markup 
(speciﬁes the model, not an implementation of the model)

• Organized by Data Mining Group (DMG), since 1997, http://dmg.org/

• Open standards for Data Mining and Statistical models

• PMML producer: for applications that create predictive models

• PMML consumers: for application that read or consume models

PREDICTIVE MODEL MARKUP LANGUAGE (PMML)
“PMML is the leading standard for statistical and data mining models and supported by over 20
vendors and organizations.With PMML, it is easy to develop a model on one system using one
application and deploy the model on another system using another application.”

• Association Rules: AssociationModel element

• Cluster Models: ClusteringModel element

• Decision Trees: TreeModel element

• Naïve Bayes Classiﬁers: NaiveBayesModel element

• Neural Networks: NeuralNetwork element

• Regression: RegressionModel and GeneralRegressionModel elements

• Rulesets: RuleSetModel element

• Sequences: SequenceModel element

• SupportVector Machines: SupportVectorMachineModel element

• Text Models: TextModel element

• Time Series: TimeSeriesModel element

PMML MODEL COVERAGE

PMML VENDORS COVERAGE

BUILDING AND RUNNING PMML MODELS
Model
Producer
Data PMML
ModelExplore data and build model
using regression, clustering, etc.
Training
Scoring
New 
Data
PMML model
Measure and improve model
Post
Processing
Model 
Consumer
Data
Data
scores
PATTERN
ETL, prepare data
ETL, prepare data

## train a RandomForest model!
f <- as.formula("as.factor(species) ~ .")

fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)

!
## test the model on the holdout test set!
print(fit)

!
out <- iris_full

out$predict <- predict(fit, out, type="class")

!
## export predicted labels to TSV!
write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)

!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
PATTERN: CREATE A MODEL IN R

<?xml version="1.0"?>

<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd">

<Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">

<Extension name="user" value="alexisroos" extender="Rattle/PMML"/>

<Application name="Rattle/PMML" version="1.4"/>

<Timestamp>2014-06-02 18:04:36</Timestamp>

</Header>

<DataDictionary numberOfFields="5">

<DataField name="species" optype="categorical" dataType="string">

<Value value="setosa"/>

<Value value="versicolor"/>

<Value value="virginica"/>

</DataField>

<DataField name="sepal_length" optype="continuous" dataType="double"/>

<DataField name="sepal_width" optype="continuous" dataType="double"/>

<DataField name="petal_length" optype="continuous" dataType="double"/>

<DataField name="petal_width" optype="continuous" dataType="double"/>

</DataDictionary>

<MiningModel modelName="randomForest_Model" functionName="classification">

<MiningSchema>

<MiningField name="species" usageType="predicted" invalidValueTreatment="asIs"/>

<MiningField name="sepal_length" usageType="active" invalidValueTreatment="asIs"/>

<MiningField name="sepal_width" usageType="active" invalidValueTreatment="asIs"/>

<MiningField name="petal_length" usageType="active" invalidValueTreatment="asIs"/>

<MiningField name="petal_width" usageType="active" invalidValueTreatment="asIs"/>

</MiningSchema>

<Output>

<OutputField name="Predicted_species" feature="predictedValue"/>

<OutputField name="Probability_setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/>

<OutputField name="Probability_versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/>

<OutputField name="Probability_virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/>

</Output>

...!
PATTERN: CAPTURE MODEL IN PMML

public static void main(String[] args) throws RuntimeException {!
String inputPath = args[0];!
String classifyPath = args[1];!
!
Properties properties = new Properties();!
AppProps.setApplicationJarClass(properties, Main.class);!
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);!
!
Tap inputTap = new Hfs(new TextDelimited(true, "t"), inputPath);!
Tap classifyTap = new Hfs(new TextDelimited(true, "t"), classifyPath);!
!
OptionParser optParser = new OptionParser();!
optParser.accepts("pmml").withRequiredArg();!
OptionSet options = optParser.parse(args);!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName("classify")!
.addSource("input", inputTap)!
.addSink("classify", classifyTap);!
!
if (options.hasArgument("pmml")) {!
String pmmlPath = (String) options.valuesOf("pmml").get(0);!
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput(new File(pmmlPath))!
.retainOnlyActiveIncomingFields()!
.setDefaultPredictedField(new Fields("predict", Double.class)); // default value if missing from the model!
flowDef.addAssemblyPlanner(pmmlPlanner);!
}!
!
Flow classifyFlow = flowConnector.connect(flowDef);!
classifyFlow.complete();!
}!
PATTERN: REUSE A MODEL

## run an RF classifier at scale!
!
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml

!
## run an RF classifier at scale, assert regression test, measure confusion matrix!
!
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml
data/iris.rf.xml --measure out/measure!
!
!
## run a predictive model at scale, measure RMSE!
!
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml
data/iris.rf.xml --rmse out/measure
PATTERN: SCORE A MODEL

AGENDA
• Cascading
• Demo

• Hierarchical Clustering

• K-Means Clustering
• Linear Regression

• Logistic Regression

• Random Forest

!
also, model chaining and general support for ensembles

!
algorithms can be added or extended based on customer use cases 
PATTERN: ALGOS IMPLEMENTED

BUILDING AND RUNNING PMML MODELS
Model
Producer
Data PMML
ModelExplore data and build model
using Regression, clustering, etc.
Training
Scoring
New 
Data
PMML model
Measure and improve model
Post
Processing
Model 
Consumer
Data
Data
scores
PATTERN
ETL, prepare data
ETL, prepare data
LINGUAL
LINGUAL

PATTERN: SINGLE MODEL ARCHITECTURE
Cascading allows multiple departments to combine their workﬂow components into an 
single integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive 
Model
Data 
preparation
Data
Data
Data
CASCADING
decrease the project costs…
reduce licensing costs…

•Can score data and run experiments at scale onto Hadoop

•Run different models using Ensembles

•In turn this allows to improve existing models and improve accuracy
PATTERN BENEFITS

PATTERN: ARCHITECTURE
Cascading allows multiple departments to combine their workﬂow components into an single
integrated app (jar) – based on 100% open source – that can be managed by a single tool
ETL
Predictive 
Model
Data 
preparation
Data
Data
Data
!
!
.setName( "etl" )!
.addSource( "data.source1", emplTap )!
.addSource( "data.source2", salesTap )!
.addSink( "results", resultsTap );!
!
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
!
flowDef.addAssemblyPlanner( sqlPlanner );!
!
!

PATTERN: ARCHITECTURE
Cascading allows multiple departments to combine their workﬂow components into an single
integrated app (jar) – based on 100% open source – that can be managed by a single tool
ETL
Predictive 
Model
Data 
preparation
Data
Data
Data
!
!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
!
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
.retainOnlyActiveIncomingFields();!
!
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
!

AGENDA
• Cascading
• Demo

PATTERN: DEMO
1. Generate the model in R

2. Examine PMML MODEL

3. Write & Run Cascading app to score the model

KEY TAKEAWAYS
Reuse existing learning models and investments to run
data scoring at scale
Leverage existing skill sets: Java, Scala, SQL, PMML, etc
Allow teams to collaborate on single model that can be
visualized, managed and monitored.

@ALEXISROOS
42
QUESTIONS?

Pattern: An Open Source Project for Migrating Predictive Models from SAS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Pattern: An Open Source Project for Migrating Predictive Models from SAS

Similar to Pattern: An Open Source Project for Migrating Predictive Models from SAS (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Pattern: An Open Source Project for Migrating Predictive Models from SAS