SlideShare a Scribd company logo
DRIVING
INNOVATION
THROUGH
DATA
PATTERN: AN OPEN SOURCE PROJECT FOR MIGRATING
PREDICTIVE MODELS FROM SAS, ETC., ONTO HADOOP
Alexis Roos | June 3 2014 | Hadoop Summit
Copyright 2014, Concurrent Inc.Confidential2
Pattern is:!
!
• An open source project that works on top of Cascading to support
scoring of PMML models (from R, SAS, etc.) at scale on to
Hadoop.!
!
• Models are reused and deployed within Cascading workflows.
PATTERN IN A NUTSHELL
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc3
AGENDA
• Cascading
• PMML and Cascading
• Pattern Scenarios
• Demo
Experiments – comparing models
• Much customer interest in leveraging Cascading and 

Apache Hadoop to run customer experiments at scale	

• run multiple variants, then measure relative “lift”	

• Concurrent runtime – tag and track models	

!
the following example compares two models trained 

with different machine learning algorithms	



this is exaggerated, one has an important variable 

intentionally omitted to help illustrate the experiment
## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="",
na.strings="NULL", header=TRUE, encoding="UTF8")!
!
## split data into test and train sets!
set.seed(71)!
split_ratio <- 2/10!
split <- round(dim(data)[1] * split_ratio)!
data_tests <- data[1:split,]!
!
data_train <- data[(split + 1):dim(data)[1],]!
i <- colnames(data_train) == "order_id"!
j <- 1:length(i)!
data_train <- data_train[,-j[i]]!
!
## train a RandomForest model!
f <- as.formula("as.factor(label) ~ .")!
fit <- randomForest(f, data_train, ntree=25)!
!
## test the model on the holdout test set!
print(fit$importance)!
print(fit)!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 13.12%!
Confusion matrix:!
0 1 class.error!
0 57 9 0.1363636!
1 12 82 0.1276596
<?xml version="1.0"?>!
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/
XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/
pmml-4-1.xsd">!
<Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">!
<Extension name="user" value="alexisroos" extender="Rattle/PMML"/>!
<Application name="Rattle/PMML" version="1.4"/>!
<Timestamp>2014-02-17 22:11:37</Timestamp>!
</Header>!
<DataDictionary numberOfFields="4">!
<DataField name="label" optype="categorical" dataType="string">!
<Value value="0"/>!
<Value value="1"/>!
</DataField>!
<DataField name="var0" optype="continuous" dataType="double"/>!
<DataField name="var1" optype="continuous" dataType="double"/>!
<DataField name="var2" optype="continuous" dataType="double"/>!
</DataDictionary>!
<MiningModel modelName="randomForest_Model" functionName="classification">!
<MiningSchema>!
<MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>!
<MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>!
<MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>!
<MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>!
</MiningSchema>!
<Output>!
<OutputField name="Predicted_label" feature="predictedValue"/>!
<OutputField name="Probability_0" optype="continuous" dataType="double"
feature="probability" value="0"/>!
Experiments – Random Forest model
In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-*.jar data/
sample.tsv out/classify --pmml data/
antifraud.rf.xml!
!
hadoop dfs -cat out/classify/part-*
Experiments – Random Forest model
Copyright 2014, Concurrent Inc.
CASCADING OVERVIEW
8
•Enterprise Grade - Proven application development
framework for building robust and complex Big Data
applications with thousands of deployments.	

•Productive - Cascading relies on software patterns to
provide optimal level of abstraction allowing to greatly
simplify creation, testing, deployment and operation of
applications by focusing on business logic first.	

•Flexible & Extensible - Runs on all popular Hadoop
distributions, but not limited to Hadoop. Easily
extensible framework supporting a variety of
extensions, tools, and other integrations.
Hadoop
On-Premise or Cloud
Data Applications!
ETL, Analytics, Data
Processing, Machine Learning
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.
CASCADING ECOSYSTEM
9
On-Premise Deployments
Other Data StoresHadoop Distributions
ClojureSQL
RDBMS
MPP
EDW
LINGUALPATTERNSCALDINGCASCALOG
Languages
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential
BUSINESSES DEPEND ON US
10
>30% of Marketplace’s 1000 node
Hadoop cluster runs Cascading
applications
Cascading powers Revenues, Publisher
Analytics, and User Engagement
applications
Built their business for weather
insurance using using Cascading	

Sold to Monsanto for $950MM
Standardize on Cascading for their 	

fraud detection business
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.ConfidentialConfidential
CASCADING - DATA APPS
11
Enterprise IT!
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
!
Corporate Apps!
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
!
Telecom!
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail!
Mobile, Social, Search Analytics
Funnel analysis
Revenue attribution
Customer experiments
Ad Optimization
Retail recommenders
!
Consumer / Entertainment!
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
!
!
Finance!
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
!
Health / Biotech!
Aggregate metrics for Govt
Person biometrics
Veterinary diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps
!
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.12
The Cascading processing model is based on a metaphor of flows based on patterns
Source Tap
Sink Tap
Pipe
Tuple
Stream
Pipe
Assembly
Flow
Copyright 2014, Concurrent Inc.
CASCADING - MODEL METAPHOR
. Data is represented as flows of tuples.

. Pipes allow you to manage a data flow through functional programming:

. Splitting, Merging, Filtering, Parsing, Transforming, Grouping, Aggregating, Buffering, Joining, etc.
! +!
Fields
Copyright 2014, Concurrent Inc.Confidential13
CASCADE
Copyright 2014, Concurrent Inc.
• Cascade joins together multiple flows and execute them based on dependencies.
Copyright 2014, Concurrent Inc.Confidential14
• Flow planners allow Flows to be independent from the execution platform and the
processing query planner is responsible for defining, sharing, and executing data-
processing workflows	

• Currently there are two kinds of flow planners 	

- Local	

- Hadoop (1 & 2)	

• Allows for “fail fast” 	

- The flow planners can check completeness of flows, operations, type safety, etc.	

• Maps the pipe assembly to MapReduce in a deterministic way
FLOWS EXECUTION
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential15
FLOWS EXECUTION
Copyright 2014, Concurrent Inc.
Map
CoGroupfunc aggr SinkSource GroupBy
func
aggr
Source func
functemp
Reduce Map Reduce
Flow
Client
FlowAssembly
Cluster
Job Job
Cascading automatically generates MapReduce jobs for specified platform
Copyright 2014, Concurrent Inc.Confidential16
CASCADE - EXAMPLE CODE
Copyright 2014, Concurrent Inc.
• Top 10 IPs for Apache log file
RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"),

"^([^ ]*) S+ S+ [([w:/]+s[+-]d{4})] "(.+?)" (d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5});
!
Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS);
processPipe = new GroupBy(processPipe, new Fields("ip"));
processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL);
!
Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true);
sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));
Copyright 2014, Concurrent Inc.Confidential17
CASCADE - EXAMPLE CODE - DRIVEN
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential18
CASCADE - EXAMPLE CODE - DRIVEN
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc19
AGENDA
• Cascading
• PMML in Cascading
• Pattern Scenarios
• Demo
Copyright 2014, Concurrent Inc.Confidential20
• Established XML standard for predictive model markup

(specifies the model, not an implementation of the model)	

• Organized by Data Mining Group (DMG), since 1997, http://dmg.org/	

• Open standards for Data Mining and Statistical models	

• PMML producer: for applications that create predictive models	

• PMML consumers: for application that read or consume models	

PREDICTIVE MODEL MARKUP LANGUAGE (PMML)
Copyright 2014, Concurrent Inc.
“PMML is the leading standard for statistical and data mining models and supported by over 20
vendors and organizations.With PMML, it is easy to develop a model on one system using one
application and deploy the model on another system using another application.”
Copyright 2014, Concurrent Inc.Confidential21
• Association Rules: AssociationModel element	

• Cluster Models: ClusteringModel element	

• Decision Trees: TreeModel element	

• Naïve Bayes Classifiers: NaiveBayesModel element	

• Neural Networks: NeuralNetwork element	

• Regression: RegressionModel and GeneralRegressionModel elements	

• Rulesets: RuleSetModel element	

• Sequences: SequenceModel element	

• SupportVector Machines: SupportVectorMachineModel element	

• Text Models: TextModel element	

• Time Series: TimeSeriesModel element	

PMML MODEL COVERAGE
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential22
PMML VENDORS COVERAGE
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential23
BUILDING AND RUNNING PMML MODELS
Copyright 2014, Concurrent Inc.
Model
Producer
Data PMML
ModelExplore data and build model
using regression, clustering, etc.
Training
Scoring
New

Data
PMML model
Measure and improve model
Post
Processing
Model

Consumer
Data
Data
scores
PATTERN
ETL, prepare data
ETL, prepare data
Copyright 2014, Concurrent Inc.Confidential24
## train a RandomForest model!
f <- as.formula("as.factor(species) ~ .")	

fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)	

 !
## test the model on the holdout test set!
print(fit)	

 !
out <- iris_full	

out$predict <- predict(fit, out, type="class")	

 !
## export predicted labels to TSV!
write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)	

!
 !
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
PATTERN: CREATE A MODEL IN R
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential25
<?xml version="1.0"?>	

<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd">	

<Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">	

<Extension name="user" value="alexisroos" extender="Rattle/PMML"/>	

<Application name="Rattle/PMML" version="1.4"/>	

<Timestamp>2014-06-02 18:04:36</Timestamp>	

</Header>	

<DataDictionary numberOfFields="5">	

<DataField name="species" optype="categorical" dataType="string">	

<Value value="setosa"/>	

<Value value="versicolor"/>	

<Value value="virginica"/>	

</DataField>	

<DataField name="sepal_length" optype="continuous" dataType="double"/>	

<DataField name="sepal_width" optype="continuous" dataType="double"/>	

<DataField name="petal_length" optype="continuous" dataType="double"/>	

<DataField name="petal_width" optype="continuous" dataType="double"/>	

</DataDictionary>	

<MiningModel modelName="randomForest_Model" functionName="classification">	

<MiningSchema>	

<MiningField name="species" usageType="predicted" invalidValueTreatment="asIs"/>	

<MiningField name="sepal_length" usageType="active" invalidValueTreatment="asIs"/>	

<MiningField name="sepal_width" usageType="active" invalidValueTreatment="asIs"/>	

<MiningField name="petal_length" usageType="active" invalidValueTreatment="asIs"/>	

<MiningField name="petal_width" usageType="active" invalidValueTreatment="asIs"/>	

</MiningSchema>	

<Output>	

<OutputField name="Predicted_species" feature="predictedValue"/>	

<OutputField name="Probability_setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/>	

<OutputField name="Probability_versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/>	

<OutputField name="Probability_virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/>	

</Output>	

...!
PATTERN: CAPTURE MODEL IN PMML
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential26
public static void main(String[] args) throws RuntimeException {!
String inputPath = args[0];!
String classifyPath = args[1];!
!
Properties properties = new Properties();!
AppProps.setApplicationJarClass(properties, Main.class);!
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);!
!
Tap inputTap = new Hfs(new TextDelimited(true, "t"), inputPath);!
Tap classifyTap = new Hfs(new TextDelimited(true, "t"), classifyPath);!
!
OptionParser optParser = new OptionParser();!
optParser.accepts("pmml").withRequiredArg();!
OptionSet options = optParser.parse(args);!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName("classify")!
.addSource("input", inputTap)!
.addSink("classify", classifyTap);!
!
if (options.hasArgument("pmml")) {!
String pmmlPath = (String) options.valuesOf("pmml").get(0);!
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput(new File(pmmlPath))!
.retainOnlyActiveIncomingFields()!
.setDefaultPredictedField(new Fields("predict", Double.class)); // default value if missing from the model!
flowDef.addAssemblyPlanner(pmmlPlanner);!
}!
!
Flow classifyFlow = flowConnector.connect(flowDef);!
classifyFlow.complete();!
}!
PATTERN: REUSE A MODEL
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential27
## run an RF classifier at scale!
 !
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml	

 	

!
## run an RF classifier at scale, assert regression test, measure confusion matrix!
 !
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml
data/iris.rf.xml --measure out/measure!
!
 !
## run a predictive model at scale, measure RMSE!
 !
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml
data/iris.rf.xml --rmse out/measure
PATTERN: SCORE A MODEL
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc28
AGENDA
• Cascading
• PMML in Cascading
• Pattern Scenarios
• Demo
Experiments – comparing models
• Much customer interest in leveraging Cascading and 

Apache Hadoop to run customer experiments at scale	

• run multiple variants, then measure relative “lift”	

• Concurrent runtime – tag and track models	

!
the following example compares two models trained 

with different machine learning algorithms	



this is exaggerated, one has an important variable 

intentionally omitted to help illustrate the experiment
## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="",
na.strings="NULL", header=TRUE, encoding="UTF8")!
!
## split data into test and train sets!
set.seed(71)!
split_ratio <- 2/10!
split <- round(dim(data)[1] * split_ratio)!
data_tests <- data[1:split,]!
!
data_train <- data[(split + 1):dim(data)[1],]!
i <- colnames(data_train) == "order_id"!
j <- 1:length(i)!
data_train <- data_train[,-j[i]]!
!
## train a RandomForest model!
f <- as.formula("as.factor(label) ~ .")!
fit <- randomForest(f, data_train, ntree=25)!
!
## test the model on the holdout test set!
print(fit$importance)!
print(fit)!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 13.12%!
Confusion matrix:!
0 1 class.error!
0 57 9 0.1363636!
1 12 82 0.1276596
<?xml version="1.0"?>!
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/
XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/
pmml-4-1.xsd">!
<Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">!
<Extension name="user" value="alexisroos" extender="Rattle/PMML"/>!
<Application name="Rattle/PMML" version="1.4"/>!
<Timestamp>2014-02-17 22:11:37</Timestamp>!
</Header>!
<DataDictionary numberOfFields="4">!
<DataField name="label" optype="categorical" dataType="string">!
<Value value="0"/>!
<Value value="1"/>!
</DataField>!
<DataField name="var0" optype="continuous" dataType="double"/>!
<DataField name="var1" optype="continuous" dataType="double"/>!
<DataField name="var2" optype="continuous" dataType="double"/>!
</DataDictionary>!
<MiningModel modelName="randomForest_Model" functionName="classification">!
<MiningSchema>!
<MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>!
<MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>!
<MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>!
<MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>!
</MiningSchema>!
<Output>!
<OutputField name="Predicted_label" feature="predictedValue"/>!
<OutputField name="Probability_0" optype="continuous" dataType="double"
feature="probability" value="0"/>!
Experiments – Random Forest model
In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-*.jar data/
sample.tsv out/classify --pmml data/
antifraud.rf.xml!
!
hadoop dfs -cat out/classify/part-*
Experiments – Random Forest model
Copyright 2014, Concurrent Inc.Confidential33
• Hierarchical Clustering 	

• K-Means Clustering
• Linear Regression	

• Logistic Regression	

• Random Forest	

!
also, model chaining and general support for ensembles	

!
algorithms can be added or extended based on customer use cases

PATTERN: ALGOS IMPLEMENTED
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential34
BUILDING AND RUNNING PMML MODELS
Copyright 2014, Concurrent Inc.
Model
Producer
Data PMML
ModelExplore data and build model
using Regression, clustering, etc.
Training
Scoring
New

Data
PMML model
Measure and improve model
Post
Processing
Model

Consumer
Data
Data
scores
PATTERN
ETL, prepare data
ETL, prepare data
LINGUAL
LINGUAL
Copyright 2014, Concurrent Inc.Confidential35
PATTERN: SINGLE MODEL ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multiple departments to combine their workflow components into an

single integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive

Model
Data

preparation
Data
Data
Data
CASCADING
decrease the project costs…
reduce licensing costs…
Copyright 2014, Concurrent Inc.Confidential36
•Can score data and run experiments at scale onto Hadoop	

•Run different models using Ensembles	

•In turn this allows to improve existing models and improve accuracy
PATTERN BENEFITS
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential37
PATTERN: ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multiple departments to combine their workflow components into an single
integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive

Model
Data

preparation
Data
Data
Data
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "etl" )!
.addSource( "data.source1", emplTap )!
.addSource( "data.source2", salesTap )!
.addSink( "results", resultsTap );!
 !
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
 !
flowDef.addAssemblyPlanner( sqlPlanner );!
!
!
Copyright 2014, Concurrent Inc.Confidential38
PATTERN: ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multiple departments to combine their workflow components into an single
integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive

Model
Data

preparation
Data
Data
Data
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
 !
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
.retainOnlyActiveIncomingFields();!
 !
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
!
Copyright 2014, Concurrent Inc39
AGENDA
• Cascading
• PMML in Cascading
• Pattern Scenarios
• Demo
Copyright 2014, Concurrent Inc.Confidential40
PATTERN: DEMO
Copyright 2014, Concurrent Inc.
1. Generate the model in R	

2. Examine PMML MODEL	

3. Write & Run Cascading app to score the model
Copyright 2014, Concurrent Inc.Confidential41
KEY TAKEAWAYS
Copyright 2014, Concurrent Inc.
Reuse existing learning models and investments to run
data scoring at scale
Leverage existing skill sets: Java, Scala, SQL, PMML, etc
Allow teams to collaborate on single model that can be
visualized, managed and monitored.
Copyright 2014, Concurrent Inc.
@ALEXISROOS
42
QUESTIONS?

More Related Content

Viewers also liked

Kaahwa armstrong intern report
Kaahwa armstrong intern reportKaahwa armstrong intern report
Kaahwa armstrong intern report
kaahwa Armstrong
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
Revolution Analytics
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
DataWorks Summit/Hadoop Summit
 
Evaluation task 1
Evaluation task 1 Evaluation task 1
Evaluation task 1
Saskia Tarn
 
Varios cuentos
Varios cuentosVarios cuentos
Varios cuentos
Cindy Zurita
 
навигатор N9
навигатор N9навигатор N9
навигатор N9
Евгений Марков
 
Responsive Web Design Chalk Talk
Responsive Web Design Chalk TalkResponsive Web Design Chalk Talk
Responsive Web Design Chalk Talk
Beau Ulrey
 
Powerpoint
PowerpointPowerpoint
Powerpoint
pikinurjamil
 
Fashion & customize inspiration
Fashion & customize inspirationFashion & customize inspiration
Fashion & customize inspiration
Pim Studio Unicps
 
Fashion Illustratie & inspiration
Fashion Illustratie & inspirationFashion Illustratie & inspiration
Fashion Illustratie & inspiration
Pim Studio Unicps
 
Valgen Case Study - Midsized Apparel Manufacturer
Valgen Case Study - Midsized Apparel ManufacturerValgen Case Study - Midsized Apparel Manufacturer
Valgen Case Study - Midsized Apparel Manufacturer
ValgenMobility
 
Mas leather international
Mas leather internationalMas leather international
Mas leather international
Mas Leather International
 
Stich ii trial for supratentorial intra cerebral bleed
Stich ii trial for supratentorial intra cerebral bleedStich ii trial for supratentorial intra cerebral bleed
Stich ii trial for supratentorial intra cerebral bleed
garry07
 
Critical appraisal of Stitch Trial by Dr. Akshay Mehta
Critical appraisal of Stitch Trial by Dr. Akshay MehtaCritical appraisal of Stitch Trial by Dr. Akshay Mehta
Critical appraisal of Stitch Trial by Dr. Akshay Mehta
cardiositeindia
 
stitching details of basic shirt
stitching details of  basic shirtstitching details of  basic shirt
stitching details of basic shirt
MD. SAJJADUL KARIM BHUIYAN
 
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...
VsimPPT
 
Scaling Up Improved Seed and Agronomic Practices in Southern Africa
Scaling Up Improved Seed and Agronomic Practices in Southern AfricaScaling Up Improved Seed and Agronomic Practices in Southern Africa
Scaling Up Improved Seed and Agronomic Practices in Southern Africa
International Institute of Tropical Agriculture
 
Transportatopn problm
Transportatopn problmTransportatopn problm
Transportatopn problm
Anshul Singh
 
4. STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE
4.	STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE4.	STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE
4. STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE
AELC
 
The fertilizer industry Roydon D'mello
The fertilizer industry  Roydon D'melloThe fertilizer industry  Roydon D'mello
The fertilizer industry Roydon D'mello
Roydon D'mello
 

Viewers also liked (20)

Kaahwa armstrong intern report
Kaahwa armstrong intern reportKaahwa armstrong intern report
Kaahwa armstrong intern report
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Evaluation task 1
Evaluation task 1 Evaluation task 1
Evaluation task 1
 
Varios cuentos
Varios cuentosVarios cuentos
Varios cuentos
 
навигатор N9
навигатор N9навигатор N9
навигатор N9
 
Responsive Web Design Chalk Talk
Responsive Web Design Chalk TalkResponsive Web Design Chalk Talk
Responsive Web Design Chalk Talk
 
Powerpoint
PowerpointPowerpoint
Powerpoint
 
Fashion & customize inspiration
Fashion & customize inspirationFashion & customize inspiration
Fashion & customize inspiration
 
Fashion Illustratie & inspiration
Fashion Illustratie & inspirationFashion Illustratie & inspiration
Fashion Illustratie & inspiration
 
Valgen Case Study - Midsized Apparel Manufacturer
Valgen Case Study - Midsized Apparel ManufacturerValgen Case Study - Midsized Apparel Manufacturer
Valgen Case Study - Midsized Apparel Manufacturer
 
Mas leather international
Mas leather internationalMas leather international
Mas leather international
 
Stich ii trial for supratentorial intra cerebral bleed
Stich ii trial for supratentorial intra cerebral bleedStich ii trial for supratentorial intra cerebral bleed
Stich ii trial for supratentorial intra cerebral bleed
 
Critical appraisal of Stitch Trial by Dr. Akshay Mehta
Critical appraisal of Stitch Trial by Dr. Akshay MehtaCritical appraisal of Stitch Trial by Dr. Akshay Mehta
Critical appraisal of Stitch Trial by Dr. Akshay Mehta
 
stitching details of basic shirt
stitching details of  basic shirtstitching details of  basic shirt
stitching details of basic shirt
 
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...
 
Scaling Up Improved Seed and Agronomic Practices in Southern Africa
Scaling Up Improved Seed and Agronomic Practices in Southern AfricaScaling Up Improved Seed and Agronomic Practices in Southern Africa
Scaling Up Improved Seed and Agronomic Practices in Southern Africa
 
Transportatopn problm
Transportatopn problmTransportatopn problm
Transportatopn problm
 
4. STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE
4.	STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE4.	STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE
4. STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGE
 
The fertilizer industry Roydon D'mello
The fertilizer industry  Roydon D'melloThe fertilizer industry  Roydon D'mello
The fertilizer industry Roydon D'mello
 

Similar to Pattern: An Open Source Project for Migrating Predictive Models from SAS

Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
Cascading
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Sri Ambati
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
Cascading
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
Sri Ambati
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
MapR Technologies
 
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!
CA Technologies
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
SAS integration with NoSQL data
SAS integration with NoSQL dataSAS integration with NoSQL data
SAS integration with NoSQL data
Kevin Lee
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
Stepan Pushkarev
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
Craig Jordan
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzML
BigML, Inc
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
Cascading
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
DataWorks Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 

Similar to Pattern: An Open Source Project for Migrating Predictive Models from SAS (20)

Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
SAS integration with NoSQL data
SAS integration with NoSQL dataSAS integration with NoSQL data
SAS integration with NoSQL data
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzML
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 

Pattern: An Open Source Project for Migrating Predictive Models from SAS

  • 1. DRIVING INNOVATION THROUGH DATA PATTERN: AN OPEN SOURCE PROJECT FOR MIGRATING PREDICTIVE MODELS FROM SAS, ETC., ONTO HADOOP Alexis Roos | June 3 2014 | Hadoop Summit
  • 2. Copyright 2014, Concurrent Inc.Confidential2 Pattern is:! ! • An open source project that works on top of Cascading to support scoring of PMML models (from R, SAS, etc.) at scale on to Hadoop.! ! • Models are reused and deployed within Cascading workflows. PATTERN IN A NUTSHELL Copyright 2014, Concurrent Inc.
  • 3. Copyright 2014, Concurrent Inc3 AGENDA • Cascading • PMML and Cascading • Pattern Scenarios • Demo
  • 4. Experiments – comparing models • Much customer interest in leveraging Cascading and 
 Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models ! the following example compares two models trained 
 with different machine learning algorithms 
 this is exaggerated, one has an important variable 
 intentionally omitted to help illustrate the experiment
  • 5. ## load the "baseline" reference data! dat_folder <- '.'! data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="", na.strings="NULL", header=TRUE, encoding="UTF8")! ! ## split data into test and train sets! set.seed(71)! split_ratio <- 2/10! split <- round(dim(data)[1] * split_ratio)! data_tests <- data[1:split,]! ! data_train <- data[(split + 1):dim(data)[1],]! i <- colnames(data_train) == "order_id"! j <- 1:length(i)! data_train <- data_train[,-j[i]]! ! ## train a RandomForest model! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=25)! ! ## test the model on the holdout test set! print(fit$importance)! print(fit)! ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 13.12%! Confusion matrix:! 0 1 class.error! 0 57 9 0.1363636! 1 12 82 0.1276596
  • 6. <?xml version="1.0"?>! <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/ pmml-4-1.xsd">! <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">! <Extension name="user" value="alexisroos" extender="Rattle/PMML"/>! <Application name="Rattle/PMML" version="1.4"/>! <Timestamp>2014-02-17 22:11:37</Timestamp>! </Header>! <DataDictionary numberOfFields="4">! <DataField name="label" optype="categorical" dataType="string">! <Value value="0"/>! <Value value="1"/>! </DataField>! <DataField name="var0" optype="continuous" dataType="double"/>! <DataField name="var1" optype="continuous" dataType="double"/>! <DataField name="var2" optype="continuous" dataType="double"/>! </DataDictionary>! <MiningModel modelName="randomForest_Model" functionName="classification">! <MiningSchema>! <MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>! <MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>! </MiningSchema>! <Output>! <OutputField name="Predicted_label" feature="predictedValue"/>! <OutputField name="Probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>! Experiments – Random Forest model
  • 7. In pattern/pattern-examples! ! gradle clean jar! ! hadoop dfs -rmr out/classify! ! hadoop jar build/libs/pattern-examples-*.jar data/ sample.tsv out/classify --pmml data/ antifraud.rf.xml! ! hadoop dfs -cat out/classify/part-* Experiments – Random Forest model
  • 8. Copyright 2014, Concurrent Inc. CASCADING OVERVIEW 8 •Enterprise Grade - Proven application development framework for building robust and complex Big Data applications with thousands of deployments. •Productive - Cascading relies on software patterns to provide optimal level of abstraction allowing to greatly simplify creation, testing, deployment and operation of applications by focusing on business logic first. •Flexible & Extensible - Runs on all popular Hadoop distributions, but not limited to Hadoop. Easily extensible framework supporting a variety of extensions, tools, and other integrations. Hadoop On-Premise or Cloud Data Applications! ETL, Analytics, Data Processing, Machine Learning Copyright 2014, Concurrent Inc.
  • 9. Copyright 2014, Concurrent Inc. CASCADING ECOSYSTEM 9 On-Premise Deployments Other Data StoresHadoop Distributions ClojureSQL RDBMS MPP EDW LINGUALPATTERNSCALDINGCASCALOG Languages Copyright 2014, Concurrent Inc.
  • 10. Copyright 2014, Concurrent Inc.Confidential BUSINESSES DEPEND ON US 10 >30% of Marketplace’s 1000 node Hadoop cluster runs Cascading applications Cascading powers Revenues, Publisher Analytics, and User Engagement applications Built their business for weather insurance using using Cascading Sold to Monsanto for $950MM Standardize on Cascading for their fraud detection business Copyright 2014, Concurrent Inc.
  • 11. Copyright 2014, Concurrent Inc.ConfidentialConfidential CASCADING - DATA APPS 11 Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel analysis Revenue attribution Customer experiments Ad Optimization Retail recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate metrics for Govt Person biometrics Veterinary diagnostics Next-Gen Genomics Argonomics Environmental Maps ! Copyright 2014, Concurrent Inc.
  • 12. Copyright 2014, Concurrent Inc.12 The Cascading processing model is based on a metaphor of flows based on patterns Source Tap Sink Tap Pipe Tuple Stream Pipe Assembly Flow Copyright 2014, Concurrent Inc. CASCADING - MODEL METAPHOR . Data is represented as flows of tuples.
 . Pipes allow you to manage a data flow through functional programming:
 . Splitting, Merging, Filtering, Parsing, Transforming, Grouping, Aggregating, Buffering, Joining, etc. ! +! Fields
  • 13. Copyright 2014, Concurrent Inc.Confidential13 CASCADE Copyright 2014, Concurrent Inc. • Cascade joins together multiple flows and execute them based on dependencies.
  • 14. Copyright 2014, Concurrent Inc.Confidential14 • Flow planners allow Flows to be independent from the execution platform and the processing query planner is responsible for defining, sharing, and executing data- processing workflows • Currently there are two kinds of flow planners - Local - Hadoop (1 & 2) • Allows for “fail fast” - The flow planners can check completeness of flows, operations, type safety, etc. • Maps the pipe assembly to MapReduce in a deterministic way FLOWS EXECUTION Copyright 2014, Concurrent Inc.
  • 15. Copyright 2014, Concurrent Inc.Confidential15 FLOWS EXECUTION Copyright 2014, Concurrent Inc. Map CoGroupfunc aggr SinkSource GroupBy func aggr Source func functemp Reduce Map Reduce Flow Client FlowAssembly Cluster Job Job Cascading automatically generates MapReduce jobs for specified platform
  • 16. Copyright 2014, Concurrent Inc.Confidential16 CASCADE - EXAMPLE CODE Copyright 2014, Concurrent Inc. • Top 10 IPs for Apache log file RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"),
 "^([^ ]*) S+ S+ [([w:/]+s[+-]d{4})] "(.+?)" (d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5}); ! Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS); processPipe = new GroupBy(processPipe, new Fields("ip")); processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL); ! Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true); sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));
  • 17. Copyright 2014, Concurrent Inc.Confidential17 CASCADE - EXAMPLE CODE - DRIVEN Copyright 2014, Concurrent Inc.
  • 18. Copyright 2014, Concurrent Inc.Confidential18 CASCADE - EXAMPLE CODE - DRIVEN Copyright 2014, Concurrent Inc.
  • 19. Copyright 2014, Concurrent Inc19 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
  • 20. Copyright 2014, Concurrent Inc.Confidential20 • Established XML standard for predictive model markup
 (specifies the model, not an implementation of the model) • Organized by Data Mining Group (DMG), since 1997, http://dmg.org/ • Open standards for Data Mining and Statistical models • PMML producer: for applications that create predictive models • PMML consumers: for application that read or consume models PREDICTIVE MODEL MARKUP LANGUAGE (PMML) Copyright 2014, Concurrent Inc. “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.”
  • 21. Copyright 2014, Concurrent Inc.Confidential21 • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • SupportVector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element PMML MODEL COVERAGE Copyright 2014, Concurrent Inc.
  • 22. Copyright 2014, Concurrent Inc.Confidential22 PMML VENDORS COVERAGE Copyright 2014, Concurrent Inc.
  • 23. Copyright 2014, Concurrent Inc.Confidential23 BUILDING AND RUNNING PMML MODELS Copyright 2014, Concurrent Inc. Model Producer Data PMML ModelExplore data and build model using regression, clustering, etc. Training Scoring New
 Data PMML model Measure and improve model Post Processing Model
 Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data
  • 24. Copyright 2014, Concurrent Inc.Confidential24 ## train a RandomForest model! f <- as.formula("as.factor(species) ~ .") fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)  ! ## test the model on the holdout test set! print(fit)  ! out <- iris_full out$predict <- predict(fit, out, type="class")  ! ## export predicted labels to TSV! write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) !  ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) PATTERN: CREATE A MODEL IN R Copyright 2014, Concurrent Inc.
  • 25. Copyright 2014, Concurrent Inc.Confidential25 <?xml version="1.0"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd"> <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model"> <Extension name="user" value="alexisroos" extender="Rattle/PMML"/> <Application name="Rattle/PMML" version="1.4"/> <Timestamp>2014-06-02 18:04:36</Timestamp> </Header> <DataDictionary numberOfFields="5"> <DataField name="species" optype="categorical" dataType="string"> <Value value="setosa"/> <Value value="versicolor"/> <Value value="virginica"/> </DataField> <DataField name="sepal_length" optype="continuous" dataType="double"/> <DataField name="sepal_width" optype="continuous" dataType="double"/> <DataField name="petal_length" optype="continuous" dataType="double"/> <DataField name="petal_width" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification"> <MiningSchema> <MiningField name="species" usageType="predicted" invalidValueTreatment="asIs"/> <MiningField name="sepal_length" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="sepal_width" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="petal_length" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="petal_width" usageType="active" invalidValueTreatment="asIs"/> </MiningSchema> <Output> <OutputField name="Predicted_species" feature="predictedValue"/> <OutputField name="Probability_setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/> <OutputField name="Probability_versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/> <OutputField name="Probability_virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/> </Output> ...! PATTERN: CAPTURE MODEL IN PMML Copyright 2014, Concurrent Inc.
  • 26. Copyright 2014, Concurrent Inc.Confidential26 public static void main(String[] args) throws RuntimeException {! String inputPath = args[0];! String classifyPath = args[1];! ! Properties properties = new Properties();! AppProps.setApplicationJarClass(properties, Main.class);! HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);! ! Tap inputTap = new Hfs(new TextDelimited(true, "t"), inputPath);! Tap classifyTap = new Hfs(new TextDelimited(true, "t"), classifyPath);! ! OptionParser optParser = new OptionParser();! optParser.accepts("pmml").withRequiredArg();! OptionSet options = optParser.parse(args);! ! FlowDef flowDef = FlowDef.flowDef()! .setName("classify")! .addSource("input", inputTap)! .addSink("classify", classifyTap);! ! if (options.hasArgument("pmml")) {! String pmmlPath = (String) options.valuesOf("pmml").get(0);! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput(new File(pmmlPath))! .retainOnlyActiveIncomingFields()! .setDefaultPredictedField(new Fields("predict", Double.class)); // default value if missing from the model! flowDef.addAssemblyPlanner(pmmlPlanner);! }! ! Flow classifyFlow = flowConnector.connect(flowDef);! classifyFlow.complete();! }! PATTERN: REUSE A MODEL Copyright 2014, Concurrent Inc.
  • 27. Copyright 2014, Concurrent Inc.Confidential27 ## run an RF classifier at scale!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml   ! ## run an RF classifier at scale, assert regression test, measure confusion matrix!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml --measure out/measure! !  ! ## run a predictive model at scale, measure RMSE!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml --rmse out/measure PATTERN: SCORE A MODEL Copyright 2014, Concurrent Inc.
  • 28. Copyright 2014, Concurrent Inc28 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
  • 29. Experiments – comparing models • Much customer interest in leveraging Cascading and 
 Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models ! the following example compares two models trained 
 with different machine learning algorithms 
 this is exaggerated, one has an important variable 
 intentionally omitted to help illustrate the experiment
  • 30. ## load the "baseline" reference data! dat_folder <- '.'! data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="", na.strings="NULL", header=TRUE, encoding="UTF8")! ! ## split data into test and train sets! set.seed(71)! split_ratio <- 2/10! split <- round(dim(data)[1] * split_ratio)! data_tests <- data[1:split,]! ! data_train <- data[(split + 1):dim(data)[1],]! i <- colnames(data_train) == "order_id"! j <- 1:length(i)! data_train <- data_train[,-j[i]]! ! ## train a RandomForest model! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=25)! ! ## test the model on the holdout test set! print(fit$importance)! print(fit)! ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 13.12%! Confusion matrix:! 0 1 class.error! 0 57 9 0.1363636! 1 12 82 0.1276596
  • 31. <?xml version="1.0"?>! <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/ pmml-4-1.xsd">! <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">! <Extension name="user" value="alexisroos" extender="Rattle/PMML"/>! <Application name="Rattle/PMML" version="1.4"/>! <Timestamp>2014-02-17 22:11:37</Timestamp>! </Header>! <DataDictionary numberOfFields="4">! <DataField name="label" optype="categorical" dataType="string">! <Value value="0"/>! <Value value="1"/>! </DataField>! <DataField name="var0" optype="continuous" dataType="double"/>! <DataField name="var1" optype="continuous" dataType="double"/>! <DataField name="var2" optype="continuous" dataType="double"/>! </DataDictionary>! <MiningModel modelName="randomForest_Model" functionName="classification">! <MiningSchema>! <MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>! <MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>! </MiningSchema>! <Output>! <OutputField name="Predicted_label" feature="predictedValue"/>! <OutputField name="Probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>! Experiments – Random Forest model
  • 32. In pattern/pattern-examples! ! gradle clean jar! ! hadoop dfs -rmr out/classify! ! hadoop jar build/libs/pattern-examples-*.jar data/ sample.tsv out/classify --pmml data/ antifraud.rf.xml! ! hadoop dfs -cat out/classify/part-* Experiments – Random Forest model
  • 33. Copyright 2014, Concurrent Inc.Confidential33 • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest ! also, model chaining and general support for ensembles ! algorithms can be added or extended based on customer use cases
 PATTERN: ALGOS IMPLEMENTED Copyright 2014, Concurrent Inc.
  • 34. Copyright 2014, Concurrent Inc.Confidential34 BUILDING AND RUNNING PMML MODELS Copyright 2014, Concurrent Inc. Model Producer Data PMML ModelExplore data and build model using Regression, clustering, etc. Training Scoring New
 Data PMML model Measure and improve model Post Processing Model
 Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data LINGUAL LINGUAL
  • 35. Copyright 2014, Concurrent Inc.Confidential35 PATTERN: SINGLE MODEL ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an
 single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data CASCADING decrease the project costs… reduce licensing costs…
  • 36. Copyright 2014, Concurrent Inc.Confidential36 •Can score data and run experiments at scale onto Hadoop •Run different models using Ensembles •In turn this allows to improve existing models and improve accuracy PATTERN BENEFITS Copyright 2014, Concurrent Inc.
  • 37. Copyright 2014, Concurrent Inc.Confidential37 PATTERN: ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "etl" )! .addSource( "data.source1", emplTap )! .addSource( "data.source2", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
  • 38. Copyright 2014, Concurrent Inc.Confidential38 PATTERN: ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  • 39. Copyright 2014, Concurrent Inc39 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
  • 40. Copyright 2014, Concurrent Inc.Confidential40 PATTERN: DEMO Copyright 2014, Concurrent Inc. 1. Generate the model in R 2. Examine PMML MODEL 3. Write & Run Cascading app to score the model
  • 41. Copyright 2014, Concurrent Inc.Confidential41 KEY TAKEAWAYS Copyright 2014, Concurrent Inc. Reuse existing learning models and investments to run data scoring at scale Leverage existing skill sets: Java, Scala, SQL, PMML, etc Allow teams to collaborate on single model that can be visualized, managed and monitored.
  • 42. Copyright 2014, Concurrent Inc. @ALEXISROOS 42 QUESTIONS?