Your SlideShare is downloading. ×
DRIVING
INNOVATION
THROUGH
DATA
PATTERN: AN OPEN SOURCE PROJECT FOR MIGRATING
PREDICTIVE MODELS FROM SAS, ETC., ONTO HADOO...
Copyright 2014, Concurrent Inc.Confidential2
Pattern is:!
!
• An open source project that works on top of Cascading to supp...
Copyright 2014, Concurrent Inc3
AGENDA
• Cascading
• PMML and Cascading
• Pattern Scenarios
• Demo
Experiments – comparing models
• Much customer interest in leveraging Cascading and 

Apache Hadoop to run customer experi...
## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep...
<?xml version="1.0"?>!
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/
XMLSchem...
In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-...
Copyright 2014, Concurrent Inc.
CASCADING OVERVIEW
8
•Enterprise Grade - Proven application development
framework for buil...
Copyright 2014, Concurrent Inc.
CASCADING ECOSYSTEM
9
On-Premise Deployments
Other Data StoresHadoop Distributions
Clojure...
Copyright 2014, Concurrent Inc.Confidential
BUSINESSES DEPEND ON US
10
>30% of Marketplace’s 1000 node
Hadoop cluster runs ...
Copyright 2014, Concurrent Inc.ConfidentialConfidential
CASCADING - DATA APPS
11
Enterprise IT!
Extract Transform Load
Log F...
Copyright 2014, Concurrent Inc.12
The Cascading processing model is based on a metaphor of flows based on patterns
Source T...
Copyright 2014, Concurrent Inc.Confidential13
CASCADE
Copyright 2014, Concurrent Inc.
• Cascade joins together multiple flow...
Copyright 2014, Concurrent Inc.Confidential14
• Flow planners allow Flows to be independent from the execution platform and...
Copyright 2014, Concurrent Inc.Confidential15
FLOWS EXECUTION
Copyright 2014, Concurrent Inc.
Map
CoGroupfunc aggr SinkSour...
Copyright 2014, Concurrent Inc.Confidential16
CASCADE - EXAMPLE CODE
Copyright 2014, Concurrent Inc.
• Top 10 IPs for Apach...
Copyright 2014, Concurrent Inc.Confidential17
CASCADE - EXAMPLE CODE - DRIVEN
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential18
CASCADE - EXAMPLE CODE - DRIVEN
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc19
AGENDA
• Cascading
• PMML in Cascading
• Pattern Scenarios
• Demo
Copyright 2014, Concurrent Inc.Confidential20
• Established XML standard for predictive model markup

(specifies the model, ...
Copyright 2014, Concurrent Inc.Confidential21
• Association Rules: AssociationModel element	

• Cluster Models: ClusteringM...
Copyright 2014, Concurrent Inc.Confidential22
PMML VENDORS COVERAGE
Copyright 2014, Concurrent Inc.
Copyright 2014, Concurrent Inc.Confidential23
BUILDING AND RUNNING PMML MODELS
Copyright 2014, Concurrent Inc.
Model
Produc...
Copyright 2014, Concurrent Inc.Confidential24
## train a RandomForest model!
f <- as.formula("as.factor(species) ~ .")	

fit...
Copyright 2014, Concurrent Inc.Confidential25
<?xml version="1.0"?>	

<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_...
Copyright 2014, Concurrent Inc.Confidential26
public static void main(String[] args) throws RuntimeException {!
String inpu...
Copyright 2014, Concurrent Inc.Confidential27
## run an RF classifier at scale!
 !
hadoop jar ./build/libs/patterndemo.jar ...
Copyright 2014, Concurrent Inc28
AGENDA
• Cascading
• PMML in Cascading
• Pattern Scenarios
• Demo
Experiments – comparing models
• Much customer interest in leveraging Cascading and 

Apache Hadoop to run customer experi...
## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep...
<?xml version="1.0"?>!
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/
XMLSchem...
In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-...
Copyright 2014, Concurrent Inc.Confidential33
• Hierarchical Clustering 	

• K-Means Clustering
• Linear Regression	

• Log...
Copyright 2014, Concurrent Inc.Confidential34
BUILDING AND RUNNING PMML MODELS
Copyright 2014, Concurrent Inc.
Model
Produc...
Copyright 2014, Concurrent Inc.Confidential35
PATTERN: SINGLE MODEL ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading ...
Copyright 2014, Concurrent Inc.Confidential36
•Can score data and run experiments at scale onto Hadoop	

•Run different mod...
Copyright 2014, Concurrent Inc.Confidential37
PATTERN: ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multip...
Copyright 2014, Concurrent Inc.Confidential38
PATTERN: ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multip...
Copyright 2014, Concurrent Inc39
AGENDA
• Cascading
• PMML in Cascading
• Pattern Scenarios
• Demo
Copyright 2014, Concurrent Inc.Confidential40
PATTERN: DEMO
Copyright 2014, Concurrent Inc.
1. Generate the model in R	

2....
Copyright 2014, Concurrent Inc.Confidential41
KEY TAKEAWAYS
Copyright 2014, Concurrent Inc.
Reuse existing learning models ...
Copyright 2014, Concurrent Inc.
@ALEXISROOS
42
QUESTIONS?
Upcoming SlideShare
Loading in...5
×

Pattern: An Open Source Project for Migrating Predictive Models from SAS

1,347

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,347
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Pattern: An Open Source Project for Migrating Predictive Models from SAS"

  1. 1. DRIVING INNOVATION THROUGH DATA PATTERN: AN OPEN SOURCE PROJECT FOR MIGRATING PREDICTIVE MODELS FROM SAS, ETC., ONTO HADOOP Alexis Roos | June 3 2014 | Hadoop Summit
  2. 2. Copyright 2014, Concurrent Inc.Confidential2 Pattern is:! ! • An open source project that works on top of Cascading to support scoring of PMML models (from R, SAS, etc.) at scale on to Hadoop.! ! • Models are reused and deployed within Cascading workflows. PATTERN IN A NUTSHELL Copyright 2014, Concurrent Inc.
  3. 3. Copyright 2014, Concurrent Inc3 AGENDA • Cascading • PMML and Cascading • Pattern Scenarios • Demo
  4. 4. Experiments – comparing models • Much customer interest in leveraging Cascading and 
 Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models ! the following example compares two models trained 
 with different machine learning algorithms 
 this is exaggerated, one has an important variable 
 intentionally omitted to help illustrate the experiment
  5. 5. ## load the "baseline" reference data! dat_folder <- '.'! data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="", na.strings="NULL", header=TRUE, encoding="UTF8")! ! ## split data into test and train sets! set.seed(71)! split_ratio <- 2/10! split <- round(dim(data)[1] * split_ratio)! data_tests <- data[1:split,]! ! data_train <- data[(split + 1):dim(data)[1],]! i <- colnames(data_train) == "order_id"! j <- 1:length(i)! data_train <- data_train[,-j[i]]! ! ## train a RandomForest model! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=25)! ! ## test the model on the holdout test set! print(fit$importance)! print(fit)! ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 13.12%! Confusion matrix:! 0 1 class.error! 0 57 9 0.1363636! 1 12 82 0.1276596
  6. 6. <?xml version="1.0"?>! <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/ pmml-4-1.xsd">! <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">! <Extension name="user" value="alexisroos" extender="Rattle/PMML"/>! <Application name="Rattle/PMML" version="1.4"/>! <Timestamp>2014-02-17 22:11:37</Timestamp>! </Header>! <DataDictionary numberOfFields="4">! <DataField name="label" optype="categorical" dataType="string">! <Value value="0"/>! <Value value="1"/>! </DataField>! <DataField name="var0" optype="continuous" dataType="double"/>! <DataField name="var1" optype="continuous" dataType="double"/>! <DataField name="var2" optype="continuous" dataType="double"/>! </DataDictionary>! <MiningModel modelName="randomForest_Model" functionName="classification">! <MiningSchema>! <MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>! <MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>! </MiningSchema>! <Output>! <OutputField name="Predicted_label" feature="predictedValue"/>! <OutputField name="Probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>! Experiments – Random Forest model
  7. 7. In pattern/pattern-examples! ! gradle clean jar! ! hadoop dfs -rmr out/classify! ! hadoop jar build/libs/pattern-examples-*.jar data/ sample.tsv out/classify --pmml data/ antifraud.rf.xml! ! hadoop dfs -cat out/classify/part-* Experiments – Random Forest model
  8. 8. Copyright 2014, Concurrent Inc. CASCADING OVERVIEW 8 •Enterprise Grade - Proven application development framework for building robust and complex Big Data applications with thousands of deployments. •Productive - Cascading relies on software patterns to provide optimal level of abstraction allowing to greatly simplify creation, testing, deployment and operation of applications by focusing on business logic first. •Flexible & Extensible - Runs on all popular Hadoop distributions, but not limited to Hadoop. Easily extensible framework supporting a variety of extensions, tools, and other integrations. Hadoop On-Premise or Cloud Data Applications! ETL, Analytics, Data Processing, Machine Learning Copyright 2014, Concurrent Inc.
  9. 9. Copyright 2014, Concurrent Inc. CASCADING ECOSYSTEM 9 On-Premise Deployments Other Data StoresHadoop Distributions ClojureSQL RDBMS MPP EDW LINGUALPATTERNSCALDINGCASCALOG Languages Copyright 2014, Concurrent Inc.
  10. 10. Copyright 2014, Concurrent Inc.Confidential BUSINESSES DEPEND ON US 10 >30% of Marketplace’s 1000 node Hadoop cluster runs Cascading applications Cascading powers Revenues, Publisher Analytics, and User Engagement applications Built their business for weather insurance using using Cascading Sold to Monsanto for $950MM Standardize on Cascading for their fraud detection business Copyright 2014, Concurrent Inc.
  11. 11. Copyright 2014, Concurrent Inc.ConfidentialConfidential CASCADING - DATA APPS 11 Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel analysis Revenue attribution Customer experiments Ad Optimization Retail recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate metrics for Govt Person biometrics Veterinary diagnostics Next-Gen Genomics Argonomics Environmental Maps ! Copyright 2014, Concurrent Inc.
  12. 12. Copyright 2014, Concurrent Inc.12 The Cascading processing model is based on a metaphor of flows based on patterns Source Tap Sink Tap Pipe Tuple Stream Pipe Assembly Flow Copyright 2014, Concurrent Inc. CASCADING - MODEL METAPHOR . Data is represented as flows of tuples.
 . Pipes allow you to manage a data flow through functional programming:
 . Splitting, Merging, Filtering, Parsing, Transforming, Grouping, Aggregating, Buffering, Joining, etc. ! +! Fields
  13. 13. Copyright 2014, Concurrent Inc.Confidential13 CASCADE Copyright 2014, Concurrent Inc. • Cascade joins together multiple flows and execute them based on dependencies.
  14. 14. Copyright 2014, Concurrent Inc.Confidential14 • Flow planners allow Flows to be independent from the execution platform and the processing query planner is responsible for defining, sharing, and executing data- processing workflows • Currently there are two kinds of flow planners - Local - Hadoop (1 & 2) • Allows for “fail fast” - The flow planners can check completeness of flows, operations, type safety, etc. • Maps the pipe assembly to MapReduce in a deterministic way FLOWS EXECUTION Copyright 2014, Concurrent Inc.
  15. 15. Copyright 2014, Concurrent Inc.Confidential15 FLOWS EXECUTION Copyright 2014, Concurrent Inc. Map CoGroupfunc aggr SinkSource GroupBy func aggr Source func functemp Reduce Map Reduce Flow Client FlowAssembly Cluster Job Job Cascading automatically generates MapReduce jobs for specified platform
  16. 16. Copyright 2014, Concurrent Inc.Confidential16 CASCADE - EXAMPLE CODE Copyright 2014, Concurrent Inc. • Top 10 IPs for Apache log file RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"),
 "^([^ ]*) S+ S+ [([w:/]+s[+-]d{4})] "(.+?)" (d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5}); ! Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS); processPipe = new GroupBy(processPipe, new Fields("ip")); processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL); ! Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true); sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));
  17. 17. Copyright 2014, Concurrent Inc.Confidential17 CASCADE - EXAMPLE CODE - DRIVEN Copyright 2014, Concurrent Inc.
  18. 18. Copyright 2014, Concurrent Inc.Confidential18 CASCADE - EXAMPLE CODE - DRIVEN Copyright 2014, Concurrent Inc.
  19. 19. Copyright 2014, Concurrent Inc19 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
  20. 20. Copyright 2014, Concurrent Inc.Confidential20 • Established XML standard for predictive model markup
 (specifies the model, not an implementation of the model) • Organized by Data Mining Group (DMG), since 1997, http://dmg.org/ • Open standards for Data Mining and Statistical models • PMML producer: for applications that create predictive models • PMML consumers: for application that read or consume models PREDICTIVE MODEL MARKUP LANGUAGE (PMML) Copyright 2014, Concurrent Inc. “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.”
  21. 21. Copyright 2014, Concurrent Inc.Confidential21 • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • SupportVector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element PMML MODEL COVERAGE Copyright 2014, Concurrent Inc.
  22. 22. Copyright 2014, Concurrent Inc.Confidential22 PMML VENDORS COVERAGE Copyright 2014, Concurrent Inc.
  23. 23. Copyright 2014, Concurrent Inc.Confidential23 BUILDING AND RUNNING PMML MODELS Copyright 2014, Concurrent Inc. Model Producer Data PMML ModelExplore data and build model using regression, clustering, etc. Training Scoring New
 Data PMML model Measure and improve model Post Processing Model
 Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data
  24. 24. Copyright 2014, Concurrent Inc.Confidential24 ## train a RandomForest model! f <- as.formula("as.factor(species) ~ .") fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)  ! ## test the model on the holdout test set! print(fit)  ! out <- iris_full out$predict <- predict(fit, out, type="class")  ! ## export predicted labels to TSV! write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) !  ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) PATTERN: CREATE A MODEL IN R Copyright 2014, Concurrent Inc.
  25. 25. Copyright 2014, Concurrent Inc.Confidential25 <?xml version="1.0"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd"> <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model"> <Extension name="user" value="alexisroos" extender="Rattle/PMML"/> <Application name="Rattle/PMML" version="1.4"/> <Timestamp>2014-06-02 18:04:36</Timestamp> </Header> <DataDictionary numberOfFields="5"> <DataField name="species" optype="categorical" dataType="string"> <Value value="setosa"/> <Value value="versicolor"/> <Value value="virginica"/> </DataField> <DataField name="sepal_length" optype="continuous" dataType="double"/> <DataField name="sepal_width" optype="continuous" dataType="double"/> <DataField name="petal_length" optype="continuous" dataType="double"/> <DataField name="petal_width" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification"> <MiningSchema> <MiningField name="species" usageType="predicted" invalidValueTreatment="asIs"/> <MiningField name="sepal_length" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="sepal_width" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="petal_length" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="petal_width" usageType="active" invalidValueTreatment="asIs"/> </MiningSchema> <Output> <OutputField name="Predicted_species" feature="predictedValue"/> <OutputField name="Probability_setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/> <OutputField name="Probability_versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/> <OutputField name="Probability_virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/> </Output> ...! PATTERN: CAPTURE MODEL IN PMML Copyright 2014, Concurrent Inc.
  26. 26. Copyright 2014, Concurrent Inc.Confidential26 public static void main(String[] args) throws RuntimeException {! String inputPath = args[0];! String classifyPath = args[1];! ! Properties properties = new Properties();! AppProps.setApplicationJarClass(properties, Main.class);! HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);! ! Tap inputTap = new Hfs(new TextDelimited(true, "t"), inputPath);! Tap classifyTap = new Hfs(new TextDelimited(true, "t"), classifyPath);! ! OptionParser optParser = new OptionParser();! optParser.accepts("pmml").withRequiredArg();! OptionSet options = optParser.parse(args);! ! FlowDef flowDef = FlowDef.flowDef()! .setName("classify")! .addSource("input", inputTap)! .addSink("classify", classifyTap);! ! if (options.hasArgument("pmml")) {! String pmmlPath = (String) options.valuesOf("pmml").get(0);! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput(new File(pmmlPath))! .retainOnlyActiveIncomingFields()! .setDefaultPredictedField(new Fields("predict", Double.class)); // default value if missing from the model! flowDef.addAssemblyPlanner(pmmlPlanner);! }! ! Flow classifyFlow = flowConnector.connect(flowDef);! classifyFlow.complete();! }! PATTERN: REUSE A MODEL Copyright 2014, Concurrent Inc.
  27. 27. Copyright 2014, Concurrent Inc.Confidential27 ## run an RF classifier at scale!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml   ! ## run an RF classifier at scale, assert regression test, measure confusion matrix!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml --measure out/measure! !  ! ## run a predictive model at scale, measure RMSE!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml --rmse out/measure PATTERN: SCORE A MODEL Copyright 2014, Concurrent Inc.
  28. 28. Copyright 2014, Concurrent Inc28 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
  29. 29. Experiments – comparing models • Much customer interest in leveraging Cascading and 
 Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models ! the following example compares two models trained 
 with different machine learning algorithms 
 this is exaggerated, one has an important variable 
 intentionally omitted to help illustrate the experiment
  30. 30. ## load the "baseline" reference data! dat_folder <- '.'! data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="", na.strings="NULL", header=TRUE, encoding="UTF8")! ! ## split data into test and train sets! set.seed(71)! split_ratio <- 2/10! split <- round(dim(data)[1] * split_ratio)! data_tests <- data[1:split,]! ! data_train <- data[(split + 1):dim(data)[1],]! i <- colnames(data_train) == "order_id"! j <- 1:length(i)! data_train <- data_train[,-j[i]]! ! ## train a RandomForest model! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=25)! ! ## test the model on the holdout test set! print(fit$importance)! print(fit)! ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 13.12%! Confusion matrix:! 0 1 class.error! 0 57 9 0.1363636! 1 12 82 0.1276596
  31. 31. <?xml version="1.0"?>! <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/ pmml-4-1.xsd">! <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">! <Extension name="user" value="alexisroos" extender="Rattle/PMML"/>! <Application name="Rattle/PMML" version="1.4"/>! <Timestamp>2014-02-17 22:11:37</Timestamp>! </Header>! <DataDictionary numberOfFields="4">! <DataField name="label" optype="categorical" dataType="string">! <Value value="0"/>! <Value value="1"/>! </DataField>! <DataField name="var0" optype="continuous" dataType="double"/>! <DataField name="var1" optype="continuous" dataType="double"/>! <DataField name="var2" optype="continuous" dataType="double"/>! </DataDictionary>! <MiningModel modelName="randomForest_Model" functionName="classification">! <MiningSchema>! <MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>! <MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>! </MiningSchema>! <Output>! <OutputField name="Predicted_label" feature="predictedValue"/>! <OutputField name="Probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>! Experiments – Random Forest model
  32. 32. In pattern/pattern-examples! ! gradle clean jar! ! hadoop dfs -rmr out/classify! ! hadoop jar build/libs/pattern-examples-*.jar data/ sample.tsv out/classify --pmml data/ antifraud.rf.xml! ! hadoop dfs -cat out/classify/part-* Experiments – Random Forest model
  33. 33. Copyright 2014, Concurrent Inc.Confidential33 • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest ! also, model chaining and general support for ensembles ! algorithms can be added or extended based on customer use cases
 PATTERN: ALGOS IMPLEMENTED Copyright 2014, Concurrent Inc.
  34. 34. Copyright 2014, Concurrent Inc.Confidential34 BUILDING AND RUNNING PMML MODELS Copyright 2014, Concurrent Inc. Model Producer Data PMML ModelExplore data and build model using Regression, clustering, etc. Training Scoring New
 Data PMML model Measure and improve model Post Processing Model
 Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data LINGUAL LINGUAL
  35. 35. Copyright 2014, Concurrent Inc.Confidential35 PATTERN: SINGLE MODEL ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an
 single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data CASCADING decrease the project costs… reduce licensing costs…
  36. 36. Copyright 2014, Concurrent Inc.Confidential36 •Can score data and run experiments at scale onto Hadoop •Run different models using Ensembles •In turn this allows to improve existing models and improve accuracy PATTERN BENEFITS Copyright 2014, Concurrent Inc.
  37. 37. Copyright 2014, Concurrent Inc.Confidential37 PATTERN: ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "etl" )! .addSource( "data.source1", emplTap )! .addSource( "data.source2", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
  38. 38. Copyright 2014, Concurrent Inc.Confidential38 PATTERN: ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  39. 39. Copyright 2014, Concurrent Inc39 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
  40. 40. Copyright 2014, Concurrent Inc.Confidential40 PATTERN: DEMO Copyright 2014, Concurrent Inc. 1. Generate the model in R 2. Examine PMML MODEL 3. Write & Run Cascading app to score the model
  41. 41. Copyright 2014, Concurrent Inc.Confidential41 KEY TAKEAWAYS Copyright 2014, Concurrent Inc. Reuse existing learning models and investments to run data scoring at scale Leverage existing skill sets: Java, Scala, SQL, PMML, etc Allow teams to collaborate on single model that can be visualized, managed and monitored.
  42. 42. Copyright 2014, Concurrent Inc. @ALEXISROOS 42 QUESTIONS?

×