• Save
Pattern: An Open Source Project for Migrating Predictive Models from SAS
Upcoming SlideShare
Loading in...5
×
 

Pattern: An Open Source Project for Migrating Predictive Models from SAS

on

  • 566 views

 

Statistics

Views

Total Views
566
Slideshare-icon Views on SlideShare
421
Embed Views
145

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 145

http://www.scoop.it 145

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Pattern: An Open Source Project for Migrating Predictive Models from SAS Pattern: An Open Source Project for Migrating Predictive Models from SAS Presentation Transcript

    • DRIVING INNOVATION THROUGH DATA PATTERN: AN OPEN SOURCE PROJECT FOR MIGRATING PREDICTIVE MODELS FROM SAS, ETC., ONTO HADOOP Alexis Roos | June 3 2014 | Hadoop Summit
    • Copyright 2014, Concurrent Inc.Confidential2 Pattern is:! ! • An open source project that works on top of Cascading to support scoring of PMML models (from R, SAS, etc.) at scale on to Hadoop.! ! • Models are reused and deployed within Cascading workflows. PATTERN IN A NUTSHELL Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc3 AGENDA • Cascading • PMML and Cascading • Pattern Scenarios • Demo
    • Experiments – comparing models • Much customer interest in leveraging Cascading and 
 Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models ! the following example compares two models trained 
 with different machine learning algorithms 
 this is exaggerated, one has an important variable 
 intentionally omitted to help illustrate the experiment
    • ## load the "baseline" reference data! dat_folder <- '.'! data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="", na.strings="NULL", header=TRUE, encoding="UTF8")! ! ## split data into test and train sets! set.seed(71)! split_ratio <- 2/10! split <- round(dim(data)[1] * split_ratio)! data_tests <- data[1:split,]! ! data_train <- data[(split + 1):dim(data)[1],]! i <- colnames(data_train) == "order_id"! j <- 1:length(i)! data_train <- data_train[,-j[i]]! ! ## train a RandomForest model! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=25)! ! ## test the model on the holdout test set! print(fit$importance)! print(fit)! ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 13.12%! Confusion matrix:! 0 1 class.error! 0 57 9 0.1363636! 1 12 82 0.1276596
    • <?xml version="1.0"?>! <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/ pmml-4-1.xsd">! <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">! <Extension name="user" value="alexisroos" extender="Rattle/PMML"/>! <Application name="Rattle/PMML" version="1.4"/>! <Timestamp>2014-02-17 22:11:37</Timestamp>! </Header>! <DataDictionary numberOfFields="4">! <DataField name="label" optype="categorical" dataType="string">! <Value value="0"/>! <Value value="1"/>! </DataField>! <DataField name="var0" optype="continuous" dataType="double"/>! <DataField name="var1" optype="continuous" dataType="double"/>! <DataField name="var2" optype="continuous" dataType="double"/>! </DataDictionary>! <MiningModel modelName="randomForest_Model" functionName="classification">! <MiningSchema>! <MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>! <MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>! </MiningSchema>! <Output>! <OutputField name="Predicted_label" feature="predictedValue"/>! <OutputField name="Probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>! Experiments – Random Forest model
    • In pattern/pattern-examples! ! gradle clean jar! ! hadoop dfs -rmr out/classify! ! hadoop jar build/libs/pattern-examples-*.jar data/ sample.tsv out/classify --pmml data/ antifraud.rf.xml! ! hadoop dfs -cat out/classify/part-* Experiments – Random Forest model
    • Copyright 2014, Concurrent Inc. CASCADING OVERVIEW 8 •Enterprise Grade - Proven application development framework for building robust and complex Big Data applications with thousands of deployments. •Productive - Cascading relies on software patterns to provide optimal level of abstraction allowing to greatly simplify creation, testing, deployment and operation of applications by focusing on business logic first. •Flexible & Extensible - Runs on all popular Hadoop distributions, but not limited to Hadoop. Easily extensible framework supporting a variety of extensions, tools, and other integrations. Hadoop On-Premise or Cloud Data Applications! ETL, Analytics, Data Processing, Machine Learning Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc. CASCADING ECOSYSTEM 9 On-Premise Deployments Other Data StoresHadoop Distributions ClojureSQL RDBMS MPP EDW LINGUALPATTERNSCALDINGCASCALOG Languages Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential BUSINESSES DEPEND ON US 10 >30% of Marketplace’s 1000 node Hadoop cluster runs Cascading applications Cascading powers Revenues, Publisher Analytics, and User Engagement applications Built their business for weather insurance using using Cascading Sold to Monsanto for $950MM Standardize on Cascading for their fraud detection business Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.ConfidentialConfidential CASCADING - DATA APPS 11 Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel analysis Revenue attribution Customer experiments Ad Optimization Retail recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate metrics for Govt Person biometrics Veterinary diagnostics Next-Gen Genomics Argonomics Environmental Maps ! Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.12 The Cascading processing model is based on a metaphor of flows based on patterns Source Tap Sink Tap Pipe Tuple Stream Pipe Assembly Flow Copyright 2014, Concurrent Inc. CASCADING - MODEL METAPHOR . Data is represented as flows of tuples.
 . Pipes allow you to manage a data flow through functional programming:
 . Splitting, Merging, Filtering, Parsing, Transforming, Grouping, Aggregating, Buffering, Joining, etc. ! +! Fields
    • Copyright 2014, Concurrent Inc.Confidential13 CASCADE Copyright 2014, Concurrent Inc. • Cascade joins together multiple flows and execute them based on dependencies.
    • Copyright 2014, Concurrent Inc.Confidential14 • Flow planners allow Flows to be independent from the execution platform and the processing query planner is responsible for defining, sharing, and executing data- processing workflows • Currently there are two kinds of flow planners - Local - Hadoop (1 & 2) • Allows for “fail fast” - The flow planners can check completeness of flows, operations, type safety, etc. • Maps the pipe assembly to MapReduce in a deterministic way FLOWS EXECUTION Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential15 FLOWS EXECUTION Copyright 2014, Concurrent Inc. Map CoGroupfunc aggr SinkSource GroupBy func aggr Source func functemp Reduce Map Reduce Flow Client FlowAssembly Cluster Job Job Cascading automatically generates MapReduce jobs for specified platform
    • Copyright 2014, Concurrent Inc.Confidential16 CASCADE - EXAMPLE CODE Copyright 2014, Concurrent Inc. • Top 10 IPs for Apache log file RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"),
 "^([^ ]*) S+ S+ [([w:/]+s[+-]d{4})] "(.+?)" (d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5}); ! Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS); processPipe = new GroupBy(processPipe, new Fields("ip")); processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL); ! Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true); sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));
    • Copyright 2014, Concurrent Inc.Confidential17 CASCADE - EXAMPLE CODE - DRIVEN Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential18 CASCADE - EXAMPLE CODE - DRIVEN Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc19 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
    • Copyright 2014, Concurrent Inc.Confidential20 • Established XML standard for predictive model markup
 (specifies the model, not an implementation of the model) • Organized by Data Mining Group (DMG), since 1997, http://dmg.org/ • Open standards for Data Mining and Statistical models • PMML producer: for applications that create predictive models • PMML consumers: for application that read or consume models PREDICTIVE MODEL MARKUP LANGUAGE (PMML) Copyright 2014, Concurrent Inc. “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.”
    • Copyright 2014, Concurrent Inc.Confidential21 • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • SupportVector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element PMML MODEL COVERAGE Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential22 PMML VENDORS COVERAGE Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential23 BUILDING AND RUNNING PMML MODELS Copyright 2014, Concurrent Inc. Model Producer Data PMML ModelExplore data and build model using regression, clustering, etc. Training Scoring New
 Data PMML model Measure and improve model Post Processing Model
 Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data
    • Copyright 2014, Concurrent Inc.Confidential24 ## train a RandomForest model! f <- as.formula("as.factor(species) ~ .") fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)  ! ## test the model on the holdout test set! print(fit)  ! out <- iris_full out$predict <- predict(fit, out, type="class")  ! ## export predicted labels to TSV! write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) !  ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) PATTERN: CREATE A MODEL IN R Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential25 <?xml version="1.0"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd"> <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model"> <Extension name="user" value="alexisroos" extender="Rattle/PMML"/> <Application name="Rattle/PMML" version="1.4"/> <Timestamp>2014-06-02 18:04:36</Timestamp> </Header> <DataDictionary numberOfFields="5"> <DataField name="species" optype="categorical" dataType="string"> <Value value="setosa"/> <Value value="versicolor"/> <Value value="virginica"/> </DataField> <DataField name="sepal_length" optype="continuous" dataType="double"/> <DataField name="sepal_width" optype="continuous" dataType="double"/> <DataField name="petal_length" optype="continuous" dataType="double"/> <DataField name="petal_width" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification"> <MiningSchema> <MiningField name="species" usageType="predicted" invalidValueTreatment="asIs"/> <MiningField name="sepal_length" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="sepal_width" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="petal_length" usageType="active" invalidValueTreatment="asIs"/> <MiningField name="petal_width" usageType="active" invalidValueTreatment="asIs"/> </MiningSchema> <Output> <OutputField name="Predicted_species" feature="predictedValue"/> <OutputField name="Probability_setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/> <OutputField name="Probability_versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/> <OutputField name="Probability_virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/> </Output> ...! PATTERN: CAPTURE MODEL IN PMML Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential26 public static void main(String[] args) throws RuntimeException {! String inputPath = args[0];! String classifyPath = args[1];! ! Properties properties = new Properties();! AppProps.setApplicationJarClass(properties, Main.class);! HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);! ! Tap inputTap = new Hfs(new TextDelimited(true, "t"), inputPath);! Tap classifyTap = new Hfs(new TextDelimited(true, "t"), classifyPath);! ! OptionParser optParser = new OptionParser();! optParser.accepts("pmml").withRequiredArg();! OptionSet options = optParser.parse(args);! ! FlowDef flowDef = FlowDef.flowDef()! .setName("classify")! .addSource("input", inputTap)! .addSink("classify", classifyTap);! ! if (options.hasArgument("pmml")) {! String pmmlPath = (String) options.valuesOf("pmml").get(0);! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput(new File(pmmlPath))! .retainOnlyActiveIncomingFields()! .setDefaultPredictedField(new Fields("predict", Double.class)); // default value if missing from the model! flowDef.addAssemblyPlanner(pmmlPlanner);! }! ! Flow classifyFlow = flowConnector.connect(flowDef);! classifyFlow.complete();! }! PATTERN: REUSE A MODEL Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential27 ## run an RF classifier at scale!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml   ! ## run an RF classifier at scale, assert regression test, measure confusion matrix!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml --measure out/measure! !  ! ## run a predictive model at scale, measure RMSE!  ! hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml --rmse out/measure PATTERN: SCORE A MODEL Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc28 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
    • Experiments – comparing models • Much customer interest in leveraging Cascading and 
 Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models ! the following example compares two models trained 
 with different machine learning algorithms 
 this is exaggerated, one has an important variable 
 intentionally omitted to help illustrate the experiment
    • ## load the "baseline" reference data! dat_folder <- '.'! data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="", na.strings="NULL", header=TRUE, encoding="UTF8")! ! ## split data into test and train sets! set.seed(71)! split_ratio <- 2/10! split <- round(dim(data)[1] * split_ratio)! data_tests <- data[1:split,]! ! data_train <- data[(split + 1):dim(data)[1],]! i <- colnames(data_train) == "order_id"! j <- 1:length(i)! data_train <- data_train[,-j[i]]! ! ## train a RandomForest model! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=25)! ! ## test the model on the holdout test set! print(fit$importance)! print(fit)! ! ## export RF model to PMML! saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 13.12%! Confusion matrix:! 0 1 class.error! 0 57 9 0.1363636! 1 12 82 0.1276596
    • <?xml version="1.0"?>! <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/ pmml-4-1.xsd">! <Header copyright="Copyright (c) 2014 alexisroos" description="Random Forest Tree Model">! <Extension name="user" value="alexisroos" extender="Rattle/PMML"/>! <Application name="Rattle/PMML" version="1.4"/>! <Timestamp>2014-02-17 22:11:37</Timestamp>! </Header>! <DataDictionary numberOfFields="4">! <DataField name="label" optype="categorical" dataType="string">! <Value value="0"/>! <Value value="1"/>! </DataField>! <DataField name="var0" optype="continuous" dataType="double"/>! <DataField name="var1" optype="continuous" dataType="double"/>! <DataField name="var2" optype="continuous" dataType="double"/>! </DataDictionary>! <MiningModel modelName="randomForest_Model" functionName="classification">! <MiningSchema>! <MiningField name="label" usageType="predicted" invalidValueTreatment="asIs"/>! <MiningField name="var0" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var1" usageType="active" invalidValueTreatment="asIs"/>! <MiningField name="var2" usageType="active" invalidValueTreatment="asIs"/>! </MiningSchema>! <Output>! <OutputField name="Predicted_label" feature="predictedValue"/>! <OutputField name="Probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>! Experiments – Random Forest model
    • In pattern/pattern-examples! ! gradle clean jar! ! hadoop dfs -rmr out/classify! ! hadoop jar build/libs/pattern-examples-*.jar data/ sample.tsv out/classify --pmml data/ antifraud.rf.xml! ! hadoop dfs -cat out/classify/part-* Experiments – Random Forest model
    • Copyright 2014, Concurrent Inc.Confidential33 • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest ! also, model chaining and general support for ensembles ! algorithms can be added or extended based on customer use cases
 PATTERN: ALGOS IMPLEMENTED Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential34 BUILDING AND RUNNING PMML MODELS Copyright 2014, Concurrent Inc. Model Producer Data PMML ModelExplore data and build model using Regression, clustering, etc. Training Scoring New
 Data PMML model Measure and improve model Post Processing Model
 Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data LINGUAL LINGUAL
    • Copyright 2014, Concurrent Inc.Confidential35 PATTERN: SINGLE MODEL ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an
 single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data CASCADING decrease the project costs… reduce licensing costs…
    • Copyright 2014, Concurrent Inc.Confidential36 •Can score data and run experiments at scale onto Hadoop •Run different models using Ensembles •In turn this allows to improve existing models and improve accuracy PATTERN BENEFITS Copyright 2014, Concurrent Inc.
    • Copyright 2014, Concurrent Inc.Confidential37 PATTERN: ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "etl" )! .addSource( "data.source1", emplTap )! .addSource( "data.source2", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
    • Copyright 2014, Concurrent Inc.Confidential38 PATTERN: ARCHITECTURE Copyright 2014, Concurrent Inc. Cascading allows multiple departments to combine their workflow components into an single integrated app (jar) – based on 100% open source – that can be managed by a single tool LINGUAL (ANSI SQL) PATTERN (PMML) ETL Predictive
 Model Data
 preparation Data Data Data ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
    • Copyright 2014, Concurrent Inc39 AGENDA • Cascading • PMML in Cascading • Pattern Scenarios • Demo
    • Copyright 2014, Concurrent Inc.Confidential40 PATTERN: DEMO Copyright 2014, Concurrent Inc. 1. Generate the model in R 2. Examine PMML MODEL 3. Write & Run Cascading app to score the model
    • Copyright 2014, Concurrent Inc.Confidential41 KEY TAKEAWAYS Copyright 2014, Concurrent Inc. Reuse existing learning models and investments to run data scoring at scale Leverage existing skill sets: Java, Scala, SQL, PMML, etc Allow teams to collaborate on single model that can be visualized, managed and monitored.
    • Copyright 2014, Concurrent Inc. @ALEXISROOS 42 QUESTIONS?