0
“Pattern –              an open source project for migrating              predictive models onto Apache Hadoop”           ...
Pattern: predictive models at scale                                                Document                               ...
Cascading – origins           API author Chris Wensel worked as a system architect           at an Enterprise firm well-kno...
Cascading – functional programming           Key insight: MapReduce is based on functional programming           – back to...
functional programming… in production             • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,              ...
Cascading – definitions             • a pattern language for Enterprise Data Workflows                                      ...
Cascading – usage             • Java API, DSLs in Scala, Clojure,                                                         ...
Cascading – integrations             • partners: Microsoft Azure, Hortonworks,                                            ...
Cascading – deployments             • case studies: Climate Corp, Twitter, Etsy,                 Williams-Sonoma, uSwitch,...
Cascading – deployments             • case studies: Climate Corp, Twitter, Etsy,                 Williams-Sonoma, uSwitch,...
Pattern: predictive models at scale                                                Document                               ...
The Ubiquitous Word Count                                                                               Document          ...
word count – conceptual flow diagram               Document               Collection                             Tokenize  ...
word count – Cascading app in Java                                                                                        ...
word count – generated flow diagram                                                                                        ...
word count – Cascalog / Clojure                                                                    Document               ...
word count – Cascalog / Clojure                                                                            Document       ...
word count – Scalding / Scala                                                                    Document                 ...
word count – Scalding / Scala                                                                                Document     ...
word count – Scalding / Scala                                                                                       Docume...
Two Avenues to the App Layer…            Enterprise: must contend with            complexity at scale everyday…           ...
Pattern: predictive models at scale                                                Document                               ...
workflow abstraction – pattern language           Cascading uses a “plumbing” metaphor in the Java API,           to define ...
references…                      pattern language: a structured method for solving                      large, complex des...
workflow abstraction – pattern language           Cascading uses a “plumbing” metaphor in the Java API,           to define ...
workflow abstraction – literate programming           Cascading workflows generate their own visual           documentation:...
references…                      by Don Knuth                      Literate Programming                      Univ of Chica...
workflow abstraction – test-driven development             •   assert patterns (regex) on the tuple streams                ...
workflow abstraction – business process           Following the essence of literate programming, Cascading           workflo...
references…                      by Edgar Codd                      “A relational model of data for large shared data bank...
workflow abstraction – API design principles             • specify what is required, not how it must be achieved           ...
workflow abstraction – building apps in layers                        business      separation of concerns: focus on specif...
workflow abstraction – building apps in layers                        business      separation of concerns: focus on specif...
Pattern: predictive models at scale                                                Document                               ...
Pattern – analytics workflows             • open source project – ASL 2, GitHub repo             • multiple companies contr...
Pattern – analytics workflows             • open source project – ASL 2, GitHub repo             • multiple companies contr...
Pattern – model scoring             • migrate workloads: SAS,Teradata, etc.,                 exporting predictive models a...
Pattern – an example classifier               1. use customer order history as the training data set               2. train...
Pattern – an example classifier                      risk classifier                                               risk clas...
Pattern – create a model in R                      ## train a RandomForest model                                          ...
Pattern – capture model parameters as PMML                      <?xml version="1.0"?>                      <PMML version="...
Pattern – score a model, within an app                      public class Main {                        public static void ...
Pattern – score a model, using pre-defined Cascading app                           Customer                            Orde...
Pattern – score a model, using pre-defined Cascading app                      ## run an RF classifier at scale             ...
Pattern – evaluating results                      bash-3.2$ head out/classify/part-00000                      label" var0"...
Lingual – connecting Hadoop and R                      # load the JDBC package                      library(RJDBC)        ...
Lingual – connecting Hadoop and R                      > summary(df$hire_age)                         Min. 1st Qu. Median ...
Pattern: predictive models at scale                                                Document                               ...
PMML – standard             • established XML standard for predictive model markup             • organized by Data Mining ...
PMML – models             •   Association Rules: AssociationModel element             •   Cluster Models: ClusteringModel ...
PMML – vendor coverageSunday, 17 March 13                 51
Pattern: predictive models at scale                                                Document                               ...
roadmap – existing algorithms for scoring             •   	                  Random Forest             •   Decision Trees ...
roadmap – top priorities for creating models at scale             • 	Random Forest             • Logistic Regression      ...
roadmap – next priorities for scoring             •   	                  Time Series (ARIMA forecast)             •   Asso...
Pattern: predictive models at scale                                                Document                               ...
experiments – comparing models             • much customer interest in leveraging Cascading and                 Apache Had...
experiments – Random Forest model                      ## train a Random Forest model                      ## example: htt...
experiments – Logistic Regression model                      ## train a Logistic Regression model (special case of GLM)   ...
experiments – comparing results             • 	use a confusion matrix to compare results for the classifiers             • ...
references…                      Enterprise Data Workflows                      with Cascading                      O’Reill...
drill-down…                      blog, dev community, code/wiki/gists, maven repo,                      commercial product...
Upcoming SlideShare
Loading in...5
×

Pattern: an open source project for migrating predictive models onto Apache Hadoop

2,184

Published on

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,184
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
32
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Transcript of "Pattern: an open source project for migrating predictive models onto Apache Hadoop"

  1. 1. “Pattern – an open source project for migrating predictive models onto Apache Hadoop” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc.Sunday, 17 March 13 1
  2. 2. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 2
  3. 3. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology.Sunday, 17 March 13 3
  4. 4. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clustersSunday, 17 March 13 4
  5. 5. functional programming… in production • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wikiSunday, 17 March 13 5
  6. 6. Cascading – definitions • a pattern language for Enterprise Data Workflows Customers • simple to build, easy to test, robust in production • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingSunday, 17 March 13 6
  7. 7. Cascading – usage • Java API, DSLs in Scala, Clojure, Customers Jython, JRuby, Groovy, ANSI SQL • ASL 2 license, GitHub src, Web App http://conjars.org • 5+ yrs production use, logs logs Logs Cache multiple Enterprise verticals Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingSunday, 17 March 13 7
  8. 8. Cascading – integrations • partners: Microsoft Azure, Hortonworks, Customers Amazon AWS, MapR, EMC, SpringSource, Cloudera Web • taps: Memcached, Cassandra, MongoDB, App HBase, JDBC, Parquet, etc. logs logs Cache • serialization: Avro, Thrift, Kryo, Support Logs JSON, etc. trap source tap sink tap tap • topologies: Apache Hadoop, Data tuple spaces, local mode Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingSunday, 17 March 13 8
  9. 9. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.Sunday, 17 March 13 9
  10. 10. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utilityworkflow abstraction grids, telecom, addresses: genomics, climatology, agronomics, etc. • staffing bottleneck; • system integration; • operational complexity; • test-driven developmentSunday, 17 March 13 10
  11. 11. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 11
  12. 12. The Ubiquitous Word Count Document Definition: Collection Tokenize GroupBy M token Count count how often each word appears count how often each word appears R Word Count in a collection of text documents in a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): • requires a minimal amount of code for each word w in segment(text): emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems.Sunday, 17 March 13 12
  13. 13. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702Sunday, 17 March 13 13
  14. 14. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy M token String wcPath = args[ 1 ]; Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Sunday, 17 March 13 14
  15. 15. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Sunday, 17 March 13 15
  16. 16. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/ImpatientSunday, 17 March 13 16
  17. 17. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learnSunday, 17 March 13 17
  18. 18. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true)) }Sunday, 17 March 13 18
  19. 19. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than CascalogSunday, 17 March 13 19
  20. 20. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale • less learning curve than CascalogSunday, 17 March 13 20
  21. 21. Two Avenues to the App Layer… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Sunday, 17 March 13 21
  22. 22. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 22
  23. 23. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern languageSunday, 17 March 13 23
  24. 24. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four” amazon.com/dp/0201633612Sunday, 17 March 13 24
  25. 25. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern languageSunday, 17 March 13 25
  26. 26. workflow abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps – great for cross-team collaborationSunday, 17 March 13 26
  27. 27. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”Sunday, 17 March 13 27
  28. 28. workflow abstraction – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes customer Customer profile DBs Prefs Hadoop redirect traps in production Reporting Cluster to Ops, QA, Support, Audit, etc.Sunday, 17 March 13 28
  29. 29. workflow abstraction – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scaleSunday, 17 March 13 29
  30. 30. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data Closely related to functional relational programming paradigm: “Out of the Tar Pit” Moseley & Marks 2006 http://goo.gl/SKspnSunday, 17 March 13 30
  31. 31. workflow abstraction – API design principles • specify what is required, not how it must be achieved • plan far ahead, before consuming cluster resources – fail fast prior to submit • fail the same way twice – deterministic flow planners help reduce engineering costs for debugging at scale • same JAR, any scale – app does not require a recompile to change data taps or cluster topologiesSunday, 17 March 13 31
  32. 32. workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ enables the functional programming aspects: compiler within a compiler, mapping optimizer flows to topologies (e.g., create and sequence Hadoop job steps) compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for build troubleshooting, exception handling, notifications, etc.; one app = one JAR topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. JVM cluster cluster scheduler, instrumentation, etc.Sunday, 17 March 13 32
  33. 33. workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ optimizer several theoretical aspects converge enables the functional programming aspects: compiler within a compiler, mapping flows to topologies into software engineering practices entire app is visible to the compiler: resolves issues of crossing boundaries for compiler/ build which minimize the complexity of troubleshooting, exception handling, notifications, etc.; one app = one JAR building and maintaining Enterprise topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. data workflows JVM cluster cluster scheduler, instrumentation, etc.Sunday, 17 March 13 33
  34. 34. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 34
  35. 35. Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at scale, captured as PMML • use sample Hadoop app at scale – no coding required • integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale cascading.org/patternSunday, 17 March 13 35
  36. 36. Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at reduced development greatly scale, captured at PMML costs, less • use sample Hadoop app at scale – no coding required leveraging the licensing issues at scale – • economics of Apache Hadoop clusters, integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale of analytics plus the core competencies staff, plus existing IP in predictive models cascading.org/patternSunday, 17 March 13 36
  37. 37. Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • great open source tools – R, Weka, Web App KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – logs logs Cache Logs Matrix API, etc. Support • leverage PMML as another kind trap tap source tap sink tap of DSL Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting cascading.org/patternSunday, 17 March 13 37
  38. 38. Pattern – an example classifier 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest risk classifier dimension: customer 360 risk classifier dimension: per-order Cascading apps 3. export model from R to PMML data prep training data sets analysts laptop customer transactions predict score new 4. build a Cascading app to execute the PMML model model costs detect PMML model orders anomaly fraudsters detection 4.1. generate flow from PMML description segment customers velocity metrics 4.2. plan the flow for a topology (Hadoop) Hadoop batch Customer DB real-time IMDG workloads workloads 4.3. compile app to a JAR file ETL chargebacks, partner DW etc. data 5. verify results with a regression test 6. deploy the app at scale to calculate scores 7. potentially, reuse classifier for real-time scoringSunday, 17 March 13 38
  39. 39. Pattern – an example classifier risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analysts customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. dataSunday, 17 March 13 39
  40. 40. Pattern – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))Sunday, 17 March 13 40
  41. 41. Pattern – capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ...Sunday, 17 March 13 41
  42. 42. Pattern – score a model, within an app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } }Sunday, 17 March 13 42
  43. 43. Pattern – score a model, using pre-defined Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps MatrixSunday, 17 March 13 43
  44. 44. Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml   ## run an RF classifier at scale, assert regression test, measure confusion matrix   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure   ## run a predictive model at scale, measure RMSE   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measureSunday, 17 March 13 44
  45. 45. Pattern – evaluating results bash-3.2$ head out/classify/part-00000 label" var0" var1" var2" order_id" predicted" score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1Sunday, 17 March 13 45
  46. 46. Lingual – connecting Hadoop and R # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = Gina") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()Sunday, 17 March 13 46
  47. 47. Lingual – connecting Hadoop and R > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92 cascading.org/lingual launchpad.net/test-dbSunday, 17 March 13 47
  48. 48. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 48
  49. 49. PMML – standard • established XML standard for predictive model markup • organized by Data Mining Group (DMG), since 1997 http://dmg.org/ • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” wikipedia.org/wiki/Predictive_Model_Markup_LanguageSunday, 17 March 13 49
  50. 50. PMML – models • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • Support Vector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-PMML2/Sunday, 17 March 13 50
  51. 51. PMML – vendor coverageSunday, 17 March 13 51
  52. 52. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 52
  53. 53. roadmap – existing algorithms for scoring • Random Forest • Decision Trees • Linear Regression • GLM • Logistic Regression • K-Means Clustering • Hierarchical Clustering • Support Vector Machines cascading.org/patternSunday, 17 March 13 53
  54. 54. roadmap – top priorities for creating models at scale • Random Forest • Logistic Regression • K-Means Clustering a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop… cascading.org/patternSunday, 17 March 13 54
  55. 55. roadmap – next priorities for scoring • Time Series (ARIMA forecast) • Association Rules (basket analysis) • Naïve Bayes • Neural Networks algorithms extended based on customer use cases – contact @pacoid cascading.org/patternSunday, 17 March 13 55
  56. 56. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 56
  57. 57. experiments – comparing models • much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experimentSunday, 17 March 13 57
  58. 58. experiments – Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478Sunday, 17 March 13 58
  59. 59. experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omittedSunday, 17 March 13 59
  60. 60. experiments – comparing results • use a confusion matrix to compare results for the classifiers • Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) • assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costsSunday, 17 March 13 60
  61. 61. references… Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721Sunday, 17 March 13 61
  62. 62. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com Copyright @2013, Concurrent, Inc.Sunday, 17 March 13 62
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×