• Share
  • Email
  • Embed
  • Like
  • Private Content
Pattern: an open source project for migrating predictive models onto Apache Hadoop
 

Pattern: an open source project for migrating predictive models onto Apache Hadoop

on

  • 2,304 views

 

Statistics

Views

Total Views
2,304
Views on SlideShare
1,971
Embed Views
333

Actions

Likes
7
Downloads
28
Comments
0

7 Embeds 333

http://www.scoop.it 245
https://twitter.com 82
http://localhost:3000 2
http://127.0.0.1:3000 1
http://tweetedtimes.com 1
http://0.0.0.0:3000 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Pattern: an open source project for migrating predictive models onto Apache Hadoop Pattern: an open source project for migrating predictive models onto Apache Hadoop Presentation Transcript

    • “Pattern – an open source project for migrating predictive models onto Apache Hadoop” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc.Sunday, 17 March 13 1
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 2
    • Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology.Sunday, 17 March 13 3
    • Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clustersSunday, 17 March 13 4
    • functional programming… in production • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wikiSunday, 17 March 13 5
    • Cascading – definitions • a pattern language for Enterprise Data Workflows Customers • simple to build, easy to test, robust in production • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingSunday, 17 March 13 6
    • Cascading – usage • Java API, DSLs in Scala, Clojure, Customers Jython, JRuby, Groovy, ANSI SQL • ASL 2 license, GitHub src, Web App http://conjars.org • 5+ yrs production use, logs logs Logs Cache multiple Enterprise verticals Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingSunday, 17 March 13 7
    • Cascading – integrations • partners: Microsoft Azure, Hortonworks, Customers Amazon AWS, MapR, EMC, SpringSource, Cloudera Web • taps: Memcached, Cassandra, MongoDB, App HBase, JDBC, Parquet, etc. logs logs Cache • serialization: Avro, Thrift, Kryo, Support Logs JSON, etc. trap source tap sink tap tap • topologies: Apache Hadoop, Data tuple spaces, local mode Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingSunday, 17 March 13 8
    • Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.Sunday, 17 March 13 9
    • Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utilityworkflow abstraction grids, telecom, addresses: genomics, climatology, agronomics, etc. • staffing bottleneck; • system integration; • operational complexity; • test-driven developmentSunday, 17 March 13 10
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 11
    • The Ubiquitous Word Count Document Definition: Collection Tokenize GroupBy M token Count count how often each word appears count how often each word appears R Word Count in a collection of text documents in a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): • requires a minimal amount of code for each word w in segment(text): emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems.Sunday, 17 March 13 12
    • word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702Sunday, 17 March 13 13
    • word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy M token String wcPath = args[ 1 ]; Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Sunday, 17 March 13 14
    • word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Sunday, 17 March 13 15
    • word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/ImpatientSunday, 17 March 13 16
    • word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learnSunday, 17 March 13 17
    • word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true)) }Sunday, 17 March 13 18
    • word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than CascalogSunday, 17 March 13 19
    • word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale • less learning curve than CascalogSunday, 17 March 13 20
    • Two Avenues to the App Layer… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Sunday, 17 March 13 21
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 22
    • workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern languageSunday, 17 March 13 23
    • references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four” amazon.com/dp/0201633612Sunday, 17 March 13 24
    • workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern languageSunday, 17 March 13 25
    • workflow abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps – great for cross-team collaborationSunday, 17 March 13 26
    • references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”Sunday, 17 March 13 27
    • workflow abstraction – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes customer Customer profile DBs Prefs Hadoop redirect traps in production Reporting Cluster to Ops, QA, Support, Audit, etc.Sunday, 17 March 13 28
    • workflow abstraction – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scaleSunday, 17 March 13 29
    • references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data Closely related to functional relational programming paradigm: “Out of the Tar Pit” Moseley & Marks 2006 http://goo.gl/SKspnSunday, 17 March 13 30
    • workflow abstraction – API design principles • specify what is required, not how it must be achieved • plan far ahead, before consuming cluster resources – fail fast prior to submit • fail the same way twice – deterministic flow planners help reduce engineering costs for debugging at scale • same JAR, any scale – app does not require a recompile to change data taps or cluster topologiesSunday, 17 March 13 31
    • workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ enables the functional programming aspects: compiler within a compiler, mapping optimizer flows to topologies (e.g., create and sequence Hadoop job steps) compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for build troubleshooting, exception handling, notifications, etc.; one app = one JAR topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. JVM cluster cluster scheduler, instrumentation, etc.Sunday, 17 March 13 32
    • workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ optimizer several theoretical aspects converge enables the functional programming aspects: compiler within a compiler, mapping flows to topologies into software engineering practices entire app is visible to the compiler: resolves issues of crossing boundaries for compiler/ build which minimize the complexity of troubleshooting, exception handling, notifications, etc.; one app = one JAR building and maintaining Enterprise topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. data workflows JVM cluster cluster scheduler, instrumentation, etc.Sunday, 17 March 13 33
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 34
    • Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at scale, captured as PMML • use sample Hadoop app at scale – no coding required • integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale cascading.org/patternSunday, 17 March 13 35
    • Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at reduced development greatly scale, captured at PMML costs, less • use sample Hadoop app at scale – no coding required leveraging the licensing issues at scale – • economics of Apache Hadoop clusters, integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale of analytics plus the core competencies staff, plus existing IP in predictive models cascading.org/patternSunday, 17 March 13 36
    • Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • great open source tools – R, Weka, Web App KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – logs logs Cache Logs Matrix API, etc. Support • leverage PMML as another kind trap tap source tap sink tap of DSL Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting cascading.org/patternSunday, 17 March 13 37
    • Pattern – an example classifier 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest risk classifier dimension: customer 360 risk classifier dimension: per-order Cascading apps 3. export model from R to PMML data prep training data sets analysts laptop customer transactions predict score new 4. build a Cascading app to execute the PMML model model costs detect PMML model orders anomaly fraudsters detection 4.1. generate flow from PMML description segment customers velocity metrics 4.2. plan the flow for a topology (Hadoop) Hadoop batch Customer DB real-time IMDG workloads workloads 4.3. compile app to a JAR file ETL chargebacks, partner DW etc. data 5. verify results with a regression test 6. deploy the app at scale to calculate scores 7. potentially, reuse classifier for real-time scoringSunday, 17 March 13 38
    • Pattern – an example classifier risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analysts customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. dataSunday, 17 March 13 39
    • Pattern – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))Sunday, 17 March 13 40
    • Pattern – capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ...Sunday, 17 March 13 41
    • Pattern – score a model, within an app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } }Sunday, 17 March 13 42
    • Pattern – score a model, using pre-defined Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps MatrixSunday, 17 March 13 43
    • Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml   ## run an RF classifier at scale, assert regression test, measure confusion matrix   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure   ## run a predictive model at scale, measure RMSE   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measureSunday, 17 March 13 44
    • Pattern – evaluating results bash-3.2$ head out/classify/part-00000 label" var0" var1" var2" order_id" predicted" score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1Sunday, 17 March 13 45
    • Lingual – connecting Hadoop and R # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = Gina") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()Sunday, 17 March 13 46
    • Lingual – connecting Hadoop and R > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92 cascading.org/lingual launchpad.net/test-dbSunday, 17 March 13 47
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 48
    • PMML – standard • established XML standard for predictive model markup • organized by Data Mining Group (DMG), since 1997 http://dmg.org/ • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” wikipedia.org/wiki/Predictive_Model_Markup_LanguageSunday, 17 March 13 49
    • PMML – models • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • Support Vector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-PMML2/Sunday, 17 March 13 50
    • PMML – vendor coverageSunday, 17 March 13 51
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 52
    • roadmap – existing algorithms for scoring • Random Forest • Decision Trees • Linear Regression • GLM • Logistic Regression • K-Means Clustering • Hierarchical Clustering • Support Vector Machines cascading.org/patternSunday, 17 March 13 53
    • roadmap – top priorities for creating models at scale • Random Forest • Logistic Regression • K-Means Clustering a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop… cascading.org/patternSunday, 17 March 13 54
    • roadmap – next priorities for scoring • Time Series (ARIMA forecast) • Association Rules (basket analysis) • Naïve Bayes • Neural Networks algorithms extended based on customer use cases – contact @pacoid cascading.org/patternSunday, 17 March 13 55
    • Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer ExperimentsSunday, 17 March 13 56
    • experiments – comparing models • much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experimentSunday, 17 March 13 57
    • experiments – Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478Sunday, 17 March 13 58
    • experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omittedSunday, 17 March 13 59
    • experiments – comparing results • use a confusion matrix to compare results for the classifiers • Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) • assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costsSunday, 17 March 13 60
    • references… Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721Sunday, 17 March 13 61
    • drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com Copyright @2013, Concurrent, Inc.Sunday, 17 March 13 62