SlideShare a Scribd company logo
“Pattern –
              an open source project for migrating
              predictive models onto Apache Hadoop”

                  Paco Nathan
                  Concurrent, Inc.
                  San Francisco, CA

                 Copyright @2013, Concurrent, Inc.

Sunday, 17 March 13                                   1
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            2
Cascading – origins

           API author Chris Wensel worked as a system architect
           at an Enterprise firm well-known for many popular
           data products.
           Wensel was following the Nutch open source project –
           where Hadoop started.
           Observation: would be difficult to find Java developers
           to write complex Enterprise apps in MapReduce –
           potential blocker for leveraging new open source

Sunday, 17 March 13                                                3
Cascading – functional programming

           Key insight: MapReduce is based on functional programming
           – back to LISP in 1970s. Apache Hadoop use cases are
           mostly about data pipelines, which are functional in nature.
           To ease staffing problems as “Main Street” Enterprise firms
           began to embrace Hadoop, Cascading was introduced
           in late 2007, as a new Java API to implement functional
           programming for large-scale data workflows:

             • leverages JVM and Java-based tools without any
                 need to create new languages
             •   allows programmers who have J2EE expertise
                 to leverage the economics of Hadoop clusters

Sunday, 17 March 13                                                       4
functional programming… in production

             • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
                 have invested in open source projects atop Cascading
                 – used for their large-scale production deployments
             •   new case studies for Cascading apps are mostly
                 based on domain-specific languages (DSLs) in JVM
                 languages which emphasize functional programming:

                 Cascalog in Clojure (2010)
                 Scalding in Scala (2012)


Sunday, 17 March 13                                                       5
Cascading – definitions

             • a pattern language for Enterprise Data Workflows
             • simple to build, easy to test, robust in production
             • design principles ⟹ ensure best practices at scale                             Web

                                                                                logs         Cache

                                                                         trap                  sink
                                                                          tap                  tap

                                                           Modeling    PMML


                                                            Cubes                            customer
                                                                                            profile DBs

Sunday, 17 March 13                                                                                       6
Cascading – usage

             • Java API, DSLs in Scala, Clojure,
                 Jython, JRuby, Groovy, ANSI SQL
             • ASL 2 license, GitHub src,                                             Web
             • 5+ yrs production use,                                   logs

                 multiple Enterprise verticals     Support
                                                                 trap                  sink
                                                                  tap                  tap

                                                   Modeling    PMML


                                                    Cubes                            customer
                                                                                    profile DBs

Sunday, 17 March 13                                                                               7
Cascading – integrations

             • partners: Microsoft Azure, Hortonworks,
                 Amazon AWS, MapR, EMC, SpringSource,
                 Cloudera                                                                   Web

             • taps: Memcached, Cassandra, MongoDB,

                 HBase, JDBC, Parquet, etc.                                   logs
                                                                                logs       Cache

             • serialization: Avro, Thrift, Kryo,        Support

                 JSON, etc.                                            trap
                                                                                   tap       sink
                                                                        tap                  tap

             • topologies: Apache Hadoop,                                      Data
                 tuple spaces, local mode                Modeling    PMML


                                                          Cubes                            customer
                                                                                          profile DBs

Sunday, 17 March 13                                                                                     8
Cascading – deployments

             • case studies: Climate Corp, Twitter, Etsy,
                 Williams-Sonoma, uSwitch, Airbnb, Nokia,
                 YieldBot, Square, Harvard, etc.
             • use cases: ETL, marketing funnel, anti-fraud,
                 social media, retail pricing, search analytics,
                 recommenders, eCRM, utility grids, telecom,
                 genomics, climatology, agronomics, etc.

Sunday, 17 March 13                                                9
Cascading – deployments

             • case studies: Climate Corp, Twitter, Etsy,
                 Williams-Sonoma, uSwitch, Airbnb, Nokia,
                 YieldBot, Square, Harvard, etc.
             • use cases: ETL, marketing funnel, anti-fraud,
                 social media, retail pricing, search analytics,
                 recommenders, eCRM, utilityworkflow abstraction
                                                 grids, telecom,   addresses:
                 genomics, climatology, agronomics, etc.
                                             • staffing bottleneck;
                                             • system integration;
                                             • operational complexity;
                                             • test-driven development

Sunday, 17 March 13                                                             10
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            11
The Ubiquitous Word Count


                                                                                       M                token    Count

               count how often each word appears
             count how often each word appears
                                                                                                          R              Word

               in a collection of text documents
             in a collection of text documents
           This simple program provides an excellent test case for
           parallel processing, since it illustrates:                void map (String doc_id, String text):

            • requires a minimal amount of code                       for each word w in segment(text):
                                                                        emit(w, "1");

            • demonstrates use of both symbolic and numeric values
            • shows a dependency graph of tuples as an abstraction   void reduce (String word, Iterator group):

            • is not many steps away from useful search indexing      int count = 0;

            • serves as a “Hello World” for Hadoop apps               for each pc in group:
                                                                        count += Int(pc);

           Any distributed computing framework which can run Word     emit(word, String(count));
           Count efficiently in parallel at scale can handle much
           larger and more interesting compute problems.

Sunday, 17 March 13                                                                                                              12
word count – conceptual flow diagram


                       M                   token               Count

                                             R                             Word

              1 map                    
              1 reduce
             18 lines code                     

Sunday, 17 March 13                                                                 13
word count – Cascading app in Java

           String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                           M                token

           String wcPath = args[ 1 ];                                                                                                Count

           Properties properties = new Properties();                                                                          R              Word

           AppProps.setApplicationJarClass( properties, Main.class );
           HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

           // create source and sink taps
           Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
           Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

           // specify a regex to split "document" text lines into token stream
           Fields token = new Fields( "token" );
           Fields text = new Fields( "text" );
           RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
           // only returns "token"
           Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
           // determine the word counts
           Pipe wcPipe = new Pipe( "wc", docPipe );
           wcPipe = new GroupBy( wcPipe, token );
           wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

           // connect the taps, pipes, etc., into a flow
           FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
            .addSource( docPipe, docTap )
            .addTailSink( wcPipe, wcTap );
           // write a DOT file and run the flow
           Flow wcFlow = flowConnector.connect( flowDef );
           wcFlow.writeDOT( "dot/" );

Sunday, 17 March 13                                                                                                                                  14
word count – generated flow diagram

                                                              [head]                                                  M
                                                                                                                                       token    Count

                                                                                                                                         R              Word

                                Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                        [{2}:'doc_id', 'text']
                                                        [{2}:'doc_id', 'text']






                                                        [{2}:'token', 'count']

                             Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                        [{2}:'token', 'count']
                                                        [{2}:'token', 'count']


Sunday, 17 March 13                                                                                                                                             15
word count – Cascalog / Clojure

           (ns impatient.core                                               M
                                                                                             token    Count

             (:use [cascalog.api]                                                              R              Word

                   [cascalog.more-taps :only (hfs-delimited)])
             (:require [clojure.string :as s]
                       [cascalog.ops :as c])

           (defmapcatop split [line]
             "reads in a line of string and splits it by regex"
             (s/split line #"[[](),.)s]+"))

           (defn -main [in out & args]
             (?<- (hfs-delimited out)
                  [?word ?count]
                  ((hfs-delimited in :skip-header? true) _ ?line)
                  (split ?line :> ?word)
                  (c/count ?count)))

           ; Paul Lam

Sunday, 17 March 13                                                                                                   16
word count – Cascalog / Clojure

                                                                                    M                token    Count

                                                                                                       R              Word

             • implements Datalog in Clojure, with predicates backed
               by Cascading – for a highly declarative language
             • run ad-hoc queries from the Clojure REPL –
               approx. 10:1 code reduction compared with SQL
             • composable subqueries, used for test-driven development
               (TDD) practices at scale
             • Leiningen build: simple, no surprises, in Clojure itself
             • more new deployments than other Cascading DSLs –
               Climate Corp is largest use case: 90% Clojure/Cascalog
             • has a learning curve, limited number of Clojure developers
             • aggregators are the magic, and those take effort to learn

Sunday, 17 March 13                                                                                                           17
word count – Scalding / Scala

          import com.twitter.scalding._                                     M
                                                                                             token    Count

                                                                                               R              Word

          class WordCount(args : Args) extends Job(args) {
                 ('doc_id, 'text),
                 skipHeader = true)
              .flatMap('text -> 'token) {
                 text : String => text.split("[ [](),.]")
              .groupBy('token) { _.size('count) }
              .write(Tsv(args("wc"), writeHeader = true))

Sunday, 17 March 13                                                                                                   18
word count – Scalding / Scala

                                                                                        M                token    Count

                                                                                                           R              Word

             • extends the Scala collections API so that distributed lists
               become “pipes” backed by Cascading
             • code is compact, easy to understand
             • nearly 1:1 between elements of conceptual flow diagram
               and function calls
             • extensive libraries are available for linear algebra, abstract
               algebra, machine learning – e.g., Matrix API, Algebird, etc.
             • significant investments by Twitter, Etsy, eBay, etc.
             • great for data services at scale
             • less learning curve than Cascalog

Sunday, 17 March 13                                                                                                               19
word count – Scalding / Scala

                                                                                               M                token    Count

                                                                                                                  R              Word

             • extends the Scala collections API so that distributed lists
               become “pipes” backed by Cascading
             • code is compact, easy to understand
             • nearly 1:1 between elements of conceptual flow diagram
               and function calls        Cascalog and Scalding DSLs
             • extensive libraries are available for linear algebra, abstractaspects
                                         leverage the functional
               algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                         of MapReduce, helping limit
             • significant investments by Twitter, Etsy, eBay, etc.
                                         complexity in process
             • great for data services at scale
             • less learning curve than Cascalog

Sunday, 17 March 13                                                                                                                      20
Two Avenues to the App Layer…

            Enterprise: must contend with
            complexity at scale everyday…
            incumbents extend current practices and
            infrastructure investments – using J2EE,

                                                          complexity ➞
            ANSI SQL, SAS, etc. – to migrate
            workflows onto Apache Hadoop while
            leveraging existing staff

             Start-ups: crave complexity and
             scale to become viable…
             new ventures move into Enterprise space
             to compete using relatively lean staff,
             while leveraging sophisticated engineering
             practices, e.g., Cascalog and Scalding
                                                                         scale ➞

Sunday, 17 March 13                                                                21
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            22
workflow abstraction – pattern language

           Cascading uses a “plumbing” metaphor in the Java API,
           to define workflows out of familiar elements: Pipes, Taps,
           Tuple Flows, Filters, Joins, Traps, etc.



                                                                      HashJoin   Regex
                                                                        Left     token
                                                                                         GroupBy    R
                                                         Stop Word                        token


            Data is represented as flows of tuples. Operations within                                    Word

            the flows bring functional programming aspects into Java                                     Count

            In formal terms, this provides a pattern language

Sunday, 17 March 13                                                                                             23

                      pattern language: a structured method for solving
                      large, complex design problems, where the syntax of
                      the language promotes the use of best practices


                      design patterns: the notion originated in consensus
                      negotiation for architecture, later applied in OOP
                      software engineering by “Gang of Four”

Sunday, 17 March 13                                                         24
workflow abstraction – pattern language

           Cascading uses a “plumbing” metaphor in the Java API,
           to define workflows out of familiar elements: Pipes, Taps,
           Tuple Flows, Filters, Joins, Traps, etc.


                                         design principles of the pattern


                                         language ensure best practices
                                                         Stop Word

                                         for robust, parallel data workflows

                                         at scale                                           Count

            Data is represented as flows of tuples. Operations within                                    Word

            the flows bring functional programming aspects into Java                                     Count

            In formal terms, this provides a pattern language

Sunday, 17 March 13                                                                                             25
workflow abstraction – literate programming

           Cascading workflows generate their own visual
           documentation: flow diagrams




                                                                       HashJoin   Regex
                                                                         Left     token
                                                                                          GroupBy    R
                                                          Stop Word                        token


            In formal terms, flow diagrams leverage a methodology                                         Word

            called literate programming
            Provides intuitive, visual representations for apps –
            great for cross-team collaboration

Sunday, 17 March 13                                                                                              26

                      by Don Knuth
                      Literate Programming
                      Univ of Chicago Press, 1992

                      “Instead of imagining that our main task is
                       to instruct a computer what to do, let us
                       concentrate rather on explaining to human
                       beings what we want a computer to do.”

Sunday, 17 March 13                                                 27
workflow abstraction – test-driven development

             •   assert patterns (regex) on the tuple streams
             •   adjust assert levels, like log4j levels
             •   trap edge cases as “data exceptions”                                           Web

             •   TDD at scale:
                 1. start from raw inputs in the flow graph                        logs

                 2. define stream assertions for each stage   Support
                                                                           trap                  sink
                    of transforms                                           tap

                 3. verify exceptions, code to remove them   Modeling    PMML

                 4. when impl is complete, app has full                    sink
                    test coverage                            Analytics
                                                              Cubes                            customer
                                                                                              profile DBs
           redirect traps in production                      Reporting

           to Ops, QA, Support, Audit, etc.

Sunday, 17 March 13                                                                                         28
workflow abstraction – business process

           Following the essence of literate programming, Cascading
           workflows provide statements of business process
           This recalls a sense of business process management
           for Enterprise apps (think BPM/BPEL for Big Data)
           Cascading creates a separation of concerns between
           business process and implementation details (Hadoop, etc.)
           This is especially apparent in large-scale Cascalog apps:
               “Specify what you require, not how to achieve it.”
           By virtue of the pattern language, the flow planner then
           determines how to translate business process into efficient,
           parallel jobs at scale

Sunday, 17 March 13                                                      29

                      by Edgar Codd
                      “A relational model of data for large shared data banks”
                      Communications of the ACM, 1970
                      Rather than arguing between SQL vs. NoSQL…
                      structured vs. unstructured data frameworks…
                      this approach focuses on what apps do:
                        the process of structuring data

                      Closely related to functional relational programming paradigm:
                        “Out of the Tar Pit”
                        Moseley & Marks 2006

Sunday, 17 March 13                                                                    30
workflow abstraction – API design principles

             • specify what is required, not how it must be achieved
             • plan far ahead, before consuming cluster resources –
                 fail fast prior to submit

             • fail the same way twice – deterministic flow planners
                 help reduce engineering costs for debugging at scale

             • same JAR, any scale – app does not require a recompile
                 to change data taps or cluster topologies

Sunday, 17 March 13                                                     31
workflow abstraction – building apps in layers

                        business      separation of concerns: focus on specifying what is required, not how the computers
                                      must accomplish it – not unlike BPM/BPEL for BigData

                       test-driven    assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
                      development     code until tests pass, repeat … route exceptional data to appropriate department

                         pattern      syntax of the pattern language conveys expertise – much like building a tower with
                                      Lego blocks: ensure best practices for robust, parallel data workflows at scale

                      flow planner/   enables the functional programming aspects: compiler within a compiler, mapping
                         optimizer    flows to topologies (e.g., create and sequence Hadoop job steps)

                       compiler/      entire app is visible to the compiler: resolves issues of crossing boundaries for
                         build        troubleshooting, exception handling, notifications, etc.; one app = one JAR

                        topology      Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.

                       JVM cluster    cluster scheduler, instrumentation, etc.

Sunday, 17 March 13                                                                                                              32
workflow abstraction – building apps in layers

                        business      separation of concerns: focus on specifying what is required, not how the computers
                                      must accomplish it – not unlike BPM/BPEL for BigData

                       test-driven    assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
                      development     code until tests pass, repeat … route exceptional data to appropriate department

                         pattern      syntax of the pattern language conveys expertise – much like building a tower with
                                      Lego blocks: ensure best practices for robust, parallel data workflows at scale

                      flow planner/
                                              several theoretical aspects converge
                                      enables the functional programming aspects: compiler within a compiler, mapping
                                      flows to topologies
                                              into software engineering practices
                                      entire app is visible to the compiler: resolves issues of crossing boundaries for
                         build                which minimize the complexity of
                                      troubleshooting, exception handling, notifications, etc.; one app = one JAR
                                              building and maintaining Enterprise
                        topology      Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
                                              data workflows
                       JVM cluster    cluster scheduler, instrumentation, etc.

Sunday, 17 March 13                                                                                                              33
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            34
Pattern – analytics workflows

             • open source project – ASL 2, GitHub repo
             • multiple companies contributing
             • complementary to Apache Mahout – while leveraging
                 workflow abstraction, multiple topologies, etc.
             •   model scoring: generates workflows from PMML models
             •   model creation: estimation at scale, captured as PMML
             •   use sample Hadoop app at scale – no coding required
             •   integrate with 2 lines of Java (1 line Clojure or Scala)
             •   excellent use cases for customer experiments at scale


Sunday, 17 March 13                                                         35
Pattern – analytics workflows

             • open source project – ASL 2, GitHub repo
             • multiple companies contributing
             • complementary to Apache Mahout – while leveraging
                 workflow abstraction, multiple topologies, etc.
             •   model scoring: generates workflows from PMML models
             •   model creation: estimation at reduced development
                                     greatly scale, captured at PMML      costs, less
             •   use sample Hadoop app at scale – no coding required leveraging the
                                    licensing issues at scale –
             •                      economics of Apache Hadoop clusters,
                 integrate with 2 lines of Java (1 line Clojure or Scala)
             •   excellent use cases for customer experiments at scale of analytics
                                    plus the core competencies
                                    staff, plus existing IP in predictive models


Sunday, 17 March 13                                                                     36
Pattern – model scoring

             • migrate workloads: SAS,Teradata, etc.,
                 exporting predictive models as PMML                                     Customers

             • great open source tools – R, Weka,                                          Web
                 KNIME, Matlab, RapidMiner, etc.
             • integrate with other libraries –                              logs
                                                                               logs       Cache
                 Matrix API, etc.                       Support

             • leverage PMML as another kind                          trap
                                                                                  tap       sink

                 of DSL
                                                        Modeling    PMML


                                                         Cubes                            customer
                                                                                         profile DBs


Sunday, 17 March 13                                                                                    37
Pattern – an example classifier

               1. use customer order history as the training data set
               2. train a risk classifier for orders, using Random Forest   risk classifier
                                                                           dimension: customer 360
                                                                                                                                        risk classifier
                                                                                                                                        dimension: per-order
                                                                           Cascading apps

               3. export model from R to PMML                                       data prep
                                                                                                      data sets

                                                                                    predict                                                            score new

               4. build a Cascading app to execute the PMML model                  model costs


                                                                                    fraudsters                                                          detection

                      4.1. generate flow from PMML description                        segment

                      4.2. plan the flow for a topology (Hadoop)                     Hadoop

                                                                                                          workloads                     workloads

                      4.3. compile app to a JAR file

                                                                                                             chargebacks,   partner
                                                                                                 DW              etc.        data

               5. verify results with a regression test
               6. deploy the app at scale to calculate scores
               7. potentially, reuse classifier for real-time scoring

Sunday, 17 March 13                                                                                                                                                   38
Pattern – an example classifier

                      risk classifier                                               risk classifier
                      dimension: customer 360                                      dimension: per-order
                      Cascading apps

                                                  training             analyst's                    customer
                               data prep                                laptop
                                                 data sets                                        transactions

                               predict                                                            score new
                              model costs                                                           orders
                                 detect                                                            anomaly
                               fraudsters                                                          detection

                                segment                                                             velocity
                               customers                                                            metrics

                               Hadoop                                  Customer                    IMDG
                                                         batch                     real-time
                                                     workloads                     workloads


                                                        chargebacks,   partner
                                            DW              etc.        data

Sunday, 17 March 13                                                                                              39
Pattern – create a model in R

                      ## train a RandomForest model
                      f <- as.formula("as.factor(label) ~ .")
                      fit <- randomForest(f, data_train, ntree=50)
                      ## test the model on the holdout test set
                      predicted <- predict(fit, data)
                      data$predicted <- predicted
                      confuse <- table(pred = predicted, true = data[,1])
                      ## export predicted labels to TSV
                      write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
                        quote=FALSE, sep="t", row.names=FALSE)
                      ## export RF model to PMML
                      saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

Sunday, 17 March 13                                                                          40
Pattern – capture model parameters as PMML
                      <?xml version="1.0"?>
                      <PMML version="4.0" xmlns=""
                       <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
                        <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
                        <Application name="Rattle/PMML" version="1.2.30"/>
                        <Timestamp>2012-10-22 19:39:28</Timestamp>
                       <DataDictionary numberOfFields="4">
                        <DataField name="label" optype="categorical" dataType="string">
                         <Value value="0"/>
                         <Value value="1"/>
                        <DataField name="var0" optype="continuous" dataType="double"/>
                        <DataField name="var1" optype="continuous" dataType="double"/>
                        <DataField name="var2" optype="continuous" dataType="double"/>
                       <MiningModel modelName="randomForest_Model" functionName="classification">
                         <MiningField name="label" usageType="predicted"/>
                         <MiningField name="var0" usageType="active"/>
                         <MiningField name="var1" usageType="active"/>
                         <MiningField name="var2" usageType="active"/>
                        <Segmentation multipleModelMethod="majorityVote">
                         <Segment id="1">
                          <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
                            <MiningField name="label" usageType="predicted"/>
                            <MiningField name="var0" usageType="active"/>
                            <MiningField name="var1" usageType="active"/>
                            <MiningField name="var2" usageType="active"/>

Sunday, 17 March 13                                                                                                                                                 41
Pattern – score a model, within an app
                      public class Main {
                        public static void main( String[] args ) {
                          String pmmlPath = args[ 0 ];
                          String ordersPath = args[ 1 ];
                          String classifyPath = args[ 2 ];
                          String trapPath = args[ 3 ];

                            Properties properties = new Properties();
                            AppProps.setApplicationJarClass( properties, Main.class );
                            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

                            // create source and sink taps
                            Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
                            Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
                            Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

                            // define a "Classifier" model from PMML to evaluate the orders
                            ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
                            Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

                            // connect the taps, pipes, etc., into a flow
                            FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
                             .addSource( classifyPipe, ordersTap )
                             .addTrap( classifyPipe, trapTap )
                             .addSink( classifyPipe, classifyTap );

                            // write a DOT file and run the flow
                            Flow classifyFlow = flowConnector.connect( flowDef );
                            classifyFlow.writeDOT( "dot/" );

Sunday, 17 March 13                                                                                                                      42
Pattern – score a model, using pre-defined Cascading app


                                                 Scored             GroupBy
                                      Classify            Assert
                                                 Orders              token

                                 M                                             R


                                                          Failure              Confusion
                                                           Traps                Matrix

Sunday, 17 March 13                                                                        43
Pattern – score a model, using pre-defined Cascading app

                      ## run an RF classifier at scale
                      hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
                        --pmml data/sample.rf.xml

                      ## run an RF classifier at scale, assert regression test, measure confusion matrix
                      hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
                        --pmml data/sample.rf.xml --assert --measure out/measure

                      ## run a predictive model at scale, measure RMSE
                      hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap 
                           --pmml data/iris.lm_p.xml --rmse out/measure

Sunday, 17 March 13                                                                                        44
Pattern – evaluating results

                      bash-3.2$ head out/classify/part-00000
                      label" var0" var1" var2" order_id" predicted"
                      1" 0" 1" 0" 6f8e1014" 1" 1
                      0" 0" 0" 1" 6f8ea22e" 0" 0
                      1" 0" 1" 0" 6f8ea435" 1" 1
                      0" 0" 0" 1" 6f8ea5e1" 0" 0
                      1" 0" 1" 0" 6f8ea785" 1" 1
                      1" 0" 1" 0" 6f8ea91e" 1" 1
                      0" 1" 0" 0" 6f8eaaba" 0" 0
                      1" 0" 1" 0" 6f8eac54" 1" 1
                      0" 1" 1" 0" 6f8eade3" 1" 1

Sunday, 17 March 13                                                       45
Lingual – connecting Hadoop and R

                      # load the JDBC package
                      # set up the driver
                      drv <- JDBC("cascading.lingual.jdbc.Driver",
                      # set up a database connection to a local repository
                      connection <- dbConnect(drv,
                      # query the repository: in this case the MySQL sample database (CSV files)
                      df <- dbGetQuery(connection,
                        "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
                      # use R functions to summarize and visualize part of the data
                      df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25

                      m <- ggplot(df, aes(x=hire_age))
                      m <- m + ggtitle("Age at hire, people named Gina")
                      m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

Sunday, 17 March 13                                                                                             46
Lingual – connecting Hadoop and R

                      > summary(df$hire_age)
                         Min. 1st Qu. Median     Mean 3rd Qu.    Max.
                        20.86   27.89   31.70   31.61   35.01   43.92


Sunday, 17 March 13                                                     47
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            48
PMML – standard

             • established XML standard for predictive model markup
             • organized by Data Mining Group (DMG), since 1997
             • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
                 Microsoft, etc.
             • PMML concepts for metadata, ensembles, etc., translate
                 directly into Cascading tuple flows

           “PMML is the leading standard for statistical and data mining models and
            supported by over 20 vendors and organizations.With PMML, it is easy
            to develop a model on one system using one application and deploy the
            model on another system using another application.”


Sunday, 17 March 13                                                                   49
PMML – models

             •   Association Rules: AssociationModel element
             •   Cluster Models: ClusteringModel element
             •   Decision Trees: TreeModel element
             •   Naïve Bayes Classifiers: NaiveBayesModel element
             •   Neural Networks: NeuralNetwork element
             •   Regression: RegressionModel and GeneralRegressionModel elements
             •   Rulesets: RuleSetModel element
             •   Sequences: SequenceModel element
             •   Support Vector Machines: SupportVectorMachineModel element
             •   Text Models: TextModel element
             •   Time Series: TimeSeriesModel element


Sunday, 17 March 13                                                                50
PMML – vendor coverage

Sunday, 17 March 13                 51
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            52
roadmap – existing algorithms for scoring


                  Random Forest
             •   Decision Trees
             •   Linear Regression
             •   GLM
             •   Logistic Regression
             •   K-Means Clustering
             •   Hierarchical Clustering
             •   Support Vector Machines


Sunday, 17 March 13                                    53
roadmap – top priorities for creating models at scale


Random Forest
             • Logistic Regression
             • K-Means Clustering

           a wealth of recent research indicates many opportunities
           to parallelize popular algorithms for training models at scale
           on Apache Hadoop…


Sunday, 17 March 13                                                         54
roadmap – next priorities for scoring


                  Time Series (ARIMA forecast)
             •   Association Rules (basket analysis)
             •   Naïve Bayes
             •   Neural Networks

           algorithms extended based on customer use cases –
           contact @pacoid


Sunday, 17 March 13                                            55
Pattern: predictive models at scale



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token


            • Enterprise Data Workflows

            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments

Sunday, 17 March 13                                                                                                            56
experiments – comparing models

             • much customer interest in leveraging Cascading and
                 Apache Hadoop to run customer experiments at scale
             • run multiple variants, then measure relative “lift”
             • Concurrent runtime – tag and track models

           the following example compares two models trained
           with different machine learning algorithms

           this is exaggerated, one has an important variable
           intentionally omitted to help illustrate the experiment

Sunday, 17 March 13                                                   57
experiments – Random Forest model

                      ## train a Random Forest model
                      ## example:
                      f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
                      fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
                      saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))

                               OOB estimate of   error rate: 14%
                      Confusion matrix:
                         0   1 class.error
                      0 69 16     0.1882353
                      1 12 103    0.1043478

Sunday, 17 March 13                                                                          58
experiments – Logistic Regression model

                      ## train a Logistic Regression model (special case of GLM)
                      ## example:
                      f <- as.formula("as.factor(label) ~ var0 + var2")
                      fit <- glm(f, family=binomial, data=data)
                      saveXML(pmml(fit), file=paste(out_folder, "", sep="/"))

                                  Estimate Std. Error z value Pr(>|z|)
                      (Intercept)    1.8524    0.3803   4.871 1.11e-06 ***
                      var0          -1.3755    0.4355 -3.159 0.00159 **
                      var2          -3.7742    0.5794 -6.514 7.30e-11 ***
                      Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
                       ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                      NB: this model has “var1” intentionally omitted

Sunday, 17 March 13                                                                                 59
experiments – comparing results


use a confusion matrix to compare results for the classifiers
             • Logistic Regression has a lower “false negative” rate (5% vs. 11%)
                 however it has a much higher “false positive” rate (52% vs. 14%)
             • assign a cost model to select a winner –
                 for example, in an ecommerce anti-fraud classifier:
                      FN ∼ chargeback risk
                      FP ∼ customer support costs

Sunday, 17 March 13                                                                 60

                      Enterprise Data Workflows
                      with Cascading
                      O’Reilly, 2013

Sunday, 17 March 13                              61

                      blog, dev community, code/wiki/gists, maven repo,
                      commercial products, career opportunities:

                                                                          Copyright @2013, Concurrent, Inc.

Sunday, 17 March 13                                                                                           62

More Related Content

What's hot

The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
Infochimps, a CSC Big Data Business
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf FraenkelBi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPE NCC
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
HP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management SolutionsHP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management Solutions
Eduardo Castro
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera, Inc.
Standards for Semantic Mashups
Standards for Semantic MashupsStandards for Semantic Mashups
Standards for Semantic Mashups
Laurent Lefort
Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Geodata AS
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
DataWorks Summit

What's hot (13)

The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf FraenkelBi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
HP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management SolutionsHP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management Solutions
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Standards for Semantic Mashups
Standards for Semantic MashupsStandards for Semantic Mashups
Standards for Semantic Mashups
Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013
Hana Offerings Engl
Hana Offerings EnglHana Offerings Engl
Hana Offerings Engl
User Group Bi
User Group BiUser Group Bi
User Group Bi
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability

Viewers also liked

Panorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités localesPanorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités locales
Emilie Marquois
Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)
Tim O'Reilly
Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)
Tim O'Reilly
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
Paco Nathan
What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)
Sergey Sundukovskiy
Government 2.0
Government 2.0Government 2.0
Government 2.0
Tim O'Reilly
Bilan de mobilité
Bilan de mobilitéBilan de mobilité
Bilan de mobilité
Cursus Management
AWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisAWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThis
Paco Nathan
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
Traffic Signal Movie Preview
Traffic Signal Movie PreviewTraffic Signal Movie Preview
Traffic Signal Movie Preview
Kapil Mohan
25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project
Liz Strauss
Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)
Tim O'Reilly
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
David Gleich
Ermes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaErmes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia Giulia
Simone Puksic
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco Nathan
Paco Nathan
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooMohnish Jadwani

Viewers also liked (20)

Panorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités localesPanorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités locales
Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)
Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)
Government 2.0
Government 2.0Government 2.0
Government 2.0
Bilan de mobilité
Bilan de mobilitéBilan de mobilité
Bilan de mobilité
AWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisAWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThis
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
Traffic Signal Movie Preview
Traffic Signal Movie PreviewTraffic Signal Movie Preview
Traffic Signal Movie Preview
25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project
Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
Ermes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaErmes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia Giulia
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco Nathan
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours too

Similar to Pattern: an open source project for migrating predictive models onto Apache Hadoop

Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Paco Nathan
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Paco Nathan
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
Revolution Analytics
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
JanBask Training
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Narayan Bharadwaj
Dancing about architecture
Dancing about architectureDancing about architecture
Dancing about architecture
Coraline Ehmke
Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server
Verbella CMG
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
Paco Nathan
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
Ruby at UW C4C
Ruby at UW C4CRuby at UW C4C
Ruby at UW C4C
Ivan Storck
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSybase Türkiye
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Paco Nathan
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Bringing olap fully online  analyze changing datasets in mem sql and spark wi...Bringing olap fully online  analyze changing datasets in mem sql and spark wi...
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
Paco Nathan
Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)
Cesare Pautasso

Similar to Pattern: an open source project for migrating predictive models onto Apache Hadoop (20)

Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
Dancing about architecture
Dancing about architectureDancing about architecture
Dancing about architecture
Data streaming
Data streamingData streaming
Data streaming
Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
Ruby at UW C4C
Ruby at UW C4CRuby at UW C4C
Ruby at UW C4C
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming Processing
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Bringing olap fully online  analyze changing datasets in mem sql and spark wi...Bringing olap fully online  analyze changing datasets in mem sql and spark wi...
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Computable Content
Computable ContentComputable Content
Computable Content
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused

Recently uploaded

Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati

Recently uploaded (20)

Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Pattern: an open source project for migrating predictive models onto Apache Hadoop

  • 1. “Pattern – an open source project for migrating predictive models onto Apache Hadoop” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Sunday, 17 March 13 1
  • 2. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 2
  • 3. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology. Sunday, 17 March 13 3
  • 4. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters Sunday, 17 March 13 4
  • 5. functional programming… in production • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) Sunday, 17 March 13 5
  • 6. Cascading – definitions • a pattern language for Enterprise Data Workflows Customers • simple to build, easy to test, robust in production • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 6
  • 7. Cascading – usage • Java API, DSLs in Scala, Clojure, Customers Jython, JRuby, Groovy, ANSI SQL • ASL 2 license, GitHub src, Web App • 5+ yrs production use, logs logs Logs Cache multiple Enterprise verticals Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 7
  • 8. Cascading – integrations • partners: Microsoft Azure, Hortonworks, Customers Amazon AWS, MapR, EMC, SpringSource, Cloudera Web • taps: Memcached, Cassandra, MongoDB, App HBase, JDBC, Parquet, etc. logs logs Cache • serialization: Avro, Thrift, Kryo, Support Logs JSON, etc. trap source tap sink tap tap • topologies: Apache Hadoop, Data tuple spaces, local mode Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 8
  • 9. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. Sunday, 17 March 13 9
  • 10. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utilityworkflow abstraction grids, telecom, addresses: genomics, climatology, agronomics, etc. • staffing bottleneck; • system integration; • operational complexity; • test-driven development Sunday, 17 March 13 10
  • 11. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 11
  • 12. The Ubiquitous Word Count Document Definition: Collection Tokenize GroupBy M token Count count how often each word appears count how often each word appears R Word Count in a collection of text documents in a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): • requires a minimal amount of code for each word w in segment(text): emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Sunday, 17 March 13 12
  • 13. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map 1 reduce 18 lines code Sunday, 17 March 13 13
  • 14. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy M token String wcPath = args[ 1 ]; Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/" ); wcFlow.complete(); Sunday, 17 March 13 14
  • 15. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Sunday, 17 March 13 15
  • 16. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; Sunday, 17 March 13 16
  • 17. word count – Cascalog / Clojure Document Collection Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Sunday, 17 March 13 17
  • 18. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Sunday, 17 March 13 18
  • 19. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog Sunday, 17 March 13 19
  • 20. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale • less learning curve than Cascalog Sunday, 17 March 13 20
  • 21. Two Avenues to the App Layer… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Sunday, 17 March 13 21
  • 22. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 22
  • 23. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern language Sunday, 17 March 13 23
  • 24. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four” Sunday, 17 March 13 24
  • 25. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern language Sunday, 17 March 13 25
  • 26. workflow abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps – great for cross-team collaboration Sunday, 17 March 13 26
  • 27. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” Sunday, 17 March 13 27
  • 28. workflow abstraction – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes customer Customer profile DBs Prefs Hadoop redirect traps in production Reporting Cluster to Ops, QA, Support, Audit, etc. Sunday, 17 March 13 28
  • 29. workflow abstraction – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale Sunday, 17 March 13 29
  • 30. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data Closely related to functional relational programming paradigm: “Out of the Tar Pit” Moseley & Marks 2006 Sunday, 17 March 13 30
  • 31. workflow abstraction – API design principles • specify what is required, not how it must be achieved • plan far ahead, before consuming cluster resources – fail fast prior to submit • fail the same way twice – deterministic flow planners help reduce engineering costs for debugging at scale • same JAR, any scale – app does not require a recompile to change data taps or cluster topologies Sunday, 17 March 13 31
  • 32. workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ enables the functional programming aspects: compiler within a compiler, mapping optimizer flows to topologies (e.g., create and sequence Hadoop job steps) compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for build troubleshooting, exception handling, notifications, etc.; one app = one JAR topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. JVM cluster cluster scheduler, instrumentation, etc. Sunday, 17 March 13 32
  • 33. workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ optimizer several theoretical aspects converge enables the functional programming aspects: compiler within a compiler, mapping flows to topologies into software engineering practices entire app is visible to the compiler: resolves issues of crossing boundaries for compiler/ build which minimize the complexity of troubleshooting, exception handling, notifications, etc.; one app = one JAR building and maintaining Enterprise topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. data workflows JVM cluster cluster scheduler, instrumentation, etc. Sunday, 17 March 13 33
  • 34. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 34
  • 35. Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at scale, captured as PMML • use sample Hadoop app at scale – no coding required • integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale Sunday, 17 March 13 35
  • 36. Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at reduced development greatly scale, captured at PMML costs, less • use sample Hadoop app at scale – no coding required leveraging the licensing issues at scale – • economics of Apache Hadoop clusters, integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale of analytics plus the core competencies staff, plus existing IP in predictive models Sunday, 17 March 13 36
  • 37. Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • great open source tools – R, Weka, Web App KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – logs logs Cache Logs Matrix API, etc. Support • leverage PMML as another kind trap tap source tap sink tap of DSL Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 37
  • 38. Pattern – an example classifier 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest risk classifier dimension: customer 360 risk classifier dimension: per-order Cascading apps 3. export model from R to PMML data prep training data sets analyst's laptop customer transactions predict score new 4. build a Cascading app to execute the PMML model model costs detect PMML model orders anomaly fraudsters detection 4.1. generate flow from PMML description segment customers velocity metrics 4.2. plan the flow for a topology (Hadoop) Hadoop batch Customer DB real-time IMDG workloads workloads 4.3. compile app to a JAR file ETL chargebacks, partner DW etc. data 5. verify results with a regression test 6. deploy the app at scale to calculate scores 7. potentially, reuse classifier for real-time scoring Sunday, 17 March 13 38
  • 39. Pattern – an example classifier risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analyst's customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. data Sunday, 17 March 13 39
  • 40. Pattern – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) Sunday, 17 March 13 40
  • 41. Pattern – capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns=""  xmlns:xsi=""  xsi:schemaLocation="">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... Sunday, 17 March 13 41
  • 42. Pattern – score a model, within an app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/" );   classifyFlow.complete(); } } Sunday, 17 March 13 42
  • 43. Pattern – score a model, using pre-defined Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrix Sunday, 17 March 13 43
  • 44. Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml   ## run an RF classifier at scale, assert regression test, measure confusion matrix   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure   ## run a predictive model at scale, measure RMSE   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measure Sunday, 17 March 13 44
  • 45. Pattern – evaluating results bash-3.2$ head out/classify/part-00000 label" var0" var1" var2" order_id" predicted" score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1 Sunday, 17 March 13 45
  • 46. Lingual – connecting Hadoop and R # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density() Sunday, 17 March 13 46
  • 47. Lingual – connecting Hadoop and R > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92 Sunday, 17 March 13 47
  • 48. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 48
  • 49. PMML – standard • established XML standard for predictive model markup • organized by Data Mining Group (DMG), since 1997 • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” Sunday, 17 March 13 49
  • 50. PMML – models • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • Support Vector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element Sunday, 17 March 13 50
  • 51. PMML – vendor coverage Sunday, 17 March 13 51
  • 52. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 52
  • 53. roadmap – existing algorithms for scoring • Random Forest • Decision Trees • Linear Regression • GLM • Logistic Regression • K-Means Clustering • Hierarchical Clustering • Support Vector Machines Sunday, 17 March 13 53
  • 54. roadmap – top priorities for creating models at scale • Random Forest • Logistic Regression • K-Means Clustering a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop… Sunday, 17 March 13 54
  • 55. roadmap – next priorities for scoring • Time Series (ARIMA forecast) • Association Rules (basket analysis) • Naïve Bayes • Neural Networks algorithms extended based on customer use cases – contact @pacoid Sunday, 17 March 13 55
  • 56. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 56
  • 57. experiments – comparing models • much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment Sunday, 17 March 13 57
  • 58. experiments – Random Forest model ## train a Random Forest model ## example:   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478 Sunday, 17 March 13 58
  • 59. experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example:   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omitted Sunday, 17 March 13 59
  • 60. experiments – comparing results • use a confusion matrix to compare results for the classifiers • Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) • assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costs Sunday, 17 March 13 60
  • 61. references… Enterprise Data Workflows with Cascading O’Reilly, 2013 Sunday, 17 March 13 61
  • 62. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: Copyright @2013, Concurrent, Inc. Sunday, 17 March 13 62