• Share
  • Email
  • Embed
  • Like
  • Private Content
Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop
 

Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

on

  • 2,684 views

Hadoop Summit 2013 talk:

Hadoop Summit 2013 talk:

“Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop”

Statistics

Views

Total Views
2,684
Views on SlideShare
1,369
Embed Views
1,315

Actions

Likes
4
Downloads
28
Comments
0

10 Embeds 1,315

http://liber118.com 547
http://www.scoop.it 400
http://eventifier.co 273
http://eventifier.com 55
http://lanyrd.com 19
https://twitter.com 12
http://www.feedspot.com 4
http://feeds.feedburner.com 2
http://webcache.googleusercontent.com 2
http://www.liber118.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop Presentation Transcript

    • Paco NathanConcurrent, Inc.San Francisco, CA@pacoid“Pattern – an open source projectfor migrating predictive modelsfrom SAS, etc., onto Hadoop”1Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern2Tuesday, 25 June 13
    • Cascading – originsAPI author Chris Wensel worked as a system architectat an Enterprise firm well-known for many populardata products.Wensel was following the Nutch open source project –where Hadoop started.Observation: would be difficult to find Java developersto write complex Enterprise apps in MapReduce –potential blocker for leveraging new open sourcetechnology.3Tuesday, 25 June 13
    • Cascading – functional programmingKey insight: MapReduce is based on functional programming– back to LISP in 1970s. Apache Hadoop use cases aremostly about data pipelines, which are functional in nature.To ease staffing problems as “Main Street” Enterprise firmsbegan to embrace Hadoop, Cascading was introducedin late 2007, as a new Java API to implement functionalprogramming for large-scale data workflows:• leverages JVM and Java-based tools without anyneed to create new languages• allows programmers who have J2EE expertiseto leverage the economics of Hadoop clusters4Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading – definitions• a pattern language for Enterprise Data Workflows• simple to build, easy to test, robust in production• design principles ⟹ ensure best practices at scale5Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading – usage• Java API, DSLs in Scala, Clojure,Jython, JRuby, Groovy,ANSI SQL• ASL 2 license, GitHub src,http://conjars.org• 5+ yrs production use,multiple Enterprise verticals6Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading – integrations• partners: Microsoft Azure, Hortonworks,Amazon AWS, MapR, EMC, SpringSource,Cloudera• taps: Memcached, Cassandra, MongoDB,HBase, JDBC, Parquet, etc.• serialization: Avro, Thrift, Kryo,JSON, etc.• topologies: Apache Hadoop,tuple spaces, local mode7Tuesday, 25 June 13
    • Cascading – deployments• case studies: Climate Corp, Twitter, Etsy,Williams-Sonoma, uSwitch, Airbnb, Nokia,YieldBot, Square, Harvard, Factual, etc.• use cases: ETL, marketing funnel, anti-fraud,social media, retail pricing, search analytics,recommenders, eCRM, utility grids, telecom,genomics, climatology, agronomics, etc.8Tuesday, 25 June 13
    • Cascading – deployments• case studies: Climate Corp, Twitter, Etsy,Williams-Sonoma, uSwitch, Airbnb, Nokia,YieldBot, Square, Harvard, Factual, etc.• use cases: ETL, marketing funnel, anti-fraud,social media, retail pricing, search analytics,recommenders, eCRM, utility grids, telecom,genomics, climatology, agronomics, etc.workflow abstraction addresses:• staffing bottleneck;• system integration;• operational complexity;• test-driven development9Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern10Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLEnterprise Data WorkflowsLet’s consider a “strawman” architecturefor an example app… at the front endLOB use cases drive demand for apps11Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLEnterprise Data WorkflowsSame example… in the back officeOrganizations have substantial investmentsin people, infrastructure, process12Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLEnterprise Data WorkflowsSame example… the heavy lifting!“Main Street” firms are migratingworkflows to Hadoop, for costsavings and scale-out13Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – taps• taps integrate other data frameworks, as tuple streams• these are “plumbing” endpoints in the pattern language• sources (inputs), sinks (outputs), traps (exceptions)• text delimited, JDBC, Memcached,HBase, Cassandra, MongoDB, etc.• data serialization: Avro, Thrift,Kryo, JSON, etc.• extend a new kind of tap in justa few lines of Javaschema and provenance getderived from analysis of the taps14Tuesday, 25 June 13
    • Cascading workflows – tapsString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();source and sink tapsfor TSV data in HDFS15Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – topologies• topologies execute workflows on clusters• flow planner is like a compiler for queries- Hadoop (MapReduce jobs)- local mode (dev/test or special config)- in-memory data grids (real-time)• flow planner can be extendedto support other topologiesblend flows in different topologiesinto the same app – for example,batch (Hadoop) + transactions (IMDG)16Tuesday, 25 June 13
    • Cascading workflows – topologiesString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();flow planner forApache Hadooptopology17Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – test-driven development• assert patterns (regex) on the tuple streams• adjust assert levels, like log4j levels• trap edge cases as “data exceptions”• TDD at scale:1.start from raw inputs in the flow graph2.define stream assertions for each stageof transforms3.verify exceptions, code to remove them4.when impl is complete, app has fulltest coverageredirect traps in productionto Ops, QA, Support,Audit, etc.18Tuesday, 25 June 13
    • Workflow Abstraction – pattern languageCascading uses a “plumbing” metaphor in the Java API,to define workflows out of familiar elements: Pipes, Taps,Tuple Flows, Filters, Joins, Traps, etc.ScrubtokenDocumentCollectionTokenizeWordCountGroupBytokenCountStop WordListRegextokenHashJoinLeftRHSMRData is represented as flows of tuples. Operations withinthe flows bring functional programming aspects into JavaIn formal terms, this provides a pattern language19Tuesday, 25 June 13
    • Pattern Languagestructured method for solving large, complex designproblems, where the syntax of the language ensuresthe use of best practices – i.e., conveying expertiseFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsA Pattern LanguageChristopher Alexander, et al.amazon.com/dp/019501919920Tuesday, 25 June 13
    • Workflow Abstraction – literate programmingCascading workflows generate their own visualdocumentation: flow diagramsin formal terms, flow diagrams leverage a methodologycalled literate programmingprovides intuitive, visual representations for apps –great for cross-team collaborationScrubtokenDocumentCollectionTokenizeWordCountGroupBytokenCountStop WordListRegextokenHashJoinLeftRHSMR21Tuesday, 25 June 13
    • Literate Programmingby Don KnuthLiterate ProgrammingUniv of Chicago Press, 1992literateprogramming.com/“Instead of imagining that our main task isto instruct a computer what to do, let usconcentrate rather on explaining to humanbeings what we want a computer to do.”22Tuesday, 25 June 13
    • Workflow Abstraction – business processfollowing the essence of literate programming, Cascadingworkflows provide statements of business processthis recalls a sense of business process managementfor Enterprise apps (think BPM/BPEL for Big Data)Cascading creates a separation of concerns betweenbusiness process and implementation details (Hadoop, etc.)this is especially apparent in large-scale Cascalog apps:“Specify what you require, not how to achieve it.”by virtue of the pattern language, the flow planner thendetermines how to translate business process into efficient,parallel jobs at scale23Tuesday, 25 June 13
    • Business Processby Edgar Codd“A relational model of data for large shared data banks”Communications of the ACM, 1970dl.acm.org/citation.cfm?id=362685rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks…this approach focuses on what apps do:the process of structuring data24Tuesday, 25 June 13
    • Cascading – functional programming• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,have invested in open source projects atop Cascading– used for their large-scale production deployments• new case studies for Cascading apps are mostlybased on domain-specific languages (DSLs) in JVMlanguages which emphasize functional programming:Cascalog in Clojure (2010)Scalding in Scala (2012)github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wikiWhy Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnologyDan Woods, 2013-04-17 Forbesforbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/25Tuesday, 25 June 13
    • Functional Programming for Big DataWordCount with token scrubbing…Apache Hive: 52 lines HQL + 8 lines Python (UDF)compared toScalding: 18 lines Scala/Cascadingfunctional programming languages help reducesoftware engineering costs at scale, over time26Tuesday, 25 June 13
    • Two Avenues to the App Layer…scale ➞complexity➞Enterprise: must contend withcomplexity at scale everyday…incumbents extend current practices andinfrastructure investments – using J2EE,ANSI SQL, SAS, etc. – to migrateworkflows onto Apache Hadoop whileleveraging existing staffStart-ups: crave complexity andscale to become viable…new ventures move into Enterprise spaceto compete using relatively lean staff,while leveraging sophisticated engineeringpractices, e.g., Cascalog and Scalding27Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern28Tuesday, 25 June 13
    • • established XML standard for predictive model markup• organized by Data Mining Group (DMG), since 1997http://dmg.org/• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,Microsoft, etc.• PMML concepts for metadata, ensembles, etc., translatedirectly into Cascading tuple flows“PMML is the leading standard for statistical and data mining models andsupported by over 20 vendors and organizations.With PMML, it is easyto develop a model on one system using one application and deploy themodel on another system using another application.”PMML – standardwikipedia.org/wiki/Predictive_Model_Markup_Language29Tuesday, 25 June 13
    • • Association Rules: AssociationModel element• Cluster Models: ClusteringModel element• Decision Trees: TreeModel element• Naïve Bayes Classifiers: NaiveBayesModel element• Neural Networks: NeuralNetwork element• Regression: RegressionModel and GeneralRegressionModel elements• Rulesets: RuleSetModel element• Sequences: SequenceModel element• SupportVector Machines: SupportVectorMachineModel element• Text Models: TextModel element• Time Series: TimeSeriesModel elementPMML – model coverageibm.com/developerworks/industry/library/ind-PMML2/30Tuesday, 25 June 13
    • PMML – vendor coverage31Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern32Tuesday, 25 June 13
    • HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLPattern – model scoring• migrate workloads: SAS,Teradata, etc.,exporting predictive models as PMML• great open source tools – R, Weka,KNIME, Matlab, RapidMiner, etc.• integrate with other libraries –Matrix API, etc.• leverage PMML as another kindof DSLcascading.org/pattern33Tuesday, 25 June 13
    • ## train a RandomForest model f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance)print(fit) predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))Pattern – create a model in R34Tuesday, 25 June 13
    • <?xml version="1.0"?><PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>  <Application name="Rattle/PMML" version="1.2.30"/>  <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4">  <DataField name="label" optype="categorical" dataType="string">   <Value value="0"/>   <Value value="1"/>  </DataField>  <DataField name="var0" optype="continuous" dataType="double"/>  <DataField name="var1" optype="continuous" dataType="double"/>  <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification">  <MiningSchema>   <MiningField name="label" usageType="predicted"/>   <MiningField name="var0" usageType="active"/>   <MiningField name="var1" usageType="active"/>   <MiningField name="var2" usageType="active"/>  </MiningSchema>  <Segmentation multipleModelMethod="majorityVote">   <Segment id="1">    <True/>    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">     <MiningSchema>      <MiningField name="label" usageType="predicted"/>      <MiningField name="var0" usageType="active"/>      <MiningField name="var1" usageType="active"/>      <MiningField name="var2" usageType="active"/>     </MiningSchema>...Pattern – capture model parameters as PMML35Tuesday, 25 June 13
    • public static void main( String[] args ) throws RuntimeException {String inputPath = args[ 0 ];String classifyPath = args[ 1 ];// set up the config propertiesProperties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );  // create source and sink tapsTap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );  // handle command line optionsOptionParser optParser = new OptionParser();optParser.accepts( "pmml" ).withRequiredArg();  OptionSet options = optParser.parse( args ); // connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "classify" ).addSource( "input", inputTap ).addSink( "classify", classifyTap ); if( options.hasArgument( "pmml" ) ) {String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );PMMLPlanner pmmlPlanner = new PMMLPlanner().setPMMLInput( new File( pmmlPath ) ).retainOnlyActiveIncomingFields().setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the modelflowDef.addAssemblyPlanner( pmmlPlanner );} // write a DOT file and run the flowFlow classifyFlow = flowConnector.connect( flowDef );classifyFlow.writeDOT( "dot/classify.dot" );classifyFlow.complete();}Pattern – score a model, within an app36Tuesday, 25 June 13
    • CustomerOrdersClassifyScoredOrdersGroupBytokenCountPMMLModelM RFailureTrapsAssertConfusionMatrixPattern – score a model, using pre-defined Cascading appcascading.org/pattern37Tuesday, 25 June 13
    • ## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml ## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure ## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measurePattern – score a model, using pre-defined Cascading app38Tuesday, 25 June 13
    • Roadmap – existing algorithms for scoring• Random Forest• Decision Trees• Linear Regression• GLM• Logistic Regression• K-Means Clustering• Hierarchical Clustering• Multinomial• SupportVector Machines (prepared for release)also, model chaining and general support for ensemblescascading.org/pattern39Tuesday, 25 June 13
    • Roadmap – next priorities for scoring• Time Series (ARIMA forecast)• Association Rules (basket analysis)• Naïve Bayes• Neural Networksalgorithms extended based on customer use cases –contact groups.google.com/forum/?fromgroups#!forum/pattern-usercascading.org/pattern40Tuesday, 25 June 13
    • Roadmap – top priorities for creating models at scale• Random Forest• Logistic Regression• K-Means Clustering• Association Rules…plus all models which can be trained via sparse matrixfactorization (TQSR => PCA, SVD least squares, etc.)a wealth of recent research indicates many opportunitiesto parallelize popular algorithms for training models at scaleon Apache Hadoop…cascading.org/pattern41Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern42Tuesday, 25 June 13
    • Experiments – comparing models• much customer interest in leveraging Cascading andApache Hadoop to run customer experiments at scale• run multiple variants, then measure relative “lift”• Concurrent runtime – tag and track modelsthe following example compares two models trainedwith different machine learning algorithmsthis is exaggerated, one has an important variableintentionally omitted to help illustrate the experiment43Tuesday, 25 June 13
    • ## train a Random Forest model## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2")fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)print(fit)saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))Experiments – Random Forest modelOOB estimate of error rate: 14%Confusion matrix:0 1 class.error0 69 16 0.18823531 12 103 0.104347844Tuesday, 25 June 13
    • ## train a Logistic Regression model (special case of GLM)## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2")fit <- glm(f, family=binomial, data=data)print(summary(fit))saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))Experiments – Logistic Regression modelCoefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***var0 -1.3755 0.4355 -3.159 0.00159 **var2 -3.7742 0.5794 -6.514 7.30e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1NB: this model has “var1” intentionally omitted45Tuesday, 25 June 13
    • Experiments – comparing results• use a confusion matrix to compare results for the classifiers• Logistic Regression has a lower “false negative” rate (5% vs. 11%)however it has a much higher “false positive” rate (52% vs. 14%)• assign a cost model to select a winner –for example, in an ecommerce anti-fraud classifier:FN ∼ chargeback riskFP ∼ customer support costs46Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern47Tuesday, 25 June 13
    • Two Cultures“A new research community using these tools sprang up.Their goalwas predictive accuracy.The community consisted of young computerscientists, physicists and engineers plus a few aging statisticians.They began using the new tools in working on complex predictionproblems where it was obvious that data models were not applicable:speech recognition, image recognition, nonlinear time series prediction,handwriting recognition, prediction in financial markets.”Statistical Modeling: TheTwo CulturesLeo Breiman, 2001bit.ly/eUTh9Lin other words, seeing the forest for the trees…this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling(machine data for automation/optimization)48Tuesday, 25 June 13
    • Why Do Ensembles Matter?The World…per Data ModelingThe World…49Tuesday, 25 June 13
    • Algorithmic Modeling“The trick to being a scientist is to be open to usinga wide variety of tools.” – Breimancirca 2001: Random Forest, bootstrap aggregation, etc.,yield dramatic increases in predictive power over earliermodeling such as Logistic Regressionmajor learnings from the Netflix Prize: the power ofensembles, model chaining, etc.the problems at hand have become simply too big and toocomplex for ONE distribution, ONE model, ONE team…50Tuesday, 25 June 13
    • Ensemble ModelsBreiman:“a multiplicity of data models”BellKor team: 100+ individual models in 2007 Progress Prizewhile the process of combining models adds complexity(making it more difficult to anticipate or explain predictions)accuracy may increase substantiallyEnsemble Learning: Better PredictionsThrough DiversityTodd HollowayETech (2008)abeautifulwww.com/EnsembleLearningETech.pdfThe Story of the Netflix Prize:An EnsemblersTaleLester MackeyNational Academies Seminar,Washington, DC (2011)stanford.edu/~lmackey/papers/51Tuesday, 25 June 13
    • KDD 2013 PMML WorkshopPattern: PMML for Cascading and HadoopPaco Nathan, Girish KathalagiriChicago, 2013-08-11 (accepted)19th ACM SIGKDDConference on Knowledge Discoveryand Data Miningkdd13pmml.wordpress.com52Tuesday, 25 June 13
    • FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern53Tuesday, 25 June 13
    • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesenduses54Tuesday, 25 June 13
    • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesANSI SQL for ETL55Tuesday, 25 June 13
    • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logic56Tuesday, 25 June 13
    • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive models57Tuesday, 25 June 13
    • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive modelsANSI SQL for ETL most of the licensing costs…58Tuesday, 25 June 13
    • Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logicmost of the project costs…59Tuesday, 25 June 13
    • ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcea compiler sees it all…cascading.org60Tuesday, 25 June 13
    • a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "etl" ).addSource( "example.employee", emplTap ).addSource( "example.sales", salesTap ).addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner().setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );cascading.org61Tuesday, 25 June 13
    • a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "classifier" ).addSource( "input", inputTap ).addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner().setPMMLInput( new File( pmmlModel ) ).retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );62Tuesday, 25 June 13
    • cascading.orgETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcevisual collaboration for the business logic is a greatway to improve how teams work togetherFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleads63Tuesday, 25 June 13
    • ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsmultiple departments, working in their respectiveframeworks, integrate results into a combined app,which runs at scale on a cluster… business processcombined in a common space (DAG) for flowplanners, compiler, optimization, troubleshooting,exception handling, notifications, security audit,performance monitoring, etc.cascading.org64Tuesday, 25 June 13
    • Enterprise DataWorkflowswith CascadingO’Reilly, 2013amazon.com/dp/1449358721references…newsletter updates:liber118.com/pxn/65Tuesday, 25 June 13
    • Many thanks to others who have contributed code,ideas, suggestions, etc., to Pattern:• Chris Wensel @ Concurrent• Girish Kathalagiri @ AgilOne• Vijay Srinivas Agneeswaran @ Impetus• Chris Severs @ eBay• Ofer Mendelevitch @ Hortonworks• Sergey Boldyrev @ Nokia• Quinton Anderson @ IZAZI Solutions• Chris Gutierrez @ Airbnb• Villu Ruusmann @ JPMML projectacknowledgements…66Tuesday, 25 June 13
    • blog, developer community, code/wiki/gists, maven repo,commercial products, etc.:cascading.orgzest.to/group11github.com/Cascadingconjars.orggoo.gl/KQtULconcurrentinc.comdrill-down…67Tuesday, 25 June 13