Enterprise Data Workflows with Cascading and Windows Azure HDInsight

1,701 views

Published on

SF Bay Area Azure Developers meetup at Microsoft, SF on 2013-06-11

http://www.meetup.com/bayazure/events/120889902/

Published in: Technology, Education
  • Be the first to comment

Enterprise Data Workflows with Cascading and Windows Azure HDInsight

  1. 1. Paco NathanConcurrent, Inc.San Francisco, CA@pacoid“Enterprise Data Workflows withCascading and Windows AzureHDInsight”1
  2. 2. Cascading and Windows Azure HDInsightFirst, some highly recommended reading…Using MapReduce with HDInsighthttp://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/Drive Smarter Decisions with Hadoop andWindows Azure HDInsighthttp://www.slideshare.net/Hadoop_Summit/drive-smarter-decisions-with-hadoop-and-windows-azure-hdinsight2
  3. 3. Cascading and Windows Azure HDInsightSecond, a highly recommended talk…Hadoop Summit 2013 San JoseWed, June 26th, 2:05-2:45 pmhadoopsummit.org/san-jose/schedule/“Building Tools for the Hadoop Developer”In this session we’ll first discuss our experience extending Hadoop development tonew platforms & languages and then discuss our experiments and experiencesbuilding supporting developer tools and plugins for those platforms. First, we’ll take ahands on approach to showing our experiments and successes extending Hadoop tolanguages such as JavaScript and .NET with LINQ. Second, we’ll walk through someof the developer & developer ops tools and plugins we’ve experimented with in aneffort to simplify life for the Hadoop developer across both on premises and cloud-based projects.Matt Winkler, Microsoftblogs.msdn.com/mwinkle@mwinkle3
  4. 4. FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: background4
  5. 5. Cascading – originsAPI author Chris Wensel worked as a system architectat an Enterprise firm well-known for many populardata products.Wensel was following the Nutch open source project –where Hadoop started.Observation: would be difficult to find Java developersto write complex Enterprise apps in MapReduce –potential blocker for leveraging new open sourcetechnology.5
  6. 6. Cascading – functional programmingKey insight: MapReduce is based on functional programming– back to LISP in 1970s. Apache Hadoop use cases aremostly about data pipelines, which are functional in nature.To ease staffing problems as “Main Street” Enterprise firmsbegan to embrace Hadoop, Cascading was introducedin late 2007, as a new Java API to implement functionalprogramming for large-scale data workflows:• leverages JVM and Java-based tools without anyneed to create new languages• allows programmers who have J2EE expertiseto leverage the economics of Hadoop clusters6
  7. 7. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading – definitions• a pattern language for Enterprise Data Workflows• simple to build, easy to test, robust in production• design principles ⟹ ensure best practices at scale7
  8. 8. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading – usage• Java API, DSLs in Scala, Clojure,Jython, JRuby, Groovy,ANSI SQL• ASL 2 license, GitHub src,http://conjars.org• 5+ yrs production use,multiple Enterprise verticals8
  9. 9. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading – integrations• partners: Microsoft Azure, Hortonworks,Amazon AWS, MapR, EMC, SpringSource,Cloudera• taps: Memcached, Cassandra, MongoDB,HBase, JDBC, Parquet, etc.• serialization: Avro, Thrift, Kryo,JSON, etc.• topologies: Apache Hadoop,tuple spaces, local mode9
  10. 10. Cascading – deployments• case studies: Climate Corp, Twitter, Etsy,Williams-Sonoma, uSwitch, Airbnb, Nokia,YieldBot, Square, Harvard, Factual, etc.• use cases: ETL, marketing funnel, anti-fraud,social media, retail pricing, search analytics,recommenders, eCRM, utility grids, telecom,genomics, climatology, agronomics, etc.10
  11. 11. Cascading – deployments• case studies: Climate Corp, Twitter, Etsy,Williams-Sonoma, uSwitch, Airbnb, Nokia,YieldBot, Square, Harvard, Factual, etc.• use cases: ETL, marketing funnel, anti-fraud,social media, retail pricing, search analytics,recommenders, eCRM, utility grids, telecom,genomics, climatology, agronomics, etc.workflow abstraction addresses:• staffing bottleneck;• system integration;• operational complexity;• test-driven development11
  12. 12. FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsCascading: sample code12
  13. 13. void map (String doc_id, String text):for each word w in segment(text):emit(w, "1");void reduce (String word, Iterator group):int count = 0;for each pc in group:count += Int(pc);emit(word, String(count));The Ubiquitous Word CountDefinition:count how often each word appearsin a collection of text documentsThis simple program provides an excellent test case forparallel processing, since it illustrates:• requires a minimal amount of code• demonstrates use of both symbolic and numeric values• shows a dependency graph of tuples as an abstraction• is not many steps away from useful search indexing• serves as a “HelloWorld” for Hadoop appsAny distributed computing framework which can runWordCount efficiently in parallel at scale can handle muchlarger and more interesting compute problems.DocumentCollectionWordCountTokenizeGroupBytoken CountRMcount how often each word appearsin a collection of text documents13
  14. 14. DocumentCollectionWordCountTokenizeGroupBytoken CountRM1 map1 reduce18 lines code gist.github.com/3900702word count – conceptual flow diagramcascading.org/category/impatient14
  15. 15. word count – Cascading app in JavaString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();DocumentCollectionWordCountTokenizeGroupBytoken CountRM15
  16. 16. mapreduceEvery(wc)[Count[decl:count]]Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]]GroupBy(wc)[by:[token]]Each(token)[RegexSplitGenerator[decl:token][args:1]]Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]][head][tail][{2}:token, count][{1}:token][{2}:doc_id, text][{2}:doc_id, text]wc[{1}:token][{1}:token][{2}:token, count][{2}:token, count][{1}:token][{1}:token]word count – generated flow diagramDocumentCollectionWordCountTokenizeGroupBytoken CountRM16
  17. 17. (ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[[](),.)s]+"))(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count))); Paul Lam; github.com/Quantisan/Impatientword count – Cascalog / ClojureDocumentCollectionWordCountTokenizeGroupBytoken CountRM17
  18. 18. github.com/nathanmarz/cascalog/wiki• implements Datalog in Clojure, with predicates backedby Cascading – for a highly declarative language• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL• composable subqueries, used for test-driven development(TDD) practices at scale• Leiningen build: simple, no surprises, in Clojure itself• more new deployments than other Cascading DSLs –Climate Corp is largest use case: 90% Clojure/Cascalog• has a learning curve, limited number of Clojure developers• aggregators are the magic, and those take effort to learnword count – Cascalog / ClojureDocumentCollectionWordCountTokenizeGroupBytoken CountRM18
  19. 19. import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) {Tsv(args("doc"),(doc_id, text),skipHeader = true).read.flatMap(text -> token) {text : String => text.split("[ [](),.]")}.groupBy(token) { _.size(count) }.write(Tsv(args("wc"), writeHeader = true))}word count – Scalding / ScalaDocumentCollectionWordCountTokenizeGroupBytoken CountRM19
  20. 20. github.com/twitter/scalding/wiki• extends the Scala collections API so that distributed listsbecome “pipes” backed by Cascading• code is compact, easy to understand• nearly 1:1 between elements of conceptual flow diagramand function calls• extensive libraries are available for linear algebra, abstractalgebra, machine learning – e.g., Matrix API, Algebird, etc.• significant investments by Twitter, Etsy, eBay, etc.• great for data services at scale• less learning curve than Cascalogword count – Scalding / ScalaDocumentCollectionWordCountTokenizeGroupBytoken CountRM20
  21. 21. github.com/twitter/scalding/wiki• extends the Scala collections API so that distributed listsbecome “pipes” backed by Cascading• code is compact, easy to understand• nearly 1:1 between elements of conceptual flow diagramand function calls• extensive libraries are available for linear algebra, abstractalgebra, machine learning – e.g., Matrix API,Algebird, etc.• significant investments by Twitter, Etsy, eBay, etc.• great for data services at scale• less learning curve than Cascalogword count – Scalding / ScalaDocumentCollectionWordCountTokenizeGroupBytoken CountRMCascalog and Scalding DSLsleverage the functional aspectsof MapReduce, helping limitcomplexity in process21
  22. 22. FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsThe Workflow Abstraction22
  23. 23. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLEnterprise Data WorkflowsLet’s consider a “strawman” architecturefor an example app… at the front endLOB use cases drive demand for apps23
  24. 24. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLEnterprise Data WorkflowsSame example… in the back officeOrganizations have substantial investmentsin people, infrastructure, process24
  25. 25. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLEnterprise Data WorkflowsSame example… the heavy lifting!“Main Street” firms are migratingworkflows to Hadoop, for costsavings and scale-out25
  26. 26. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – taps• taps integrate other data frameworks, as tuple streams• these are “plumbing” endpoints in the pattern language• sources (inputs), sinks (outputs), traps (exceptions)• text delimited, JDBC, Memcached,HBase, Cassandra, MongoDB, etc.• data serialization: Avro, Thrift,Kryo, JSON, etc.• extend a new kind of tap in justa few lines of Javaschema and provenance getderived from analysis of the taps26
  27. 27. Cascading workflows – tapsString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();source and sink tapsfor TSV data in HDFS27
  28. 28. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – topologies• topologies execute workflows on clusters• flow planner is like a compiler for queries- Hadoop (MapReduce jobs)- local mode (dev/test or special config)- in-memory data grids (real-time)• flow planner can be extendedto support other topologiesblend flows in different topologiesinto the same app – for example,batch (Hadoop) + transactions (IMDG)28
  29. 29. Cascading workflows – topologiesString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();flow planner forApache Hadooptopology29
  30. 30. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – test-driven development• assert patterns (regex) on the tuple streams• adjust assert levels, like log4j levels• trap edge cases as “data exceptions”• TDD at scale:1.start from raw inputs in the flow graph2.define stream assertions for each stageof transforms3.verify exceptions, code to remove them4.when impl is complete, app has fulltest coverageredirect traps in productionto Ops, QA, Support,Audit, etc.30
  31. 31. Cascading – functional programming• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,have invested in open source projects atop Cascading– used for their large-scale production deployments• new case studies for Cascading apps are mostlybased on domain-specific languages (DSLs) in JVMlanguages which emphasize functional programming:Cascalog in Clojure (2010)Scalding in Scala (2012)github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wikiWhy Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnologyDan Woods, 2013-04-17 Forbesforbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/31
  32. 32. Functional Programming for Big DataWordCount with token scrubbing…Apache Hive: 52 lines HQL + 8 lines Python (UDF)compared toScalding: 18 lines Scala/Cascadingfunctional programming languages help reducesoftware engineering costs at scale, over time32
  33. 33. Workflow Abstraction – pattern languageCascading uses a “plumbing” metaphor in the Java API,to define workflows out of familiar elements: Pipes, Taps,Tuple Flows, Filters, Joins, Traps, etc.ScrubtokenDocumentCollectionTokenizeWordCountGroupBytokenCountStop WordListRegextokenHashJoinLeftRHSMRData is represented as flows of tuples. Operations withinthe flows bring functional programming aspects into JavaIn formal terms, this provides a pattern language33
  34. 34. Pattern Languagestructured method for solving large, complex designproblems, where the syntax of the language ensuresthe use of best practices – i.e., conveying expertiseFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsA Pattern LanguageChristopher Alexander, et al.amazon.com/dp/019501919934
  35. 35. Workflow Abstraction – literate programmingCascading workflows generate their own visualdocumentation: flow diagramsin formal terms, flow diagrams leverage a methodologycalled literate programmingprovides intuitive, visual representations for apps –great for cross-team collaborationScrubtokenDocumentCollectionTokenizeWordCountGroupBytokenCountStop WordListRegextokenHashJoinLeftRHSMR35
  36. 36. references…by Don KnuthLiterate ProgrammingUniv of Chicago Press, 1992literateprogramming.com/“Instead of imagining that our main task isto instruct a computer what to do, let usconcentrate rather on explaining to humanbeings what we want a computer to do.”36
  37. 37. Workflow Abstraction – business processfollowing the essence of literate programming, Cascadingworkflows provide statements of business processthis recalls a sense of business process managementfor Enterprise apps (think BPM/BPEL for Big Data)Cascading creates a separation of concerns betweenbusiness process and implementation details (Hadoop, etc.)this is especially apparent in large-scale Cascalog apps:“Specify what you require, not how to achieve it.”by virtue of the pattern language, the flow planner thendetermines how to translate business process into efficient,parallel jobs at scale37
  38. 38. references…by Edgar Codd“A relational model of data for large shared data banks”Communications of the ACM, 1970dl.acm.org/citation.cfm?id=362685rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks…this approach focuses on what apps do:the process of structuring data38
  39. 39. Two Avenues to the App Layer…scale ➞complexity➞Enterprise: must contend withcomplexity at scale everyday…incumbents extend current practices andinfrastructure investments – using J2EE,ANSI SQL, SAS, etc. – to migrateworkflows onto Apache Hadoop whileleveraging existing staffStart-ups: crave complexity andscale to become viable…new ventures move into Enterprise spaceto compete using relatively lean staff,while leveraging sophisticated engineeringpractices, e.g., Cascalog and Scalding39
  40. 40. FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsLingual: ANSI SQL in Cascading40
  41. 41. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLCascading workflows – ANSI SQL• collab with Optiq – industry-proven code base• ANSI SQL parser/optimizer atop Cascadingflow planner• JDBC driver to integrate into existingtools and app servers• relational catalog over a collectionof unstructured data• SQL shell prompt to run queries• enable analysts without retrainingon Hadoop, etc.• transparency for Support, Ops,Finance, et al.a language for queries – not a database,but ANSI SQL as a DSL for workflows41
  42. 42. Lingual – CSV data in local file systemcascading.org/lingual42
  43. 43. Lingual – shell prompt, catalogcascading.org/lingual43
  44. 44. Lingual – queriescascading.org/lingual44
  45. 45. abstraction RDBMS JVM Clusterparser ANSI SQLcompliant parserANSI SQLcompliant parseroptimizer logical plan,optimized based on statslogical plan,optimized based on statsplanner physical plan API “plumbing”machinedataquery history,table statsapp history,tuple statstopology b-trees, etc. heterogenous, distributed:Hadoop, in-memory, etc.visualization ERD flow diagramschema table schema tuple schemacatalog relational catalog tap usage DBprovenance (manual audit) data setproducers/consumersabstraction layers in queries…45
  46. 46. Lingual – articlesOpen Source Lingual Helps SQL Devs Unlock HadoopThor Olavsrud, 2013-02-22cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_HadoopHadoop AppsWithout MapReduce MindsetsAdrian Bridgwater, 2013-02-28drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708Concurrent gives old SQL users new Hadoop tricksJack Clark, 2013-02-20theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/Concurrent Raises $4 Million,Will Expand Big Data App FrameworkJames Johnson, 2013-03-20inquisitr.com/581755/concurrent-raises-4-million-will-expand-big-data-app-frameworkConcurrent Releases Lingual, a SQL DSL for HadoopBoris Lublinsky, 2013-02-28infoq.com/news/2013/02/Lingual46
  47. 47. Lingual – JDBC driverpublic void run() throws ClassNotFoundException, SQLException {Class.forName( "cascading.lingual.jdbc.Driver" );Connection connection =DriverManager.getConnection("jdbc:lingual:local;schemas=src/main/resources/data/example" );Statement statement = connection.createStatement(); ResultSet resultSet = statement.executeQuery("select *n"+ "from "EXAMPLE"."SALES_FACT_1997" as sn"+ "join "EXAMPLE"."EMPLOYEE" as en"+ "on e."EMPID" = s."CUST_ID"" ); while( resultSet.next() ) {int n = resultSet.getMetaData().getColumnCount();StringBuilder builder = new StringBuilder(); for( int i = 1; i <= n; i++ ) {builder.append( ( i > 1 ? "; " : "" )+ resultSet.getMetaData().getColumnLabel( i )+ "="+ resultSet.getObject( i ) );}System.out.println( builder );} resultSet.close();statement.close();connection.close();}47
  48. 48. Lingual – JDBC result set$ gradle clean jar$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar CUST_ID=100; PROD_ID=10; EMPID=100; NAME=BillCUST_ID=150; PROD_ID=20; EMPID=150; NAME=SebastianCaveat: if you absolutely positively must have sub-secondSQL query response for Pb-scale data on a 1000+ nodecluster… Good luck with that! (call the MPP vendors)This ANSI SQL library is primarily intended for batchworkflows – high throughput, not low-latency –for many under-represented use cases in Enterprise IT.In other words, SQL as a DSL.cascading.org/lingual48
  49. 49. # load the JDBC packagelibrary(RJDBC) # set up the driverdrv <- JDBC("cascading.lingual.jdbc.Driver","~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar") # set up a database connection to a local repositoryconnection <- dbConnect(drv,"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES") # query the repository: in this case the MySQL sample database (CSV files)df <- dbGetQuery(connection,"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = Gina")head(df) # use R functions to summarize and visualize part of the datadf$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25summary(df$hire_age)library(ggplot2)m <- ggplot(df, aes(x=hire_age))m <- m + ggtitle("Age at hire, people named Gina")m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()Lingual – connecting Hadoop and R49
  50. 50. > summary(df$hire_age)Min. 1st Qu. Median Mean 3rd Qu. Max.20.86 27.89 31.70 31.61 35.01 43.92Lingual – connecting Hadoop and Rcascading.org/lingual50
  51. 51. FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsPattern: PMML in Cascading51
  52. 52. HadoopClustersourcetapsourcetap sinktaptraptapcustomerprofile DBsCustomerPrefslogslogsLogsDataWorkflowCacheCustomersSupportWebAppReportingAnalyticsCubessinktapModeling PMMLPattern – model scoring• migrate workloads: SAS,Teradata, etc.,exporting predictive models as PMML• great open source tools – R, Weka,KNIME, Matlab, RapidMiner, etc.• integrate with other libraries –Matrix API, etc.• leverage PMML as another kindof DSLcascading.org/pattern52
  53. 53. ## train a RandomForest model f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance)print(fit) predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))Pattern – create a model in R53
  54. 54. <?xml version="1.0"?><PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>  <Application name="Rattle/PMML" version="1.2.30"/>  <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4">  <DataField name="label" optype="categorical" dataType="string">   <Value value="0"/>   <Value value="1"/>  </DataField>  <DataField name="var0" optype="continuous" dataType="double"/>  <DataField name="var1" optype="continuous" dataType="double"/>  <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification">  <MiningSchema>   <MiningField name="label" usageType="predicted"/>   <MiningField name="var0" usageType="active"/>   <MiningField name="var1" usageType="active"/>   <MiningField name="var2" usageType="active"/>  </MiningSchema>  <Segmentation multipleModelMethod="majorityVote">   <Segment id="1">    <True/>    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">     <MiningSchema>      <MiningField name="label" usageType="predicted"/>      <MiningField name="var0" usageType="active"/>      <MiningField name="var1" usageType="active"/>      <MiningField name="var2" usageType="active"/>     </MiningSchema>...Pattern – capture model parameters as PMML54
  55. 55. public static void main( String[] args ) throws RuntimeException {String inputPath = args[ 0 ];String classifyPath = args[ 1 ];// set up the config propertiesProperties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );  // create source and sink tapsTap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );  // handle command line optionsOptionParser optParser = new OptionParser();optParser.accepts( "pmml" ).withRequiredArg();  OptionSet options = optParser.parse( args ); // connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "classify" ).addSource( "input", inputTap ).addSink( "classify", classifyTap ); if( options.hasArgument( "pmml" ) ) {String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );PMMLPlanner pmmlPlanner = new PMMLPlanner().setPMMLInput( new File( pmmlPath ) ).retainOnlyActiveIncomingFields().setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the modelflowDef.addAssemblyPlanner( pmmlPlanner );} // write a DOT file and run the flowFlow classifyFlow = flowConnector.connect( flowDef );classifyFlow.writeDOT( "dot/classify.dot" );classifyFlow.complete();}Pattern – score a model, within an app55
  56. 56. CustomerOrdersClassifyScoredOrdersGroupBytokenCountPMMLModelM RFailureTrapsAssertConfusionMatrixPattern – score a model, using pre-defined Cascading appcascading.org/pattern56
  57. 57. ## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml ## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure ## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measurePattern – score a model, using pre-defined Cascading app57
  58. 58. • established XML standard for predictive model markup• organized by Data Mining Group (DMG), since 1997http://dmg.org/• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,Microsoft, etc.• PMML concepts for metadata, ensembles, etc., translatedirectly into Cascading tuple flows“PMML is the leading standard for statistical and data mining models andsupported by over 20 vendors and organizations.With PMML, it is easyto develop a model on one system using one application and deploy themodel on another system using another application.”PMML – standardwikipedia.org/wiki/Predictive_Model_Markup_Language58
  59. 59. • Association Rules: AssociationModel element• Cluster Models: ClusteringModel element• Decision Trees: TreeModel element• Naïve Bayes Classifiers: NaiveBayesModel element• Neural Networks: NeuralNetwork element• Regression: RegressionModel and GeneralRegressionModel elements• Rulesets: RuleSetModel element• Sequences: SequenceModel element• SupportVector Machines: SupportVectorMachineModel element• Text Models: TextModel element• Time Series: TimeSeriesModel elementPMML – model coverageibm.com/developerworks/industry/library/ind-PMML2/59
  60. 60. PMML – vendor coverage60
  61. 61. roadmap – existing algorithms for scoring• Random Forest• Decision Trees• Linear Regression• GLM• Logistic Regression• K-Means Clustering• Hierarchical Clustering• SupportVector Machines• Multinomialalso, model chaining and general support for ensemblescascading.org/pattern61
  62. 62. roadmap – top priorities for creating models at scale• Random Forest• Logistic Regression• K-Means Clusteringa wealth of recent research indicates many opportunitiesto parallelize popular algorithms for training models at scaleon Apache Hadoop…cascading.org/pattern62
  63. 63. roadmap – next priorities for scoring• Time Series (ARIMA forecast)• Association Rules (basket analysis)• Naïve Bayes• Neural Networksalgorithms extended based on customer use cases –contact groups.google.com/forum/?fromgroups#!forum/pattern-usercascading.org/pattern63
  64. 64. experiments – comparing models• much customer interest in leveraging Cascading andApache Hadoop to run customer experiments at scale• run multiple variants, then measure relative “lift”• Concurrent runtime – tag and track modelsthe following example compares two models trainedwith different machine learning algorithmsthis is exaggerated, one has an important variableintentionally omitted to help illustrate the experiment64
  65. 65. ## train a Random Forest model## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2")fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)print(fit)saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))experiments – Random Forest modelOOB estimate of error rate: 14%Confusion matrix:0 1 class.error0 69 16 0.18823531 12 103 0.104347865
  66. 66. ## train a Logistic Regression model (special case of GLM)## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2")fit <- glm(f, family=binomial, data=data)print(summary(fit))saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))experiments – Logistic Regression modelCoefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***var0 -1.3755 0.4355 -3.159 0.00159 **var2 -3.7742 0.5794 -6.514 7.30e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1NB: this model has “var1” intentionally omitted66
  67. 67. experiments – comparing results• use a confusion matrix to compare results for the classifiers• Logistic Regression has a lower “false negative” rate (5% vs. 11%)however it has a much higher “false positive” rate (52% vs. 14%)• assign a cost model to select a winner –for example, in an ecommerce anti-fraud classifier:FN ∼ chargeback riskFP ∼ customer support costs67
  68. 68. FailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsDesign Patterns for Workflows,Across Departments68
  69. 69. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesenduses69
  70. 70. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesANSI SQL for ETL70
  71. 71. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logic71
  72. 72. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive models72
  73. 73. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesSAS for predictive modelsANSI SQL for ETL most of the licensing costs…73
  74. 74. Anatomy of an Enterprise appDefinition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…ETLdatapreppredictivemodeldatasourcesendusesJ2EE for business logicmost of the project costs…74
  75. 75. ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcea compiler sees it all…cascading.org75
  76. 76. a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "etl" ).addSource( "example.employee", emplTap ).addSource( "example.sales", salesTap ).addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner().setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );cascading.org76
  77. 77. a compiler sees it all…ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFlowDef flowDef = FlowDef.flowDef().setName( "classifier" ).addSource( "input", inputTap ).addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner().setPMMLInput( new File( pmmlModel ) ).retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );77
  78. 78. cascading.orgETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourcevisual collaboration for the business logic is a greatway to improve how teams work togetherFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleads78
  79. 79. ETLdatapreppredictivemodeldatasourcesendusesLingual:DW → ANSI SQLPattern:SAS, R, etc. → PMMLbusiness logic in Java,Clojure, Scala, etc.sink taps forMemcached, HBase,MongoDB, etc.source taps forCassandra, JDBC,Splunk, etc.Anatomy of an Enterprise appCascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open sourceFailureTrapsbonusallocationemployeePMMLclassifierquarterlysalesJoinCountleadsmultiple departments, working in their respectiveframeworks, integrate results into a combined app,which runs at scale on a cluster… business processcombined in a common space (DAG) for flowplanners, compiler, optimization, troubleshooting,exception handling, notifications, security audit,performance monitoring, etc.cascading.org79
  80. 80. Enterprise DataWorkflowswith CascadingO’Reilly, 2013amazon.com/dp/1449358721references…newsletter updates:liber118.com/pxn/80
  81. 81. blog, developer community, code/wiki/gists, maven repo,commercial products, etc.:cascading.orgzest.to/group11github.com/Cascadingconjars.orggoo.gl/KQtULconcurrentinc.comdrill-down…81

×