Cascading: Enterprise Data Workflows based on Functional Programming
 

Like this? Share it with your network

Share

Cascading: Enterprise Data Workflows based on Functional Programming

on

  • 8,262 views

Airbnb tech talk about Cascading and functional programming.

Airbnb tech talk about Cascading and functional programming.

https://www.airbnb.com/meetups/3sdaratwe-tech-talk-paco-nathan

Statistics

Views

Total Views
8,262
Views on SlideShare
8,202
Embed Views
60

Actions

Likes
6
Downloads
52
Comments
0

2 Embeds 60

https://twitter.com 58
http://kred.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cascading: Enterprise Data Workflows based on Functional Programming Presentation Transcript

  • 1. “Cascading: Enterprise Data Workflows based on Functional Programming” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoidCopyright @2013, Concurrent, Inc. 1
  • 2. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 2
  • 3. Q3 1997: inflection pointFour independent teams were working toward horizontalscale-out of workflows based on commodity hardware.This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi(YHOO Search), then GOOGMapReduce and the Apache Hadoop open source stackemerged from this. 3
  • 4. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 4
  • 5. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy “Throw it over the wall” BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 5
  • 6. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 6
  • 7. Circa 2001: post- big ecommerce successes Stakeholder Product Customers “Data products” dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 7
  • 8. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 8
  • 9. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev “Optimizing topologies” Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 9
  • 10. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L 10
  • 11. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx 11
  • 12. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 12
  • 13. Cascading – originsAPI author Chris Wensel worked as a system architectat an Enterprise firm well-known for many populardata products.Wensel was following the Nutch open source project –where Hadoop started.Observation: would be difficult to find Java developersto write complex Enterprise apps in MapReduce –potential blocker for leveraging new open sourcetechnology. 13
  • 14. Cascading – functional programmingKey insight: MapReduce is based on functional programming– back to LISP in 1970s. Apache Hadoop use cases aremostly about data pipelines, which are functional in nature.To ease staffing problems as “Main Street” Enterprise firmsbegan to embrace Hadoop, Cascading was introducedin late 2007, as a new Java API to implement functionalprogramming for large-scale data workflows:• leverages JVM and Java-based tools without any need to create new languages• allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters 14
  • 15. functional programming… in production• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012)github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki 15
  • 16. Cascading – definitions• a pattern language for Enterprise Data Workflows Customers• simple to build, easy to test, robust in production• design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 16
  • 17. Cascading – usage• Java API, DSLs in Scala, Clojure, Customers Jython, JRuby, Groovy, ANSI SQL• ASL 2 license, GitHub src, Web App http://conjars.org• 5+ yrs production use, logs logs Logs Cache multiple Enterprise verticals Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 17
  • 18. Cascading – integrations• partners: Microsoft Azure, Hortonworks, Customers Amazon AWS, MapR, EMC, SpringSource, Cloudera Web• taps: Memcached, Cassandra, MongoDB, App HBase, JDBC, Parquet, etc. logs logs Cache• serialization: Avro, Thrift, Kryo, Support Logs JSON, etc. trap source tap sink tap tap• topologies: Apache Hadoop, Data tuple spaces, local mode Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 18
  • 19. Cascading – deployments• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. 19
  • 20. Cascading – deployments• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utilityworkflow abstraction grids, telecom, addresses: genomics, climatology, agronomics, etc. • staffing bottleneck; • system integration; • operational complexity; • test-driven development 20
  • 21. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 21
  • 22. The Ubiquitous Word Count DocumentDefinition: Collection Tokenize GroupBy M token Count count how often each word appears count how often each word appears R Word Count in aacollection of text documents in collection of text documentsThis simple program provides an excellent test case forparallel processing, since it illustrates: void map (String doc_id, String text): • requires a minimal amount of code for each word w in segment(text): emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc);Any distributed computing framework which can run Word emit(word, String(count));Count efficiently in parallel at scale can handle muchlarger and more interesting compute problems. 22
  • 23. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce18 lines code gist.github.com/3900702 23
  • 24. word count – Cascading app in Java Document CollectionString docPath = args[ 0 ]; Tokenize GroupBy M tokenString wcPath = args[ 1 ]; CountProperties properties = new Properties(); R Word CountAppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete(); 24
  • 25. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail] 25
  • 26. word count – Cascalog / Clojure Document Collection(ns impatient.core M Tokenize GroupBy token Count  (:use [cascalog.api] R Word Count        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[[](),.)s]+"))(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count))); Paul Lam; github.com/Quantisan/Impatient 26
  • 27. word count – Cascalog / Clojure Document Collectiongithub.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language• run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL• composable subqueries, used for test-driven development (TDD) practices at scale• Leiningen build: simple, no surprises, in Clojure itself• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog• has a learning curve, limited number of Clojure developers• aggregators are the magic, and those take effort to learn 27
  • 28. word count – Scalding / Scala Document Collectionimport com.twitter.scalding._ M Tokenize GroupBy token Count  R Word Countclass WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true))} 28
  • 29. word count – Scalding / Scala Document Collectiongithub.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading• code is compact, easy to understand• nearly 1:1 between elements of conceptual flow diagram and function calls• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.• significant investments by Twitter, Etsy, eBay, etc.• great for data services at scale• less learning curve than Cascalog 29
  • 30. word count – Scalding / Scala Document Collectiongithub.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading• code is compact, easy to understand• nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs• extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping limit• significant investments by Twitter, Etsy, eBay, etc. complexity in process• great for data services at scale• less learning curve than Cascalog 30
  • 31. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 31
  • 32. workflow abstraction – pattern languageCascading uses a “plumbing” metaphor in the Java API,to define workflows out of familiar elements: Pipes, Taps,Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS CountData is represented as flows of tuples. Operations within Wordthe flows bring functional programming aspects into Java CountIn formal terms, this provides a pattern language 32
  • 33. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four” amazon.com/dp/0201633612 33
  • 34. workflow abstraction – pattern languageCascading uses a “plumbing” metaphor in the Java API,to define workflows out of familiar elements: Pipes, Taps,Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale CountData is represented as flows of tuples. Operations within Wordthe flows bring functional programming aspects into Java CountIn formal terms, this provides a pattern language 34
  • 35. workflow abstraction – literate programmingCascading workflows generate their own visualdocumentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS CountIn formal terms, flow diagrams leverage a methodology Word Countcalled literate programmingProvides intuitive, visual representations for apps –great for cross-team collaboration 35
  • 36. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” 36
  • 37. workflow abstraction – business processFollowing the essence of literate programming, Cascadingworkflows provide statements of business processThis recalls a sense of business process managementfor Enterprise apps (think BPM/BPEL for Big Data)Cascading creates a separation of concerns betweenbusiness process and implementation details (Hadoop, etc.)This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.”By virtue of the pattern language, the flow planner thendetermines how to translate business process into efficient,parallel jobs at scale 37
  • 38. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data 38
  • 39. workflow abstraction – functional relational programmingThe combination of functional programming, pattern language,DSLs, literate programming, business process, etc., traces backto the original definition of the relational model (Codd, 1970)prior to SQL.Cascalog, in particular, implements more of what Codd intendedfor a “data sublanguage” and is considered to be close to a fullimplementation of the functional relational programmingparadigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspn 39
  • 40. workflow abstraction – functional relational programmingThe combination of functional programming, pattern language,DSLs, literate programming, business process, etc., traces backto the original definition of the relational model (Codd, 1970)prior to SQL.Cascalog, in particular, implements more of what Codd intendedfor a “data sublanguage” and is considered to be close to a fullimplementation of the functional relational programmingparadigm defined in: several theoretical aspects converge Moseley & Marks, 2006 into software engineering practices “Out of the Tar Pit” which minimize the complexity of goo.gl/SKspn building and maintaining Enterprise data workflows 40
  • 41. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 41
  • 42. Enterprise Data Workflows CustomersLet’s consider a “strawman” architecturefor an example app… at the front end Web AppLOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 42
  • 43. Enterprise Data Workflows CustomersSame example… in the back officeOrganizations have substantial investments Web Appin people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 43
  • 44. Enterprise Data Workflows CustomersSame example… the heavy lifting!“Main Street” firms are migrating Web Appworkflows to Hadoop, for costsavings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 44
  • 45. Cascading workflows – taps• taps integrate other data frameworks, as tuple streams Customers• these are “plumbing” endpoints in the pattern language• sources (inputs), sinks (outputs), traps (exceptions) Web App• text delimited, JDBC, Memcached, logs HBase, Cassandra, MongoDB, etc. logs Logs Cache• data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap• extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBsschema and provenance get Hadoop Prefs Clusterderived from analysis of the taps Reporting 45
  • 46. Cascading workflows – tapsString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe ); source and sink tapswcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete(); 46
  • 47. Cascading workflows – topologies• topologies execute workflows on clusters Customers• flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support• flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tapblend flows in different topologies tap Analyticsinto the same app – for example, Cubes customer Customer profile DBsbatch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster Reporting 47
  • 48. Cascading workflows – topologiesString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe ); topologywcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete(); 48
  • 49. example topologies… 49
  • 50. Cascading workflows – test-driven development• assert patterns (regex) on the tuple streams Customers• adjust assert levels, like log4j levels• trap edge cases as “data exceptions” Web App• TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes customer Customer profile DBs Prefs Hadoopredirect traps in production Reporting Clusterto Ops, QA, Support, Audit, etc. 50
  • 51. Two Avenues to the App Layer…Enterprise: must contend withcomplexity at scale everyday…incumbents extend current practices andinfrastructure investments – using J2EE, complexity ➞ANSI SQL, SAS, etc. – to migrateworkflows onto Apache Hadoop whileleveraging existing staffStart-ups: crave complexity andscale to become viable…new ventures move into Enterprise spaceto compete using relatively lean staff,while leveraging sophisticated engineeringpractices, e.g., Cascalog and Scalding scale ➞ 51
  • 52. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 52
  • 53. Cascading workflows – ANSI SQL• collab with Optiq – industry-proven code base Customers• ANSI SQL parser/optimizer atop Cascading flow planner Web App• JDBC driver to integrate into existing tools and app servers logs logs Cache• relational catalog over a collection Support Logs of unstructured data trap source tap sink tap tap• SQL shell prompt to run queries Data Modeling• enable analysts without retraining PMML Workflow on Hadoop, etc. sink tap source tap• transparency for Support, Ops, Analytics Cubes customer Customer Finance, et al. profile DBs Prefs Hadoop Cluster Reportinga language for queries – not a database,but ANSI SQL as a DSL for workflows 53
  • 54. Lingual – CSV data in local file systemcascading.org/lingual 54
  • 55. Lingual – shell prompt, catalogcascading.org/lingual 55
  • 56. Lingual – queriescascading.org/lingual 56
  • 57. abstraction layers in queries… abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumers 57
  • 58. Lingual – JDBC driverpublic void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();  ResultSet resultSet = statement.executeQuery( "select *n" + "from "EXAMPLE"."SALES_FACT_1997" as sn" + "join "EXAMPLE"."EMPLOYEE" as en" + "on e."EMPID" = s."CUST_ID"" );  while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();  for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); } System.out.println( builder ); }  resultSet.close(); statement.close(); connection.close(); } 58
  • 59. Lingual – JDBC result set$ gradle clean jar$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar CUST_ID=100; PROD_ID=10; EMPID=100; NAME=BillCUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors) This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency – for many under-represented use cases in Enterprise IT. In other words, SQL as a DSL. cascading.org/lingual 59
  • 60. Lingual – connecting Hadoop and R # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = Gina") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density() 60
  • 61. Lingual – connecting Hadoop and R > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92cascading.org/lingual 61
  • 62. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 62
  • 63. Pattern – model scoring• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers• great open source tools – R, Weka, Web App KNIME, Matlab, RapidMiner, etc.• integrate with other libraries – logs logs Cache Logs Matrix API, etc. Support• leverage PMML as another kind trap tap source tap sink tap of DSL Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reportingcascading.org/pattern 63
  • 64. Pattern – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) 64
  • 65. Pattern – capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... 65
  • 66. Pattern – score a model, within an app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } } 66
  • 67. Pattern – score a model, using pre-defined Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrixcascading.org/pattern 67
  • 68. Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml   ## run an RF classifier at scale, assert regression test, measure confusion matrix   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure   ## run a predictive model at scale, measure RMSE   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measure 68
  • 69. PMML – model coverage• Association Rules: AssociationModel element• Cluster Models: ClusteringModel element• Decision Trees: TreeModel element• Naïve Bayes Classifiers: NaiveBayesModel element• Neural Networks: NeuralNetwork element• Regression: RegressionModel and GeneralRegressionModel elements• Rulesets: RuleSetModel element• Sequences: SequenceModel element• Support Vector Machines: SupportVectorMachineModel element• Text Models: TextModel element• Time Series: TimeSeriesModel elementibm.com/developerworks/industry/library/ind-PMML2/ 69
  • 70. PMML – vendor coverage 70
  • 71. experiments – Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478 71
  • 72. experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omitted 72
  • 73. experiments – evaluating results• use a confusion matrix to compare results for the classifiers• Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%)• assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costs• can extend this to evaluate N models, M labels in an N × M × M matrix 73
  • 74. Cascading: Workflow Abstraction Document1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS2. Cascading Count Word Count3. Sample Code4. A Little Theory…5. Workflows6. Lingual7. Pattern8. Open Data 74
  • 75. Palo Alto is quite a pleasant place• temperate weather• lots of parks, enormous trees• great coffeehouses• walkable downtown• not particularly crowdedOn a nice summer day, who wants to be stuckindoors on a phone call?Instead, take it outside – go for a walkAnd example open source project:github.com/Cascading/CoPA/wiki 75
  • 76. 1. Open Data about municipal infrastructure(GIS data: trees, roads, parks) ✚2. Big Data about where people like to walk(smartphone GPS logs) ✚ Document Collection3. some curated metadata M Tokenize Scrub token HashJoin Regex(which surfaces the value) Left token GroupBy R Stop Word token List RHS Count Word Count4. personalized recommendations:“Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sipping a latte or enjoying some fro-yo.” 76
  • 77. discoveryThe City of Palo Alto recently began to support Open Datato give the local community greater visibility into howtheir city government operatesThis effort is intended to encourage students, entrepreneurs,local organizations, etc., to build new apps which contributeto the public goodpaloalto.opendata.junar.com/dashboards/7576/geographic-information/ 77
  • 78. discoveryGIS about trees in Palo Alto: 78
  • 79. discoveryGeographic_Information,,,"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australisSource: davey tree Protected: Designated: Heritage: Appraised Value:Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point""Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: WilkieWay From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID:598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 YearConstructed: 1950 Traffic Count: 596 Traffic Index: residential local TrafficClass: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width:40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 BaseType Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 DistrictNumber: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 BaseFailure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: SurfaceTreatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity:none Block Extent: 0 Longitude and Transverse Severity: none Longitude and TransverseExtent: 0Trench Severity: Ravelling Severity: none none Trench Extent: 0 (unstructured data…) Ravelling Extent: Rutting Severity: 0 Ridability Severity: none Rutting Extent: none 0Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0Remediation: Deduct Value: 100 Priority: Pavement Condition: excellentStreet Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicolsIdentifier System: 21410 ","-122.1249640794,37.4155803115645,0.0-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line" 79
  • 80. discovery(defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )  (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) (specify what you require, not how to achieve it… data prep costs are 80/20) 80
  • 81. discovery (ad-hoc queries get refinedinto composable predicates) Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point 81
  • 82. discovery(curate valuable metadata) 82
  • 83. (defn get-trees [src trap tree_meta] discovery "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 (:trap (hfs-textline trap)) ?tree_id! " 412 )) ?situs" " 115 ?tree_site" 1 ?species" " liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calre ?avg_height" 27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash" " 9q9jh0 83
  • 84. discovery// run analysis and visualization in Rlibrary(ggplot2)dat_folder <- ~/src/concur/CoPA/out/treedata <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8") summary(data)t <- head(sort(table(data$V5), decreasing=TRUE)trees <- as.data.frame.table(t, n=20))colnames(trees) <- c("species", "count") m <- ggplot(data, aes(x=V8))m <- m + ggtitle("Estimated Tree Height (meters)")m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density() par(mar = c(7, 4, 4, 2) + 0.1)plot(trees, xaxt="n", xlab="")axis(1, labels=FALSE)text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE)grid(nx=nrow(trees)) 84
  • 85. discoveryanalysis of the tree data: sweetgum 85
  • 86. discovery GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gisM Tree tree Metadata Failure Traps (flow diagram, gis tree) 86
  • 87. modeling geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.1629q9jh0 87
  • 88. modelingEach road in the GIS export is listed as a block between twocross roads, and each may have multiple road segments torepresent turns:" -122.161776959558,37.4518836690781,0.0" -122.161390381489,37.4516410983794,0.0" -122.160786011735,37.4512589903357,0.0" -122.160531178368,37.4510977281699,0.0 ( lat1, lng1, alt1 ) ( lat3, lng3, alt3 ) ( lat0, lng0, alt0 ) ( lat2, lng2, alt2 )NB: segments in the raw GIS have the order of geo coordinatesscrambled: (lng, lat, alt) 88
  • 89. modelingFilter trees which are too far away to provide shade. Calculate a sumof moments for tree height × distance, as an estimator for shade: X X X 9q9jh0 89
  • 90. modeling(defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) )) 90
  • 91. modeling Filter tree heightM Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road shade traffic (flow diagram, shade) 91
  • 92. modeling(defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) ))?uuid ?geohash ?gps_count ?recent_visitcf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 197237669096932cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348 92
  • 93. modeling(defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) ))Recommenders often combine multiple signals, via weightedaverages, to rank personalized results:• GPS of person ∩ road segment• frequency and recency of visit• traffic class and rate• road albedo (sunlight reflection)• tree shade estimatorAdjusting the mix allows for further personalization at the end use 93
  • 94. apps‣ addr: 115 HAWTHORNE AVE‣ lat/lng: 37.446, -122.168‣ geohash: 9q9jh0‣ tree: 413 site 2‣ species: Liquidambar styraciflua‣ est. height: 23 m‣ shade metric: 4.363‣ traffic: local residential, light traffic‣ recent visit: 1972376952532‣ a short walk from my train stop ✔ 94
  • 95. references… Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 95
  • 96. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com Copyright @2013, Concurrent, Inc. 96