Your SlideShare is downloading. ×
0
“Enterprise Data Workflows              with Cascading”                  Paco Nathan                  Concurrent, Inc.     ...
Unstructured Data              meets             Enterprise Scale                  • an example considered                ...
Enterprise Data Workflows                                                                           Document              ...
Enterprise Data Workflows                                                                          Customers          an ex...
Enterprise Data Workflows                                                                        Customers         an examp...
Enterprise Data Workflows                                                                                                  ...
Enterprise Data Workflows                                                                                              Cust...
Enterprise Data Workflows                                                                        Document                 ...
Cascading – definitions            • a pattern language for Enterprise Data Workflows            • simple to build, easy to ...
Cascading – usage            • Java API, Scala DSL Scalding, Clojure DSL Cascalog            • ASL 2 license, GitHub src, ...
quotes…                  “Cascading gives Java developers the ability to build                   Big Data applications on ...
Cascading – deployments           • case studies: Twitter, Etsy, Climate Corp, Nokia, Factual,                Williams-Son...
case studies…                                                           (Williams-Sonoma, Neiman Marcus)                  ...
Cascading – taps            •   taps integrate other data frameworks, as tuple streams            •   these are “plumbing”...
Cascading – topologies            •   topologies execute workflows on clusters            •   flow planner is much like a co...
example topologies…Tuesday, 12 February 13                                                                         16Here ...
Cascading – ANSI SQL            • ANSI SQL parser/optimizer atop Cascading flow planner            • JDBC driver to integra...
how to query…                             abstraction                                          RDBMS                      ...
Cascading – machine learning            •   export predictive models as PMML            •   Cascading compiles to JVM clas...
PMML support…Tuesday, 12 February 13                                                                           20Here are ...
Cascading – test-driven development            •   assert patterns (regex) on the tuple streams            •   trap edge c...
Cascading – API design principles           •   specify what is required, not how it must be achieved           •   provid...
Enterprise Data Workflows                                               Document                                          ...
the ubiquitous word count         definition:           count how often each word appears in a collection of text document...
word count – pseudocode         void map (String doc_id, String text):              for each word w in segment(text):     ...
word count – flow diagram            Document            Collection                              Tokenize                  ...
word count – Cascading app                                                                    Document                    ...
word count – flow plan                                                                                                  Doc...
word count – Scalding / Scala                                                     Document                                ...
word count – Scalding / Scala                                                                                             ...
word count – Cascalog / Clojure                                                     Document                              ...
word count – Cascalog / Clojure                                                                         Document          ...
Enterprise Data Workflows                                                                                                 ...
circa 1996: pre- inflection point                                                                   Stakeholder            ...
circa 2001: post- big ecommerce successes                            Stakeholder                                          ...
circa 2013: clusters everywhere                                                              Data Products                ...
asymptotically…           • smarter, more robust clusters                                                                 ...
references…                  by Leo Breiman                  Statistical Modeling:                  The Two Cultures      ...
references…                   by DJ Patil                   Data Jujitsu                   O’Reilly, 2012                 ...
Enterprise Data Workflows                                                                                                 ...
the workflow abstraction          Tuple Flows, Pipes, Taps, Filters, Joins, Traps, etc.          …in other words, “plumbing...
rather than arguing SQL vs. NoSQL…           this kind of work focuses on           the process of structuring data       ...
workflow – abstraction layer            • Cascading initially grew from interaction with the Nutch project, before         ...
workflow – literate collaboration            • provides an intuitive visual representation for apps: flow diagrams          ...
workflow – business process           • imposes a separation of concerns between the capture of business               proc...
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Upcoming SlideShare
Loading in...5
×

Chicago Hadoop Users Group: Enterprise Data Workflows

30,903

Published on

presentation for CHUG, http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/events/95464182/

Published in: Technology
2 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
30,903
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
80
Comments
2
Likes
13
Embeds 0
No embeds

No notes for slide

Transcript of "Chicago Hadoop Users Group: Enterprise Data Workflows"

  1. 1. “Enterprise Data Workflows with Cascading” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid zest.to/event63_77 2013-02-12 Copyright @2013, Concurrent, Inc.Tuesday, 12 February 13 1You may not have heard about us much, but you use our API in lots of places:your bank, your airline, your hospital, your mobile device, your social network, etc.
  2. 2. Unstructured Data meets Enterprise Scale • an example considered • system integration: tearing down silos • code samples • data science perspectives: how we got here • the workflow abstraction: many aspects of an app • developer, analyst, scientist • summary, referencesTuesday, 12 February 13 2Background: I’m a data scientist, an engineering director,spent the past decade building/leading Data teams which created large-scale apps.This talk is about using Cascading and related DSLs to build Enterprise Data Workflows.Our emphasis is on leveraging the workflow abstraction for system integration, for mitigating complexity, and for producing simple, robust apps at scale.We’ll show a little something for the developers, the analysts, and the scientists in the room.
  3. 3. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count an example consideredTuesday, 12 February 13 3Let’s consider the matter of handling Big Datafrom the perspective of building and maintaining Enterprise apps…
  4. 4. Enterprise Data Workflows Customers an example… Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 4Apache Hadoop rarely ever gets used in isolation
  5. 5. Enterprise Data Workflows Customers an example… the front end Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 5LOB use cases drive the demand for Big Data apps
  6. 6. Enterprise Data Workflows Customers an example… the back office Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 6Enterprise organizations have seriously ginormous investments in existing back office practices:people, infrastructure, processes
  7. 7. Enterprise Data Workflows Customers an example… the heavy lifting! Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 7“Main Street” firms have invested in Hadoop to address Big Data needs,off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  8. 8. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count system integration: tearing down silosTuesday, 12 February 13 8the process of building Enterprise apps is largely aboutsystem integration and business process, meeting in the middle
  9. 9. Cascading – definitions • a pattern language for Enterprise Data Workflows • simple to build, easy to test, robust in production Customers • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 9A pattern language ensures that best practices are followed by an implementation.In this case, parallelization of deterministic query plans for reliable, Enterprise-scale workflows on Hadoop, etc.
  10. 10. Cascading – usage • Java API, Scala DSL Scalding, Clojure DSL Cascalog • ASL 2 license, GitHub src, http://conjars.org Customers • 5+ yrs production use, multiple Enterprise verticals Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 10More than 5 year history of large-scale Enterprise deploymentsDSLs in Scala, Clojure, Jython, JRuby, Groovy, etc.Maven repo for third-party contribs
  11. 11. quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 infoworld.com/slideshow/65089Tuesday, 12 February 13 11Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”
  12. 12. Cascading – deployments • case studies: Twitter, Etsy, Climate Corp, Nokia, Factual, Williams-Sonoma, uSwitch, Airbnb, Square, Harvard, etc. Customers • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera Web App • OSS frameworks built atop by: Twitter, Etsy, eBay, Climate Corp, uSwitch, YieldBot, etc. logs Cache logs Logs • use cases: ETL, anti-fraud, advertising, Support recommenders, retail pricing, eCRM, trap source sink tap marketing funnel, search analytics, tap tap genomics, climatology, etc. Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 12Several published case studies about Cascading, Scalding, Cascalog, etc.Wide range of use cases.Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with all Hadoop vendors.
  13. 13. case studies… (Williams-Sonoma, Neiman Marcus) concurrentinc.com/case-studies/upstream/ upstreamsoftware.com/blog/bid/86333/ (revenue team, publisher analytics) concurrentinc.com/case-studies/twitter/ github.com/twitter/scalding/wiki (infrastructure team) concurrentinc.com/case-studies/airbnb/ gigaom.com/data/meet-the-combo-behind-etsy-airbnb-and- climate-corp-hadoop-jobs/Tuesday, 12 February 13 13Several customers using Cascading / Scalding / Cascalog have published case studies.Here are a few.
  14. 14. Cascading – taps • taps integrate other data frameworks, as tuple streams • these are “plumbing” endpoints in the pattern language Customers • sources (inputs), sinks (outputs), traps (exceptions) Web • where schema and provenance get determined App • text delimited, JDBC, Memcached, logs HBase, Cassandra, MongoDB, etc. logs Logs Cache • data serialization: Avro, Thrift, Support source Kryo, JSON, etc. trap tap tap sink tap • extend in ~4 lines of Java Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 14Speaking of system integration,taps provide the simplest approach for integrating different frameworks.
  15. 15. Cascading – topologies • topologies execute workflows on clusters • flow planner is much like a compiler for queries Customers • abstraction layers reduce training costs Web • Hadoop (MapReduce jobs) App • local mode (dev/test or special config) logs Cache logs • in-memory data grids (real-time) Logs Support • flow planner can be extended trap source sink tap to support other topologies tap tap • blend flows from different Modeling PMML Data Workflow topologies into one app source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 12 February 13 15Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
  16. 16. example topologies…Tuesday, 12 February 13 16Here are some examples of topologies for distributed computing --Apache Hadoop being the first supported by Cascading,followed by local mode, and now a tuple space (IMDG) flow planner in the works.Several other widely used platforms would also be likely suspects for Cascading flow planners.
  17. 17. Cascading – ANSI SQL • ANSI SQL parser/optimizer atop Cascading flow planner • JDBC driver to integrate into existing tools and app servers Customers • surface a relational catalog over a collection of Web unstructured data App • launch a SQL shell prompt to run queries logs logs Cache Logs • enable the analysts without Support retraining on Hadoop, etc. trap source tap sink tap tap • transparency for Support, Data Ops, Finance, et al. Modeling PMML Workflow • combine SQL flows with sink source tap tap Scalding, Cascalog, etc. Analytics • based on collab with Optiq – Cubes customer Customer profile DBs industry-proven code base Hadoop Prefs Cluster • keep the DBAs happy, and Reporting go home a hero!Tuesday, 12 February 13 17Quite a number of projects have started out with Hadoop, then grafted a SQL-like syntax onto it. Somewhere.We started out with a query planner used in Enterprise, then partnered with Optiq -- the team behind an Enterprise-proven code base for an ANSI SQL parser/optimizer.In the sense that Splunk handles “machine data”, this SQL implementation provides “machine code”, as the lingua franca of Enterprise system integration.
  18. 18. how to query… abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumersTuesday, 12 February 13 18When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running onJVM clusters
  19. 19. Cascading – machine learning • export predictive models as PMML • Cascading compiles to JVM classes for parallelization Customers • migrate workloads: SAS, Microstrategy,Teradata, etc. Web • great OSS tools: R, Weka, KNIME, RapidMiner, etc. App • run multiple models in parallel logs as customer experiments logs Logs Cache • Random Forest, Logistic Regression, Support source GLM, Assoc Rules, Decision Trees, trap tap tap sink tap K-Means, Hierarchical Clustering, etc. Data Modeling • 2 lines of code required for PMML Workflow integration sink source tap tap • integrate with other libraries: Analytics Cubes Matrix API, Algebird, etc. customer Customer profile DBs Prefs • combine with other flows into Hadoop Cluster one app: Java for ETL, Reporting Scala for data services, SQL for reporting, etc.Tuesday, 12 February 13 19PMML has been around for a while, and export is supported by virtually every analytics platform,covering a wide variety of predictive modeling algorithms.Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)Five companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
  20. 20. PMML support…Tuesday, 12 February 13 20Here are just a few of the tools that people use to create predictive models for export as PMML
  21. 21. Cascading – test-driven development • assert patterns (regex) on the tuple streams • trap edge cases as “data exceptions” Customers • adjust assert levels, like log4j levels Web • TDD at scale: App 1. start from raw inputs in logs the flow graph logs Cache Logs 2. define stream assertions Support for each stage of transforms trap source tap sink tap tap 3. verify exceptions, code to eliminate them Modeling PMML Data Workflow 4. rinse, lather, repeat… source sink tap 5. when impl is complete, tap Analytics app has full test coverage Cubes customer • TDD follows from Cascalog’s Customer profile DBs Prefs Hadoop composable subqueries Cluster Reporting • redirect traps in production to Ops, QA, Support, Audit, etc.Tuesday, 12 February 13 21TDD is not usually high on the list when people start discussing Big Data apps.Chris Wensel introduced into Cascading the notion of a “data exception”, and how to set stream assertion levels as part of the business logic of an application.Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composingthose predicates into large-scale apps.
  22. 22. Cascading – API design principles • specify what is required, not how it must be achieved • provide the “glue” for system integration • no surprises • same JAR, any scale • plan far ahead (before consuming cluster resources) • fail the same way twice Closely related to “functional relational programming” paradigm from Moseley & Marks 2006 http://goo.gl/SKspnTuesday, 12 February 13 22Overview of the design principles embodied by Cascading as a pattern language…Some aspects (Cascalog in particular) are closely related to “FRP” from Moseley/Marks 2006
  23. 23. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count code samples: Word CountTuesday, 12 February 13 23Let’s make this real, show some code…
  24. 24. the ubiquitous word count definition: count how often each word appears in a collection of text documents this simple program provides an excellent test case for parallel processing, since it illustrates: ‣ requires a minimal amount of code ‣ demonstrates use of both symbolic and numeric values ‣ shows a dependency graph of tuples as an abstraction ‣ is not many steps away from useful search indexing ‣ serves as a “Hello World” for Hadoop apps any distributed computing framework which can run Word Count efficiently in parallel at scale can handle much larger and more interesting compute problemsTuesday, 12 February 13 24
  25. 25. word count – pseudocode void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator partial_counts): int count = 0; for each pc in partial_counts: count += Int(pc); emit(word, String(count));Tuesday, 12 February 13 25
  26. 26. word count – flow diagram Document Collection Tokenize GroupBy M token Count R Word Count cascading.org/category/impatient gist.github.com/3900702 1 map 1 reduce 18 lines codeTuesday, 12 February 13 26
  27. 27. word count – Cascading app Document Collection Tokenize GroupBy M token Count R Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Tuesday, 12 February 13 27
  28. 28. word count – flow plan Document Collection Tokenize GroupBy M token Count R Word Count [head] Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Tuesday, 12 February 13 28
  29. 29. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count import com.twitter.scalding._   class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true)) }Tuesday, 12 February 13 29
  30. 30. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count github.com/twitter/scalding/wiki ‣ extends the Scala collections API, distributed lists become “pipes” backed by Cascading ‣ code is compact, easy to understand – very close to conceptual flow diagram ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ large-scale, complex problems can be handled in just a few lines of code ‣ significant investments by Twitter, Etsy, eBay, etc., in this open source project ‣ extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., “Matrix API” ‣ several large-scale apps in production deployments ‣ IMHO, especially great for data services at scaleTuesday, 12 February 13 30Using a functional programming language to build flows works even better than trying to represent functional programming constructs within Java…
  31. 31. word count – Cascalog / Clojure Document Collection Tokenize GroupBy M token Count R Word Count (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/ImpatientTuesday, 12 February 13 31
  32. 32. word count – Cascalog / Clojure Document Collection Tokenize GroupBy M token Count R Word Count github.com/nathanmarz/cascalog/wiki ‣ implements Datalog in Clojure, with predicates backed by Cascading ‣ a truly declarative language – whereas Scalding lacks that aspect of functional programming ‣ run ad-hoc queries from the Clojure REPL, approx. 10:1 code reduction compared with SQL ‣ composable subqueries, for test-driven development (TDD) at scale ‣ fault-tolerant workflows which are simple to follow ‣ same framework used from discovery through to production apps ‣ FRP mitigates the s/w engineering costs of Accidental Complexity ‣ focus on the process of structuring data; not un/structured ‣ Leiningen build: simple, no surprises, in Clojure itself ‣ has a learning curve, limited number of Clojure developers ‣ aggregators are the magic, those take effort to learnTuesday, 12 February 13 32
  33. 33. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count data science perspectives: how we got hereTuesday, 12 February 13 33Let’s examine an evolution of Data Science practice, subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes, and commercialized Big Data
  34. 34. circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMSTuesday, 12 February 13 34Ah, teh olde days - Perl and C++ for CGI :)Feedback loops shown in red represent data innovations at the time…Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos
  35. 35. circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMSTuesday, 12 February 13 35Q3 1997: Greg Linden @ Amazon, Randy Shoup @ eBay -- independent teams arrived at the same conclusion:parallelize workloads onto clusters of commodity servers (Intel/Linux) to scale-out horizontally.Google and Inktomi (YHOO Search) were working along the same lines.MapReduce grew directly out of this effort. LinkedIn, Facebook, Twitter, Apple, etc., follow.Algorithmic modeling, which leveraged machine data, allowed for Big Data to become monetized.REALLY monetized :)Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machinedata for automation/optimization)MapReduce came from work in 2002. Google is now three generations beyond that -- while the Global 1000 struggles to rationalize Hadoop practices.Google gets upset when people try to “open the kimono”; however, Twitter is in SF where that’s a national pastime :) To get an idea of what powers Google internally, check the open sourceprojects: Scalding, Matrix API, Algebird, etc.
  36. 36. circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMSTuesday, 12 February 13 36Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.Not unlike a practice at LLL, where 4x more data gets collected about the machine than about the experiment.
  37. 37. asymptotically… • smarter, more robust clusters DSL • increased leverage of machine data for automation and optimization • DSLs focused on scalability, testability, Planner/ reducing s/w engineering complexity Optimizer • increased use of “machine code”, who writes SQL directly? Workflow • workflows incorporating more “moving parts” App • less about “bigness” of data, History more about complexity of process Cluster • greater instrumentation ⟹ even more machine data, increased feedback Cluster SchedulerTuesday, 12 February 13 37Enterprise Data Workflows: more about “complex” process than about “big” data
  38. 38. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/Tuesday, 12 February 13 38for a really great discussion about the fundamentals of Data Science and process for algorithmic modeling (analyzing the 1997 inflection point), refer back to Breiman 2001.
  39. 39. references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZETuesday, 12 February 13 39in terms of building data products, see DJ Patils mini-books on OReilly:Building Data Science TeamsData Jujitsu
  40. 40. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count the workflow abstraction: many aspects of an appTuesday, 12 February 13 40The workflow abstraction helps make Hadoop accessible to a broader audience of developers.Let’s take a look at how organizations can leverage it in other important ways…
  41. 41. the workflow abstraction Tuple Flows, Pipes, Taps, Filters, Joins, Traps, etc. …in other words, “plumbing” as a pattern language for managing the complexity of Big Data in Enterprise apps on many levels Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word CountTuesday, 12 February 13 41The workflow abstraction,a pattern language for building robust, scalable Enterprise apps,which works on many levels across an organization…
  42. 42. rather than arguing SQL vs. NoSQL… this kind of work focuses on the process of structuring data which must occur long before work on large-scale joins, visualizations, predictive models, etc. so the process of structuring data is what we examine here: i.e., how to build workflows for Big Data thank you Dr. Codd “A relational model of data for large shared data banks” dl.acm.org/citation.cfm?id=362685Tuesday, 12 February 13 42instead, in Data Science work we must focus on *the process of structuring data*that must happen before the large-scale joins, predictive models, visualizations, etc.the process of structuring data is what i will show herehow to build workflows from Big Datathank you Dr. Codd
  43. 43. workflow – abstraction layer • Cascading initially grew from interaction with the Nutch project, before Hadoop had a name; API author Chris Wensel recognized that MapReduce would be too complex for substantial work in an Enterprise context • 5+ years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts • the pattern language provides a structured method for solving large, complex design problems where the syntax of the language promotes use of best practices – which addresses staffing issues Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word CountTuesday, 12 February 13 43First and foremost, the workflow represents an abstraction layerto mitigate the complexity and costs of coding large apps directly in MapReduce.
  44. 44. workflow – literate collaboration • provides an intuitive visual representation for apps: flow diagrams • flow diagrams are quite valuable for cross-team collaboration • this approach leverages literate programming methodology, especially in DSLs written in functional programming languages • example: nearly 1:1 correspondence between function calls and flow diagram elements in Scalding • example: expert developers on cascading-users email list use flow diagrams to help troubleshoot issues remotely Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word CountTuesday, 12 February 13 44Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- the expert developersgenerally ask a novice to provide a flow diagram first
  45. 45. workflow – business process • imposes a separation of concerns between the capture of business process requirements, and the implementation details (Hadoop, etc.) • workflow orchestration evokes the notion of business process management for Enterprise apps (think BPM/BPEL) • Cascalog leverages Datalog features to make business process executable: “specify what you require, not how to achieve it” Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word CountTuesday, 12 February 13 45Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×