“Functional programming              for optimization problems              in Big Data”                  Paco Nathan     ...
The Workflow Abstraction                                                                                                  ...
Q3 1997: inflection point             Four independent teams were working toward horizontal             scale-out of workflo...
Circa 1996: pre- inflection point                                                                                          ...
Circa 2001: post- big ecommerce successes                                                        Stakeholder              ...
Circa 2013: clusters everywhere                                                                                      Data ...
references…                       by Leo Breiman                       Statistical Modeling: The Two Cultures             ...
references…                   Amazon                   “Early Amazon: Splitting the website” – Greg Linden                ...
core values                   Data Science teams develop actionable insights, building                   confidence for dec...
team process = needs                                       help people ask the                         discovery     right...
team composition = roles                                                                                                  ...
matrix: evaluate needs × roles                                                                  nn                        ...
most valuable skills                   approximately 80% of the costs for data-related projects                   get spen...
science in data science?                                                                                        edoMpUsser...
references…                       by DJ Patil                       Data Jujitsu                       O’Reilly, 2012     ...
The Workflow Abstraction                                                                                                 D...
Cascading – origins             API author Chris Wensel worked as a system architect             at an Enterprise firm well...
Cascading – functional programming             Key insight: MapReduce is based on functional programming             – bac...
examples…                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested                           in...
The Ubiquitous Word Count                                                                                                 ...
word count – conceptual flow diagram                 Document                 Collection                                   ...
word count – Cascading app in Java                                                                                        ...
word count – generated flow diagram                                                                                        ...
word count – Cascalog / Clojure                                                                      Document             ...
word count – Cascalog / Clojure                                                                                           ...
word count – Scalding / Scala                                                                                 Document    ...
word count – Scalding / Scala                                                                                             ...
word count – Scalding / Scala                                                                                             ...
The Workflow Abstraction                                                                         Document                 ...
Cascading workflows – pattern language             Cascading uses a “plumbing” metaphor in the Java API,             to defi...
references…                      pattern language: a structured method for solving                      large, complex des...
Cascading workflows – literate programming             Cascading workflows generate their own visual             documentati...
references…                       by Don Knuth                       Literate Programming                       Univ of Ch...
examples…                        • Scalding apps have nearly 1:1 correspondence                            between functio...
Cascading workflows – business process            Following the essence of literate programming, Cascading            workfl...
references…                      by Edgar Codd                      “A relational model of data for large shared data bank...
Cascading workflows – functional relational programming             The combination of functional programming, pattern lang...
Two Avenues…             Enterprise: must contend with             complexity at scale everyday…             incumbents ex...
Cascading workflows – functional relational programming           The combination of functional programming, pattern langua...
The Workflow Abstraction                                                                        Document                  ...
Cascading – deployments              • 5+ history of Enterprise production deployments,                   ASL 2 license, G...
Finance: Ecommerce Risk                   Problem:                                                                        ...
Finance: Ecommerce Risk                   KPI:                                                                            ...
Finance: Ecommerce Risk                   Data Science Issues:                                                            ...
Finance: Ecommerce Risk                   Predictive Analytics:                                                           ...
Finance: Ecommerce Risk                   1. Data Preparation (batch)                                                     ...
Finance: Ecommerce Risk                   2. Model Creation (analyst)                                                     ...
Finance: Ecommerce Risk                   3. Test Model (analyst/batch loop)                                              ...
Finance: Ecommerce Risk                   5. Production Deployment (near-time)                                            ...
Finance: Ecommerce Risk                         risk classifier                                               risk classifie...
Ecommerce: Marketing Funnel                    Problem:                    • must optimize large ad spend budget          ...
Ecommerce: Marketing Funnel                    KPI:                    cost per paying user (CPP)                         ...
Ecommerce: Marketing Funnel                    Predictive Analytics:                    batch                             ...
Airlines                    Problem:                    • minimize schedule delays                    • re-route around we...
Airlines                    Predictive Analytics:                    batch                    • predict “last mile” failur...
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Upcoming SlideShare
Loading in...5
×

Functional programming for optimization problems in Big Data

16,581

Published on

Invited talk for the INFORMS chapter at Stanford. 2013-03-06.

Published in: Technology

Functional programming for optimization problems in Big Data

  1. 1. “Functional programming for optimization problems in Big Data” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc.Wednesday, 06 March 13 1
  2. 2. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 2Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.Where did Big Data come from, and where is this kind of work headed?
  3. 3. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this.Wednesday, 06 March 13 3Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:parallelize workloads onto clusters of commodity servers to scale-out horizontally.Google and Inktomi (YHOO Search) were working along the same lines.
  4. 4. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMSWednesday, 06 March 13 4Perl and C++ for CGI :)Feedback loops shown in red represent data innovations at the time… these are rather static.Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos
  5. 5. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMSWednesday, 06 March 13 5Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of themarketing funnel, as in our case study.LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
  6. 6. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMSWednesday, 06 March 13 6Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.We see this feeding into cluster optimization in YARN, Apache Mesos, etc.
  7. 7. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9LWednesday, 06 March 13 7Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
  8. 8. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspxWednesday, 06 March 13 8In their own words…
  9. 9. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. by definition, this is a multi-disciplinary pursuit which requires teams, not sole playersWednesday, 06 March 13 9
  10. 10. team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at Gephi integration scale to customers build smarts into apps product features keep infrastructure systems running, cost-effectiveWednesday, 06 March 13 10
  11. 11. team composition = roles Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Domain Expert business process, stakeholder Count Word Count data science Data Scientist data prep, discovery, modeling, etc. App Dev software engineering, automation Ops systems engineering, access introduced capabilityWednesday, 06 March 13 11This is an example of multi-disciplinary team composition for data scienceWhile other emerging problems spaces will require other more specific kinds of team roles
  12. 12. matrix: evaluate needs × roles nn o overy very elliing e ng ratiio rat o apps apps stem stem ss diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer opsWednesday, 06 March 13 12
  13. 13. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondaryWednesday, 06 March 13 13
  14. 14. science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC in a nutshell, what we do… tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT ‣ estimate probability taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM ‣ calculate analytic variance teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ leverage use of learning theory + collab with DevOps, Stakeholders + reduce work to cron entriesWednesday, 06 March 13 14
  15. 15. references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZEWednesday, 06 March 13 15Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
  16. 16. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 16Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  17. 17. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology.Wednesday, 06 March 13 17Cascading initially grew from interaction with the Nutch project, before Hadoop had a nameAPI author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
  18. 18. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows.Wednesday, 06 March 13 18Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
  19. 19. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wikiWednesday, 06 March 13 19Many case studies, many Enterprise production deployments now for 5+ years.
  20. 20. The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems.Wednesday, 06 March 13 20Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
  21. 21. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702Wednesday, 06 March 13 21Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
  22. 22. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Wednesday, 06 March 13 22Based on a Cascading implementation of Word Count, here is sample code --approx 1/3 the code size of the Word Count example from Apache Hadoop2nd to last line: generates a DOT file for the flow diagram
  23. 23. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Wednesday, 06 March 13 23As a concrete example of literate programming in Cascading,here is the DOT representation of the flow plan -- generated by the app itself.
  24. 24. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/ImpatientWednesday, 06 March 13 24Here is the same Word Count app written in Clojure, using Cascalog.
  25. 25. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learnWednesday, 06 March 13 25From what we see about language features, customer case studies, and best practices in general --Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.Great for large-scale, complex apps, where small teams must limit the complexities in their process.
  26. 26. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true)) }Wednesday, 06 March 13 26Here is the same Word Count app written in Scala, using Scalding.Very compact, easy to understand; however, also more imperative than Cascalog.
  27. 27. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level languageWednesday, 06 March 13 27If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
  28. 28. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level languageWednesday, 06 March 13 28Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
  29. 29. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 29CS theory related to data workflow abstraction, to manage complexity
  30. 30. Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language.Wednesday, 06 March 13 30A pattern language, based on the metaphor of “plumbing”
  31. 31. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices. amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”. amazon.com/dp/0201633612Wednesday, 06 March 13 31Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
  32. 32. Cascading workflows – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps, great for cross-team collaboration.Wednesday, 06 March 13 32Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
  33. 33. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”Wednesday, 06 March 13 33Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
  34. 34. examples… • Scalding apps have nearly 1:1 correspondence between function calls and the elements in their flow diagrams – excellent elision and literate representation • noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated [head] as a DOT file), sometimes in lieu of showing code Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] In formal terms, a flow diagram is a directed, acyclic [{1}:token] [{1}:token] graph (DAG) on which lots of interesting math applies GroupBy(wc)[by:[token]] for query optimization, predictive models about app wc[{1}:token] [{1}:token] reduce execution, parallel efficiency metrics, etc. Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Wednesday, 06 March 13 34Literate programming examples observed on the email list are some of the best illustrations of this methodology.
  35. 35. Cascading workflows – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) As a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale.Wednesday, 06 March 13 35Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
  36. 36. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on: the process of structuring data That’s what apps do – Making Data WorkWednesday, 06 March 13 36Focus on *the process of structuring data*which must happen before the large-scale joins, predictive models, visualizations, etc.Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
  37. 37. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspnWednesday, 06 March 13 37A more contemporary statement along similar lines...
  38. 38. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Wednesday, 06 March 13 38Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  39. 39. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a several theoretical aspects converge “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: into software engineering practices Moseley & Marks, 2006which mitigates the complexity of “Out of the Tar Pit” building and maintaining Enterprise goo.gl/SKspn data workflowsWednesday, 06 March 13 39
  40. 40. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 40Here are a few use cases to consider, for Enterprise data workflows
  41. 41. Cascading – deployments • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc.Wednesday, 06 March 13 41Several published case studies about Cascading, Cascalog, Scalding, etc.Wide range of use cases.Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with the various Hadoop distro vendors, cloud providers, etc.
  42. 42. Finance: Ecommerce Risk Problem: stat.berkeley.edu <1% chargeback rate allowed by Visa, others follow • may leverage CAPTURE/AUTH wait period • Cybersource,Vindicia, others haven’t stopped fraud >15% chargeback rate common for mobile in US: • not much info shared with merchant • carrier as judge/jury/executioner; customer assumed correct most common: professional fraud (identity theft, etc.) • patterns of attack change all the time • widespread use of IP proxies, to mask location • global market for stolen credit card info other common case is friendly fraud • teenager billing to parent’s cell phoneWednesday, 06 March 13 42
  43. 43. Finance: Ecommerce Risk KPI: stat.berkeley.edu chargeback rate (CB) • ground truth for how much fraud the bank/carrier claims • 7-120 day latencies from the bank false positive rate (FP) • estimated cost: predicts customer support issues • complaints due to incorrect fraud scores on valid orders (or lies) false negative rate (FN) • estimated risk: how much fraud may pass undetected in future orders • changes with new product features/services/inventory/marketingWednesday, 06 March 13 43
  44. 44. Finance: Ecommerce Risk Data Science Issues: stat.berkeley.edu • chargeback limits imply few training cases • sparse data implies lots of missing values – must impute • long latency on chargebacks – “good” flips to “bad” • most detection occurs within large-scale batch, decisions required during real-time event processing • not just one pattern to detect – many, ever-changing • many unknowns: blocked orders scare off professional fraud, inferences cannot be confirmed • cannot simply use raw data as input – requires lots of data preparation and statistical modeling • each ecommerce firm has shopping/policy nuances which get exploited differently – hard to generalize solutionsWednesday, 06 March 13 44
  45. 45. Finance: Ecommerce Risk Predictive Analytics: stat.berkeley.edu batch • cluster/segment customers for expected behaviors • adjust for seasonal variation • geospatial indexing / bayesian point estimates (fraud by lat/lng) • impute missing values (“guesses” to fill-in sparse data) • run anti-fraud classifier (customer 360) real-time • exponential smoothing (estimators for velocity) • calculate running medians (anomaly detection) • run anti-fraud classifier (per order)Wednesday, 06 March 13 45
  46. 46. Finance: Ecommerce Risk 1. Data Preparation (batch) stat.berkeley.edu ‣ ETL from bank, log sessionization, customer profiles, etc. - large-scale joins of customers + orders ‣ apply time window - too long: patterns lose currency - too short: not enough wait for chargebacks ‣ segment customers - temporary fraud (identity theft which has been resolved) - confirmed fraud (chargebacks from the bank) - estimated fraud (blocked/banned by Customer Support) - valid orders (but different clusters of expected behavior) ‣ subsample to rebalance data - produce training set + test holdout - adjust balance for FP/FN bias (company risk profile)Wednesday, 06 March 13 46
  47. 47. Finance: Ecommerce Risk 2. Model Creation (analyst) stat.berkeley.edu ‣ distinguish between different IV data types - continuous (e.g., age) - boolean (e.g., paid lead) - categorical (e.g., gender) - computed (e.g., geo risk, velocities) ‣ use geospatial smoothing for lat/lng ‣ determine distributions for IV ‣ adjust IV for seasonal variation, where appropriate ‣ impute missing values based on density functions / medians ‣ factor analysis: determine which IV to keep (too many creates problems) ‣ train model: random forest (RF) classifiers predict likely fraud ‣ calculate the confusion matrix (TP/FP/TN/FN)Wednesday, 06 March 13 47
  48. 48. Finance: Ecommerce Risk 3. Test Model (analyst/batch loop) stat.berkeley.edu ‣ calculate estimated fraud rates ‣ identify potential found fraud cases ‣ report to Customer Support for review ‣ generate risk vs. benefit curves ‣ visualize estimated impact of new model 4. Decision (stakeholder) ‣ decide risk vs. benefit (minimize fraud + customer support costs) ‣ coordinate with bank/carrier if there are current issues ‣ determine go/no-go, when to deploy in production, size of rolloutWednesday, 06 March 13 48
  49. 49. Finance: Ecommerce Risk 5. Production Deployment (near-time) stat.berkeley.edu ‣ run model on in-memory grid / transaction processing ‣ A/B test to verify model in production (progressive rollout) ‣ detect anomalies - use running medians on continuous IVs - use exponential smoothing on computed IVs (velocities) - trigger notifications ‣ monitor KPI and other metrics in dashboardsWednesday, 06 March 13 49
  50. 50. Finance: Ecommerce Risk risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analysts customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. dataWednesday, 06 March 13 50
  51. 51. Ecommerce: Marketing Funnel Problem: • must optimize large ad spend budget Wikipedia • different vendors report different kinds of metrics • some campaigns are much smaller than others • seasonal variation distorts performance • inherent latency in spend vs. effect • ads channels cannot scale up immediately • must “scrub” leads to dispute payments/refunds • hard to predict ROI for incremental ad spend • many issues of diminishing returns in generalWednesday, 06 March 13 51
  52. 52. Ecommerce: Marketing Funnel KPI: cost per paying user (CPP) Wikipedia • must align metrics for different ad channels • generally need to estimate to end-of-month customer lifetime value (LTV) • big differences based on geographic region, age, gender, etc. • assumes that new customers behave like previous customers return on investment (ROI) • relationship between CPP and LTV • adjust to invest in marketing (>CPP) vs. extract profit (>LTV) other metrics • reach: how many people get a brand message • customer satisfaction: would recommend to a friend, etc.Wednesday, 06 March 13 52
  53. 53. Ecommerce: Marketing Funnel Predictive Analytics: batch Wikipedia • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • customer lifetime value estimates ROI of new leads • linear programming models estimate elasticity of demand real-time • determine whether this is actually a new customer… • new: modify initial UX based on ad channel, region, friends, etc. • old: recommend products/services/friends based on behaviors • adjust spend on poorly performing channels • track back to top referring sites/partnersWednesday, 06 March 13 53
  54. 54. Airlines Problem: • minimize schedule delays • re-route around weather and airport conditions • manage supplier channels and inventories to minimize AOG KPI: forecast future passenger demand customer loyalty aircraft on ground (AOG) mean time between failures (MTBF)Wednesday, 06 March 13 54
  55. 55. Airlines Predictive Analytics: batch • predict “last mile” failures • optimize capacity utilization • operations research problem to optimize stocking / minimize fuel waste • boost customer loyalty by adjusting incentives frequent flyer programs real-time • forecast schedule delays • monitor factors for travel conditions: weather, airports, etc.Wednesday, 06 March 13 55
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×