SlideShare a Scribd company logo
“The Workflow Abstraction”

                     Strata SC

                     Paco Nathan
                     Concurrent, Inc.
                     San Francisco, CA

                   Copyright @2013, Concurrent, Inc.

Friday, 01 March 13                                                                                           1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
The Workflow Abstraction



                       1. Funnel
                                                                                                                                                                                                 HashJoin   Regex
                                                                                                                                                                                                   Left     token
                                                                                                                                                                                                                    GroupBy    R
                                                                                                                                                                                    Stop Word                        token



                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines

Friday, 01 March 13                                                                                                                                                                                                                        2
This talk is about the workflow abstraction:
 * the business process of structuring data
 * the practices of building robust apps at scale
 * the open source projects for Enterprise Data Workflows

We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?

Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
Marketing Funnel – overview

             In reference to Making Data Work…
             Almost every business uses a model
             similar to this – give or take a few steps.                                                                       Campaigns

             Customer leads go in at the top,
             those get refined through several stages,
             then results flow out the bottom.





Friday, 01 March 13                                                                                                                          3
Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.

This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
Marketing Funnel – clickstream

              Different funnel stages get represented
              in ecommerce by events captured in                                                                                           Customers
              log files, as a class of machine data
              called clickstream                                                                                                           Campaigns

                •   ad impressions                                                                                                         Awareness

                •   URL clicks                                                                                Click

                •   landing page views                                                                                                      Interest

                •   new user registrations                                                                            Sign Up

                •   session cookies
                •   online purchases                                                                                                       Conversion

                •   social network activity                                                                                       "Like"

                •   etc.                                                                                                                    Referral


Friday, 01 March 13                                                                                                                                      4
Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
Marketing Funnel – metrics

             A variety of clickstream metrics can
             be used as performance indicators                                                             Customers
             at different stages of the funnel:
              •    CPM: cost per thousand                                    Impression

              •    CTR: click-through rate                                                                 Awareness                           CPM

              •    CPA: cost per action                                         Click

              •    etc.                                                                                     Interest                     CTR

                                                                                        Sign Up

                                                                                                           Evalutation                behaviors


                                                                                                           Conversion           CPA


                                                                                                            Referral        NPS, social graph, etc.

                                                                                                             Repeat      loyalty, win back, etc.

Friday, 01 March 13                                                                                                                                   5
The many different highly-nuanced metrics which apply are mind-boggling :)
Marketing Funnel – example calculations                                               Customers




                            metric                       cost     events     formula       rate    Evalutation




                              CPM                     $4,000       10^6          ÷         $4.00
                                                                           (10^6 ÷ 10^3)

                               CTR                            -   3∙10^3
                                                                              ÷ 10^6

                               CPA                            -     20           ÷         $200

Friday, 01 March 13                                                                                              6
Here are examples of the kinds of calculations performed...
Marketing Funnel – predictive model

             Given these metrics, we can go further
             to estimate cost per paying user (CPP)                                                                                       Customers
             customer lifetime value (LTV), etc.
             Then we can build a predictive model for
             return on investment (ROI) per customer,                                                                                     Awareness
             summarizing the funnel performance:
                     ROI = (LTV − CPP) ∕ CPP                                                                                               Interest

             As an example, after crunching lots of logs,                                                                                 Evalutation

             suppose that…

                     CPP = $200
                     LTV = $2000                                                                                                           Referral

                     ROI = ($2000 − $200) ∕ $200
             for a 9x multiple

Friday, 01 March 13                                                                                                                                     7
For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,
which describes the efficiency of the marketing funnel at different stages.
Marketing Funnel – example architecture                                                                        Customers



             Let’s consider an example architecture                                                                          Interest


             for calculating, reporting, and taking action                                                      Web

             on funnel metrics, based on large-scale                                                            App


             clickstream data…
                                                                                                  logs         Cache

                                                                                           trap                  sink
                                                                                            tap                  tap

                                                                     Modeling            PMML


                                                                      Cubes                                    customer
                                                                                                              profile DBs

Friday, 01 March 13                                                                                                                       8
Here’s an example architecture of using clickstream metrics within an online business.
Marketing Funnel – complexities

             Multiple ad partners, different contracts
             terms, reporting different metrics at                                                                                  Customers
             different times, click scrubs, etc.
             Campaigns target specific geo/demo,                                                     Impression

                                                                                                    ×                                           ×
             test alternate landing pages, probably                                                                                 Awareness                           CPM
             need to segment customer base…                                                              Click

             These issues make clickstream data                                                                                      Interest                     CTR

             large and yet sparse.                                                                               Sign Up

                                                                                                                                    Evalutation                behaviors

             Other issues:


             • seasonal variation                                                                                                   Conversion           CPA

             • fluctuating currency exchange rates                                                                          "Like"

                                                                                                                                     Referral        NPS, social graph, etc.
             • distortions due to credit card fraud
             • diminishing returns                                                                                                    Repeat      loyalty, win back, etc.

             • forecasting requirements
Friday, 01 March 13                                                                                                                                                            9
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.

many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards

social dimension makes this convoluted
not simple
Marketing Funnel – very large scale

             Even a small start-up may need to
             make decisions about billions of                                                                                              Customers
             events, many millions of users, and
             millions of dollars in annual ad spend.                                                                                       Campaigns

             Ad networks attempt to simplify and                                                                                           Awareness                           CPM
             optimize parts of the funnel process                                                   Click
             as a value-add.                                                                                                                Interest                     CTR

             The need for these insights has been a                                                         Sign Up

             driver for Hadoop-related technologies.                                                                                       Evalutation                behaviors


                                                                                                                                           Conversion           CPA


                                                                                                                                            Referral        NPS, social graph, etc.

                                                                                                                                             Repeat      loyalty, win back, etc.

Friday, 01 March 13                                                                                                                                                                   10
The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
Marketing Funnel – very large scale

            Even a small start-up may need to
            make decisions about billions of                                               Customers
            events, many millions of users, and
            millions of dollars in annual ad spend.                                        Campaigns

            Ad networks attempt to simplify and                                            Awareness                           CPM
            optimize parts of the funnel process                Click
            as a value-add.
                                      funnel modeling and optimization                      Interest                     CTR

            The need for these insights has been a                      Sign Up

            driver for Hadoop-relatedrequires complex data workflows
                                       technologies.                                       Evalutation                behaviors

                                      to obtain the required insights      Purchase

                                                                                           Conversion           CPA


                                                                                            Referral        NPS, social graph, etc.

                                                                                             Repeat      loyalty, win back, etc.

Friday, 01 March 13                                                                                                                   11
These needs imply complex data workflows.

It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.
The Workflow Abstraction



                      1. Funnel
                                                                                                                                           HashJoin   Regex
                                                                                                                                             Left     token
                                                                                                                                                              GroupBy    R
                                                                                                                              Stop Word                        token



                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines

Friday, 01 March 13                                                                                                                                                                  12
A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
Circa 2008 – Hadoop at scale

             Scenario: Analytics team at a large ad network…                                                                                    Campaigns


             Company had invested $MM capex in a                                                                                                 Interest

             large data warehouse across LOBs                                                                                                   Evalutation


             Mission-critical app had been written as

                                                                                                                                     collab       Repeat

             a large SQL workflow in the DW                                                                            roll-ups

             Marketing funnel metrics were estimated
             for many advertisers, many campaigns,                                                                                   per-user
             many publishers, many customers –
             billions of calculations daily
             Predictive models matched publisher ~ advertiser                                                        clickstream     RDBMS

             and campaign ~ user, to optimize marketing
             funnel performance

Friday, 01 March 13                                                                                                                                           13
Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..

Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
Circa 2008 – Hadoop at scale

             Issues:                                                                                                                                     Campaigns


              • critical app had hit hard limits for scalability                                                                                          Interest

              • several Tb data, 100’s of servers


              • batch window length vs. failure rate vs. SLA                                                                                collab


                in the context of business growth posed                                                                      roll-ups
                an existential risk

             We built out a team to address these issues                                                                                    per-user
             as rapidly as possible…
             Needed to re-create that data workflows                                                                         query/load
             based on Enterprise requirements.                                                                              clickstream     RDBMS

Friday, 01 March 13                                                                                                                                                    14
Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.

We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
Circa 2008 – Hadoop at scale

            Approach:                                                           roll-ups
             • reverse-engineered business process from
               ~1500 lines of undocumented SQL
             • created a large, multi-step Apache Hadoop                                     recommends
               app on AWS                                                        HDFS

             • leveraged cloud strategy to trade $MM
               capex for lower, scalable opex
             • Amazon identified our app as one of the                             msg
               largest Hadoop deployments on EC2
             • our app became a case study for AWS                             query/load
               prior to Elastic MapReduce launch                               clickstream

Friday, 01 March 13                                                                                       15
Our solution involved dependencies among more than a dozen Hadoop job steps.
Circa 2008 – Hadoop at scale

             Unresolved:                                                                                                                 roll-ups
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                                                                                            per-user
                debugging, etc., across the entire workflow                                                                                            recommends
              • data scientists wore beepers since Ops

                                                                                                                                × ×
                lacked visibility into business process
              • coding directly in MapReduce created
                a staffing bottleneck                                                                                                       msg

                                                                                                                                        clickstream     RDBMS

Friday, 01 March 13                                                                                                                                                16
This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.

Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.

Three issues about Enterprise workflows:
 * staffing bottleneck unless there’s a good abstraction layer
 * operational complexity, mostly due to lack of transparency
 * system integration problems *are* the main problem to solve
Circa 2008 – Hadoop at scale

             Unresolved:                                           roll-ups
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                  per-user
                debugging, etc., across the entire workflow                  recommends

              • data scientists worea good since Ops for a large, commercial
                                       beepers solution

                lacked visibility into Apachebusiness logic deployment, but
                                       the app’s Hadoop
              • coding directly in MapReduce created
                a staffing bottleneck   workflow management lacked crucial
                                                             which led to a search for a better                                      clickstream                RDBMS

                                                             workflow abstraction

Friday, 01 March 13                                                                                                                                                                           17
While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.

I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
The Workflow Abstraction



                       1. Funnel
                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token



                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines

Friday, 01 March 13                                                                                                                                                             18
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading – origins

             API author Chris Wensel worked as a system architect
             at an Enterprise firm well-known for several popular
             data products.
             Wensel was following the Nutch open source project –
             before Hadoop even had a name.
             He noted that it would become difficult to find Java
             developers to write complex Enterprise apps directly
             in Apache Hadoop – a potential blocker for leveraging
             this new open source technology.

Friday, 01 March 13                                                                                                                                                            19
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
Cascading – functional programming

             Key insight: MapReduce is based on functional programming
             – back to LISP in 1970s. Apache Hadoop use cases are
             mostly about data pipelines, which are functional in nature.
             To ease staffing problems as “Main Street” Enterprise firms
             began to embrace Hadoop, Cascading was introduced
             in late 2007, as a new Java API to implement functional
             programming for large-scale data workflows:

               • leverages JVM and Java-based tools without an need
                    to create an entirely new language
               •    allows many programmers who have J2EE expertise
                    to build apps that leverage the economics of Hadoop

Friday, 01 March 13                                                                                                                           20
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.

                       “Cascading gives Java developers the ability to build
                        Big Data applications on Hadoop using their existing
                        skillset … Management can really go out and build a
                        team around folks that are already very experienced
                        with Java. Switching over to this is really a very short
                            CIO, Thor Olavsrud

                       “Masks the complexity of MapReduce, simplifies the
                        programming, and speeds you on your journey toward
                        actionable analytics … A vast improvement over native
                        MapReduce functions or Pig UDFs.”
                            2012 BOSSIE Awards, James Borck

Friday, 01 March 13                                                                           21
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”

The issues:
 * staffing bottleneck
 * operational complexity
 * system integration
Cascading – deployments

              • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
                   uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
              • partners: Amazon AWS, Microsoft Azure, Hortonworks,
                   MapR, EMC, SpringSource, Cloudera
              • 5+ history of Enterprise production deployments,
                   ASL 2 license, GitHub src,
              • use cases: ETL, marketing funnel, anti-fraud, social media,
                   retail pricing, search analytics, recommenders, eCRM,
                   utility grids, genomics, climatology, etc.

Friday, 01 March 13                                                                  22
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:

                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)


Friday, 01 March 13                                                                    23
Many case studies, many Enterprise production deployments now for 5+ years.

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:
                                         Cascading as the basis for workflow
                                         abstractions atop Hadoop and more,
                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)
                                         with a 5+ year history of production
                                         deployments across multiple verticals

Friday, 01 March 13                                                                    24
Cascading as a basis for workflow abstraction, for Enterprise data workflows
The Workflow Abstraction



                      1. Funnel
                                                                                                               HashJoin   Regex
                                                                                                                 Left     token
                                                                                                                                  GroupBy    R
                                                                                                  Stop Word                        token



                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines

Friday, 01 March 13                                                                                                                                      25
Code samples in Cascading / Cascalog / Scalding, based on Word Count
The Ubiquitous Word Count

             Definition:                                                                                                     M
                                                                                                                                              token    Count

                 count how often each word appears
               count how often each word appears
                                                                                                                                                R              Word

               inin a collection of text documents
                  a collection of text documents
             This simple program provides an excellent test case for
             parallel processing, since it illustrates:                                                    void map (String doc_id, String text):
                                                                                                            for each word w in segment(text):
              • requires a minimal amount of code                                                             emit(w, "1");

              • demonstrates use of both symbolic and numeric values
              • shows a dependency graph of tuples as an abstraction                                       void reduce (String word, Iterator group):

              • is not many steps away from useful search indexing
                                                                                                            int count = 0;

              • serves as a “Hello World” for Hadoop apps                                                   for each pc in group:
                                                                                                              count += Int(pc);

             Any distributed computing framework which can run Word                                         emit(word, String(count));
             Count efficiently in parallel at scale can handle much
             larger and more interesting compute problems.

Friday, 01 March 13                                                                                                                                                    26
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
word count – conceptual flow diagram


                               M                                                                        token                                             Count

                                                                                                             R                                                                                Word

                1 map                                                                                  
                1 reduce
               18 lines code                                                                                                     

Friday, 01 March 13                                                                                                                                                                                                      27
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
word count – Cascading app in Java

             String docPath = args[ 0 ];                                                                          Tokenize
             String wcPath = args[ 1 ];                                                                      M                         Count

             Properties properties = new Properties();                                                                          R              Word

             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );
             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/" );

Friday, 01 March 13                                                                                                                                    28
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram
word count – generated flow diagram

                                                                                                      [head]                                                  M
                                                                                                                                                                               token    Count

                                                                                                                                                                                 R              Word

                                                                        Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                [{2}:'doc_id', 'text']
                                                                                                [{2}:'doc_id', 'text']






                                                                                                [{2}:'token', 'count']

                                                                    Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                [{2}:'token', 'count']
                                                                                                [{2}:'token', 'count']


Friday, 01 March 13                                                                                                                                                                                     29
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
word count – Cascalog / Clojure

             (ns impatient.core                                               M
                                                                                               token    Count

               (:use [cascalog.api]                                                              R              Word

                     [cascalog.more-taps :only (hfs-delimited)])
               (:require [clojure.string :as s]
                         [cascalog.ops :as c])

             (defmapcatop split [line]
               "reads in a line of string and splits it by regex"
               (s/split line #"[[](),.)s]+"))

             (defn -main [in out & args]
               (?<- (hfs-delimited out)
                    [?word ?count]
                    ((hfs-delimited in :skip-header? true) _ ?line)
                    (split ?line :> ?word)
                    (c/count ?count)))

             ; Paul Lam

Friday, 01 March 13                                                                                                     30
Here is the same Word Count app written in Clojure, using Cascalog.
word count – Cascalog / Clojure

                                                                                                                            M                token    Count

                                                                                                                                               R              Word

               • implements Datalog in Clojure, with predicates backed
                 by Cascading – for a highly declarative language
               • run ad-hoc queries from the Clojure REPL –
                 approx. 10:1 code reduction compared with SQL
               • composable subqueries, used for test-driven development
                 (TDD) practices at scale
               • Leiningen build: simple, no surprises, in Clojure itself
               • more new deployments than other Cascading DSLs –
                 Climate Corp is largest use case: 90% Clojure/Cascalog
               • has a learning curve, limited number of Clojure developers
               • aggregators are the magic, and those take effort to learn

Friday, 01 March 13                                                                                                                                                   31
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.
word count – Scalding / Scala

           import com.twitter.scalding._                                                 M
                                                                                                          token    Count

                                                                                                            R              Word

           class WordCount(args : Args) extends Job(args) {
                  ('doc_id, 'text),
                  skipHeader = true)
               .flatMap('text -> 'token) {
                  text : String => text.split("[ [](),.]")
               .groupBy('token) { _.size('count) }
               .write(Tsv(args("wc"), writeHeader = true))

Friday, 01 March 13                                                                                                                32
Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.
word count – Scalding / Scala

                                                                                                                                                                                          M                token    Count

                                                                                                                                                                                                             R              Word

                • extends the Scala collections API so that distributed lists
                  become “pipes” backed by Cascading
                • code is compact, easy to understand
                • nearly 1:1 between elements of conceptual flow diagram
                  and function calls
                • extensive libraries are available for linear algebra, abstract
                  algebra, machine learning – e.g., Matrix API, Algebird, etc.
                • significant investments by Twitter, Etsy, eBay, etc.
                • great for data services at scale
                • less learning curve than Cascalog,
                  not as much of a high-level language

Friday, 01 March 13                                                                                                                                                                                                                 33
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
word count – Scalding / Scala

                                                                                                                                                            M                token    Count

                                                                                                                                                                               R              Word

               • extends the Scala collections API so that distributed lists
                 become “pipes” backed by Cascading
               • code is compact, easy to understand
               • nearly 1:1 between elements of conceptual flow diagram
                 and function calls        Cascalog and Scalding DSLs
               • extensive libraries are available for linear algebra, abstractaspects
                                           leverage the functional
                 algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                           of MapReduce, helping to limit
               • significant investments by Twitter, Etsy, eBay, etc.
                                           complexity in process
               • great for data services at scale
                 (imagine SOA infra @ Google as an open source project)
               • less learning curve than Cascalog,
                 not as much of a high-level language

Friday, 01 March 13                                                                                                                                                                                   34
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
The Workflow Abstraction



                     1. Funnel
                                                                                                                       HashJoin   Regex
                                                                                                                         Left     token
                                                                                                                                          GroupBy    R
                                                                                                          Stop Word                        token



                     2. Circa 2008
                     3. Cascading
                     4. Sample Code
                     5. Workflows
                     6. Abstraction
                     7. Trendlines

Friday, 01 March 13                                                                                                                                              35
Tracking back to the Marketing Funnel as an example workflow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop
Enterprise Data Workflows
            Back to our marketing funnel, let’s consider
            an example app… at the front end                                          Web
            LOB use cases drive demand for apps
                                                                        logs         Cache

                                                                 trap                  sink
                                                                  tap                  tap

                                                   Modeling    PMML


                                                    Cubes                            customer
                                                                                    profile DBs

Friday, 01 March 13                                                                               36
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
             An example… in the back office
             Organizations have substantial investments                                                            Web
             in people, infrastructure, process
                                                                                                     logs         Cache

                                                                                              trap                  sink
                                                                                               tap                  tap

                                                                     Modeling            PMML


                                                                      Cubes                                       customer
                                                                                                                 profile DBs

Friday, 01 March 13                                                                                                            37
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache

                                                                                       trap                  sink
                                                                                        tap                  tap

                                                                         Modeling    PMML


                                                                          Cubes                            customer
                                                                                                          profile DBs

Friday, 01 March 13                                                                                                     38
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Cascading workflows – taps

               •   taps integrate other data frameworks, as tuple streams

               •   these are “plumbing” endpoints in the pattern language
               •   sources (inputs), sinks (outputs), traps (exceptions)                                      Web

               •   text delimited, JDBC, Memcached,
                   HBase, Cassandra, MongoDB, etc.                                              logs

               • data serialization: Avro, Thrift,
                                                                                         trap                  sink
                   Kryo, JSON, etc.                                                       tap                  tap

               • extend a new kind of tap in just
                                                                           Modeling    PMML

                   a few lines of Java                                                   sink

                                                                            Cubes                            customer
                                                                                                            profile DBs
             schema and provenance get                                                            Hadoop

             derived from analysis of the taps                             Reporting

Friday, 01 March 13                                                                                                       39
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
Cascading workflows – taps

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                source and sink taps
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );                      for TSV data in HDFS
            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/" );

Friday, 01 March 13                                                                                                        40
Here are the taps in the WordCount source
Cascading workflows – topologies

               • topologies execute workflows on clusters

               • flow planner is like a compiler for queries
                 - Hadoop (MapReduce jobs)                                                                                                      Web

                 - local mode (dev/test or special config)
                                                                                                                                  logs         Cache
                 - in-memory data grids (real-time)                                                                                 logs


               • flow planner can be extended                                                                               trap
                                                                                                                                       tap       sink
                   to support other topologies
                                                                                                             Modeling    PMML

             blend flows in different topologies                                                                            tap

             into the same app – for example,                                                                 Cubes                            customer
                                                                                                                                              profile DBs
             batch (Hadoop) + transactions (IMDG)                                                                                   Hadoop


Friday, 01 March 13                                                                                                                                         41
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
Cascading workflows – topologies

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );   flow planner for
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                     Apache Hadoop
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                topology
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/" );

Friday, 01 March 13                                                                                                   42
Here is the flow planner for Hadoop in the WordCount source
example topologies…

Friday, 01 March 13                                                                             43
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction

More Related Content

Similar to The Workflow Abstraction

Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Paco Nathan
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
Paco Nathan
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
Rob Cottingham
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand
Rob Cottingham
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Paco Nathan
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Paco Nathan
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Mojisola Erdt née Anjorin
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
Paco Nathan
Starter day presentation art of the bootstrap
Starter day presentation   art of the bootstrapStarter day presentation   art of the bootstrap
Starter day presentation art of the bootstrapScott Farquhar
North Sydney Logica
North Sydney    LogicaNorth Sydney    Logica
North Sydney LogicaMark Hellyer
Hyena Labs Works
Hyena Labs WorksHyena Labs Works
Hyena Labs Works
Hyena Design Studio
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
Paco Nathan
Inkscoop company profile
Inkscoop company profileInkscoop company profile
Inkscoop company profile
Sydney Johnson Executive
Sydney    Johnson ExecutiveSydney    Johnson Executive
Sydney Johnson ExecutiveMark Hellyer
Print-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebPrint-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper Web
Beat Signer

Similar to The Workflow Abstraction (19)

Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
Starter day presentation art of the bootstrap
Starter day presentation   art of the bootstrapStarter day presentation   art of the bootstrap
Starter day presentation art of the bootstrap
North Sydney Logica
North Sydney    LogicaNorth Sydney    Logica
North Sydney Logica
DTO #ChefConf2012
DTO #ChefConf2012DTO #ChefConf2012
DTO #ChefConf2012
Parramatta Aegon
Parramatta    AegonParramatta    Aegon
Parramatta Aegon
Hyena Labs Works
Hyena Labs WorksHyena Labs Works
Hyena Labs Works
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
Inkscoop company profile
Inkscoop company profileInkscoop company profile
Inkscoop company profile
Sydney Johnson Executive
Sydney    Johnson ExecutiveSydney    Johnson Executive
Sydney Johnson Executive
Chatswood Kumon
Chatswood   KumonChatswood   Kumon
Chatswood Kumon
Print-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebPrint-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper Web

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Computable Content
Computable ContentComputable Content
Computable Content
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

The Workflow Abstraction

  • 1. “The Workflow Abstraction” Strata SC 2013-02-28 Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Friday, 01 March 13 1 Background: dual in quantitative and distributed systems. I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
  • 2. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 2 This talk is about the workflow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data Workflows We’ll consider some theory, examples, best practices, trendlines -- what are the drivers that brought us, and where is this work heading toward? Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
  • 3. Marketing Funnel – overview In reference to Making Data Work… Customers Almost every business uses a model similar to this – give or take a few steps. Campaigns Customer leads go in at the top, Awareness those get refined through several stages, then results flow out the bottom. Interest Evalutation Conversion Referral Repeat Friday, 01 March 13 3 Let’s consider one of the most fundamental predictive models used in business: a marketing funnel. This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
  • 4. Marketing Funnel – clickstream Different funnel stages get represented in ecommerce by events captured in Customers log files, as a class of machine data called clickstream Campaigns Impression • ad impressions Awareness • URL clicks Click • landing page views Interest • new user registrations Sign Up Evalutation • session cookies Purchase • online purchases Conversion • social network activity "Like" • etc. Referral Repeat Friday, 01 March 13 4 Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
  • 5. Marketing Funnel – metrics A variety of clickstream metrics can be used as performance indicators Customers at different stages of the funnel: Campaigns • CPM: cost per thousand Impression • CTR: click-through rate Awareness CPM • CPA: cost per action Click • etc. Interest CTR Sign Up Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 5 The many different highly-nuanced metrics which apply are mind-boggling :)
  • 6. Marketing Funnel – example calculations Customers Campaigns Awareness Interest metric cost events formula rate Evalutation Conversion Referral Repeat $4,000 CPM $4,000 10^6 ÷ $4.00 (10^6 ÷ 10^3) 3∙10^3 CTR - 3∙10^3 ÷ 10^6 0.3% $4,000 CPA - 20 ÷ $200 20 Friday, 01 March 13 6 Here are examples of the kinds of calculations performed...
  • 7. Marketing Funnel – predictive model Given these metrics, we can go further to estimate cost per paying user (CPP) Customers customer lifetime value (LTV), etc. Campaigns Then we can build a predictive model for return on investment (ROI) per customer, Awareness summarizing the funnel performance: ROI = (LTV − CPP) ∕ CPP Interest As an example, after crunching lots of logs, Evalutation suppose that… Conversion CPP = $200 LTV = $2000 Referral ROI = ($2000 − $200) ∕ $200 Repeat for a 9x multiple Friday, 01 March 13 7 For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers, which describes the efficiency of the marketing funnel at different stages.
  • 8. Marketing Funnel – example architecture Customers Campaigns Customers Awareness Let’s consider an example architecture Interest Evalutation for calculating, reporting, and taking action Web Conversion on funnel metrics, based on large-scale App Referral Repeat clickstream data… logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 8 Here’s an example architecture of using clickstream metrics within an online business.
  • 9. Marketing Funnel – complexities Multiple ad partners, different contracts terms, reporting different metrics at Customers × × different times, click scrubs, etc. Campaigns Campaigns target specific geo/demo, Impression × × test alternate landing pages, probably Awareness CPM need to segment customer base… Click These issues make clickstream data Interest CTR large and yet sparse. Sign Up Evalutation behaviors Other issues: × Purchase • seasonal variation Conversion CPA • fluctuating currency exchange rates "Like" Referral NPS, social graph, etc. • distortions due to credit card fraud • diminishing returns Repeat loyalty, win back, etc. • forecasting requirements Friday, 01 March 13 9 However, real life intercedes. In many businesses, this is a complicated model to calculate correctly. scrubs many vendors, data sources, different metrics to be aligned lots of roll-ups Bayesian point estimates forecasts and dashboards social dimension makes this convoluted not simple
  • 10. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. Interest CTR The need for these insights has been a Sign Up driver for Hadoop-related technologies. Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 10 The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
  • 11. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. funnel modeling and optimization Interest CTR The need for these insights has been a Sign Up driver for Hadoop-relatedrequires complex data workflows technologies. Evalutation behaviors to obtain the required insights Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 11 These needs imply complex data workflows. It’s not about doing a BI query or a pivot table; that’s how retailers were thinking when Amazon came along.
  • 12. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 12 A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
  • 13. Circa 2008 – Hadoop at scale Customers Scenario: Analytics team at a large ad network… Campaigns Awareness Company had invested $MM capex in a Interest large data warehouse across LOBs Evalutation Conversion Mission-critical app had been written as Referral collab Repeat a large SQL workflow in the DW roll-ups filter Marketing funnel metrics were estimated for many advertisers, many campaigns, per-user recommends many publishers, many customers – billions of calculations daily query/load Predictive models matched publisher ~ advertiser clickstream RDBMS and campaign ~ user, to optimize marketing funnel performance Friday, 01 March 13 13 Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network.. Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
  • 14. Circa 2008 – Hadoop at scale Customers Issues: Campaigns Awareness • critical app had hit hard limits for scalability Interest • several Tb data, 100’s of servers Evalutation Conversion • batch window length vs. failure rate vs. SLA collab Referral Repeat in the context of business growth posed roll-ups filter an existential risk × We built out a team to address these issues per-user recommends as rapidly as possible… Needed to re-create that data workflows query/load based on Enterprise requirements. clickstream RDBMS Friday, 01 March 13 14 Marching orders: 5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City; 5 weeks to reverse engineer the mission-critical app without any access to its author; 5 weeks to implement a Hadoop version which could scale-out on EC2. We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
  • 15. Circa 2008 – Hadoop at scale Approach: roll-ups collab filter • reverse-engineered business process from ~1500 lines of undocumented SQL per-user • created a large, multi-step Apache Hadoop recommends app on AWS HDFS • leveraged cloud strategy to trade $MM capex for lower, scalable opex • Amazon identified our app as one of the msg queue largest Hadoop deployments on EC2 • our app became a case study for AWS query/load RDBMS prior to Elastic MapReduce launch clickstream Friday, 01 March 13 15 Our solution involved dependencies among more than a dozen Hadoop job steps.
  • 16. Circa 2008 – Hadoop at scale × Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends HDFS • data scientists wore beepers since Ops × × lacked visibility into business process • coding directly in MapReduce created a staffing bottleneck msg queue query/load clickstream RDBMS Friday, 01 March 13 16 This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM -- for troubleshooting, handling exceptions, notifications, etc. Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea. Three issues about Enterprise workflows: * staffing bottleneck unless there’s a good abstraction layer * operational complexity, mostly due to lack of transparency * system integration problems *are* the main problem to solve
  • 17. Circa 2008 – Hadoop at scale Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends • data scientists worea good since Ops for a large, commercial beepers solution HDFS lacked visibility into Apachebusiness logic deployment, but the app’s Hadoop • coding directly in MapReduce created a staffing bottleneck workflow management lacked crucial msg queue features… query/load which led to a search for a better clickstream RDBMS workflow abstraction Friday, 01 March 13 17 While leading this team, I sought out other ways of managing a complex workflow involving Hadoop. I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
  • 18. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 18 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 19. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology. Friday, 01 March 13 19 Cascading initially grew from interaction with the Nutch project, before Hadoop had a name API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
  • 20. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without an need to create an entirely new language • allows many programmers who have J2EE expertise to build apps that leverage the economics of Hadoop clusters Friday, 01 March 13 20 Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
  • 21. quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 Friday, 01 March 13 21 Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch” The issues: * staffing bottleneck * operational complexity * system integration
  • 22. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc. Friday, 01 March 13 22 Several published case studies about Cascading, Cascalog, Scalding, etc. Wide range of use cases. Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading. Partnerships with the various Hadoop distro vendors, cloud providers, etc.
  • 23. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) Friday, 01 March 13 23 Many case studies, many Enterprise production deployments now for 5+ years.
  • 24. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascading as the basis for workflow abstractions atop Hadoop and more, Cascalog in Clojure (2010) Scalding in Scala (2012) with a 5+ year history of production deployments across multiple verticals Friday, 01 March 13 24 Cascading as a basis for workflow abstraction, for Enterprise data workflows
  • 25. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 25 Code samples in Cascading / Cascalog / Scalding, based on Word Count
  • 26. The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Friday, 01 March 13 26 Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already... Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
  • 27. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map 1 reduce 18 lines code Friday, 01 March 13 27 Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
  • 28. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/" ); wcFlow.complete(); Friday, 01 March 13 28 Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop 2nd to last line: generates a DOT file for the flow diagram
  • 29. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Friday, 01 March 13 29 As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.
  • 30. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; Friday, 01 March 13 30 Here is the same Word Count app written in Clojure, using Cascalog.
  • 31. word count – Cascalog / Clojure Document Collection Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Friday, 01 March 13 31 From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments. Great for large-scale, complex apps, where small teams must limit the complexities in their process.
  • 32. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Friday, 01 March 13 32 Here is the same Word Count app written in Scala, using Scalding. Very compact, easy to understand; however, also more imperative than Cascalog.
  • 33. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 33 If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
  • 34. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 34 Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
  • 35. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 35 Tracking back to the Marketing Funnel as an example workflow… Let’s consider how Cascading apps incorporate other components beyond Hadoop
  • 36. Enterprise Data Workflows Customers Back to our marketing funnel, let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 36 LOB use cases drive the demand for Big Data apps
  • 37. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 37 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 38. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 38 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 39. Cascading workflows – taps • taps integrate other data frameworks, as tuple streams Customers • these are “plumbing” endpoints in the pattern language • sources (inputs), sinks (outputs), traps (exceptions) Web App • text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. logs logs Logs Cache • data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap • extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBs schema and provenance get Hadoop Prefs derived from analysis of the taps Reporting Cluster Friday, 01 March 13 39 Speaking of system integration, taps provide the simplest approach for integrating different frameworks.
  • 40. Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/" ); wcFlow.complete(); Friday, 01 March 13 40 Here are the taps in the WordCount source
  • 41. Cascading workflows – topologies • topologies execute workflows on clusters Customers • flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support • flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tap blend flows in different topologies tap Analytics into the same app – for example, Cubes customer Customer profile DBs batch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster Reporting Friday, 01 March 13 41 Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
  • 42. Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); topology wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/" ); wcFlow.complete(); Friday, 01 March 13 42 Here is the flow planner for Hadoop in the WordCount source
  • 43. example topologies… Friday, 01 March 13 43 Here are some examples of topologies for distributed computing -- Apache Hadoop being the first supported by Cascading, followed by local mode, and now a tuple space (IMDG) flow planner in the works. Several other widely used platforms would also be likely suspects for Cascading flow planners.