SlideShare a Scribd company logo
1 of 88
Download to read offline
“Enterprise Data Workflows
              with Cascading”


                  Paco Nathan
                  Concurrent, Inc.
                  San Francisco, CA
                  @pacoid




                   zest.to/event63_77
                   2013-02-12                                                           Copyright @2013, Concurrent, Inc.




Tuesday, 12 February 13                                                                                                     1
You may not have heard about us much, but you use our API in lots of places:
your bank, your airline, your hospital, your mobile device, your social network, etc.
Unstructured Data
              meets
             Enterprise Scale
                  • an example considered
                  • system integration:
                           tearing down silos
                  • code samples
                  • data science perspectives:
                           how we got here
                  • the workflow abstraction:
                           many aspects of an app
                  • developer, analyst, scientist
                  • summary, references



Tuesday, 12 February 13                                                                                                                                    2
Background: I’m a data scientist, an engineering director,
spent the past decade building/leading Data teams which created large-scale apps.

This talk is about using Cascading and related DSLs to build Enterprise Data Workflows.
Our emphasis is on leveraging the workflow abstraction for system integration, for mitigating complexity, and for producing simple, robust apps at scale.
We’ll show a little something for the developers, the analysts, and the scientists in the room.
Enterprise Data Workflows
                                                                           Document
                                                                           Collection



                                                                                                        Scrub
                                                                                        Tokenize
                                                                                                        token

                                                                                   M



                                                                                                                HashJoin   Regex
                                                                                                                  Left     token
                                                                                                                                   GroupBy    R
                                                                                                   Stop Word                        token
                                                                                                      List
                                                                                                                  RHS




                                                                                                                                      Count




                                                                                                                                                  Word
                                                                                                                                                  Count




                  an example considered




Tuesday, 12 February 13                                                                                                                                   3
Let’s consider the matter of handling Big Data
from the perspective of building and maintaining Enterprise apps…
Enterprise Data Workflows
                                                                          Customers
          an example…
                                                                            Web
                                                                            App




                                                              logs         Cache
                                                                logs
                                                                  Logs

                                          Support
                                                                 source
                                                       trap                  sink
                                                                   tap
                                                        tap                  tap


                                                               Data
                                         Modeling    PMML
                                                              Workflow

                                                                            source
                                                       sink
                                                                              tap
                                                       tap

                                         Analytics
                                          Cubes                            customer
                                                                            Customer
                                                                          profile DBs
                                                                              Prefs
                                                                Hadoop
                                                                Cluster
                                        Reporting




Tuesday, 12 February 13                                                                 4
Apache Hadoop rarely ever gets used in isolation
Enterprise Data Workflows
                                                                        Customers
         an example… the front end
                                                                          Web
                                                                          App




                                                            logs         Cache
                                                              logs
                                                                Logs

                                        Support
                                                               source
                                                     trap                  sink
                                                                 tap
                                                      tap                  tap


                                                             Data
                                       Modeling    PMML
                                                            Workflow

                                                                          source
                                                     sink
                                                                            tap
                                                     tap

                                       Analytics
                                        Cubes                            customer
                                                                          Customer
                                                                        profile DBs
                                                                            Prefs
                                                              Hadoop
                                                              Cluster
                                      Reporting




Tuesday, 12 February 13                                                               5
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
                                                                                                     Customers
          an example… the back office
                                                                                                       Web
                                                                                                       App




                                                                                       logs           Cache
                                                                                         logs
                                                                                           Logs

                                           Support
                                                                                            source
                                                                    trap                                sink
                                                                                              tap
                                                                     tap                                tap


                                                                                          Data
                                          Modeling             PMML
                                                                                         Workflow

                                                                                                       source
                                                                    sink
                                                                                                         tap
                                                                    tap

                                          Analytics
                                           Cubes                                                      customer
                                                                                                       Customer
                                                                                                     profile DBs
                                                                                                         Prefs
                                                                                           Hadoop
                                                                                           Cluster
                                         Reporting




Tuesday, 12 February 13                                                                                            6
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
                                                                                              Customers
           an example… the heavy lifting!
                                                                                                Web
                                                                                                App




                                                                                  logs         Cache
                                                                                    logs
                                                                                      Logs

                                             Support
                                                                                     source
                                                                        trap                     sink
                                                                                       tap
                                                                         tap                     tap


                                                                                   Data
                                            Modeling               PMML
                                                                                  Workflow

                                                                                                source
                                                                        sink
                                                                                                  tap
                                                                        tap

                                            Analytics
                                             Cubes                                             customer
                                                                                                Customer
                                                                                              profile DBs
                                                                                                  Prefs
                                                                                    Hadoop
                                                                                    Cluster
                                           Reporting




Tuesday, 12 February 13                                                                                     7
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Enterprise Data Workflows
                                                                        Document
                                                                        Collection



                                                                                                     Scrub
                                                                                     Tokenize
                                                                                                     token

                                                                                M



                                                                                                             HashJoin   Regex
                                                                                                               Left     token
                                                                                                                                GroupBy    R
                                                                                                Stop Word                        token
                                                                                                   List
                                                                                                               RHS




                                                                                                                                   Count




                                                                                                                                               Word
                                                                                                                                               Count




                  system integration:
                  tearing down silos




Tuesday, 12 February 13                                                                                                                                8
the process of building Enterprise apps is largely about
system integration and business process, meeting in the middle
Cascading – definitions
            • a pattern language for Enterprise Data Workflows
            • simple to build, easy to test, robust in production                                                                   Customers


            • design principles ⟹ ensure best practices at scale                                                                      Web
                                                                                                                                      App




                                                                                                                        logs         Cache
                                                                                                                          logs
                                                                                                                            Logs

                                                                                          Support
                                                                                                                           source
                                                                                                                 trap                  sink
                                                                                                                             tap
                                                                                                                  tap                  tap


                                                                                                                         Data
                                                                                         Modeling           PMML
                                                                                                                        Workflow

                                                                                                                                      source
                                                                                                                 sink
                                                                                                                                        tap
                                                                                                                 tap

                                                                                         Analytics
                                                                                          Cubes                                      customer
                                                                                                                                      Customer
                                                                                                                                    profile DBs
                                                                                                                                        Prefs
                                                                                                                          Hadoop
                                                                                                                          Cluster
                                                                                         Reporting




Tuesday, 12 February 13                                                                                                                           9
A pattern language ensures that best practices are followed by an implementation.

In this case, parallelization of deterministic query plans for reliable, Enterprise-scale workflows on Hadoop, etc.
Cascading – usage
            • Java API, Scala DSL Scalding, Clojure DSL Cascalog
            • ASL 2 license, GitHub src, http://conjars.org                                       Customers


            • 5+ yrs production use, multiple Enterprise verticals                                  Web
                                                                                                    App




                                                                                      logs         Cache
                                                                                        logs
                                                                                          Logs

                                                                 Support
                                                                                         source
                                                                               trap                  sink
                                                                                           tap
                                                                                tap                  tap


                                                                                       Data
                                                                 Modeling    PMML
                                                                                      Workflow

                                                                                                    source
                                                                               sink
                                                                                                      tap
                                                                               tap

                                                                 Analytics
                                                                  Cubes                            customer
                                                                                                    Customer
                                                                                                  profile DBs
                                                                                                      Prefs
                                                                                        Hadoop
                                                                                        Cluster
                                                                 Reporting




Tuesday, 12 February 13                                                                                         10
More than 5 year history of large-scale Enterprise deployments
DSLs in Scala, Clojure, Jython, JRuby, Groovy, etc.
Maven repo for third-party contribs
quotes…
                  “Cascading gives Java developers the ability to build
                   Big Data applications on Hadoop using their existing
                   skillset … Management can really go out and build a
                   team around folks that are already very experienced
                   with Java. Switching over to this is really a very short
                   exercise.”
                      CIO, Thor Olavsrud
                      2012-06-06
                      cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading


                  “Masks the complexity of MapReduce, simplifies the
                   programming, and speeds you on your journey toward
                   actionable analytics … A vast improvement over native
                   MapReduce functions or Pig UDFs.”
                      2012 BOSSIE Awards, James Borck
                      2012-09-18
                      infoworld.com/slideshow/65089




Tuesday, 12 February 13                                                                    11
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”
Cascading – deployments
           • case studies: Twitter, Etsy, Climate Corp, Nokia, Factual,
                Williams-Sonoma, uSwitch, Airbnb, Square, Harvard, etc.                                               Customers


           • partners: Amazon AWS, Microsoft Azure, Hortonworks,
                MapR, EMC, SpringSource, Cloudera                                                                       Web
                                                                                                                        App

           • OSS frameworks built atop by: Twitter, Etsy,
                eBay, Climate Corp, uSwitch, YieldBot, etc.                                               logs         Cache
                                                                                                            logs
                                                                                                              Logs
           • use cases: ETL, anti-fraud, advertising,                                Support
                recommenders, retail pricing, eCRM,                                                trap
                                                                                                             source
                                                                                                                         sink
                                                                                                               tap
                marketing funnel, search analytics,                                                 tap                  tap


                genomics, climatology, etc.                                                                Data
                                                                                     Modeling    PMML
                                                                                                          Workflow

                                                                                                                        source
                                                                                                   sink
                                                                                                                          tap
                                                                                                   tap

                                                                                     Analytics
                                                                                      Cubes                            customer
                                                                                                                        Customer
                                                                                                                      profile DBs
                                                                                                                          Prefs
                                                                                                            Hadoop
                                                                                                            Cluster
                                                                                     Reporting




Tuesday, 12 February 13                                                                                                             12
Several published case studies about Cascading, Scalding, Cascalog, etc.
Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with all Hadoop vendors.
case studies…

                                                           (Williams-Sonoma, Neiman Marcus)

                   concurrentinc.com/case-studies/upstream/
                   upstreamsoftware.com/blog/bid/86333/


                                                         (revenue team, publisher analytics)

                   concurrentinc.com/case-studies/twitter/
                   github.com/twitter/scalding/wiki


                                                         (infrastructure team)

                   concurrentinc.com/case-studies/airbnb/
                   gigaom.com/data/meet-the-combo-behind-etsy-airbnb-and-
                  climate-corp-hadoop-jobs/




Tuesday, 12 February 13                                                                        13
Several customers using Cascading / Scalding / Cascalog have published case studies.
Here are a few.
Cascading – taps
            •   taps integrate other data frameworks, as tuple streams
            •   these are “plumbing” endpoints in the pattern language                                      Customers


            •   sources (inputs), sinks (outputs), traps (exceptions)
                                                                                                              Web

            •   where schema and provenance get determined                                                    App


            •   text delimited, JDBC, Memcached,
                                                                                                logs
                HBase, Cassandra, MongoDB, etc.                                                   logs
                                                                                                    Logs
                                                                                                             Cache


            • data serialization: Avro, Thrift,                            Support
                                                                                                   source
                Kryo, JSON, etc.                                                         trap
                                                                                          tap
                                                                                                     tap       sink
                                                                                                               tap

            • extend in ~4 lines of Java                                                         Data
                                                                           Modeling    PMML
                                                                                                Workflow

                                                                                                              source
                                                                                         sink
                                                                                                                tap
                                                                                         tap

                                                                           Analytics
                                                                            Cubes                            customer
                                                                                                              Customer
                                                                                                            profile DBs
                                                                                                                Prefs
                                                                                                  Hadoop
                                                                                                  Cluster
                                                                           Reporting




Tuesday, 12 February 13                                                                                                   14
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
Cascading – topologies
            •   topologies execute workflows on clusters
            •   flow planner is much like a compiler for queries                                                                 Customers


            •   abstraction layers reduce training costs
                                                                                                                                  Web

            •   Hadoop (MapReduce jobs)                                                                                           App


            •   local mode (dev/test or special config)
                                                                                                                    logs         Cache
                                                                                                                      logs
            •   in-memory data grids (real-time)                                                                        Logs

                                                                                       Support
            •   flow planner can be extended
                                                                                                             trap
                                                                                                                       source
                                                                                                                                   sink
                                                                                                                         tap
                to support other topologies                                                                   tap                  tap


            • blend flows from different                                               Modeling          PMML
                                                                                                                     Data
                                                                                                                    Workflow
                topologies into one app
                                                                                                                                  source
                                                                                                             sink
                                                                                                                                    tap
                                                                                                             tap

                                                                                      Analytics
                                                                                       Cubes                                     customer
                                                                                                                                  Customer
                                                                                                                                profile DBs
                                                                                                                                    Prefs
                                                                                                                      Hadoop
                                                                                                                      Cluster
                                                                                     Reporting




Tuesday, 12 February 13                                                                                                                       15
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
example topologies…




Tuesday, 12 February 13                                                                         16
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.
Cascading – ANSI SQL
            • ANSI SQL parser/optimizer atop Cascading flow planner
            • JDBC driver to integrate into existing tools and app servers                                                                                 Customers


            • surface a relational catalog over a collection of                                                                                              Web
                unstructured data                                                                                                                            App


            • launch a SQL shell prompt
                to run queries                                                                                              logs
                                                                                                                              logs                          Cache
                                                                                                                                Logs

            • enable the analysts without                                             Support

                retraining on Hadoop, etc.                                                                    trap
                                                                                                                                source
                                                                                                                                  tap                         sink
                                                                                                               tap                                            tap

            • transparency for Support,
                                                                                                                              Data
                Ops, Finance, et al.                                                  Modeling          PMML
                                                                                                                             Workflow

            • combine SQL flows with                                                                           sink
                                                                                                                                                             source
                                                                                                                                                               tap
                                                                                                              tap
                Scalding, Cascalog, etc.
                                                                                      Analytics

            • based on collab with Optiq –                                             Cubes                                                                customer
                                                                                                                                                             Customer
                                                                                                                                                           profile DBs
                industry-proven code base                                                                                      Hadoop
                                                                                                                                                               Prefs

                                                                                                                               Cluster
            • keep the DBAs happy, and                                               Reporting

                go home a hero!



Tuesday, 12 February 13                                                                                                                                                  17
Quite a number of projects have started out with Hadoop, then grafted a SQL-like syntax onto it. Somewhere.

We started out with a query planner used in Enterprise, then partnered with Optiq -- the team behind an Enterprise-proven code base for an ANSI SQL parser/optimizer.

In the sense that Splunk handles “machine data”, this SQL implementation provides “machine code”, as the lingua franca of Enterprise system integration.
how to query…
                             abstraction                                          RDBMS                                        JVM Cluster
                                     parser                                    ANSI SQL                                         ANSI SQL
                                                                             compliant parser                                 compliant parser
                                   optimizer                                 logical plan,                                    logical plan,
                                                                       optimized based on stats                         optimized based on stats
                                    planner                                      physical plan                                 API “plumbing”

                                    machine                                    query history,                                     app history,
                                     data                                        table stats                                       tuple stats
                                    topology                                     b-trees, etc.                         heterogenous, distributed:
                                                                                                                        Hadoop, in-memory, etc.
                                 visualization                                         ERD                                       flow diagram

                                    schema                                      table schema                                     tuple schema

                                     catalog                                 relational catalog                                  tap usage DB


                                  provenance                                   (manual audit)                                   data set
                                                                                                                          producers/consumers


Tuesday, 12 February 13                                                                                                                                                             18
When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on
JVM clusters
Cascading – machine learning
            •   export predictive models as PMML
            •   Cascading compiles to JVM classes for parallelization                                                             Customers


            •   migrate workloads: SAS, Microstrategy,Teradata, etc.
                                                                                                                                    Web

            •   great OSS tools: R, Weka, KNIME, RapidMiner, etc.                                                                   App


            •   run multiple models in parallel
                                                                                                                      logs
                as customer experiments                                                                                 logs
                                                                                                                          Logs
                                                                                                                                   Cache


            • Random Forest, Logistic Regression,                                      Support
                                                                                                                         source
                GLM, Assoc Rules, Decision Trees,                                                              trap
                                                                                                                tap
                                                                                                                           tap       sink
                                                                                                                                     tap

                K-Means, Hierarchical Clustering, etc.
                                                                                                                       Data
                                                                                      Modeling
            • 2 lines of code required for
                                                                                                       PMML
                                                                                                                      Workflow

                integration                                                                                sink
                                                                                                                                    source
                                                                                                                                      tap
                                                                                                           tap
            • integrate with other libraries:                                         Analytics
                                                                                       Cubes
                 Matrix API, Algebird, etc.                                                                                        customer
                                                                                                                                    Customer
                                                                                                                                  profile DBs
                                                                                                                                      Prefs
            • combine with other flows into                                                                              Hadoop
                                                                                                                        Cluster
                one app: Java for ETL,                                                Reporting

                Scala for data services,
                SQL for reporting, etc.


Tuesday, 12 February 13                                                                                                                         19
PMML has been around for a while, and export is supported by virtually every analytics platform,
covering a wide variety of predictive modeling algorithms.

Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.

Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)

Five companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
PMML support…




Tuesday, 12 February 13                                                                           20
Here are just a few of the tools that people use to create predictive models for export as PMML
Cascading – test-driven development
            •   assert patterns (regex) on the tuple streams
            •   trap edge cases as “data exceptions”                                                                                                         Customers


            •   adjust assert levels, like log4j levels
                                                                                                                                                                   Web

            •   TDD at scale:                                                                                                                                      App


                1. start from raw inputs in
                                                                                                                               logs
                   the flow graph                                                                                                 logs                              Cache
                                                                                                                                   Logs
                2. define stream assertions                                              Support
                   for each stage of transforms                                                                trap
                                                                                                                                   source
                                                                                                                                     tap                            sink
                                                                                                                tap                                                 tap
                3. verify exceptions, code to
                   eliminate them                                                       Modeling          PMML
                                                                                                                                 Data
                                                                                                                                Workflow
                4. rinse, lather, repeat…                                                                                                                          source
                                                                                                               sink
                                                                                                                                                                     tap
                5. when impl is complete,                                                                      tap

                                                                                        Analytics
                   app has full test coverage                                            Cubes                                                                customer
            • TDD follows from Cascalog’s                                                                                                                      Customer
                                                                                                                                                             profile DBs
                                                                                                                                                                 Prefs
                                                                                                                                  Hadoop
                composable subqueries                                                                                             Cluster
                                                                                       Reporting
            • redirect traps in production to
                Ops, QA, Support, Audit, etc.


Tuesday, 12 February 13                                                                                                                                                                21
TDD is not usually high on the list when people start discussing Big Data apps.

Chris Wensel introduced into Cascading the notion of a “data exception”, and how to set stream assertion levels as part of the business logic of an application.

Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing
those predicates into large-scale apps.
Cascading – API design principles
           •   specify what is required, not how it must be achieved
           •   provide the “glue” for system integration
           •   no surprises
           •   same JAR, any scale
           •   plan far ahead (before consuming cluster resources)
           •   fail the same way twice




                Closely related to “functional relational programming”
                paradigm from Moseley & Marks 2006
                http://goo.gl/SKspn


Tuesday, 12 February 13                                                                      22
Overview of the design principles embodied by Cascading as a pattern language…

Some aspects (Cascalog in particular) are closely related to “FRP” from Moseley/Marks 2006
Enterprise Data Workflows
                                               Document
                                               Collection



                                                                            Scrub
                                                            Tokenize
                                                                            token

                                                       M



                                                                                    HashJoin   Regex
                                                                                      Left     token
                                                                                                       GroupBy    R
                                                                       Stop Word                        token
                                                                          List
                                                                                      RHS




                                                                                                          Count




                                                                                                                      Word
                                                                                                                      Count




                 code samples:
                 Word Count




Tuesday, 12 February 13                                                                                                       23
Let’s make this real, show some code…
the ubiquitous word count
         definition:
           count how often each word appears in a collection of text documents

         this simple program provides an excellent test case for
         parallel processing, since it illustrates:
           ‣ requires a minimal amount of code
            ‣ demonstrates use of both symbolic and numeric values
            ‣ shows a dependency graph of tuples as an abstraction
            ‣ is not many steps away from useful search indexing
            ‣ serves as a “Hello World” for Hadoop apps

         any distributed computing framework which can run Word Count
         efficiently in parallel at scale can handle much larger and
         more interesting compute problems



Tuesday, 12 February 13                                                          24
word count – pseudocode

         void map (String doc_id, String text):
              for each word w in segment(text):
                  emit(w, "1");



         void reduce (String word, Iterator partial_counts):
              int count = 0;


              for each pc in partial_counts:
                  count += Int(pc);


              emit(word, String(count));




Tuesday, 12 February 13                                        25
word count – flow diagram

            Document
            Collection




                              Tokenize
                                               GroupBy
                          M                     token       Count




                                                  R                  Word
                                                                     Count




                                         cascading.org/category/impatient
                                                  gist.github.com/3900702
            1 map
            1 reduce
           18 lines code



Tuesday, 12 February 13                                                      26
word count – Cascading app
                                                                    Document
                                                                    Collection




                                                                                 Tokenize
                                                                                            GroupBy
                                                                            M                token    Count




                                                                                               R              Word
                                                                                                              Count


         String docPath = args[ 0 ];
         String wcPath = args[ 1 ];
         Properties properties = new Properties();
         AppProps.setApplicationJarClass( properties, Main.class );
         HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

         // create source and sink taps
         Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
         Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

         // specify a regex to split "document" text lines into token stream
         Fields token = new Fields( "token" );
         Fields text = new Fields( "text" );
         RegexSplitGenerator splitter =
           new RegexSplitGenerator( token, "[ [](),.]" );
         // only returns "token"
         Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
         // determine the word counts
         Pipe wcPipe = new Pipe( "wc", docPipe );
         wcPipe = new GroupBy( wcPipe, token );
         wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

         // connect the taps, pipes, etc., into a flow
         FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
          .addSource( docPipe, docTap )
          .addTailSink( wcPipe, wcTap );
         // write a DOT file and run the flow
         Flow wcFlow = flowConnector.connect( flowDef );
         wcFlow.writeDOT( "dot/wc.dot" );
         wcFlow.complete();


Tuesday, 12 February 13                                                                                               27
word count – flow plan
                                                                                                  Document
                                                                                                  Collection




                                                                                                               Tokenize
                                                                                                                          GroupBy
                                                                                                          M                token    Count




                                                                                                                             R              Word
                                                                                                                                            Count



                                                           [head]



                             Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                     [{2}:'doc_id', 'text']
                                                     [{2}:'doc_id', 'text']




                                                                                                               map
                              Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                         [{1}:'token']
                                                         [{1}:'token']



                                               GroupBy('wc')[by:['token']]

                                                       wc[{1}:'token']
                                                       [{1}:'token']




                                                                                                               reduce
                                            Every('wc')[Count[decl:'count']]

                                                     [{2}:'token', 'count']
                                                     [{1}:'token']



                          Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                     [{2}:'token', 'count']
                                                     [{2}:'token', 'count']



                                                            [tail]


Tuesday, 12 February 13                                                                                                                             28
word count – Scalding / Scala
                                                     Document
                                                     Collection




                                                                  Tokenize
                                                                             GroupBy
                                                             M                token    Count




                                                                                R              Word
                                                                                               Count




        import com.twitter.scalding._
         
        class WordCount(args : Args) extends Job(args) {
          Tsv(args("doc"),
               ('doc_id, 'text),
               skipHeader = true)
            .read
            .flatMap('text -> 'token) {
               text : String => text.split("[ [](),.]")
             }
            .groupBy('token) { _.size('count) }
            .write(Tsv(args("wc"), writeHeader = true))
        }




Tuesday, 12 February 13                                                                                29
word count – Scalding / Scala
                                                                                                                         Document
                                                                                                                         Collection




                                                                                                                                      Tokenize
                                                                                                                                                  GroupBy
                                                                                                                                 M                 token    Count




                                                                                                                                                     R              Word
                                                                                                                                                                    Count



          github.com/twitter/scalding/wiki
             ‣ extends the Scala collections API, distributed
                 lists become “pipes” backed by Cascading
             ‣ code is compact, easy to understand –
                 very close to conceptual flow diagram
             ‣ functional programming is great for expressing
                 complex workflows in MapReduce, etc.
             ‣ large-scale, complex problems can be handled
                 in just a few lines of code
             ‣ significant investments by Twitter, Etsy, eBay, etc.,
                 in this open source project
             ‣ extensive libraries are available for linear algebra,
                 abstract algebra, machine learning – e.g., “Matrix API”
             ‣ several large-scale apps in production deployments

             ‣ IMHO, especially great for data services at scale



Tuesday, 12 February 13                                                                                                                                                     30
Using a functional programming language to build flows works even better than trying to represent functional programming constructs within Java…
word count – Cascalog / Clojure
                                                     Document
                                                     Collection




                                                                  Tokenize
                                                                             GroupBy
                                                             M                token    Count




                                                                                R              Word
                                                                                               Count




         (ns impatient.core
           (:use [cascalog.api]
                 [cascalog.more-taps :only (hfs-delimited)])
           (:require [clojure.string :as s]
                     [cascalog.ops :as c])
           (:gen-class))

         (defmapcatop split [line]
           "reads in a line of string and splits it by regex"
           (s/split line #"[[](),.)s]+"))

         (defn -main [in out & args]
           (?<- (hfs-delimited out)
                [?word ?count]
                ((hfs-delimited in :skip-header? true) _ ?line)
                (split ?line :> ?word)
                (c/count ?count)))

         ; Paul Lam
         ; github.com/Quantisan/Impatient


Tuesday, 12 February 13                                                                                31
word count – Cascalog / Clojure
                                                                         Document
                                                                         Collection




                                                                                      Tokenize
                                                                                                 GroupBy
                                                                                 M                token    Count




                                                                                                    R              Word
                                                                                                                   Count



         github.com/nathanmarz/cascalog/wiki
            ‣ implements Datalog in Clojure, with predicates backed by Cascading

            ‣ a truly declarative language – whereas Scalding lacks that aspect
                of functional programming
            ‣ run ad-hoc queries from the Clojure REPL, approx. 10:1 code
                reduction compared with SQL
            ‣ composable subqueries, for test-driven development (TDD) at scale

            ‣ fault-tolerant workflows which are simple to follow

            ‣ same framework used from discovery through to production apps

            ‣ FRP mitigates the s/w engineering costs of Accidental Complexity

            ‣ focus on the process of structuring data; not un/structured

            ‣ Leiningen build: simple, no surprises, in Clojure itself

            ‣ has a learning curve, limited number of Clojure developers

            ‣ aggregators are the magic, those take effort to learn

Tuesday, 12 February 13                                                                                                    32
Enterprise Data Workflows
                                                                                                                   Document
                                                                                                                   Collection



                                                                                                                                                Scrub
                                                                                                                                Tokenize
                                                                                                                                                token

                                                                                                                           M



                                                                                                                                                        HashJoin   Regex
                                                                                                                                                          Left     token
                                                                                                                                                                           GroupBy    R
                                                                                                                                           Stop Word                        token
                                                                                                                                              List
                                                                                                                                                          RHS




                                                                                                                                                                              Count




                                                                                                                                                                                          Word
                                                                                                                                                                                          Count




                  data science perspectives:
                  how we got here




Tuesday, 12 February 13                                                                                                                                                                           33
Let’s examine an evolution of Data Science practice, subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes, and commercialized Big Data
circa 1996: pre- inflection point

                                                                   Stakeholder                    Customers

                        Excel pivot tables
                      PowerPoint slide decks                             strategy



                              BI
                                                                       Product
                            Analysts


                                                                       requirements



                            SQL Query                                                 optimized
                                                                      Engineering       code         Web App
                             result sets



                                                                                                     transactions




                                                                                                     RDBMS




Tuesday, 12 February 13                                                                                             34
Ah, teh olde days - Perl and C++ for CGI :)

Feedback loops shown in red represent data innovations at the time…

Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos
circa 2001: post- big ecommerce successes

                            Stakeholder                                               Product                                              Customers




                                dashboards                                                                                                         UX
                                                                                   Engineering

                                                                models                                             servlets

                                                                                     recommenders
                            Algorithmic
                                                                                            +                                              Web Apps
                             Modeling                                                   classifiers


                                                                                                                                          Middleware
                                                                aggregation
                                                                                                                   event
                                SQL Query                                                                         history
                                 result sets                                                                                                    customer
                                                                                                                                              transactions
                                                                                         Logs



                                   DW                                                       ETL                                               RDBMS




Tuesday, 12 February 13                                                                                                                                                                   35
Q3 1997: Greg Linden @ Amazon, Randy Shoup @ eBay -- independent teams arrived at the same conclusion:
parallelize workloads onto clusters of commodity servers (Intel/Linux) to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.

MapReduce grew directly out of this effort. LinkedIn, Facebook, Twitter, Apple, etc., follow.
Algorithmic modeling, which leveraged machine data, allowed for Big Data to become monetized.
REALLY monetized :)

Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine
data for automation/optimization)

MapReduce came from work in 2002. Google is now three generations beyond that -- while the Global 1000 struggles to rationalize Hadoop practices.

Google gets upset when people try to “open the kimono”; however, Twitter is in SF where that’s a national pastime :) To get an idea of what powers Google internally, check the open source
projects: Scalding, Matrix API, Algebird, etc.
circa 2013: clusters everywhere

                                                              Data Products                                                      Customers
                                    business
          Domain                    process                                                                                                                  Prod
          Expert                                                 Workflow
                                       dashboard
                                        metrics
                          data
                                                                                                                                 Web Apps,             s/w
                                                                   History                              services
                        science                                                                                                  Mobile, etc.          dev
         Data
       Scientist
                                                                   Planner                                                 social
                                     discovery                                                                          interactions
                                         +                                       optimized                                             transactions,
                                                                                                                                                              Eng
                                     modeling                          taps       capacity                                                content

         App Dev
                                                                       Use Cases Across Topologies


                                                                    Hadoop,                         Log                           In-Memory
                                                                      etc.                         Events                          Data Grid
            Ops                             DW                                                                                                                Ops
                                                                                                               batch       near time


                                                                                                Cluster Scheduler
         introduced                                                                                                                                          existing
          capability                                                                                                                                          SDLC

                                                                                                                                       RDBMS
                                                                                                                                        RDBMS




Tuesday, 12 February 13                                                                                                                                                 36
Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where 4x more data gets collected about the machine than about the experiment.
asymptotically…
           • smarter, more robust clusters
                                                                                 DSL
           • increased leverage of machine data
               for automation and optimization
           • DSLs focused on scalability, testability,                         Planner/
               reducing s/w engineering complexity                             Optimizer
           • increased use of “machine code”,
               who writes SQL directly?
                                                                               Workflow
           • workflows incorporating more
               “moving parts”
                                                                                            App
           • less about “bigness” of data,                                                 History
               more about complexity of process
                                                                               Cluster
           • greater instrumentation ⟹
               even more machine data,
               increased feedback
                                                                                Cluster
                                                                               Scheduler




Tuesday, 12 February 13                                                                              37
Enterprise Data Workflows: more about “complex” process than about “big” data
references…


                  by Leo Breiman
                  Statistical Modeling:
                  The Two Cultures
                  Statistical Science, 2001
                  bit.ly/eUTh9L

                  also check out RStudio:
                  rstudio.org/
                  rpubs.com/

Tuesday, 12 February 13                                                                                                                                                       38
for a really great discussion about the fundamentals of Data Science and process for algorithmic modeling (analyzing the 1997 inflection point), refer back to Breiman 2001.
references…


                   by DJ Patil

                   Data Jujitsu
                   O’Reilly, 2012
                   amazon.com/dp/B008HMN5BE

                   Building Data Science Teams
                   O’Reilly, 2011
                   amazon.com/dp/B005O4U3ZE

Tuesday, 12 February 13                                                      39
in terms of building data products, see DJ Patil's mini-books on O'Reilly:
Building Data Science Teams
Data Jujitsu
Enterprise Data Workflows
                                                                                                   Document
                                                                                                   Collection



                                                                                                                                Scrub
                                                                                                                Tokenize
                                                                                                                                token

                                                                                                           M



                                                                                                                                        HashJoin   Regex
                                                                                                                                          Left     token
                                                                                                                                                           GroupBy    R
                                                                                                                           Stop Word                        token
                                                                                                                              List
                                                                                                                                          RHS




                                                                                                                                                              Count




                                                                                                                                                                          Word
                                                                                                                                                                          Count




                  the workflow abstraction:
                  many aspects of an app




Tuesday, 12 February 13                                                                                                                                                           40
The workflow abstraction helps make Hadoop accessible to a broader audience of developers.

Let’s take a look at how organizations can leverage it in other important ways…
the workflow abstraction
          Tuple Flows, Pipes, Taps, Filters, Joins, Traps, etc.
          …in other words, “plumbing” as a pattern language
          for managing the complexity of Big Data in Enterprise apps
          on many levels



                   Document
                   Collection



                                                                    Scrub
                                         Tokenize
                                                                    token

                            M



                                                                            HashJoin   Regex
                                                                              Left     token
                                                                                               GroupBy    R
                                                            Stop Word                           token
                                                               List
                                                                              RHS




                                                                                                  Count




                                                                                                              Word
                                                                                                              Count




Tuesday, 12 February 13                                                                                               41
The workflow abstraction,
a pattern language for building robust, scalable Enterprise apps,
which works on many levels across an organization…
rather than arguing SQL vs. NoSQL…
           this kind of work focuses on
           the process of structuring data

           which must occur long before work
           on large-scale joins, visualizations,
           predictive models, etc.
           so the process of structuring data is
           what we examine here:
           i.e., how to build workflows
           for Big Data

           thank you Dr. Codd
           “A relational model of data for large shared data banks”
           dl.acm.org/citation.cfm?id=362685




Tuesday, 12 February 13                                                                  42
instead, in Data Science work we must focus on *the process of structuring data*
that must happen before the large-scale joins, predictive models, visualizations, etc.
the process of structuring data is what i will show here
how to build workflows from Big Data
thank you Dr. Codd
workflow – abstraction layer
            • Cascading initially grew from interaction with the Nutch project, before
                Hadoop had a name; API author Chris Wensel recognized that MapReduce
                would be too complex for substantial work in an Enterprise context
            • 5+ years later, Enterprise app deployments on Hadoop are limited by
                staffing issues: difficulty of retraining staff, scarcity of Hadoop experts
            • the pattern language provides a structured method for solving large,
                complex design problems where the syntax of the language promotes
                use of best practices – which addresses staffing issues


                                                                       Document
                                                                       Collection



                                                                                                    Scrub
                                                                                    Tokenize
                                                                                                    token

                                                                               M



                                                                                                            HashJoin   Regex
                                                                                                              Left     token
                                                                                                                               GroupBy    R
                                                                                               Stop Word                        token
                                                                                                  List
                                                                                                              RHS




                                                                                                                                  Count




                                                                                                                                              Word
                                                                                                                                              Count




Tuesday, 12 February 13                                                                                                                               43
First and foremost, the workflow represents an abstraction layer
to mitigate the complexity and costs of coding large apps directly in MapReduce.
workflow – literate collaboration
            • provides an intuitive visual representation for apps: flow diagrams
            • flow diagrams are quite valuable for cross-team collaboration
            • this approach leverages literate programming methodology,
                especially in DSLs written in functional programming languages
            • example: nearly 1:1 correspondence between function calls and
                flow diagram elements in Scalding
            • example: expert developers on cascading-users email list
                use flow diagrams to help troubleshoot issues remotely

                                                                       Document
                                                                       Collection



                                                                                                          Scrub
                                                                                       Tokenize
                                                                                                          token

                                                                               M



                                                                                                                          HashJoin           Regex
                                                                                                                            Left             token
                                                                                                                                                              GroupBy     R
                                                                                                     Stop Word                                                 token
                                                                                                        List
                                                                                                                            RHS




                                                                                                                                                                  Count




                                                                                                                                                                                 Word
                                                                                                                                                                                 Count




Tuesday, 12 February 13                                                                                                                                                                     44
Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- the expert developers
generally ask a novice to provide a flow diagram first
workflow – business process
           • imposes a separation of concerns between the capture of business
               process requirements, and the implementation details (Hadoop, etc.)
           • workflow orchestration evokes the notion of business process
               management for Enterprise apps (think BPM/BPEL)
           • Cascalog leverages Datalog features to make business process
               executable: “specify what you require, not how to achieve it”




                                                                 Document
                                                                 Collection



                                                                                              Scrub
                                                                              Tokenize
                                                                                              token

                                                                         M



                                                                                                      HashJoin   Regex
                                                                                                        Left     token
                                                                                                                         GroupBy    R
                                                                                         Stop Word                        token
                                                                                            List
                                                                                                        RHS




                                                                                                                            Count




                                                                                                                                        Word
                                                                                                                                        Count




Tuesday, 12 February 13                                                                                                                         45
Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
workflow – data architect
           • represents a physical plan for large-scale data flow management
           • tap schemes and tuple streams determine the relevant schema
           • a producer/consumer graph of tap identifier URIs provides a
               view of data provenance
           • cluster utilization vs. producer/consumer graph surfaces ROI
               for Hadoop-based data products




                                                      Document
                                                      Collection



                                                                                   Scrub
                                                                   Tokenize
                                                                                   token

                                                              M



                                                                                           HashJoin   Regex
                                                                                             Left     token
                                                                                                              GroupBy    R
                                                                              Stop Word                        token
                                                                                 List
                                                                                             RHS




                                                                                                                 Count




                                                                                                                             Word
                                                                                                                             Count




Tuesday, 12 February 13                                                                                                              46
Data Architect POV:
a physical plan for large-scale data flow management
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows

More Related Content

Viewers also liked

Actor Based Asyncronous IO in Akka
Actor Based Asyncronous IO in AkkaActor Based Asyncronous IO in Akka
Actor Based Asyncronous IO in Akkadrewhk
 
Efficient HTTP Apis
Efficient HTTP ApisEfficient HTTP Apis
Efficient HTTP ApisAdrian Cole
 
Beginning Haskell, Dive In, Its Not That Scary!
Beginning Haskell, Dive In, Its Not That Scary!Beginning Haskell, Dive In, Its Not That Scary!
Beginning Haskell, Dive In, Its Not That Scary!priort
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 
Building ‘Bootiful’ microservices cloud
Building ‘Bootiful’ microservices cloudBuilding ‘Bootiful’ microservices cloud
Building ‘Bootiful’ microservices cloudIdan Fridman
 
Effective Actors
Effective ActorsEffective Actors
Effective Actorsshinolajla
 
Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)mircodotta
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Composable and streamable Play apps
Composable and streamable Play appsComposable and streamable Play apps
Composable and streamable Play appsYevgeniy Brikman
 
An example of Future composition in a real app
An example of Future composition in a real appAn example of Future composition in a real app
An example of Future composition in a real appPhil Calçado
 
Design for developers
Design for developersDesign for developers
Design for developersJohan Ronsse
 
19 challenging thoughts about leadership 2nd edition
19 challenging thoughts about leadership   2nd edition19 challenging thoughts about leadership   2nd edition
19 challenging thoughts about leadership 2nd editionTFLI
 
Demystifying Scala Type System
Demystifying Scala Type SystemDemystifying Scala Type System
Demystifying Scala Type SystemDavid Galichet
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2Rafał Kuć
 
Modern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesModern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesMarkus Winand
 
55 New Features in Java SE 8
55 New Features in Java SE 855 New Features in Java SE 8
55 New Features in Java SE 8Simon Ritter
 

Viewers also liked (18)

Actor Based Asyncronous IO in Akka
Actor Based Asyncronous IO in AkkaActor Based Asyncronous IO in Akka
Actor Based Asyncronous IO in Akka
 
Efficient HTTP Apis
Efficient HTTP ApisEfficient HTTP Apis
Efficient HTTP Apis
 
Beginning Haskell, Dive In, Its Not That Scary!
Beginning Haskell, Dive In, Its Not That Scary!Beginning Haskell, Dive In, Its Not That Scary!
Beginning Haskell, Dive In, Its Not That Scary!
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
Building ‘Bootiful’ microservices cloud
Building ‘Bootiful’ microservices cloudBuilding ‘Bootiful’ microservices cloud
Building ‘Bootiful’ microservices cloud
 
Effective Actors
Effective ActorsEffective Actors
Effective Actors
 
Curator intro
Curator introCurator intro
Curator intro
 
Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Composable and streamable Play apps
Composable and streamable Play appsComposable and streamable Play apps
Composable and streamable Play apps
 
An example of Future composition in a real app
An example of Future composition in a real appAn example of Future composition in a real app
An example of Future composition in a real app
 
Design for developers
Design for developersDesign for developers
Design for developers
 
19 challenging thoughts about leadership 2nd edition
19 challenging thoughts about leadership   2nd edition19 challenging thoughts about leadership   2nd edition
19 challenging thoughts about leadership 2nd edition
 
Demystifying Scala Type System
Demystifying Scala Type SystemDemystifying Scala Type System
Demystifying Scala Type System
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Modern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesModern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial Databases
 
55 New Features in Java SE 8
55 New Features in Java SE 855 New Features in Java SE 8
55 New Features in Java SE 8
 
Culture
CultureCulture
Culture
 

Similar to Chicago Hadoop Users Group: Enterprise Data Workflows

Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataPaco Nathan
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the ImpatientPaco Nathan
 
FinCap Solutions Brochure
FinCap  Solutions BrochureFinCap  Solutions Brochure
FinCap Solutions BrochureCFPuser
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingPaco Nathan
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
 
A Slightly Different Web of Data
A Slightly Different Web of DataA Slightly Different Web of Data
A Slightly Different Web of DataRinke Hoekstra
 
1 informatica-training
1 informatica-training1 informatica-training
1 informatica-trainingsagdal
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 

Similar to Chicago Hadoop Users Group: Enterprise Data Workflows (20)

Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
FinCap Solutions Brochure
FinCap  Solutions BrochureFinCap  Solutions Brochure
FinCap Solutions Brochure
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
 
A Slightly Different Web of Data
A Slightly Different Web of DataA Slightly Different Web of Data
A Slightly Different Web of Data
 
1 informatica-training
1 informatica-training1 informatica-training
1 informatica-training
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Recently uploaded (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Chicago Hadoop Users Group: Enterprise Data Workflows

  • 1. “Enterprise Data Workflows with Cascading” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid zest.to/event63_77 2013-02-12 Copyright @2013, Concurrent, Inc. Tuesday, 12 February 13 1 You may not have heard about us much, but you use our API in lots of places: your bank, your airline, your hospital, your mobile device, your social network, etc.
  • 2. Unstructured Data meets Enterprise Scale • an example considered • system integration: tearing down silos • code samples • data science perspectives: how we got here • the workflow abstraction: many aspects of an app • developer, analyst, scientist • summary, references Tuesday, 12 February 13 2 Background: I’m a data scientist, an engineering director, spent the past decade building/leading Data teams which created large-scale apps. This talk is about using Cascading and related DSLs to build Enterprise Data Workflows. Our emphasis is on leveraging the workflow abstraction for system integration, for mitigating complexity, and for producing simple, robust apps at scale. We’ll show a little something for the developers, the analysts, and the scientists in the room.
  • 3. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count an example considered Tuesday, 12 February 13 3 Let’s consider the matter of handling Big Data from the perspective of building and maintaining Enterprise apps…
  • 4. Enterprise Data Workflows Customers an example… Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 4 Apache Hadoop rarely ever gets used in isolation
  • 5. Enterprise Data Workflows Customers an example… the front end Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 5 LOB use cases drive the demand for Big Data apps
  • 6. Enterprise Data Workflows Customers an example… the back office Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 6 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 7. Enterprise Data Workflows Customers an example… the heavy lifting! Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 7 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 8. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count system integration: tearing down silos Tuesday, 12 February 13 8 the process of building Enterprise apps is largely about system integration and business process, meeting in the middle
  • 9. Cascading – definitions • a pattern language for Enterprise Data Workflows • simple to build, easy to test, robust in production Customers • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 9 A pattern language ensures that best practices are followed by an implementation. In this case, parallelization of deterministic query plans for reliable, Enterprise-scale workflows on Hadoop, etc.
  • 10. Cascading – usage • Java API, Scala DSL Scalding, Clojure DSL Cascalog • ASL 2 license, GitHub src, http://conjars.org Customers • 5+ yrs production use, multiple Enterprise verticals Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 10 More than 5 year history of large-scale Enterprise deployments DSLs in Scala, Clojure, Jython, JRuby, Groovy, etc. Maven repo for third-party contribs
  • 11. quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 infoworld.com/slideshow/65089 Tuesday, 12 February 13 11 Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”
  • 12. Cascading – deployments • case studies: Twitter, Etsy, Climate Corp, Nokia, Factual, Williams-Sonoma, uSwitch, Airbnb, Square, Harvard, etc. Customers • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera Web App • OSS frameworks built atop by: Twitter, Etsy, eBay, Climate Corp, uSwitch, YieldBot, etc. logs Cache logs Logs • use cases: ETL, anti-fraud, advertising, Support recommenders, retail pricing, eCRM, trap source sink tap marketing funnel, search analytics, tap tap genomics, climatology, etc. Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 12 Several published case studies about Cascading, Scalding, Cascalog, etc. Wide range of use cases. Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading. Partnerships with all Hadoop vendors.
  • 13. case studies… (Williams-Sonoma, Neiman Marcus) concurrentinc.com/case-studies/upstream/ upstreamsoftware.com/blog/bid/86333/ (revenue team, publisher analytics) concurrentinc.com/case-studies/twitter/ github.com/twitter/scalding/wiki (infrastructure team) concurrentinc.com/case-studies/airbnb/ gigaom.com/data/meet-the-combo-behind-etsy-airbnb-and- climate-corp-hadoop-jobs/ Tuesday, 12 February 13 13 Several customers using Cascading / Scalding / Cascalog have published case studies. Here are a few.
  • 14. Cascading – taps • taps integrate other data frameworks, as tuple streams • these are “plumbing” endpoints in the pattern language Customers • sources (inputs), sinks (outputs), traps (exceptions) Web • where schema and provenance get determined App • text delimited, JDBC, Memcached, logs HBase, Cassandra, MongoDB, etc. logs Logs Cache • data serialization: Avro, Thrift, Support source Kryo, JSON, etc. trap tap tap sink tap • extend in ~4 lines of Java Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 14 Speaking of system integration, taps provide the simplest approach for integrating different frameworks.
  • 15. Cascading – topologies • topologies execute workflows on clusters • flow planner is much like a compiler for queries Customers • abstraction layers reduce training costs Web • Hadoop (MapReduce jobs) App • local mode (dev/test or special config) logs Cache logs • in-memory data grids (real-time) Logs Support • flow planner can be extended trap source sink tap to support other topologies tap tap • blend flows from different Modeling PMML Data Workflow topologies into one app source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 12 February 13 15 Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
  • 16. example topologies… Tuesday, 12 February 13 16 Here are some examples of topologies for distributed computing -- Apache Hadoop being the first supported by Cascading, followed by local mode, and now a tuple space (IMDG) flow planner in the works. Several other widely used platforms would also be likely suspects for Cascading flow planners.
  • 17. Cascading – ANSI SQL • ANSI SQL parser/optimizer atop Cascading flow planner • JDBC driver to integrate into existing tools and app servers Customers • surface a relational catalog over a collection of Web unstructured data App • launch a SQL shell prompt to run queries logs logs Cache Logs • enable the analysts without Support retraining on Hadoop, etc. trap source tap sink tap tap • transparency for Support, Data Ops, Finance, et al. Modeling PMML Workflow • combine SQL flows with sink source tap tap Scalding, Cascalog, etc. Analytics • based on collab with Optiq – Cubes customer Customer profile DBs industry-proven code base Hadoop Prefs Cluster • keep the DBAs happy, and Reporting go home a hero! Tuesday, 12 February 13 17 Quite a number of projects have started out with Hadoop, then grafted a SQL-like syntax onto it. Somewhere. We started out with a query planner used in Enterprise, then partnered with Optiq -- the team behind an Enterprise-proven code base for an ANSI SQL parser/optimizer. In the sense that Splunk handles “machine data”, this SQL implementation provides “machine code”, as the lingua franca of Enterprise system integration.
  • 18. how to query… abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumers Tuesday, 12 February 13 18 When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
  • 19. Cascading – machine learning • export predictive models as PMML • Cascading compiles to JVM classes for parallelization Customers • migrate workloads: SAS, Microstrategy,Teradata, etc. Web • great OSS tools: R, Weka, KNIME, RapidMiner, etc. App • run multiple models in parallel logs as customer experiments logs Logs Cache • Random Forest, Logistic Regression, Support source GLM, Assoc Rules, Decision Trees, trap tap tap sink tap K-Means, Hierarchical Clustering, etc. Data Modeling • 2 lines of code required for PMML Workflow integration sink source tap tap • integrate with other libraries: Analytics Cubes Matrix API, Algebird, etc. customer Customer profile DBs Prefs • combine with other flows into Hadoop Cluster one app: Java for ETL, Reporting Scala for data services, SQL for reporting, etc. Tuesday, 12 February 13 19 PMML has been around for a while, and export is supported by virtually every analytics platform, covering a wide variety of predictive modeling algorithms. Cascading reads PMML, building out workflows under the hood which run efficiently in parallel. Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;) Five companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
  • 20. PMML support… Tuesday, 12 February 13 20 Here are just a few of the tools that people use to create predictive models for export as PMML
  • 21. Cascading – test-driven development • assert patterns (regex) on the tuple streams • trap edge cases as “data exceptions” Customers • adjust assert levels, like log4j levels Web • TDD at scale: App 1. start from raw inputs in logs the flow graph logs Cache Logs 2. define stream assertions Support for each stage of transforms trap source tap sink tap tap 3. verify exceptions, code to eliminate them Modeling PMML Data Workflow 4. rinse, lather, repeat… source sink tap 5. when impl is complete, tap Analytics app has full test coverage Cubes customer • TDD follows from Cascalog’s Customer profile DBs Prefs Hadoop composable subqueries Cluster Reporting • redirect traps in production to Ops, QA, Support, Audit, etc. Tuesday, 12 February 13 21 TDD is not usually high on the list when people start discussing Big Data apps. Chris Wensel introduced into Cascading the notion of a “data exception”, and how to set stream assertion levels as part of the business logic of an application. Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • 22. Cascading – API design principles • specify what is required, not how it must be achieved • provide the “glue” for system integration • no surprises • same JAR, any scale • plan far ahead (before consuming cluster resources) • fail the same way twice Closely related to “functional relational programming” paradigm from Moseley & Marks 2006 http://goo.gl/SKspn Tuesday, 12 February 13 22 Overview of the design principles embodied by Cascading as a pattern language… Some aspects (Cascalog in particular) are closely related to “FRP” from Moseley/Marks 2006
  • 23. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count code samples: Word Count Tuesday, 12 February 13 23 Let’s make this real, show some code…
  • 24. the ubiquitous word count definition: count how often each word appears in a collection of text documents this simple program provides an excellent test case for parallel processing, since it illustrates: ‣ requires a minimal amount of code ‣ demonstrates use of both symbolic and numeric values ‣ shows a dependency graph of tuples as an abstraction ‣ is not many steps away from useful search indexing ‣ serves as a “Hello World” for Hadoop apps any distributed computing framework which can run Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems Tuesday, 12 February 13 24
  • 25. word count – pseudocode void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator partial_counts): int count = 0; for each pc in partial_counts: count += Int(pc); emit(word, String(count)); Tuesday, 12 February 13 25
  • 26. word count – flow diagram Document Collection Tokenize GroupBy M token Count R Word Count cascading.org/category/impatient gist.github.com/3900702 1 map 1 reduce 18 lines code Tuesday, 12 February 13 26
  • 27. word count – Cascading app Document Collection Tokenize GroupBy M token Count R Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Tuesday, 12 February 13 27
  • 28. word count – flow plan Document Collection Tokenize GroupBy M token Count R Word Count [head] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Tuesday, 12 February 13 28
  • 29. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count import com.twitter.scalding._   class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Tuesday, 12 February 13 29
  • 30. word count – Scalding / Scala Document Collection Tokenize GroupBy M token Count R Word Count github.com/twitter/scalding/wiki ‣ extends the Scala collections API, distributed lists become “pipes” backed by Cascading ‣ code is compact, easy to understand – very close to conceptual flow diagram ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ large-scale, complex problems can be handled in just a few lines of code ‣ significant investments by Twitter, Etsy, eBay, etc., in this open source project ‣ extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., “Matrix API” ‣ several large-scale apps in production deployments ‣ IMHO, especially great for data services at scale Tuesday, 12 February 13 30 Using a functional programming language to build flows works even better than trying to represent functional programming constructs within Java…
  • 31. word count – Cascalog / Clojure Document Collection Tokenize GroupBy M token Count R Word Count (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Tuesday, 12 February 13 31
  • 32. word count – Cascalog / Clojure Document Collection Tokenize GroupBy M token Count R Word Count github.com/nathanmarz/cascalog/wiki ‣ implements Datalog in Clojure, with predicates backed by Cascading ‣ a truly declarative language – whereas Scalding lacks that aspect of functional programming ‣ run ad-hoc queries from the Clojure REPL, approx. 10:1 code reduction compared with SQL ‣ composable subqueries, for test-driven development (TDD) at scale ‣ fault-tolerant workflows which are simple to follow ‣ same framework used from discovery through to production apps ‣ FRP mitigates the s/w engineering costs of Accidental Complexity ‣ focus on the process of structuring data; not un/structured ‣ Leiningen build: simple, no surprises, in Clojure itself ‣ has a learning curve, limited number of Clojure developers ‣ aggregators are the magic, those take effort to learn Tuesday, 12 February 13 32
  • 33. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count data science perspectives: how we got here Tuesday, 12 February 13 33 Let’s examine an evolution of Data Science practice, subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes, and commercialized Big Data
  • 34. circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS Tuesday, 12 February 13 34 Ah, teh olde days - Perl and C++ for CGI :) Feedback loops shown in red represent data innovations at the time… Characterized by slow, manual processes: data modeling / business intelligence; “throw it over the wall”… this thinking led to impossible silos
  • 35. circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS Tuesday, 12 February 13 35 Q3 1997: Greg Linden @ Amazon, Randy Shoup @ eBay -- independent teams arrived at the same conclusion: parallelize workloads onto clusters of commodity servers (Intel/Linux) to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines. MapReduce grew directly out of this effort. LinkedIn, Facebook, Twitter, Apple, etc., follow. Algorithmic modeling, which leveraged machine data, allowed for Big Data to become monetized. REALLY monetized :) Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization) MapReduce came from work in 2002. Google is now three generations beyond that -- while the Global 1000 struggles to rationalize Hadoop practices. Google gets upset when people try to “open the kimono”; however, Twitter is in SF where that’s a national pastime :) To get an idea of what powers Google internally, check the open source projects: Scalding, Matrix API, Algebird, etc.
  • 36. circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS Tuesday, 12 February 13 36 Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams. Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric. Not unlike a practice at LLL, where 4x more data gets collected about the machine than about the experiment.
  • 37. asymptotically… • smarter, more robust clusters DSL • increased leverage of machine data for automation and optimization • DSLs focused on scalability, testability, Planner/ reducing s/w engineering complexity Optimizer • increased use of “machine code”, who writes SQL directly? Workflow • workflows incorporating more “moving parts” App • less about “bigness” of data, History more about complexity of process Cluster • greater instrumentation ⟹ even more machine data, increased feedback Cluster Scheduler Tuesday, 12 February 13 37 Enterprise Data Workflows: more about “complex” process than about “big” data
  • 38. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/ Tuesday, 12 February 13 38 for a really great discussion about the fundamentals of Data Science and process for algorithmic modeling (analyzing the 1997 inflection point), refer back to Breiman 2001.
  • 39. references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE Tuesday, 12 February 13 39 in terms of building data products, see DJ Patil's mini-books on O'Reilly: Building Data Science Teams Data Jujitsu
  • 40. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count the workflow abstraction: many aspects of an app Tuesday, 12 February 13 40 The workflow abstraction helps make Hadoop accessible to a broader audience of developers. Let’s take a look at how organizations can leverage it in other important ways…
  • 41. the workflow abstraction Tuple Flows, Pipes, Taps, Filters, Joins, Traps, etc. …in other words, “plumbing” as a pattern language for managing the complexity of Big Data in Enterprise apps on many levels Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Tuesday, 12 February 13 41 The workflow abstraction, a pattern language for building robust, scalable Enterprise apps, which works on many levels across an organization…
  • 42. rather than arguing SQL vs. NoSQL… this kind of work focuses on the process of structuring data which must occur long before work on large-scale joins, visualizations, predictive models, etc. so the process of structuring data is what we examine here: i.e., how to build workflows for Big Data thank you Dr. Codd “A relational model of data for large shared data banks” dl.acm.org/citation.cfm?id=362685 Tuesday, 12 February 13 42 instead, in Data Science work we must focus on *the process of structuring data* that must happen before the large-scale joins, predictive models, visualizations, etc. the process of structuring data is what i will show here how to build workflows from Big Data thank you Dr. Codd
  • 43. workflow – abstraction layer • Cascading initially grew from interaction with the Nutch project, before Hadoop had a name; API author Chris Wensel recognized that MapReduce would be too complex for substantial work in an Enterprise context • 5+ years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts • the pattern language provides a structured method for solving large, complex design problems where the syntax of the language promotes use of best practices – which addresses staffing issues Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Tuesday, 12 February 13 43 First and foremost, the workflow represents an abstraction layer to mitigate the complexity and costs of coding large apps directly in MapReduce.
  • 44. workflow – literate collaboration • provides an intuitive visual representation for apps: flow diagrams • flow diagrams are quite valuable for cross-team collaboration • this approach leverages literate programming methodology, especially in DSLs written in functional programming languages • example: nearly 1:1 correspondence between function calls and flow diagram elements in Scalding • example: expert developers on cascading-users email list use flow diagrams to help troubleshoot issues remotely Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Tuesday, 12 February 13 44 Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming. Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- the expert developers generally ask a novice to provide a flow diagram first
  • 45. workflow – business process • imposes a separation of concerns between the capture of business process requirements, and the implementation details (Hadoop, etc.) • workflow orchestration evokes the notion of business process management for Enterprise apps (think BPM/BPEL) • Cascalog leverages Datalog features to make business process executable: “specify what you require, not how to achieve it” Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Tuesday, 12 February 13 45 Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL)
  • 46. workflow – data architect • represents a physical plan for large-scale data flow management • tap schemes and tuple streams determine the relevant schema • a producer/consumer graph of tap identifier URIs provides a view of data provenance • cluster utilization vs. producer/consumer graph surfaces ROI for Hadoop-based data products Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Tuesday, 12 February 13 46 Data Architect POV: a physical plan for large-scale data flow management