SlideShare a Scribd company logo
1 of 77
Building Enterprise Apps
for Big Data with Cascading


Paco Nathan
                            Document
                            Collection



                                                           Scrub
                                           Tokenize
                                                           token




Concurrent, Inc.
                                    M



                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                         List
                                                                     RHS




pnathan@concurrentinc.com                                                                Count




@pacoid
                                                                                                     Word
                                                                                                     Count




                                         Copyright @2012, Concurrent, Inc.
Enterprise Apps
 for Big Data
with Cascading
 1. backstory: how we got here
 2. build: Data Science teams
 3. pattern: common use cases
 4. intro: Cascading API
 5. tutorial: for the impatient
 6. code: sample apps
Intro to Cascading
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




1. backstory:
how we got here
inflection point
 huge Internet successes after 1997 holiday season…          1997
 AMZN, EBAY, Inktomi (YHOO Search), then GOOG
                                                             1998
 consider this metric:
   annual revenue per customer / amount of data stored
 which dropped 100x within a few years after 1997            2004

 storage and processing costs plummeted, now we must
 work much smarter to extract ROI from Big Data…
 our methods must adapt

 “conventional wisdom” of RDBMS and BI tools became
 less viable; however, business cadre was still focused on
 pivot tables and pie charts… which tends toward inertia!

 MapReduce and the Hadoop open source stack grew
 directly out of that contention… however, that effort        +
 only solves parts of the puzzle
inflection point: consequences
 Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
 Hadoop Summit, 2012:

 “All of Fortune 500 is now on notice over the next 10-year period.”
 Amazon and Google as exemplars of massive disruption in retail,
 advertising, etc.
 data as the major force displacing Global 1000 over the next decade,
 mostly through apps — verticals, leveraging domain expertise


 Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
 XLDB, 2012:

 “Complex analytics workloads are now displacing SQL as the basis
  for Enterprise apps.”
primary sources
 Amazon
 “Early Amazon: Splitting the website” – Greg Linden
 glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

 eBay
 “The eBay Architecture” – Randy Shoup, Dan Pritchett
 addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
 addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

 Inktomi (YHOO Search)
 “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
 youtube.com/watch?v=E91oEn1bnXM

 Google
 “The Birth of Google” – John Battelle
 wired.com/wired/archive/13.08/battelle.html
 “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
 youtube.com/watch?v=qsan-GQaeyk
 perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
the world before…

BI, SQL, and highly
optimized code
data innovation: circa 1996
                            Stakeholder                   Customers

     Excel pivot tables
   PowerPoint slide decks        strategy



         BI
                                Product
       Analysts


                               requirements



       SQL Query                              optimized
                              Engineering       code         Web App
        result sets



                                                             transactions




                                                             RDBMS
the world after…

machine learning,
leveraging log files
data innovation: circa 2001
    Stakeholder                    Product                   Customers




      dashboards                                                  UX
                                  Engineering

                    models                        servlets

                                  recommenders
    Algorithmic                          +                   Web Apps
     Modeling                        classifiers


                                                             Middleware
                    aggregation
                                                   event
     SQL Query                                    history
      result sets                                               customer
                                                              transactions
                                     Logs



        DW                             ETL                    RDBMS
the world ahead…

what our customers
are doing now
data innovation: circa 2013
                                                                                             Customers
                                        Data Apps
                          business
  Domain                  process       Workflow                                                                          Prod
  Expert
                            dashboard                                                        Web Apps,
                             metrics
                                         History                     services                 Mobile,
                  data                                                                         etc.                s/w
                science                                                                                            dev
   Data
                                         Planner
 Scientist
                                                                                       social
                          discovery                  optimized                      interactions
                              +                       capacity                                     transactions,          Eng
                                         endpoints
                          modeling                                                                    content

  App Dev
                                                Data Access Patterns


                                         Hadoop,                   Log                        In-Memory
                                           etc.                   Events                       Data Grid
    Ops                          DW                                                                                       Ops
                                                                            batch      "real time"


                                                                 Cluster Scheduler
  introduced                                                                                                             existing
   capability                                                                                                             SDLC

                                                                                                   RDBMS
                                                                                                    RDBMS
a key difference…
statistical thinking


      Process              Variation               Data             Tools



  employing a mode of thought which includes both logical and analytical reasoning:
  evaluating the whole of a problem, as well as its component parts; attempting
  to assess the effects of changing one or more variables

  this approach attempts to understand not just problems and solutions,
  but also the processes involved and their variances

  particularly valuable in Big Data work when combined with hands-on experience in
  physics – roughly 50% of my peers come from physics or physical engineering…

  programmers typically don’t think this way…
  however, both systems engineers and data scientists must!
reference

  by Leo Breiman
  Statistical Modeling:
  The Two Cultures
  Statistical Science, 2001
  bit.ly/eUTh9L
Intro to Cascading
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




2. build:
Data Science teams
core values

  Data Science teams develop actionable insights,
  building confidence for decisions

  that work may influence a few decisions worth
  billions (e.g., M&A) or billions of small decisions
  (e.g., AdWords)

  probably somewhere in-between…




                                                        Wikipedia
  solving for pattern, at scale.

  an interdisciplinary pursuit which
  requires teams, not sole players
most valuable skills
 approximately 80% of the costs for data-related projects
 get spent on data preparation – mostly on cleaning up
 data quality issues: ETL, log file analysis, etc.

 unfortunately, data-related budgets for many companies tend
 to go into frameworks which can only be used after clean up

 most valuable skills:
   ‣ learn to use programmable tools that prepare data

   ‣ learn to generate compelling data visualizations

   ‣ learn to estimate the confidence for reported results

   ‣ learn to automate work, making analysis repeatable
                                                               D3
 the rest of the skills – modeling,
 algorithms, etc. – those are secondary
social caveats
 “This data cannot be correct!” may be an early warning
 about an organization itself
 much depends on how the people whom you work alongside
 tend to arrive at decisions:
     ‣ probably good: Induction, Abduction, Circumscription
     ‣ probably poor: Deduction, Speculation, Justification


 in general, one good data visualization
 puts many ongoing verbal arguments to rest
 however, let domain experts handle
 “data storytelling”, not data scientists



                                                              xkcd
the science in data science?
                                                         edoMpUsserD:IUN
                                     tcudorP ylppA lenaP yrotnevnI tneilC
                                  tcudorP evomeR lenaP yrotnevnI tneilC




  in a nutshell, what we do…
                                                         edoMmooRyM:IUN
                                                     edoMmooRcilbuP:IUN
                                                                  ydduB ddA
                                                               nigoL etisbeW
                                                                           vd
                                                          edoMsdneirF:IUN
                                                              edoMtahC:IUN
                                                          egasseM a evaeL
                                             G1 :gniniamer ecaps sserddA
                                                      dekcilCeliforPyM:IUN
                                                       edoMstiderCyuB:IUN
                                                           tohspanS a ekaT
                                                       egapemoH nwO tisiV
                                                               elbbuB a epyT
                                                                taeS egnahC
                                                          wodniW D3 nepO
                                                                  dneirF ddA
                                 revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                   lenaP tidE




  ‣ estimate probability
                                                                    woN tahC
                                                                     teP yalP
                                                                    teP deeF
                             2 petS egaP traC esahcruP edaM remotsuC
                                          M215 :gniniamer ecaps sserddA
                                                              gnihtolC no tuP
                                                           bew :metI na yuB
                                                             edoMeivoM:IUN
                                    ytinummoc ,tneilc :detratS weiV eivoM
                                                             teP weN etaerC
                                        detrats etius tset :tseTytivitcennoC
                                                   emag pazyeh dehcnuaL
                                                    eciov mooRcilbuP tahC
                                                          egasseM yadhtriB
                                                          edoMlairotuT:IUN
                                                    ybbol semag dehcnuaL
                                                        noitartsigeR euqinU




  ‣ calculate analytic variance




                                                                                edoMpUsserD:IUN
                                                                                tcudorP ylppA lenaP yrotnevnI tneilC
                                                                                tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                edoMmooRyM:IUN
                                                                                edoMmooRcilbuP:IUN
                                                                                ydduB ddA
                                                                                nigoL etisbeW
                                                                                vd
                                                                                edoMsdneirF:IUN
                                                                                edoMtahC:IUN
                                                                                egasseM a evaeL
                                                                                G1 :gniniamer ecaps sserddA
                                                                                dekcilCeliforPyM:IUN
                                                                                edoMstiderCyuB:IUN
                                                                                tohspanS a ekaT
                                                                                egapemoH nwO tisiV
                                                                                elbbuB a epyT
                                                                                taeS egnahC

                                                                                dneirF ddA
                                                                                revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                lenaP tidE
                                                                                woN tahC
                                                                                teP yalP
                                                                                teP deeF
                                                                                2 petS egaP traC esahcruP edaM remotsuC
                                                                                M215 :gniniamer ecaps sserddA
                                                                                gnihtolC no tuP
                                                                                bew :metI na yuB
                                                                                edoMeivoM:IUN
                                                                                ytinummoc ,tneilc :detratS weiV eivoM
                                                                                teP weN etaerC
                                                                                detrats etius tset :tseTytivitcennoC
                                                                                emag pazyeh dehcnuaL
                                                                                eciov mooRcilbuP tahC
                                                                                egasseM yadhtriB
                                                                                edoMlairotuT:IUN
                                                                                ybbol semag dehcnuaL
                                                                                noitartsigeR euqinU
                                                                                wodniW D3 nepO
  ‣ manipulate order complexity

  ‣ make use of learning theory

  +   collab with DevOps, Stakeholders

  +   reduce our work to cron entries
synthesis of the above
  MapReduce is Good Enough?
  Jimmy Lin, U Maryland + Twitter
  arxiv.org/pdf/1209.2191v1.pdf



  A Few Useful Things to Know about Machine Learning
  Pedro Domingos, U Washington
  homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
team process = needs

                  help people ask the
    discovery     right questions


                  allow automation to place
     modeling     informed bets


                  deliver products at
    integration   scale to customers


                  build smarts into
       apps       product features            Gephi



                  keep infrastructure
     systems      running, cost-effective
team composition = roles

       Domain
       Expert
                               business process,
                               stakeholder
                       data
                     science
        Data                   data prep, discovery,
      Scientist                modeling, etc.            Document
                                                         Collection



                                                                                      Scrub
                                                                      Tokenize
                                                                                      token

                                                                 M



                                                                                              HashJoin   Regex
                                                                                                Left     token
                                                                                                                 GroupBy    R
                                                                                 Stop Word                        token
                                                                                    List
                                                                                                RHS




       App Dev
                               software engineering,                                                                Count




                               automation                                                                                       Word
                                                                                                                                Count




         Ops                   systems engineering, access



       introduced
        capability
matrix = needs × roles
                                            nn
         o
         overy
           very      elliing
                      e ng            ratiio
                                      rat o      apps
                                                 apps      tem
                                                            tem
                                                               ss
   diisc
   d sc           mod
                  mod           nteg
                               ii nteg                  sys
                                                        sys

                                                                    stakeholder



                                                                     scientist



                                                                    developer



                                                                       ops
matrix: example team
                                             nn
          o
          overy
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      tem
                                                             tem
                                                                ss
    diisc
    d sc           mod
                   mod           nteg
                                ii nteg                  sys
                                                         sys

                                                                     stakeholder



                                                                      scientist



                                                                     developer



                                                                        ops



 summary: this team seems heavy on systems, may need more overlap
 between modeling and integration, particularly among team leads
Q:
Can I simply hire one
rockstar data scientist
to cover all this work?
A: No, interdisciplinary
work requires teams.

A: Hire leads who speak
the lingo of each domain.

A: Hire people who cover
2+ roles, when possible.
reference

  by DJ Patil

  Data Jujitsu
  O’Reilly, 2012
  amazon.com/dp/B008HMN5BE

  Building Data Science Teams
  O’Reilly, 2011
  amazon.com/dp/B005O4U3ZE
Intro to Cascading
          Document
          Collection



                                       Scrub
                       Tokenize
                                       token

                  M



                                               HashJoin   Regex
                                                 Left     token
                                                                  GroupBy    R
                                  Stop Word                        token
                                     List
                                                 RHS




                                                                     Count




                                                                                 Word
                                                                                 Count




3. pattern:
common use cases
CAP theorem
 purpose: theoretical limits for data access patterns
 essence:
    ‣ consistency
    ‣ availability
    ‣ partition tolerance




 best case scenario: you may pick two … or spend billions
 struggling to obtain all three at scale (GOOG)
 translated: cost of doing business

   www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

   julianbrowne.com/article/viewer/brewers-cap-theorem
data access patterns
 design patterns: originated in consensus negotiation
 for architecture, later used in software engineering
 consider the corollaries in large-scale data work…
 essence: select data frameworks based on
 your data access patterns
 in other words, decouple use cases based on needs
  – avoid the “one size fits all” (OSFA) anti-pattern
 let’s review some examples…
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
and, since
“One Size Fits All”…
doesn’t
a selection of great tools…
                                                                reporting:
                                   visualization:
                                                                Graphite, PowerPivot,
                                   ggplot2, D3, Gephi
   analytics/modeling:                                          Pentaho, Jaspersoft, SAS
   R, Weka, Matlab, PMML, GLPK
                                      text:
                                      LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

       apps:
       Cascading, Scalding, Cascalog, R markdown, SWF
                                                scale-out:
                                                Scalr, RightScale, CycleComputing, vFabric, Beanstalk
               graph:          column:
               Gremlin,        Vertica,
               GraphLab,       HBase,           key/val:        index:               relational:
               Neo4J           Drill,           Redis,          Lucene/Solr,         usual suspects
                               Dynamo           Membase,        ElasticSearch
                                                MySQL

   imdg:
   Spark, Storm,         hadoop:
                         EMR, HW, MapR,               machine data:
   Gigaspaces
                         EMC, Azure, Compute          Splunk, collectd,         durable storage:
                                                      Nagios                    S3, ASV, GCS,
                                                                                Riak, Couch
common use cases
  app patterns
use case: marketing funnel
  •   must optimize a very large ad spend
  •   different vendors report different metrics




                                                                Wikipedia
  •   seasonal variation distorts performance
  •   some campaigns are much smaller than others
  •   hard to predict ROI for incremental spend

  approach:
  • log aggregation, followed with cohort analysis
  • bayesian point estimates compare different-sized ad tests
  • customer lifetime value quantifies ROI of new leads
  • time series analysis normalizes for seasonal variation
  • geolocation adjusts for regional cost/benefit
  • linear programming models estimate elasticity of demand
use case: ecommerce fraud
  • sparse data means lots of missing values




                                                             stat.berkeley.edu
  • “needle in a haystack” lack of training cases
  • answers are available in large-scale batch, results
      are needed in real-time event processing
  •   not just one pattern to detect – many, ever-changing

  approach:
  • random forest (RF) classifiers predict likely fraud
  • subsampled data to re-balance training sets
  • impute missing values based on density functions
  • train on massive log files, run on in-memory grid
  • adjust metrics to minimize customer support costs
  • detect novelty – report anomalies via notifications
use case: customer segmentation
  • many millions of customers, hard to determine
      which features resonate




                                                                Mathworks
  •   multi-modal distributions get obscured by the
      practice of calculating an “average”
  •   not much is known about individual customers

  approach:
  • connected components for sessionization, determining
      uniques from logs
  •   estimates for age, gender, income, geo, etc.
  •   clustering algorithms to group into market segments
  •   social graph infers “unknown” relationships
  • covariance/heat maps visualizes segments vs. feature sets
use case: monetizing content
  • need to suggest relevant content which would




                                                               Digital Humanities
      otherwise get buried in the back catalog
  •   big disconnect between inventory and limited
      performance ad market
  •   enormous amounts of text, hard to categorize

  approach:
  • text analytics glean key phrases from documents
  • hierarchical clustering of char frequencies detects lang
  • latent dirichlet allocation (LDA) reduces dimension to
      topic models
  •   recommenders suggest similar topics to customers
  • collaborative filters connect known users with less known
Intro to Cascading
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




4. intro:
Cascading API
Cascading API: purpose
  ‣ simplify data processing development and deployment

  ‣ improve application developer productivity

  ‣ enable data processing application manageability
Cascading API: a few facts
  Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

  in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
  Finance, Health Care, Transportation, other verticals

  studies published about large use cases: Twitter, Etsy, Airbnb, Square,
  Climate Corporation, FlightCaster, Williams-Sonoma

  partnerships and distribution with SpringSource, Amazon AWS,
  Microsoft Azure, Hortonworks, MapR, EMC

  several open source projects built atop, managed by Twitter, Etsy, etc.,
  which provide substantial Machine Learning libraries

  DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

  data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
  plus serialization in Apache Thrift, Avro, Kyro, etc.

  entire app compiles into a single JAR: fully connected for compiler optimization,
  exception handling, debugging, config, scheduling, etc.
Cascading API: a few quotes
 “Cascading gives Java developers the ability to build Big Data applications
  on Hadoop using their existing skillset … Management can really go out
  and build a team around folks that are already very experienced with Java.
  Switching over to this is really a very short exercise.”
   CIO, Thor Olavsrud, 2012-06-06
   cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading


 “Masks the complexity of MapReduce, simplifies the programming, and
  speeds you on your journey toward actionable analytics … A vast
  improvement over native MapReduce functions or Pig UDFs.”
   2012 BOSSIE Awards, James Borck, 2012-09-18
   infoworld.com/slideshow/65089


 “Company’s promise to application developers is an opportunity to build
  and test applications on their desktops in the language of choice with
  familiar constructs and reusable components”
   Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
   drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
data+code “political spectrum”
 “Notes from the Mystery Machine Bus”
 by Steve Yegge, Google
 goo.gl/SeRZa
         “conservative”                             “liberal”
           (mostly) Enterprise                   (mostly) Start-Up

            risk management                    customer experiments

                assurance                            flexibility

          well-defined schema                   schema follows code
          explicit configuration                     convention

         type-checking compiler                 interpreted scripts

           wants no surprises                  wants no impediments

         Java, Scala, Clojure, etc.            PHP, Ruby, Python, etc.

  Cascading, Scalding, Cascalog, etc.   Hive, Pig, Hadoop Streaming, etc.
Cascading API: adoption

    As Enterprise apps move into
    Hadoop and related BigData
    frameworks, risk profiles shift
    toward more conservative
    programming practices

    Cascading provides a popular
    API for defining and managing
    Enterprise data workflows
enterprise data workflows
 Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc.
 …in other words, “plumbing”

   Document
   Collection



                                Scrub
                Tokenize
                                token

           M



                                        HashJoin   Regex
                                          Left     token
                                                            GroupBy    R
                           Stop Word                         token
                              List
                                          RHS




                                                               Count




                                                                            Word
                                                                            Count
data workflows: team
  ‣ Business Stakeholder POV:
    business process management for workflow orchestration (think BPM/BPEL)

  ‣ Systems Integrator POV:
    system integration of heterogenous data sources and compute platforms

  ‣ Data Scientist POV:
    a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

  ‣ Data Architect POV:
    a physical plan for large-scale data flow management

  ‣ Software Architect POV:
    a pattern language, similar to plumbing or circuit design
                                                                   Document
                                                                   Collection




  ‣ App Developer POV:                                                     M
                                                                                Tokenize
                                                                                                Scrub
                                                                                                token




    API bindings for Java, Scala, Clojure, Jython, JRuby, etc.                             Stop Word
                                                                                              List
                                                                                                        HashJoin
                                                                                                          Left


                                                                                                          RHS
                                                                                                                   Regex
                                                                                                                   token
                                                                                                                           GroupBy
                                                                                                                            token
                                                                                                                                      R




                                                                                                                              Count




  ‣ Systems Engineer POV:                                                                                                                 Word
                                                                                                                                          Count




    a JAR file, has passed CI, available in a Maven repo
data workflows: layers
   business     domain expertise, business trade-offs,
   process      operating parameters, market position, etc.

      API       Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
   language
                …envision whatever runs in a JVM

   optimize /
    schedule    major changes in technology now
                   Document
                   Collection



                                                Scrub
                                Tokenize
                                                token




    physical
                           M



                                                        HashJoin   Regex
                                                          Left     token
                                                                           GroupBy    R



     plan
                                           Stop Word                        token
                                              List
                                                          RHS




                                                                              Count




                                                                                          Word
                                                                                          Count




   compute      Apache Hadoop, in-memory local mode




                                                                                                  “assembler”
                                                                                                   code
   substrate
                …envision GPUs, streaming, etc.

   machine
    data        Splunk, Nagios, Collectd, New Relic, etc.
data workflows: SQL
        Relational
          SQL parser


          logical plan,
    optimized based on stats
          physical plan


         query history,
           table stats
          b-trees, etc.

              ERD


         table schema


            catalog
data workflows: SQL vs. JVM
         Relational              Cascading + Driven
           SQL parser             SQL-92 compliant parser
                                       (in progress)
           logical plan,              TODO: logical plan,
     optimized based on stats      optimized based on stats
           physical plan               API “plumbing”


          query history,                 app history,
            table stats                   tuple stats
           b-trees, etc.        distributed compute substrate:
                                   Hadoop, in-memory, etc.
               ERD                      flow diagram


          table schema                  tuple schema


             catalog                 endpoint usage DB
Intro to Cascading
            Document
            Collection



                                         Scrub
                         Tokenize
                                         token

                    M



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token
                                       List
                                                   RHS




                                                                       Count




                                                                                   Word
                                                                                   Count




5. tutorial:
for the impatient
“Cascading for the Impatient”
  cascading.org/category/impatient/
  ‣ a series of introductory tutorials and code samples

  ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive

   Document
   Collection



                                Scrub
                Tokenize
                                token

           M



                                        HashJoin   Regex
                                          Left     token
                                                           GroupBy    R
                           Stop Word                        token
                              List
                                          RHS




                                                              Count




                                                                          Word
                                                                          Count
1: copy
                       public class
                         Main
                         {
                         public static void
                         main( String[] args )
                           {
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
 Source
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

          M                // create the sink tap
                           Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
                Sink
                           // specify a pipe to connect the taps
                           Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();

 1 mapper                  }
                         }
 0 reducers
10 lines code
wait!



  ten lines of code
  for a file copy…
  seems like a lot.
same JAR, any scale…
                                                       MegaCorp Enterprise IT:
                                                       Pb’s data
                                                       1000+ node private cluster
                                                       EVP calls you when app fails
                                                       runtime: days+

                                        Production Cluster:
                                        Tb’s data
                                        EMR w/ 50 HPC Instances
                                        Ops monitors results
                                        runtime: hours – days

                    Staging Cluster:
                    Gb’s data
                    EMR + 4 Spot Instances
                    CI shows red or green lights
                    runtime: minutes – hours

 Your Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count


Document
Collection




                Tokenize
                           GroupBy
        M                   token             Count




                              R                        Word
                                                       Count




 1 mapper
 1 reducer
18 lines code                        gist.github.com/3900702
Cascading / Java                                               Document
                                                               Collection




                                                                       M
                                                                            Tokenize
                                                                                       GroupBy
                                                                                        token    Count


String docPath = args[ 0 ];                                                               R              Word

String wcPath = args[ 1 ];                                                                               Count




Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
 .addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Scalding / Scala                          Document
                                          Collection




                                                  M
                                                       Tokenize
                                                                  GroupBy
                                                                   token    Count




                                                                     R              Word
                                                                                    Count




// Sujit Pal
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

package com.mycompany.impatient

import com.twitter.scalding._

class Part2(args : Args) extends Job(args) {
  val input = Tsv(args("input"), ('docId, 'text))
  val output = Tsv(args("output"))
  input.read.
    flatMap('text -> 'word) {
       text : String => text.split("""s+""")
    }.
    groupBy('word) { group => group.size }.
    write(output)
}
Cascalog / Clojure                            Document
                                              Collection




                                                      M
                                                           Tokenize
                                                                      GroupBy
                                                                       token    Count




                                                                         R              Word
                                                                                        Count




; Paul Lam
; github.com/Quantisan/Impatient

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
Hive                                        Document
                                            Collection




                                                    M
                                                         Tokenize
                                                                    GroupBy
                                                                     token    Count




                                                                       R              Word
                                                                                      Count




-- Steve Severance
-- stackoverflow.com/questions/10039949/word-count-program-in-hive

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv'
OVERWRITE INTO TABLE input;

SELECT
 word, COUNT(*)
FROM input
 LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word
;
Pig                                         Document
                                            Collection




                                                    M
                                                         Tokenize
                                                                    GroupBy
                                                                     token    Count




                                                                       R              Word
                                                                                      Count




-- kudos to Dmitriy Ryaboy

docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
  AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';

-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
  GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
  GENERATE group AS token, COUNT(tokenPipe) AS count;

-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
3: wc + scrub


Document
Collection



                        Scrub   GroupBy
             Tokenize
                        token    token
                                          Count
        M

                                   R              Word
                                                  Count




 1 mapper
 1 reducer
22+10 lines code
4: wc + scrub + stop words


Document
Collection



                             Scrub
             Tokenize
                             token

        M



                                     HashJoin   Regex
                                       Left     token
                                                        GroupBy    R
                        Stop Word                        token
                           List
                                       RHS




                                                           Count



 1 mapper                                                              Word

 1 reducer                                                             Count


28+10 lines code
5: tf-idf


                                                                        Unique                 Insert   SumBy




                                                                  D
                                                                        doc_id                   1      doc_id
Document
Collection

                                                                  M       R           M                   R      M     RHS

                               Scrub
             Tokenize
                               token
                                                                                                                     HashJoin
        M

                                                                                                                                            RHS




                                                          token
                                       HashJoin   Regex                 Unique                GroupBy




                                                                  DF
                                         Left     token                  token                 token                                                         ExprFunc
                                                                                                         Count                             CoGroup
                        Stop Word                                                                                                                              tf-idf
                           List
                                         RHS
                                                                  M       R           M          R               M                                   R
                                                                                                                                                                          TF-IDF




                                                                                                                 M

                                                                       GroupBy
                                                                  TF

                                                                        doc_id,
                                                                         token                 Count
                                                                                                                             GroupBy                 Count
                                                                                                                              token

                                                                  M       R       M       R
                                                                                                                                                                  Word
                                                                                                                                R      M      R                   Count




  11 mappers
   9 reducers
  65+10 lines code
6: tf-idf + tdd


                                                                                                Unique                 Insert   SumBy




                                                                                          D
                                                                                                doc_id                   1      doc_id
Document
Collection

                                                                                                                                               RHS
                                                                                          M       R           M                   R      M
                       Assert                          Scrub
                                Tokenize
                                                       token
                                                                                                                                             HashJoin              Checkpoint
        M
                                                                                                                                                                                  M

                                                                                                                                                                                       RHS




                                                                                  token
                                                               HashJoin   Regex                 Unique                GroupBy




                                                                                          DF
                                                                 Left     token                  token                 token     Count                                                               ExprFunc
                                                                                                                                                                                      CoGroup
                                                                                                                                                                                                       tf-idf
                                           Stop Word
                                              List               RHS

                                                                                          M       R           M          R               M                                                      R
                                                                                                                                                                                                                TF-IDF




                                                                                                                                         M
                                                                                               GroupBy




                                                                                          TF
                                                                                                doc_id,
             Failure                                                                             token                 Count
              Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                      token

                                                                                          M       R       M       R
                                                                                                                                                                                             Word
                                                                                                                                                                                             Count
                                                                                                                                                        R      M    R




  12 mappers
   9 reducers
  76+14 lines code
deployed on AWS…


 elastic-mapreduce --create --name "TF-IDF" 
   --jar s3n://temp.cascading.org/impatient/part6.jar 
   --arg s3n://temp.cascading.org/impatient/rain.txt 
   --arg s3n://temp.cascading.org/impatient/out/wc 
   --arg s3n://temp.cascading.org/impatient/en.stop 
   --arg s3n://temp.cascading.org/impatient/out/tfidf 
   --arg s3n://temp.cascading.org/impatient/out/trap 
   --arg s3n://temp.cascading.org/impatient/out/check




 aws.amazon.com/elasticmapreduce/
results?                                                                                                                                                                                                                                                 doc_id tf-idf
                                                                                                                                                                                                                                                         doc02 0.9163
                                                                                                                                                                                                                                                                         token
                                                                                                                                                                                                                                                                         air
                                                                                                                                                                                                                                                         doc05 0.9163    australia
                                                                                                                                                                                                                                                         doc05 0.9163    broken
                                                                                                                                                                                                                                                         doc04 0.9163    california's
                                                                                                                                                                                                                                                         doc04 0.9163    cause
                                                                                                                                                                                                                                                         doc02 0.9163    cloudcover
                                                                                                                                                                                                                                                         doc04 0.9163    death
                                                                                                                                                                                                                                                         doc04 0.9163    deserts
                                                                                                                                                                                                                                                         doc03 0.9163    downwind
doc_id text                                                                                                                                                                                                                                               …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.                                                                                                                                                                            doc02 0.9163    sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain                                                                                                                                                                     doc04 0.9163    such
with less rain and cloudcover.                                                                                                                                                                                                                           doc04 0.9163    valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind)                                                                                                                                                                        doc05 0.9163    women
side of a mountain.                                                                                                                                                                                                                                      doc03 0.5108    land
doc04 This is known as the rain shadow effect and is the primary cause of leeward                                                                                                                                                                        doc05 0.5108    land
deserts of mountain ranges, such as California's Death Valley.                                                                                                                                                                                           doc01 0.5108    lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]                                                                                                                                                                                                 doc02 0.5108    lee
zoink null                                                                                                                                                                                                                                               doc03 0.5108    leeward
                                                                                                                                                                                                                                                         doc04 0.5108    leeward
                                                                                                                                                                                                                                                         doc01 0.4463    area
                                                                                                                                                                                                                                                         doc02 0.2231    area
                                                                                                                                                                                                                                                         doc03 0.2231    area
                                                                                                                                                                                                                                                         doc01 0.2231    dry
                                                                                                                                                                                                                                                         doc02 0.2231    dry
                                                                                                                                                                                                                                                         doc03 0.2231    dry
                                                                                                                                                                                                                                                         doc02 0.2231    mountain
                                                                                                                                Unique                 Insert   SumBy
                                                                                                                          D




                                                                                                                                doc_id                   1      doc_id
                                Document
                                Collection

                                                                                                                                                                               RHS
                                                                                                                          M       R           M                   R      M




                                                                                                                                                                                                                                                         doc03 0.2231    mountain
                                                       Assert                          Scrub
                                                                Tokenize
                                                                                       token
                                                                                                                                                                             HashJoin              Checkpoint
                                        M
                                                                                                                                                                                                                  M

                                                                                                                                                                                                                       RHS
                                                                                                                  token




                                                                                               HashJoin   Regex                 Unique                GroupBy
                                                                                                                          DF




                                                                                                 Left     token




                                                                                                                                                                                                                                                         doc04 0.2231    mountain
                                                                                                                                 token                 token     Count                                                               ExprFunc
                                                                                                                                                                                                                      CoGroup
                                                                                                                                                                                                                                       tf-idf
                                                                           Stop Word
                                                                              List               RHS

                                                                                                                          M       R           M          R               M                                                      R
                                                                                                                                                                                                                                                TF-IDF




                                                                                                                               GroupBy
                                                                                                                                                                         M


                                                                                                                                                                                                                                                         doc01 0.0000    rain
                                                                                                                          TF




                                                                                                                                doc_id,
                                             Failure                                                                             token                 Count
                                              Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                                                      token




                                                                                                                                                                                                                                                         doc02 0.0000    rain
                                                                                                                          M       R       M       R
                                                                                                                                                                                                                             Word
                                                                                                                                                                                                                             Count
                                                                                                                                                                                        R      M    R




                                                                                                                                                                                                                                                         doc03 0.0000    rain
                                                                                                                                                                                                                                                         doc04 0.0000    rain
                                                                                                                                                                                                                                                         doc01 0.0000    shadow
                                                                                                                                                                                                                                                         doc02 0.0000    shadow
                                                                                                                                                                                                                                                         doc03 0.0000    shadow
                                                                                                                                                                                                                                                         doc04 0.0000    shadow
comparisons?


 compare similar code in Scalding (Scala) and Cascalog (Clojure):

 sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
 based on: github.com/twitter/scalding/wiki


 github.com/Quantisan/Impatient
 based on: github.com/nathanmarz/cascalog/wiki
Intro to Cascading
              Document
              Collection



                                           Scrub
                           Tokenize
                                           token

                      M



                                                   HashJoin   Regex
                                                     Left     token
                                                                      GroupBy    R
                                      Stop Word                        token
                                         List
                                                     RHS




                                                                         Count




                                                                                     Word
                                                                                     Count




6. code:
sample apps
Social Recommender

                                          filter
                    Twitter                                 stop words
                                         tweets




                   calculate
                                                               QA
                   similiarity


                                        threshold
                                        min, max

                  Neo4j

                                 LDA                        Redis




github.com/Cascading/SampleRecommender
 ‣ social recommender based on Twitter: suggest users who tweet about similar stocks
 ‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop
 ‣ uses a stop word list to remove common words, offensive phrases, etc.
 ‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc.
 ‣ adapted in Spring by Costin Leau
SocRec: architecture

         Twitter                                             filter                                          low-freq
        firehose                     source                                            stop words
                                                            tweets                                       batch updates
      ( uid, tweet, t )


                                  checkpoint:
                                  tokenized tweets




       calculate                                checkpoint:                                               analysis +
                                                                                           QA
       similiarity                              token frequency                                            curation


                                checkpoint:                             similarity
                                similar users                          thresholds



                                                           threshold
                                                           min, max
                                  sink
                                          sink                         sink
   Neo4j:
   social                                                                                Redis
   graph               LDA:
                       topic                                                            results
                                                                                     (uid: uidx, rank)
                     trending
SocRec: results

                        uid          recommend        weight
                  carbonfiberxrm     ClosingBellNews   0.1459

                  carbonfiberxrm     DJFunkyGrrL       0.0870

                  ClosingBellNews   DJFunkyGrrL       0.1491

                  CloudStocks       DJFunkyGrrL       0.1206

                  ElmoreNicole      DJFunkyGrrL       0.1798

                  EsNeey            alexiolo_         0.8603

                  ...
City of Palo Alto open data
                                                   Regex           Regex




                                            tree
                                                                                 Scrub
                                                    filter         parser        species




                                            M
                                                                                                       HashJoin
                                                                                                         Left     Geohash
    CoPA
  GIS exprot                                                                                 Tree
                                                                                           Metadata                                M
                                                                                                         RHS                            RHS
                                                                                                                            tree
               Regex     Checkpoint




                                            road
                                                   Regex           Regex

                                      tsv
               parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                   parser                                                              CoGroup
                                                                                                                                                     Distance   tree_dist       tree_name         shade
  M

                                                                                                                                                 R                          M               R                M    RHS
                                            M
                                                                            HashJoin        Estimate     Road
                                                                              Left           Albedo    Segments   Geohash                                                                                        CoGroup
                                                              Road
                                                             Metadata                                                                                                              GPS
               Failure                                                        RHS                                                  M                                               logs
                Traps                                                                                                                                                                                                      R
                                                                                                                            road


                                                                                                                                                                                                 Geohash


                                                                                                                                                                                                             M

                                                   Regex
                                            park




                                                    filter                                                                                                                                                                     reco




                                            M
                                                                  park




github.com/Cascading/CoPA/wiki
  ‣ GIS export for parks, roads, trees (unstructured / open data)
  ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
  ‣ curated metadata, used to enrich the dataset
  ‣ could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
CoPA: log events
CoPA: results                                      0.12
                                                               Estimated Tree Height (meters)




                                                   0.10




                                                   0.08
                                                                                                          count
                                                                                                             0




                                         density
                                                                                                             100
                                                   0.06                                                      200
                                                                                                             300



                                                   0.04




                                                   0.02




                                                   0.00


                                                          0   10        20            30        40   50
                                                                         avg_height




 ‣   addr: 115 HAWTHORNE AVE
 ‣   lat/lng: 37.446, -122.168
 ‣   geohash: 9q9jh0
 ‣   tree: 413 site 2
 ‣   species: Liquidambar styraciflua
 ‣   avg height 23 m
 ‣   road albedo: 0.12
 ‣   distance: 10 m
 ‣   a short walk from my train stop ✔
drill-down


  blog, code/wiki/gists, jars, list, DevOps products:
  cascading.org/
  github.org/Cascading/
  conjars.org/
  goo.gl/KQtUL
  concurrentinc.com/
                                      pnathan@concurrentinc.com
                                      @pacoid

More Related Content

What's hot

Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
 
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxa9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxVasimMemon4
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Search Engine - How to Make it
Search Engine - How to Make itSearch Engine - How to Make it
Search Engine - How to Make itAndreas Yunanto
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email ArchivingJinho Jung
 

What's hot (20)

Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
 
Treasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on HerokuTreasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on Heroku
 
Resume - Narasimha Rao B V (TCS)
Resume - Narasimha  Rao B V (TCS)Resume - Narasimha  Rao B V (TCS)
Resume - Narasimha Rao B V (TCS)
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxa9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docx
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
Resume
ResumeResume
Resume
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Os Lonergan
Os LonerganOs Lonergan
Os Lonergan
 
Search Engine - How to Make it
Search Engine - How to Make itSearch Engine - How to Make it
Search Engine - How to Make it
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
 

Viewers also liked

sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matterDawid Weiss
 
Conception avec pic
Conception avec pic Conception avec pic
Conception avec pic nawzat
 
Introduction to programming with c,
Introduction to programming with c,Introduction to programming with c,
Introduction to programming with c,Hossain Md Shakhawat
 
Introduction to C Programming
Introduction to C ProgrammingIntroduction to C Programming
Introduction to C ProgrammingMOHAMAD NOH AHMAD
 
Pic 16f877 ..
Pic 16f877 ..Pic 16f877 ..
Pic 16f877 ..sunprass
 
Programmation des pic_en_c_part1
Programmation des pic_en_c_part1Programmation des pic_en_c_part1
Programmation des pic_en_c_part1oussamada
 
Microcontroleur Pic16 F84
Microcontroleur Pic16 F84Microcontroleur Pic16 F84
Microcontroleur Pic16 F84guest1e7b02
 
AP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 PowerpointAP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 PowerpointMrreynon
 
Embedded system (Chapter )
Embedded system (Chapter )Embedded system (Chapter )
Embedded system (Chapter )Ikhwan_Fakrudin
 
Glasgow Coma Scale Presentation
Glasgow Coma Scale PresentationGlasgow Coma Scale Presentation
Glasgow Coma Scale PresentationHayden G
 
AP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 PowerpointAP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 PowerpointMrreynon
 
Basics of C programming
Basics of C programmingBasics of C programming
Basics of C programmingavikdhupar
 
Music Recommendation Tutorial
Music Recommendation TutorialMusic Recommendation Tutorial
Music Recommendation TutorialOscar Celma
 
Test driven development in C
Test driven development in CTest driven development in C
Test driven development in CAmritayan Nayak
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShareSlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

Viewers also liked (20)

sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matter
 
Conception avec pic
Conception avec pic Conception avec pic
Conception avec pic
 
Introduction to programming with c,
Introduction to programming with c,Introduction to programming with c,
Introduction to programming with c,
 
Introduction to C Programming
Introduction to C ProgrammingIntroduction to C Programming
Introduction to C Programming
 
Pic 16f877 ..
Pic 16f877 ..Pic 16f877 ..
Pic 16f877 ..
 
Programmation des pic_en_c_part1
Programmation des pic_en_c_part1Programmation des pic_en_c_part1
Programmation des pic_en_c_part1
 
Microcontroleur Pic16 F84
Microcontroleur Pic16 F84Microcontroleur Pic16 F84
Microcontroleur Pic16 F84
 
Cours pics16 f877
Cours pics16 f877Cours pics16 f877
Cours pics16 f877
 
AP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 PowerpointAP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 Powerpoint
 
Ge matrix
Ge matrixGe matrix
Ge matrix
 
Ge9 final ppt
Ge9 final pptGe9 final ppt
Ge9 final ppt
 
Embedded system (Chapter )
Embedded system (Chapter )Embedded system (Chapter )
Embedded system (Chapter )
 
Glasgow Coma Scale Presentation
Glasgow Coma Scale PresentationGlasgow Coma Scale Presentation
Glasgow Coma Scale Presentation
 
AP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 PowerpointAP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 Powerpoint
 
Basics of C programming
Basics of C programmingBasics of C programming
Basics of C programming
 
Music Recommendation Tutorial
Music Recommendation TutorialMusic Recommendation Tutorial
Music Recommendation Tutorial
 
Test driven development in C
Test driven development in CTest driven development in C
Test driven development in C
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to Building Enterprise Apps for Big Data with Cascading

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingPaco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Hadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesHadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesDataWorks Summit
 
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCentergoodfriday
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureInside Analysis
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP IntegrationRaymond Gao
 
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Stuart Charlton
 
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Officegoodfriday
 
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architectureDataWorks Summit
 
sones company presentation
sones company presentationsones company presentation
sones company presentationsones GmbH
 
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data CenterCloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data Centervsarathy
 

Similar to Building Enterprise Apps for Big Data with Cascading (20)

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Hadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesHadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation Architectures
 
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenter
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP Integration
 
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
 
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Office
 
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architecture
 
sones company presentation
sones company presentationsones company presentation
sones company presentation
 
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data CenterCloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center
 
S18
S18S18
S18
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Building Enterprise Apps for Big Data with Cascading

  • 1. Building Enterprise Apps for Big Data with Cascading Paco Nathan Document Collection Scrub Tokenize token Concurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHS pnathan@concurrentinc.com Count @pacoid Word Count Copyright @2012, Concurrent, Inc.
  • 2. Enterprise Apps for Big Data with Cascading 1. backstory: how we got here 2. build: Data Science teams 3. pattern: common use cases 4. intro: Cascading API 5. tutorial: for the impatient 6. code: sample apps
  • 3. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. backstory: how we got here
  • 4. inflection point huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1998 consider this metric: annual revenue per customer / amount of data stored which dropped 100x within a few years after 1997 2004 storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt “conventional wisdom” of RDBMS and BI tools became less viable; however, business cadre was still focused on pivot tables and pie charts… which tends toward inertia! MapReduce and the Hadoop open source stack grew directly out of that contention… however, that effort + only solves parts of the puzzle
  • 5. inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “All of Fortune 500 is now on notice over the next 10-year period.” Amazon and Google as exemplars of massive disruption in retail, advertising, etc. data as the major force displacing Global 1000 over the next decade, mostly through apps — verticals, leveraging domain expertise Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.) XLDB, 2012: “Complex analytics workloads are now displacing SQL as the basis  for Enterprise apps.”
  • 6. primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
  • 7. the world before… BI, SQL, and highly optimized code
  • 8. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
  • 9. the world after… machine learning, leveraging log files
  • 10. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
  • 11. the world ahead… what our customers are doing now
  • 12. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data Planner Scientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS
  • 14. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
  • 15. reference by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L
  • 16. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. build: Data Science teams
  • 17. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… Wikipedia solving for pattern, at scale. an interdisciplinary pursuit which requires teams, not sole players
  • 18. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  • 19. social caveats “This data cannot be correct!” may be an early warning about an organization itself much depends on how the people whom you work alongside tend to arrive at decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization puts many ongoing verbal arguments to rest however, let domain experts handle “data storytelling”, not data scientists xkcd
  • 20. the science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE ‣ estimate probability woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU ‣ calculate analytic variance edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ make use of learning theory + collab with DevOps, Stakeholders + reduce our work to cron entries
  • 21. synthesis of the above MapReduce is Good Enough? Jimmy Lin, U Maryland + Twitter arxiv.org/pdf/1209.2191v1.pdf A Few Useful Things to Know about Machine Learning Pedro Domingos, U Washington homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
  • 22. team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at integration scale to customers build smarts into apps product features Gephi keep infrastructure systems running, cost-effective
  • 23. team composition = roles Domain Expert business process, stakeholder data science Data data prep, discovery, Scientist modeling, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS App Dev software engineering, Count automation Word Count Ops systems engineering, access introduced capability
  • 24. matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops
  • 25. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
  • 26. Q: Can I simply hire one rockstar data scientist to cover all this work?
  • 27. A: No, interdisciplinary work requires teams. A: Hire leads who speak the lingo of each domain. A: Hire people who cover 2+ roles, when possible.
  • 28. reference by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE
  • 29. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 3. pattern: common use cases
  • 30. CAP theorem purpose: theoretical limits for data access patterns essence: ‣ consistency ‣ availability ‣ partition tolerance best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) translated: cost of doing business www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf julianbrowne.com/article/viewer/brewers-cap-theorem
  • 31. data access patterns design patterns: originated in consensus negotiation for architecture, later used in software engineering consider the corollaries in large-scale data work… essence: select data frameworks based on your data access patterns in other words, decouple use cases based on needs – avoid the “one size fits all” (OSFA) anti-pattern let’s review some examples…
  • 32. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 33. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 34. and, since “One Size Fits All”… doesn’t
  • 35. a selection of great tools… reporting: visualization: Graphite, PowerPivot, ggplot2, D3, Gephi analytics/modeling: Pentaho, Jaspersoft, SAS R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL imdg: Spark, Storm, hadoop: EMR, HW, MapR, machine data: Gigaspaces EMC, Azure, Compute Splunk, collectd, durable storage: Nagios S3, ASV, GCS, Riak, Couch
  • 36. common use cases app patterns
  • 37. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
  • 38. use case: ecommerce fraud • sparse data means lots of missing values stat.berkeley.edu • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
  • 39. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
  • 40. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
  • 41. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. intro: Cascading API
  • 42. Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageability
  • 43. Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, Airbnb, Square, Climate Corporation, FlightCaster, Williams-Sonoma partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, managed by Twitter, Etsy, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debugging, config, scheduling, etc.
  • 44. Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 infoworld.com/slideshow/65089 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08 drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
  • 45. data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
  • 46. Cascading API: adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API for defining and managing Enterprise data workflows
  • 47. enterprise data workflows Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 48. data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection ‣ App Developer POV: M Tokenize Scrub token API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count ‣ Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repo
  • 49. data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token physical M HashJoin Regex Left token GroupBy R plan Stop Word token List RHS Count Word Count compute Apache Hadoop, in-memory local mode “assembler” code substrate …envision GPUs, streaming, etc. machine data Splunk, Nagios, Collectd, New Relic, etc.
  • 50. data workflows: SQL Relational SQL parser logical plan, optimized based on stats physical plan query history, table stats b-trees, etc. ERD table schema catalog
  • 51. data workflows: SQL vs. JVM Relational Cascading + Driven SQL parser SQL-92 compliant parser (in progress) logical plan, TODO: logical plan, optimized based on stats optimized based on stats physical plan API “plumbing” query history, app history, table stats tuple stats b-trees, etc. distributed compute substrate: Hadoop, in-memory, etc. ERD flow diagram table schema tuple schema catalog endpoint usage DB
  • 52. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 5. tutorial: for the impatient
  • 53. “Cascading for the Impatient” cascading.org/category/impatient/ ‣ a series of introductory tutorials and code samples ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 54. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code
  • 55. wait! ten lines of code for a file copy… seems like a lot.
  • 56. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 57. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code gist.github.com/3900702
  • 58. Cascading / Java Document Collection M Tokenize GroupBy token Count String docPath = args[ 0 ]; R Word String wcPath = args[ 1 ]; Count Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();
  • 59. Scalding / Scala Document Collection M Tokenize GroupBy token Count R Word Count // Sujit Pal // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html package com.mycompany.impatient import com.twitter.scalding._ class Part2(args : Args) extends Job(args) {   val input = Tsv(args("input"), ('docId, 'text))   val output = Tsv(args("output"))   input.read.     flatMap('text -> 'word) { text : String => text.split("""s+""") }.     groupBy('word) { group => group.size }.     write(output) }
  • 60. Cascalog / Clojure Document Collection M Tokenize GroupBy token Count R Word Count ; Paul Lam ; github.com/Quantisan/Impatient (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count)))
  • 61. Hive Document Collection M Tokenize GroupBy token Count R Word Count -- Steve Severance -- stackoverflow.com/questions/10039949/word-count-program-in-hive CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; SELECT  word, COUNT(*) FROM input  LATERAL VIEW explode(split(text, ' ')) lTable AS word GROUP BY word ;
  • 62. Pig Document Collection M Tokenize GroupBy token Count R Word Count -- kudos to Dmitriy Ryaboy docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource') AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != 'doc_id'; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*'; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource'); EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
  • 63. 3: wc + scrub Document Collection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer 22+10 lines code
  • 64. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code
  • 65. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 66. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • 67. deployed on AWS… elastic-mapreduce --create --name "TF-IDF" --jar s3n://temp.cascading.org/impatient/part6.jar --arg s3n://temp.cascading.org/impatient/rain.txt --arg s3n://temp.cascading.org/impatient/out/wc --arg s3n://temp.cascading.org/impatient/en.stop --arg s3n://temp.cascading.org/impatient/out/tfidf --arg s3n://temp.cascading.org/impatient/out/trap --arg s3n://temp.cascading.org/impatient/out/check aws.amazon.com/elasticmapreduce/
  • 68. results? doc_id tf-idf doc02 0.9163 token air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 california's doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 deserts doc03 0.9163 downwind doc_id text … doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such with less rain and cloudcover. doc04 0.9163 valley doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women side of a mountain. doc03 0.5108 land doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee zoink null doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry doc02 0.2231 mountain Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M doc03 0.2231 mountain Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token doc04 0.2231 mountain token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF GroupBy M doc01 0.0000 rain TF doc_id, Failure token Count Traps GroupBy Count token doc02 0.0000 rain M R M R Word Count R M R doc03 0.0000 rain doc04 0.0000 rain doc01 0.0000 shadow doc02 0.0000 shadow doc03 0.0000 shadow doc04 0.0000 shadow
  • 69. comparisons? compare similar code in Scalding (Scala) and Cascalog (Clojure): sujitpal.blogspot.com/2012/08/scalding-for-impatient.html based on: github.com/twitter/scalding/wiki github.com/Quantisan/Impatient based on: github.com/nathanmarz/cascalog/wiki
  • 70. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 6. code: sample apps
  • 71. Social Recommender filter Twitter stop words tweets calculate QA similiarity threshold min, max Neo4j LDA Redis github.com/Cascading/SampleRecommender ‣ social recommender based on Twitter: suggest users who tweet about similar stocks ‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop ‣ uses a stop word list to remove common words, offensive phrases, etc. ‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc. ‣ adapted in Spring by Costin Leau
  • 72. SocRec: architecture Twitter filter low-freq firehose source stop words tweets batch updates ( uid, tweet, t ) checkpoint: tokenized tweets calculate checkpoint: analysis + QA similiarity token frequency curation checkpoint: similarity similar users thresholds threshold min, max sink sink sink Neo4j: social Redis graph LDA: topic results (uid: uidx, rank) trending
  • 73. SocRec: results uid recommend weight carbonfiberxrm ClosingBellNews 0.1459 carbonfiberxrm DJFunkyGrrL 0.0870 ClosingBellNews DJFunkyGrrL 0.1491 CloudStocks DJFunkyGrrL 0.1206 ElmoreNicole DJFunkyGrrL 0.1798 EsNeey alexiolo_ 0.8603 ...
  • 74. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Segments Geohash CoGroup Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park github.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…”
  • 76. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔
  • 77. drill-down blog, code/wiki/gists, jars, list, DevOps products: cascading.org/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/ pnathan@concurrentinc.com @pacoid

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n