SlideShare a Scribd company logo
Building Enterprise Apps
for Big Data with Cascading

Paco Nathan


Concurrent, Inc.

                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                                     RHS                                                                Count


                                         Copyright @2012, Concurrent, Inc.
Enterprise Apps
 for Big Data
with Cascading
 1. backstory: how we got here
 2. build: Data Science teams
 3. pattern: common use cases
 4. intro: Cascading API
 5. tutorial: for the impatient
 6. code: sample apps
Intro to Cascading



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token



1. backstory:
how we got here
inflection point
 huge Internet successes after 1997 holiday season…          1997
 AMZN, EBAY, Inktomi (YHOO Search), then GOOG
 consider this metric:
   annual revenue per customer / amount of data stored
 which dropped 100x within a few years after 1997            2004

 storage and processing costs plummeted, now we must
 work much smarter to extract ROI from Big Data…
 our methods must adapt

 “conventional wisdom” of RDBMS and BI tools became
 less viable; however, business cadre was still focused on
 pivot tables and pie charts… which tends toward inertia!

 MapReduce and the Hadoop open source stack grew
 directly out of that contention… however, that effort        +
 only solves parts of the puzzle
inflection point: consequences
 Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
 Hadoop Summit, 2012:

 “All of Fortune 500 is now on notice over the next 10-year period.”
 Amazon and Google as exemplars of massive disruption in retail,
 advertising, etc.
 data as the major force displacing Global 1000 over the next decade,
 mostly through apps — verticals, leveraging domain expertise

 Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
 XLDB, 2012:

 “Complex analytics workloads are now displacing SQL as the basis
  for Enterprise apps.”
primary sources
 “Early Amazon: Splitting the website” – Greg Linden

 “The eBay Architecture” – Randy Shoup, Dan Pritchett

 Inktomi (YHOO Search)
 “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)

 “The Birth of Google” – John Battelle
 “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
the world before…

BI, SQL, and highly
optimized code
data innovation: circa 1996
                            Stakeholder                   Customers

     Excel pivot tables
   PowerPoint slide decks        strategy



       SQL Query                              optimized
                              Engineering       code         Web App
        result sets


the world after…

machine learning,
leveraging log files
data innovation: circa 2001
    Stakeholder                    Product                   Customers

      dashboards                                                  UX

                    models                        servlets

    Algorithmic                          +                   Web Apps
     Modeling                        classifiers

     SQL Query                                    history
      result sets                                               customer

        DW                             ETL                    RDBMS
the world ahead…

what our customers
are doing now
data innovation: circa 2013
                                        Data Apps
  Domain                  process       Workflow                                                                          Prod
                            dashboard                                                        Web Apps,
                                         History                     services                 Mobile,
                  data                                                                         etc.                s/w
                science                                                                                            dev
                          discovery                  optimized                      interactions
                              +                       capacity                                     transactions,          Eng
                          modeling                                                                    content

  App Dev
                                                Data Access Patterns

                                         Hadoop,                   Log                        In-Memory
                                           etc.                   Events                       Data Grid
    Ops                          DW                                                                                       Ops
                                                                            batch      "real time"

                                                                 Cluster Scheduler
  introduced                                                                                                             existing
   capability                                                                                                             SDLC

a key difference…
statistical thinking

      Process              Variation               Data             Tools

  employing a mode of thought which includes both logical and analytical reasoning:
  evaluating the whole of a problem, as well as its component parts; attempting
  to assess the effects of changing one or more variables

  this approach attempts to understand not just problems and solutions,
  but also the processes involved and their variances

  particularly valuable in Big Data work when combined with hands-on experience in
  physics – roughly 50% of my peers come from physics or physical engineering…

  programmers typically don’t think this way…
  however, both systems engineers and data scientists must!

  by Leo Breiman
  Statistical Modeling:
  The Two Cultures
  Statistical Science, 2001
Intro to Cascading



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token



2. build:
Data Science teams
core values

  Data Science teams develop actionable insights,
  building confidence for decisions

  that work may influence a few decisions worth
  billions (e.g., M&A) or billions of small decisions
  (e.g., AdWords)

  probably somewhere in-between…

  solving for pattern, at scale.

  an interdisciplinary pursuit which
  requires teams, not sole players
most valuable skills
 approximately 80% of the costs for data-related projects
 get spent on data preparation – mostly on cleaning up
 data quality issues: ETL, log file analysis, etc.

 unfortunately, data-related budgets for many companies tend
 to go into frameworks which can only be used after clean up

 most valuable skills:
   ‣ learn to use programmable tools that prepare data

   ‣ learn to generate compelling data visualizations

   ‣ learn to estimate the confidence for reported results

   ‣ learn to automate work, making analysis repeatable
 the rest of the skills – modeling,
 algorithms, etc. – those are secondary
social caveats
 “This data cannot be correct!” may be an early warning
 about an organization itself
 much depends on how the people whom you work alongside
 tend to arrive at decisions:
     ‣ probably good: Induction, Abduction, Circumscription
     ‣ probably poor: Deduction, Speculation, Justification

 in general, one good data visualization
 puts many ongoing verbal arguments to rest
 however, let domain experts handle
 “data storytelling”, not data scientists

the science in data science?
                                     tcudorP ylppA lenaP yrotnevnI tneilC
                                  tcudorP evomeR lenaP yrotnevnI tneilC

  in a nutshell, what we do…
                                                                  ydduB ddA
                                                               nigoL etisbeW
                                                          egasseM a evaeL
                                             G1 :gniniamer ecaps sserddA
                                                           tohspanS a ekaT
                                                       egapemoH nwO tisiV
                                                               elbbuB a epyT
                                                                taeS egnahC
                                                          wodniW D3 nepO
                                                                  dneirF ddA
                                 revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                   lenaP tidE

  ‣ estimate probability
                                                                    woN tahC
                                                                     teP yalP
                                                                    teP deeF
                             2 petS egaP traC esahcruP edaM remotsuC
                                          M215 :gniniamer ecaps sserddA
                                                              gnihtolC no tuP
                                                           bew :metI na yuB
                                    ytinummoc ,tneilc :detratS weiV eivoM
                                                             teP weN etaerC
                                        detrats etius tset :tseTytivitcennoC
                                                   emag pazyeh dehcnuaL
                                                    eciov mooRcilbuP tahC
                                                          egasseM yadhtriB
                                                    ybbol semag dehcnuaL
                                                        noitartsigeR euqinU

  ‣ calculate analytic variance

                                                                                tcudorP ylppA lenaP yrotnevnI tneilC
                                                                                tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                ydduB ddA
                                                                                nigoL etisbeW
                                                                                egasseM a evaeL
                                                                                G1 :gniniamer ecaps sserddA
                                                                                tohspanS a ekaT
                                                                                egapemoH nwO tisiV
                                                                                elbbuB a epyT
                                                                                taeS egnahC

                                                                                dneirF ddA
                                                                                revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                lenaP tidE
                                                                                woN tahC
                                                                                teP yalP
                                                                                teP deeF
                                                                                2 petS egaP traC esahcruP edaM remotsuC
                                                                                M215 :gniniamer ecaps sserddA
                                                                                gnihtolC no tuP
                                                                                bew :metI na yuB
                                                                                ytinummoc ,tneilc :detratS weiV eivoM
                                                                                teP weN etaerC
                                                                                detrats etius tset :tseTytivitcennoC
                                                                                emag pazyeh dehcnuaL
                                                                                eciov mooRcilbuP tahC
                                                                                egasseM yadhtriB
                                                                                ybbol semag dehcnuaL
                                                                                noitartsigeR euqinU
                                                                                wodniW D3 nepO
  ‣ manipulate order complexity

  ‣ make use of learning theory

  +   collab with DevOps, Stakeholders

  +   reduce our work to cron entries
synthesis of the above
  MapReduce is Good Enough?
  Jimmy Lin, U Maryland + Twitter

  A Few Useful Things to Know about Machine Learning
  Pedro Domingos, U Washington
team process = needs

                  help people ask the
    discovery     right questions

                  allow automation to place
     modeling     informed bets

                  deliver products at
    integration   scale to customers

                  build smarts into
       apps       product features            Gephi

                  keep infrastructure
     systems      running, cost-effective
team composition = roles

                               business process,
        Data                   data prep, discovery,
      Scientist                modeling, etc.            Document



                                                                                              HashJoin   Regex
                                                                                                Left     token
                                                                                                                 GroupBy    R
                                                                                 Stop Word                        token

       App Dev
                               software engineering,                                                                Count

                               automation                                                                                       Word

         Ops                   systems engineering, access

matrix = needs × roles
           very      elliing
                      e ng            ratiio
                                      rat o      apps
                                                 apps      tem
   d sc           mod
                  mod           nteg
                               ii nteg                  sys




matrix: example team
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      tem
    d sc           mod
                   mod           nteg
                                ii nteg                  sys





 summary: this team seems heavy on systems, may need more overlap
 between modeling and integration, particularly among team leads
Can I simply hire one
rockstar data scientist
to cover all this work?
A: No, interdisciplinary
work requires teams.

A: Hire leads who speak
the lingo of each domain.

A: Hire people who cover
2+ roles, when possible.

  by DJ Patil

  Data Jujitsu
  O’Reilly, 2012

  Building Data Science Teams
  O’Reilly, 2011
Intro to Cascading



                                               HashJoin   Regex
                                                 Left     token
                                                                  GroupBy    R
                                  Stop Word                        token



3. pattern:
common use cases
CAP theorem
 purpose: theoretical limits for data access patterns
    ‣ consistency
    ‣ availability
    ‣ partition tolerance

 best case scenario: you may pick two … or spend billions
 struggling to obtain all three at scale (GOOG)
 translated: cost of doing business
data access patterns
 design patterns: originated in consensus negotiation
 for architecture, later used in software engineering
 consider the corollaries in large-scale data work…
 essence: select data frameworks based on
 your data access patterns
 in other words, decouple use cases based on needs
  – avoid the “one size fits all” (OSFA) anti-pattern
 let’s review some examples…
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
and, since
“One Size Fits All”…
a selection of great tools…
                                                                Graphite, PowerPivot,
                                   ggplot2, D3, Gephi
   analytics/modeling:                                          Pentaho, Jaspersoft, SAS
   R, Weka, Matlab, PMML, GLPK
                                      LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

       Cascading, Scalding, Cascalog, R markdown, SWF
                                                Scalr, RightScale, CycleComputing, vFabric, Beanstalk
               graph:          column:
               Gremlin,        Vertica,
               GraphLab,       HBase,           key/val:        index:               relational:
               Neo4J           Drill,           Redis,          Lucene/Solr,         usual suspects
                               Dynamo           Membase,        ElasticSearch

   Spark, Storm,         hadoop:
                         EMR, HW, MapR,               machine data:
                         EMC, Azure, Compute          Splunk, collectd,         durable storage:
                                                      Nagios                    S3, ASV, GCS,
                                                                                Riak, Couch
common use cases
  app patterns
use case: marketing funnel
  •   must optimize a very large ad spend
  •   different vendors report different metrics

  •   seasonal variation distorts performance
  •   some campaigns are much smaller than others
  •   hard to predict ROI for incremental spend

  • log aggregation, followed with cohort analysis
  • bayesian point estimates compare different-sized ad tests
  • customer lifetime value quantifies ROI of new leads
  • time series analysis normalizes for seasonal variation
  • geolocation adjusts for regional cost/benefit
  • linear programming models estimate elasticity of demand
use case: ecommerce fraud
  • sparse data means lots of missing values

  • “needle in a haystack” lack of training cases
  • answers are available in large-scale batch, results
      are needed in real-time event processing
  •   not just one pattern to detect – many, ever-changing

  • random forest (RF) classifiers predict likely fraud
  • subsampled data to re-balance training sets
  • impute missing values based on density functions
  • train on massive log files, run on in-memory grid
  • adjust metrics to minimize customer support costs
  • detect novelty – report anomalies via notifications
use case: customer segmentation
  • many millions of customers, hard to determine
      which features resonate

  •   multi-modal distributions get obscured by the
      practice of calculating an “average”
  •   not much is known about individual customers

  • connected components for sessionization, determining
      uniques from logs
  •   estimates for age, gender, income, geo, etc.
  •   clustering algorithms to group into market segments
  •   social graph infers “unknown” relationships
  • covariance/heat maps visualizes segments vs. feature sets
use case: monetizing content
  • need to suggest relevant content which would

                                                               Digital Humanities
      otherwise get buried in the back catalog
  •   big disconnect between inventory and limited
      performance ad market
  •   enormous amounts of text, hard to categorize

  • text analytics glean key phrases from documents
  • hierarchical clustering of char frequencies detects lang
  • latent dirichlet allocation (LDA) reduces dimension to
      topic models
  •   recommenders suggest similar topics to customers
  • collaborative filters connect known users with less known
Intro to Cascading



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token



4. intro:
Cascading API
Cascading API: purpose
  ‣ simplify data processing development and deployment

  ‣ improve application developer productivity

  ‣ enable data processing application manageability
Cascading API: a few facts
  Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

  in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
  Finance, Health Care, Transportation, other verticals

  studies published about large use cases: Twitter, Etsy, Airbnb, Square,
  Climate Corporation, FlightCaster, Williams-Sonoma

  partnerships and distribution with SpringSource, Amazon AWS,
  Microsoft Azure, Hortonworks, MapR, EMC

  several open source projects built atop, managed by Twitter, Etsy, etc.,
  which provide substantial Machine Learning libraries

  DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

  data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
  plus serialization in Apache Thrift, Avro, Kyro, etc.

  entire app compiles into a single JAR: fully connected for compiler optimization,
  exception handling, debugging, config, scheduling, etc.
Cascading API: a few quotes
 “Cascading gives Java developers the ability to build Big Data applications
  on Hadoop using their existing skillset … Management can really go out
  and build a team around folks that are already very experienced with Java.
  Switching over to this is really a very short exercise.”
   CIO, Thor Olavsrud, 2012-06-06

 “Masks the complexity of MapReduce, simplifies the programming, and
  speeds you on your journey toward actionable analytics … A vast
  improvement over native MapReduce functions or Pig UDFs.”
   2012 BOSSIE Awards, James Borck, 2012-09-18

 “Company’s promise to application developers is an opportunity to build
  and test applications on their desktops in the language of choice with
  familiar constructs and reusable components”
   Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
data+code “political spectrum”
 “Notes from the Mystery Machine Bus”
 by Steve Yegge, Google
         “conservative”                             “liberal”
           (mostly) Enterprise                   (mostly) Start-Up

            risk management                    customer experiments

                assurance                            flexibility

          well-defined schema                   schema follows code
          explicit configuration                     convention

         type-checking compiler                 interpreted scripts

           wants no surprises                  wants no impediments

         Java, Scala, Clojure, etc.            PHP, Ruby, Python, etc.

  Cascading, Scalding, Cascalog, etc.   Hive, Pig, Hadoop Streaming, etc.
Cascading API: adoption

    As Enterprise apps move into
    Hadoop and related BigData
    frameworks, risk profiles shift
    toward more conservative
    programming practices

    Cascading provides a popular
    API for defining and managing
    Enterprise data workflows
enterprise data workflows
 Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc.
 …in other words, “plumbing”




                                        HashJoin   Regex
                                          Left     token
                                                            GroupBy    R
                           Stop Word                         token


data workflows: team
  ‣ Business Stakeholder POV:
    business process management for workflow orchestration (think BPM/BPEL)

  ‣ Systems Integrator POV:
    system integration of heterogenous data sources and compute platforms

  ‣ Data Scientist POV:
    a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

  ‣ Data Architect POV:
    a physical plan for large-scale data flow management

  ‣ Software Architect POV:
    a pattern language, similar to plumbing or circuit design

  ‣ App Developer POV:                                                     M

    API bindings for Java, Scala, Clojure, Jython, JRuby, etc.                             Stop Word



  ‣ Systems Engineer POV:                                                                                                                 Word

    a JAR file, has passed CI, available in a Maven repo
data workflows: layers
   business     domain expertise, business trade-offs,
   process      operating parameters, market position, etc.

      API       Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
                …envision whatever runs in a JVM

   optimize /
    schedule    major changes in technology now



                                                        HashJoin   Regex
                                                          Left     token
                                                                           GroupBy    R

                                           Stop Word                        token



   compute      Apache Hadoop, in-memory local mode

                …envision GPUs, streaming, etc.

    data        Splunk, Nagios, Collectd, New Relic, etc.
data workflows: SQL
          SQL parser

          logical plan,
    optimized based on stats
          physical plan

         query history,
           table stats
          b-trees, etc.


         table schema

data workflows: SQL vs. JVM
         Relational              Cascading + Driven
           SQL parser             SQL-92 compliant parser
                                       (in progress)
           logical plan,              TODO: logical plan,
     optimized based on stats      optimized based on stats
           physical plan               API “plumbing”

          query history,                 app history,
            table stats                   tuple stats
           b-trees, etc.        distributed compute substrate:
                                   Hadoop, in-memory, etc.
               ERD                      flow diagram

          table schema                  tuple schema

             catalog                 endpoint usage DB
Intro to Cascading



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token



5. tutorial:
for the impatient
“Cascading for the Impatient”
  ‣ a series of introductory tutorials and code samples

  ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive




                                        HashJoin   Regex
                                          Left     token
                                                           GroupBy    R
                           Stop Word                        token


1: copy
                       public class
                         public static void
                         main( String[] args )
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

          M                // create the sink tap
                           Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
                           // specify a pipe to connect the taps
                           Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();

 1 mapper                  }
 0 reducers
10 lines code

  ten lines of code
  for a file copy…
  seems like a lot.
same JAR, any scale…
                                                       MegaCorp Enterprise IT:
                                                       Pb’s data
                                                       1000+ node private cluster
                                                       EVP calls you when app fails
                                                       runtime: days+

                                        Production Cluster:
                                        Tb’s data
                                        EMR w/ 50 HPC Instances
                                        Ops monitors results
                                        runtime: hours – days

                    Staging Cluster:
                    Gb’s data
                    EMR + 4 Spot Instances
                    CI shows red or green lights
                    runtime: minutes – hours

 Your Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count


        M                   token             Count

                              R                        Word

 1 mapper
 1 reducer
18 lines code              
Cascading / Java                                               Document

                                                                                        token    Count

String docPath = args[ 0 ];                                                               R              Word

String wcPath = args[ 1 ];                                                                               Count

Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
 .addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/" );
Scalding / Scala                          Document

                                                                   token    Count

                                                                     R              Word

// Sujit Pal

package com.mycompany.impatient

import com.twitter.scalding._

class Part2(args : Args) extends Job(args) {
  val input = Tsv(args("input"), ('docId, 'text))
  val output = Tsv(args("output"))
    flatMap('text -> 'word) {
       text : String => text.split("""s+""")
    groupBy('word) { group => group.size }.
Cascalog / Clojure                            Document

                                                                       token    Count

                                                                         R              Word

; Paul Lam

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
Hive                                        Document

                                                                     token    Count

                                                                       R              Word

-- Steve Severance



 word, COUNT(*)
FROM input
 LATERAL VIEW explode(split(text, ' ')) lTable AS word
Pig                                         Document

                                                                     token    Count

                                                                       R              Word

-- kudos to Dmitriy Ryaboy

docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
  AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';

-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
  GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
  GENERATE group AS token, COUNT(tokenPipe) AS count;

-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/ -dot wcPipe;
3: wc + scrub


                        Scrub   GroupBy
                        token    token

                                   R              Word

 1 mapper
 1 reducer
22+10 lines code
4: wc + scrub + stop words




                                     HashJoin   Regex
                                       Left     token
                                                        GroupBy    R
                        Stop Word                        token


 1 mapper                                                              Word

 1 reducer                                                             Count

28+10 lines code
5: tf-idf

                                                                        Unique                 Insert   SumBy

                                                                        doc_id                   1      doc_id

                                                                  M       R           M                   R      M     RHS



                                       HashJoin   Regex                 Unique                GroupBy

                                         Left     token                  token                 token                                                         ExprFunc
                                                                                                         Count                             CoGroup
                        Stop Word                                                                                                                              tf-idf
                                                                  M       R           M          R               M                                   R



                                                                         token                 Count
                                                                                                                             GroupBy                 Count

                                                                  M       R       M       R
                                                                                                                                R      M      R                   Count

  11 mappers
   9 reducers
  65+10 lines code
6: tf-idf + tdd

                                                                                                Unique                 Insert   SumBy

                                                                                                doc_id                   1      doc_id

                                                                                          M       R           M                   R      M
                       Assert                          Scrub
                                                                                                                                             HashJoin              Checkpoint


                                                               HashJoin   Regex                 Unique                GroupBy

                                                                 Left     token                  token                 token     Count                                                               ExprFunc
                                           Stop Word
                                              List               RHS

                                                                                          M       R           M          R               M                                                      R


             Failure                                                                             token                 Count
              Traps                                                                                                                                  GroupBy              Count

                                                                                          M       R       M       R
                                                                                                                                                        R      M    R

  12 mappers
   9 reducers
  76+14 lines code
deployed on AWS…

 elastic-mapreduce --create --name "TF-IDF" 
   --jar s3n:// 
   --arg s3n:// 
   --arg s3n:// 
   --arg s3n:// 
   --arg s3n:// 
   --arg s3n:// 
   --arg s3n://
results?                                                                                                                                                                                                                                                 doc_id tf-idf
                                                                                                                                                                                                                                                         doc02 0.9163
                                                                                                                                                                                                                                                         doc05 0.9163    australia
                                                                                                                                                                                                                                                         doc05 0.9163    broken
                                                                                                                                                                                                                                                         doc04 0.9163    california's
                                                                                                                                                                                                                                                         doc04 0.9163    cause
                                                                                                                                                                                                                                                         doc02 0.9163    cloudcover
                                                                                                                                                                                                                                                         doc04 0.9163    death
                                                                                                                                                                                                                                                         doc04 0.9163    deserts
                                                                                                                                                                                                                                                         doc03 0.9163    downwind
doc_id text                                                                                                                                                                                                                                               …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.                                                                                                                                                                            doc02 0.9163    sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain                                                                                                                                                                     doc04 0.9163    such
with less rain and cloudcover.                                                                                                                                                                                                                           doc04 0.9163    valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind)                                                                                                                                                                        doc05 0.9163    women
side of a mountain.                                                                                                                                                                                                                                      doc03 0.5108    land
doc04 This is known as the rain shadow effect and is the primary cause of leeward                                                                                                                                                                        doc05 0.5108    land
deserts of mountain ranges, such as California's Death Valley.                                                                                                                                                                                           doc01 0.5108    lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]                                                                                                                                                                                                 doc02 0.5108    lee
zoink null                                                                                                                                                                                                                                               doc03 0.5108    leeward
                                                                                                                                                                                                                                                         doc04 0.5108    leeward
                                                                                                                                                                                                                                                         doc01 0.4463    area
                                                                                                                                                                                                                                                         doc02 0.2231    area
                                                                                                                                                                                                                                                         doc03 0.2231    area
                                                                                                                                                                                                                                                         doc01 0.2231    dry
                                                                                                                                                                                                                                                         doc02 0.2231    dry
                                                                                                                                                                                                                                                         doc03 0.2231    dry
                                                                                                                                                                                                                                                         doc02 0.2231    mountain
                                                                                                                                Unique                 Insert   SumBy

                                                                                                                                doc_id                   1      doc_id

                                                                                                                          M       R           M                   R      M

                                                                                                                                                                                                                                                         doc03 0.2231    mountain
                                                       Assert                          Scrub
                                                                                                                                                                             HashJoin              Checkpoint


                                                                                               HashJoin   Regex                 Unique                GroupBy

                                                                                                 Left     token

                                                                                                                                                                                                                                                         doc04 0.2231    mountain
                                                                                                                                 token                 token     Count                                                               ExprFunc
                                                                           Stop Word
                                                                              List               RHS

                                                                                                                          M       R           M          R               M                                                      R


                                                                                                                                                                                                                                                         doc01 0.0000    rain

                                             Failure                                                                             token                 Count
                                              Traps                                                                                                                                  GroupBy              Count

                                                                                                                                                                                                                                                         doc02 0.0000    rain
                                                                                                                          M       R       M       R
                                                                                                                                                                                        R      M    R

                                                                                                                                                                                                                                                         doc03 0.0000    rain
                                                                                                                                                                                                                                                         doc04 0.0000    rain
                                                                                                                                                                                                                                                         doc01 0.0000    shadow
                                                                                                                                                                                                                                                         doc02 0.0000    shadow
                                                                                                                                                                                                                                                         doc03 0.0000    shadow
                                                                                                                                                                                                                                                         doc04 0.0000    shadow

 compare similar code in Scalding (Scala) and Cascalog (Clojure):
 based on:
 based on:
Intro to Cascading



                                                   HashJoin   Regex
                                                     Left     token
                                                                      GroupBy    R
                                      Stop Word                        token



6. code:
sample apps
Social Recommender

                    Twitter                                 stop words


                                        min, max


                                 LDA                        Redis
 ‣ social recommender based on Twitter: suggest users who tweet about similar stocks
 ‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop
 ‣ uses a stop word list to remove common words, offensive phrases, etc.
 ‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc.
 ‣ adapted in Spring by Costin Leau
SocRec: architecture

         Twitter                                             filter                                          low-freq
        firehose                     source                                            stop words
                                                            tweets                                       batch updates
      ( uid, tweet, t )

                                  tokenized tweets

       calculate                                checkpoint:                                               analysis +
       similiarity                              token frequency                                            curation

                                checkpoint:                             similarity
                                similar users                          thresholds

                                                           min, max
                                          sink                         sink
   social                                                                                Redis
   graph               LDA:
                       topic                                                            results
                                                                                     (uid: uidx, rank)
SocRec: results

                        uid          recommend        weight
                  carbonfiberxrm     ClosingBellNews   0.1459

                  carbonfiberxrm     DJFunkyGrrL       0.0870

                  ClosingBellNews   DJFunkyGrrL       0.1491

                  CloudStocks       DJFunkyGrrL       0.1206

                  ElmoreNicole      DJFunkyGrrL       0.1798

                  EsNeey            alexiolo_         0.8603

City of Palo Alto open data
                                                   Regex           Regex

                                                    filter         parser        species

                                                                                                         Left     Geohash
  GIS exprot                                                                                 Tree
                                                                                           Metadata                                M
                                                                                                         RHS                            RHS
               Regex     Checkpoint

                                                   Regex           Regex

               parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                   parser                                                              CoGroup
                                                                                                                                                     Distance   tree_dist       tree_name         shade

                                                                                                                                                 R                          M               R                M    RHS
                                                                            HashJoin        Estimate     Road
                                                                              Left           Albedo    Segments   Geohash                                                                                        CoGroup
                                                             Metadata                                                                                                              GPS
               Failure                                                        RHS                                                  M                                               logs
                Traps                                                                                                                                                                                                      R




                                                    filter                                                                                                                                                                     reco

  ‣ GIS export for parks, roads, trees (unstructured / open data)
  ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
  ‣ curated metadata, used to enrich the dataset
  ‣ could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
CoPA: log events
CoPA: results                                      0.12
                                                               Estimated Tree Height (meters)



                                                   0.06                                                      200




                                                          0   10        20            30        40   50

 ‣   addr: 115 HAWTHORNE AVE
 ‣   lat/lng: 37.446, -122.168
 ‣   geohash: 9q9jh0
 ‣   tree: 413 site 2
 ‣   species: Liquidambar styraciflua
 ‣   avg height 23 m
 ‣   road albedo: 0.12
 ‣   distance: 10 m
 ‣   a short walk from my train stop ✔

  blog, code/wiki/gists, jars, list, DevOps products:

More Related Content

What's hot

Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
Treasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on HerokuTreasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on Heroku
Salesforce Developers Japan
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
SAP Technology
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
Infochimps, a CSC Big Data Business
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
Treasure Data, Inc.
Search Engine - How to Make it
Search Engine - How to Make itSearch Engine - How to Make it
Search Engine - How to Make it
Andreas Yunanto
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
DataWorks Summit
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
Jinho Jung

What's hot (20)

Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
Treasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on HerokuTreasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on Heroku
Resume - Narasimha Rao B V (TCS)
Resume - Narasimha  Rao B V (TCS)Resume - Narasimha  Rao B V (TCS)
Resume - Narasimha Rao B V (TCS)
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
Os Lonergan
Os LonerganOs Lonergan
Os Lonergan
Search Engine - How to Make it
Search Engine - How to Make itSearch Engine - How to Make it
Search Engine - How to Make it
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving

Viewers also liked

sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matter
Dawid Weiss
Conception avec pic
Conception avec pic Conception avec pic
Conception avec pic
Introduction to programming with c,
Introduction to programming with c,Introduction to programming with c,
Introduction to programming with c,
Hossain Md Shakhawat
Introduction to C Programming
Introduction to C ProgrammingIntroduction to C Programming
Introduction to C Programming
Pic 16f877 ..
Pic 16f877 ..Pic 16f877 ..
Pic 16f877 ..
Programmation des pic_en_c_part1
Programmation des pic_en_c_part1Programmation des pic_en_c_part1
Programmation des pic_en_c_part1oussamada
Microcontroleur Pic16 F84
Microcontroleur Pic16 F84Microcontroleur Pic16 F84
Microcontroleur Pic16 F84guest1e7b02
AP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 PowerpointAP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 Powerpoint
Ge matrix
Ge matrixGe matrix
Ge matrix
Sunil Chichra
Ge9 final ppt
Ge9 final pptGe9 final ppt
Ge9 final ppt
Ravin Gandhi
Embedded system (Chapter )
Embedded system (Chapter )Embedded system (Chapter )
Embedded system (Chapter )Ikhwan_Fakrudin
Glasgow Coma Scale Presentation
Glasgow Coma Scale PresentationGlasgow Coma Scale Presentation
Glasgow Coma Scale Presentation
Hayden G
AP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 PowerpointAP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 Powerpoint
Basics of C programming
Basics of C programmingBasics of C programming
Basics of C programming
Music Recommendation Tutorial
Music Recommendation TutorialMusic Recommendation Tutorial
Music Recommendation Tutorial
Oscar Celma
Test driven development in C
Test driven development in CTest driven development in C
Test driven development in CAmritayan Nayak
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare

Viewers also liked (20)

sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matter
Conception avec pic
Conception avec pic Conception avec pic
Conception avec pic
Introduction to programming with c,
Introduction to programming with c,Introduction to programming with c,
Introduction to programming with c,
Introduction to C Programming
Introduction to C ProgrammingIntroduction to C Programming
Introduction to C Programming
Pic 16f877 ..
Pic 16f877 ..Pic 16f877 ..
Pic 16f877 ..
Programmation des pic_en_c_part1
Programmation des pic_en_c_part1Programmation des pic_en_c_part1
Programmation des pic_en_c_part1
Microcontroleur Pic16 F84
Microcontroleur Pic16 F84Microcontroleur Pic16 F84
Microcontroleur Pic16 F84
Cours pics16 f877
Cours pics16 f877Cours pics16 f877
Cours pics16 f877
AP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 PowerpointAP Physics - Chapter 6 Powerpoint
AP Physics - Chapter 6 Powerpoint
Ge matrix
Ge matrixGe matrix
Ge matrix
Ge9 final ppt
Ge9 final pptGe9 final ppt
Ge9 final ppt
Embedded system (Chapter )
Embedded system (Chapter )Embedded system (Chapter )
Embedded system (Chapter )
Glasgow Coma Scale Presentation
Glasgow Coma Scale PresentationGlasgow Coma Scale Presentation
Glasgow Coma Scale Presentation
AP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 PowerpointAP Physics - Chapter 4 Powerpoint
AP Physics - Chapter 4 Powerpoint
Basics of C programming
Basics of C programmingBasics of C programming
Basics of C programming
Music Recommendation Tutorial
Music Recommendation TutorialMusic Recommendation Tutorial
Music Recommendation Tutorial
Test driven development in C
Test driven development in CTest driven development in C
Test driven development in C
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare

Similar to Building Enterprise Apps for Big Data with Cascading

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
Paco Nathan
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Paco Nathan
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Paco Nathan
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Paco Nathan
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
Paco Nathan
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
Hadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesHadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation Architectures
DataWorks Summit
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenter
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Paco Nathan
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
Inside Analysis
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP Integration
Raymond Gao
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Stuart Charlton
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Office
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architecture
DataWorks Summit
sones company presentation
sones company presentationsones company presentation
sones company presentation
sones GmbH
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data CenterCloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center

Similar to Building Enterprise Apps for Big Data with Cascading (20)

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
Hadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesHadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation Architectures
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenter
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP Integration
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Office
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architecture
sones company presentation
sones company presentationsones company presentation
sones company presentation
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data CenterCloud Computing through FCAPS Managed Services in a Virtualized Data Center
Cloud Computing through FCAPS Managed Services in a Virtualized Data Center

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Computable Content
Computable ContentComputable Content
Computable Content
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Building Enterprise Apps for Big Data with Cascading

  • 1. Building Enterprise Apps for Big Data with Cascading Paco Nathan Document Collection Scrub Tokenize token Concurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count @pacoid Word Count Copyright @2012, Concurrent, Inc.
  • 2. Enterprise Apps for Big Data with Cascading 1. backstory: how we got here 2. build: Data Science teams 3. pattern: common use cases 4. intro: Cascading API 5. tutorial: for the impatient 6. code: sample apps
  • 3. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. backstory: how we got here
  • 4. inflection point huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1998 consider this metric: annual revenue per customer / amount of data stored which dropped 100x within a few years after 1997 2004 storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt “conventional wisdom” of RDBMS and BI tools became less viable; however, business cadre was still focused on pivot tables and pie charts… which tends toward inertia! MapReduce and the Hadoop open source stack grew directly out of that contention… however, that effort + only solves parts of the puzzle
  • 5. inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “All of Fortune 500 is now on notice over the next 10-year period.” Amazon and Google as exemplars of massive disruption in retail, advertising, etc. data as the major force displacing Global 1000 over the next decade, mostly through apps — verticals, leveraging domain expertise Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.) XLDB, 2012: “Complex analytics workloads are now displacing SQL as the basis  for Enterprise apps.”
  • 6. primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) Google “The Birth of Google” – John Battelle “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
  • 7. the world before… BI, SQL, and highly optimized code
  • 8. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
  • 9. the world after… machine learning, leveraging log files
  • 10. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
  • 11. the world ahead… what our customers are doing now
  • 12. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data Planner Scientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS
  • 14. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
  • 15. reference by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001
  • 16. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. build: Data Science teams
  • 17. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… Wikipedia solving for pattern, at scale. an interdisciplinary pursuit which requires teams, not sole players
  • 18. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  • 19. social caveats “This data cannot be correct!” may be an early warning about an organization itself much depends on how the people whom you work alongside tend to arrive at decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization puts many ongoing verbal arguments to rest however, let domain experts handle “data storytelling”, not data scientists xkcd
  • 20. the science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE ‣ estimate probability woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU ‣ calculate analytic variance edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ make use of learning theory + collab with DevOps, Stakeholders + reduce our work to cron entries
  • 21. synthesis of the above MapReduce is Good Enough? Jimmy Lin, U Maryland + Twitter A Few Useful Things to Know about Machine Learning Pedro Domingos, U Washington
  • 22. team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at integration scale to customers build smarts into apps product features Gephi keep infrastructure systems running, cost-effective
  • 23. team composition = roles Domain Expert business process, stakeholder data science Data data prep, discovery, Scientist modeling, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS App Dev software engineering, Count automation Word Count Ops systems engineering, access introduced capability
  • 24. matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops
  • 25. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
  • 26. Q: Can I simply hire one rockstar data scientist to cover all this work?
  • 27. A: No, interdisciplinary work requires teams. A: Hire leads who speak the lingo of each domain. A: Hire people who cover 2+ roles, when possible.
  • 28. reference by DJ Patil Data Jujitsu O’Reilly, 2012 Building Data Science Teams O’Reilly, 2011
  • 29. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 3. pattern: common use cases
  • 30. CAP theorem purpose: theoretical limits for data access patterns essence: ‣ consistency ‣ availability ‣ partition tolerance best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) translated: cost of doing business
  • 31. data access patterns design patterns: originated in consensus negotiation for architecture, later used in software engineering consider the corollaries in large-scale data work… essence: select data frameworks based on your data access patterns in other words, decouple use cases based on needs – avoid the “one size fits all” (OSFA) anti-pattern let’s review some examples…
  • 32. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 33. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 34. and, since “One Size Fits All”… doesn’t
  • 35. a selection of great tools… reporting: visualization: Graphite, PowerPivot, ggplot2, D3, Gephi analytics/modeling: Pentaho, Jaspersoft, SAS R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL imdg: Spark, Storm, hadoop: EMR, HW, MapR, machine data: Gigaspaces EMC, Azure, Compute Splunk, collectd, durable storage: Nagios S3, ASV, GCS, Riak, Couch
  • 36. common use cases app patterns
  • 37. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
  • 38. use case: ecommerce fraud • sparse data means lots of missing values • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
  • 39. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
  • 40. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
  • 41. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. intro: Cascading API
  • 42. Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageability
  • 43. Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, Airbnb, Square, Climate Corporation, FlightCaster, Williams-Sonoma partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, managed by Twitter, Etsy, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debugging, config, scheduling, etc.
  • 44. Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
  • 45. data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
  • 46. Cascading API: adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API for defining and managing Enterprise data workflows
  • 47. enterprise data workflows Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 48. data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection ‣ App Developer POV: M Tokenize Scrub token API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count ‣ Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repo
  • 49. data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token physical M HashJoin Regex Left token GroupBy R plan Stop Word token List RHS Count Word Count compute Apache Hadoop, in-memory local mode “assembler” code substrate …envision GPUs, streaming, etc. machine data Splunk, Nagios, Collectd, New Relic, etc.
  • 50. data workflows: SQL Relational SQL parser logical plan, optimized based on stats physical plan query history, table stats b-trees, etc. ERD table schema catalog
  • 51. data workflows: SQL vs. JVM Relational Cascading + Driven SQL parser SQL-92 compliant parser (in progress) logical plan, TODO: logical plan, optimized based on stats optimized based on stats physical plan API “plumbing” query history, app history, table stats tuple stats b-trees, etc. distributed compute substrate: Hadoop, in-memory, etc. ERD flow diagram table schema tuple schema catalog endpoint usage DB
  • 52. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 5. tutorial: for the impatient
  • 53. “Cascading for the Impatient” ‣ a series of introductory tutorials and code samples ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 54. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code
  • 55. wait! ten lines of code for a file copy… seems like a lot.
  • 56. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 57. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code
  • 58. Cascading / Java Document Collection M Tokenize GroupBy token Count String docPath = args[ 0 ]; R Word String wcPath = args[ 1 ]; Count Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/" ); wcFlow.complete();
  • 59. Scalding / Scala Document Collection M Tokenize GroupBy token Count R Word Count // Sujit Pal // package com.mycompany.impatient import com.twitter.scalding._ class Part2(args : Args) extends Job(args) {   val input = Tsv(args("input"), ('docId, 'text))   val output = Tsv(args("output"))     flatMap('text -> 'word) { text : String => text.split("""s+""") }.     groupBy('word) { group => group.size }.     write(output) }
  • 60. Cascalog / Clojure Document Collection M Tokenize GroupBy token Count R Word Count ; Paul Lam ; (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count)))
  • 61. Hive Document Collection M Tokenize GroupBy token Count R Word Count -- Steve Severance -- CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; SELECT  word, COUNT(*) FROM input  LATERAL VIEW explode(split(text, ' ')) lTable AS word GROUP BY word ;
  • 62. Pig Document Collection M Tokenize GroupBy token Count R Word Count -- kudos to Dmitriy Ryaboy docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource') AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != 'doc_id'; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*'; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource'); EXPLAIN -out dot/ -dot wcPipe;
  • 63. 3: wc + scrub Document Collection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer 22+10 lines code
  • 64. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code
  • 65. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 66. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • 67. deployed on AWS… elastic-mapreduce --create --name "TF-IDF" --jar s3n:// --arg s3n:// --arg s3n:// --arg s3n:// --arg s3n:// --arg s3n:// --arg s3n://
  • 68. results? doc_id tf-idf doc02 0.9163 token air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 california's doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 deserts doc03 0.9163 downwind doc_id text … doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such with less rain and cloudcover. doc04 0.9163 valley doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women side of a mountain. doc03 0.5108 land doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee zoink null doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry doc02 0.2231 mountain Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M doc03 0.2231 mountain Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token doc04 0.2231 mountain token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF GroupBy M doc01 0.0000 rain TF doc_id, Failure token Count Traps GroupBy Count token doc02 0.0000 rain M R M R Word Count R M R doc03 0.0000 rain doc04 0.0000 rain doc01 0.0000 shadow doc02 0.0000 shadow doc03 0.0000 shadow doc04 0.0000 shadow
  • 69. comparisons? compare similar code in Scalding (Scala) and Cascalog (Clojure): based on: based on:
  • 70. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 6. code: sample apps
  • 71. Social Recommender filter Twitter stop words tweets calculate QA similiarity threshold min, max Neo4j LDA Redis ‣ social recommender based on Twitter: suggest users who tweet about similar stocks ‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop ‣ uses a stop word list to remove common words, offensive phrases, etc. ‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc. ‣ adapted in Spring by Costin Leau
  • 72. SocRec: architecture Twitter filter low-freq firehose source stop words tweets batch updates ( uid, tweet, t ) checkpoint: tokenized tweets calculate checkpoint: analysis + QA similiarity token frequency curation checkpoint: similarity similar users thresholds threshold min, max sink sink sink Neo4j: social Redis graph LDA: topic results (uid: uidx, rank) trending
  • 73. SocRec: results uid recommend weight carbonfiberxrm ClosingBellNews 0.1459 carbonfiberxrm DJFunkyGrrL 0.0870 ClosingBellNews DJFunkyGrrL 0.1491 CloudStocks DJFunkyGrrL 0.1206 ElmoreNicole DJFunkyGrrL 0.1798 EsNeey alexiolo_ 0.8603 ...
  • 74. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Segments Geohash CoGroup Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…”
  • 76. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔
  • 77. drill-down blog, code/wiki/gists, jars, list, DevOps products: @pacoid

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n