Offline Processing with
        Hadoop

     Chris K Wensel
     Concurrent, Inc.
Introduction
                Chris K Wensel
               chris@wensel.net

• Cascading, Lead Developer
    • http://cascading.org/
• Concurrent, Inc., Founder
    • Hadoop/Cascading support and tools
    • http://concurrentinc.com/
Computing Systems


           data           info

                  value

• Exist to create value out of data
• Everything else is an implementation
  detail
In Todays Computing
           Environment

• Lots of relevant medium-large data sets
  – that individually could fit in a RDBMS
• Lots of applications touching that data
  – where do you think PERL came from?
• Underutilized hardware owning
  (intermediate) data
  – xen/vmware add complexity (sprawl)
continued...
• Raw data continuously arriving (and in
  bursts)
  – we mostly care about the new stuff
• Raw data is dirty
  – bots and bugs
• Demands on timely/predictable result
  availability
  – downstream systems must be fed
• The ‘Cloud’ is enabling an on-demand
  model
Data Warehousing != Data
     ETL
         Processing

                         process     streams
    hub and spoke                  [distributed]
      [monolithic]



• Data Warehousing
  – monolithic systems and data schema
  – distribution through manual federation/
    sharding
• Data Processing
  – cluster of peer systems
  – dynamic even distribution of data and
    processing
Data Warehousing
                                     data
          raw data       ETL      warehouse   ETL    reporting
          loggers                                   [BI, KPI, etc]
         loggers                   [cache]
        loggers
                                              ETL
                                    ETL
                          data
                         mining
                                                      product        Consumer


                      R, SAS,     some data
                     Excel, etc
          Analyst


• Agility, no “one size fits all” schema,
  resistant to change
• Complex Analytics, cannot be represented
  by SQL
• Massive Data Sets, won’t fit or too
Production Data Processing
              raw data   data processing   valuable
              loggers                        data
             loggers
            loggers
                                                      Consumer




• Online / Real-Time      process



  – low latency (milliseconds to seconds for
    results)
  – smaller datasets - streams
• Offline / Batch
  – high latency (minutes to days for results)
  – larger datasets - files
Hadoop Adoption
           Cluster




                Rack            Rack                 Rack

                Node   Node     Node        Node     ...


                              Global Compute-space


                               Global Namespace




• Distributed replicated storage for large
  files
• Distributed fault tolerant exec of batch
  processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis
But Stuffed into Legacy Roles
                                                data
                                               mining
                          data warehouse
        raw data   ETL
        loggers          Hadoop + pig / hive
       loggers
      loggers
                                 ETL
                                                        Analyst




• Hadoop deployments mirror legacy
  architectures
  – ETL into cached “structured storage”
• Pig/Hive are syntaxes for Data Mining
  “Big” data
  – SQL like, but hard to customize and not
    “advanced”
Hadoop for Data Processing
                Value Creation

                  Scalability

                  Simplicity




• More Value through Innovation
• Scalability, Not Performance
• Simplifies Infrastructure
Simplicity
           Cluster




                Rack                  Rack                 Rack

                Node         Node     Node        Node     ...


                     cpus           Global Compute-space


                     disks           Global Namespace




• Virtualization across resources, not
  within (PaaS)
  – A single FileSystem across disks - no DBA
  – A single Execution System across CPUs -
    less IT
Scalability
         Users       Cluster

            Client

                          Rack                Rack                  Rack

                          Node         Node   Node           Node   ...

            Client
                                 job
                                                       job
                                                 job
            Client




• Scalability - continued reliability and met
  expectations as demand changes
• Application Scalability - data grows, app/
  infra expand
• Organizational Scalability - simpler infra
Creating Value
                                 events


                                               reporting
                  raw data
                  loggers
                 loggers     data processing
                loggers           Hadoop
                                 + Hadoop
                              etlCascading
                                   analytics
                                 Cascading
     Producer                                              Consumer


                                               product

                             operational



                              Value

• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind
Consequences
• Improved reliability of production
  processes
  – “we had a failed disk yet jobs never
    failed”
• Greater utilization of hardware
  resources
  – dynamically moves code to available
    cores
• Increased rate of innovation
  – diverse analytics over larger sets, less
    bureaucracy
• Fewer staff
Hadoop MapReduce
        Count Job                                Sort Job
                     [ k, [v] ]                                    [ k, [v] ]
             Map                   Reduce              Map                        Reduce


       [ k, v ]                   [ k, v ]              [ k, v ]                        [ k, v ]


              File                            File                                   File



                                             [ k, v ] = key and value pair
                                             [ k, [v] ] = key and associated values collection




• Nearly impossible to “think in”
• Apps are many dependent MR jobs
Cascading
                                   Word Count/Sort Flow
         Map                          Reduce                              Map           Reduce
                    [ f1,f2,.. ]             [ f1,f2,.. ]            [ f1,f2,.. ]
         Parse                     Group                    Count                    Sort

                                                                                            [ f1,f2,.. ]
                 [ f1,f2,.. ]


          Data                             [ f1, f2,... ] = tuples with field names             Data




• Alternative model & API to MapReduce
  – pipe/filters of re-usable operations
• For rapidly implementing Data Processing
  Systems
• Open-Source
Emerging Tool Support
• Karmasphere IDE (soon)
  – Developing and Debugging
• Bixo (Bixo Labs) Data Mining Toolkit
  – Apache Nutch replacement
  – Easier to customize to meet new business
    models
• Clojure & JRuby Domain Specific
  Languages (DSL)
  – Machine Learning
  – Simple/Complex Ad-Hoc queries
Practical Applications
• Log/event analysis, device and system
  monitoring
• Web crawling and content mining
• Behavior ad-targeting segmentation
• Ad campaign ROI
• Demand and event prediction
• POS analytics for product demand pricing
Successes
• Publicis/RazorFish - Behavioral Ad-
  Targeting
  – Cascading + AWS (Elastic MapReduce)
  – Daily automated User Behavior
    Segmentation
  – 6wks dev, 3T/day, $13k/mo
  – 500% increase in return on ad spend
    from a similar campaign a year before
continued...
• FlightCaster - Predicting flight delays
  – Clojure + Cascading + AWS
  – Machine learning and production
    processing
  – 3mos dev, 10G day, <1T total currently,
    <$2k/mos
• Etsy - Online Marketplace
  – JRuby + Cascading
  – Data mining (Hadoop as a DW!)
  – 750M page-views/mo, 60G/day of logs
Resources
• Chris K Wensel
  – chris@wensel.net
  – @cwensel
• Cascading
  – an API for optimizing production data
    processing
  – http://cascading.org
• Concurrent, Inc.
  – Support and Mentoring
  – http://concurrentinc.com

Processing Big Data

  • 1.
    Offline Processing with Hadoop Chris K Wensel Concurrent, Inc.
  • 2.
    Introduction Chris K Wensel chris@wensel.net • Cascading, Lead Developer • http://cascading.org/ • Concurrent, Inc., Founder • Hadoop/Cascading support and tools • http://concurrentinc.com/
  • 3.
    Computing Systems data info value • Exist to create value out of data • Everything else is an implementation detail
  • 4.
    In Todays Computing Environment • Lots of relevant medium-large data sets – that individually could fit in a RDBMS • Lots of applications touching that data – where do you think PERL came from? • Underutilized hardware owning (intermediate) data – xen/vmware add complexity (sprawl)
  • 5.
    continued... • Raw datacontinuously arriving (and in bursts) – we mostly care about the new stuff • Raw data is dirty – bots and bugs • Demands on timely/predictable result availability – downstream systems must be fed • The ‘Cloud’ is enabling an on-demand model
  • 6.
    Data Warehousing !=Data ETL Processing process streams hub and spoke [distributed] [monolithic] • Data Warehousing – monolithic systems and data schema – distribution through manual federation/ sharding • Data Processing – cluster of peer systems – dynamic even distribution of data and processing
  • 7.
    Data Warehousing data raw data ETL warehouse ETL reporting loggers [BI, KPI, etc] loggers [cache] loggers ETL ETL data mining product Consumer R, SAS, some data Excel, etc Analyst • Agility, no “one size fits all” schema, resistant to change • Complex Analytics, cannot be represented by SQL • Massive Data Sets, won’t fit or too
  • 8.
    Production Data Processing raw data data processing valuable loggers data loggers loggers Consumer • Online / Real-Time process – low latency (milliseconds to seconds for results) – smaller datasets - streams • Offline / Batch – high latency (minutes to days for results) – larger datasets - files
  • 9.
    Hadoop Adoption Cluster Rack Rack Rack Node Node Node Node ... Global Compute-space Global Namespace • Distributed replicated storage for large files • Distributed fault tolerant exec of batch processes • Scale out vs (legacy) scale up • Java API allows complex analysis
  • 10.
    But Stuffed intoLegacy Roles data mining data warehouse raw data ETL loggers Hadoop + pig / hive loggers loggers ETL Analyst • Hadoop deployments mirror legacy architectures – ETL into cached “structured storage” • Pig/Hive are syntaxes for Data Mining “Big” data – SQL like, but hard to customize and not “advanced”
  • 11.
    Hadoop for DataProcessing Value Creation Scalability Simplicity • More Value through Innovation • Scalability, Not Performance • Simplifies Infrastructure
  • 12.
    Simplicity Cluster Rack Rack Rack Node Node Node Node ... cpus Global Compute-space disks Global Namespace • Virtualization across resources, not within (PaaS) – A single FileSystem across disks - no DBA – A single Execution System across CPUs - less IT
  • 13.
    Scalability Users Cluster Client Rack Rack Rack Node Node Node Node ... Client job job job Client • Scalability - continued reliability and met expectations as demand changes • Application Scalability - data grows, app/ infra expand • Organizational Scalability - simpler infra
  • 14.
    Creating Value events reporting raw data loggers loggers data processing loggers Hadoop + Hadoop etlCascading analytics Cascading Producer Consumer product operational Value • Unconstrained processing model • Data processing requires integration • Processing must not fail or fall behind
  • 15.
    Consequences • Improved reliabilityof production processes – “we had a failed disk yet jobs never failed” • Greater utilization of hardware resources – dynamically moves code to available cores • Increased rate of innovation – diverse analytics over larger sets, less bureaucracy • Fewer staff
  • 16.
    Hadoop MapReduce Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection • Nearly impossible to “think in” • Apps are many dependent MR jobs
  • 17.
    Cascading Word Count/Sort Flow Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Parse Group Count Sort [ f1,f2,.. ] [ f1,f2,.. ] Data [ f1, f2,... ] = tuples with field names Data • Alternative model & API to MapReduce – pipe/filters of re-usable operations • For rapidly implementing Data Processing Systems • Open-Source
  • 18.
    Emerging Tool Support •Karmasphere IDE (soon) – Developing and Debugging • Bixo (Bixo Labs) Data Mining Toolkit – Apache Nutch replacement – Easier to customize to meet new business models • Clojure & JRuby Domain Specific Languages (DSL) – Machine Learning – Simple/Complex Ad-Hoc queries
  • 19.
    Practical Applications • Log/eventanalysis, device and system monitoring • Web crawling and content mining • Behavior ad-targeting segmentation • Ad campaign ROI • Demand and event prediction • POS analytics for product demand pricing
  • 20.
    Successes • Publicis/RazorFish -Behavioral Ad- Targeting – Cascading + AWS (Elastic MapReduce) – Daily automated User Behavior Segmentation – 6wks dev, 3T/day, $13k/mo – 500% increase in return on ad spend from a similar campaign a year before
  • 21.
    continued... • FlightCaster -Predicting flight delays – Clojure + Cascading + AWS – Machine learning and production processing – 3mos dev, 10G day, <1T total currently, <$2k/mos • Etsy - Online Marketplace – JRuby + Cascading – Data mining (Hadoop as a DW!) – 750M page-views/mo, 60G/day of logs
  • 22.
    Resources • Chris KWensel – chris@wensel.net – @cwensel • Cascading – an API for optimizing production data processing – http://cascading.org • Concurrent, Inc. – Support and Mentoring – http://concurrentinc.com