Processing Big Data


Published on

Cloud Connect presentation on "Big" Data Processing

Published in: Technology
1 Comment
  • Its unfortunate you feel the need to bad mouth the data warehouse to substantiate the business value of Hadoop and Cascade. You are ignorant of what a good data warehouse can do -- your claims are in error. If the only way Hadoop succeeds is to displace the DW, it will be obliterated by oracle IBM, MS, etc.. I believe Hadoop has a wonderful future -- but it aint a database so get over it. Focus on what it does best and quit bad mouthing a $40B industry that runs in every data center.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

  • Processing Big Data

    1. 1. Offline Processing with Hadoop Chris K Wensel Concurrent, Inc.
    2. 2. Introduction Chris K Wensel • Cascading, Lead Developer • • Concurrent, Inc., Founder • Hadoop/Cascading support and tools •
    3. 3. Computing Systems data info value • Exist to create value out of data • Everything else is an implementation detail
    4. 4. In Todays Computing Environment • Lots of relevant medium-large data sets – that individually could fit in a RDBMS • Lots of applications touching that data – where do you think PERL came from? • Underutilized hardware owning (intermediate) data – xen/vmware add complexity (sprawl)
    5. 5. continued... • Raw data continuously arriving (and in bursts) – we mostly care about the new stuff • Raw data is dirty – bots and bugs • Demands on timely/predictable result availability – downstream systems must be fed • The ‘Cloud’ is enabling an on-demand model
    6. 6. Data Warehousing != Data ETL Processing process streams hub and spoke [distributed] [monolithic] • Data Warehousing – monolithic systems and data schema – distribution through manual federation/ sharding • Data Processing – cluster of peer systems – dynamic even distribution of data and processing
    7. 7. Data Warehousing data raw data ETL warehouse ETL reporting loggers [BI, KPI, etc] loggers [cache] loggers ETL ETL data mining product Consumer R, SAS, some data Excel, etc Analyst • Agility, no “one size fits all” schema, resistant to change • Complex Analytics, cannot be represented by SQL • Massive Data Sets, won’t fit or too
    8. 8. Production Data Processing raw data data processing valuable loggers data loggers loggers Consumer • Online / Real-Time process – low latency (milliseconds to seconds for results) – smaller datasets - streams • Offline / Batch – high latency (minutes to days for results) – larger datasets - files
    9. 9. Hadoop Adoption Cluster Rack Rack Rack Node Node Node Node ... Global Compute-space Global Namespace • Distributed replicated storage for large files • Distributed fault tolerant exec of batch processes • Scale out vs (legacy) scale up • Java API allows complex analysis
    10. 10. But Stuffed into Legacy Roles data mining data warehouse raw data ETL loggers Hadoop + pig / hive loggers loggers ETL Analyst • Hadoop deployments mirror legacy architectures – ETL into cached “structured storage” • Pig/Hive are syntaxes for Data Mining “Big” data – SQL like, but hard to customize and not “advanced”
    11. 11. Hadoop for Data Processing Value Creation Scalability Simplicity • More Value through Innovation • Scalability, Not Performance • Simplifies Infrastructure
    12. 12. Simplicity Cluster Rack Rack Rack Node Node Node Node ... cpus Global Compute-space disks Global Namespace • Virtualization across resources, not within (PaaS) – A single FileSystem across disks - no DBA – A single Execution System across CPUs - less IT
    13. 13. Scalability Users Cluster Client Rack Rack Rack Node Node Node Node ... Client job job job Client • Scalability - continued reliability and met expectations as demand changes • Application Scalability - data grows, app/ infra expand • Organizational Scalability - simpler infra
    14. 14. Creating Value events reporting raw data loggers loggers data processing loggers Hadoop + Hadoop etlCascading analytics Cascading Producer Consumer product operational Value • Unconstrained processing model • Data processing requires integration • Processing must not fail or fall behind
    15. 15. Consequences • Improved reliability of production processes – “we had a failed disk yet jobs never failed” • Greater utilization of hardware resources – dynamically moves code to available cores • Increased rate of innovation – diverse analytics over larger sets, less bureaucracy • Fewer staff
    16. 16. Hadoop MapReduce Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection • Nearly impossible to “think in” • Apps are many dependent MR jobs
    17. 17. Cascading Word Count/Sort Flow Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Parse Group Count Sort [ f1,f2,.. ] [ f1,f2,.. ] Data [ f1, f2,... ] = tuples with field names Data • Alternative model & API to MapReduce – pipe/filters of re-usable operations • For rapidly implementing Data Processing Systems • Open-Source
    18. 18. Emerging Tool Support • Karmasphere IDE (soon) – Developing and Debugging • Bixo (Bixo Labs) Data Mining Toolkit – Apache Nutch replacement – Easier to customize to meet new business models • Clojure & JRuby Domain Specific Languages (DSL) – Machine Learning – Simple/Complex Ad-Hoc queries
    19. 19. Practical Applications • Log/event analysis, device and system monitoring • Web crawling and content mining • Behavior ad-targeting segmentation • Ad campaign ROI • Demand and event prediction • POS analytics for product demand pricing
    20. 20. Successes • Publicis/RazorFish - Behavioral Ad- Targeting – Cascading + AWS (Elastic MapReduce) – Daily automated User Behavior Segmentation – 6wks dev, 3T/day, $13k/mo – 500% increase in return on ad spend from a similar campaign a year before
    21. 21. continued... • FlightCaster - Predicting flight delays – Clojure + Cascading + AWS – Machine learning and production processing – 3mos dev, 10G day, <1T total currently, <$2k/mos • Etsy - Online Marketplace – JRuby + Cascading – Data mining (Hadoop as a DW!) – 750M page-views/mo, 60G/day of logs
    22. 22. Resources • Chris K Wensel – – @cwensel • Cascading – an API for optimizing production data processing – • Concurrent, Inc. – Support and Mentoring –