Processing Big Data

2,470 views

Published on

Cloud Connect presentation on "Big" Data Processing

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
  • Its unfortunate you feel the need to bad mouth the data warehouse to substantiate the business value of Hadoop and Cascade. You are ignorant of what a good data warehouse can do -- your claims are in error. If the only way Hadoop succeeds is to displace the DW, it will be obliterated by oracle IBM, MS, etc.. I believe Hadoop has a wonderful future -- but it aint a database so get over it. Focus on what it does best and quit bad mouthing a $40B industry that runs in every data center.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,470
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
56
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide






















  • Processing Big Data

    1. 1. Offline Processing with Hadoop Chris K Wensel Concurrent, Inc.
    2. 2. Introduction Chris K Wensel chris@wensel.net • Cascading, Lead Developer • http://cascading.org/ • Concurrent, Inc., Founder • Hadoop/Cascading support and tools • http://concurrentinc.com/
    3. 3. Computing Systems data info value • Exist to create value out of data • Everything else is an implementation detail
    4. 4. In Todays Computing Environment • Lots of relevant medium-large data sets – that individually could fit in a RDBMS • Lots of applications touching that data – where do you think PERL came from? • Underutilized hardware owning (intermediate) data – xen/vmware add complexity (sprawl)
    5. 5. continued... • Raw data continuously arriving (and in bursts) – we mostly care about the new stuff • Raw data is dirty – bots and bugs • Demands on timely/predictable result availability – downstream systems must be fed • The ‘Cloud’ is enabling an on-demand model
    6. 6. Data Warehousing != Data ETL Processing process streams hub and spoke [distributed] [monolithic] • Data Warehousing – monolithic systems and data schema – distribution through manual federation/ sharding • Data Processing – cluster of peer systems – dynamic even distribution of data and processing
    7. 7. Data Warehousing data raw data ETL warehouse ETL reporting loggers [BI, KPI, etc] loggers [cache] loggers ETL ETL data mining product Consumer R, SAS, some data Excel, etc Analyst • Agility, no “one size fits all” schema, resistant to change • Complex Analytics, cannot be represented by SQL • Massive Data Sets, won’t fit or too
    8. 8. Production Data Processing raw data data processing valuable loggers data loggers loggers Consumer • Online / Real-Time process – low latency (milliseconds to seconds for results) – smaller datasets - streams • Offline / Batch – high latency (minutes to days for results) – larger datasets - files
    9. 9. Hadoop Adoption Cluster Rack Rack Rack Node Node Node Node ... Global Compute-space Global Namespace • Distributed replicated storage for large files • Distributed fault tolerant exec of batch processes • Scale out vs (legacy) scale up • Java API allows complex analysis
    10. 10. But Stuffed into Legacy Roles data mining data warehouse raw data ETL loggers Hadoop + pig / hive loggers loggers ETL Analyst • Hadoop deployments mirror legacy architectures – ETL into cached “structured storage” • Pig/Hive are syntaxes for Data Mining “Big” data – SQL like, but hard to customize and not “advanced”
    11. 11. Hadoop for Data Processing Value Creation Scalability Simplicity • More Value through Innovation • Scalability, Not Performance • Simplifies Infrastructure
    12. 12. Simplicity Cluster Rack Rack Rack Node Node Node Node ... cpus Global Compute-space disks Global Namespace • Virtualization across resources, not within (PaaS) – A single FileSystem across disks - no DBA – A single Execution System across CPUs - less IT
    13. 13. Scalability Users Cluster Client Rack Rack Rack Node Node Node Node ... Client job job job Client • Scalability - continued reliability and met expectations as demand changes • Application Scalability - data grows, app/ infra expand • Organizational Scalability - simpler infra
    14. 14. Creating Value events reporting raw data loggers loggers data processing loggers Hadoop + Hadoop etlCascading analytics Cascading Producer Consumer product operational Value • Unconstrained processing model • Data processing requires integration • Processing must not fail or fall behind
    15. 15. Consequences • Improved reliability of production processes – “we had a failed disk yet jobs never failed” • Greater utilization of hardware resources – dynamically moves code to available cores • Increased rate of innovation – diverse analytics over larger sets, less bureaucracy • Fewer staff
    16. 16. Hadoop MapReduce Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection • Nearly impossible to “think in” • Apps are many dependent MR jobs
    17. 17. Cascading Word Count/Sort Flow Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Parse Group Count Sort [ f1,f2,.. ] [ f1,f2,.. ] Data [ f1, f2,... ] = tuples with field names Data • Alternative model & API to MapReduce – pipe/filters of re-usable operations • For rapidly implementing Data Processing Systems • Open-Source
    18. 18. Emerging Tool Support • Karmasphere IDE (soon) – Developing and Debugging • Bixo (Bixo Labs) Data Mining Toolkit – Apache Nutch replacement – Easier to customize to meet new business models • Clojure & JRuby Domain Specific Languages (DSL) – Machine Learning – Simple/Complex Ad-Hoc queries
    19. 19. Practical Applications • Log/event analysis, device and system monitoring • Web crawling and content mining • Behavior ad-targeting segmentation • Ad campaign ROI • Demand and event prediction • POS analytics for product demand pricing
    20. 20. Successes • Publicis/RazorFish - Behavioral Ad- Targeting – Cascading + AWS (Elastic MapReduce) – Daily automated User Behavior Segmentation – 6wks dev, 3T/day, $13k/mo – 500% increase in return on ad spend from a similar campaign a year before
    21. 21. continued... • FlightCaster - Predicting flight delays – Clojure + Cascading + AWS – Machine learning and production processing – 3mos dev, 10G day, <1T total currently, <$2k/mos • Etsy - Online Marketplace – JRuby + Cascading – Data mining (Hadoop as a DW!) – 750M page-views/mo, 60G/day of logs
    22. 22. Resources • Chris K Wensel – chris@wensel.net – @cwensel • Cascading – an API for optimizing production data processing – http://cascading.org • Concurrent, Inc. – Support and Mentoring – http://concurrentinc.com

    ×