Cloud Computing: Hadoop

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    6 Favorites

    Cloud Computing: Hadoop - Presentation Transcript

    1. Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email_address]
    2. What is Hadoop
      • Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.
    3. Why?
      • A common infrastructure pattern extracted from building distributed systems
      • Scale
      • Incremental growth
      • Cost
      • Flexibility
    4. Built-in Resilience to Failure
      • When dealing with large numbers of commodity servers, failure is a fact of life
      • Assume failure, build protections and recovery into your architecture
        • Data level redundancy
        • Job/Task level monitoring and automated restart and re-allocation
    5. Current State of Hadoop Project
      • Top level Apache Foundation project
      • In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, …
      • Large, active user base, mailing lists, user groups
      • Very active development, strong development team
    6. Widely Adopted
      • A valuable and reusable skill set
        • Taught at major universities
        • Easier to hire for
        • Easier to train on
        • Portable across projects, groups
    7. Plethora of Related Projects
      • Pig
      • Hive
      • Hbase
      • Cascading
      • Hadoop on EC2
      • JAQL , X-Trace, Happy, Mahout
    8. What is Hadoop
      • The Linux of distributed processing.
    9. How Does Hadoop Work?
    10. Hadoop File System
      • A distributed file system for large data
        • Your data in triplicate
        • Built-in redundancy, resiliency to large scale failures
        • Intelligent distribution, striping across racks
        • Accommodates very large data sizes
        • On commodity hardware
    11. Programming Model: Map/Reduce
      • Very simple programming model:
        • Map(anything)->key, value
        • Sort, partition on key
        • Reduce(key,value)->key, value
      • No parallel processing / message passing semantics
      • Programmable in Java or any other language (streaming)
    12. Processing Model
      • Create or allocate a cluster
      • Put data onto the file system:
        • Data is split into blocks, stored in triplicate across your cluster
      • Run your job:
        • Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data
          • Move computation to data, not data to computation
    13. Processing Model
          • Monitor workers, automatically restarting failed or slow tasks
        • Gather output of Map, sort and partition on key
        • Run Reduce tasks
          • Monitor workers, automatically restarting failed or slow tasks
      • Results of your job are now available on the Hadoop file system
    14. Hadoop on the Grid
      • Managed Hadoop clusters
      • Shared resources
        • improved utilization
      • Standard data sets, storage
      • Shared, standardized operations management
      • Hosted internally or externally (eg. on EC2)
    15. Usage Patterns
    16. ETL
      • Put large data source (eg. Log files) onto the Hadoop File System
      • Perform aggregations, transformations, normalizations on the data
      • Load into RDBMS / data mart
    17. Reporting and Analytics
      • Run canned and ad-hoc queries over large data
      • Run analytics and data mining operations on large data
      • Produce reports for end-user consumption or loading into data mart
    18. Data Processing Pipelines
      • Multi-step pipelines for data processing
      • Coordination, scheduling, data collection and publishing of feeds
      • SLA carrying, regularly scheduled jobs
    19. Machine Learning & Graph Algorithms
      • Traverse large graphs and data sets, building models and classifiers
      • Implement machine learning algorithms over massive data sets
    20. General Back-End Processing
      • Implement significant portions of back-end, batch oriented processing on the grid
      • General computation framework
      • Simplify back-end architecture
    21. What Next?
      • Dowload Hadoop:
        • http://hadoop.apache.org/
      • Try it on your laptop
      • Try Pig
        • http://hadoop.apahe.org/pig/
      • Deploy to multiple boxes
      • Try it on EC2

    + darugardarugar, 2 years ago

    custom

    3243 views, 6 favs, 3 embeds more stats

    Data Processing in the Cloud with Hadoop from Data more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3243
      • 3015 on SlideShare
      • 228 from embeds
    • Comments 0
    • Favorites 6
    • Downloads 261
    Most viewed embeds
    • 218 views on http://parand.com
    • 9 views on http://cloudrecruiter.blogspot.com
    • 1 views on http://74.125.155.132

    more

    All embeds
    • 218 views on http://parand.com
    • 9 views on http://cloudrecruiter.blogspot.com
    • 1 views on http://74.125.155.132

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories