Hadoop - An Introduction

  • 1,215 views
Uploaded on

Introduction to Hadoop

Introduction to Hadoop

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,215
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
30
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Shankar Radhakrishnan
    HCL Technologies
    Hadoop – An Introduction
  • 2. State of the Data
    What is Hadoop
    Hadoop Ecosystem
    References
    Agenda
  • 3. Data driven businesses
    Businesses have been collecting information all the time
    Mine more == Collect more (and vice-versa)
    Challenges
    Application Complexities
    Data growth
    Infrastructure
    Economics
    Need of the day
    State of the data
  • 4. Data driven business
    Businesses have been collecting informationall the time
    Mine more == Collect more (and vice-versa)
    Challenges
    Application Complexities
    Data growth
    Infrastructure
    Economics
    State of the data
  • 5. Applications
    Searches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product Listings
    ERP, CRM, Databases, Internal Applications, Customer/Consumer facing products
    Mobile
    Context
    Web, Customers, Products, Business Systems,Processes, Services
    Support Systems
    CRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPM
    Data driven business
  • 6. Data driven businesses
    Businesses have been collecting informationall the time
    Mine more == Collect more (and vice-versa)
    Challenges
    Application Complexities
    Data growth
    Infrastructure
    Economics
    State of the data
  • 7. Drivers
    ROI
    Customer Retention
    Product Affinity
    Market Trends
    Research Analysis
    Customer/Consumer Analytics
    Process
    Clustering
    Classification
    Build Relationships
    Regression
    Types
    Structured
    Semi-structured
    Unstructured
    Mine more
  • 8. Data driven businesses
    Businesses have been collecting informationall the time
    Mine more == Collect more (and vice-versa)
    Challenges
    Application Complexities
    Data growth
    Infrastructure
    Economics
    State of the data
  • 9. Complex Applications
    Data integration is a good but complex problem to solve
    Data Growth
    Growth is exponential
    Infrastructure
    Availability
    Unscalablehardware
    Economics
    Managing high data volume comes at a price
    Failures are very costly
    Challenges
  • 10. System that can handle high volume data
    System that can perform complex operations
    Scalable
    Robust
    Highly Available
    Fault Tolerant
    Cheap
    Need of the day
  • 11. Top level Apache project
    Open source
    Inspired by Google’s white papers onMap/Reduce (MR), Google File System (GFS)
    Originally developed to support Apache Nutch Search Engine
    Software Framework - Java
    Designed
    For sophisticated analysis
    To deal with structured and unstructured complex data
  • 12. Runs on commodity hardware
    Shared-nothing architecture
    Scale hardware when ever you want
    System compensates for hardware scalingand issues (if any)
    Run large-scale, high volume data processes
    Scales well with complex analysis jobs
    Handles failures
    Ideal to consolidate data from both new and legacy data sources
    Value to the business
    Why Hadoop?
  • 13. Hadoop in an enterprise - Example
  • 14. HDFS Hadoop Distributed File System
    Map/Reduce Software framework for Clustered, Distributed data processing
    ZooKeeper Scheduler
    Avro Data Serialization
    Chukwa Data Collection System to monitor Distributed Systems
    HBase Data storage for distributed large tables
    Hive Data warehousing infrastructure
    Pig High-Level Query Language
    Hadoop Ecosystem
  • 15. Master/Slave Architecture
    Runs on commodity hardware
    Fault Tolerant
    Handle large volumes of data
    Provides High Throughput
    Streaming data-access
    Simple file coherency model
    Portable to heterogeneous hardware and software
    Robust
    Handles disk failures, replication (& re-replication)
    Performs cluster rebalancing, data integrity checks
    HDFS – Hadoop Distributed File System
  • 16. HDFS – Example
    Name node
    • File system operations
    • 17. Maps data-nodes
    Data node
    • Process read/write
    • 18. Handles Data-blocks
    • 19. Replication
  • Tagged by a job
    Splits input data-set into separate chunk’s
    Processed by map tasks, in parallel
    Sorts the output of the maps
    Processed by reduce tasks, in parallel
    Typically stored and processed in a file system
    Framework takes care of
    Scheduling tasks
    Monitoring
    Re-executing failed tasks
    Hadoop Map/Reduce
  • 20. Example : Mapper Function
  • 21. Example : Reduce Function
  • 22. Who runs Hadoop?