Hadoop - An Introduction
Upcoming SlideShare
Loading in...5
×
 

Hadoop - An Introduction

on

  • 1,493 views

Introduction to Hadoop

Introduction to Hadoop

Statistics

Views

Total Views
1,493
Views on SlideShare
1,414
Embed Views
79

Actions

Likes
0
Downloads
30
Comments
0

3 Embeds 79

http://www.shankarr.com 76
http://dbmstalk.blogspot.com 2
http://datavoxel.blogspot.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop - An Introduction Hadoop - An Introduction Presentation Transcript

    • Shankar Radhakrishnan
      HCL Technologies
      Hadoop – An Introduction
    • State of the Data
      What is Hadoop
      Hadoop Ecosystem
      References
      Agenda
    • Data driven businesses
      Businesses have been collecting information all the time
      Mine more == Collect more (and vice-versa)
      Challenges
      Application Complexities
      Data growth
      Infrastructure
      Economics
      Need of the day
      State of the data
    • Data driven business
      Businesses have been collecting informationall the time
      Mine more == Collect more (and vice-versa)
      Challenges
      Application Complexities
      Data growth
      Infrastructure
      Economics
      State of the data
    • Applications
      Searches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product Listings
      ERP, CRM, Databases, Internal Applications, Customer/Consumer facing products
      Mobile
      Context
      Web, Customers, Products, Business Systems,Processes, Services
      Support Systems
      CRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPM
      Data driven business
    • Data driven businesses
      Businesses have been collecting informationall the time
      Mine more == Collect more (and vice-versa)
      Challenges
      Application Complexities
      Data growth
      Infrastructure
      Economics
      State of the data
    • Drivers
      ROI
      Customer Retention
      Product Affinity
      Market Trends
      Research Analysis
      Customer/Consumer Analytics
      Process
      Clustering
      Classification
      Build Relationships
      Regression
      Types
      Structured
      Semi-structured
      Unstructured
      Mine more
    • Data driven businesses
      Businesses have been collecting informationall the time
      Mine more == Collect more (and vice-versa)
      Challenges
      Application Complexities
      Data growth
      Infrastructure
      Economics
      State of the data
    • Complex Applications
      Data integration is a good but complex problem to solve
      Data Growth
      Growth is exponential
      Infrastructure
      Availability
      Unscalablehardware
      Economics
      Managing high data volume comes at a price
      Failures are very costly
      Challenges
    • System that can handle high volume data
      System that can perform complex operations
      Scalable
      Robust
      Highly Available
      Fault Tolerant
      Cheap
      Need of the day
    • Top level Apache project
      Open source
      Inspired by Google’s white papers onMap/Reduce (MR), Google File System (GFS)
      Originally developed to support Apache Nutch Search Engine
      Software Framework - Java
      Designed
      For sophisticated analysis
      To deal with structured and unstructured complex data
    • Runs on commodity hardware
      Shared-nothing architecture
      Scale hardware when ever you want
      System compensates for hardware scalingand issues (if any)
      Run large-scale, high volume data processes
      Scales well with complex analysis jobs
      Handles failures
      Ideal to consolidate data from both new and legacy data sources
      Value to the business
      Why Hadoop?
    • Hadoop in an enterprise - Example
    • HDFS Hadoop Distributed File System
      Map/Reduce Software framework for Clustered, Distributed data processing
      ZooKeeper Scheduler
      Avro Data Serialization
      Chukwa Data Collection System to monitor Distributed Systems
      HBase Data storage for distributed large tables
      Hive Data warehousing infrastructure
      Pig High-Level Query Language
      Hadoop Ecosystem
    • Master/Slave Architecture
      Runs on commodity hardware
      Fault Tolerant
      Handle large volumes of data
      Provides High Throughput
      Streaming data-access
      Simple file coherency model
      Portable to heterogeneous hardware and software
      Robust
      Handles disk failures, replication (& re-replication)
      Performs cluster rebalancing, data integrity checks
      HDFS – Hadoop Distributed File System
    • HDFS – Example
      Name node
      • File system operations
      • Maps data-nodes
      Data node
      • Process read/write
      • Handles Data-blocks
      • Replication
    • Tagged by a job
      Splits input data-set into separate chunk’s
      Processed by map tasks, in parallel
      Sorts the output of the maps
      Processed by reduce tasks, in parallel
      Typically stored and processed in a file system
      Framework takes care of
      Scheduling tasks
      Monitoring
      Re-executing failed tasks
      Hadoop Map/Reduce
    • Example : Mapper Function
    • Example : Reduce Function
    • Who runs Hadoop?