Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Introduction


Published on

The document starts with the introduction for Hadoop and covers the Hadoop 1.x / 2.x services (HDFS / MapReduce / YARN).
It also explains the architecture of Hadoop, the working of Hadoop distributed file system and MapReduce programming model.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Hadoop Introduction

  1. 1. Apache Hadoop is a Java software framework that allows for the distributed processing of large data sets across clusters of computers spread across the world using a simple programming model.
  2. 2. •  Distributed, scalable and reliable •  Fault‐tolerant storage system Hadoop Distributed File System •  High-performance parallel data processing •  Employs the divide-conquer principle Map-Reduce Programming Model
  3. 3. A class teacher of class 5 needs to find out the name of the student with highest marks for each subject. Total students : 50 Total subjects : 5 Our Goal To minimize the Total time spent Time to process each subject per student : 1min Total time spent : 250mins Subject 1 : S1-98 Subject 2 : S13-95 Subject 3 : S1-97 Subject 4 : S23-100 Subject 5 : S8-99 Input Output
  4. 4. HDFS: Distribute the data into blocks across multiple nodes Distribute papers across 5 peons – Each peon will have papers of 10 students for each subject (50 papers each) a) Map Phase: Apply business logic on distributed data in parallel Each peon will provide list of subjects with student name and highest marks from his data from a list of 10 students. Total time spent: 50mins (in parallel) b) Reduce Phase: Iterate over the map phase output and get final result Total records left: 5 students for 5 subjects only. Time to get subject list for student name with highest marks: 25mins c) Total time spent: 50 + 25 = 75mins
  5. 5. Social Media Data Analyzing Web Clickstream Data Server Log Data Machine and Sensor Data
  6. 6. HDFS Layer : -- Stores files across storage nodes in a Hadoop cluster Consists of : •  Namenode & Datanodes Map-Reduce Engine : -- Processes vast amounts of data in- parallel on large clusters in a reliable & fault-tolerant manner Consists of : •  Job Tracker & Task Trackers
  7. 7. Namenode Datanode_1 Datanode_2 Datanode_3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Storage & Replication of Blocks in HDFS Filedividedintoblocks Block 1 Block 2 Block 3 Block 4 HDFS Client File write request
  8. 8. Job Tracker Task Tracker 1 Task Tracker _2 Task Tracker _3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Map-Reduce job from client Executes individual Map-Reduce tasks assigned by Job Tracker Task Trackers retrieve data from HDFS which is stored on the Data-node i.e. the same system where Task Tracker is running. Task Tracker Data Node Slave m/c
  9. 9. NameNode Ø  Maps a block to the Datanodes Ø  Controls read/write access to files Ø  Manages Replication Engine for Blocks DataNode Ø  Responsible for serving read and write requests (block creation, deletion, and replication) JobTracker Ø  Accepts Map-Reduce tasks from the clients Ø  Assigns tasks to the Task Trackers & monitors their status TaskTracker Ø  Worker daemon, runs Map-Reduce tasks Ø  Sends heart-beat to Job Tracker Ø  Retrieves Job resources from HDFS NameNode DataNode JobTracker TaskTracker Hadoop Daemons
  10. 10. Hadoop Services HDFS MapReduce YARN YARN stands for “Yet Another Resource Negotiator”, a framework to provide generic resource management solution to Hadoop clusters.
  11. 11. Allows easy integration of multiple data processing algorithms to the data stored in HDFS
  12. 12. Query Language Pig Scripting Coordination Service Columnar Database Log Management Data Exchange Designing Workflow Machine Learning Messaging System
  13. 13. a)  Apache Website à b)  Learning YARN à c)  Hadoop: The definitive guide à