Hadoop data analysis



This ppt is about how hadoop is helpful for us and architecture of hadoop

This ppt is about how hadoop is helpful for us and architecture of hadoop



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hadoop data analysis Hadoop data analysis Presentation Transcript

  • Problem Statement.. 2
  • Solution • Open Source Apache Project • Runs on • Linux, Mac OS/X, Windows, and Solaris • Commodity hardware 3
  • Who uses hadoop ? Hadoop optimized content 4
  • What is Hadoop ? • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is based on a simple programming model called MapReduce 5
  • Hadoop clusters 6 One of the largest clusters in the world runs on around 2000 nodes at yahoo does hundred's of jobs every month
  • Hadoop Architecture • Distributed file system (HDFS) • Execution engine (MapReduce) Master node (single node) Many slave nodes 7
  • Hadoop: How it Works?? 8
  • Hadoop Distributed File System (HDFS) Centralized namenode - Maintains metadata info about files File F 1 2 3 4 5 Blocks (64 MB) Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) 9
  • Main Properties of HDFS • • • • Large Replication Failure Fault Tolerance 10
  • Hadoop MapReduce • Handles large-scale web search applications • Popular programming module for data centres • Effective programming model for data mining, machine learning and search applications in data centers. • Helps to abstract from the issues of parallelization, scheduling, input partitioning, failover, replication. 11
  • MapReduce Phases 12
  • Properties of MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Number of mappers • Concept of locality Node 1 Node 2 Node 3 • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 13
  • Properties of MapReduce Engine (Cont’d) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress Map Parse-hash Reduce Map Parse-hash Reduce Map Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks Reduce Map Parse-hash 14
  • Key-Value Pairs • Mappers and Reducers are users’ code (provided functions) • Just need to obey the Key-Value pairs interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Reducers: • Consume <key, <list of values>> • Produce <key, value> • Shuffling and Sorting: • Hidden phase between mappers and reducers • Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> 15
  • Example 1: Word Count • Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks 16
  • 17
  • THANK YOU..!!! 18