0

Hadoop data analysis

287

Published on

This ppt is about how hadoop is helpful for us and architecture of hadoop

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
287
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop data analysis"

  1. 1. BIG DATA ANALYSIS 1
  2. 2. Problem Statement.. 2
  3. 3. Solution • Open Source Apache Project • Runs on • Linux, Mac OS/X, Windows, and Solaris • Commodity hardware 3
  4. 4. Who uses hadoop ? Hadoop optimized content 4
  5. 5. What is Hadoop ? • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is based on a simple programming model called MapReduce 5
  6. 6. Hadoop clusters 6 One of the largest clusters in the world runs on around 2000 nodes at yahoo does hundred's of jobs every month
  7. 7. Hadoop Architecture • Distributed file system (HDFS) • Execution engine (MapReduce) Master node (single node) Many slave nodes 7
  8. 8. Hadoop: How it Works?? 8
  9. 9. Hadoop Distributed File System (HDFS) Centralized namenode - Maintains metadata info about files File F 1 2 3 4 5 Blocks (64 MB) Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) 9
  10. 10. Main Properties of HDFS • • • • Large Replication Failure Fault Tolerance 10
  11. 11. Hadoop MapReduce • Handles large-scale web search applications • Popular programming module for data centres • Effective programming model for data mining, machine learning and search applications in data centers. • Helps to abstract from the issues of parallelization, scheduling, input partitioning, failover, replication. 11
  12. 12. MapReduce Phases 12
  13. 13. Properties of MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Number of mappers • Concept of locality Node 1 Node 2 Node 3 • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 13
  14. 14. Properties of MapReduce Engine (Cont’d) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress Map Parse-hash Reduce Map Parse-hash Reduce Map Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks Reduce Map Parse-hash 14
  15. 15. Key-Value Pairs • Mappers and Reducers are users’ code (provided functions) • Just need to obey the Key-Value pairs interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Reducers: • Consume <key, <list of values>> • Produce <key, value> • Shuffling and Sorting: • Hidden phase between mappers and reducers • Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> 15
  16. 16. Example 1: Word Count • Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks 16
  17. 17. 17
  18. 18. THANK YOU..!!! 18
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×