Ali Bahu10/17/2012APACHE HADOOP
INTRODUCTION Apache Hadoop is an open-source softwareframework that supports data-intensivedistributed applications, lice...
WHY HADOOP Need to process Multi Petabyte Datasets It is expensive to build reliability in each application Nodes failu...
WHO USES HADOOP Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo!
COMMODITY HARDWARE Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 ...
HDFS ARCHITECTURE
HDFS (HADOOP DISTRIBUTED FILESYSTEM) Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Com...
NAME NODE METADATA Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Me...
DATA NODEBlock Server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves d...
BLOCK PLACEMENT Block Placement Strategy One replica on local node Second replica on a remote rack Third replica on sa...
DATA CORRECTNESS Use Checksums to validate data Use CRC32, etc. File Creation Client computes checksum per 512 byte D...
NAMENODE FAILURE A single point of failure Transaction Log stored in multiple directories A directory on the local file...
DATA PIPELINING Client retrieves a list of DataNodes on which to place replicas of ablock Client writes block to the fir...
HADOOP MAP/REDUCE The Map-Reduce programming model Framework for distributed processing of large data sets Pluggable us...
Upcoming SlideShare
Loading in...5
×

Hadoop

440

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
440
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop"

  1. 1. Ali Bahu10/17/2012APACHE HADOOP
  2. 2. INTRODUCTION Apache Hadoop is an open-source softwareframework that supports data-intensivedistributed applications, licensed under theApache v2 license. It enables applications towork with thousands of computation-independent computers and petabytes ofdata. Hadoop was derived from GooglesMapReduce and Google File System (GFS)papers. Hadoop is implemented in Java.
  3. 3. WHY HADOOP Need to process Multi Petabyte Datasets It is expensive to build reliability in each application Nodes failure is expected and Hadoop can help Number of nodes is not constant Efficient, reliable, Open Source Apache License Workloads are IO bound and not CPU bound
  4. 4. WHO USES HADOOP Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo!
  5. 5. COMMODITY HARDWARE Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is one gigabit
  6. 6. HDFS ARCHITECTURE
  7. 7. HDFS (HADOOP DISTRIBUTED FILESYSTEM) Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations canmove to where data resides Provides very high aggregate bandwidth User Space, runs on heterogeneous OS Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple Data Nodes Intelligent Client Client can find location of blocks Client accesses data directly from Data Node
  8. 8. NAME NODE METADATA Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Metadata List of files List of Blocks for each file List of Data Nodes for each block File attributes, e.g. creation time, replication factor, etc. Transaction Log Records file creations, file deletions, etc. is kept here
  9. 9. DATA NODEBlock Server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to ClientsBlock Report Periodically sends a report of all existing blocks tothe Name NodeFacilitates Pipelining of Data Forwards data to other specified Data Nodes
  10. 10. BLOCK PLACEMENT Block Placement Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed Clients read from nearest replica
  11. 11. DATA CORRECTNESS Use Checksums to validate data Use CRC32, etc. File Creation Client computes checksum per 512 byte Data Node stores the checksum File access Client retrieves the data and checksum from Data Node If Validation fails, Client tries other replicas
  12. 12. NAMENODE FAILURE A single point of failure Transaction Log stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/CIFS)
  13. 13. DATA PIPELINING Client retrieves a list of DataNodes on which to place replicas of ablock Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in thePipeline When all replicas are written, the Client moves on to write the nextblock in file Rebalancer is used to ensure that the % disk full on DataNodes aresimilar Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool
  14. 14. HADOOP MAP/REDUCE The Map-Reduce programming model Framework for distributed processing of large data sets Pluggable user code runs in generic framework Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output Useful for: Log processing Web search indexing Ad-hoc queries, etc.

×