++
Hadoop 기본 과정
++
Overview
HDFSHDFSHDFSHDFS
ImpalaImpalaImpalaImpala
MapReduceMapReduceMapReduceMapReduce
CascadingCascadingCascadingCasc...
++
Big Data for What?
Service
CAP Theorem, Fast Response ,Scale Out , Schema
Free ...
Distributor with RDBMS
NoSQL
MongoDB...
++
What’s Hadoop
Consist of
HDFS (Hadoop Distributed File System)
MapReduce
++
HDFS Architecture
master
namenode
slave
bunch of
datanode
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeData...
++
single master
Strong Point
simple architecture
master have global knowledge.
file and block namespace (memory and disk)...
++
single master
Weak Point
SPOF(= single point of failure )
bottleneck
minimizing master’s involvement is important
++
Fast Recovery for NameNode
Secondary Namenode
crawls namenode’s
operation log
maintains
namenode’s data
NameNodeNameNod...
++
HA for NameNode
active namenode
do normal
namenode’s
operation
standby namenode
maintain
namenode’s data
ready to be ac...
++
block
each file consists of blocks
size
default 64M
replication ( default 3 )
++
write operation
client send ‘write request’ to
namenode
namenode lock file and select
datanode to be written.
namenode ...
++
read operation
client send ‘read request’ to
namenode
namenode lock file and select
datanode to be written.
namenode re...
++
block(again)
reason to use big-size-block
reduce client’s need to interact with namenode
reduce the size of metadata st...
++
namenode’s operation
namespace management and locking
replica placement
creation, re-replication, rebalancing
garbage c...
++ namespace management and locking
goal
ensure proper serialization
use read lock/write lock
++
block replica placement
goal
maximize data reliability and availability
maximize network bandwidth utilization
default ...
++ creation, re-replication, rebalancing
creation
client create new files
consider
disk space utilization
number of recent...
++
garbage collection
what’s garbage?
block not in namenode’s metadata.
mechanism
when exchanging HeartBeat with namenode,...
++
stale replica detection
mechanism
storing with generation timestamp.
when restarting, datanode reports its set of block...
++
Datanode’s operation
check data integrity
datanode use checksumming to detect corruption.
++
filesystem api
hdfs provide basic linux utilities.
ex)
hdfs dfs -mkdir -p /foo
hdfs dfs -ls /foo
hdfs dfs -cat /foo/bar...
++
etc
raid?
native library?
++
end
thanks ....
Upcoming SlideShare
Loading in...5
×

HDFS introduction

460

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
460
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

HDFS introduction

  1. 1. ++ Hadoop 기본 과정
  2. 2. ++ Overview HDFSHDFSHDFSHDFS ImpalaImpalaImpalaImpala MapReduceMapReduceMapReduceMapReduce CascadingCascadingCascadingCascading HiveHiveHiveHive
  3. 3. ++ Big Data for What? Service CAP Theorem, Fast Response ,Scale Out , Schema Free ... Distributor with RDBMS NoSQL MongoDB , HBASE , CouchDB ... Analysis Hadoop <--- today’s topic!!!
  4. 4. ++ What’s Hadoop Consist of HDFS (Hadoop Distributed File System) MapReduce
  5. 5. ++ HDFS Architecture master namenode slave bunch of datanode NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
  6. 6. ++ single master Strong Point simple architecture master have global knowledge. file and block namespace (memory and disk) mapping from files to blocks (memory and disk) location of each block’s replicas ( only memory) master can make sophisticated decisions.
  7. 7. ++ single master Weak Point SPOF(= single point of failure ) bottleneck minimizing master’s involvement is important
  8. 8. ++ Fast Recovery for NameNode Secondary Namenode crawls namenode’s operation log maintains namenode’s data NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode Secondary NameNodeSecondary NameNodeSecondary NameNodeSecondary NameNode
  9. 9. ++ HA for NameNode active namenode do normal namenode’s operation standby namenode maintain namenode’s data ready to be active namenode NameNode(active)NameNode(active)NameNode(active)NameNode(active) DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode NameNode(standby)NameNode(standby)NameNode(standby)NameNode(standby)
  10. 10. ++ block each file consists of blocks size default 64M replication ( default 3 )
  11. 11. ++ write operation client send ‘write request’ to namenode namenode lock file and select datanode to be written. namenode response datanode list to client. client send file content to datanode. datanode store file and relay to other datanode. finally client send close request to namenode. namenode release write lock NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode clientclientclientclient write lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanode
  12. 12. ++ read operation client send ‘read request’ to namenode namenode lock file and select datanode to be written. namenode response datanode list to client. client send read request to datanode. datanode send content to client finally client send close request to namenode. namenode release read lock NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode clientclientclientclient read lockread lockread lockread lock
  13. 13. ++ block(again) reason to use big-size-block reduce client’s need to interact with namenode reduce the size of metadata stored on namenode
  14. 14. ++ namenode’s operation namespace management and locking replica placement creation, re-replication, rebalancing garbage collection stale replica detection
  15. 15. ++ namespace management and locking goal ensure proper serialization use read lock/write lock
  16. 16. ++ block replica placement goal maximize data reliability and availability maximize network bandwidth utilization default strategy is ... one on same datanode. one on other datanode in same rack. one on other datanode in other rack.
  17. 17. ++ creation, re-replication, rebalancing creation client create new files consider disk space utilization number of recent creation spread replicas re-replication number of available replica falls below proper goal datanode down, replica corruption ... rebalancing move replicas for better disk space and load balancing
  18. 18. ++ garbage collection what’s garbage? block not in namenode’s metadata. mechanism when exchanging HeartBeat with namenode, datanode reports subset of block it has. master replies with garbage blocks. datanode deletes grabage blocks.
  19. 19. ++ stale replica detection mechanism storing with generation timestamp. when restarting, datanode reports its set of blocks with its generation timestamp
  20. 20. ++ Datanode’s operation check data integrity datanode use checksumming to detect corruption.
  21. 21. ++ filesystem api hdfs provide basic linux utilities. ex) hdfs dfs -mkdir -p /foo hdfs dfs -ls /foo hdfs dfs -cat /foo/bar.txt hdfs dfs -rm -r /foo
  22. 22. ++ etc raid? native library?
  23. 23. ++ end thanks ....
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×