Hadoop architecture by ajay


Published on

More about Hadoop

This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Hadoop installation Post Configurations

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop architecture by ajay

  1. 1. http://www.beinghadoop.com hadoopframework@gmail.com
  2. 2. MASTER NODE, SLAVE NODES. Master nodes are Namenode Secondary namenode Jobtracker Slave nodes: Datanodes, Task trackers. Hdfs – Hadoop distributed file system which is distributed to Available Datanodes. So actual data resides in Datanodes. Data in Datanodes represented as small size chunks, we call this chunks as blocks. The default block size in HDFS is 64Mb. For example if you want to store 1GB of data, in HDFS, In 3 data nodes, 1000mb/64mb=15.625mb 16 blocks are equally distributed 3 data nodes. In This 15 blocks are in same size, 16th one with 7Mb. hadoopframework@gmail.com
  3. 3. These blocks are replicated based on replication policy. Data replication policy is to improve data reliability, availability and network bandwidth utilization. If you are using Replication of blocks, the Default Replication Factor is 3. In above example 3*16=48 blocks are created. These 48 blocks are replicated to all the data nodes As HDFS replication placement policy Hdfs stores --one replica on node in local rack. second replica on another datanode in local rack, third one in different node in a different rack. Hdfs stores each file as sequence of blocks. All blocks of a file except last block are the same size. this blocks of a file are replicated for fault tolerance. A user or an application can specify the replication factor for a file. hadoopframework@gmail.com
  4. 4. The default Replication factor for HDFS is specified In Hdfs-site.xml under hadoop-home/conf directory We can manually specify replication factor for file using Hadoop fs –setrep command. Syntax: hadoop fs s-setrep [-R] location of file. R indicates Recursive hadoopframework@gmail.com
  5. 5. In above diagram data1 contains 3 blocks 1,2,3 Data2 contains 2 blocks 4,5. There are 4 data nodes. This 5 blocks*3 replications=15blocks are distributed to 4 data nodes. When blocks are replicating to data nodes, General issues arises like ---over replicated blocks ---under replicated blocks ---corrupted blocks ---miss replicated blocks. hadoopframework@gmail.com
  6. 6. To balance These Nodes equally We run a script start-balancer.sh under Hadoop-home/bin directory We use hadoop fsck /, hadoop dfsadmin –report hadoop dfsadmin –metasave metaloginfo.txt Datanode blockscanerreport To check blocks information, Storage information health of hdfs file system, hadoopframework@gmail.com
  7. 7. hadoopframework@gmail.com
  8. 8. hadoopframework@gmail.com
  9. 9. We can not access the data from Datanode Directly. Because datanode contains distributed blocks, Which are like fragmented data or part files. The Datanode are responsible for ---serving read, write requests from the clients ---performs block creating, deletion, Replication of blocks based on replication factor.. hadoopframework@gmail.com
  10. 10. when datanode startup, each datanode in a cluster performs handshake procedure with namenode. Each DataNode sends its block report to the NameNode every hour so that NameNode has an up to date view of where block replicas are located in the cluster. Datanode also sends heartbeats to the NameNode every ten minutes. so that Namenode will identify how many datanodes are working properly. if it doesn't receive a heartbeat from DataNode, NameNode assumes that the particular datanode got failure. And then it starts writing replicas in available datanodes. hadoopframework@gmail.com
  11. 11. Name node manages the file system namespace. It stores metadata for the files which are being stored in DataNode. NameNode maintains the data in the form file system tree. When user writes data to DataNode, its metadata is maintained in Namenode. So when user tries to access the same data, or Hdfs client requests to access data, it has to contact NameNode through Jobtracker. It gets the reference in which DataNode, the data is located. Will be given to Hdfs client. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint When user initiated write operation, the data written to editlog file. When chekpoint occurs it writes the data into fsimage file hadoopframework@gmail.com
  12. 12. Name node stores the data as a Hierarchical file system Namenode maintains the file system tree Any meta information changes to the file system is recorded by the Namenode. An application can specify the number of replicas of the file needed, replication factor of the file. This information is stored in the Namenode. Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. For example, creating a new file. Change replication factor of a file EditLog is stored in the Namenode’s local filesystem Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in NameNode’s local file system. hadoopframework@gmail.com
  13. 13. When you start the Namenode , it enters to Safemode. During the safemode, Replication of data blocks do not occur. Each DataNode checks in with Heartbeat and BlockReport. Namenode verifies that each block has acceptable number of replicas After a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode. It then makes the list of blocks that need to be replicated. Namenode then proceeds to replicate these blocks to other Datanodes. hadoopframework@gmail.com
  14. 14. Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are treated as they have failed and the work is scheduled to a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, it may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information hadoopframework@gmail.com
  15. 15. A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in its own JVM Process. Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The Tasktracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. hadoopframework@gmail.com
  16. 16. The Maximum Number of Map tasks that can run on a task tracker Is controlled by the property for map tasks Mapred.tasktracker.map.tasks.maximum=2(Default 2) The Maximum Number of Reduce tasks that can run on a task tracker Is controlled by the property for reduce tasks Mapred.tasktracker.reduce.tasks.maximum=2 (Default=2) If eight processors are available we can assign maximum Seven for map Or reducer tasks. hadoopframework@gmail.com
  17. 17. 1. A block is minimum amount of data that can read or Write. HDFS- The Default block size 64mb. we can set it to 128mb in hdfs-site.xml dfs.blocksize=128mb 2. The HDfs default buffer size 4kb. By considering system configuration we can set this using io.file.buffer.size in core-site.xml hadoopframework@gmail.com
  18. 18. When we add a node to existing cluster it is commissioning a node, Removing a node is decommissioning node. 3.To commissioning a node Add the network address of the nodes to commissioning. Dfs.hosts= node address (hdfs-site.xml) Mapred.hosts=node address(mapred-site.xml) Update the name node using the commands Hadoop dfsadmin –refreshnodes Update the jobtracker using the command Hadoop mradmin –refreshnodes. hadoopframework@gmail.com
  19. 19. To Decommissioning a node Add network addresses of the nodes to be decommissioned. Dfs.hosts.exclude=node address(hdfs-site.xml) Mapred.hosts.exclude=node address(mapred-site.xml) Update the name node using the commands Hadoop dfsadmin –refreshnodes Update the jobtracker using the command Hadoop mradmin –refreshnodes 4.Non HDFs in data node: In One Tb disk of data node, If we want to use 250Gb for Non Hdfs storage Set dfs.datanode.du.reserverd=250gb(in bytes) hadoopframework@gmail.com
  20. 20. 5. Recycle bin for HDFS. If trash enabled , A hidden directory created with the name .trash Command to clear the trash for Hdfs Hadoop fs –expunge. To specify the minimum amount of time that a file will remain in trash can be set using fs.trash.interval=600 minutes. To disable the trash set Fs.trash.interval=0 in core.site.xml hadoopframework@gmail.com
  21. 21. 6. DATANODE BLOCK SCANNER: PEROIODICALLY VERIFIES ALL THE BLOCKS ON THE DATA NODE. The default interval every 504 HOURS. HDFS-SITE.XML: DFS.DATANODE.SCAN.PERIOD.HOURS=504 7. Log Files Location: By default The log directory is located Under hadoop_install/logs. We can assign new location for Log directory in Hadoop-env.sh by adding a Line Export Hadoop_log_dir=/var/log/hadoop hadoopframework@gmail.com
  22. 22. Hadoop-env.sh: Environmental variables which are used to run hadoop. Core-site.xml: contains configuration settings for hadoop, i/o settings, other properties common to Hdfs and Mapreduce. Hdfs-site.xml: Configurations for hdfs daemons, Namenode, snn, data nodes. Mapred-site.xml: Configuration settings for Mapreduce daemons: Jobtracker, tasktrackers. hadoopframework@gmail.com
  23. 23. To designate particular node as name node, specify its URI In core-site.xml for the property fs.default.name. The NameNode ‘s name or its IP address is specified In core-site.xml: In case of psuedo distributed mode, we configure name node in localhost. We can also specify like hdfs:// hadoopframework@gmail.com
  24. 24. Dfs.name.dir property in hdfs-site.xml contains the location Where NameNode stores file system metadata. We can specify multiple Disks locations, remote disks. So that we can recover data incase NameNode Got failure. Specify the directory names separated by comma(,). hadoopframework@gmail.com
  25. 25. hadoopframework@gmail.com
  26. 26. The property dfs.data.dir in hdfs-site.xml specify the location Where Datanode stores its blocks. This property contains list of directories Where DataNode stores its blocks. We can specify multiple directories for replication Of blocks. It uses round robin algorithm to write data between directories. Mapred.job.tracker property in mapred-site.xml specifies The port where job tracker is running. In a fully distributed mode we can configure jobtracker in a separate node too. hadoopframework@gmail.com
  27. 27. Mapred.local.dir which contains list of directory name seperated by comma Where jobtracker stores intermediate data for jobs. The data is cleared when Job is completed. Mapred.system.dir specifies a locations where shared files are stored While a job is running. Mapred.tasktracker.map.tasks.maximum is a property specifies The number of map tasks that can run on a tasktracker at one time. Mapred.tasktracker.reduce.tasks.maximum is a property specifies The number of reducer tasks that can run on a tasktracker at one time. hadoopframework@gmail.com
  28. 28. Name node’s http server address port: 50070. Open web browser and type uri for NameNode as http://localhost:50070 Http:// 50030 ; jobtracker http server address port 50060: task tracker http server address port 50075: datanode’s http server address port 50090: secondary name node’s http server address port hadoopframework@gmail.com
  29. 29. hadoopframework@gmail.com
  30. 30. These Scripts are located under the Directory Hadoop_home/bin or /usr/lib/bin HADOOP CONTROL SCRIPTS: Master file: It is plain text file Contains the NameNode, Secondary NameNode, Jobtracker address. Slave file. It is a plain text file contains the address of data nodes. Start-dfs.sh: Starts namenode on the local machine. Starts datanode on each machine listed in slave file. Starts secondary namenode . Start-mapred.sh Starts a jobtracker Starts tasktracker on each datanode machine. To stop this we have the scripts Stop-dfs.sh Stop-mapred.sh hadoopframework@gmail.com
  31. 31. hadoopframework@gmail.com
  32. 32. http://localhost:50075 Http:// hadoopframework@gmail.com
  33. 33. http://localhost:50030 Http:// Http:// hadoopframework@gmail.com
  34. 34. http://localhost:50090 Http:// hadoopframework@gmail.com
  35. 35. http://localhost:50060 Http:// hadoopframework@gmail.com