Hadoop HDFS by rohitkapa


Published on

A complete Hadoop Distributed File System concept overview and tutorial with real time examples.

Published in: Technology

Hadoop HDFS by rohitkapa

  1. 1. Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS)
  2. 2. Topics CoveredBig Data and Hadoop IntroductionHDFS IntroductionHDFS DefinitionHDFS ComponentsArchitecture of HDFSUnderstanding the File SystemRead and Write in HDFSHDFS CLISummary 2
  3. 3. What is Big Data ?Big Data refers to datasets that grow so large that it is difficult tocapture, store, manage, share, analyze and visualize with thetypical database software tools.Big Data actually comes in complex, unstructured formats,mostly everything from web sites, social media and email, tovideos, Data-warehouses and Scientific world.Four Vs are that make a data so challenging to classified as BIGDATA are  Volume Velocity Variety Value 3
  4. 4. What is Hadoop?It is an Apache Software Foundation project• Framework for running applications on large clusters• Modeled after Google’s MapReduce / GFS framework• Implemented in JavaA software platform that lets one easily write and run applicationsthat process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file systemHeres what makes it especially useful for: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 4
  5. 5. Why Hadoop?Handle partial hardware failures without going down: − If machine fails, we should be switch over to stand by machine − If disk fails – use RAID or mirror diskFault Tolerance:  Regular backups  Logging  Mirror database at different siteElasticity of resources:  Increase capacity without restarting the whole system (PureScale)  More computing power should equal to faster processingResult consistency:  Answer should be consistent (independent of something failing) and returned in reasonable amount of time 5
  6. 6. HDFS -Introduction● Hadoop Distributed File System (HDFS)● Based on Google File System● Google file system was derrived from bigfiles paper authored by Larry and Sergey in Stanford● Hadoop provides a distributed filesystem and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm● The interface to HDFS is patterned after the Unix filesystem● Other distributed file system types are PVFS,Lustre,GFS,KDFS,FTP,Amazon S3 6
  7. 7. HDFS – Goals and Assumptions● Hardware Failure● Streaming Data Access● Large Data Sets● Simple Coherency Model● “Moving Computation is Cheaper than Moving Data”● Portability Across Heterogeneous Hardware and Software Platforms 7
  8. 8. HDFS Definition – The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. – HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. – HDFS is the primary storage system used by Hadoop applications. – HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. – HDFS provides high throughput access to application data and is suitable for applications that have large data setsHDFS consists of following components (daemons) •HDFS Master “Namenode” •HDFS Workers “Datanodes” •Secondary Name Node 8
  9. 9. HDFS Components Namenode: NameNode, a master server, manages the file system namespace and regulates access to files by clients.  Meta-data in Memory – The entire metadata is in main memory  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  A Transaction Log – Records file creations, file deletions. Etc Data Node: DataNodes, one per node in the cluster, manages storage attached to the nodes that they run on  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients – Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes 9
  10. 10. HDFS Components Secondary Name Node – Not used as hot stand-by or mirror node. Failover node is in future release. – Will be renamed in 0.21 to CheckNode – Bakup nameNode periodically wakes up and processes check point and updates the nameNode – Memory requirements are the same as nameNode (big) – Typically on a separate machine in large cluster ( > 10 nodes) – Directory is same as nameNode except it keeps previous checkpoint version in addition to current. – It can be used to restore failed nameNode (just copy current directory to new nameNode) 10
  11. 11. HDFS Block– Large data sets are divide into small chunks for easy processing.– Default is 64 MB– Can be increased more to 128 MB– Reason for this default size and how it effects HDFS 11
  12. 12. HDFS Architecture 12
  13. 13. HDFS Architecture
  14. 14. Understanding the File systemBlock placement• Current Strategy − One replica on local node − Second replica on a remote rack − Third replica on same remote rack − Additional replicas are randomly placed• Clients read from nearest replicaData Correctness• Use Checksums to validate data − Use CRC32• File Creation − Client computes checksum per 512 byte − DataNode stores the checksum• File access − Client retrieves the data and checksum from DataNode − If Validation fails, Client tries other replicas 14
  15. 15. Understanding the File system Data pipelining − Client retrieves a list of DataNodes on which to place replicas of a block − Client writes block to the first DataNode − The first DataNode forwards the data to the next DataNode in the Pipeline − When all replicas are written, the Client moves on to write the next block in file Rebalancer – Goal: % of disk occupied on Datanodes should be similar − Usually run when new Datanodes are added − Cluster is online when Rebalancer is active − Rebalancer is throttled to avoid network congestion − Command line tool 15
  16. 16. Read and Write in HDFS 16
  17. 17. Read and Write in HDFS...contd 17
  18. 18. Read and Write in HDFS...contd 18
  19. 19. Read and Write in HDFS..contd 19
  20. 20. Read and Write in HDFS...contd 20
  21. 21. Read and Write in HDFS...contd 21
  22. 22. Read and Write in HDFS 22
  23. 23. Read and Write in HDFS 23
  24. 24. Read and Write in HDFS 24
  25. 25. Read and Write in HDFS 25
  26. 26. Read and Write in HDFS 26
  27. 27. Read and Write in HDFS 27
  28. 28. Read and Write in HDFS 28
  29. 29. Read and Write in HDFS 29
  30. 30. Read and Write in HDFS 30
  31. 31. Read and Write in HDFS 31
  32. 32. Read and Write in HDFS 32
  33. 33. Read and Write in HDFS 33
  34. 34. Command Line interface– HDFS has a UNIX based command line interface and we have to access this using HDFS using this CLI.– HDFS can also be accessed through a web interface but its limit is only for viewing HDFS contents.– We will go through this part in detail in Practical sessions.– Below are few examples of CLI based operationshadoop fs -mkdir /inputhadoop fs -copyFromLocal input/docs/tweets.txt /input/tweets.txthadoop fs -put input/docs/tweets.txt /input/tweets.txthadoop fs -ls /inputhadoop fs -rmr /input
  35. 35. Resource● Apache Hadoop Wiki● Bradhed Lund Website(special thanks for making easy to understand HDFS in real time)
  36. 36. THANK YOU -by Rohit Kapa