SlideShare a Scribd company logo
1 of 36
Hadoop Distributed File
 System (HDFS)
Hadoop Distributed File System (HDFS)
Topics Covered


Big Data and Hadoop Introduction
HDFS Introduction
HDFS Definition
HDFS Components
Architecture of HDFS
Understanding the File System
Read and Write in HDFS
HDFS CLI
Summary



                                   2
What is Big Data ?
Big Data refers to datasets that grow so large that it is difficult to
capture, store, manage, share, analyze and visualize with the
typical database software tools.

Big Data actually comes in complex, unstructured formats,
mostly everything from web sites, social media and email, to
videos, Data-warehouses and Scientific world.

Four Vs are that make a data so challenging to classified as BIG
DATA are
         Volume Velocity Variety Value




                                                                         3
What is Hadoop?
It is an Apache Software Foundation project
• Framework for running applications on large clusters
• Modeled after Google’s MapReduce / GFS framework
• Implemented in Java

A software platform that lets one easily write and run applications
that process vast amounts of data. It includes:
      – MapReduce – offline computing engine
      – HDFS – Hadoop distributed file system

Here's what makes it especially useful for:
   Scalable: It can reliably store and process petabytes.
   Economical: It distributes the data and processing across clusters of
      commonly available computers (in thousands).
   Efficient: By distributing the data, it can process it in parallel on the nodes
      where the data is located.
   Reliable: It automatically maintains multiple copies of data and automatically
      redeploys computing tasks based on failures.

                                                                                     4
Why Hadoop?
Handle partial hardware failures without going down:
  − If machine fails, we should be switch over to stand by machine
  − If disk fails – use RAID or mirror disk


Fault Tolerance:
   Regular backups
   Logging
   Mirror database at different site


Elasticity of resources:
   Increase capacity without restarting the whole system (PureScale)
   More computing power should equal to faster processing


Result consistency:
   Answer should be consistent (independent of something failing) and returned in
    reasonable amount of time


                                                                                     5
HDFS -Introduction
●
    Hadoop Distributed File System (HDFS)

●
    Based on Google File System

●
    Google file system was derrived from 'bigfiles' paper authored by
    Larry and Sergey in Stanford

●
    Hadoop provides a distributed filesystem and a framework for the
    analysis and transformation of very large data sets using the
    MapReduce paradigm

●
    The interface to HDFS is patterned after the Unix filesystem

●
    Other distributed file system types are
    PVFS,Lustre,GFS,KDFS,FTP,Amazon S3


                                                                        6
HDFS – Goals and Assumptions

●
    Hardware Failure

●
    Streaming Data Access

●
    Large Data Sets

●
    Simple Coherency Model

●
    “Moving Computation is Cheaper than Moving Data”

●
    Portability Across Heterogeneous Hardware and Software
    Platforms



                                                             7
HDFS Definition
 – The Hadoop Distributed File System (HDFS) is a distributed file system
   designed to run on commodity hardware.
 – HDFS is a distributed, scalable, and portable filesystem written in Java
   for the Hadoop framework.
 – HDFS is the primary storage system used by Hadoop applications.
 – HDFS is highly fault-tolerant and is designed to be deployed on low-cost
   hardware.
 – HDFS provides high throughput access to application data and is
   suitable for applications that have large data sets

HDFS consists of following components (daemons)
          •HDFS Master “Namenode”

          •HDFS Workers “Datanodes”

          •Secondary Name Node
                                                                              8
HDFS Components
 Namenode:
 NameNode, a master server, manages the file system namespace and regulates access to files by clients.
  Meta-data in Memory
   – The entire metadata is in main memory
  Types of Metadata
   – List of files
   – List of Blocks for each file
   – List of DataNodes for each block
   – File attributes, e.g creation time, replication factor
  A Transaction Log
   – Records file creations, file deletions. Etc

 Data Node:
 DataNodes, one per node in the cluster, manages storage attached to the nodes that they run on
  A Block Server
   – Stores data in the local file system (e.g. ext3)
   – Stores meta-data of a block (e.g. CRC)
   – Serves data and meta-data to Clients
   – Block Report
   – Periodically sends a report of all existing blocks to the NameNode
  Facilitates Pipelining of Data
   – Forwards data to other specified DataNodes                                                           9
HDFS Components

 Secondary Name Node


 – Not used as hot stand-by or mirror node. Failover node is in future release.
 – Will be renamed in 0.21 to CheckNode
 – Bakup nameNode periodically wakes up and processes check point and updates
 the nameNode
 – Memory requirements are the same as nameNode (big)
 – Typically on a separate machine in large cluster ( > 10 nodes)
 – Directory is same as nameNode except it keeps previous checkpoint version in
 addition to current.
 – It can be used to restore failed nameNode (just copy current directory to new
 nameNode)




                                                                                   10
HDFS Block

– Large data sets are divide into small chunks for easy processing.
– Default is 64 MB
– Can be increased more to 128 MB
– Reason for this default size and how it effects HDFS




                                                                      11
HDFS Architecture




                    12
HDFS Architecture
Understanding the File system
Block placement
• Current Strategy
   −   One replica on local node
   −   Second replica on a remote rack
   −   Third replica on same remote rack
   −   Additional replicas are randomly placed
• Clients read from nearest replica

Data Correctness
• Use Checksums to validate data
   − Use CRC32
• File Creation
   − Client computes checksum per 512 byte
   − DataNode stores the checksum
• File access
   − Client retrieves the data and checksum from DataNode
   − If Validation fails, Client tries other replicas       14
Understanding the File system

 Data pipelining
   − Client retrieves a list of DataNodes on which to place replicas of a block
   − Client writes block to the first DataNode
   − The first DataNode forwards the data to the next DataNode in the Pipeline
   − When all replicas are written, the Client moves on to write the next block in file



 Rebalancer
     – Goal: % of disk occupied on Datanodes should be similar
    −   Usually run when new Datanodes are added
    −   Cluster is online when Rebalancer is active
    −   Rebalancer is throttled to avoid network congestion
    −   Command line tool




                                                                                          15
Read and Write in HDFS




                         16
Read and Write in HDFS...contd




                                 17
Read and Write in HDFS...contd




                                 18
Read and Write in HDFS..contd




                                19
Read and Write in HDFS...contd




                                 20
Read and Write in HDFS...contd




                                 21
Read and Write in HDFS




                         22
Read and Write in HDFS




                         23
Read and Write in HDFS




                         24
Read and Write in HDFS




                         25
Read and Write in HDFS




                         26
Read and Write in HDFS




                         27
Read and Write in HDFS




                         28
Read and Write in HDFS




                         29
Read and Write in HDFS




                         30
Read and Write in HDFS




                         31
Read and Write in HDFS




                         32
Read and Write in HDFS




                         33
Command Line interface

– HDFS has a UNIX based command line interface and we have to access this using
 HDFS using this CLI.
– HDFS can also be accessed through a web interface but its limit is only for viewing
 HDFS contents.
– We will go through this part in detail in Practical sessions.

– Below are few examples of CLI based operations

hadoop fs -mkdir /input
hadoop fs -copyFromLocal input/docs/tweets.txt /input/tweets.txt
hadoop fs -put input/docs/tweets.txt /input/tweets.txt
hadoop fs -ls /input
hadoop fs -rmr /input
Resource
●   Apache Hadoop Wiki
●   Bradhed Lund Website(special thanks for making easy to
    understand HDFS in real time)
THANK YOU

      -by Rohit Kapa

More Related Content

What's hot

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)
Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)
Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf8840VinayShelke
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 

What's hot (20)

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
IBM GPFS
IBM GPFSIBM GPFS
IBM GPFS
 
Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)
Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)
Communication Patterns with Apache Spark-(Reza Zadeh, Stanford)
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop
Hadoop Hadoop
Hadoop
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 

Viewers also liked

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3Sung-jae Park
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
Z OS IBM Utilities
Z OS IBM UtilitiesZ OS IBM Utilities
Z OS IBM Utilitieskapa rohit
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveTapan Avasthi
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 

Viewers also liked (19)

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Jcl faqs
Jcl faqsJcl faqs
Jcl faqs
 
Z OS IBM Utilities
Z OS IBM UtilitiesZ OS IBM Utilities
Z OS IBM Utilities
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 

Similar to Hadoop HDFS by rohitkapa

big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptxAakashBerlia1
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 

Similar to Hadoop HDFS by rohitkapa (20)

big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Hdfs
HdfsHdfs
Hdfs
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Hadoop HDFS by rohitkapa

  • 1. Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS)
  • 2. Topics Covered Big Data and Hadoop Introduction HDFS Introduction HDFS Definition HDFS Components Architecture of HDFS Understanding the File System Read and Write in HDFS HDFS CLI Summary 2
  • 3. What is Big Data ? Big Data refers to datasets that grow so large that it is difficult to capture, store, manage, share, analyze and visualize with the typical database software tools. Big Data actually comes in complex, unstructured formats, mostly everything from web sites, social media and email, to videos, Data-warehouses and Scientific world. Four Vs are that make a data so challenging to classified as BIG DATA are  Volume Velocity Variety Value 3
  • 4. What is Hadoop? It is an Apache Software Foundation project • Framework for running applications on large clusters • Modeled after Google’s MapReduce / GFS framework • Implemented in Java A software platform that lets one easily write and run applications that process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file system Here's what makes it especially useful for: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 4
  • 5. Why Hadoop? Handle partial hardware failures without going down: − If machine fails, we should be switch over to stand by machine − If disk fails – use RAID or mirror disk Fault Tolerance:  Regular backups  Logging  Mirror database at different site Elasticity of resources:  Increase capacity without restarting the whole system (PureScale)  More computing power should equal to faster processing Result consistency:  Answer should be consistent (independent of something failing) and returned in reasonable amount of time 5
  • 6. HDFS -Introduction ● Hadoop Distributed File System (HDFS) ● Based on Google File System ● Google file system was derrived from 'bigfiles' paper authored by Larry and Sergey in Stanford ● Hadoop provides a distributed filesystem and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm ● The interface to HDFS is patterned after the Unix filesystem ● Other distributed file system types are PVFS,Lustre,GFS,KDFS,FTP,Amazon S3 6
  • 7. HDFS – Goals and Assumptions ● Hardware Failure ● Streaming Data Access ● Large Data Sets ● Simple Coherency Model ● “Moving Computation is Cheaper than Moving Data” ● Portability Across Heterogeneous Hardware and Software Platforms 7
  • 8. HDFS Definition – The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. – HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. – HDFS is the primary storage system used by Hadoop applications. – HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. – HDFS provides high throughput access to application data and is suitable for applications that have large data sets HDFS consists of following components (daemons) •HDFS Master “Namenode” •HDFS Workers “Datanodes” •Secondary Name Node 8
  • 9. HDFS Components Namenode: NameNode, a master server, manages the file system namespace and regulates access to files by clients.  Meta-data in Memory – The entire metadata is in main memory  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  A Transaction Log – Records file creations, file deletions. Etc Data Node: DataNodes, one per node in the cluster, manages storage attached to the nodes that they run on  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients – Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes 9
  • 10. HDFS Components Secondary Name Node – Not used as hot stand-by or mirror node. Failover node is in future release. – Will be renamed in 0.21 to CheckNode – Bakup nameNode periodically wakes up and processes check point and updates the nameNode – Memory requirements are the same as nameNode (big) – Typically on a separate machine in large cluster ( > 10 nodes) – Directory is same as nameNode except it keeps previous checkpoint version in addition to current. – It can be used to restore failed nameNode (just copy current directory to new nameNode) 10
  • 11. HDFS Block – Large data sets are divide into small chunks for easy processing. – Default is 64 MB – Can be increased more to 128 MB – Reason for this default size and how it effects HDFS 11
  • 14. Understanding the File system Block placement • Current Strategy − One replica on local node − Second replica on a remote rack − Third replica on same remote rack − Additional replicas are randomly placed • Clients read from nearest replica Data Correctness • Use Checksums to validate data − Use CRC32 • File Creation − Client computes checksum per 512 byte − DataNode stores the checksum • File access − Client retrieves the data and checksum from DataNode − If Validation fails, Client tries other replicas 14
  • 15. Understanding the File system Data pipelining − Client retrieves a list of DataNodes on which to place replicas of a block − Client writes block to the first DataNode − The first DataNode forwards the data to the next DataNode in the Pipeline − When all replicas are written, the Client moves on to write the next block in file Rebalancer – Goal: % of disk occupied on Datanodes should be similar − Usually run when new Datanodes are added − Cluster is online when Rebalancer is active − Rebalancer is throttled to avoid network congestion − Command line tool 15
  • 16. Read and Write in HDFS 16
  • 17. Read and Write in HDFS...contd 17
  • 18. Read and Write in HDFS...contd 18
  • 19. Read and Write in HDFS..contd 19
  • 20. Read and Write in HDFS...contd 20
  • 21. Read and Write in HDFS...contd 21
  • 22. Read and Write in HDFS 22
  • 23. Read and Write in HDFS 23
  • 24. Read and Write in HDFS 24
  • 25. Read and Write in HDFS 25
  • 26. Read and Write in HDFS 26
  • 27. Read and Write in HDFS 27
  • 28. Read and Write in HDFS 28
  • 29. Read and Write in HDFS 29
  • 30. Read and Write in HDFS 30
  • 31. Read and Write in HDFS 31
  • 32. Read and Write in HDFS 32
  • 33. Read and Write in HDFS 33
  • 34. Command Line interface – HDFS has a UNIX based command line interface and we have to access this using HDFS using this CLI. – HDFS can also be accessed through a web interface but its limit is only for viewing HDFS contents. – We will go through this part in detail in Practical sessions. – Below are few examples of CLI based operations hadoop fs -mkdir /input hadoop fs -copyFromLocal input/docs/tweets.txt /input/tweets.txt hadoop fs -put input/docs/tweets.txt /input/tweets.txt hadoop fs -ls /input hadoop fs -rmr /input
  • 35. Resource ● Apache Hadoop Wiki ● Bradhed Lund Website(special thanks for making easy to understand HDFS in real time)
  • 36. THANK YOU -by Rohit Kapa