SlideShare a Scribd company logo
Introduction to
Hadoop
Big Data
Overview
• Big Data is a collection of data that
is huge in volume yet growing
exponentially with time.
• It is a data with so large size and
complexity that none of
traditional data management tools
can store it or process it efficiently.
• It cannot be processed using
traditional computing techniques.
• Big data is also a data but
with huge size.
The 4 V’s in Big Data
• Volume of Big Data
• The volume of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently
larger than terabytes and petabytes. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be all credit card transactions
on a day within Europe.
• Velocity of Big Data
• Velocity refers to the speed with which data is generated. An example of a data that is generated with high velocity
would be Twitter messages or Facebook posts.
• Variety of Big Data
• Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three
types: structured, semi structured and unstructured data. An example of high variety data sets would be the CCTV audio
and video files that are generated at various locations in a city.
• Veracity of Big Data
• Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to
analyze and that contribute in a meaningful way to the overall results. Low veracity data, on the other hand, contains a
high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. An example of a high
veracity data set would be data from a medical experiment or trial.
Types of Big
Data
• Structured data:-Structured data is data that has been
predefined and formatted to a set structure before being
placed in data storage, The best example of structured
data is the relational database: the data has been formatted
into precisely defined fields, such as credit card numbers or
address, in order to be easily queried with SQL.
• Semi-Structured :-Semi-structured data is a form
of structured data that does not obey the tabular structure
of data models associated with relational databases or other
forms of data tables, but nonetheless contains tags or other
markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
• Unstructured data:-Unstructured data is data stored in its
native format and not processed until it is used, which is
known as schema-on-read. It comes in a myriad of file
formats, including email, social media posts, presentations,
chats, IoT sensor data, and satellite imagery.
How Big Data
is generated
• The bulk of big data
generated comes from
three primary sources:
social data,
machine data and
transactional data.
Apache Hadoop
Framework
• Apache Hadoop is a collection of
open-source software utilities that
facilitates using a network of
many computers to solve
problems involving massive
amounts of data and
computation. It provides a
software framework for distributed
storage and processing of big
data using the MapReduce
programming model.
Core components of
Hadoop
Hadoop Distributed File
System (HDFS): the storage
system for Hadoop spread
out over multiple machines
as a means to reduce cost
and increase reliability.
MapReduce engine: the
algorithm that filters, sorts
and then uses the database
input in some way.
Some
Hadoop
Users
Difference
Between
Hadoop and
RDBMS
Cluster Modes in Hadoop
• Hadoop Mainly works on 3
different Modes:
1.Standalone Mode
2.Pseudo-distributed Mode
3.Fully-Distributed Mode
Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which provides
various services to solve the big data problems. It includes Apache projects
and various commercial tools and solutions. There are four major elements
of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most
of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
•HDFS: Hadoop Distributed File System
•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
HDFS Daemons
and MapReduce
Daemons
Hadoop Cluster Architecture
• A hadoop cluster architecture
consists of a data centre, rack and the
node that actually executes the jobs.
Data centre consists of the racks and
racks consists of nodes. A medium to
large cluster consists of a two or three
level hadoop cluster architecture that
is built with rack mounted servers.
Every rack of servers is interconnected
through 1 gigabyte of Ethernet (1
GigE). Each rack level switch in a
hadoop cluster is connected to a
cluster level switch which are in turn
connected to other cluster level
switches or they uplink to other
switching infrastructure.
Hadoop Distributed File
System
• The Hadoop Distributed File System (HDFS) is the
primary data storage system used
by Hadoop applications. It employs a NameNode and
DataNode architecture to implement a distributed file
system that provides high-performance access to data
across highly scalable Hadoop clusters.
• With the Hadoop Distributed File system, the data is
written once on the server and subsequently read and
re-used many times thereafter.
• The NameNode also manages access to the files,
including reads, writes, creates, deletes and replication
of data blocks across different data nodes
HDFS Components
• Hadoop cluster consists of three components -
• Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and
executing parallel computation the stored data using MapReduce. Master Node has 3 nodes –
NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing of
data using MapReduce while the NameNode handles the data storage function with HDFS.
NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the
access time of the file, which user is accessing a file on current time and which file is saved in which
hadoop cluster. The secondary NameNode keeps a backup of the NameNode data.
• Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and
performing computations. Every slave/worker node runs both a TaskTracker and a DataNode
service to communicate with the Master node in the cluster. The DataNode service is secondary to
the NameNode and the TaskTracker service is secondary to the JobTracker.
• Client Nodes – Client node has hadoop installed with all the required cluster configuration settings
and is responsible for loading all the data into the hadoop cluster. Client node submits mapreduce
jobs describing on how data needs to be processed and then the output is retrieved by the client
node once the job processing is completed.
HDFS Architecture
• Hadoop follows a master slave architecture
design for data storage and distributed data
processing using HDFS and MapReduce
respectively. The master node for data
storage is hadoop HDFS is the NameNode
and the master node for parallel processing
of data using Hadoop MapReduce is the Job
Tracker. The slave nodes in the hadoop
architecture are the other machines in the
Hadoop cluster which store data and perform
complex computations. Every slave node has
a Task Tracker daemon and a DataNode that
synchronizes the processes with the Job
Tracker and NameNode respectively. In
Hadoop architectural implementation the
master or slave systems can be setup in the
cloud or on-premise.
HDFS Read File
• Step 1: The client opens the file it wishes to read by calling open() on the File
System Object(which for HDFS is an instance of Distributed File System).
• Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in the
file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the data node and name node I/O.
• Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
• Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
• Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
This happens transparently to the client, which from its point of view is simply
reading an endless stream. Blocks are read as, with the DFSInputStream
opening new connections to data nodes because the client reads through the
stream. It will also call the name node to retrieve the data node locations for the
next batch of blocks as needed.
• Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.
HDFS Write File
• Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
• Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file. If these checks pass, the name node
prepares a record of the new file; otherwise, the file can’t be created and therefore
the client is thrown an error i.e. IOException. The DFS returns an
FSDataOutputStream for the client to start out writing data to.
• Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The list of data nodes forms a pipeline, and here we’ll assume the
replication level is three, so there are three nodes in the pipeline. The
DataStreamer streams the packets to the primary data node within the pipeline,
which stores each packet and forwards it to the second data node within the
pipeline.
• Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
• Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
• Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgments before connecting to the name node to signal
whether the file is complete or not.
Some Basic
Hadoop
Commands
• cat
HDFS Command that reads a file on HDFS and prints the content of that file to the
standard output.
Command: hdfs dfs –cat /new_file/test
• text
HDFS Command that takes a source file and outputs the file in text format.
Command: hdfs dfs –text /new_file/test
• copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Command: hdfs dfs –copyFromLocal /home/file/test /new_file
• copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Command: hdfs dfs –copyToLocal /file/test /home/file
• put
HDFS Command to copy single source or multiple sources from local file system to the destination file system.
Command: hdfs dfs –put /home/file/test /user
• get
HDFS Command to copy files from hdfs to the local file system.
Command: hdfs dfs –get /user/test /home/file
• count
HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file pattern.
Command: hdfs dfs –count /user
• rm
HDFS Command to remove the file from HDFS.
Command: hdfs dfs –rm /new_file/test
• rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
Command: hdfs dfs -rm -r /new_file
• cp
HDFS Command to copy files from source to destination. This command allows multiple
sources as well, in which case the destination must be a directory.
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
• mv
HDFS Command to move files from source to destination. This command allows multiple
sources as well, in which case the destination needs to be a directory.
Command: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
• rmdir
HDFS Command to remove the directory.
Command: hdfs dfs –rmdir /user/Hadoop
• help
HDFS Command that displays help for given command or all commands if none is
specified.
Command: hdfs dfs -help

More Related Content

What's hot

Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
BADR
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
Edureka!
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
Joe McTee
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
 

What's hot (20)

Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
hadoop
hadoophadoop
hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 

Similar to Hadoop

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
rishavkumar1402
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
ATWIINE Simon Alex
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Ankan Banerjee
 
Hadoop
HadoopHadoop
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
ShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
VishalBH1
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
WasyihunSema2
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 

Similar to Hadoop (20)

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
hadoop
hadoophadoop
hadoop
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data applications
Big data applicationsBig data applications
Big data applications
 

Recently uploaded

basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
nikitacareer3
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 

Recently uploaded (20)

basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 

Hadoop

  • 2. Big Data Overview • Big Data is a collection of data that is huge in volume yet growing exponentially with time. • It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. • It cannot be processed using traditional computing techniques. • Big data is also a data but with huge size.
  • 3. The 4 V’s in Big Data • Volume of Big Data • The volume of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently larger than terabytes and petabytes. In other words, this means that the data sets in Big Data are too large to process with a regular laptop or desktop processor. An example of a high-volume data set would be all credit card transactions on a day within Europe. • Velocity of Big Data • Velocity refers to the speed with which data is generated. An example of a data that is generated with high velocity would be Twitter messages or Facebook posts. • Variety of Big Data • Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three types: structured, semi structured and unstructured data. An example of high variety data sets would be the CCTV audio and video files that are generated at various locations in a city. • Veracity of Big Data • Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to analyze and that contribute in a meaningful way to the overall results. Low veracity data, on the other hand, contains a high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. An example of a high veracity data set would be data from a medical experiment or trial.
  • 4. Types of Big Data • Structured data:-Structured data is data that has been predefined and formatted to a set structure before being placed in data storage, The best example of structured data is the relational database: the data has been formatted into precisely defined fields, such as credit card numbers or address, in order to be easily queried with SQL. • Semi-Structured :-Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. • Unstructured data:-Unstructured data is data stored in its native format and not processed until it is used, which is known as schema-on-read. It comes in a myriad of file formats, including email, social media posts, presentations, chats, IoT sensor data, and satellite imagery.
  • 5. How Big Data is generated • The bulk of big data generated comes from three primary sources: social data, machine data and transactional data.
  • 6. Apache Hadoop Framework • Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
  • 7. Core components of Hadoop Hadoop Distributed File System (HDFS): the storage system for Hadoop spread out over multiple machines as a means to reduce cost and increase reliability. MapReduce engine: the algorithm that filters, sorts and then uses the database input in some way.
  • 10. Cluster Modes in Hadoop • Hadoop Mainly works on 3 different Modes: 1.Standalone Mode 2.Pseudo-distributed Mode 3.Fully-Distributed Mode
  • 11. Hadoop Ecosystem Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc. Following are the components that collectively form a Hadoop ecosystem: •HDFS: Hadoop Distributed File System •YARN: Yet Another Resource Negotiator •MapReduce: Programming based Data Processing •Spark: In-Memory data processing •PIG, HIVE: Query based processing of data services •HBase: NoSQL Database •Mahout, Spark MLLib: Machine Learning algorithm libraries •Solar, Lucene: Searching and Indexing •Zookeeper: Managing cluster •Oozie: Job Scheduling
  • 13. Hadoop Cluster Architecture • A hadoop cluster architecture consists of a data centre, rack and the node that actually executes the jobs. Data centre consists of the racks and racks consists of nodes. A medium to large cluster consists of a two or three level hadoop cluster architecture that is built with rack mounted servers. Every rack of servers is interconnected through 1 gigabyte of Ethernet (1 GigE). Each rack level switch in a hadoop cluster is connected to a cluster level switch which are in turn connected to other cluster level switches or they uplink to other switching infrastructure.
  • 14. Hadoop Distributed File System • The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. • With the Hadoop Distributed File system, the data is written once on the server and subsequently read and re-used many times thereafter. • The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes
  • 15. HDFS Components • Hadoop cluster consists of three components - • Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and executing parallel computation the stored data using MapReduce. Master Node has 3 nodes – NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing of data using MapReduce while the NameNode handles the data storage function with HDFS. NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the access time of the file, which user is accessing a file on current time and which file is saved in which hadoop cluster. The secondary NameNode keeps a backup of the NameNode data. • Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and performing computations. Every slave/worker node runs both a TaskTracker and a DataNode service to communicate with the Master node in the cluster. The DataNode service is secondary to the NameNode and the TaskTracker service is secondary to the JobTracker. • Client Nodes – Client node has hadoop installed with all the required cluster configuration settings and is responsible for loading all the data into the hadoop cluster. Client node submits mapreduce jobs describing on how data needs to be processed and then the output is retrieved by the client node once the job processing is completed.
  • 16. HDFS Architecture • Hadoop follows a master slave architecture design for data storage and distributed data processing using HDFS and MapReduce respectively. The master node for data storage is hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise.
  • 17. HDFS Read File • Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an instance of Distributed File System). • Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O. • Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file. • Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. • Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best data node for the next block. This happens transparently to the client, which from its point of view is simply reading an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of blocks as needed. • Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream.
  • 18. HDFS Write File • Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS). • Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the client to start out writing data to. • Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the primary data node within the pipeline, which stores each packet and forwards it to the second data node within the pipeline. • Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the pipeline. • Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data nodes, called an “ack queue”. • Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments before connecting to the name node to signal whether the file is complete or not.
  • 20. • cat HDFS Command that reads a file on HDFS and prints the content of that file to the standard output. Command: hdfs dfs –cat /new_file/test • text HDFS Command that takes a source file and outputs the file in text format. Command: hdfs dfs –text /new_file/test • copyFromLocal HDFS Command to copy the file from a Local file system to HDFS. Command: hdfs dfs –copyFromLocal /home/file/test /new_file • copyToLocal HDFS Command to copy the file from HDFS to Local File System. Command: hdfs dfs –copyToLocal /file/test /home/file
  • 21. • put HDFS Command to copy single source or multiple sources from local file system to the destination file system. Command: hdfs dfs –put /home/file/test /user • get HDFS Command to copy files from hdfs to the local file system. Command: hdfs dfs –get /user/test /home/file • count HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file pattern. Command: hdfs dfs –count /user • rm HDFS Command to remove the file from HDFS. Command: hdfs dfs –rm /new_file/test • rm -r HDFS Command to remove the entire directory and all of its content from HDFS. Command: hdfs dfs -rm -r /new_file
  • 22. • cp HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory. Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 • mv HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory. Command: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2 • rmdir HDFS Command to remove the directory. Command: hdfs dfs –rmdir /user/Hadoop • help HDFS Command that displays help for given command or all commands if none is specified. Command: hdfs dfs -help