SlideShare a Scribd company logo
1 of 32
Big Data Analytics With
Hadoop
Big Data & IoT
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
What is Big data?
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• but having data bigger it requires different approaches: – Techniques,
tools and architecture
• an aim to solve new problems or old problems in a better way
• Big Data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques
Big data analytics
• Big data analytics is the often complex process of examining big
data to uncover information such as hidden patterns, correlations,
market trends and customer preferences that can help organizations
make informed business decisions.
• On a broad scale, data analytics technologies and techniques give
organizations a way to analyze data sets and gather new information.
Why is big data analytics important?
Big data analytics helps organizations harness their data and use it to
identify new opportunities. That, in turn, leads to smarter business
moves, more efficient operations, higher profits and happier
customers. Businesses that use big data with advanced analytics gain
value in many ways, such as:
Reducing cost. Big data technologies like cloud-based analytics can
significantly reduce costs when it comes to storing large amounts of
data (for example, a data lake). Plus, big data analytics helps
organizations find more efficient ways of doing business.
Cont…
Making faster, better decisions:
The speed of in-memory
analytics – combined with the ability to analyze new
sources of data, such as streaming data from IoT
helps businesses analyze information immediately
and make fast, informed decisions.
Developing and marketing new products and services:
Being able to gauge customer needs and customer
satisfaction through analytics empowers businesses to
give customers what they want, when they
want it. With big data analytics, more
companies have an opportunity to develop
innovative new products to meet customers’
changing needs.
Hadoop
• Hadoop is an open source framework that is used to efficiently store
and process large datasets ranging in size from gigabytes to petabytes
of data.
• Instead of using one large computer to store and process the data,
Hadoop allows clustering multiple computers to analyze massive
datasets in parallel more quickly.
• It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware
Features of Hadoop
• Hadoop is Open Source
• Hadoop cluster is Highly Scalable
• Hadoop provides Fault Tolerance
• Hadoop provides High Availability
• Hadoop is very Cost-Effective
• Hadoop is Faster in Data Processing
• Hadoop provides Feasibility
Hadoop Architecture
• The Hadoop Architecture Mainly consists of 4 components.
1. Map Reduce
1. Processing/Computation layer(MapReduce)
A method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer phases
i. Map
ii. Reduce
In between map and reduce is shuffle and sort.
Map Reduce
Map:
• The Map function always runs first typically used to
filter, transform, or parse the data. The output from Map
becomes the input to Reduce
Reduce:
• The Reduce function is optional normally used to
summarize data from the Map function.
2.HDFS(Hadoop Distributed File System)
2. Storage layer (Hadoop Distributed File System)
• The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on
multiple servers, data is divided into blocks based on file size. These blocks are
then randomly distributed and stored across slave machines.
• HDFS in Hadoop Architecture divides large data into different blocks. Replicated
three times by default, each block contains 128 MB of data. Replications operate
under two rules:
i. Two identical blocks cannot be placed on the same DataNode
ii. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
Example (HDFS)
Components of HDFS
There are two components of HDFS:
1) NameNode (masternode)
2) Slave Node
Components of HDFS
NameNode:
• NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the
data about the data. Meta Data can be the transaction logs that keep track of the
user’s activity in a Hadoop cluster.
• Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
Components of HDFS
• DataNode: DataNodes works as a Slave. DataNodes are mainly
utilized for storing the data in a Hadoop cluster, the number of
DataNodes can be from 1 to 500 or even more than that. The more
number of DataNode, the Hadoop cluster will be able to store more
data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
File Block In HDFS
Data in HDFS is always stored in terms of blocks. So the single block of data is
divided into multiple blocks of size 128MB which is default and you can also change
it manually.
Replication In HDFS
• Replication ensures the availability of the data. Replication is making a
copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor.
• As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to
make a copy of those file blocks.
• By default, the Replication Factor for Hadoop is set to 3 which can be
configured.
Read Operation In HDFS
• Data read request is served by HDFS, NameNode, and DataNode
Write Operation In HDFS
3.Hadoop YARN
• Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource
management layer of Hadoop and is responsible for resource allocation and job
scheduling
• The Purpose of Job schedular is to divide a big task into small jobs so that each
job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized
• Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job
timing, etc
• And the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster
Elements of YARN
The elements of YARN include:
1. ResourceManager (one per cluster)
2. ApplicationMaster (one per application)
3. NodeManagers (one per node)
Elements of YARN
1. Resource Manager
• Resource Manager manages the resource allocation in the cluster and is
responsible for tracking how many resources are available in the cluster and each
node manager’s contribution. It has two main components:
i. Scheduler: Allocating resources to various running applications and scheduling
resources based on the requirements of the application; it doesn’t monitor or
track the status of the applications
ii. Application Manager: Accepting job submissions from the client or monitoring
and restarting application masters in case of failure
Elements of YARN
2. Application Master
• Application Master manages the resource needs of individual applications and
interacts with the scheduler to acquire the required resources. It connects with
the node manager to execute and monitor tasks.
3. Node Manager
• Node Manager tracks running jobs and sends signals (or heartbeats) to the
resource manager to relay the status of a node. It also monitors each container’s
resource utilization.
4. Hadoop common or Common Utilities
• Hadoop common or Common utilities are nothing but our java library
and java files or we can say the java scripts that we need for all the
other components present in a Hadoop cluster. these utilities are
used by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verify that Hardware failure in a Hadoop cluster is common
so it needs to be solved automatically in software by Hadoop
Framework.
Advantages of Hadoop
Advantages of Hadoop
1. Varied Data Sources
• Hadoop accepts a variety of data. Data can come from a range of sources like email conversation,
social media etc. and can be of structured or unstructured form. Hadoop can derive value from
diverse data. Hadoop can accept data in a text file, XML file, images, CSV files etc.
2. Cost-effective
• Hadoop is an economical solution as it uses a cluster of commodity hardware to store data.
Commodity hardware is cheap machines hence the cost of adding nodes to the framework is not
much high. In Hadoop 3.0 we have only 50% of storage overhead as opposed to 200% in
Hadoop2.x. This requires less machine to store data as the redundant data decreased significantly.
3. Performance
• Hadoop with its distributed processing and distributed storage architecture processes huge
amounts of data with high speed. Hadoop even defeated supercomputer the fastest machine in
2008. It divides the input data file into a number of blocks and stores data in these blocks over
several nodes. It also divides the task that user submits into various sub-tasks which assign to
these worker nodes containing required data and these sub-task run in parallel thereby improving
the performance.
Advantages of Hadoop
4.Fault-Tolerant
• In Hadoop 3.0 fault tolerance is provided by erasure coding. For example, 6 data blocks produce 3 parity
blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In event of failure of any
node the data block affected can be recovered by using these parity blocks and the remaining data blocks.
5. Highly Available
• In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode, so if a
NameNode goes down then we have standby NameNode to count on. But Hadoop 3.0 supports multiple
standby NameNode making the system even more highly available as it can continue functioning in case if
two or more NameNodes crashes.
6. Low Network Traffic
• In Hadoop, each job submitted by the user is split into a number of independent sub-tasks and these sub-
tasks are assigned to the data nodes thereby moving a small amount of code to data rather than moving
huge data to code which leads to low network traffic.
7. High Throughput
• Throughput means job done per unit time. Hadoop stores data in a distributed fashion which allows using
distributed processing with ease. A given job gets divided into small jobs which work on chunks of data in
parallel thereby giving high throughput.
Advantages of Hadoop
8. Open Source
• Hadoop is an open source technology i.e. its source code is freely available. We can modify the source code to suit a
specific requirement.
9. Scalable
• Hadoop works on the principle of horizontal scalability i.e. we need to add the entire machine to the cluster of
nodes and not change the configuration of a machine like adding RAM, disk and so on which is known as vertical
scalability. Nodes can be added to Hadoop cluster on the fly making it a scalable framework.
10. Ease of use
• The Hadoop framework takes care of parallel processing, MapReduce programmers does not need to care for
achieving distributed processing, it is done at the backend automatically.
11. Compatibility
• Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink etc. They have got
processing engines which work over Hadoop as a backend i.e. We use Hadoop as data storage platforms for them.
12. Multiple Languages Supported
• Developers can code using many languages on Hadoop like C, C++, Perl, Python, Ruby, and Groovy.
Disadvantages of Hadoop
1. Issue With Small Files
• Hadoop is suitable for a small number of large files but when it comes to the application
which deals with a large number of small files, Hadoop fails here. A small file is nothing
but a file which is significantly smaller than Hadoop’s block size which can be either
128MB or 256MB by default. These large number of small files overload the Namenode
as it stores namespace for the system and makes it difficult for Hadoop to function.
2. Vulnerable By Nature
• Hadoop is written in Java which is a widely used programming language hence it is easily
exploited by cyber criminals which makes Hadoop vulnerable to security breaches.
3. Processing Overhead
• In Hadoop, the data is read from the disk and written to the disk which makes read/write
operations very expensive when we are dealing with tera and petabytes of data. Hadoop
cannot do in-memory calculations hence it incurs processing overhead.
Disadvantages of Hadoop
4. Supports Only Batch Processing
• At the core, Hadoop has a batch processing engine which is not efficient in stream
processing. It cannot produce output in real-time with low latency. It only works
on data which we collect and store in a file in advance before processing.
5. Iterative Processing
• Hadoop cannot do iterative processing by itself. Machine learning or iterative
processing has a cyclic data flow whereas Hadoop has data flowing in a chain of
stages where output on one stage becomes the input of another stage.
6. Security
• For security, Hadoop uses Kerberos authentication which is hard to manage. It is
missing encryption at storage and network levels which are a major point of
concern.
References
• https://www.sas.com/en_us/insights/analytics/big-data-analytics.html
• https://www.simplilearn.com/tutorials/hadoop-tutorial/hadoop-
architecture#:~:text=Hadoop%20is%20a%20framework%20permitting,manageme
nt%20in%20the%20Hadoop%20cluster
• https://data-flair.training/blogs/features-of-hadoop-and-design-principles/
• https://www.geeksforgeeks.org/hadoop-architecture/
• https://www.guru99.com/learn-hdfs-a-beginners-guide.html
• https://data-flair.training/blogs/advantages-and-disadvantages-of-hadoop/

More Related Content

Similar to Big Data Analytics With Hadoop

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 

Similar to Big Data Analytics With Hadoop (20)

Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Big Data
Big DataBig Data
Big Data
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
data analytics lecture4.pptx
data analytics lecture4.pptxdata analytics lecture4.pptx
data analytics lecture4.pptx
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
Apache-Hadoop-Slides.pptx
Apache-Hadoop-Slides.pptxApache-Hadoop-Slides.pptx
Apache-Hadoop-Slides.pptx
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 

More from Umair Shafique

BIG DATA ANALYTICS USING R
BIG DATA ANALYTICS USING  RBIG DATA ANALYTICS USING  R
BIG DATA ANALYTICS USING R
Umair Shafique
 

More from Umair Shafique (6)

Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
BIG DATA ANALYTICS USING R
BIG DATA ANALYTICS USING  RBIG DATA ANALYTICS USING  R
BIG DATA ANALYTICS USING R
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 

Recently uploaded

CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.
Cherry
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Cherry
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
Cherry
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
RaunakRastogi4
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Recently uploaded (20)

CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolation
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 

Big Data Analytics With Hadoop

  • 1. Big Data Analytics With Hadoop Big Data & IoT Umair Shafique (03246441789) Scholar MS Information Technology - University of Gujrat
  • 2. What is Big data? • ‘Big Data’ is similar to ‘small data’, but bigger in size • but having data bigger it requires different approaches: – Techniques, tools and architecture • an aim to solve new problems or old problems in a better way • Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques
  • 3. Big data analytics • Big data analytics is the often complex process of examining big data to uncover information such as hidden patterns, correlations, market trends and customer preferences that can help organizations make informed business decisions. • On a broad scale, data analytics technologies and techniques give organizations a way to analyze data sets and gather new information.
  • 4. Why is big data analytics important? Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. Businesses that use big data with advanced analytics gain value in many ways, such as: Reducing cost. Big data technologies like cloud-based analytics can significantly reduce costs when it comes to storing large amounts of data (for example, a data lake). Plus, big data analytics helps organizations find more efficient ways of doing business.
  • 5. Cont… Making faster, better decisions: The speed of in-memory analytics – combined with the ability to analyze new sources of data, such as streaming data from IoT helps businesses analyze information immediately and make fast, informed decisions. Developing and marketing new products and services: Being able to gauge customer needs and customer satisfaction through analytics empowers businesses to give customers what they want, when they want it. With big data analytics, more companies have an opportunity to develop innovative new products to meet customers’ changing needs.
  • 6. Hadoop • Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. • Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly. • It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware
  • 7.
  • 8. Features of Hadoop • Hadoop is Open Source • Hadoop cluster is Highly Scalable • Hadoop provides Fault Tolerance • Hadoop provides High Availability • Hadoop is very Cost-Effective • Hadoop is Faster in Data Processing • Hadoop provides Feasibility
  • 9. Hadoop Architecture • The Hadoop Architecture Mainly consists of 4 components.
  • 10. 1. Map Reduce 1. Processing/Computation layer(MapReduce) A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer phases i. Map ii. Reduce In between map and reduce is shuffle and sort.
  • 11. Map Reduce Map: • The Map function always runs first typically used to filter, transform, or parse the data. The output from Map becomes the input to Reduce Reduce: • The Reduce function is optional normally used to summarize data from the Map function.
  • 12. 2.HDFS(Hadoop Distributed File System) 2. Storage layer (Hadoop Distributed File System) • The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines. • HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules: i. Two identical blocks cannot be placed on the same DataNode ii. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
  • 14. Components of HDFS There are two components of HDFS: 1) NameNode (masternode) 2) Slave Node
  • 15. Components of HDFS NameNode: • NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster. • Meta Data can also be the name of the file, size, and the information about the location(Block number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication. Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
  • 16. Components of HDFS • DataNode: DataNodes works as a Slave. DataNodes are mainly utilized for storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should have High storing capacity to store a large number of file blocks.
  • 17. File Block In HDFS Data in HDFS is always stored in terms of blocks. So the single block of data is divided into multiple blocks of size 128MB which is default and you can also change it manually.
  • 18. Replication In HDFS • Replication ensures the availability of the data. Replication is making a copy of something and the number of times you make a copy of that particular thing can be expressed as it’s Replication Factor. • As we have seen in File blocks that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks. • By default, the Replication Factor for Hadoop is set to 3 which can be configured.
  • 19. Read Operation In HDFS • Data read request is served by HDFS, NameNode, and DataNode
  • 21. 3.Hadoop YARN • Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling • The Purpose of Job schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be Maximized • Job Scheduler also keeps track of which job is important, which job has more priority, dependencies between the jobs and all the other information like job timing, etc • And the use of Resource Manager is to manage all the resources that are made available for running a Hadoop cluster
  • 22. Elements of YARN The elements of YARN include: 1. ResourceManager (one per cluster) 2. ApplicationMaster (one per application) 3. NodeManagers (one per node)
  • 23. Elements of YARN 1. Resource Manager • Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components: i. Scheduler: Allocating resources to various running applications and scheduling resources based on the requirements of the application; it doesn’t monitor or track the status of the applications ii. Application Manager: Accepting job submissions from the client or monitoring and restarting application masters in case of failure
  • 24. Elements of YARN 2. Application Master • Application Master manages the resource needs of individual applications and interacts with the scheduler to acquire the required resources. It connects with the node manager to execute and monitor tasks. 3. Node Manager • Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to relay the status of a node. It also monitors each container’s resource utilization.
  • 25. 4. Hadoop common or Common Utilities • Hadoop common or Common utilities are nothing but our java library and java files or we can say the java scripts that we need for all the other components present in a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework.
  • 27. Advantages of Hadoop 1. Varied Data Sources • Hadoop accepts a variety of data. Data can come from a range of sources like email conversation, social media etc. and can be of structured or unstructured form. Hadoop can derive value from diverse data. Hadoop can accept data in a text file, XML file, images, CSV files etc. 2. Cost-effective • Hadoop is an economical solution as it uses a cluster of commodity hardware to store data. Commodity hardware is cheap machines hence the cost of adding nodes to the framework is not much high. In Hadoop 3.0 we have only 50% of storage overhead as opposed to 200% in Hadoop2.x. This requires less machine to store data as the redundant data decreased significantly. 3. Performance • Hadoop with its distributed processing and distributed storage architecture processes huge amounts of data with high speed. Hadoop even defeated supercomputer the fastest machine in 2008. It divides the input data file into a number of blocks and stores data in these blocks over several nodes. It also divides the task that user submits into various sub-tasks which assign to these worker nodes containing required data and these sub-task run in parallel thereby improving the performance.
  • 28. Advantages of Hadoop 4.Fault-Tolerant • In Hadoop 3.0 fault tolerance is provided by erasure coding. For example, 6 data blocks produce 3 parity blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In event of failure of any node the data block affected can be recovered by using these parity blocks and the remaining data blocks. 5. Highly Available • In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode, so if a NameNode goes down then we have standby NameNode to count on. But Hadoop 3.0 supports multiple standby NameNode making the system even more highly available as it can continue functioning in case if two or more NameNodes crashes. 6. Low Network Traffic • In Hadoop, each job submitted by the user is split into a number of independent sub-tasks and these sub- tasks are assigned to the data nodes thereby moving a small amount of code to data rather than moving huge data to code which leads to low network traffic. 7. High Throughput • Throughput means job done per unit time. Hadoop stores data in a distributed fashion which allows using distributed processing with ease. A given job gets divided into small jobs which work on chunks of data in parallel thereby giving high throughput.
  • 29. Advantages of Hadoop 8. Open Source • Hadoop is an open source technology i.e. its source code is freely available. We can modify the source code to suit a specific requirement. 9. Scalable • Hadoop works on the principle of horizontal scalability i.e. we need to add the entire machine to the cluster of nodes and not change the configuration of a machine like adding RAM, disk and so on which is known as vertical scalability. Nodes can be added to Hadoop cluster on the fly making it a scalable framework. 10. Ease of use • The Hadoop framework takes care of parallel processing, MapReduce programmers does not need to care for achieving distributed processing, it is done at the backend automatically. 11. Compatibility • Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink etc. They have got processing engines which work over Hadoop as a backend i.e. We use Hadoop as data storage platforms for them. 12. Multiple Languages Supported • Developers can code using many languages on Hadoop like C, C++, Perl, Python, Ruby, and Groovy.
  • 30. Disadvantages of Hadoop 1. Issue With Small Files • Hadoop is suitable for a small number of large files but when it comes to the application which deals with a large number of small files, Hadoop fails here. A small file is nothing but a file which is significantly smaller than Hadoop’s block size which can be either 128MB or 256MB by default. These large number of small files overload the Namenode as it stores namespace for the system and makes it difficult for Hadoop to function. 2. Vulnerable By Nature • Hadoop is written in Java which is a widely used programming language hence it is easily exploited by cyber criminals which makes Hadoop vulnerable to security breaches. 3. Processing Overhead • In Hadoop, the data is read from the disk and written to the disk which makes read/write operations very expensive when we are dealing with tera and petabytes of data. Hadoop cannot do in-memory calculations hence it incurs processing overhead.
  • 31. Disadvantages of Hadoop 4. Supports Only Batch Processing • At the core, Hadoop has a batch processing engine which is not efficient in stream processing. It cannot produce output in real-time with low latency. It only works on data which we collect and store in a file in advance before processing. 5. Iterative Processing • Hadoop cannot do iterative processing by itself. Machine learning or iterative processing has a cyclic data flow whereas Hadoop has data flowing in a chain of stages where output on one stage becomes the input of another stage. 6. Security • For security, Hadoop uses Kerberos authentication which is hard to manage. It is missing encryption at storage and network levels which are a major point of concern.
  • 32. References • https://www.sas.com/en_us/insights/analytics/big-data-analytics.html • https://www.simplilearn.com/tutorials/hadoop-tutorial/hadoop- architecture#:~:text=Hadoop%20is%20a%20framework%20permitting,manageme nt%20in%20the%20Hadoop%20cluster • https://data-flair.training/blogs/features-of-hadoop-and-design-principles/ • https://www.geeksforgeeks.org/hadoop-architecture/ • https://www.guru99.com/learn-hdfs-a-beginners-guide.html • https://data-flair.training/blogs/advantages-and-disadvantages-of-hadoop/