1. Big Data Analytics With
Hadoop
Big Data & IoT
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
2. What is Big data?
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• but having data bigger it requires different approaches: – Techniques,
tools and architecture
• an aim to solve new problems or old problems in a better way
• Big Data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques
3. Big data analytics
• Big data analytics is the often complex process of examining big
data to uncover information such as hidden patterns, correlations,
market trends and customer preferences that can help organizations
make informed business decisions.
• On a broad scale, data analytics technologies and techniques give
organizations a way to analyze data sets and gather new information.
4. Why is big data analytics important?
Big data analytics helps organizations harness their data and use it to
identify new opportunities. That, in turn, leads to smarter business
moves, more efficient operations, higher profits and happier
customers. Businesses that use big data with advanced analytics gain
value in many ways, such as:
Reducing cost. Big data technologies like cloud-based analytics can
significantly reduce costs when it comes to storing large amounts of
data (for example, a data lake). Plus, big data analytics helps
organizations find more efficient ways of doing business.
5. Cont…
Making faster, better decisions:
The speed of in-memory
analytics – combined with the ability to analyze new
sources of data, such as streaming data from IoT
helps businesses analyze information immediately
and make fast, informed decisions.
Developing and marketing new products and services:
Being able to gauge customer needs and customer
satisfaction through analytics empowers businesses to
give customers what they want, when they
want it. With big data analytics, more
companies have an opportunity to develop
innovative new products to meet customers’
changing needs.
6. Hadoop
• Hadoop is an open source framework that is used to efficiently store
and process large datasets ranging in size from gigabytes to petabytes
of data.
• Instead of using one large computer to store and process the data,
Hadoop allows clustering multiple computers to analyze massive
datasets in parallel more quickly.
• It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware
7.
8. Features of Hadoop
• Hadoop is Open Source
• Hadoop cluster is Highly Scalable
• Hadoop provides Fault Tolerance
• Hadoop provides High Availability
• Hadoop is very Cost-Effective
• Hadoop is Faster in Data Processing
• Hadoop provides Feasibility
10. 1. Map Reduce
1. Processing/Computation layer(MapReduce)
A method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer phases
i. Map
ii. Reduce
In between map and reduce is shuffle and sort.
11. Map Reduce
Map:
• The Map function always runs first typically used to
filter, transform, or parse the data. The output from Map
becomes the input to Reduce
Reduce:
• The Reduce function is optional normally used to
summarize data from the Map function.
12. 2.HDFS(Hadoop Distributed File System)
2. Storage layer (Hadoop Distributed File System)
• The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on
multiple servers, data is divided into blocks based on file size. These blocks are
then randomly distributed and stored across slave machines.
• HDFS in Hadoop Architecture divides large data into different blocks. Replicated
three times by default, each block contains 128 MB of data. Replications operate
under two rules:
i. Two identical blocks cannot be placed on the same DataNode
ii. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
15. Components of HDFS
NameNode:
• NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the
data about the data. Meta Data can be the transaction logs that keep track of the
user’s activity in a Hadoop cluster.
• Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
16. Components of HDFS
• DataNode: DataNodes works as a Slave. DataNodes are mainly
utilized for storing the data in a Hadoop cluster, the number of
DataNodes can be from 1 to 500 or even more than that. The more
number of DataNode, the Hadoop cluster will be able to store more
data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
17. File Block In HDFS
Data in HDFS is always stored in terms of blocks. So the single block of data is
divided into multiple blocks of size 128MB which is default and you can also change
it manually.
18. Replication In HDFS
• Replication ensures the availability of the data. Replication is making a
copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor.
• As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to
make a copy of those file blocks.
• By default, the Replication Factor for Hadoop is set to 3 which can be
configured.
19. Read Operation In HDFS
• Data read request is served by HDFS, NameNode, and DataNode
21. 3.Hadoop YARN
• Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource
management layer of Hadoop and is responsible for resource allocation and job
scheduling
• The Purpose of Job schedular is to divide a big task into small jobs so that each
job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized
• Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job
timing, etc
• And the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster
22. Elements of YARN
The elements of YARN include:
1. ResourceManager (one per cluster)
2. ApplicationMaster (one per application)
3. NodeManagers (one per node)
23. Elements of YARN
1. Resource Manager
• Resource Manager manages the resource allocation in the cluster and is
responsible for tracking how many resources are available in the cluster and each
node manager’s contribution. It has two main components:
i. Scheduler: Allocating resources to various running applications and scheduling
resources based on the requirements of the application; it doesn’t monitor or
track the status of the applications
ii. Application Manager: Accepting job submissions from the client or monitoring
and restarting application masters in case of failure
24. Elements of YARN
2. Application Master
• Application Master manages the resource needs of individual applications and
interacts with the scheduler to acquire the required resources. It connects with
the node manager to execute and monitor tasks.
3. Node Manager
• Node Manager tracks running jobs and sends signals (or heartbeats) to the
resource manager to relay the status of a node. It also monitors each container’s
resource utilization.
25. 4. Hadoop common or Common Utilities
• Hadoop common or Common utilities are nothing but our java library
and java files or we can say the java scripts that we need for all the
other components present in a Hadoop cluster. these utilities are
used by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verify that Hardware failure in a Hadoop cluster is common
so it needs to be solved automatically in software by Hadoop
Framework.
27. Advantages of Hadoop
1. Varied Data Sources
• Hadoop accepts a variety of data. Data can come from a range of sources like email conversation,
social media etc. and can be of structured or unstructured form. Hadoop can derive value from
diverse data. Hadoop can accept data in a text file, XML file, images, CSV files etc.
2. Cost-effective
• Hadoop is an economical solution as it uses a cluster of commodity hardware to store data.
Commodity hardware is cheap machines hence the cost of adding nodes to the framework is not
much high. In Hadoop 3.0 we have only 50% of storage overhead as opposed to 200% in
Hadoop2.x. This requires less machine to store data as the redundant data decreased significantly.
3. Performance
• Hadoop with its distributed processing and distributed storage architecture processes huge
amounts of data with high speed. Hadoop even defeated supercomputer the fastest machine in
2008. It divides the input data file into a number of blocks and stores data in these blocks over
several nodes. It also divides the task that user submits into various sub-tasks which assign to
these worker nodes containing required data and these sub-task run in parallel thereby improving
the performance.
28. Advantages of Hadoop
4.Fault-Tolerant
• In Hadoop 3.0 fault tolerance is provided by erasure coding. For example, 6 data blocks produce 3 parity
blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In event of failure of any
node the data block affected can be recovered by using these parity blocks and the remaining data blocks.
5. Highly Available
• In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode, so if a
NameNode goes down then we have standby NameNode to count on. But Hadoop 3.0 supports multiple
standby NameNode making the system even more highly available as it can continue functioning in case if
two or more NameNodes crashes.
6. Low Network Traffic
• In Hadoop, each job submitted by the user is split into a number of independent sub-tasks and these sub-
tasks are assigned to the data nodes thereby moving a small amount of code to data rather than moving
huge data to code which leads to low network traffic.
7. High Throughput
• Throughput means job done per unit time. Hadoop stores data in a distributed fashion which allows using
distributed processing with ease. A given job gets divided into small jobs which work on chunks of data in
parallel thereby giving high throughput.
29. Advantages of Hadoop
8. Open Source
• Hadoop is an open source technology i.e. its source code is freely available. We can modify the source code to suit a
specific requirement.
9. Scalable
• Hadoop works on the principle of horizontal scalability i.e. we need to add the entire machine to the cluster of
nodes and not change the configuration of a machine like adding RAM, disk and so on which is known as vertical
scalability. Nodes can be added to Hadoop cluster on the fly making it a scalable framework.
10. Ease of use
• The Hadoop framework takes care of parallel processing, MapReduce programmers does not need to care for
achieving distributed processing, it is done at the backend automatically.
11. Compatibility
• Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink etc. They have got
processing engines which work over Hadoop as a backend i.e. We use Hadoop as data storage platforms for them.
12. Multiple Languages Supported
• Developers can code using many languages on Hadoop like C, C++, Perl, Python, Ruby, and Groovy.
30. Disadvantages of Hadoop
1. Issue With Small Files
• Hadoop is suitable for a small number of large files but when it comes to the application
which deals with a large number of small files, Hadoop fails here. A small file is nothing
but a file which is significantly smaller than Hadoop’s block size which can be either
128MB or 256MB by default. These large number of small files overload the Namenode
as it stores namespace for the system and makes it difficult for Hadoop to function.
2. Vulnerable By Nature
• Hadoop is written in Java which is a widely used programming language hence it is easily
exploited by cyber criminals which makes Hadoop vulnerable to security breaches.
3. Processing Overhead
• In Hadoop, the data is read from the disk and written to the disk which makes read/write
operations very expensive when we are dealing with tera and petabytes of data. Hadoop
cannot do in-memory calculations hence it incurs processing overhead.
31. Disadvantages of Hadoop
4. Supports Only Batch Processing
• At the core, Hadoop has a batch processing engine which is not efficient in stream
processing. It cannot produce output in real-time with low latency. It only works
on data which we collect and store in a file in advance before processing.
5. Iterative Processing
• Hadoop cannot do iterative processing by itself. Machine learning or iterative
processing has a cyclic data flow whereas Hadoop has data flowing in a chain of
stages where output on one stage becomes the input of another stage.
6. Security
• For security, Hadoop uses Kerberos authentication which is hard to manage. It is
missing encryption at storage and network levels which are a major point of
concern.