VIRTUALIZATION: Basics of Virtualization, Types of Virtualizations, Implementation Levels of Virtualization, Virtualization Structures, Tools and Mechanisms, Virtualization of CPU, Memory, I/O Devices, Virtual Clusters and Resource management, Virtualization for Data-center Automation, Introduction to MapReduce, GFS, HDFS, Hadoop, Framework.)
9953330565 Low Rate Call Girls In Rohini Delhi NCR
Virtualization - cloud computing
1. Cloud Computing
SWETA KUMARI BARNWAL 1
Module: 4
CLOUD COMPUTING
(Topics: VIRTUALIZATION: Basics of Virtualization, Types of Virtualizations, Implementation
Levels of Virtualization, Virtualization Structures, Tools and Mechanisms, Virtualization of CPU,
Memory, I/O Devices, Virtual Clusters and Resource management, Virtualization for Data-center
Automation, Introduction to MapReduce, GFS, HDFS, Hadoop, Framework.)
Virtualization in Cloud Computing
Virtualization is the "creation of a virtual (rather than actual) version of something, such as a
server, a desktop, a storage device, an operating system or network resources".
In other words, Virtualization is a technique, which allows to share a single physical instance
of a resource or an application among multiple customers and organizations. It does by
assigning a logical name to a physical storage and providing a pointer to that physical resource
when demanded.
What is the concept behind the Virtualization?
Creation of a virtual machine over existing operating system and hardware is known as
Hardware Virtualization. A Virtual machine provides an environment that is logically
separated from the underlying hardware.
The machine on which the virtual machine is going to create is known as Host Machine and
that virtual machine is referred as a Guest Machine.
BENEFITS OF VIRTUALIZATION
1. More flexible and efficient allocation of resources.
2. Enhance development productivity.
3. It lowers the cost of IT infrastructure.
4. Remote access and rapid scalability.
5. High availability and disaster recovery.
6. Pay peruse of the IT infrastructure on demand.
7. Enables running multiple operating systems.
Types of Virtualizations:
1. Hardware Virtualization.
2. Operating system Virtualization.
3. Server Virtualization.
4. Storage Virtualization.
1) Hardware Virtualization:
2. Cloud Computing
SWETA KUMARI BARNWAL 2
When the virtual machine software or virtual machine manager (VMM) is directly installed on
the hardware system is known as hardware virtualization.
The main job of hypervisor is to control and monitoring the processor, memory and other
hardware resources.
After virtualization of hardware system we can install different operating system on it and run
different applications on those OS.
Usage:
Hardware virtualization is mainly done for the server platforms, because controlling virtual
machines is much easier than controlling a physical server.
2) Operating System Virtualization:
When the virtual machine software or virtual machine manager (VMM) is installed on the Host
operating system instead of directly on the hardware system is known as operating system
virtualization.
Usage:
Operating System Virtualization is mainly used for testing the applications on different
platforms of OS.
3) Server Virtualization:
When the virtual machine software or virtual machine manager (VMM) is directly installed on
the Server system is known as server virtualization.
Usage:
Server virtualization is done because a single physical server can be divided into multiple
servers on the demand basis and for balancing the load.
4) Storage Virtualization:
Storage virtualization is the process of grouping the physical storage from multiple network
storage devices so that it looks like a single storage device.
Storage virtualization is also implemented by using software applications.
Usage:
Storage virtualization is mainly done for back-up and recovery purposes.
How does virtualization work in cloud computing?
3. Cloud Computing
SWETA KUMARI BARNWAL 3
Virtualization plays a very important role in the cloud computing technology, normally in the
cloud computing, users share the data present in the clouds like application etc, but actually
with the help of virtualization users shares the Infrastructure.
The main usage of Virtualization Technology is to provide the applications with the standard
versions to their cloud users, suppose if the next version of that application is released, then
cloud provider has to provide the latest version to their cloud users and practically it is possible
because it is more expensive.
To overcome this problem we use basically virtualization technology, By using virtualization,
all severs and the software application which are required by other cloud providers are
maintained by the third party people, and the cloud providers has to pay the money on monthly
or annual basis.
Conclusion
Mainly Virtualization means, running multiple operating systems on a single machine but
sharing all the hardware resources. And it helps us to provide the pool of IT resources so that
we can share these IT resources in order get benefits in the business.
Other than these there are few more Virtualizations:
1. Application Virtualization:
Application virtualization helps a user to have remote access of an application from
a server. The server stores all personal information and other characteristics of the
application but can still run on a local workstation through the internet. Example of
this would be a user who needs to run two different versions of the same software.
Technologies that use application virtualization are hosted applications and
packaged applications.
2. Network Virtualization:
The ability to run multiple virtual networks with each has a separate control and
data plan. It co-exists together on top of one physical network. It can be managed by
individual parties that potentially confidential to each other.
4. Cloud Computing
SWETA KUMARI BARNWAL 4
Network virtualization provides a facility to create and provision virtual networks—
logical switches, routers, firewalls, load balancer, virtual Private Network (VPN),
and workload security within days or even in weeks.
3. Desktop Virtualization:
Desktop virtualization allows the users’ OS to be remotely stored on a server in the
data centre. It allows the user to access their desktop virtually, from any location by
a different machine. Users who want specific operating systems other than Windows
Server will need to have a virtual desktop. Main benefits of desktop virtualization
are user mobility, portability, easy management of software installation, updates,
and patches.
4. Data Virtualization: Data virtualization is the process of retrieve data from various
resources without knowing its type and physical location where it is stored. It collects
heterogeneous data from different resources and allows data users across the
organization to access this data according to their work requirements. This
heterogeneous data can be accessed using any application such as web portals, web
services, E-commerce, Software as a Service (SaaS), and mobile application.
We can use Data Virtualization in the field of data integration, business
intelligence, and cloud computing.
Advantages of Data Virtualization
There are the following advantages of data virtualization -
o It allows users to access the data without worrying about where it resides on the
memory.
o It offers better customer satisfaction, retention, and revenue growth.
o It provides various security mechanism that allows users to safely store their personal
and professional information.
o It reduces costs by removing data replication.
o It provides a user-friendly interface to develop customized views.
o It provides various simple and fast deployment resources.
o It increases business user efficiency by providing data in real-time.
o It is used to perform tasks such as data integration, business integration, Service-
Oriented Architecture (SOA) data services, and enterprise search.
Disadvantages of Data Virtualization
o It creates availability issues, because availability is maintained by third-party providers.
o It required a high implementation cost.
5. Cloud Computing
SWETA KUMARI BARNWAL 5
o It creates the availability and scalability issues.
o Although it saves time during the implementation phase of virtualization but it
consumes more time to generate the appropriate result.
Uses of Data Virtualization
There are the following uses of Data Virtualization -
1. Analyze performance
Data virtualization is used to analyze the performance of the organization compared to previous
years.
2. Search and discover interrelated data
Data Virtualization (DV) provides a mechanism to easily search the data which is similar and
internally related to each other.
3. Agile Business Intelligence
It is one of the most common uses of Data Virtualization. It is used in agile reporting, real-time
dashboards that require timely aggregation, analyze and present the relevant data from multiple
resources. Both individuals and managers use this to monitor performance, which helps to
make daily operational decision processes such as sales, support, finance, logistics, legal, and
compliance.
4. Data Management
Data virtualization provides a secure centralized layer to search, discover, and govern the
unified data and its relationships.
Data Virtualization Tools
There are the following Data Virtualization tools -
1. Red Hat JBoss data virtualization
Red Hat virtualization is the best choice for developers and those who are using micro services
and containers. It is written in Java.
2. TIBCO data virtualization
TIBCO helps administrators and users to create a data virtualization platform for accessing the
multiple data sources and data sets. It provides a builtin transformation engine to combine
non-relational and un-structured data sources.
3. Oracle data service integrator
6. Cloud Computing
SWETA KUMARI BARNWAL 6
It is a very popular and powerful data integrator tool which is mainly worked with Oracle
products. It allows organizations to quickly develop and manage data services to access a single
view of data.
4. SAS Federation Server
SAS Federation Server provides various technologies such as scalable, multi-user, and
standards-based data access to access data from multiple data services. It mainly focuses on
securing data.
5. Denodo
Denodo is one of the best data virtualization tools which allows organizations to minimize the
network traffic load and improve response time for large data sets. It is suitable for both small
as well as large organizations.
Industries that use Data Virtualization
o Communication & Technology
In Communication & Technology industry, data virtualization is used to increase
revenue per customer, create a real-time ODS for marketing, manage customers,
improve customer insights, and optimize customer care, etc.
o Finance
In the field of finance, DV is used to improve trade reconciliation, empowering data
democracy, addressing data complexity, and managing fixed-risk income.
o Government
In the government sector, DV is used for protecting the environment.
o Healthcare
Data virtualization plays a very important role in the field of healthcare. In healthcare,
DV helps to improve patient care, drive new product innovation, accelerating M&A
synergies, and provide a more efficient claims analysis.
o Manufacturing
In manufacturing industry, data virtualization is used to optimize a global supply chain,
optimize factories, and improve IT assets utilization.
Implementation Levels of Virtualization
• Virtualization is a computer architecture technology by which multiple virtual
machines
• (VMs) are multiplexed in the same hardware machine.
• After virtualization, different user applications managed by their own operating
systems
• (Guest OS) can run on the same hardware independent of the host OS
7. Cloud Computing
SWETA KUMARI BARNWAL 7
• done by adding additional software, called a virtualization layer
• This virtualization layer is known as hypervisor or virtual machine monitor (VMM)
• Function of the software layer for virtualization is to virtualize the physical hardware
of a host machine into virtual resources to be used by the VMs
• Common virtualization layers include the instruction set architecture (ISA) level,
hardware level, operating system level, library support level, and application level.
Instruction Set Architecture Level
➢ At the ISA level, virtualization is performed by emulating a given ISA by the ISA of
the host machine. For example, MIPS binary code can run on an x86-based host
machine with the help of ISA emulation. With this approach, it is possible to run a
large amount of legacy binary code written for various processors on any given new
hardware host machine.
➢ Instruction set emulation leads to virtual ISAs created on any hardware machine. The
basic emulation method is through code interpretation. An interpreter program
interprets the source instructions to target instructions one by one. One source
instruction may require tens or hundreds of native target instructions to perform its
function. Obviously, this process is relatively slow. For better performance, dynamic
binary translation is desired.
➢ This approach translates basic blocks of dynamic source instructions to target
instructions. The basic blocks can also be extended to program traces or super blocks
to increase translation efficiency.
➢ Instruction set emulation requires binary translation and optimization. A virtual
instruction set architecture (V-ISA) thus requires adding a processor-specific software
translation layer to the compiler.
Hardware Abstraction Level
➢ Hardware-level virtualization is performed right on top of the bare hardware.
➢ This approach generates a virtual hardware environment for a VM.
8. Cloud Computing
SWETA KUMARI BARNWAL 8
➢ The process manages the underlying hardware through virtualization. The idea is to
virtualize
➢ a computer’s resources, such as its processors, memory, and I/O devices.
➢ The intention is to upgrade the hardware utilization rate by multiple users
concurrently. The
➢ idea was implemented in the IBM VM/370 in the 1960s.
➢ More recently, the Xen hypervisor has been applied to virtualize x86-based machines
to run.
Operating System Level
➢ This refers to an abstraction layer between traditional OS and user applications.
➢ OS-level virtualization creates isolated containers on a single physical server and
the OS
➢ instances to utilize the hardware and software in data centres.
➢ The containers behave like real servers.
➢ OS-level virtualization is commonly used in creating virtual hosting environments
to allocate
➢ hardware resources among a large number of mutually distrusting users.
➢ It is also used, to a lesser extent, in consolidating server hardware by moving
services on
➢ separate hosts into containers or VMs on one server.
Library Support Level
➢ Most applications use APIs exported by user-level libraries rather than using lengthy
system calls by the OS.
➢ Since most systems provide well-documented APIs, such an interface becomes
another candidate for virtualization.
➢ Virtualization with library interfaces is possible by controlling the communication
link between applications and the rest of a system through API hooks.
User-Application Level
➢ Virtualization at the application level virtualizes an application as a VM.
➢ On a traditional OS, an application often runs as a process. Therefore, application-
level virtualization is also known as process-level virtualization.
➢ The most popular approach is to deploy high level language (HLL)VMs. In this
scenario, the virtualization layer sits as an application program on top of the operating
system,
➢ The layer exports an abstraction of a VM that can run programs written and compiled
to a particular abstract machine definition.
➢ Any program written in the HLL and compiled for this VM will be able to run on it.
The Microsoft .NET CLR and Java Virtual Machine (JVM) are two good examples of
this class of VM.
MAP-REDUCE IN HADOOP
9. Cloud Computing
SWETA KUMARI BARNWAL 9
MapReduce is a software framework and programming model used for processing huge
amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map
tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java,
Ruby, Python, and C++. The programs of Map Reduce in cloud computing are parallel in
nature, thus are very useful for performing large-scale data analysis using multiple machines
in the cluster.
The input to each phase is key-value pairs. In addition, every programmer needs to specify
two functions: map function and reduce function.
MapReduce Architecture in Big Data explained in detail
The whole process goes through four phases of execution namely, splitting, mapping,
shuffling, and reducing.
Now in this MapReduce tutorial, let’s understand with a MapReduce example–
Consider you have following input data for your MapReduce in Big data Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MapReduce Architecture
10. Cloud Computing
SWETA KUMARI BARNWAL 10
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
The data goes through the following phases of MapReduce in Big Data
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase
summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total
occurrences of each word.
MapReduce Architecture explained in detail
• One map task is created for each split which then executes map function for each
record in the split.
• It is always beneficial to have multiple splits because the time taken to process a split
is small as compared to the time taken for processing of the whole input. When the
splits are smaller, the processing is better to load balanced since we are processing the
splits in parallel.
11. Cloud Computing
SWETA KUMARI BARNWAL 11
• However, it is also not desirable to have splits too small in size. When splits are too
small, the overload of managing the splits and map task creation begins to dominate
the total job execution time.
• For most jobs, it is better to make a split size equal to the size of an HDFS block
(which is 64 MB, by default).
• Execution of map tasks results into writing output to a local disk on the respective
node and not to HDFS.
• Reason for choosing local disk over HDFS is, to avoid replication which takes place
in case of HDFS store operation.
• Map output is intermediate output which is processed by reduce tasks to produce the
final output.
• Once the job is complete, the map output can be thrown away. So, storing it in HDFS
with replication becomes overkill.
• In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.
• Reduce task doesn’t work on the concept of data locality. An output of every map task
is fed to the reduce task. Map output is transferred to the machine where reduce task
is running.
• On this machine, the output is merged and then passed to the user-defined reduce
function.
• Unlike the map output, reduce output is stored in HDFS (the first replica is stored on
the local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
How MapReduce Organizes Work?
Now in this MapReduce tutorial, we will learn how MapReduce works
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.
12. Cloud Computing
SWETA KUMARI BARNWAL 12
How Hadoop MapReduce Works
• A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to
run on different data nodes.
• Execution of individual task is then to look after by task tracker, which resides on
every data node executing part of the job.
• Task tracker’s responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends ‘heartbeat’ signal to the Jobtracker so as
to notify him of the current state of the system.
• Thus job tracker keeps track of the overall progress of each job. In the event of task
failure, the job tracker can reschedule it on a different task tracker.
HADOOP
Apache Hadoop is an open source software framework used to develop data processing
applications which are executed in a distributed computing environment.
Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers. Commodity computers are cheap and widely available. These are
mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File system. The
processing model is based on ‘Data Locality’ concept wherein computational logic is sent to
cluster nodes(server) containing data. This computational logic is nothing, but a compiled
13. Cloud Computing
SWETA KUMARI BARNWAL 13
version of a program written in a high-level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
Apache Hadoop consists of two sub-projects –
1. Hadoop MapReduce: MapReduce is a computational model and software framework
for writing applications which are run on Hadoop. These MapReduce programs are
capable of processing enormous data in parallel on large clusters of computation
nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of
Hadoop applications. MapReduce applications consume data from HDFS. HDFS
creates multiple replicas of data blocks and distributes them on compute nodes in a
cluster. This distribution enables reliable and extremely rapid computations.
Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the
term is also used for a family of related projects that fall under the umbrella of distributed
computing and large-scale data processing. Other Hadoop-related projects at Apache include
are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.
HADOOP Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods.
NameNode:
NameNode represented every files and directory which is used in the namespace
DataNode:
DataNode helps you to manage the state of an HDFS node and allows you to interacts with
the blocks
14. Cloud Computing
SWETA KUMARI BARNWAL 14
MasterNode:
The master node allows you to conduct parallel processing of data using Hadoop
MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows you to store
data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker
and a DataNode. This allows you to synchronize the processes with the NameNode and Job
Tracker respectively.
In Hadoop, master or slave system can be set up in the cloud or on-premise
Features Of ‘Hadoop’
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best
suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to
the computing nodes, less network bandwidth is consumed. This concept is called as data
locality concept which helps increase the efficiency of Hadoop based applications.
• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and
thus allows for the growth of Big Data. Also, scaling does not require modifications to
application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes.
That way, in the event of a cluster node failure, data processing can still proceed by using
data stored on another cluster node.
GOOGLE File System
Architecture
Single GFS master! and multiple GFS chunk servers accessed by multiple GFS clients
GFS Files are divided into fixed-‐size chunks (64MB)
Each chunk is identified by a globally unique “chunk handle” (64 bits)
Chunk servers store chunks on local disks as Linux file
For reliability each chunk is replicated on multiple chunk servers (default = 3) 7
GFS master→
maintains all file system metadata
15. Cloud Computing
SWETA KUMARI BARNWAL 15
• namespace, access control, chunk mapping & locations (files → chunks → replicas)
send periodically heartbeat messages with chunk servers
• instructions + state monitoring
GFS client →
• Library code linked into each applications
• Communicates with GFS master for metadata operations (control plane)
• Communicates with chunk servers for read/write operations (data plane)
Chunk Size →
• 64MB
• Stored as a plain Linux file on chunkserver
• Extended only as needed
• Lazy space allocation avoids wasting space (no fragmentation)
• Reason for large chunk size-
Advantages
• Reduces client interaction with GFS master for chunk location information
• Reduces size of metadata stored on master (full in-‐memory)
• Reduces network overhead by keeping persistent TCP connections to the
chunk-server over an extended period of time •
Disadvantages
• Small files can create hot spots on chunk-servers if many clients accessing the
same file
16. Cloud Computing
SWETA KUMARI BARNWAL 16
Descriptions:
1. There is single master in the whole cluster which stores metadata.
2. Other nodes act as the chunk servers for storing data.
3. The file system namespace and locking facilities are managed by master.
4. The master periodically communicates with the chunk servers to collect management
information and give instruction to chunk servers to do work such as load balancing
or fail recovery.
5. With a single master, many complicated distributed algorithms can be avoided and the
design of the system can be simplified.
6. The single GFS master could be the performance bottleneck and single point of
failure.
7. To mitigate this, Google uses a shadow master to replicate all the data on the master
and the design guarantees that all data operations are transferred between the master
and the clients and they can be cached for future use.
8. With the current quality of commodity servers, the single master can handle a cluster
more than 1000 nodes.
The features of Google file system are as follows:
1. GFS was designed for high fault tolerance.
2. Master and chunk servers can be restarted in a few seconds and with such a fast
recovery capability, the window of time in which data is unavailable can be greatly
reduced.
3. Each chunk is replicated at least three places and can tolerate at least two data crashes
for a single chunk of data.
4. The shadow master handles the failure of the GFS master.
5. For data integrity, GFS makes checksums on every 64KB block in each chunk.
6. GFS can achieve the goals of high availability, high performance and implementation.
7. It demonstrates how to support large scale processing workloads on commodity
hardware designed to tolerate frequent component failures optimized for huge files
that are mostly appended and read.
HADOOP
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
17. Cloud Computing
SWETA KUMARI BARNWAL 17
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable
for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
18. Cloud Computing
SWETA KUMARI BARNWAL 18
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Checking that the code was executed successfully.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
• Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
• Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
Hadoop Distributed File System (HDFS)
With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a network all the complications of
a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
19. Cloud Computing
SWETA KUMARI BARNWAL 19
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
• Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
• Streaming Data Access Pattern: HDFS is designed on principle of write-once
and read-many-times. Once data is written large portions of dataset can be
processed any number times.
• Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other
file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. MasterNode:
• Manages all the slave nodes and assign work to them.
• It executes filesystem namespace operations like opening, closing,
renaming files and directories.
• It should be deployed on reliable hardware which has the high config.
not on commodity hardware.
2. NameNode:
• Actual worker nodes, who do the actual work like reading, writing,
processing etc.
• They also perform creation, deletion, and replication upon instruction
from the master.
• They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks,
block Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
20. Cloud Computing
SWETA KUMARI BARNWAL 20
• Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode). Datanodes(slavenode)replicate the
blocks among themselves and the information of what blocks they contain is sent to the
master. Default replication factor is 3 means for each block 3 replicas are created (including
itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and
every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on
a single machine. Even if we store, then each read and write operation on that whole file is
going to take very high seek time. But if we have multiple blocks of size 128MB then its
become easy to perform various read and write operations on it compared to doing it on a
whole file at once. So we divide the file to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode
D1. Now if the data node D1 crashes we will lose the block and which will make the
overall data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it
dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too
and the blocks will be under-replicated compared to the remaining blocks. Here
master node(namenode) will give a signal to datanodes containing replicas of
those lost blocks to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
• High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t
work well.
• Low latency data access: Applications that require low-latency access to data
i.e in the range of milliseconds will not work well with HDFS, because HDFS is
designed keeping in mind that we need high-throughput of data even at the cost
of latency.
• Small file problem: Having lots of small files will result in lots of seeks and
lots of movement from one datanode to another datanode to retrieve each small
file, this whole process is a very inefficient data access pattern.