SlideShare a Scribd company logo
1 of 20
Download to read offline
Cloud Computing
SWETA KUMARI BARNWAL 1
Module: 4
CLOUD COMPUTING
(Topics: VIRTUALIZATION: Basics of Virtualization, Types of Virtualizations, Implementation
Levels of Virtualization, Virtualization Structures, Tools and Mechanisms, Virtualization of CPU,
Memory, I/O Devices, Virtual Clusters and Resource management, Virtualization for Data-center
Automation, Introduction to MapReduce, GFS, HDFS, Hadoop, Framework.)
Virtualization in Cloud Computing
Virtualization is the "creation of a virtual (rather than actual) version of something, such as a
server, a desktop, a storage device, an operating system or network resources".
In other words, Virtualization is a technique, which allows to share a single physical instance
of a resource or an application among multiple customers and organizations. It does by
assigning a logical name to a physical storage and providing a pointer to that physical resource
when demanded.
What is the concept behind the Virtualization?
Creation of a virtual machine over existing operating system and hardware is known as
Hardware Virtualization. A Virtual machine provides an environment that is logically
separated from the underlying hardware.
The machine on which the virtual machine is going to create is known as Host Machine and
that virtual machine is referred as a Guest Machine.
BENEFITS OF VIRTUALIZATION
1. More flexible and efficient allocation of resources.
2. Enhance development productivity.
3. It lowers the cost of IT infrastructure.
4. Remote access and rapid scalability.
5. High availability and disaster recovery.
6. Pay peruse of the IT infrastructure on demand.
7. Enables running multiple operating systems.
Types of Virtualizations:
1. Hardware Virtualization.
2. Operating system Virtualization.
3. Server Virtualization.
4. Storage Virtualization.
1) Hardware Virtualization:
Cloud Computing
SWETA KUMARI BARNWAL 2
When the virtual machine software or virtual machine manager (VMM) is directly installed on
the hardware system is known as hardware virtualization.
The main job of hypervisor is to control and monitoring the processor, memory and other
hardware resources.
After virtualization of hardware system we can install different operating system on it and run
different applications on those OS.
Usage:
Hardware virtualization is mainly done for the server platforms, because controlling virtual
machines is much easier than controlling a physical server.
2) Operating System Virtualization:
When the virtual machine software or virtual machine manager (VMM) is installed on the Host
operating system instead of directly on the hardware system is known as operating system
virtualization.
Usage:
Operating System Virtualization is mainly used for testing the applications on different
platforms of OS.
3) Server Virtualization:
When the virtual machine software or virtual machine manager (VMM) is directly installed on
the Server system is known as server virtualization.
Usage:
Server virtualization is done because a single physical server can be divided into multiple
servers on the demand basis and for balancing the load.
4) Storage Virtualization:
Storage virtualization is the process of grouping the physical storage from multiple network
storage devices so that it looks like a single storage device.
Storage virtualization is also implemented by using software applications.
Usage:
Storage virtualization is mainly done for back-up and recovery purposes.
How does virtualization work in cloud computing?
Cloud Computing
SWETA KUMARI BARNWAL 3
Virtualization plays a very important role in the cloud computing technology, normally in the
cloud computing, users share the data present in the clouds like application etc, but actually
with the help of virtualization users shares the Infrastructure.
The main usage of Virtualization Technology is to provide the applications with the standard
versions to their cloud users, suppose if the next version of that application is released, then
cloud provider has to provide the latest version to their cloud users and practically it is possible
because it is more expensive.
To overcome this problem we use basically virtualization technology, By using virtualization,
all severs and the software application which are required by other cloud providers are
maintained by the third party people, and the cloud providers has to pay the money on monthly
or annual basis.
Conclusion
Mainly Virtualization means, running multiple operating systems on a single machine but
sharing all the hardware resources. And it helps us to provide the pool of IT resources so that
we can share these IT resources in order get benefits in the business.
Other than these there are few more Virtualizations:
1. Application Virtualization:
Application virtualization helps a user to have remote access of an application from
a server. The server stores all personal information and other characteristics of the
application but can still run on a local workstation through the internet. Example of
this would be a user who needs to run two different versions of the same software.
Technologies that use application virtualization are hosted applications and
packaged applications.
2. Network Virtualization:
The ability to run multiple virtual networks with each has a separate control and
data plan. It co-exists together on top of one physical network. It can be managed by
individual parties that potentially confidential to each other.
Cloud Computing
SWETA KUMARI BARNWAL 4
Network virtualization provides a facility to create and provision virtual networks—
logical switches, routers, firewalls, load balancer, virtual Private Network (VPN),
and workload security within days or even in weeks.
3. Desktop Virtualization:
Desktop virtualization allows the users’ OS to be remotely stored on a server in the
data centre. It allows the user to access their desktop virtually, from any location by
a different machine. Users who want specific operating systems other than Windows
Server will need to have a virtual desktop. Main benefits of desktop virtualization
are user mobility, portability, easy management of software installation, updates,
and patches.
4. Data Virtualization: Data virtualization is the process of retrieve data from various
resources without knowing its type and physical location where it is stored. It collects
heterogeneous data from different resources and allows data users across the
organization to access this data according to their work requirements. This
heterogeneous data can be accessed using any application such as web portals, web
services, E-commerce, Software as a Service (SaaS), and mobile application.
We can use Data Virtualization in the field of data integration, business
intelligence, and cloud computing.
Advantages of Data Virtualization
There are the following advantages of data virtualization -
o It allows users to access the data without worrying about where it resides on the
memory.
o It offers better customer satisfaction, retention, and revenue growth.
o It provides various security mechanism that allows users to safely store their personal
and professional information.
o It reduces costs by removing data replication.
o It provides a user-friendly interface to develop customized views.
o It provides various simple and fast deployment resources.
o It increases business user efficiency by providing data in real-time.
o It is used to perform tasks such as data integration, business integration, Service-
Oriented Architecture (SOA) data services, and enterprise search.
Disadvantages of Data Virtualization
o It creates availability issues, because availability is maintained by third-party providers.
o It required a high implementation cost.
Cloud Computing
SWETA KUMARI BARNWAL 5
o It creates the availability and scalability issues.
o Although it saves time during the implementation phase of virtualization but it
consumes more time to generate the appropriate result.
Uses of Data Virtualization
There are the following uses of Data Virtualization -
1. Analyze performance
Data virtualization is used to analyze the performance of the organization compared to previous
years.
2. Search and discover interrelated data
Data Virtualization (DV) provides a mechanism to easily search the data which is similar and
internally related to each other.
3. Agile Business Intelligence
It is one of the most common uses of Data Virtualization. It is used in agile reporting, real-time
dashboards that require timely aggregation, analyze and present the relevant data from multiple
resources. Both individuals and managers use this to monitor performance, which helps to
make daily operational decision processes such as sales, support, finance, logistics, legal, and
compliance.
4. Data Management
Data virtualization provides a secure centralized layer to search, discover, and govern the
unified data and its relationships.
Data Virtualization Tools
There are the following Data Virtualization tools -
1. Red Hat JBoss data virtualization
Red Hat virtualization is the best choice for developers and those who are using micro services
and containers. It is written in Java.
2. TIBCO data virtualization
TIBCO helps administrators and users to create a data virtualization platform for accessing the
multiple data sources and data sets. It provides a builtin transformation engine to combine
non-relational and un-structured data sources.
3. Oracle data service integrator
Cloud Computing
SWETA KUMARI BARNWAL 6
It is a very popular and powerful data integrator tool which is mainly worked with Oracle
products. It allows organizations to quickly develop and manage data services to access a single
view of data.
4. SAS Federation Server
SAS Federation Server provides various technologies such as scalable, multi-user, and
standards-based data access to access data from multiple data services. It mainly focuses on
securing data.
5. Denodo
Denodo is one of the best data virtualization tools which allows organizations to minimize the
network traffic load and improve response time for large data sets. It is suitable for both small
as well as large organizations.
Industries that use Data Virtualization
o Communication & Technology
In Communication & Technology industry, data virtualization is used to increase
revenue per customer, create a real-time ODS for marketing, manage customers,
improve customer insights, and optimize customer care, etc.
o Finance
In the field of finance, DV is used to improve trade reconciliation, empowering data
democracy, addressing data complexity, and managing fixed-risk income.
o Government
In the government sector, DV is used for protecting the environment.
o Healthcare
Data virtualization plays a very important role in the field of healthcare. In healthcare,
DV helps to improve patient care, drive new product innovation, accelerating M&A
synergies, and provide a more efficient claims analysis.
o Manufacturing
In manufacturing industry, data virtualization is used to optimize a global supply chain,
optimize factories, and improve IT assets utilization.
Implementation Levels of Virtualization
• Virtualization is a computer architecture technology by which multiple virtual
machines
• (VMs) are multiplexed in the same hardware machine.
• After virtualization, different user applications managed by their own operating
systems
• (Guest OS) can run on the same hardware independent of the host OS
Cloud Computing
SWETA KUMARI BARNWAL 7
• done by adding additional software, called a virtualization layer
• This virtualization layer is known as hypervisor or virtual machine monitor (VMM)
• Function of the software layer for virtualization is to virtualize the physical hardware
of a host machine into virtual resources to be used by the VMs
• Common virtualization layers include the instruction set architecture (ISA) level,
hardware level, operating system level, library support level, and application level.
Instruction Set Architecture Level
➢ At the ISA level, virtualization is performed by emulating a given ISA by the ISA of
the host machine. For example, MIPS binary code can run on an x86-based host
machine with the help of ISA emulation. With this approach, it is possible to run a
large amount of legacy binary code written for various processors on any given new
hardware host machine.
➢ Instruction set emulation leads to virtual ISAs created on any hardware machine. The
basic emulation method is through code interpretation. An interpreter program
interprets the source instructions to target instructions one by one. One source
instruction may require tens or hundreds of native target instructions to perform its
function. Obviously, this process is relatively slow. For better performance, dynamic
binary translation is desired.
➢ This approach translates basic blocks of dynamic source instructions to target
instructions. The basic blocks can also be extended to program traces or super blocks
to increase translation efficiency.
➢ Instruction set emulation requires binary translation and optimization. A virtual
instruction set architecture (V-ISA) thus requires adding a processor-specific software
translation layer to the compiler.
Hardware Abstraction Level
➢ Hardware-level virtualization is performed right on top of the bare hardware.
➢ This approach generates a virtual hardware environment for a VM.
Cloud Computing
SWETA KUMARI BARNWAL 8
➢ The process manages the underlying hardware through virtualization. The idea is to
virtualize
➢ a computer’s resources, such as its processors, memory, and I/O devices.
➢ The intention is to upgrade the hardware utilization rate by multiple users
concurrently. The
➢ idea was implemented in the IBM VM/370 in the 1960s.
➢ More recently, the Xen hypervisor has been applied to virtualize x86-based machines
to run.
Operating System Level
➢ This refers to an abstraction layer between traditional OS and user applications.
➢ OS-level virtualization creates isolated containers on a single physical server and
the OS
➢ instances to utilize the hardware and software in data centres.
➢ The containers behave like real servers.
➢ OS-level virtualization is commonly used in creating virtual hosting environments
to allocate
➢ hardware resources among a large number of mutually distrusting users.
➢ It is also used, to a lesser extent, in consolidating server hardware by moving
services on
➢ separate hosts into containers or VMs on one server.
Library Support Level
➢ Most applications use APIs exported by user-level libraries rather than using lengthy
system calls by the OS.
➢ Since most systems provide well-documented APIs, such an interface becomes
another candidate for virtualization.
➢ Virtualization with library interfaces is possible by controlling the communication
link between applications and the rest of a system through API hooks.
User-Application Level
➢ Virtualization at the application level virtualizes an application as a VM.
➢ On a traditional OS, an application often runs as a process. Therefore, application-
level virtualization is also known as process-level virtualization.
➢ The most popular approach is to deploy high level language (HLL)VMs. In this
scenario, the virtualization layer sits as an application program on top of the operating
system,
➢ The layer exports an abstraction of a VM that can run programs written and compiled
to a particular abstract machine definition.
➢ Any program written in the HLL and compiled for this VM will be able to run on it.
The Microsoft .NET CLR and Java Virtual Machine (JVM) are two good examples of
this class of VM.
MAP-REDUCE IN HADOOP
Cloud Computing
SWETA KUMARI BARNWAL 9
MapReduce is a software framework and programming model used for processing huge
amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map
tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java,
Ruby, Python, and C++. The programs of Map Reduce in cloud computing are parallel in
nature, thus are very useful for performing large-scale data analysis using multiple machines
in the cluster.
The input to each phase is key-value pairs. In addition, every programmer needs to specify
two functions: map function and reduce function.
MapReduce Architecture in Big Data explained in detail
The whole process goes through four phases of execution namely, splitting, mapping,
shuffling, and reducing.
Now in this MapReduce tutorial, let’s understand with a MapReduce example–
Consider you have following input data for your MapReduce in Big data Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MapReduce Architecture
Cloud Computing
SWETA KUMARI BARNWAL 10
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
The data goes through the following phases of MapReduce in Big Data
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase
summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total
occurrences of each word.
MapReduce Architecture explained in detail
• One map task is created for each split which then executes map function for each
record in the split.
• It is always beneficial to have multiple splits because the time taken to process a split
is small as compared to the time taken for processing of the whole input. When the
splits are smaller, the processing is better to load balanced since we are processing the
splits in parallel.
Cloud Computing
SWETA KUMARI BARNWAL 11
• However, it is also not desirable to have splits too small in size. When splits are too
small, the overload of managing the splits and map task creation begins to dominate
the total job execution time.
• For most jobs, it is better to make a split size equal to the size of an HDFS block
(which is 64 MB, by default).
• Execution of map tasks results into writing output to a local disk on the respective
node and not to HDFS.
• Reason for choosing local disk over HDFS is, to avoid replication which takes place
in case of HDFS store operation.
• Map output is intermediate output which is processed by reduce tasks to produce the
final output.
• Once the job is complete, the map output can be thrown away. So, storing it in HDFS
with replication becomes overkill.
• In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.
• Reduce task doesn’t work on the concept of data locality. An output of every map task
is fed to the reduce task. Map output is transferred to the machine where reduce task
is running.
• On this machine, the output is merged and then passed to the user-defined reduce
function.
• Unlike the map output, reduce output is stored in HDFS (the first replica is stored on
the local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
How MapReduce Organizes Work?
Now in this MapReduce tutorial, we will learn how MapReduce works
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.
Cloud Computing
SWETA KUMARI BARNWAL 12
How Hadoop MapReduce Works
• A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to
run on different data nodes.
• Execution of individual task is then to look after by task tracker, which resides on
every data node executing part of the job.
• Task tracker’s responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends ‘heartbeat’ signal to the Jobtracker so as
to notify him of the current state of the system.
• Thus job tracker keeps track of the overall progress of each job. In the event of task
failure, the job tracker can reschedule it on a different task tracker.
HADOOP
Apache Hadoop is an open source software framework used to develop data processing
applications which are executed in a distributed computing environment.
Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers. Commodity computers are cheap and widely available. These are
mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File system. The
processing model is based on ‘Data Locality’ concept wherein computational logic is sent to
cluster nodes(server) containing data. This computational logic is nothing, but a compiled
Cloud Computing
SWETA KUMARI BARNWAL 13
version of a program written in a high-level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
Apache Hadoop consists of two sub-projects –
1. Hadoop MapReduce: MapReduce is a computational model and software framework
for writing applications which are run on Hadoop. These MapReduce programs are
capable of processing enormous data in parallel on large clusters of computation
nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of
Hadoop applications. MapReduce applications consume data from HDFS. HDFS
creates multiple replicas of data blocks and distributes them on compute nodes in a
cluster. This distribution enables reliable and extremely rapid computations.
Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the
term is also used for a family of related projects that fall under the umbrella of distributed
computing and large-scale data processing. Other Hadoop-related projects at Apache include
are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.
HADOOP Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods.
NameNode:
NameNode represented every files and directory which is used in the namespace
DataNode:
DataNode helps you to manage the state of an HDFS node and allows you to interacts with
the blocks
Cloud Computing
SWETA KUMARI BARNWAL 14
MasterNode:
The master node allows you to conduct parallel processing of data using Hadoop
MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows you to store
data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker
and a DataNode. This allows you to synchronize the processes with the NameNode and Job
Tracker respectively.
In Hadoop, master or slave system can be set up in the cloud or on-premise
Features Of ‘Hadoop’
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best
suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to
the computing nodes, less network bandwidth is consumed. This concept is called as data
locality concept which helps increase the efficiency of Hadoop based applications.
• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and
thus allows for the growth of Big Data. Also, scaling does not require modifications to
application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes.
That way, in the event of a cluster node failure, data processing can still proceed by using
data stored on another cluster node.
GOOGLE File System
Architecture
Single GFS master! and multiple GFS chunk servers accessed by multiple GFS clients
GFS Files are divided into fixed-‐size chunks (64MB)
Each chunk is identified by a globally unique “chunk handle” (64 bits)
Chunk servers store chunks on local disks as Linux file
For reliability each chunk is replicated on multiple chunk servers (default = 3) 7
GFS master→
maintains all file system metadata
Cloud Computing
SWETA KUMARI BARNWAL 15
• namespace, access control, chunk mapping & locations (files → chunks → replicas)
send periodically heartbeat messages with chunk servers
• instructions + state monitoring
GFS client →
• Library code linked into each applications
• Communicates with GFS master for metadata operations (control plane)
• Communicates with chunk servers for read/write operations (data plane)
Chunk Size →
• 64MB
• Stored as a plain Linux file on chunkserver
• Extended only as needed
• Lazy space allocation avoids wasting space (no fragmentation)
• Reason for large chunk size-
Advantages
• Reduces client interaction with GFS master for chunk location information
• Reduces size of metadata stored on master (full in-‐memory)
• Reduces network overhead by keeping persistent TCP connections to the
chunk-server over an extended period of time •
Disadvantages
• Small files can create hot spots on chunk-servers if many clients accessing the
same file
Cloud Computing
SWETA KUMARI BARNWAL 16
Descriptions:
1. There is single master in the whole cluster which stores metadata.
2. Other nodes act as the chunk servers for storing data.
3. The file system namespace and locking facilities are managed by master.
4. The master periodically communicates with the chunk servers to collect management
information and give instruction to chunk servers to do work such as load balancing
or fail recovery.
5. With a single master, many complicated distributed algorithms can be avoided and the
design of the system can be simplified.
6. The single GFS master could be the performance bottleneck and single point of
failure.
7. To mitigate this, Google uses a shadow master to replicate all the data on the master
and the design guarantees that all data operations are transferred between the master
and the clients and they can be cached for future use.
8. With the current quality of commodity servers, the single master can handle a cluster
more than 1000 nodes.
The features of Google file system are as follows:
1. GFS was designed for high fault tolerance.
2. Master and chunk servers can be restarted in a few seconds and with such a fast
recovery capability, the window of time in which data is unavailable can be greatly
reduced.
3. Each chunk is replicated at least three places and can tolerate at least two data crashes
for a single chunk of data.
4. The shadow master handles the failure of the GFS master.
5. For data integrity, GFS makes checksums on every 64KB block in each chunk.
6. GFS can achieve the goals of high availability, high performance and implementation.
7. It demonstrates how to support large scale processing workloads on commodity
hardware designed to tolerate frequent component failures optimized for huge files
that are mostly appended and read.
HADOOP
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
Cloud Computing
SWETA KUMARI BARNWAL 17
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable
for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
Cloud Computing
SWETA KUMARI BARNWAL 18
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Checking that the code was executed successfully.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
• Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
• Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
Hadoop Distributed File System (HDFS)
With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a network all the complications of
a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
Cloud Computing
SWETA KUMARI BARNWAL 19
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
• Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
• Streaming Data Access Pattern: HDFS is designed on principle of write-once
and read-many-times. Once data is written large portions of dataset can be
processed any number times.
• Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other
file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. MasterNode:
• Manages all the slave nodes and assign work to them.
• It executes filesystem namespace operations like opening, closing,
renaming files and directories.
• It should be deployed on reliable hardware which has the high config.
not on commodity hardware.
2. NameNode:
• Actual worker nodes, who do the actual work like reading, writing,
processing etc.
• They also perform creation, deletion, and replication upon instruction
from the master.
• They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks,
block Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
Cloud Computing
SWETA KUMARI BARNWAL 20
• Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode). Datanodes(slavenode)replicate the
blocks among themselves and the information of what blocks they contain is sent to the
master. Default replication factor is 3 means for each block 3 replicas are created (including
itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and
every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on
a single machine. Even if we store, then each read and write operation on that whole file is
going to take very high seek time. But if we have multiple blocks of size 128MB then its
become easy to perform various read and write operations on it compared to doing it on a
whole file at once. So we divide the file to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode
D1. Now if the data node D1 crashes we will lose the block and which will make the
overall data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it
dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too
and the blocks will be under-replicated compared to the remaining blocks. Here
master node(namenode) will give a signal to datanodes containing replicas of
those lost blocks to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
• High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t
work well.
• Low latency data access: Applications that require low-latency access to data
i.e in the range of milliseconds will not work well with HDFS, because HDFS is
designed keeping in mind that we need high-throughput of data even at the cost
of latency.
• Small file problem: Having lots of small files will result in lots of seeks and
lots of movement from one datanode to another datanode to retrieve each small
file, this whole process is a very inefficient data access pattern.

More Related Content

What's hot

Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computingkevnikool
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3Vepsun Technologies
 
Introduction Cloud Computing
Introduction Cloud ComputingIntroduction Cloud Computing
Introduction Cloud ComputingRoel Honning
 
Server virtualization
Server virtualizationServer virtualization
Server virtualizationofsorganizer
 
Azure Stack Fundamentals
Azure Stack FundamentalsAzure Stack Fundamentals
Azure Stack FundamentalsCenk Ersoy
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualizationhktripathy
 
VMware Esx Short Presentation
VMware Esx Short PresentationVMware Esx Short Presentation
VMware Esx Short PresentationBarcamp Cork
 
virtualization-vs-containerization-paas
virtualization-vs-containerization-paasvirtualization-vs-containerization-paas
virtualization-vs-containerization-paasrajdeep
 
Cloud Computing PPT.pptx
Cloud Computing PPT.pptxCloud Computing PPT.pptx
Cloud Computing PPT.pptxHetKhandol
 
Cloud Computing An introduction
Cloud Computing An introductionCloud Computing An introduction
Cloud Computing An introductionSanjay Sharma
 
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDaysThe world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDaysCodeOps Technologies LLP
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing pptJagriti Rai
 
Virtualization using VMWare Workstation
Virtualization using VMWare WorkstationVirtualization using VMWare Workstation
Virtualization using VMWare WorkstationHitesh Gupta
 

What's hot (20)

Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computing
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3
 
Introduction Cloud Computing
Introduction Cloud ComputingIntroduction Cloud Computing
Introduction Cloud Computing
 
Server virtualization
Server virtualizationServer virtualization
Server virtualization
 
Azure Stack Fundamentals
Azure Stack FundamentalsAzure Stack Fundamentals
Azure Stack Fundamentals
 
Virtualization
VirtualizationVirtualization
Virtualization
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualization
 
VMware Esx Short Presentation
VMware Esx Short PresentationVMware Esx Short Presentation
VMware Esx Short Presentation
 
virtualization-vs-containerization-paas
virtualization-vs-containerization-paasvirtualization-vs-containerization-paas
virtualization-vs-containerization-paas
 
Fog computing
Fog computingFog computing
Fog computing
 
2 vm provisioning
2 vm provisioning2 vm provisioning
2 vm provisioning
 
Cloud Computing PPT.pptx
Cloud Computing PPT.pptxCloud Computing PPT.pptx
Cloud Computing PPT.pptx
 
Xen & virtualization
Xen & virtualizationXen & virtualization
Xen & virtualization
 
Containerization
ContainerizationContainerization
Containerization
 
VMware Presentation
VMware PresentationVMware Presentation
VMware Presentation
 
Cloud Computing An introduction
Cloud Computing An introductionCloud Computing An introduction
Cloud Computing An introduction
 
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDaysThe world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing ppt
 
Virtualization using VMWare Workstation
Virtualization using VMWare WorkstationVirtualization using VMWare Workstation
Virtualization using VMWare Workstation
 
Firewall
FirewallFirewall
Firewall
 

Similar to Virtualization - cloud computing

2-Virtualization in Cloud Computing and Types.docx
2-Virtualization in Cloud Computing and Types.docx2-Virtualization in Cloud Computing and Types.docx
2-Virtualization in Cloud Computing and Types.docxshruti533256
 
IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)
IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)
IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)IRJET Journal
 
Virtualization for Cloud Environment
Virtualization for Cloud EnvironmentVirtualization for Cloud Environment
Virtualization for Cloud EnvironmentDr. Sunil Kr. Pandey
 
Four Main Types of Virtualization
Four Main Types of VirtualizationFour Main Types of Virtualization
Four Main Types of VirtualizationHTS Hosting
 
Virtualization in Cloud Computing.ppt
Virtualization in Cloud Computing.pptVirtualization in Cloud Computing.ppt
Virtualization in Cloud Computing.pptMohammadArmanulHaque
 
1-Introduction to Virtualization.docx
1-Introduction to Virtualization.docx1-Introduction to Virtualization.docx
1-Introduction to Virtualization.docxshruti533256
 
A proposal for implementing cloud computing in newspaper company
A proposal for implementing cloud computing in newspaper companyA proposal for implementing cloud computing in newspaper company
A proposal for implementing cloud computing in newspaper companyKingsley Mensah
 
Virtualization in Cloud Computing
Virtualization in Cloud ComputingVirtualization in Cloud Computing
Virtualization in Cloud Computingijtsrd
 
VIRTUALIZATION for computer science.pptx
VIRTUALIZATION for computer science.pptxVIRTUALIZATION for computer science.pptx
VIRTUALIZATION for computer science.pptxKelvinBakespear
 
Risk Analysis and Mitigation in Virtualized Environments
Risk Analysis and Mitigation in Virtualized EnvironmentsRisk Analysis and Mitigation in Virtualized Environments
Risk Analysis and Mitigation in Virtualized EnvironmentsSiddharth Coontoor
 
Short Economic EssayPlease answer MINIMUM 400 word I need this.docx
Short Economic EssayPlease answer MINIMUM 400 word I need this.docxShort Economic EssayPlease answer MINIMUM 400 word I need this.docx
Short Economic EssayPlease answer MINIMUM 400 word I need this.docxbudabrooks46239
 
virtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptxvirtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptxXanGwaps
 
Virtualization and its Types
Virtualization and its TypesVirtualization and its Types
Virtualization and its TypesHTS Hosting
 
Cloud Computing using virtulization
Cloud Computing using virtulizationCloud Computing using virtulization
Cloud Computing using virtulizationAJIT NEGI
 
Virtualization in Cloud computing
Virtualization in Cloud computing Virtualization in Cloud computing
Virtualization in Cloud computing Priti Banya Mohanty
 
Quick start guide_virtualization_uk_a4_online_2021-uk
Quick start guide_virtualization_uk_a4_online_2021-ukQuick start guide_virtualization_uk_a4_online_2021-uk
Quick start guide_virtualization_uk_a4_online_2021-ukAssespro Nacional
 
Cloud models and platforms
Cloud models and platformsCloud models and platforms
Cloud models and platformsPrabhat gangwar
 

Similar to Virtualization - cloud computing (20)

2-Virtualization in Cloud Computing and Types.docx
2-Virtualization in Cloud Computing and Types.docx2-Virtualization in Cloud Computing and Types.docx
2-Virtualization in Cloud Computing and Types.docx
 
IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)
IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)
IRJET- A Survey on Virtualization and Attacks on Virtual Machine Monitor (VMM)
 
Virtualization for Cloud Environment
Virtualization for Cloud EnvironmentVirtualization for Cloud Environment
Virtualization for Cloud Environment
 
Four Main Types of Virtualization
Four Main Types of VirtualizationFour Main Types of Virtualization
Four Main Types of Virtualization
 
Virtualization in Cloud Computing.ppt
Virtualization in Cloud Computing.pptVirtualization in Cloud Computing.ppt
Virtualization in Cloud Computing.ppt
 
1-Introduction to Virtualization.docx
1-Introduction to Virtualization.docx1-Introduction to Virtualization.docx
1-Introduction to Virtualization.docx
 
Cloud Services: Types of Cloud
Cloud Services: Types of CloudCloud Services: Types of Cloud
Cloud Services: Types of Cloud
 
A proposal for implementing cloud computing in newspaper company
A proposal for implementing cloud computing in newspaper companyA proposal for implementing cloud computing in newspaper company
A proposal for implementing cloud computing in newspaper company
 
Presentation on Top Cloud Computing Technologies
Presentation on Top Cloud Computing TechnologiesPresentation on Top Cloud Computing Technologies
Presentation on Top Cloud Computing Technologies
 
Virtualization in Cloud Computing
Virtualization in Cloud ComputingVirtualization in Cloud Computing
Virtualization in Cloud Computing
 
VIRTUALIZATION for computer science.pptx
VIRTUALIZATION for computer science.pptxVIRTUALIZATION for computer science.pptx
VIRTUALIZATION for computer science.pptx
 
Risk Analysis and Mitigation in Virtualized Environments
Risk Analysis and Mitigation in Virtualized EnvironmentsRisk Analysis and Mitigation in Virtualized Environments
Risk Analysis and Mitigation in Virtualized Environments
 
Short Economic EssayPlease answer MINIMUM 400 word I need this.docx
Short Economic EssayPlease answer MINIMUM 400 word I need this.docxShort Economic EssayPlease answer MINIMUM 400 word I need this.docx
Short Economic EssayPlease answer MINIMUM 400 word I need this.docx
 
virtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptxvirtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptx
 
Virtualization and its Types
Virtualization and its TypesVirtualization and its Types
Virtualization and its Types
 
Cloud Computing using virtulization
Cloud Computing using virtulizationCloud Computing using virtulization
Cloud Computing using virtulization
 
Ijetcas14 440
Ijetcas14 440Ijetcas14 440
Ijetcas14 440
 
Virtualization in Cloud computing
Virtualization in Cloud computing Virtualization in Cloud computing
Virtualization in Cloud computing
 
Quick start guide_virtualization_uk_a4_online_2021-uk
Quick start guide_virtualization_uk_a4_online_2021-ukQuick start guide_virtualization_uk_a4_online_2021-uk
Quick start guide_virtualization_uk_a4_online_2021-uk
 
Cloud models and platforms
Cloud models and platformsCloud models and platforms
Cloud models and platforms
 

More from Sweta Kumari Barnwal

Computer Network-Data Link Layer-Module-2.pdf
Computer Network-Data Link Layer-Module-2.pdfComputer Network-Data Link Layer-Module-2.pdf
Computer Network-Data Link Layer-Module-2.pdfSweta Kumari Barnwal
 
Sensors in Different Applications Area.pdf
Sensors in Different Applications Area.pdfSensors in Different Applications Area.pdf
Sensors in Different Applications Area.pdfSweta Kumari Barnwal
 
Sensor technology module-3-interface electronic circuits
Sensor technology module-3-interface electronic circuitsSensor technology module-3-interface electronic circuits
Sensor technology module-3-interface electronic circuitsSweta Kumari Barnwal
 
Sensors fundamentals and characteristics, physical principle of sensing
Sensors fundamentals and characteristics, physical principle of sensingSensors fundamentals and characteristics, physical principle of sensing
Sensors fundamentals and characteristics, physical principle of sensingSweta Kumari Barnwal
 
Process improvement & service oriented software engineering
Process improvement & service oriented software engineeringProcess improvement & service oriented software engineering
Process improvement & service oriented software engineeringSweta Kumari Barnwal
 

More from Sweta Kumari Barnwal (20)

UNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdfUNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdf
 
Number System.pdf
Number System.pdfNumber System.pdf
Number System.pdf
 
Cloud Computing_Module-1.pdf
Cloud Computing_Module-1.pdfCloud Computing_Module-1.pdf
Cloud Computing_Module-1.pdf
 
Computer Network-Data Link Layer-Module-2.pdf
Computer Network-Data Link Layer-Module-2.pdfComputer Network-Data Link Layer-Module-2.pdf
Computer Network-Data Link Layer-Module-2.pdf
 
Sensors in Different Applications Area.pdf
Sensors in Different Applications Area.pdfSensors in Different Applications Area.pdf
Sensors in Different Applications Area.pdf
 
Sensor technology module-3-interface electronic circuits
Sensor technology module-3-interface electronic circuitsSensor technology module-3-interface electronic circuits
Sensor technology module-3-interface electronic circuits
 
Sensors fundamentals and characteristics, physical principle of sensing
Sensors fundamentals and characteristics, physical principle of sensingSensors fundamentals and characteristics, physical principle of sensing
Sensors fundamentals and characteristics, physical principle of sensing
 
Logic gates
Logic gatesLogic gates
Logic gates
 
Basic computer system
Basic computer systemBasic computer system
Basic computer system
 
Features of windows
Features of windowsFeatures of windows
Features of windows
 
Operating system and services
Operating system and servicesOperating system and services
Operating system and services
 
Introduction to computers
Introduction to computersIntroduction to computers
Introduction to computers
 
Application Layer
Application LayerApplication Layer
Application Layer
 
Network Layer & Transport Layer
Network Layer & Transport LayerNetwork Layer & Transport Layer
Network Layer & Transport Layer
 
Module 3-cyber security
Module 3-cyber securityModule 3-cyber security
Module 3-cyber security
 
Unit ii-hackers and cyber crimes
Unit ii-hackers and cyber crimesUnit ii-hackers and cyber crimes
Unit ii-hackers and cyber crimes
 
Process improvement & service oriented software engineering
Process improvement & service oriented software engineeringProcess improvement & service oriented software engineering
Process improvement & service oriented software engineering
 
Introduction to computers i
Introduction to computers iIntroduction to computers i
Introduction to computers i
 
file management
 file management file management
file management
 
Module 4 memory management
Module 4 memory managementModule 4 memory management
Module 4 memory management
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 

Virtualization - cloud computing

  • 1. Cloud Computing SWETA KUMARI BARNWAL 1 Module: 4 CLOUD COMPUTING (Topics: VIRTUALIZATION: Basics of Virtualization, Types of Virtualizations, Implementation Levels of Virtualization, Virtualization Structures, Tools and Mechanisms, Virtualization of CPU, Memory, I/O Devices, Virtual Clusters and Resource management, Virtualization for Data-center Automation, Introduction to MapReduce, GFS, HDFS, Hadoop, Framework.) Virtualization in Cloud Computing Virtualization is the "creation of a virtual (rather than actual) version of something, such as a server, a desktop, a storage device, an operating system or network resources". In other words, Virtualization is a technique, which allows to share a single physical instance of a resource or an application among multiple customers and organizations. It does by assigning a logical name to a physical storage and providing a pointer to that physical resource when demanded. What is the concept behind the Virtualization? Creation of a virtual machine over existing operating system and hardware is known as Hardware Virtualization. A Virtual machine provides an environment that is logically separated from the underlying hardware. The machine on which the virtual machine is going to create is known as Host Machine and that virtual machine is referred as a Guest Machine. BENEFITS OF VIRTUALIZATION 1. More flexible and efficient allocation of resources. 2. Enhance development productivity. 3. It lowers the cost of IT infrastructure. 4. Remote access and rapid scalability. 5. High availability and disaster recovery. 6. Pay peruse of the IT infrastructure on demand. 7. Enables running multiple operating systems. Types of Virtualizations: 1. Hardware Virtualization. 2. Operating system Virtualization. 3. Server Virtualization. 4. Storage Virtualization. 1) Hardware Virtualization:
  • 2. Cloud Computing SWETA KUMARI BARNWAL 2 When the virtual machine software or virtual machine manager (VMM) is directly installed on the hardware system is known as hardware virtualization. The main job of hypervisor is to control and monitoring the processor, memory and other hardware resources. After virtualization of hardware system we can install different operating system on it and run different applications on those OS. Usage: Hardware virtualization is mainly done for the server platforms, because controlling virtual machines is much easier than controlling a physical server. 2) Operating System Virtualization: When the virtual machine software or virtual machine manager (VMM) is installed on the Host operating system instead of directly on the hardware system is known as operating system virtualization. Usage: Operating System Virtualization is mainly used for testing the applications on different platforms of OS. 3) Server Virtualization: When the virtual machine software or virtual machine manager (VMM) is directly installed on the Server system is known as server virtualization. Usage: Server virtualization is done because a single physical server can be divided into multiple servers on the demand basis and for balancing the load. 4) Storage Virtualization: Storage virtualization is the process of grouping the physical storage from multiple network storage devices so that it looks like a single storage device. Storage virtualization is also implemented by using software applications. Usage: Storage virtualization is mainly done for back-up and recovery purposes. How does virtualization work in cloud computing?
  • 3. Cloud Computing SWETA KUMARI BARNWAL 3 Virtualization plays a very important role in the cloud computing technology, normally in the cloud computing, users share the data present in the clouds like application etc, but actually with the help of virtualization users shares the Infrastructure. The main usage of Virtualization Technology is to provide the applications with the standard versions to their cloud users, suppose if the next version of that application is released, then cloud provider has to provide the latest version to their cloud users and practically it is possible because it is more expensive. To overcome this problem we use basically virtualization technology, By using virtualization, all severs and the software application which are required by other cloud providers are maintained by the third party people, and the cloud providers has to pay the money on monthly or annual basis. Conclusion Mainly Virtualization means, running multiple operating systems on a single machine but sharing all the hardware resources. And it helps us to provide the pool of IT resources so that we can share these IT resources in order get benefits in the business. Other than these there are few more Virtualizations: 1. Application Virtualization: Application virtualization helps a user to have remote access of an application from a server. The server stores all personal information and other characteristics of the application but can still run on a local workstation through the internet. Example of this would be a user who needs to run two different versions of the same software. Technologies that use application virtualization are hosted applications and packaged applications. 2. Network Virtualization: The ability to run multiple virtual networks with each has a separate control and data plan. It co-exists together on top of one physical network. It can be managed by individual parties that potentially confidential to each other.
  • 4. Cloud Computing SWETA KUMARI BARNWAL 4 Network virtualization provides a facility to create and provision virtual networks— logical switches, routers, firewalls, load balancer, virtual Private Network (VPN), and workload security within days or even in weeks. 3. Desktop Virtualization: Desktop virtualization allows the users’ OS to be remotely stored on a server in the data centre. It allows the user to access their desktop virtually, from any location by a different machine. Users who want specific operating systems other than Windows Server will need to have a virtual desktop. Main benefits of desktop virtualization are user mobility, portability, easy management of software installation, updates, and patches. 4. Data Virtualization: Data virtualization is the process of retrieve data from various resources without knowing its type and physical location where it is stored. It collects heterogeneous data from different resources and allows data users across the organization to access this data according to their work requirements. This heterogeneous data can be accessed using any application such as web portals, web services, E-commerce, Software as a Service (SaaS), and mobile application. We can use Data Virtualization in the field of data integration, business intelligence, and cloud computing. Advantages of Data Virtualization There are the following advantages of data virtualization - o It allows users to access the data without worrying about where it resides on the memory. o It offers better customer satisfaction, retention, and revenue growth. o It provides various security mechanism that allows users to safely store their personal and professional information. o It reduces costs by removing data replication. o It provides a user-friendly interface to develop customized views. o It provides various simple and fast deployment resources. o It increases business user efficiency by providing data in real-time. o It is used to perform tasks such as data integration, business integration, Service- Oriented Architecture (SOA) data services, and enterprise search. Disadvantages of Data Virtualization o It creates availability issues, because availability is maintained by third-party providers. o It required a high implementation cost.
  • 5. Cloud Computing SWETA KUMARI BARNWAL 5 o It creates the availability and scalability issues. o Although it saves time during the implementation phase of virtualization but it consumes more time to generate the appropriate result. Uses of Data Virtualization There are the following uses of Data Virtualization - 1. Analyze performance Data virtualization is used to analyze the performance of the organization compared to previous years. 2. Search and discover interrelated data Data Virtualization (DV) provides a mechanism to easily search the data which is similar and internally related to each other. 3. Agile Business Intelligence It is one of the most common uses of Data Virtualization. It is used in agile reporting, real-time dashboards that require timely aggregation, analyze and present the relevant data from multiple resources. Both individuals and managers use this to monitor performance, which helps to make daily operational decision processes such as sales, support, finance, logistics, legal, and compliance. 4. Data Management Data virtualization provides a secure centralized layer to search, discover, and govern the unified data and its relationships. Data Virtualization Tools There are the following Data Virtualization tools - 1. Red Hat JBoss data virtualization Red Hat virtualization is the best choice for developers and those who are using micro services and containers. It is written in Java. 2. TIBCO data virtualization TIBCO helps administrators and users to create a data virtualization platform for accessing the multiple data sources and data sets. It provides a builtin transformation engine to combine non-relational and un-structured data sources. 3. Oracle data service integrator
  • 6. Cloud Computing SWETA KUMARI BARNWAL 6 It is a very popular and powerful data integrator tool which is mainly worked with Oracle products. It allows organizations to quickly develop and manage data services to access a single view of data. 4. SAS Federation Server SAS Federation Server provides various technologies such as scalable, multi-user, and standards-based data access to access data from multiple data services. It mainly focuses on securing data. 5. Denodo Denodo is one of the best data virtualization tools which allows organizations to minimize the network traffic load and improve response time for large data sets. It is suitable for both small as well as large organizations. Industries that use Data Virtualization o Communication & Technology In Communication & Technology industry, data virtualization is used to increase revenue per customer, create a real-time ODS for marketing, manage customers, improve customer insights, and optimize customer care, etc. o Finance In the field of finance, DV is used to improve trade reconciliation, empowering data democracy, addressing data complexity, and managing fixed-risk income. o Government In the government sector, DV is used for protecting the environment. o Healthcare Data virtualization plays a very important role in the field of healthcare. In healthcare, DV helps to improve patient care, drive new product innovation, accelerating M&A synergies, and provide a more efficient claims analysis. o Manufacturing In manufacturing industry, data virtualization is used to optimize a global supply chain, optimize factories, and improve IT assets utilization. Implementation Levels of Virtualization • Virtualization is a computer architecture technology by which multiple virtual machines • (VMs) are multiplexed in the same hardware machine. • After virtualization, different user applications managed by their own operating systems • (Guest OS) can run on the same hardware independent of the host OS
  • 7. Cloud Computing SWETA KUMARI BARNWAL 7 • done by adding additional software, called a virtualization layer • This virtualization layer is known as hypervisor or virtual machine monitor (VMM) • Function of the software layer for virtualization is to virtualize the physical hardware of a host machine into virtual resources to be used by the VMs • Common virtualization layers include the instruction set architecture (ISA) level, hardware level, operating system level, library support level, and application level. Instruction Set Architecture Level ➢ At the ISA level, virtualization is performed by emulating a given ISA by the ISA of the host machine. For example, MIPS binary code can run on an x86-based host machine with the help of ISA emulation. With this approach, it is possible to run a large amount of legacy binary code written for various processors on any given new hardware host machine. ➢ Instruction set emulation leads to virtual ISAs created on any hardware machine. The basic emulation method is through code interpretation. An interpreter program interprets the source instructions to target instructions one by one. One source instruction may require tens or hundreds of native target instructions to perform its function. Obviously, this process is relatively slow. For better performance, dynamic binary translation is desired. ➢ This approach translates basic blocks of dynamic source instructions to target instructions. The basic blocks can also be extended to program traces or super blocks to increase translation efficiency. ➢ Instruction set emulation requires binary translation and optimization. A virtual instruction set architecture (V-ISA) thus requires adding a processor-specific software translation layer to the compiler. Hardware Abstraction Level ➢ Hardware-level virtualization is performed right on top of the bare hardware. ➢ This approach generates a virtual hardware environment for a VM.
  • 8. Cloud Computing SWETA KUMARI BARNWAL 8 ➢ The process manages the underlying hardware through virtualization. The idea is to virtualize ➢ a computer’s resources, such as its processors, memory, and I/O devices. ➢ The intention is to upgrade the hardware utilization rate by multiple users concurrently. The ➢ idea was implemented in the IBM VM/370 in the 1960s. ➢ More recently, the Xen hypervisor has been applied to virtualize x86-based machines to run. Operating System Level ➢ This refers to an abstraction layer between traditional OS and user applications. ➢ OS-level virtualization creates isolated containers on a single physical server and the OS ➢ instances to utilize the hardware and software in data centres. ➢ The containers behave like real servers. ➢ OS-level virtualization is commonly used in creating virtual hosting environments to allocate ➢ hardware resources among a large number of mutually distrusting users. ➢ It is also used, to a lesser extent, in consolidating server hardware by moving services on ➢ separate hosts into containers or VMs on one server. Library Support Level ➢ Most applications use APIs exported by user-level libraries rather than using lengthy system calls by the OS. ➢ Since most systems provide well-documented APIs, such an interface becomes another candidate for virtualization. ➢ Virtualization with library interfaces is possible by controlling the communication link between applications and the rest of a system through API hooks. User-Application Level ➢ Virtualization at the application level virtualizes an application as a VM. ➢ On a traditional OS, an application often runs as a process. Therefore, application- level virtualization is also known as process-level virtualization. ➢ The most popular approach is to deploy high level language (HLL)VMs. In this scenario, the virtualization layer sits as an application program on top of the operating system, ➢ The layer exports an abstraction of a VM that can run programs written and compiled to a particular abstract machine definition. ➢ Any program written in the HLL and compiled for this VM will be able to run on it. The Microsoft .NET CLR and Java Virtual Machine (JVM) are two good examples of this class of VM. MAP-REDUCE IN HADOOP
  • 9. Cloud Computing SWETA KUMARI BARNWAL 9 MapReduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. The input to each phase is key-value pairs. In addition, every programmer needs to specify two functions: map function and reduce function. MapReduce Architecture in Big Data explained in detail The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing. Now in this MapReduce tutorial, let’s understand with a MapReduce example– Consider you have following input data for your MapReduce in Big data Program Welcome to Hadoop Class Hadoop is good Hadoop is bad MapReduce Architecture
  • 10. Cloud Computing SWETA KUMARI BARNWAL 10 The final output of the MapReduce task is bad 1 Class 1 good 1 Hadoop 3 is 2 to 1 Welcome 1 The data goes through the following phases of MapReduce in Big Data Input Splits: An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map Mapping This is the very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, a job of mapping phase is to count a number of occurrences of each word from input splits (more details about input-split is given below) and prepare a list in the form of <word, frequency> Shuffling This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, the same words are clubed together along with their respective frequency. Reducing In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset. In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each word. MapReduce Architecture explained in detail • One map task is created for each split which then executes map function for each record in the split. • It is always beneficial to have multiple splits because the time taken to process a split is small as compared to the time taken for processing of the whole input. When the splits are smaller, the processing is better to load balanced since we are processing the splits in parallel.
  • 11. Cloud Computing SWETA KUMARI BARNWAL 11 • However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing the splits and map task creation begins to dominate the total job execution time. • For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by default). • Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS. • Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation. • Map output is intermediate output which is processed by reduce tasks to produce the final output. • Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill. • In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the map task on another node and re-creates the map output. • Reduce task doesn’t work on the concept of data locality. An output of every map task is fed to the reduce task. Map output is transferred to the machine where reduce task is running. • On this machine, the output is merged and then passed to the user-defined reduce function. • Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node and other replicas are stored on off-rack nodes). So, writing the reduce output How MapReduce Organizes Work? Now in this MapReduce tutorial, we will learn how MapReduce works Hadoop divides the job into tasks. There are two types of tasks: 1. Map tasks (Splits & Mapping) 2. Reduce tasks (Shuffling, Reducing) as mentioned above. The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a 1. Jobtracker: Acts like a master (responsible for complete execution of submitted job) 2. Multiple Task Trackers: Acts like slaves, each of them performing the job For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.
  • 12. Cloud Computing SWETA KUMARI BARNWAL 12 How Hadoop MapReduce Works • A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster. • It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on different data nodes. • Execution of individual task is then to look after by task tracker, which resides on every data node executing part of the job. • Task tracker’s responsibility is to send the progress report to the job tracker. • In addition, task tracker periodically sends ‘heartbeat’ signal to the Jobtracker so as to notify him of the current state of the system. • Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the job tracker can reschedule it on a different task tracker. HADOOP Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment. Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost. Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system. The processing model is based on ‘Data Locality’ concept wherein computational logic is sent to cluster nodes(server) containing data. This computational logic is nothing, but a compiled
  • 13. Cloud Computing SWETA KUMARI BARNWAL 13 version of a program written in a high-level language such as Java. Such a program, processes data stored in Hadoop HDFS. Apache Hadoop consists of two sub-projects – 1. Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes. 2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations. Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper. HADOOP Architecture Hadoop has a Master-Slave Architecture for data storage and distributed data processing using MapReduce and HDFS methods. NameNode: NameNode represented every files and directory which is used in the namespace DataNode: DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks
  • 14. Cloud Computing SWETA KUMARI BARNWAL 14 MasterNode: The master node allows you to conduct parallel processing of data using Hadoop MapReduce. Slave node: The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows you to synchronize the processes with the NameNode and Job Tracker respectively. In Hadoop, master or slave system can be set up in the cloud or on-premise Features Of ‘Hadoop’ • Suitable for Big Data Analysis As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase the efficiency of Hadoop based applications. • Scalability HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for the growth of Big Data. Also, scaling does not require modifications to application logic. • Fault Tolerance HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node. GOOGLE File System Architecture Single GFS master! and multiple GFS chunk servers accessed by multiple GFS clients GFS Files are divided into fixed-‐size chunks (64MB) Each chunk is identified by a globally unique “chunk handle” (64 bits) Chunk servers store chunks on local disks as Linux file For reliability each chunk is replicated on multiple chunk servers (default = 3) 7 GFS master→ maintains all file system metadata
  • 15. Cloud Computing SWETA KUMARI BARNWAL 15 • namespace, access control, chunk mapping & locations (files → chunks → replicas) send periodically heartbeat messages with chunk servers • instructions + state monitoring GFS client → • Library code linked into each applications • Communicates with GFS master for metadata operations (control plane) • Communicates with chunk servers for read/write operations (data plane) Chunk Size → • 64MB • Stored as a plain Linux file on chunkserver • Extended only as needed • Lazy space allocation avoids wasting space (no fragmentation) • Reason for large chunk size- Advantages • Reduces client interaction with GFS master for chunk location information • Reduces size of metadata stored on master (full in-‐memory) • Reduces network overhead by keeping persistent TCP connections to the chunk-server over an extended period of time • Disadvantages • Small files can create hot spots on chunk-servers if many clients accessing the same file
  • 16. Cloud Computing SWETA KUMARI BARNWAL 16 Descriptions: 1. There is single master in the whole cluster which stores metadata. 2. Other nodes act as the chunk servers for storing data. 3. The file system namespace and locking facilities are managed by master. 4. The master periodically communicates with the chunk servers to collect management information and give instruction to chunk servers to do work such as load balancing or fail recovery. 5. With a single master, many complicated distributed algorithms can be avoided and the design of the system can be simplified. 6. The single GFS master could be the performance bottleneck and single point of failure. 7. To mitigate this, Google uses a shadow master to replicate all the data on the master and the design guarantees that all data operations are transferred between the master and the clients and they can be cached for future use. 8. With the current quality of commodity servers, the single master can handle a cluster more than 1000 nodes. The features of Google file system are as follows: 1. GFS was designed for high fault tolerance. 2. Master and chunk servers can be restarted in a few seconds and with such a fast recovery capability, the window of time in which data is unavailable can be greatly reduced. 3. Each chunk is replicated at least three places and can tolerate at least two data crashes for a single chunk of data. 4. The shadow master handles the failure of the GFS master. 5. For data integrity, GFS makes checksums on every 64KB block in each chunk. 6. GFS can achieve the goals of high availability, high performance and implementation. 7. It demonstrates how to support large scale processing workloads on commodity hardware designed to tolerate frequent component failures optimized for huge files that are mostly appended and read. HADOOP Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Hadoop Architecture At its core, Hadoop has two major layers namely − • Processing/Computation layer (MapReduce), and • Storage layer (Hadoop Distributed File System).
  • 17. Cloud Computing SWETA KUMARI BARNWAL 17 MapReduce MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework. Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets. Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules − • Hadoop Common − These are Java libraries and utilities required by other Hadoop modules. • Hadoop YARN − This is a framework for job scheduling and cluster resource management.
  • 18. Cloud Computing SWETA KUMARI BARNWAL 18 How Does Hadoop Work? It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single- CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines. Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs − • Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M). • These files are then distributed across various cluster nodes for further processing. • HDFS, being on top of the local file system, supervises the processing. • Blocks are replicated for handling hardware failure. • Checking that the code was executed successfully. • Performing the sort that takes place between the map and reduce stages. • Sending the sorted data to a certain computer. • Writing the debugging logs for each job. Advantages of Hadoop • Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. • Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer. • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. • Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based. Hadoop Distributed File System (HDFS) With growing data velocity the data size easily outgrows the storage limit of a machine. A solution would be to store the data across a network of machines. Such filesystems are called distributed filesystems. Since data is stored across a network all the complications of a network come in. This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop Distributed File System) is a unique design that provides storage for extremely
  • 19. Cloud Computing SWETA KUMARI BARNWAL 19 large files with streaming data access pattern and it runs on commodity hardware. Let’s elaborate the terms: • Extremely large files: Here we are talking about the data in range of petabytes(1000 TB). • Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-times. Once data is written large portions of dataset can be processed any number times. • Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one of feature which specially distinguishes HDFS from other file system. Nodes: Master-slave nodes typically forms the HDFS cluster. 1. MasterNode: • Manages all the slave nodes and assign work to them. • It executes filesystem namespace operations like opening, closing, renaming files and directories. • It should be deployed on reliable hardware which has the high config. not on commodity hardware. 2. NameNode: • Actual worker nodes, who do the actual work like reading, writing, processing etc. • They also perform creation, deletion, and replication upon instruction from the master. • They can be deployed on commodity hardware. HDFS daemons: Daemons are the processes running in background. • Namenodes: • Run on the master node. • Store metadata (data about data) like file path, the number of blocks, block Ids. etc. • Require high amount of RAM. • Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk. • DataNodes: • Run on slave nodes.
  • 20. Cloud Computing SWETA KUMARI BARNWAL 20 • Require high memory as data is actually stored here. Data storage in HDFS: Now let’s see how the data is stored in a distributed manner. Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across different datanodes(slavenode). Datanodes(slavenode)replicate the blocks among themselves and the information of what blocks they contain is sent to the master. Default replication factor is 3 means for each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its configuration here. Note: MasterNode has the record of everything, it knows the location and info of each and every single data nodes and the blocks they contain, i.e. nothing is done without the permission of masternode. Why divide the file into blocks? Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single machine. Even if we store, then each read and write operation on that whole file is going to take very high seek time. But if we have multiple blocks of size 128MB then its become easy to perform various read and write operations on it compared to doing it on a whole file at once. So we divide the file to have faster data access i.e. reduce seek time. Why replicate the blocks in data nodes while storing? Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now if the data node D1 crashes we will lose the block and which will make the overall data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance. Terms related to HDFS: • HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead. • Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced. • Replication:: It is done by datanode. Note: No two replicas of the same block are present on the same datanode. Features: • Distributed data storage. • Blocks reduce seek time. • The data is highly available as the same block is present at multiple datanodes. • Even if multiple datanodes are down we can still do our work, thus making it highly reliable. • High fault tolerance. Limitations: Though HDFS provide many features there are some areas where it doesn’t work well. • Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency. • Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another datanode to retrieve each small file, this whole process is a very inefficient data access pattern.