Introduction to Data Storage and Cloud Computing

Unit II
Data Storage and Cloud Computing

Introduction to enterprise data storage
• There are two types of digital information input and output data
• Users provide the input data, computers provide the output data.
• But a computer’s CPU can't compute anything or produce output data without the
user’s input.
• Users can enter the input data directly into computer.
• With data storage space, users can save data onto a device and it is saved even if the
device is powered down
• Also instead of manually entering data into a computer you can construct the
computer to pull data from a storage devices
• Computers can read input data from successes as needed and it can then create and
save the output to the same sources or other storage location
• Organizations and users require data storage to meet today’s high-level
computational needs like big data projects, Ai, ML, IOT.
• Huge data storage is required to protect against data loss due to disaster, failure or
fraud.
• To avoid data loss, you can also employ data storage as backup solutions.

Direct Area/Attached Storage (DAS)
• Is often in the immediate area and directly connected to
the computing machine accessing it
• Often only one
• machine connects to it
• For example the
memory card on your
phone or a hard disk
attached to your laptop.
• DAS can provide
decent local backup services, too but sharing is limited.
• DAS devices include
floppy disk optical disk compact discs(CDs) and digital video
discs(DVDs) hard disk drives(HDD) flash drives and solid-
state drives(SSD).

Network Based Storage
• Allows more than one computer to access storage over a network,
making it better for data sharing and collaboration
• It's off-site storage capability also makes it better suited for
backups and data protection
• The storage can reside anywhere while the machines accessing it
could be somewhere else.
• For example when you store your data on Google Drive it
is stored on the storage owned and operated by Google
• You don't have control over the storage itself you can just use the
storage quota that you're eligible for
• For accessing Google Drive storage, you need to have a network
connection available
• Two common networks based-storage types are network attached
storage(NAS) and storage area network(SAN).

NAS(Network Attached Storage)
• NAS – are storage devices which connect to a network.
• NAS is often a single device made-up of redundant storage containers or a
redundant array of independent disks(RAID).
NAS typically has the following characteristics
• Single storage device
• File storage system
• TCP/ IP Ethernet network
• Limited users
• Limited speed
• Limited expansion options
• Lower cost and easy setup
• NAS systems are a type of the file service device
• A NAS is connected to the LAN just like a file server
• Rather than containing a full blown OS, it typically uses a slim microkernel
specialized for handling only I/O requests such as NFS( Unix), GIFS/8MB(
windows 2000/NT) and NCP( Netware)
• Adding or removing a NAS system is like adding or removing any network code.

Storage Area Network(SAN)
SAN is a computer network which provides access to consolidated, block-level data storage.
• SAN storage is a network of a multiple devices of various types, including SSD and flash storage,
hybrid storage, hybrid cloud storage, backup
software add appliances and cloud storage.
SAN typically has the following characteristics
• Network of multiple devices
• Block storage system
• Fibre Channel network
• Optimized for multiple users
• Faster performance
• Highly expendable
• Higher cost and complex setup
• In SAN, data is presented from storage devices to
machine such that the storage looks like it is locally attached.
• This is achieved through various types of data visualization techniques
• SAN storage provides a high speed network storage.
• In some cases SANS can be so large that they span multiple sites as well as internal data centers and
the cloud.

Data storage management
• It refers to the software and processes that improve the performance of data
storage resources
• It may include network virtualization, replication, mirroring, security, compression,
deduplication, traffic analysis, process automation, storage provisioning, and
memory management.
• These processes help businesses store more data on existing hardware, speed up data
retrieval, prevent data loss, mid data retention requirements, and reduce it expenses.
• Storage management makes it possible to reassign storage capacity quickly as business
needs change.
• Storage management techniques can be applied to primary, backup or archived storage.
• Primary storage holds effectively or frequently accessed data; backup storage holds copies
of primary storage data for use in disaster recovery; an archive storage holds outdated or
seldom used data that must be written for compliance or business continuity.
• Storage provisioning is management technique asset assigns storage capacity to servers,
computers, virtual machines and other devices.
• It may use automation to allocate storage space in a network environment.
• Intelligent storage management uses software policies and algorithms to automate the
provisioning and de provisioning of storage resources, continuously monitoring data
utilization and rebalancing data placement without human intervention

Cloud file system
• File system is an approach to manage an operate files and data on a
storage systems.
• There are various file systems, such as NTFS, FAT32,EXT4,etc that
are commonly used today in operating systems.
• File systems typically provide mechanism for reading, writing,
modifying, deleting or organizing files in folders and directories
• Cloud file systems are specifically designed to be distributed and
operated in the cloud based environment.
• Files are typically stored in chunks on various storage
servers( devices) such a distributed nature of file systems makes it
fault tolerant and also high performance due to the possible
parallelism on file operations.
Architecture for cloud systems fall into two categories
1) Client server architecture
2) Cluster based architecture hi

Client server architecture
• In client server architecture, the file server host the file
system that can be mounted( attached) by the clients.
• One file server can host multiple file shares and each file
share can be mounted and operated by multiple clients.
• All file operations are then synchronized back to the file
server so that the other clients that have mounted the same
file share can get the updates as well.
• One example of such a file system is network file system or
NFS.
• Client server based file system architecture could be limited
due to dependency on the availability of the file server and
the need to synchronize the file operations periodically.

• Client-Server Architecture
Client 1 Client 2 Client 3 Client 4
Share
1
Share
1
Share
2
Share
2
File Server

Cluster-based Architecture
• In a cluster based architecture, the file is broken into smaller
parts called chunks and each chunk is stored on the storage
server or devices.
• The chunks are redundantly stored on several
servers too withstand any fault and have high availability.
• This architecture does not depend upon a single server for
hosting the file system.
• The file system is distributed and provides parallelism that
significantly improves the scale and performance.
• This architecture is commuted to use today in the
cloud environment
for example Google file system, Amazon S3

• Cluster-Based architecture
Redundantly stored
File
Chunk
1
Chunk
2
Chunk
3
Storage
1
Storage
2
Storage
3

Google file systems
• The Google file system (GFS) is a distributed file system (DFS)
for data-centric applications with robustness, scalability, and
reliability.
• GFS can be implemented in commodity servers to support large-
scale file applications with high performance and high reliability.

Characteristics and features of GFS
1. fault tolerant: if a few disks are corrupted, the data stored on
them can still be restored and used.
2. big data size: the file system can manage several petabytes
of data without crashing.
3 high availability: the data is highly available( copied to
several disks) and is present across various clusters of disk.
4 performance: the file system provides very high performance
for read and write from the disks.
5 resource sharing: the file system allows sharing disk
resources across users.
6 Google cloud services: there are quite few Google cloud
services, such as big table that are built on GFS2 also other
Google Apps such as Gmail and maps use GFS2 as well.

• Cluster-Based architecture
Redundantly stored
Direct access to chunks
Chunk Mapping
File
Chunk
1
Chunk
2
Chunk
3
Chunk
Server
1
Chunk
Server
2
Chunk
Server
3
Master Server
Application

Hadoop Distributed File System(HDFS)
• Hadoop is an open source framework based on java that
manages the storage and processing of large amounts of data for
applications.
• Hadoop comes with a distributed file system called HDFS.
• In HDFS data is distributed over several machines and
replicated to ensure their durability to failure and high
availability to parallel application.
• It is cost effective as it uses commodity hardware.
• It involves the concept of blocks, data nodes and node name.
Where to use HDFS?
• Very Large Files: Files should be of hundreds of megabytes,
gigabytes or more.
• Streaming Data Access: The time to read whole data set is
more important than latency in reading the first.
HDFS is built on write-once and read-many-times pattern.
• Commodity Hardware: It works on low cost hardware.

Features of HDFS
• Highly Scalable - HDFS is highly scalable as it can scale hundreds
of nodes in a single cluster.
• Replication - Due to some unfavorable conditions, the node
containing the data may be loss.
• So, to overcome such problems, HDFS always maintains the copy of
data on a different machine.
• Fault tolerance - The HDFS is highly fault-tolerant that if any
machine fails, the other machine containing the copy of that data
automatically become active.
• Distributed data storage - This is one of the most important
features of HDFS that makes Hadoop very powerful.
• Here, data is divided into multiple blocks and stored into nodes.
• Portable - HDFS is designed in such a way that it can easily
portable from platform to another.

Goals of HDFS
• Handling the hardware failure - The HDFS
contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to
recover it quickly.
• Streaming data access - The HDFS applications
usually run on the general-purpose file system.
This application requires streaming access to their
data sets.
• Coherence Model - The application that runs on
HDFS require to follow the write-once-ready-many
approach.
So, a file once created need not to be changed.
However, it can be appended and truncate.

Bigtable
• Cloud Bigtable is a sparsely populated table that can scale
to billions of rows and thousands of columns, enabling you
to store terabytes or even petabytes of data.
• A single value in each row is indexed; this value is known
as the row key.
• Bigtable is a fully managed wide-column and key-
value NoSQL database service for large analytical
and operational workloads as part of the Google
Cloud portfolio.

High-level Architecture of Bigtable

High Level architecture of BigTable
• A big table implementation has three major component
1 One master server
• The master server is responsible for assigning tablets to tablet servers, detecting
the addition and expiration of a tablet servers, balancing tablet server load, and
garbage collection of files in GFS.
• in addition it handle schema changes such as table and column family creations.
2 Many tablet server
• each tablet server manages a set of tablets( typically you can have somewhere
between 10 to 1000 tablets per tablet server)
• tablet servers can be dynamically added or removed from a cluster
to accommodate changes in workloads.
• the tablet server handles read and write requests to the table that it has loaded,
and also splits tablets that have grown too large
3 Chubby
Is a highly available and persistent distributed lock service that manages
leases for resources and stores configuration information.
The service runs with five replicas, one of which is elected as a master to
serve request.

Features and characteristics of a big table
• Massive scale: big table is designed to store and
process massive( petabytes and more) volumes of
data.
• High performance: bigtable is designed to provide
very high performance with under less than 10
millisecond latency
• Run on commodity hardware: bigtable is
distributed in nature that allows it to run in parallel
on commodity hardware. you do not require any
specialized hardware to run big table.
• Flexibility: big table schema parameters let users
dynamically control whether to serve data out of
memory or from the disk. data is indexed using row
and column names that can be arbitrary strings.

HBase
HBase is an open-source non-relational distributed
database modeled after Google's Bigtable and written in
Java.
• It is developed as part of Apache Software Foundation's
Apache Hadoop project and runs on top of HDFS.
• Apache HBase is known as the Hadoop database.
• It is a column oriented, distributed and scalable big data
store.
• It is also known as a type of NoSQL database that is not
a relational database management system.

• HBase applications are also written in Java, built on top of
Hadoop and runs on HDFS.
• HBase is used when you need real-time read/write and random
access to big data.
• HBase is modeled based on Google's BigTable concepts.
• HBase is a column-oriented non-relational database
management system that runs on top of Hadoop Distributed
File System (HDFS).
• HBase provides a fault-tolerant way of storing sparse data
sets, which are common in many big data use cases.

Characteristics and features of a Hbase
1 Highly scalable: at your base is highly scalable and is designed to
handle petabytes of data. it can run on thousands of servers in parallel
2 High performance: HBase provides low latency reads and writes
to data and thus allowing for fast processing of massive datasets
3 No SQL database: this is not a traditional relational database. it is
a no SQL database that allows storing arbitrary key value pairs.
4 Fault tolerant: it reads splits data stored in tables across
multiple machines in the cluster and is built to withstand individual
machine failures in a cluster
5 API support: enterprise provides Java APIs using which you can
perform several operations on HBase data is stored in it

High-level Architecture of HBase

1 HDFS: all HBase data is stored on HDFS.
2 Regions: Tables in HBase are divided horizontally by row key range in two regions. A
region contains all rows in the table between the regions start key and end key. regions are
assigned to the nodes in the cluster called region. Servers and these serve data for reads and
writes to the clients. A region server can serve around 1000 regions.
3 Master server( HMaster): The master server coordinates the cluster and performs
administrative operations such as assigning regions to the region servers and balancing the
load. it also performed other administrative operations such as creating and deleting the
tables.
4 Region Servers( HRegion): The region servers perform data processing. each region
server stores a subset of the data of each table. clients talk to region servers to access the data
in HBase.
5 Zookeeper: is the centralized service for maintaining configuration information, naming,
providing distributed synchronization and providing group services.
zookeeper maintains which region servers(HRegion) are alive and available and provides
server failure notification to the master server(HMaster) to coordinate administrative tasks
such as region assignment
• Establishing communication across the Hadoop cluster
• Maintaining configuration information
• Tracking Region Server and HMaster failure
• Maintaining Region Server information

DynamoDB
• DynamoDB is a fully managed NoSQL database service that
allows to create database tables that can store and retrieve any
amount of data.
• It automatically manages the data traffic of tables over
multiple servers and maintains performance.
• It also relieves the customers from the burden of operating and
scaling a distributed database.
• Hence, hardware provisioning, setup, configuration,
replication, software patching, cluster scaling, etc. is managed
by Amazon.
• With DynamoDB, you can create database tables that can store
and retrieve any amount of data and serve any level of request
traffic.
• It is one of the main components of Amazon.com, the
biggest e-commerce stores in the world.

Characteristics and Features of DynamoDB
• Scalable − Amazon DynamoDB is designed to scale. There is no need to
worry about predefined limits to the amount of data each table can store.
Any amount of data can be stored and retrieved. DynamoDB will spread
automatically with the amount of data stored as the table grows.
• Fast − Amazon DynamoDB provides high throughput at very low latency.
As datasets grow, latencies remain stable due to the distributed nature of
DynamoDB's data placement and request routing algorithms.
• Durable and highly available − Amazon DynamoDB replicates data over
at least 3 different data centers’ results. The system operates and serves data
even under various failure conditions.
• Flexible: Amazon DynamoDB allows creation of dynamic tables, i.e. the
table can have any number of attributes, including multi-valued attributes.
• Cost-effective: Payment is for what we use without any minimum charges.
Its pricing structure is simple and easy to calculate.

Architecture of Dynamo
Client Interface
Clients
Dynamo Node 1
Request Coordination
Membership and failure
detection
Local Persistence engine
Dynamo Node 2
Membership and failure detection
Dynamo Node 3
Membership and failure
detection

In Dynamo each storage node has three main software components that are implemented in Java
1 Request coordination
• The coordinator executes the read and write request on behalf of clients by collecting data
from one or more nodes( for reads) or storing data at one or more nodes( for writes).
• Each client requests result in the creation of a state machine on the node that received the
client request.
• The state machine contains all the logic for identifying the nodes responsible for a key,
sending the request, waiting for the responses, potentially doing retries, processing the replies
and packaging the response to the client. each state machine instance handles exactly 1
client request.
2 Membership and failure detection
• Failure detection in Dynamo is used to avoid attempts to communicate with unreachable peer
nodes.
• For the purpose of avoiding failed attempts at communication, a purely local mechanism of a
failure detection is used.
• For example node A may consider node B failed if node B does not respond to node A's
messages. Node A quickly discovers that node B is unresponsive when B fails to respond to
A’s message. Node A then uses alternate nodes to service request that map to B’s partitions.
Node A periodically retries node B to check for node B’s recovery. decentralized failure
detection protocols using simple gossip style protocol that enable each node in the system to
learn about the arrival of other nodes

3 A local persistence engine
• Dynamo provides the flexibility to choose the
underlying persistent storage based on application
requirements.
• The main reason for designing a pluggable persistent
component is to choose the storage engine best suited
for an applications access patterns.
• For instance some database can handle objects typically
in the order of 10s of kilobytes whereas some can
handle objects of larger sizes.
• Applications choose Dynamos local persistence engine
based on their object size distribution

Google cloud data store
• Cloud storage is a cloud computing model that stores data on the internet through a
cloud computing provider who manages and operates data storage as a service.
• In this fast-moving world it become necessary to store data on the cloud storage.
• The biggest advantage of cloud storage is that we can store any type of data in
digital form on the cloud.
• Another advantage of cloud storage is that we can access data from anywhere
anytime on any device.
• There are many cloud storage providers such as, Google Drive, Dropbox,
OneDrive, iCloud, etc.
• They provide free service for limited storage but if you want to store beyond
the limit, you have to pay.

Using grids for data storage( grid
oriented storage)

Cloud Storage
Cloud storage is a data deposit model in which digital information such as
documents, photos, videos and other forms of media are stored on virtual or
cloud servers hosted by third parties.
It allows you to transfer data on an offsite storage system and access them
whenever needed.
Cloud storage is a cloud computing model that allows users to save
important data or media files on remote, third-party servers.
Users can access these servers at any time over the internet. Also known as
utility storage, cloud storage is maintained and operated by a cloud-based
service provider.

Data Management in Cloud Storage
Cloud data management is the practice of storing a company’s
data at an offsite data center that is typically owned and overseen
by a vendor who specializes in public cloud infrastructure, such as
AWS or MicrosoftAzure.
Managing data in the cloud provides an automated backup
strategy, professional support, and ease of access from any
location.

Cloud Provisioning
Cloud provisioning means allocating a cloud service provider’s resources to
a customer.
It is a key feature of cloud computing.
It refers to how a client gets cloud services and resources from a provider.
The cloud services that customers can subscribe to include infrastructure-as-
a-service (IaaS), software-as-a-service (SaaS), and platform-as-a-service
(PaaS) in public or private environments.

Types of Cloud Provisioning
Network Provisioning: Network Provisioning in the telecom industry is a means of
referring to the provisions of telecommunications services to a client.
Server Provisioning: Datacenter’s physical infrastructure, installation, configuration of
the software, and linking it to middleware, networks, and storage.
User Provisioning: It is a method of identity management that helps us in keeping a
check on the access and privileges of authorization. Provisioning is featured by the
artifacts such as equipment, suppliers, etc.
Service Provisioning: It requires setting up a service and handling its related data.

Data Intensive Technology in Cloud Computing
Data Intensive Computing is a class of parallel computing which uses data parallelism
in order to process large volumes of data.
The size of this data is typically in terabytes or petabytes.
This large amount of data is generated each day and it is referred to Big Data.
Data intensive computing has some characteristics which are different from other
forms of computing. They are:
1. In order to achieve high performance in data intensive computing, it is necessary to
minimize the movement of data. This reduces system overhead and increases
performance by allowing the algorithms to execute on the node where the data
resides.
2. The data intensive computing system utilizes a machine independent approach
where the run time system controls the scheduling, execution, load balancing,
communications and the movement of programs.

Data Intensive Technology in Cloud Computing
Data intensive computing has some characteristics which are different from other
forms of computing. They are:
3. Data intensive computing hugely focuses on reliability and availability of data.
Traditional large scale systems may be susceptible to hardware failures,
communication errors and software bugs, and data intensive computing is designed
to overcome these challenges.
4. Data intensive computing is designed for scalability so it can accommodate any
amount of data and so it can meet the time critical requirements. Scalability of the
hardware as well as the software architecture is one of the biggest advantages of
data intensive computing.

Cloud Storage from LANs to WANs
Characteristics :
1. Computer power is elastic, when it can perform parallel operations. In
general, applications conceived to run on the peak of a shared-nothing
architecture are well matched for such an environment. Some cloud computing
goods, for example, Google’s App Engine, supply not only a cloud computing
infrastructure, but also an entire programs stack with a constrained API so that
software developers are compelled to compose programs that can run in a
shared-nothing natural environment and therefore help elastic scaling.

Characteristics :
2. Data is retained at an unknown host server. In general, letting go off data is
a threat to many security issues and thus suitable precautions should be taken.
The title ‘cloud computing’ implies that the computing and storage resources are
being operated from a celestial position.
The idea is that the data is physically stored in a specific host country and is
subject to localized laws and regulations. Since most cloud computing vendors
give their clientele little command over where data is stored, the clientele has no
alternative but to expect the least that the data is encrypted utilizing a key
unavailable with the owner, the data may be accessed by a third party without
the customer’s knowledge.

Characteristics :
3. Data is duplicated often over distant locations. Data accessibility and
durability is paramount for cloud storage providers, as data tampering can be
impairing for both the business and the
accessibility and durability are normally
organization’s reputation. Data
accomplished through hidden
replications. Large cloud computing providers with data hubs dispersed all
through the world have the proficiency to provide high levels of expected error
resistance by duplicating data at distant locations across continents. Amazon’s
S3 cloud storage service replicates data over ‘regions’ and ‘availability zones’ so
that data and applications can survive even when the whole location collapses.

Distributed Data Storage :
Distributed storage means are evolving from the existing practices of data
storage for the new generation of WWW applications through organizations like
Google, Amazon and Yahoo. There are some reasons for distributed storage
means to be favoured over traditional relational database systems encompassing
scalability, accessibility and performance. The new generation of applications
require processing of data to a tune of terabytes and even peta bytes. This is
accomplished by distributed services. Distributed services means distributed
data.

CouchDB
• CouchDB is a document-oriented database server.
• Couch is an acronym for ‘Cluster Of Unreliable Commodity Hardware’, emphasizing the
distributed environment of the database.
• CouchDB is designed for document-oriented applications for example, forums, bug following,
wiki, Internet note, etc. CouchDB is ad-hoc and schema-free with a flat address space.
• CouchDB aspires to persuade the Four Pillars of Data Management by these methods:
1. Save: ACID(Atomicity, Consistency, Isolation, and Durability). compliant, save efficiently
2. See: Easy retrieval, straightforward describing procedures, fulltext search
3. Secure: Strong compartmentalization, ACL, connections over SSL
4. Share: Distributed means
• A purchaser sees a snapshot of the data and works with it even if it is altered at the same time by a
distinct client.
• CouchDB actually has no apparent authentication scheme, i.e., it is in-built.
• The replication is distributed. A server can revise others once the server is made offline and data is
changed.
• If there are confrontations, CouchDB will choose a survivor and hold that as latest.
• Users can manually suspend this surviving alternative later.
• Importantly, the confrontation tenacity yields identical results comprehensively double-checking on
the offline revisions.

Introduction to Data Storage and Cloud Computing

Recommended

Recommended

More Related Content

Similar to Introduction to Data Storage and Cloud Computing

Similar to Introduction to Data Storage and Cloud Computing (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Storage and Cloud Computing