SlideShare a Scribd company logo
Cloud Storage Infrastructures
Prof. NIKHILKUMAR B SHARDOOR
Department of Computer Science Engineering.
School of Engineering, MIT ADT University, Pune.
Content
• Introduction to Cloud Storage Infrastructure.
• Direct-Attached Storage (DAS) architecture.
• Storage Area Network (SAN) attributes -components, topologies, connectivity
options and zoning.
• SAN’s- FC protocol stack, addressing, flow control.
• Networked Attached Storage (NAS) components, protocols.
• IP Storage Area Network (IP SAN) iSCSI, FCIP and FCoE architecture.
• Content Addressed Storage (CAS) elements, storage, and retrieval processes.
• Server architectures- Stand-alone, blades, stateless, clustering.
• Cloud file systems: GFS and HDFS, BigTable, HBase and Dynamo.
Introduction
Cloud Storage : Cloud storage is a service model in which data is maintained, managed, backed
up remotely and made available to users over a network (typically the Internet).
Cloud Storage Infrastructure : A cloud storage infrastructure is the hardware and software
framework that supports the computing requirements of a private or public cloud storage service.
Both public and private cloud storage infrastructures are known for their elasticity, scalability and
flexibility.
Cloud General Architecture:
Cloud storage architectures are primarily about delivery of storage on demand in a highly scalable and
multi-tenant way. cloud storage architectures consist of a front end that exports an API to access the
storage.
Cloud Storage Architecture
Characteristic Description
Manageability
The ability to manage a system with minimal
resources
Access method Protocol through which cloud storage is exposed
Performance Performance as measured by bandwidth and latency
Multi-tenancy Support for multiple users (or tenants)
Scalability
Ability to scale to meet higher demands or load in a
graceful manner
Data availability Measure of a system’s uptime
Control
Ability to control a system — in particular, to
configure for cost, performance, or other
characteristics
Storage efficiency Measure of how efficiently the raw storage is used
Cost
Measure of the cost of the storage (commonly in
dollars per gigabytes) Fig. General Cloud Architecture
Cloud Storage Types
• DAS – Direct Attached Storage
• NAS Network Attached Storage.
• SAN- Storage Area Network.
Which Storage technology I should use for my Business Application.?
Cloud Storage Infrastructure – Direct Attached Storage(DAS)
• DAS – Direct attached Storage
• DAS stands for Direct Attached Storage and as the name suggests,
it is an architecture where storage connects directly to hosts.
• Examples of DAS include hard drives, SSD, optical disc drives
and external storage drives.
• DAS is ideal for localized data access and sharing in environment
where small server are located for instance, small businesses,
departments etc.
• Block-level access protocols are used to access data through
applications and it can also be used in combination with SAN and
NAS.
Cloud Storage Infrastructure – Direct Attached Storage(DAS)
Based on the location of storage devices with respect to host, DAS can be classified as external or
internal.
Internal DAS: The storage device is internally connected to the host by serial or parallel buses.
Most internal buses have distance limitations and can only be used for short distance
connectivity and can also connect only a limited number of devices. And also hamper
maintenance as they occupy large amount of space inside the server.
External DAS: the server connects directly to the external storage devices. SCSI or FC protocol
are used to communicate between host and storage devices.
It overcomes the limitation of internal DAS and overcome the distance and device count
limitations and also provides central administration of storage devices
Cloud Storage Infrastructure – Direct Attached Storage(DAS)
Why and why not to go for DAS?
Why to go for DAS:
• It requires low investment than other networking architectures.
• Less hardware and software are needed to setup and operate DAS.
• Configuration is simple and can be deployed easily.
• Managing DAS is easy as host based tools such as host OS are used.
Why not to go for DAS:
• Major limitation of DAS is that it doesn’t scale up well and it restricts the number of hosts that can be directly
connected to the storage.
• Limited bandwidth in DAS hampers the available I/O processing capability and when capability is reached, service
availability may be compromised.
• It doesn’t make use of optimal use of resources due to its lack of ability to share front end ports.
Cloud Storage Infrastructure –Network Attached Storage(NAS)
NAS is a file-level computer data storage server connected to a network and providing data accessibility to a
diverse group of clients.
NAS is specialized for the task assigned to it either by its hardware, software or by both and provides the
advantage of server consolidation by removing the need of having multiple file servers.
NAS also uses its own OS which works on its own peripheral devices.
A NAS operating systems is optimized for file I/O and, therefore performs file I/O better than a primitive server.
It also uses different protocols like TCP/IP, CIFS and NFS which are basically used for data transfer and for
accessing remote file service.
Components of NAS
NAS head which is basically a CPU and a memory.
More than one Network Interface Cards (NIC’s).
Optimized Operating System.
Protocols for file sharing (NFS or CIFS).
Protocols to connect and manage storage devices like ATA, SCSI, or FC.
Cloud Storage Infrastructure –Network Attached Storage(NAS)
• Centralized storage device for storing data on a
network.
• Will have multiple hard drives in RAID
configuration.
• Directly attaches to a switch or router on a
network.
• Are used in Small businesses.
Drawbacks
• Single point of Failure.
FIG: NAS
Fig: Network Attached Storage
Cloud Storage Infrastructure –Storage Area Network(SAN)
• A storage area network (SAN) provides access to consolidated, block level data storage that is accessible by
the application running on any of the networked server.
• It carries data between servers (hosts) and storage devices through fibre channel switches.
• A SAN helps in aiding organizations to connect geographically isolated hosts and provide robust
communication between hosts and storage devices.
• In a SAN, each storage server and storage device is linked through a switch which includes SAN features like
storage virtualization, quality of service, security and remote sensing etc.
Components of SAN: Cabling, Host Bus Adapters (HBA) and Switches.
• Cabling:- is the physical medium which is used to for establishing a link between every SAN device.
• HBA or Host Bus Adapter is an expansion card that fits into expansion slot in a server.
• Switch is used to handle and direct traffic between different network devices. It accepts traffic and then
transmits the traffic to the desired endpoint device.
Cloud Storage Infrastructure –Storage Area Network(SAN)
• A Special High Speed network that stores and
provides access to large amounts of data.
• SAN’s are Fault Tolerant.
• Data is shared among several disk arrays.
• Server access data as if it was accessing data from
local drive.
• iSCSI(Cheaper) and FC(Expensive) protocols
used.
• SAN’s are not affected by network traffic.
• Highly scalable, Highly Redundant and High
Speed(interconnected with fibre channel).
• Expensive.
Fig: Storage Area Network
Cloud Storage Infrastructure –Key Difference between DAS, NAS and SAN
• DAS–Directly Attached Storage.
-Usually disk or tape.
-Directly attached by a cable to the computer processor.(The hard disk drive inside a PC or a tape drive attached
to a single server are simple types of DAS.) I/O requests (also called protocols or commands).
-Access devices directly.
• NAS–Network Attached Storage.
-A NAS device (“appliance”), usually an integrated processor plus disk storage, is attached to a TCP/IP-based
network (LAN or WAN), and accessed using specialized file access/file sharing protocols.
-File requests received by a NAS are translated by the internal processor to device requests.
• SAN-Storage Area Network.
-Storage resides on a dedicated network.
-I/O requests access devices directly.
-Uses Fiber Channel media, providing an any-to-any connection for processors and storage on that network.
-Ethernet media using an I/O protocol called iSCSI is emerging in.
DAS,NAS,SAN-Best Case Scenario Vs Worst Case Scenario
Storage
Type
Best Case Scenario Worst Case Scenario
DAS DAS is ideal for small businesses that only need to
share data locally, have a defined, non-growth budget
to work with and have little to no IT support to
maintain a complex system
DAS is not a good choice for businesses that are
growing quickly, need to scale quickly, need to
share across distance and collaborate or support a
lot of system users and activity at once
NAS NAS is perfect for SMBs and organizations that need
a minimal-maintenance, reliable and flexible storage
system that can quickly scale up as needed to
accommodate new users or growing data
Server-class devices at enterprise organizations that
need to transfer block-level data supported by a
Fibre Channel connection may find that NAS can’t
deliver everything that’s needed. Maximum data
transfer issues could be a problem with NAS
SAN SAN is best for block-level data sharing of mission-
critical files or applications at data centers or large-
scale enterprise organizations.
SAN can be a significant investment and is a
sophisticated solution that’s typically reserved for
serious large-scale computing needs. A small-to-
midsize organization with a limited budget and few
IT staff or resources likely wouldn’t need SAN.
Storage Networking (FC, iSCSi, FCoE)
Fibre Channel (FC) is a technology for transmitting data between computer devices at data rates of up to 20 Gbps at present
time and more in the near future.
• Fibre Channel began in the late 1980s as part of the IPI (Intelligent Peripheral Interface) Enhanced Physical Project to
increase the capabilities of the IPI protocol. That effort widened to investigate other interface protocols as candidates for
augmentation. In 1998, Fiber Channel was approved as a project and now have become and industry standard.
iSCSI - Internet Small Computer System Interface, is a storage networking standard used to link different storage
facilities.
• iSCSI is used to transmit data over local area networks, wide area networks or the Internet and can enable location-
independent data storage and retrieval and is one of two main approaches to storage data transmission over IP networks.
Fibre Channel over IP, translates Fibre Channel control codes and data into IP packets for transmission between
geographically distant Fibre Channel SANs.
FCoE Benefits
• Mapping of Fibre Channel frames over Ethernet
• Fibre Channel enabled to run on a lossless Ethernet
network
• Wire server only once
• Fewer cables and adapters
• Software provisioning of I/O
• Interoperates with existing Fibre Channel SANs
• No gateway; stateless
iSCSI Benefits
• SCSI transport protocol that operates over TCP
• Encapsulation of SCSI command descriptor blocks and data
in TCP/IP byte streams
• Wire server only once
• Fewer cables and adaptors
• New operational model
• Broad industry support; OS vendors support their iSCSI
drivers, gateways (routers, bridges), and native iSCSI storage
arrays
Difference between FCIP and FCoE
• FCIP uses a tunnel to transfer data between networks. It relies on SCSI.
• FCoE was developed to simplify switches and consolidate I/O in comparison with FCIP. It replaces
FC links with high speed ethernet links between the devices that support the network.
• iFCP is a new standard that broadens the way data can be transferred over the internet. It combines
the FCIP and iSCSI protocols.
For More Details refer this
Link 1 https://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/white_paper_c11-495142.html
link 2: http://www.provision.ro/storage-infrastructure/storage-networking-fc-iscsi-fcoe#pagei-1|pagep-1|
Summary:
• FCoE was not designed to make iSCSI obsolete. iSCSI has many applications that FCoE does not cover, in particular in low-
end systems and in small, remote branch offices, where IP connectivity is of paramount importance.
• Some customers have limited I/O requirements in the 100-Mbps range, and iSCSI is just the right solution for them. This is
why iSCSI has taken off and is so successful in the SMB market: it is cheap, and it gets the job done.
• Large enterprises are adopting virtualization, have much higher I/O requirements, and want to preserve their investments and
training in Fibre Channel. For them, FCoE is probably a better solution.
• FCoE will take a large share of the SAN market. It will not make iSCSI obsolete, but it will reduce its potential market.
Cloud File System
A cloud file system is a distributed file system that allows many clients to have access to data and supports
operations on that data.
A File system also ensure the security in terms of Confidentiality, Availability and Integrity.
Types of Cloud File System
• GFS - Google File System.
• HDFS- Hadoop Distributed File System.
• BigTable
• HBase
• Dynamo
Cloud File System: Google File System
Fig: Architecture of GFS
• GFS is a proprietary distributed file
system developed by Google for its own
use.
• GFS is used to store and process huge
volumes of data in a distributed
manner.
• GFS consists of a single master and
multiple chunk servers.
• Files are divided into fixed sized chunks
• Each chunk has 64 MB of data in it.
• Each chunk is replicated on multiple
chunk servers (3 by default). Even if any
chunk server crashes, the data file will
still be present in other chunk servers.
Cloud File System: Google File System
Files are divided into fixed sized chunks of has
64 MB SIZE.
Each chunk is replicated on multiple chunk servers (3 by
default). Even if any chunk server crashes, the data file will still
be present in other chunk servers
Cloud File System: HDFS
• HDFS is a Apache project; Yahoo, Facebook, IBM etc.
are based on HDFS.
• HDFS is the storage unit of Hadoop that is used to store
and process huge volumes of data on multiple data
nodes.
• It is designed with low cost hardware that provides data
across multiple Hadoop clusters.
• It has high fault tolerance and throughput.
• Large file is broken down into small blocks of data,
default block size of 128 MB which can be increased as
per requirement.
• Multiple copies of each block are stored in the cluster in
a distributed manner on different nodes.
Fig: Architecture of HDFS
Cloud File System GFS Vs HDFS
GFS and HDFS are similar in many aspects and are used for storing large amount of data sets.
There are a few aspects where these can be proven to be a little different from each other.
The key aspects which differ are below:
Key Aspects GFS HDFS
Load Division GFS comprises of a single Master node and
multiple Chunk Servers.
HDFS has single Namenode and multiple
Datanodes in the file system.
Size of the
blocks
GFS stores its data into blocks and the size of
each block is 64MB which is the default block
size.
HDFS divides data into blocks and size of
each block is 128MB which is the default
block size.
Data chunk’s
storage location
GFS checks all the chunk servers in the startup
and will not maintain any particular record for
checking the replication information of any
particular data chunk.
The HDFS maintains the record of all the
data nodes information in the name node.
Cloud File System GFS Vs HDFS
Key Aspects GFS HDFS
Atomic Record
Appends
GFS provides an append option along with the
offset option. here the users can append the
file with a different offset which specifies the
same file. this kind of approach helps in
random read and write ability to the GPS
which the HDFS lacks.
HDFS can append a certain file along with
another but it does not provide an option of
offset.
Data Integrity GFS Check servers use checksums to detect
corruption of the stored data and another way
of checking the corruption is by comparing the
files for replications.
HDFS checks the contents of the HDFS
files when any file is corrupted. It uses
client software and applies checksum
checking.
Deletion In GFS the resources of the deleted files are
not reclaimed immediately as it is done in
HDFS, instead, they are stored in a different
file and they are forcibly removed if the file
won't get deleted within three days.
In HDFS the deleted files are directly
removed into a particular folder and then
they are removed by a garbage collector.
Snapshot GFS allows individual files and directories to
be snapshotted.
HDFS can take snapshots up to 65,536 for
each file.
Cloud file System: BigTables
• Bigtable is a compressed, high performance, proprietary data storage system built on Google File System,
developed by Google.
• Designed to scale to a very large size
• Petabytes of data across thousands of servers
• Used for many Google projects
• Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance
• Flexible, high-performance solution for all of Google’s products
Goals
• Want asynchronous processes to be continuously updating different pieces of data
• Want access to most current data at any time
• Need to support:
• Very high read/write rates (millions of ops per second)
• Efficient scans over all or interesting subsets of data
• Efficient joins of large one-to-one and one-to-many datasets
• Often want to examine data changes over time
• E.g. Contents of a web page over multiple crawls
Building Blocks
• Building blocks:
• Google File System (GFS): Raw storage
• Scheduler: schedules jobs onto machines
• Lock service: distributed lock manager
• MapReduce: simplified large-scale data processing
• BigTable uses of building blocks:
• GFS: stores persistent data (SSTable file format for storage of data)
• Scheduler: schedules jobs involved in BigTable serving
• Lock service: master election, location bootstrapping
• Map Reduce: often used to read/write BigTable data
Basic Data Model
• A BigTable is a sparse, distributed persistent multi-dimensional sorted map
(row, column, timestamp) -> cell contents
• Good match for most Google applications
WebTable Example
• Want to keep copy of a large collection of web pages and related information
• Use URLs as row keys
• Various aspects of web page as column names
• Store contents of web pages in the contents: column under the timestamps when they were fetched.
Rows
• Name is an arbitrary string
• Access to data in a row is atomic
• Row creation is implicit upon storing data
• Rows ordered lexicographically
• Rows close together lexicographically usually on one or a small number of machines
• Reads of short row ranges are efficient and typically require communication with a small number of
machines.
• Can exploit this property by selecting row keys so they get good locality for data access.
• Example:
math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu
VS
edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys
Columns
• Columns have two-level name structure:
• family:optional_qualifier
• Column family
• Unit of access control
• Has associated type information
• Qualifier gives unbounded columns
• Additional levels of indexing, if desired
Timestamps
• Used to store different versions of data in a cell
• New writes default to current time, but timestamps for writes can also be set explicitly by clients
• Lookup options:
• “Return most recent K values”
• “Return all values in timestamp range (or all values)”
• Column families can be marked w/ attributes:
• “Only retain most recent K values in a cell”
• “Keep values until they are older than K seconds”
Cloud File System :HBase and Dynamo
• HBase is a distributed column-oriented database built on top
of the Hadoop file system. It is an open-source project and is
horizontally scalable.
• HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of
structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random
real-time read/write access to data in the Hadoop File System.
• One can store the data in HDFS either directly or through
HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File
System and provides read and write access
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source
and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
Where to Use HBase
•Apache HBase is used to have random, real-time read/write
access to Big Data.
•It hosts very large tables on top of clusters of commodity
hardware.
•Apache HBase is a non-relational database modeled after
Google's Bigtable. Bigtable acts up on Google File System,
likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
•It is used whenever there is a need to write heavy applications.
•HBase is used whenever we need to provide fast random access to available data.
•Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Architecture of HBase
• HBase has three major components: the client library, a master server, and region servers.
Architecture of HBase
• HBase has three major components: the client library, a
master server, and region servers.
• Region servers can be added or removed as per requirement.
The master server -
• Assigns regions to the region servers and takes the
help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across region
servers. It unloads the busy servers and shifts the
regions to less occupied servers.
• Maintains the state of the cluster by negotiating the
load balancing.
• Is responsible for schema changes and other metadata
operations such as creation of tables and column
families.
Regions
• Regions are nothing but tables that are split up and spread across the region servers.
Region server
• The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.
Zookeeper
• Zookeeper is an open-source project that provides services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to
discover available servers.
• In addition to availability, the nodes are also used to track server failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.
Dynamo
• Amazon DynamoDB is a fully managed
NoSQL database service that allows to create
database tables that can store and retrieve any
amount of data.
• It automatically manages the data traffic of
tables over multiple servers and maintains
performance.
• It also relieves the customers from the burden
of operating and scaling a distributed database.
• Hardware provisioning, setup, configuration,
replication, software patching, cluster scaling, etc.
is managed by Amazon
Benefits of DynamoDB
• Managed service − Amazon DynamoDB is a managed service. There is no need to hire experts to
manage NoSQL installation. Developers need not worry about setting up, configuring a distributed
database cluster, managing ongoing cluster operations, etc. It handles all the complexities of
scaling, partitions and re-partitions data over more machine resources to meet I/O performance
requirements.
• Scalable − Amazon DynamoDB is designed to scale. There is no need to worry about predefined
limits to the amount of data each table can store. Any amount of data can be stored and retrieved.
DynamoDB will spread automatically with the amount of data stored as the table grows.
• Fast − Amazon DynamoDB provides high throughput at very low latency. As datasets grow,
latencies remain stable due to the distributed nature of DynamoDB's data placement and request
routing algorithms.
• Durable and highly available − Amazon DynamoDB replicates data over at least 3
different data centers’ results. The system operates and serves data even under
various failure conditions.
• Flexible: Amazon DynamoDB allows creation of dynamic tables, i.e. the table can
have any number of attributes, including multi-valued attributes.
• Cost-effective: Payment is for what we use without any minimum charges. Its
pricing structure is simple and easy to calculate.

More Related Content

What's hot

Network attached storage different from traditional file servers & implemen
Network attached storage different from traditional file servers & implemenNetwork attached storage different from traditional file servers & implemen
Network attached storage different from traditional file servers & implemen
IAEME Publication
 
Understanding nas (network attached storage)
Understanding nas (network attached storage)Understanding nas (network attached storage)
Understanding nas (network attached storage)
sagaroceanic11
 
Innovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun MicrosystemsInnovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun Microsystems
robinwauters
 
Cloud And Virtualization To Support Grid Infrastructures
Cloud And Virtualization To Support Grid InfrastructuresCloud And Virtualization To Support Grid Infrastructures
Cloud And Virtualization To Support Grid Infrastructures
Ignacio M. Llorente
 

What's hot (20)

Storage Area Network (San)
Storage Area Network (San)Storage Area Network (San)
Storage Area Network (San)
 
Storage Virtualization
Storage VirtualizationStorage Virtualization
Storage Virtualization
 
Network attached storage different from traditional file servers & implemen
Network attached storage different from traditional file servers & implemenNetwork attached storage different from traditional file servers & implemen
Network attached storage different from traditional file servers & implemen
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...
 
Network Attached Storage (NAS)
Network Attached Storage (NAS) Network Attached Storage (NAS)
Network Attached Storage (NAS)
 
Block Level Storage Vs File Level Storage
Block Level Storage Vs File Level StorageBlock Level Storage Vs File Level Storage
Block Level Storage Vs File Level Storage
 
Storage Virtualization Challenges
Storage Virtualization ChallengesStorage Virtualization Challenges
Storage Virtualization Challenges
 
Enterprise Cloud Glossary from Ubuntu
Enterprise Cloud Glossary from UbuntuEnterprise Cloud Glossary from Ubuntu
Enterprise Cloud Glossary from Ubuntu
 
SAN
SANSAN
SAN
 
Virtualization in cloud computing
Virtualization in cloud computingVirtualization in cloud computing
Virtualization in cloud computing
 
Network data storage
Network data storageNetwork data storage
Network data storage
 
Understanding nas (network attached storage)
Understanding nas (network attached storage)Understanding nas (network attached storage)
Understanding nas (network attached storage)
 
Innovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun MicrosystemsInnovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun Microsystems
 
San
SanSan
San
 
Harshit.ppt
Harshit.pptHarshit.ppt
Harshit.ppt
 
Cloud And Virtualization To Support Grid Infrastructures
Cloud And Virtualization To Support Grid InfrastructuresCloud And Virtualization To Support Grid Infrastructures
Cloud And Virtualization To Support Grid Infrastructures
 
Presenting Cloud Computing
Presenting Cloud ComputingPresenting Cloud Computing
Presenting Cloud Computing
 
Information Storage and Management
Information Storage and Management Information Storage and Management
Information Storage and Management
 
Multi-tenancy In the Cloud
Multi-tenancy In the CloudMulti-tenancy In the Cloud
Multi-tenancy In the Cloud
 
Nas Ashok1
Nas Ashok1Nas Ashok1
Nas Ashok1
 

Similar to Cloud storage infrastructures

Introduction to Enterprise Data Storage, Direct Attached Storage, Storage Ar...
Introduction to Enterprise Data Storage,  Direct Attached Storage, Storage Ar...Introduction to Enterprise Data Storage,  Direct Attached Storage, Storage Ar...
Introduction to Enterprise Data Storage, Direct Attached Storage, Storage Ar...
ssuserec8a711
 
Storage area network
Storage area networkStorage area network
Storage area network
Neha Agarwal
 
Introduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingIntroduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud Computing
Rutuja751147
 
Hitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-family
Hitachi Vantara
 

Similar to Cloud storage infrastructures (20)

SAN vs NAS vs DAS: Decoding Data Storage Solutions
SAN vs NAS vs DAS: Decoding Data Storage SolutionsSAN vs NAS vs DAS: Decoding Data Storage Solutions
SAN vs NAS vs DAS: Decoding Data Storage Solutions
 
Challenges in Managing IT Infrastructure
Challenges in Managing IT InfrastructureChallenges in Managing IT Infrastructure
Challenges in Managing IT Infrastructure
 
Introduction to Enterprise Data Storage, Direct Attached Storage, Storage Ar...
Introduction to Enterprise Data Storage,  Direct Attached Storage, Storage Ar...Introduction to Enterprise Data Storage,  Direct Attached Storage, Storage Ar...
Introduction to Enterprise Data Storage, Direct Attached Storage, Storage Ar...
 
Storage area network
Storage area networkStorage area network
Storage area network
 
final-unit-ii-cc-cloud computing-2022.pdf
final-unit-ii-cc-cloud computing-2022.pdffinal-unit-ii-cc-cloud computing-2022.pdf
final-unit-ii-cc-cloud computing-2022.pdf
 
Nas and san
Nas and sanNas and san
Nas and san
 
Advanced DB chapter 2.pdf
Advanced DB chapter 2.pdfAdvanced DB chapter 2.pdf
Advanced DB chapter 2.pdf
 
SAN BASICS..Why we will go for SAN?
SAN BASICS..Why we will go for SAN?SAN BASICS..Why we will go for SAN?
SAN BASICS..Why we will go for SAN?
 
Network attached storage (nas)
Network attached storage (nas)Network attached storage (nas)
Network attached storage (nas)
 
Introduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingIntroduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud Computing
 
Storage area network
Storage area networkStorage area network
Storage area network
 
storage.pptx
storage.pptxstorage.pptx
storage.pptx
 
What is a Network-Attached-Storage device and how does it work?
What is a Network-Attached-Storage device and how does it work?What is a Network-Attached-Storage device and how does it work?
What is a Network-Attached-Storage device and how does it work?
 
What is Network Attached Storage Used for?.pdf
What is Network Attached Storage Used for?.pdfWhat is Network Attached Storage Used for?.pdf
What is Network Attached Storage Used for?.pdf
 
Consolidating File Servers into the Cloud
Consolidating File Servers into the CloudConsolidating File Servers into the Cloud
Consolidating File Servers into the Cloud
 
Storage Virtualization: Towards an Efficient and Scalable Framework
Storage Virtualization: Towards an Efficient and Scalable FrameworkStorage Virtualization: Towards an Efficient and Scalable Framework
Storage Virtualization: Towards an Efficient and Scalable Framework
 
Storage Devices In PACS
Storage Devices In PACSStorage Devices In PACS
Storage Devices In PACS
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
Direct Attached Storage - Information Storage and Management.pptx
Direct Attached Storage - Information Storage and Management.pptxDirect Attached Storage - Information Storage and Management.pptx
Direct Attached Storage - Information Storage and Management.pptx
 
Hitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-family
 

Recently uploaded

Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
Kamal Acharya
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
Kamal Acharya
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tth
AmanyaSylus
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 

Recently uploaded (20)

Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tth
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 

Cloud storage infrastructures

  • 1. Cloud Storage Infrastructures Prof. NIKHILKUMAR B SHARDOOR Department of Computer Science Engineering. School of Engineering, MIT ADT University, Pune.
  • 2. Content • Introduction to Cloud Storage Infrastructure. • Direct-Attached Storage (DAS) architecture. • Storage Area Network (SAN) attributes -components, topologies, connectivity options and zoning. • SAN’s- FC protocol stack, addressing, flow control. • Networked Attached Storage (NAS) components, protocols. • IP Storage Area Network (IP SAN) iSCSI, FCIP and FCoE architecture. • Content Addressed Storage (CAS) elements, storage, and retrieval processes. • Server architectures- Stand-alone, blades, stateless, clustering. • Cloud file systems: GFS and HDFS, BigTable, HBase and Dynamo.
  • 3. Introduction Cloud Storage : Cloud storage is a service model in which data is maintained, managed, backed up remotely and made available to users over a network (typically the Internet). Cloud Storage Infrastructure : A cloud storage infrastructure is the hardware and software framework that supports the computing requirements of a private or public cloud storage service. Both public and private cloud storage infrastructures are known for their elasticity, scalability and flexibility. Cloud General Architecture: Cloud storage architectures are primarily about delivery of storage on demand in a highly scalable and multi-tenant way. cloud storage architectures consist of a front end that exports an API to access the storage.
  • 4. Cloud Storage Architecture Characteristic Description Manageability The ability to manage a system with minimal resources Access method Protocol through which cloud storage is exposed Performance Performance as measured by bandwidth and latency Multi-tenancy Support for multiple users (or tenants) Scalability Ability to scale to meet higher demands or load in a graceful manner Data availability Measure of a system’s uptime Control Ability to control a system — in particular, to configure for cost, performance, or other characteristics Storage efficiency Measure of how efficiently the raw storage is used Cost Measure of the cost of the storage (commonly in dollars per gigabytes) Fig. General Cloud Architecture
  • 5. Cloud Storage Types • DAS – Direct Attached Storage • NAS Network Attached Storage. • SAN- Storage Area Network. Which Storage technology I should use for my Business Application.?
  • 6. Cloud Storage Infrastructure – Direct Attached Storage(DAS) • DAS – Direct attached Storage • DAS stands for Direct Attached Storage and as the name suggests, it is an architecture where storage connects directly to hosts. • Examples of DAS include hard drives, SSD, optical disc drives and external storage drives. • DAS is ideal for localized data access and sharing in environment where small server are located for instance, small businesses, departments etc. • Block-level access protocols are used to access data through applications and it can also be used in combination with SAN and NAS.
  • 7. Cloud Storage Infrastructure – Direct Attached Storage(DAS) Based on the location of storage devices with respect to host, DAS can be classified as external or internal. Internal DAS: The storage device is internally connected to the host by serial or parallel buses. Most internal buses have distance limitations and can only be used for short distance connectivity and can also connect only a limited number of devices. And also hamper maintenance as they occupy large amount of space inside the server. External DAS: the server connects directly to the external storage devices. SCSI or FC protocol are used to communicate between host and storage devices. It overcomes the limitation of internal DAS and overcome the distance and device count limitations and also provides central administration of storage devices
  • 8. Cloud Storage Infrastructure – Direct Attached Storage(DAS) Why and why not to go for DAS? Why to go for DAS: • It requires low investment than other networking architectures. • Less hardware and software are needed to setup and operate DAS. • Configuration is simple and can be deployed easily. • Managing DAS is easy as host based tools such as host OS are used. Why not to go for DAS: • Major limitation of DAS is that it doesn’t scale up well and it restricts the number of hosts that can be directly connected to the storage. • Limited bandwidth in DAS hampers the available I/O processing capability and when capability is reached, service availability may be compromised. • It doesn’t make use of optimal use of resources due to its lack of ability to share front end ports.
  • 9. Cloud Storage Infrastructure –Network Attached Storage(NAS) NAS is a file-level computer data storage server connected to a network and providing data accessibility to a diverse group of clients. NAS is specialized for the task assigned to it either by its hardware, software or by both and provides the advantage of server consolidation by removing the need of having multiple file servers. NAS also uses its own OS which works on its own peripheral devices. A NAS operating systems is optimized for file I/O and, therefore performs file I/O better than a primitive server. It also uses different protocols like TCP/IP, CIFS and NFS which are basically used for data transfer and for accessing remote file service. Components of NAS NAS head which is basically a CPU and a memory. More than one Network Interface Cards (NIC’s). Optimized Operating System. Protocols for file sharing (NFS or CIFS). Protocols to connect and manage storage devices like ATA, SCSI, or FC.
  • 10. Cloud Storage Infrastructure –Network Attached Storage(NAS) • Centralized storage device for storing data on a network. • Will have multiple hard drives in RAID configuration. • Directly attaches to a switch or router on a network. • Are used in Small businesses. Drawbacks • Single point of Failure. FIG: NAS Fig: Network Attached Storage
  • 11. Cloud Storage Infrastructure –Storage Area Network(SAN) • A storage area network (SAN) provides access to consolidated, block level data storage that is accessible by the application running on any of the networked server. • It carries data between servers (hosts) and storage devices through fibre channel switches. • A SAN helps in aiding organizations to connect geographically isolated hosts and provide robust communication between hosts and storage devices. • In a SAN, each storage server and storage device is linked through a switch which includes SAN features like storage virtualization, quality of service, security and remote sensing etc. Components of SAN: Cabling, Host Bus Adapters (HBA) and Switches. • Cabling:- is the physical medium which is used to for establishing a link between every SAN device. • HBA or Host Bus Adapter is an expansion card that fits into expansion slot in a server. • Switch is used to handle and direct traffic between different network devices. It accepts traffic and then transmits the traffic to the desired endpoint device.
  • 12. Cloud Storage Infrastructure –Storage Area Network(SAN) • A Special High Speed network that stores and provides access to large amounts of data. • SAN’s are Fault Tolerant. • Data is shared among several disk arrays. • Server access data as if it was accessing data from local drive. • iSCSI(Cheaper) and FC(Expensive) protocols used. • SAN’s are not affected by network traffic. • Highly scalable, Highly Redundant and High Speed(interconnected with fibre channel). • Expensive. Fig: Storage Area Network
  • 13. Cloud Storage Infrastructure –Key Difference between DAS, NAS and SAN • DAS–Directly Attached Storage. -Usually disk or tape. -Directly attached by a cable to the computer processor.(The hard disk drive inside a PC or a tape drive attached to a single server are simple types of DAS.) I/O requests (also called protocols or commands). -Access devices directly. • NAS–Network Attached Storage. -A NAS device (“appliance”), usually an integrated processor plus disk storage, is attached to a TCP/IP-based network (LAN or WAN), and accessed using specialized file access/file sharing protocols. -File requests received by a NAS are translated by the internal processor to device requests. • SAN-Storage Area Network. -Storage resides on a dedicated network. -I/O requests access devices directly. -Uses Fiber Channel media, providing an any-to-any connection for processors and storage on that network. -Ethernet media using an I/O protocol called iSCSI is emerging in.
  • 14. DAS,NAS,SAN-Best Case Scenario Vs Worst Case Scenario Storage Type Best Case Scenario Worst Case Scenario DAS DAS is ideal for small businesses that only need to share data locally, have a defined, non-growth budget to work with and have little to no IT support to maintain a complex system DAS is not a good choice for businesses that are growing quickly, need to scale quickly, need to share across distance and collaborate or support a lot of system users and activity at once NAS NAS is perfect for SMBs and organizations that need a minimal-maintenance, reliable and flexible storage system that can quickly scale up as needed to accommodate new users or growing data Server-class devices at enterprise organizations that need to transfer block-level data supported by a Fibre Channel connection may find that NAS can’t deliver everything that’s needed. Maximum data transfer issues could be a problem with NAS SAN SAN is best for block-level data sharing of mission- critical files or applications at data centers or large- scale enterprise organizations. SAN can be a significant investment and is a sophisticated solution that’s typically reserved for serious large-scale computing needs. A small-to- midsize organization with a limited budget and few IT staff or resources likely wouldn’t need SAN.
  • 15. Storage Networking (FC, iSCSi, FCoE) Fibre Channel (FC) is a technology for transmitting data between computer devices at data rates of up to 20 Gbps at present time and more in the near future. • Fibre Channel began in the late 1980s as part of the IPI (Intelligent Peripheral Interface) Enhanced Physical Project to increase the capabilities of the IPI protocol. That effort widened to investigate other interface protocols as candidates for augmentation. In 1998, Fiber Channel was approved as a project and now have become and industry standard. iSCSI - Internet Small Computer System Interface, is a storage networking standard used to link different storage facilities. • iSCSI is used to transmit data over local area networks, wide area networks or the Internet and can enable location- independent data storage and retrieval and is one of two main approaches to storage data transmission over IP networks. Fibre Channel over IP, translates Fibre Channel control codes and data into IP packets for transmission between geographically distant Fibre Channel SANs.
  • 16. FCoE Benefits • Mapping of Fibre Channel frames over Ethernet • Fibre Channel enabled to run on a lossless Ethernet network • Wire server only once • Fewer cables and adapters • Software provisioning of I/O • Interoperates with existing Fibre Channel SANs • No gateway; stateless iSCSI Benefits • SCSI transport protocol that operates over TCP • Encapsulation of SCSI command descriptor blocks and data in TCP/IP byte streams • Wire server only once • Fewer cables and adaptors • New operational model • Broad industry support; OS vendors support their iSCSI drivers, gateways (routers, bridges), and native iSCSI storage arrays
  • 18. • FCIP uses a tunnel to transfer data between networks. It relies on SCSI. • FCoE was developed to simplify switches and consolidate I/O in comparison with FCIP. It replaces FC links with high speed ethernet links between the devices that support the network. • iFCP is a new standard that broadens the way data can be transferred over the internet. It combines the FCIP and iSCSI protocols. For More Details refer this Link 1 https://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/white_paper_c11-495142.html link 2: http://www.provision.ro/storage-infrastructure/storage-networking-fc-iscsi-fcoe#pagei-1|pagep-1|
  • 19. Summary: • FCoE was not designed to make iSCSI obsolete. iSCSI has many applications that FCoE does not cover, in particular in low- end systems and in small, remote branch offices, where IP connectivity is of paramount importance. • Some customers have limited I/O requirements in the 100-Mbps range, and iSCSI is just the right solution for them. This is why iSCSI has taken off and is so successful in the SMB market: it is cheap, and it gets the job done. • Large enterprises are adopting virtualization, have much higher I/O requirements, and want to preserve their investments and training in Fibre Channel. For them, FCoE is probably a better solution. • FCoE will take a large share of the SAN market. It will not make iSCSI obsolete, but it will reduce its potential market.
  • 20. Cloud File System A cloud file system is a distributed file system that allows many clients to have access to data and supports operations on that data. A File system also ensure the security in terms of Confidentiality, Availability and Integrity. Types of Cloud File System • GFS - Google File System. • HDFS- Hadoop Distributed File System. • BigTable • HBase • Dynamo
  • 21. Cloud File System: Google File System Fig: Architecture of GFS • GFS is a proprietary distributed file system developed by Google for its own use. • GFS is used to store and process huge volumes of data in a distributed manner. • GFS consists of a single master and multiple chunk servers. • Files are divided into fixed sized chunks • Each chunk has 64 MB of data in it. • Each chunk is replicated on multiple chunk servers (3 by default). Even if any chunk server crashes, the data file will still be present in other chunk servers.
  • 22. Cloud File System: Google File System Files are divided into fixed sized chunks of has 64 MB SIZE. Each chunk is replicated on multiple chunk servers (3 by default). Even if any chunk server crashes, the data file will still be present in other chunk servers
  • 23. Cloud File System: HDFS • HDFS is a Apache project; Yahoo, Facebook, IBM etc. are based on HDFS. • HDFS is the storage unit of Hadoop that is used to store and process huge volumes of data on multiple data nodes. • It is designed with low cost hardware that provides data across multiple Hadoop clusters. • It has high fault tolerance and throughput. • Large file is broken down into small blocks of data, default block size of 128 MB which can be increased as per requirement. • Multiple copies of each block are stored in the cluster in a distributed manner on different nodes. Fig: Architecture of HDFS
  • 24. Cloud File System GFS Vs HDFS GFS and HDFS are similar in many aspects and are used for storing large amount of data sets. There are a few aspects where these can be proven to be a little different from each other. The key aspects which differ are below: Key Aspects GFS HDFS Load Division GFS comprises of a single Master node and multiple Chunk Servers. HDFS has single Namenode and multiple Datanodes in the file system. Size of the blocks GFS stores its data into blocks and the size of each block is 64MB which is the default block size. HDFS divides data into blocks and size of each block is 128MB which is the default block size. Data chunk’s storage location GFS checks all the chunk servers in the startup and will not maintain any particular record for checking the replication information of any particular data chunk. The HDFS maintains the record of all the data nodes information in the name node.
  • 25. Cloud File System GFS Vs HDFS Key Aspects GFS HDFS Atomic Record Appends GFS provides an append option along with the offset option. here the users can append the file with a different offset which specifies the same file. this kind of approach helps in random read and write ability to the GPS which the HDFS lacks. HDFS can append a certain file along with another but it does not provide an option of offset. Data Integrity GFS Check servers use checksums to detect corruption of the stored data and another way of checking the corruption is by comparing the files for replications. HDFS checks the contents of the HDFS files when any file is corrupted. It uses client software and applies checksum checking. Deletion In GFS the resources of the deleted files are not reclaimed immediately as it is done in HDFS, instead, they are stored in a different file and they are forcibly removed if the file won't get deleted within three days. In HDFS the deleted files are directly removed into a particular folder and then they are removed by a garbage collector. Snapshot GFS allows individual files and directories to be snapshotted. HDFS can take snapshots up to 65,536 for each file.
  • 26. Cloud file System: BigTables • Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, developed by Google. • Designed to scale to a very large size • Petabytes of data across thousands of servers • Used for many Google projects • Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance • Flexible, high-performance solution for all of Google’s products Goals • Want asynchronous processes to be continuously updating different pieces of data • Want access to most current data at any time • Need to support: • Very high read/write rates (millions of ops per second) • Efficient scans over all or interesting subsets of data • Efficient joins of large one-to-one and one-to-many datasets • Often want to examine data changes over time • E.g. Contents of a web page over multiple crawls
  • 27. Building Blocks • Building blocks: • Google File System (GFS): Raw storage • Scheduler: schedules jobs onto machines • Lock service: distributed lock manager • MapReduce: simplified large-scale data processing • BigTable uses of building blocks: • GFS: stores persistent data (SSTable file format for storage of data) • Scheduler: schedules jobs involved in BigTable serving • Lock service: master election, location bootstrapping • Map Reduce: often used to read/write BigTable data
  • 28. Basic Data Model • A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents • Good match for most Google applications
  • 29. WebTable Example • Want to keep copy of a large collection of web pages and related information • Use URLs as row keys • Various aspects of web page as column names • Store contents of web pages in the contents: column under the timestamps when they were fetched.
  • 30. Rows • Name is an arbitrary string • Access to data in a row is atomic • Row creation is implicit upon storing data • Rows ordered lexicographically • Rows close together lexicographically usually on one or a small number of machines • Reads of short row ranges are efficient and typically require communication with a small number of machines. • Can exploit this property by selecting row keys so they get good locality for data access. • Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys
  • 31. Columns • Columns have two-level name structure: • family:optional_qualifier • Column family • Unit of access control • Has associated type information • Qualifier gives unbounded columns • Additional levels of indexing, if desired
  • 32. Timestamps • Used to store different versions of data in a cell • New writes default to current time, but timestamps for writes can also be set explicitly by clients • Lookup options: • “Return most recent K values” • “Return all values in timestamp range (or all values)” • Column families can be marked w/ attributes: • “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds”
  • 33. Cloud File System :HBase and Dynamo • HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. • HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS). • It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. • One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access
  • 34. Features of HBase • HBase is linearly scalable. • It has automatic failure support. • It provides consistent read and writes. • It integrates with Hadoop, both as a source and a destination. • It has easy java API for client. • It provides data replication across clusters. Where to Use HBase •Apache HBase is used to have random, real-time read/write access to Big Data. •It hosts very large tables on top of clusters of commodity hardware. •Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS. Applications of HBase •It is used whenever there is a need to write heavy applications. •HBase is used whenever we need to provide fast random access to available data. •Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
  • 35. Architecture of HBase • HBase has three major components: the client library, a master server, and region servers.
  • 36. Architecture of HBase • HBase has three major components: the client library, a master server, and region servers. • Region servers can be added or removed as per requirement. The master server - • Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task. • Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers. • Maintains the state of the cluster by negotiating the load balancing. • Is responsible for schema changes and other metadata operations such as creation of tables and column families.
  • 37. Regions • Regions are nothing but tables that are split up and spread across the region servers. Region server • The region servers have regions that - • Communicate with the client and handle data-related operations. • Handle read and write requests for all the regions under it. • Decide the size of the region by following the region size thresholds. Zookeeper • Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc. • Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers. • In addition to availability, the nodes are also used to track server failures or network partitions. • Clients communicate with region servers via zookeeper. • In pseudo and standalone modes, HBase itself will take care of zookeeper.
  • 38. Dynamo • Amazon DynamoDB is a fully managed NoSQL database service that allows to create database tables that can store and retrieve any amount of data. • It automatically manages the data traffic of tables over multiple servers and maintains performance. • It also relieves the customers from the burden of operating and scaling a distributed database. • Hardware provisioning, setup, configuration, replication, software patching, cluster scaling, etc. is managed by Amazon
  • 39. Benefits of DynamoDB • Managed service − Amazon DynamoDB is a managed service. There is no need to hire experts to manage NoSQL installation. Developers need not worry about setting up, configuring a distributed database cluster, managing ongoing cluster operations, etc. It handles all the complexities of scaling, partitions and re-partitions data over more machine resources to meet I/O performance requirements. • Scalable − Amazon DynamoDB is designed to scale. There is no need to worry about predefined limits to the amount of data each table can store. Any amount of data can be stored and retrieved. DynamoDB will spread automatically with the amount of data stored as the table grows. • Fast − Amazon DynamoDB provides high throughput at very low latency. As datasets grow, latencies remain stable due to the distributed nature of DynamoDB's data placement and request routing algorithms.
  • 40. • Durable and highly available − Amazon DynamoDB replicates data over at least 3 different data centers’ results. The system operates and serves data even under various failure conditions. • Flexible: Amazon DynamoDB allows creation of dynamic tables, i.e. the table can have any number of attributes, including multi-valued attributes. • Cost-effective: Payment is for what we use without any minimum charges. Its pricing structure is simple and easy to calculate.