BIG DATA ANALYTICS
(21CS71)
STORAGE MODELS
Submitted
by:
Harsha A
(1RR21CS057)
Storage Models


A storage model is the core of any big-data related systems. It affects the scalability,
data-structures, programming and computational models for the systems that are built
on top of any big data-related systems.Understandingabouttheunder-lying storage
model is also the key of understanding the entire spectrum of big-data frameworks.
For addressing different considerations and focus, there has been three main storage
models developed during the past a few decades, namely, Block-based storage, File-
based Storage and Object-basedStorage.
1.1 Block-Based Storage




Block level storage is one of the most classical storage model in computer science. A
traditional block-based storage system presents itself to servers using industry standard
Fiber Channel and iSCSI connectivity mechanisms. Basically, block level storage can
be considered as a hard drive in a server except that the hard drive might be installed in
a remote chassis and is accessible using Fiber Channel or iSCSI.
In addition, for block-based storage, data is stored as blocks which normally have a
fixed size yet with no additional information (metadata). A unique identifier is used to
access each block.
Block based storage focus on performance and scalability to store and access very
largescale data.
As a result, block-hasedstorage is usually used as a low level storage paradigm which
are widely used for higher level storage systems such as File-based systems, Object-
based systems and Transactional Databases, etc.
Architecture
stored as blocks which normally have a fixed size yet with no additional
information (metadata). A unique identifier is used to access each block. The
identifier is mapped to the exact location of actual data blocks through access
interfaces. Traditionally, block-based storage is bound to physical storage
protocols, such as SCSI [4], iSCSI, ATA [5] and SATA [6].
With the development of distributed computing and big data, block-based storage model
are also developed to support distributed and cloud-based environments. As we can see
from the Fig. 3, the architecture of a distributed block-storage system is composed of the
block server and a group of block nodes. The block server is responsible for maintaining
the mapping or indexing from block IDs to the actual data blocks in the block nodes. The
block nodes are responsible for storing the actual data into fixed-size partitions, each of
which is considered as a block.
1.2 File-Based Storage
File-based storage inherits from the traditional file system architecture, considers data as files
that are maintained in a hierarchical structure. It is the most common storage model and is
relatively easy to implement and use. In big data scenario, a file-based storage system could be
built on some other low-level abstraction (such as Block-based and Object-based model) to
improve its performance and scalability.
Architecture
The tile-based storage paradigm is shown in Fig. 4. File paths are organized in a hierarchy and
are used as the entries for accessing data in the physical storage. For a big data scenario,
distributed file systems (DFS) are commonly used as basic storage systems. Figure 5 shows a
typical architecture of a distributed file system which normally contains one or several name
nodes and a bunch of data nodes. The name node is responsible for maintaining the file entries
hierarchy for the entire system while the data nodes are responsible for the persistence of file
data.
In a file based system, a user would need to know of the namespaces and paths in order to
access the stored files. For sharing files across systems, the path or namespace of a file would
include three main parts: the protocol, the domain name and the path of the file. For example, a
HDFS [15] file can be indicated as: "[hdfs://][ServerAddress:ServerPort]/[FilePath]" (Fig. 6).
File-Based Storageis a distributed file system protocol originally developed by Sun
Microsystems. Basically, A Network File System allows remote hosts to mount file systems
over a network and interact with those file systems as though they are mounted locally.
This enables system administrators to consolidate resources onto centralized servers on the
network. NFS is built on the Open Network Computing Remote Procedure Call (ONC
RPC) system. NFS has been widely used in Unix and Linux-based operating systems and
also inspired the development of modern distributed file systems. There have been three
main generations (NFSv2, NFSv3 and NFsv4) for the NFS protocol due to the continuous
development of storage technology and the growth of user requirements. FBSconsists of a
few servers and more clients. The client remotely accesses the data that is stored on the
server machines. In order for this to function properly, a few processes have to be
configured and running. NFS is well-suited for sharing entire file systems with a large
number of known hosts in a transparent manner. However, with case-of-use comes a variety
of potential security problems.
Therefore, NFS also provides two basic options for access control of shared files:
→
First, the server restricts which hosts are allowed to mount which file systems either by
IP address or by host name.
→ Second, the server enforces file system permissions for users on NFS clients in the same
way it does for local users.
1.3 Object-Based Storage

The object-based storage model was firstly introduced on Network Attached Secure
devices [17] for providing more flexible data containers objects. For the past decade,
object-based storage has been further developed with further investments being made by
both system vendors such as EMC, HP, IBM and Redhat, ete, and cloud providers such as
Amazon, Microsoft and Google, etc. In the object-based storage model, data is managed as
objects.
As shown in Fig. 7. every object includes the data itself, some meta-data, attributes and a
globally unique object identifier (OID). Object-based storage model abstracts the lower
layers of storage away from the administrators and applications. Object storage systems
can be implemented at different levels, including at the device level, system level and
interface level.
Data is exposed and managed as objects which includes additional descriptive meta-data
that can be used for better indexing or management. Meta-data can be anything from
security, privacy and authentication properties to any applications associated information.


Architecture
The typical architecture of an object-based storage system is shown in Fig. 8. As we can see
from the figure, the object-based storage system normally uses a flat namespace, in which the
identifier of data and their locations are usually maintained as key-value pairs in the object
server. In principle, the object server provides location-independent addressing and constant
lookup latency for reading every object. In addition, meta-data of the data is separated from
data and is also maintained as objects in a meta-data server (might be co-located with the
object server).
As a result, it provides a standard and easier way of processing, analyzing and manipulating
of the meta-data without affecting the data itself. Due to the flat architecture, it is very easy to
scale out object-based storage systems by adding additional storage nodes to the system.
Besides, the added storage can be automatically expanded as capacity that is available for all
users. Drawing on the object container and meta-data maintained, it is also able to provide
much more flexible and line-grained data policies at different levels, for example, Amazon S3
[18] provides bucket level policy. Azure (19) provides storage account level policy, Atmos
[20] provides per-object policy.
1.4 Comparison of Storage Models



In practice, there is no perfect model which can suit all possible scenarios. Therefore,
developers and users should choose the storage models according to their application
requirements and context. Basically, cachof the storage model that we have discussed. in
this section has its own pros and cons.
Block-based storage is famous for its flexibility, versatility and simplicity. In a block level
storage system, raw storage volumes (composed of a set of blocks) are created, and then the
server-based system connects to these volumes and uses them as individual storage drives.
This makes block-based storage usable for almost any kind of applications, including file
storage, database storage, virtual machine file system (VMFS) volumes, and more.
Block-based storage can be also used for data-sharing scenarios. After creating block-based
volumes, they can be logically connected or migrated between different user spaces.
Therefore, users can use these overlapped block volumes for sharing data between each
other.
Summary of Data Storage Models
As a result, the main features of each storage model can be summarized as
shown. in Table 1. Generally, block-based storage has a fixed size for each storage
unit while file-based and object-based models can have various sizes of storage
unit based on application requirements. In addition, file-based models use the file-
based directory to locate the data whilst block-based and object-based models
both reply on a global identifier for locating data. Furthermore, both block-based
and object-based models have flat scalability while file-based storage may be
limited by its hierarchical indexing structure. Lastly, block-based storage can
normally guarantee a strong consistency while for file-based and object-based
models the consistency model is configurable for different scenarios.

Different Storage Models in Big Data Analytics

  • 1.
    BIG DATA ANALYTICS (21CS71) STORAGEMODELS Submitted by: Harsha A (1RR21CS057)
  • 2.
    Storage Models   A storagemodel is the core of any big-data related systems. It affects the scalability, data-structures, programming and computational models for the systems that are built on top of any big data-related systems.Understandingabouttheunder-lying storage model is also the key of understanding the entire spectrum of big-data frameworks. For addressing different considerations and focus, there has been three main storage models developed during the past a few decades, namely, Block-based storage, File- based Storage and Object-basedStorage.
  • 3.
    1.1 Block-Based Storage     Blocklevel storage is one of the most classical storage model in computer science. A traditional block-based storage system presents itself to servers using industry standard Fiber Channel and iSCSI connectivity mechanisms. Basically, block level storage can be considered as a hard drive in a server except that the hard drive might be installed in a remote chassis and is accessible using Fiber Channel or iSCSI. In addition, for block-based storage, data is stored as blocks which normally have a fixed size yet with no additional information (metadata). A unique identifier is used to access each block. Block based storage focus on performance and scalability to store and access very largescale data. As a result, block-hasedstorage is usually used as a low level storage paradigm which are widely used for higher level storage systems such as File-based systems, Object- based systems and Transactional Databases, etc.
  • 4.
    Architecture stored as blockswhich normally have a fixed size yet with no additional information (metadata). A unique identifier is used to access each block. The identifier is mapped to the exact location of actual data blocks through access interfaces. Traditionally, block-based storage is bound to physical storage protocols, such as SCSI [4], iSCSI, ATA [5] and SATA [6].
  • 5.
    With the developmentof distributed computing and big data, block-based storage model are also developed to support distributed and cloud-based environments. As we can see from the Fig. 3, the architecture of a distributed block-storage system is composed of the block server and a group of block nodes. The block server is responsible for maintaining the mapping or indexing from block IDs to the actual data blocks in the block nodes. The block nodes are responsible for storing the actual data into fixed-size partitions, each of which is considered as a block.
  • 6.
    1.2 File-Based Storage File-basedstorage inherits from the traditional file system architecture, considers data as files that are maintained in a hierarchical structure. It is the most common storage model and is relatively easy to implement and use. In big data scenario, a file-based storage system could be built on some other low-level abstraction (such as Block-based and Object-based model) to improve its performance and scalability. Architecture The tile-based storage paradigm is shown in Fig. 4. File paths are organized in a hierarchy and are used as the entries for accessing data in the physical storage. For a big data scenario, distributed file systems (DFS) are commonly used as basic storage systems. Figure 5 shows a typical architecture of a distributed file system which normally contains one or several name nodes and a bunch of data nodes. The name node is responsible for maintaining the file entries hierarchy for the entire system while the data nodes are responsible for the persistence of file data. In a file based system, a user would need to know of the namespaces and paths in order to access the stored files. For sharing files across systems, the path or namespace of a file would include three main parts: the protocol, the domain name and the path of the file. For example, a HDFS [15] file can be indicated as: "[hdfs://][ServerAddress:ServerPort]/[FilePath]" (Fig. 6).
  • 7.
    File-Based Storageis adistributed file system protocol originally developed by Sun Microsystems. Basically, A Network File System allows remote hosts to mount file systems over a network and interact with those file systems as though they are mounted locally. This enables system administrators to consolidate resources onto centralized servers on the network. NFS is built on the Open Network Computing Remote Procedure Call (ONC RPC) system. NFS has been widely used in Unix and Linux-based operating systems and also inspired the development of modern distributed file systems. There have been three main generations (NFSv2, NFSv3 and NFsv4) for the NFS protocol due to the continuous development of storage technology and the growth of user requirements. FBSconsists of a few servers and more clients. The client remotely accesses the data that is stored on the server machines. In order for this to function properly, a few processes have to be configured and running. NFS is well-suited for sharing entire file systems with a large number of known hosts in a transparent manner. However, with case-of-use comes a variety of potential security problems. Therefore, NFS also provides two basic options for access control of shared files: → First, the server restricts which hosts are allowed to mount which file systems either by IP address or by host name. → Second, the server enforces file system permissions for users on NFS clients in the same way it does for local users.
  • 9.
    1.3 Object-Based Storage  Theobject-based storage model was firstly introduced on Network Attached Secure devices [17] for providing more flexible data containers objects. For the past decade, object-based storage has been further developed with further investments being made by both system vendors such as EMC, HP, IBM and Redhat, ete, and cloud providers such as Amazon, Microsoft and Google, etc. In the object-based storage model, data is managed as objects. As shown in Fig. 7. every object includes the data itself, some meta-data, attributes and a globally unique object identifier (OID). Object-based storage model abstracts the lower layers of storage away from the administrators and applications. Object storage systems can be implemented at different levels, including at the device level, system level and interface level. Data is exposed and managed as objects which includes additional descriptive meta-data that can be used for better indexing or management. Meta-data can be anything from security, privacy and authentication properties to any applications associated information.  
  • 11.
    Architecture The typical architectureof an object-based storage system is shown in Fig. 8. As we can see from the figure, the object-based storage system normally uses a flat namespace, in which the identifier of data and their locations are usually maintained as key-value pairs in the object server. In principle, the object server provides location-independent addressing and constant lookup latency for reading every object. In addition, meta-data of the data is separated from data and is also maintained as objects in a meta-data server (might be co-located with the object server). As a result, it provides a standard and easier way of processing, analyzing and manipulating of the meta-data without affecting the data itself. Due to the flat architecture, it is very easy to scale out object-based storage systems by adding additional storage nodes to the system. Besides, the added storage can be automatically expanded as capacity that is available for all users. Drawing on the object container and meta-data maintained, it is also able to provide much more flexible and line-grained data policies at different levels, for example, Amazon S3 [18] provides bucket level policy. Azure (19) provides storage account level policy, Atmos [20] provides per-object policy.
  • 12.
    1.4 Comparison ofStorage Models    In practice, there is no perfect model which can suit all possible scenarios. Therefore, developers and users should choose the storage models according to their application requirements and context. Basically, cachof the storage model that we have discussed. in this section has its own pros and cons. Block-based storage is famous for its flexibility, versatility and simplicity. In a block level storage system, raw storage volumes (composed of a set of blocks) are created, and then the server-based system connects to these volumes and uses them as individual storage drives. This makes block-based storage usable for almost any kind of applications, including file storage, database storage, virtual machine file system (VMFS) volumes, and more. Block-based storage can be also used for data-sharing scenarios. After creating block-based volumes, they can be logically connected or migrated between different user spaces. Therefore, users can use these overlapped block volumes for sharing data between each other.
  • 13.
    Summary of DataStorage Models As a result, the main features of each storage model can be summarized as shown. in Table 1. Generally, block-based storage has a fixed size for each storage unit while file-based and object-based models can have various sizes of storage unit based on application requirements. In addition, file-based models use the file- based directory to locate the data whilst block-based and object-based models both reply on a global identifier for locating data. Furthermore, both block-based and object-based models have flat scalability while file-based storage may be limited by its hierarchical indexing structure. Lastly, block-based storage can normally guarantee a strong consistency while for file-based and object-based models the consistency model is configurable for different scenarios.