2. Contenent
• Introduction
• DFS
• How it works
• DFS Concepts
• File service Model
• NoSQL
• Most poppular DFSs
• NFS as an exemple
• Advantages/Challenges
• Conclusion
3. Introduction
A File System is a subsystem of the operating system that
performs file management activities such as Organization,
Storing, Retrieval, Naming,, sharing, and Protection of files.
Distributed file system (DFS)
• A method of storing and accessing files based in a
client/server architecture.
• A distributed implementation of the classical time-sharing
model of a file system, in which multiple users share files and
storage resources.
4. DFS
• In a distributed file system, one or more
central servers store files that can be
accessed, with proper authorization rights,
by any number of remote clients in the
network.
8. Distribution Concept
• Distribute blocks of data sets across multiple nodes.
• Each node has its own computing power;
which gives the ability of DFS to parallel processing data blocks.
9. Replication Concept
DFS will replicate data blocks on different clusters by copy the same pieces of
information into multiple clusters on different racks.
This will help to achieve Fault Tolerance and High Concurrency
12. File Service Models
Upload/download Model:
• files move between server and clients
• few operations (read file & write file)
• requires storage at client
• Good if whole file is accessed
Remote access: Model
• files stay at server
• rich interface with many operations
• less space at client,
• Efficient for small accesses
13. NoSQL
• Database management Non
SQL
• It does not support
relational databases
• Used for distributed
transaction processing
across multiple databases
18. The Advantages of DFS
• Scalability
• Fault Tolerance
• High Concurrency
19. Challenges
• Transparent access
User sees single, global file system regardless of location
• Scalable performance
Performance does not degrade as more clients are added
• Fault Tolerance
Client and server identify and respond appropriately when other crashes
• Consistency
See same directory and file contents on different clients at same time
• Security
Secure communication and user authentication
• Tension across these goals
Example: Caching helps performance, but hurts consistency
20. Conclusion
• Distributed file system is the new evolved version of
file system
• It can be advantageous because
Distribution of documents becomes easier to multiple
clients
Centralized storage system so client machines are not
using their resources to store files.
In Big Data, we deal with multiple clusters (computers) often. One of the main advantages of Big Data which is that it goes beyond the capabilities of one single super powerful server with extremely high computing power. The whole idea of Big Data is to distribute data across multiple clusters and to make use of computing power of each cluster (node) to process information.
Distributed file system is a system that can handle accessing data across multiple clusters (nodes).
Distributed file systems can be advantageous because they make it easier to distribute documents to multiple clients and they provide a centralized storage system so that client machines are not using their resources to store files.
How Distributed file system (DFS) works?
Distributed file system works as follows:
Distribution: Distribute blocks of data sets across multiple nodes. Each node has its own computing power; which gives the ability of DFS to parallel processing data blocks.
Replication: Distributed file system will also replicate data blocks on different clusters by copy the same pieces of information into multiple clusters on different racks. This will help to achieve the following:
Fault Tolerance: recover data block in case of cluster failure or Rack failure.
High Concurrency: avail same piece of data to be processed by multiple clients at the same time. It is done using the computation power of each node to parallel process data blocks.
upload/download: files move between server and clients, few operations (read file & write file), simple, requires storage at client, good if whole file is accessed
remote access: files stay at server, rich interface with many operations, less space at client, efficient for small accesses
Key/value: This is a persistent dictionary. It is best for when we know the key and we need to retrieve the associated value for the key.
Column, wide-column, or column-family: This organizes related data into columns instead of the typical organization in rows. It is best for when we need to query across specific columns in the database.
Document: This allows persisting JSON objects (documents), which can include nested objects or arrays of other objects.
Graph: This allows you to persist edges and nodes with their properties. It is best for when we need to store and navigate through complex relationships.
What are the Advantages of Distributed File System (DFS)?
Distributed file system provides the following main advantages:
Scalability: You can scale up your infrastructure by adding more racks or clusters to your system.
Fault Tolerance: Data replication will help to achieve fault tolerance in the following cases:
Cluster is down
Rack is down
Rack is disconnected from the network.
Job failure or restart.
High Concurrency: utilize the compute power of each node to handle multiple client requests (in a parallel way) at the same time.
The following figure illustrates the main concept of high concurrency and how it can be achieved by data replication on multiple clusters.
Access from multiple clients
Same user on different machines can access same files
Simplifies sharing
Different users on different machines can read/write to same files
Simplifies administration
One shared server to maintain (and backup)
Improve reliability
Add RAID storage to server
Access from multiple clients
Same user on different machines can access same files Simplifies sharing
Different users on different machines can read/write to same files
Simplifies administration
One shared server to maintain (and backup)
Improve reliability
Add RAID storage to server
Challenges Transparent access Scalable performance Fault ToleranceUser sees single, global file system regardless of location
Scalable performance
Performance does not degrade as more clients are added
Fault Tolerance
Client and server identify and respond appropriately when other crashes
Consistency See same directory and file contents on different clients at same time
Security
Secure communication and user authentication
Tension across these goals Example: Caching helps performance, but hurts consistency