The document describes the Google File System (GFS), which was developed by Google to handle its large-scale distributed data and storage needs. GFS uses a master-slave architecture with the master managing metadata and chunk servers storing file data in 64MB chunks that are replicated across machines. It is designed for high reliability and scalability handling failures through replication and fast recovery. Measurements show it can deliver high throughput to many concurrent readers and writers.
The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
A Distributed File System(DFS) is simply a classical model of a file system distributed across multiple machines.The purpose is to promote sharing of dispersed files.
Designed by Sanjay Ghemawat , Howard Gobioff and Shun-Tak Leung of Google in 2002-03.
Provides fault tolerance, serving large number of clients with high aggregate performance.
The field of Google is beyond the searching.
Google store the data in more than 15 thousands commodity hardware.
Handles the exceptions of Google and other Google specific challenges in their distributed file system.
4.1Introduction
- Potential Threats and Attacks on Computer System
- Confinement Problems
- Design Issues in Building Secure Distributed Systems
4.2 Cryptography
- Symmetric Cryptosystem Algorithm: DES
- Asymmetric Cryptosystem
4.3 Secure Channels
- Authentication
- Message Integrity and Confidentiality
- Secure Group Communication
4.4 Access Control
- General Issues
- Firewalls
- Secure Mobile Code
4.5 Security Management
- Key Management
- Issues in Key Distribution
- Secure Group Management
- Authorization Management
DSM system
Shared memory
On chip memory
Bus based multiprocessor
Working through cache
Write through cache
Write once protocol
Ring based multiprocessor
Protocol used
Similarities and differences b\w ring based and bus based
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
A Distributed File System(DFS) is simply a classical model of a file system distributed across multiple machines.The purpose is to promote sharing of dispersed files.
Designed by Sanjay Ghemawat , Howard Gobioff and Shun-Tak Leung of Google in 2002-03.
Provides fault tolerance, serving large number of clients with high aggregate performance.
The field of Google is beyond the searching.
Google store the data in more than 15 thousands commodity hardware.
Handles the exceptions of Google and other Google specific challenges in their distributed file system.
4.1Introduction
- Potential Threats and Attacks on Computer System
- Confinement Problems
- Design Issues in Building Secure Distributed Systems
4.2 Cryptography
- Symmetric Cryptosystem Algorithm: DES
- Asymmetric Cryptosystem
4.3 Secure Channels
- Authentication
- Message Integrity and Confidentiality
- Secure Group Communication
4.4 Access Control
- General Issues
- Firewalls
- Secure Mobile Code
4.5 Security Management
- Key Management
- Issues in Key Distribution
- Secure Group Management
- Authorization Management
DSM system
Shared memory
On chip memory
Bus based multiprocessor
Working through cache
Write through cache
Write once protocol
Ring based multiprocessor
Protocol used
Similarities and differences b\w ring based and bus based
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
This is a seminar at the Course of Advanced Operating Systems at University of Salerno which shows the first cluster based storage technology (NASD) and its evolution till the development of the new Google File System.
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
Talon systems - Distributed multi master replication strategySaptarshi Chatterjee
Data Replication is the process of storing data in more than one site or node.It is useful in improving the availability of data. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others.Data Replication is generally performed to provide a consistent copy of data across all the database nodes.
Traditionally it’s done by copying data from one database server to another, so that all the servers can have the same data. Our implementation, proposes a completely different approach. Instead of copying data from one node to another, in our design , master replicas do not directly communicate between each other and work virtually independently for write queries. For read queries, an independent process consults all the replicas to constitute a quorum. and returns the result if a majority of the machines in the system response with the same result
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Google file system
1. THE GOOGLE FILE SYSTEM
S. GHEMAWAT, H. GOBIOFF AND S. LEUNG
APRIL 7, 2015
CSI5311: Distributed Databases and Transaction Processing
Winter 2015
Prof. Iluju Kiringa
University of Ottawa
Presented By:
Ajaydeep Grewal
Roopesh Jhurani
1
3. Introduction
Google File System(GFS) is a distributed file
system developed by GOOGLE for its own use.
It is a scalable file system for large distributed
data-intensive applications.
It is widely used within GOOGLE as a storage
platform for generation and processing of data.
3
4. Inspirational factors
Multiple clusters distributed worldwide.
Thousands of queries served per second.
Single query reads more than 100's of MB of data.
Google stores dozens of copies of the entire Web.
Conclusion
Need large, distributed, highly fault tolerant file system.
Large data processing needs Performance, Reliability,
Scalability and Availability.
4
5. Design Assumptions
Component Failures
File System consists of hundreds of machines made from
commodity parts.
The quantity and quality of the machines guarantee that there
are non functional nodes at a given time.
Huge File Sizes
Workload
Large streaming reads.
Small random reads.
Large, sequential writes that append data to file.
Applications & API are co-designed
Increases flexibility.
Goal is simple file system, light burden on applications. 5
7. GFS Architecture
Master
Contains the system metadata like:
• Namespaces
• Access Control Information
• Mappings from files to chunks
• Current location of chunks
Also helps in:
◦ Garbage collection
◦ Synching across Chunk Servers(Heartbeat Synching)
7
8. GFS Architecture
Chunk Servers
Machines containing physical files divided into chunks.
Each Master server can have a number of associated chunk
servers.
For reliability, each chunk is replicated on multiple chunk
servers.
Chunk Handle
Immutable 64 bit chunk handle assigned by master at the
time of chunk creation.
8
9. GFS Architecture
GFS Client code
Code at client machine that interacts with GFS.
Interacts with the master for metadata operations.
Interacts with Chunk Servers for all Read-Write operations.
9
10. GFS Architecture
1.GFS Client
code requests for
a particular file .
2. Master gives
the location of the
chunk server.
3.Client caches
the information
and interacts
directly with the
chunk server.
4.Periodic
replication of
changes across
all the replicas.
10
11. Chunk Size
Having a large uniform chunk size of 64 MB has the
following advantages:
Reduced Client-Master interaction.
Reduced Network-Overhead.
Reduction in the size of metadata's stored.
11
12. Metadata
The file and chunk namespaces.
The mappings from files to chunks.
Location of each chunk’s replica.
First two are kept persistently in operation log files to
ensure reliability and recoverability.
Chunk locations are held by chunk servers.
Master polls the chunk server at start-up and also
periodically thereafter.
12
13. Operation Logs
The operation log contains a historical record of critical
metadata changes.
Metadata updates are in following format
e.g. (old value, new value) pairs.
Since the operation logs are very important, so they are
replicated on remote machines.
Global snapshots (checkpoints)
Checkpoint is B-tree like form and mapped into
memory.
When new updates arrive checkpoints can be created.
13
14. System Interactions
Mutation
A mutation is an operation that changes the contents or
metadata of a chunk such as a write or an append operation.
Lease mechanism
Leases are used to maintain a consistent mutation order across
replicas.
◦ Firstly the master grants a chunk lease to a replica and
calls it primary.
◦ The primary determines the order of updates to all the
other replicas.
14
15. Write Control and Data Flow
15
1.Client requests for a
write operation.
2.Master replies with
the location of Chunk
Primary and replicas.
3.Client caches the
information and pushes the
write information.
4.The Primary and
replicas store the
information in buffer
and sends a
confirmation.
5.Primary sends a
mutation order to all
the secondaries.
7.Primary sends a
confirmation to the
client.
6.Secondaries commit
the mutations and
sends a confirmation
to the Primary.
16. Consistency
Consistent: All the replicated chunks have the
same data.
Inconsistent: A failed mutation makes the region
inconsistent, i.e., different clients may see different
data.
16
17. Master Operations
1. Namespace Management and Locking
2. Replica Placement
3. Creation, Re-replication and Rebalancing
4. Garbage Collection
5. Stale Replica Detection
17
18. Master Operations
Namespace Management and Locking
Separate locks on region namespace ensures:
Serialization
Multiple operations on master to avoid any delay.
Each master operation acquires a set of locks before it runs.
To make operation on /dir1/dir2/dir3/leaf it requires locks.
Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3
Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf
File creation doesn’t require write-lock on parent directory: read-lock is
enough to protect it from deletion, rename, or snapshotted.
Write-locks on file names serialize attempts to create any duplicate file.
18
19. Master Operations
Locking Mechanism
Snapshot acquires
Read Locks on: /home, /save
Write Locks on: /home/user, /save/user
File to be created:
Read Locks on: /home, /home/user
Write Locks on: /home/user/foo
Conflicting locks on /home/user
/home/user /save/user
snapshotted
/home/user/foo
19
20. Master Operations
Replica Placement
Serves two purposes:
Maximize data reliability and availability
Maximize Network Bandwidth utilization
Spread Chunk replicas across racks:
To ensure chunk survivability
To exploit aggregate read bandwidth of multiple racks
Write traffic has to flow through multiple racks.
20
21. Master Operations
Creation, re-replication and rebalancing
Creation: Master considers several factors
Place new replicas on chunk servers with below average disk utilization.
Limit the number of “recent” creations on chunk server.
Spread replicas of a chunk across racks.
Re-replication:
Master re-replicate a chunk when number of replicas fall below a goal level.
Re-replicated chunk is prioritized based on several factors.
Master limits the numbers of active clone operations both for the cluster and
for each chunk servers.
Each chunk servers limits bandwidth it spends on each clone operation.
Balancing:
Master re-balances replicas periodically for better disk and load-balancing.
Master gradually fills up a chunk server rather than instantly filling it with
new chunks.
21
22. Master Operations
Garbage Collection
Lazy garbage collection by GFS for a deleted file.
Mechanism:
Master logs the deletion like other changes.
File is renamed to a hidden name that include deletion timestamp.
Master removes any hidden files during regular namespace
scanning thus erasing its in-memory metadata.
Similar scan performed for chunk namespace to identify orphaned
chunks and erase metadata for the same.
Chunk Server can delete those chunks not identified in master
metadata during regular heartbeat message exchange.
22
23. Master Operations
Stale Replica Detection
Problem: Chunk Replica may become stale if a chunk server fails and
misses mutations.
Solution: for each chunk, master maintains a version number.
Whenever a master grants a new lease on a chunk, master increases
the version number and inform up-to-date replicas (version number
is stored permanently on the master and associated chunk servers)
Master detects that chunk server has a stale replica when the chunk
server restarts and reports its set of chunks and associated version
numbers.
Master removes stale replica in its regular garbage collection.
Master includes chunk version number when it informs clients
which chunk server holds a lease on chunk, or when it instructs a
chunk server to read the chunk from another chunk server in
cloning operation.
23
24. Fault Tolerance and Diagnosis
High Availability
Strategies: Fast recovery and Replication.
Fast Recovery:
Master and chunk servers are designed to restore their state in seconds.
No matter how they terminated, no distinction between normal and abnormal
termination (servers routinely shutdown just by killing process).
Clients and servers experience minor timeout on outstanding requests, reconnect to
the restarted server, and retry.
Chunk Replication:
Chunk replicated on multiple chunk servers on different racks (different parts of the
file namespace can have different replica on level).
Master clones existing replicas as chunk servers go offline or detect corrupted
replicas (checksum verification).
Master Replication
Shadow master provides read-only access to file system even when the master is
down.
Master operation logs and checkpoints are replicated on multiple machines for
reliability.
24
25. Fault Tolerance and Diagnosis
Data Integrity
Each chunk server uses check summing to detect corruption of stored
chunk.
Chunk is broken into 64KB blocks with associated 32 bit checksum.
Checksums are metadata kept in memory and stored persistently with
logging, separate from user data.
For READS: chunk server verifies the checksum of data blocks that
overlap the range before returning any data.
For WRITES: chunk server verifies the checksum of first and last
data blocks that overlap the write range before perform the write, and
finally compute and record new checksums.
25
26. Measurements
Micro-benchmarks: GFS cluster
One master, 2 master replicas, 16 chunk servers with 16 clients.
Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM
disks, FastEthernet NIC connected to one HP 2524 Ethernet switch
ports 10/100 + Gigabit uplink.
26
27. Measurements
Micro-benchmarks: READS
Each client read a randomly selected 4MB region 256 times (=1GB of
data) from a 320 MB file.
Aggregate chunk server memory is 32GB, so 10% hit rate in Linux
buffer cache is expected.
27
28. Measurements
Micro-benchmarks: WRITE
Each client writes 1GB of data to a new file in a series of 1MB writes.
Network stack does not interact very well with the pipelining scheme
used for pushing data to the chunk replicas: network congestion is
more likely for 16 writers than for 16 readers because each write
involves 3 different replicas()
28
29. Measurements
Micro-benchmarks: RECORD APPENDS
Each client appends simultaneously to a single file.
Performance is limited by the network bandwidth of the 3 chunk
servers that store the last chunk of the file, independent of the number
of clients.
29
30. Conclusion
Google File System
Support Large Scale data processing workloads on COTS x86 servers.
Component failure are norms rather than exceptions.
Optimize for huge files mostly append to and then read sequentially.
Fault tolerance by constant monitoring, replicating crucial data and
fast and automatic recovery.
Delivers high aggregate throughput to many concurrent readers and
writers.
Future Improvements
Networking Stack Limit: Write throughput can be improved in the
future.
30
31. References
1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The
Google File System." ACM SIGOPS Operating Systems Review:
29. Print.
2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee.
Frangipani: A scalable distributed file system. In Proceedings of the
16th ACM Symposium on Operating System Principles, pages 224–
237, Saint-Malo, France, October 1997.
3. http://en.wikipedia.org/wiki/Google_File_System
4. http://computer.howstuffworks.com/internet/basics/google-file-
system.htm
5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System
6. http://storagemojo.com/google-file-system-eval-part-i/
7. https://www.youtube.com/watch?v=d2SWUIP40Nw
31
Each chunk server uses check summing to detect corruption
of stored data. Given that a GFS cluster often has thousands
of disks on hundreds of machines, it regularly experiences
disk failures that cause data corruption or loss on both the
read and write paths. (See Section 7 for one cause.) We
can recover from corruption using other chunkre plicas, but
it would be impractical to detect corruption by comparing
replicas across chunk servers. Moreover, divergent replicas
may be legal: the semantics of GFS mutations, in particular
atomic record append as discussed earlier, does not guarantee
identical replicas. Therefore, each chunk server must
independently verify the integrity of its own copy by maintaining
checksums.
A chunki s broken up into 64 KB blocks. Each has a corresponding
32 bit checksum. Like other metadata, checksums
are kept in memory and stored persistently with logging,
separate from user data.
For reads, the chunkserver verifies the checksum of data
blocks that overlap the read range before returning any data
to the requester, whether a client or another chunkserver.
Therefore chunkservers will not propagate corruptions to
other machines. If a blockdo es not match the recorded
checksum, the chunkserver returns an error to the requestor
and reports the mismatch to the master. In response, the
requestor will read from other replicas, while the master
will clone the chunkfrom another replica. After a valid new
replica is in place, the master instructs the chunkserver that
reported the mismatch to delete its replica.
Checksumming has little effect on read performance for
several reasons. Since most of our reads span at least a
few blocks, we need to read and checksum only a relatively
small amount of extra data for verification. GFS client code
further reduces this overhead by trying to align reads at
checksum block boundaries. Moreover, checksum lookups
and comparison on the chunkserver are done without any
I/O, and checksum calculation can often be overlapped with
I/Os.
Checksum computation is heavily optimized for writes
that append to the end of a chunk(a s opposed to writes
that overwrite existing data) because they are dominant in
our workloads. We just incrementally update the checksum
for the last partial checksum block, and compute new
checksums for any brand new checksum blocks filled by the
append. Even if the last partial checksum block is already
corrupted and we fail to detect it now, the new checksum
value will not match the stored data, and the corruption will
be detected as usual when the blocki s next read.
In contrast, if a write overwrites an existing range of the
chunk, we must read and verify the first and last blocks of
the range being overwritten, then perform the write, and
finally compute and record the new checksums. If we do
not verify the first and last blocks before overwriting them
partially, the new checksums may hide corruption that exists
in the regions not being overwritten.
During idle periods, chunkservers can scan and verify the
contents of inactive chunks. This allows us to detect corruption
in chunks that are rarely read. Once the corruption is
detected, the master can create a new uncorrupted replica
and delete the corrupted replica. This prevents an inactive
but corrupted chunkre plica from fooling the master into
thinking that it has enough valid replicas of a chunk.
In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google.
We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunk servers, and 16 clients. Note that this configuration was set up for ease of testing. Typical clusters have hundreds of chunk servers
and hundreds of clients. All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and
a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19 GFS server machines are connected to one switch, and all 16 client machines to the other. The two switches are connected with a 1 Gbps link.
N clients read simultaneously from the file system. Each client reads a randomly selected 4 MB region from a 320 GB file set. This is repeated 256 times so that each client ends up reading 1 GB of data. The chunk servers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache. Our results should be close to cold cache results.
Figure 3(a) shows the aggregate read rate for N clients and its theoretical limit. The limit peaks at an aggregate of 125 MB/s when the 1 Gbps linkb etween the two switches is saturated, or 12.5 MB/s per client when its 100 Mbps networkin terface gets saturated, whichever applies. The observed read rate is 10 MB/s, or 80% of the per-client limit, when just one client is reading. The aggregate read rate reaches 94 MB/s, about 75% of the 125 MB/s linklim it, for 16 readers, or 6 MB/s per client. The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously read from the same chunkserver.
N clients write simultaneously to N distinct files. Each client writes 1 GB of data to a new file in a series of 1 MB writes. The aggregate write rate and its theoretical limit are shown in Figure 3(b). The limit plateaus at 67 MB/s because
we need to write each byte to 3 of the 16 chunk servers, each with a 12.5 MB/s input connection. The write rate for one client is 6.3 MB/s, about half of the limit. The main culprit for this is our networkst ack. It does
not interact very well with the pipelining scheme we use for pushing data to chunkrep licas. Delays in propagating data
from one replica to another reduce the overall write rate.
Aggregate write rate reaches 35 MB/s for 16 clients (or 2.2 MB/s per client), about half the theoretical limit. As in the case of reads, it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases. Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three different replicas.
Writes are slower than we would like. In practice this has not been a major problem because even though it increases the latencies as seen by individual clients, it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients.
Figure 3(c) shows record append performance. N clients append simultaneously to a single file. Performance is limited
by the networkba ndwidth of the chunkservers that
store the last chunko f the file, independent of the number
of clients. It starts at 6.0 MB/s for one client and drops
to 4.8 MB/s for 16 clients, mostly due to congestion and
variances in networktransf er rates seen by different clients.
Our applications tend to produce multiple such files concurrently.
In other words, N clients append to M shared
files simultaneously where both N and M are in the dozens
or hundreds. Therefore, the chunkserver network congestion
in our experiment is not a significant issue in practice because
a client can make progress on writing one file while
the chunkservers for another file are busy.