Google File System (GFS) is a distributed file system developed by Google to store and process large amounts of data across its infrastructure. It is highly scalable, fault-tolerant, and optimized for big data. GFS divides files into fixed-size chunks, replicates chunks across servers, and uses a master server to manage metadata and coordinate the system. It provides scalable storage and access to data.
1. “
”
Google FileSystem
Presentation by
MBA 713 Group
Tennyson Sigauke M223098
Beauty Charamba M211405
Roselyn Moyana M223473
Sharon Zinyorewa M222266
Chipo Jekapu M222253
Ramadan Adadi M215900
2. Google File System (GFS)
•Google File System (GFS) is a distributed file system
developed by Google to
•Store, manage, and process large amounts of data
across a massive infrastructure.
•It is designed to be highly scalable, fault-tolerant, and
optimized for handling big data workloads. (Ghemawat
et al, 2003)
3. Distributed File System (DFS)
• A Distributed File System (DFS) is a system that enables files and directories
to be accessed and shared across multiple computers or nodes in a network.
It provides a unified and transparent view of distributed storage resources
by abstracting the underlying physical locations and complexities.
• A DFS typically offers features such as file replication, fault tolerance,
scalability, and distributed access control. (Tanenbaum,Van Steen 2006).
4. GFS was developed by Google to (Ghemawat, et al. 2008) :
• store, manage, and process large amounts of data across a massive infrastructure.
• It is designed to be highly scalable, fault-tolerant, and optimized for handling big
data workloads.
• GFS uses a chunk-based architecture, where files are divided into fixed-size chunks
and replicated across multiple servers for data redundancy.
• A master server maintains metadata about the file system and coordinates
operations across the distributed servers. GFS prioritizes high throughput for
streaming reads and writes, and it aims to minimize network overhead by placing
computation near the data. It provides a simple file system interface for
applications
5. • GFS is built on top of commodity hardware, such as inexpensive servers and
disk arrays, and is designed to run on a cluster of servers that can scale
horizontally as the amount of data being stored grows.
• The system is designed to provide a single global namespace for all data
stored in the system, allowing applications to access and manipulate large
amounts of data in a consistent and reliable manner.
• GFS also uses a technique called "data replication" to ensure data is stored
redundantly across multiple servers, which helps protect against data loss in
the event of hardware failure or other types of system failures.
• Overall, GFS has been highly successful in scaling to support the massive
amounts of data that Google deals with on a daily basis, and has served as a
key inspiration for other distributed file systems like Hadoop Distributed File
System (HDFS) and Amazon's Simple Storage Service (S3)
6. Main components in the GFS architecture
The File system is divided into three main components:
• Master server,
• Chunk servers and
• Client library.
The master server is the central part of the file system. It handles file metadata and
chunk servers control operations in the filesystem. (Ghemawat, S., & Gobioff, H.
(2006).
7. Features of GFS
Namespace management and locking.
Fault tolerance.
Reduced client and master interaction because of large chunk server size.
High availability.
Critical data replication.
Automatic and efficient data recovery.
High aggregate throughput.
8. Use of Google File System (GFS) by Google:
• 1. Google Search: GFS is a critical component of Google's search infrastructure. It stores and
manages the vast index of web pages and documents that Google's search engine uses to
provide relevant search results to users.
• 2. Google Maps: GFS is used to store and serve the massive amount of geographical data that
powers Google Maps.This includes map tiles, satellite imagery, street view images, and other
location-related data.
• 3. YouTube: GFS plays a crucial role in storing and delivering the enormous amount of video
content onYouTube. It allows for efficient storage, replication, and distribution of video files to
ensure smooth playback for millions of users worldwide.
• 4. Gmail: GFS is utilized for storing and managing the immense volume of user data in Gmail,
Google's popular email service. It ensures reliable and efficient storage of emails, attachments,
and other user-related data.
• 5. Google Cloud Platform: GFS serves as the underlying storage system for various services and
products offered by Google Cloud Platform (GCP). It provides scalable and resilient storage for
applications, databases, analytics, and other data-intensive workloads on the cloud platform.
These are just a few examples of how GFS is used within Google's ecosystem. It demonstrates the
system's ability to handle large-scale data storage, replication, and retrieval requirements for a
variety of applications and services
9. Advantages of Google File System (GFS)
• 1. Scalability: GFS has been designed from the ground up to handle large amounts of data, making it
incredibly scalable. It can easily scale up or down to meet the changing needs of an organization.
• 2. FaultTolerance: GFS is designed to be highly fault-tolerant. It uses data replication and automatic data
recovery to ensure that data is always available, even in the event of hardware failures.
• 3. Consistency: GFS supports consistent read and write operations across distributed servers. It also has
built-in support for data consistency, which helps to prevent data loss or corruption.
• 4. Manageability:GFS provides a single global namespace for all data stored in the system, making it easy
to manage and access data across geographically dispersed locations.
• 5. Performance: GFS is optimized for high-performance data access. It uses a technique called “Data
Chunking” to allow for faster data retrieval and also provides built-in support for data snapshotting.
• 6. Low cost: GFS uses commodity hardware and is open-source, making it a low-cost alternative to other
enterprise-level file systems.
Overall, GFS is an incredibly effective and scalable file system that provides many benefits over other
traditional file systems. It is highly fault-tolerant, consistent, manageable, and cost-effective, making it a
popular choice for large-scale organizations.
10. Disadvantages of GFS
1.Not the best fit for small files.
2.Master may act as a bottleneck.
3.unable to type at random.
4.Suitable for procedures or data that are written once and only
read (appended) later.
12. WASHINGTON
STATE
UNIVERSITY
CHUNK
12
❖ Files are divided into fixed size blocks called chunk
❖ 64 MB; greater than typical file system block size
❖ Each chunk is replicated 3 or more times
❖ Each chunk is identified by 64-bit chunk handle
13. META DATA
13
❖ Metadata is data of the stored data i.e Picture data has background
data of location where picture was taken, date, time, event etc.
❖ Three major types of metadata
➢ The file and chunk namespaces
➢ The mapping from files to chunks
➢ Locations of each chunk’s replicas
❖ All the metadata is kept in the Master’s memory
❖ 64MB chunk has 64 bytes of metadata
❖ Chunk Location are updated on every restart & heartbeat message
❖ Operation log contains a historical record of critical metadata changes.
14. • Google no longer uses GFS. The company moved its search to a new software
foundation based on a revamped file system known as Colossus, and Urs Hölzle
• Colossus now underpins virtually all of Google's web services, from Gmail,
Google Docs, and YouTube to the Google Cloud Storage service the company
offers to third-party developers.
• Whereas GFS was built for batch operations -- i.e., operations that happen in
the background before they're actually applied to a live website -- Colossus is
specifically built for "realtime" services.
15. References
• Ghemawat, S., Gobioff, H., & Leung, S.T. (2003).The Google File System.ACM SIGOPSOperating
Systems Review
• Ghemawat, S., & Gobioff, H. (2005).The Google File System: Evolutionary Iteration vs. Clean-Slate
Design:A Case Study. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on
Computer Systems (EuroSys '06), 1-10.
• Ghemawat, S., & Gobioff, H. (2006). Understanding the Performance of a Large-Scale Distributed File
System. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer
Systems (EuroSys '07), 1-8.
• Ghemawat, S., Gobioff, H., & Leung, S.T. (2018).The Google File System.ACMTransactions on
Storage (TOS), 12(4), 1-37.
• Tanenbaum, A. S., &Van Steen, M. (2006). Distributed Systems: Principles and Paradigms. Pearson
Education.