SlideShare a Scribd company logo
THE GOOGLE FILE SYSTEM
S. GHEMAWAT, H. GOBIOFF AND S. LEUNG
APRIL 7, 2015
CSI5311: Distributed Databases and Transaction Processing
Winter 2015
Prof. Iluju Kiringa
University of Ottawa
Presented By:
Ajaydeep Grewal
Roopesh Jhurani
1
AGENDA
• Introduction
• Design Overview
• System Interactions
• Master Operations
• Fault Tolerance and Diagnosis
• Measurements
• Conclusion
• References
2
Introduction
 Google File System(GFS) is a distributed file
system developed by GOOGLE for its own use.
 It is a scalable file system for large distributed
data-intensive applications.
 It is widely used within GOOGLE as a storage
platform for generation and processing of data.
3
Inspirational factors
 Multiple clusters distributed worldwide.
 Thousands of queries served per second.
 Single query reads more than 100's of MB of data.
 Google stores dozens of copies of the entire Web.
Conclusion
 Need large, distributed, highly fault tolerant file system.
 Large data processing needs Performance, Reliability,
Scalability and Availability.
4
Design Assumptions
 Component Failures
File System consists of hundreds of machines made from
commodity parts.
The quantity and quality of the machines guarantee that there
are non functional nodes at a given time.
 Huge File Sizes
 Workload
Large streaming reads.
Small random reads.
Large, sequential writes that append data to file.
 Applications & API are co-designed
Increases flexibility.
Goal is simple file system, light burden on applications. 5
GFS Architecture
Master
Chunk Servers
GFS Client API
6
GFS Architecture
Master
Contains the system metadata like:
• Namespaces
• Access Control Information
• Mappings from files to chunks
• Current location of chunks
Also helps in:
◦ Garbage collection
◦ Synching across Chunk Servers(Heartbeat Synching)
7
GFS Architecture
Chunk Servers
 Machines containing physical files divided into chunks.
 Each Master server can have a number of associated chunk
servers.
 For reliability, each chunk is replicated on multiple chunk
servers.
Chunk Handle
 Immutable 64 bit chunk handle assigned by master at the
time of chunk creation.
8
GFS Architecture
GFS Client code
 Code at client machine that interacts with GFS.
 Interacts with the master for metadata operations.
 Interacts with Chunk Servers for all Read-Write operations.
9
GFS Architecture
1.GFS Client
code requests for
a particular file .
2. Master gives
the location of the
chunk server.
3.Client caches
the information
and interacts
directly with the
chunk server.
4.Periodic
replication of
changes across
all the replicas.
10
Chunk Size
Having a large uniform chunk size of 64 MB has the
following advantages:
 Reduced Client-Master interaction.
 Reduced Network-Overhead.
 Reduction in the size of metadata's stored.
11
Metadata
 The file and chunk namespaces.
 The mappings from files to chunks.
 Location of each chunk’s replica.
First two are kept persistently in operation log files to
ensure reliability and recoverability.
Chunk locations are held by chunk servers.
Master polls the chunk server at start-up and also
periodically thereafter.
12
Operation Logs
 The operation log contains a historical record of critical
metadata changes.
 Metadata updates are in following format
 e.g. (old value, new value) pairs.
 Since the operation logs are very important, so they are
replicated on remote machines.
 Global snapshots (checkpoints)
 Checkpoint is B-tree like form and mapped into
memory.
 When new updates arrive checkpoints can be created.
13
System Interactions
 Mutation
A mutation is an operation that changes the contents or
metadata of a chunk such as a write or an append operation.
 Lease mechanism
Leases are used to maintain a consistent mutation order across
replicas.
◦ Firstly the master grants a chunk lease to a replica and
calls it primary.
◦ The primary determines the order of updates to all the
other replicas.
14
Write Control and Data Flow
15
1.Client requests for a
write operation.
2.Master replies with
the location of Chunk
Primary and replicas.
3.Client caches the
information and pushes the
write information.
4.The Primary and
replicas store the
information in buffer
and sends a
confirmation.
5.Primary sends a
mutation order to all
the secondaries.
7.Primary sends a
confirmation to the
client.
6.Secondaries commit
the mutations and
sends a confirmation
to the Primary.
Consistency
 Consistent: All the replicated chunks have the
same data.
 Inconsistent: A failed mutation makes the region
inconsistent, i.e., different clients may see different
data.
16
Master Operations
1. Namespace Management and Locking
2. Replica Placement
3. Creation, Re-replication and Rebalancing
4. Garbage Collection
5. Stale Replica Detection
17
Master Operations
Namespace Management and Locking
 Separate locks on region namespace ensures:
 Serialization
 Multiple operations on master to avoid any delay.
 Each master operation acquires a set of locks before it runs.
 To make operation on /dir1/dir2/dir3/leaf it requires locks.
 Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3
 Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf
 File creation doesn’t require write-lock on parent directory: read-lock is
enough to protect it from deletion, rename, or snapshotted.
 Write-locks on file names serialize attempts to create any duplicate file.
18
Master Operations
Locking Mechanism
 Snapshot acquires
 Read Locks on: /home, /save
 Write Locks on: /home/user, /save/user
 File to be created:
 Read Locks on: /home, /home/user
 Write Locks on: /home/user/foo
 Conflicting locks on /home/user
/home/user /save/user
snapshotted
/home/user/foo
19
Master Operations
Replica Placement
 Serves two purposes:
 Maximize data reliability and availability
 Maximize Network Bandwidth utilization
 Spread Chunk replicas across racks:
 To ensure chunk survivability
 To exploit aggregate read bandwidth of multiple racks
 Write traffic has to flow through multiple racks.
20
Master Operations
Creation, re-replication and rebalancing
 Creation: Master considers several factors
 Place new replicas on chunk servers with below average disk utilization.
 Limit the number of “recent” creations on chunk server.
 Spread replicas of a chunk across racks.
 Re-replication:
 Master re-replicate a chunk when number of replicas fall below a goal level.
 Re-replicated chunk is prioritized based on several factors.
 Master limits the numbers of active clone operations both for the cluster and
for each chunk servers.
 Each chunk servers limits bandwidth it spends on each clone operation.
 Balancing:
 Master re-balances replicas periodically for better disk and load-balancing.
 Master gradually fills up a chunk server rather than instantly filling it with
new chunks.
21
Master Operations
Garbage Collection
 Lazy garbage collection by GFS for a deleted file.
 Mechanism:
 Master logs the deletion like other changes.
 File is renamed to a hidden name that include deletion timestamp.
 Master removes any hidden files during regular namespace
scanning thus erasing its in-memory metadata.
 Similar scan performed for chunk namespace to identify orphaned
chunks and erase metadata for the same.
 Chunk Server can delete those chunks not identified in master
metadata during regular heartbeat message exchange.
22
Master Operations
Stale Replica Detection
 Problem: Chunk Replica may become stale if a chunk server fails and
misses mutations.
 Solution: for each chunk, master maintains a version number.
 Whenever a master grants a new lease on a chunk, master increases
the version number and inform up-to-date replicas (version number
is stored permanently on the master and associated chunk servers)
 Master detects that chunk server has a stale replica when the chunk
server restarts and reports its set of chunks and associated version
numbers.
 Master removes stale replica in its regular garbage collection.
 Master includes chunk version number when it informs clients
which chunk server holds a lease on chunk, or when it instructs a
chunk server to read the chunk from another chunk server in
cloning operation.
23
Fault Tolerance and Diagnosis
High Availability
 Strategies: Fast recovery and Replication.
 Fast Recovery:
 Master and chunk servers are designed to restore their state in seconds.
 No matter how they terminated, no distinction between normal and abnormal
termination (servers routinely shutdown just by killing process).
 Clients and servers experience minor timeout on outstanding requests, reconnect to
the restarted server, and retry.
 Chunk Replication:
 Chunk replicated on multiple chunk servers on different racks (different parts of the
file namespace can have different replica on level).
 Master clones existing replicas as chunk servers go offline or detect corrupted
replicas (checksum verification).
 Master Replication
 Shadow master provides read-only access to file system even when the master is
down.
 Master operation logs and checkpoints are replicated on multiple machines for
reliability.
24
Fault Tolerance and Diagnosis
Data Integrity
 Each chunk server uses check summing to detect corruption of stored
chunk.
 Chunk is broken into 64KB blocks with associated 32 bit checksum.
 Checksums are metadata kept in memory and stored persistently with
logging, separate from user data.
 For READS: chunk server verifies the checksum of data blocks that
overlap the range before returning any data.
 For WRITES: chunk server verifies the checksum of first and last
data blocks that overlap the write range before perform the write, and
finally compute and record new checksums.
25
Measurements
Micro-benchmarks: GFS cluster
 One master, 2 master replicas, 16 chunk servers with 16 clients.
 Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM
disks, FastEthernet NIC connected to one HP 2524 Ethernet switch
ports 10/100 + Gigabit uplink.
26
Measurements
Micro-benchmarks: READS
 Each client read a randomly selected 4MB region 256 times (=1GB of
data) from a 320 MB file.
 Aggregate chunk server memory is 32GB, so 10% hit rate in Linux
buffer cache is expected.
27
Measurements
Micro-benchmarks: WRITE
 Each client writes 1GB of data to a new file in a series of 1MB writes.
 Network stack does not interact very well with the pipelining scheme
used for pushing data to the chunk replicas: network congestion is
more likely for 16 writers than for 16 readers because each write
involves 3 different replicas()
28
Measurements
Micro-benchmarks: RECORD APPENDS
 Each client appends simultaneously to a single file.
 Performance is limited by the network bandwidth of the 3 chunk
servers that store the last chunk of the file, independent of the number
of clients.
29
Conclusion
Google File System
 Support Large Scale data processing workloads on COTS x86 servers.
 Component failure are norms rather than exceptions.
 Optimize for huge files mostly append to and then read sequentially.
 Fault tolerance by constant monitoring, replicating crucial data and
fast and automatic recovery.
 Delivers high aggregate throughput to many concurrent readers and
writers.
Future Improvements
 Networking Stack Limit: Write throughput can be improved in the
future.
30
References
1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The
Google File System." ACM SIGOPS Operating Systems Review:
29. Print.
2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee.
Frangipani: A scalable distributed file system. In Proceedings of the
16th ACM Symposium on Operating System Principles, pages 224–
237, Saint-Malo, France, October 1997.
3. http://en.wikipedia.org/wiki/Google_File_System
4. http://computer.howstuffworks.com/internet/basics/google-file-
system.htm
5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System
6. http://storagemojo.com/google-file-system-eval-part-i/
7. https://www.youtube.com/watch?v=d2SWUIP40Nw
31
Thank You!!
32

More Related Content

What's hot

google file system
google file systemgoogle file system
google file system
diptipan
 
11. dfs
11. dfs11. dfs
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
JYoTHiSH o.s
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems
Haitham Ahmed
 
Distributed system
Distributed systemDistributed system
Distributed system
Syed Zaid Irshad
 
MapReduce
MapReduceMapReduce
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
Vishal Polley
 
Distributed System ppt
Distributed System pptDistributed System ppt
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
Ali Raza
 
Google file system
Google file systemGoogle file system
Google file system
Lalit Rastogi
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelismprashantdahake
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
Umarudin Zaenuri
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
Maulik Togadiya
 
Design Goals of Distributed System
Design Goals of Distributed SystemDesign Goals of Distributed System
Design Goals of Distributed System
Ashish KC
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
Arush Nagpal
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
Harshad Umredkar
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
ZongYing Lyu
 
Naming in Distributed System
Naming in Distributed SystemNaming in Distributed System
Naming in Distributed System
MNM Jain Engineering College
 

What's hot (20)

google file system
google file systemgoogle file system
google file system
 
11. dfs
11. dfs11. dfs
11. dfs
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems
 
Distributed system
Distributed systemDistributed system
Distributed system
 
MapReduce
MapReduceMapReduce
MapReduce
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Google file system
Google file systemGoogle file system
Google file system
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelism
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
 
Design Goals of Distributed System
Design Goals of Distributed SystemDesign Goals of Distributed System
Design Goals of Distributed System
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
 
GFS
GFSGFS
GFS
 
Naming in Distributed System
Naming in Distributed SystemNaming in Distributed System
Naming in Distributed System
 

Viewers also liked

Google File Systems
Google File SystemsGoogle File Systems
Google File SystemsAzeem Mumtaz
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
Antonio Cesarano
 
Google file system
Google file systemGoogle file system
Google file systemDhan V Sagar
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Google File System
Google File SystemGoogle File System
Google File System
Amir Payberah
 
The google file system
The google file systemThe google file system
The google file system
Daniel Checchia
 

Viewers also liked (7)

Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
 
Google
GoogleGoogle
Google
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
Google file system
Google file systemGoogle file system
Google file system
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Google File System
Google File SystemGoogle File System
Google File System
 
The google file system
The google file systemThe google file system
The google file system
 

Similar to Google file system

advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Lalit
LalitLalit
Lalit
diptipan
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
ShimoFcis
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
Lalit Rastogi
 
Gfs
GfsGfs
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
Saptarshi Chatterjee
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and ShardingTharun Srinivasa
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
Hariharan Ganesan
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
jhao niu
 
The Google file system
The Google file systemThe Google file system
The Google file system
Sergio Shevchenko
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
os
osos
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
睿琦 崔
 
Gfs
GfsGfs
Operating system memory management
Operating system memory managementOperating system memory management
Operating system memory management
rprajat007
 
Synchronization
SynchronizationSynchronization
Synchronization
misra121
 

Similar to Google file system (20)

advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
 
Lalit
LalitLalit
Lalit
 
Google File System
Google File SystemGoogle File System
Google File System
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
 
Gfs
GfsGfs
Gfs
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and Sharding
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
os
osos
os
 
GOOGLE BIGTABLE
GOOGLE BIGTABLEGOOGLE BIGTABLE
GOOGLE BIGTABLE
 
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
 
Gfs
GfsGfs
Gfs
 
Operating system memory management
Operating system memory managementOperating system memory management
Operating system memory management
 
Synchronization
SynchronizationSynchronization
Synchronization
 
Sinfonia
Sinfonia Sinfonia
Sinfonia
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

Google file system

  • 1. THE GOOGLE FILE SYSTEM S. GHEMAWAT, H. GOBIOFF AND S. LEUNG APRIL 7, 2015 CSI5311: Distributed Databases and Transaction Processing Winter 2015 Prof. Iluju Kiringa University of Ottawa Presented By: Ajaydeep Grewal Roopesh Jhurani 1
  • 2. AGENDA • Introduction • Design Overview • System Interactions • Master Operations • Fault Tolerance and Diagnosis • Measurements • Conclusion • References 2
  • 3. Introduction  Google File System(GFS) is a distributed file system developed by GOOGLE for its own use.  It is a scalable file system for large distributed data-intensive applications.  It is widely used within GOOGLE as a storage platform for generation and processing of data. 3
  • 4. Inspirational factors  Multiple clusters distributed worldwide.  Thousands of queries served per second.  Single query reads more than 100's of MB of data.  Google stores dozens of copies of the entire Web. Conclusion  Need large, distributed, highly fault tolerant file system.  Large data processing needs Performance, Reliability, Scalability and Availability. 4
  • 5. Design Assumptions  Component Failures File System consists of hundreds of machines made from commodity parts. The quantity and quality of the machines guarantee that there are non functional nodes at a given time.  Huge File Sizes  Workload Large streaming reads. Small random reads. Large, sequential writes that append data to file.  Applications & API are co-designed Increases flexibility. Goal is simple file system, light burden on applications. 5
  • 7. GFS Architecture Master Contains the system metadata like: • Namespaces • Access Control Information • Mappings from files to chunks • Current location of chunks Also helps in: ◦ Garbage collection ◦ Synching across Chunk Servers(Heartbeat Synching) 7
  • 8. GFS Architecture Chunk Servers  Machines containing physical files divided into chunks.  Each Master server can have a number of associated chunk servers.  For reliability, each chunk is replicated on multiple chunk servers. Chunk Handle  Immutable 64 bit chunk handle assigned by master at the time of chunk creation. 8
  • 9. GFS Architecture GFS Client code  Code at client machine that interacts with GFS.  Interacts with the master for metadata operations.  Interacts with Chunk Servers for all Read-Write operations. 9
  • 10. GFS Architecture 1.GFS Client code requests for a particular file . 2. Master gives the location of the chunk server. 3.Client caches the information and interacts directly with the chunk server. 4.Periodic replication of changes across all the replicas. 10
  • 11. Chunk Size Having a large uniform chunk size of 64 MB has the following advantages:  Reduced Client-Master interaction.  Reduced Network-Overhead.  Reduction in the size of metadata's stored. 11
  • 12. Metadata  The file and chunk namespaces.  The mappings from files to chunks.  Location of each chunk’s replica. First two are kept persistently in operation log files to ensure reliability and recoverability. Chunk locations are held by chunk servers. Master polls the chunk server at start-up and also periodically thereafter. 12
  • 13. Operation Logs  The operation log contains a historical record of critical metadata changes.  Metadata updates are in following format  e.g. (old value, new value) pairs.  Since the operation logs are very important, so they are replicated on remote machines.  Global snapshots (checkpoints)  Checkpoint is B-tree like form and mapped into memory.  When new updates arrive checkpoints can be created. 13
  • 14. System Interactions  Mutation A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation.  Lease mechanism Leases are used to maintain a consistent mutation order across replicas. ◦ Firstly the master grants a chunk lease to a replica and calls it primary. ◦ The primary determines the order of updates to all the other replicas. 14
  • 15. Write Control and Data Flow 15 1.Client requests for a write operation. 2.Master replies with the location of Chunk Primary and replicas. 3.Client caches the information and pushes the write information. 4.The Primary and replicas store the information in buffer and sends a confirmation. 5.Primary sends a mutation order to all the secondaries. 7.Primary sends a confirmation to the client. 6.Secondaries commit the mutations and sends a confirmation to the Primary.
  • 16. Consistency  Consistent: All the replicated chunks have the same data.  Inconsistent: A failed mutation makes the region inconsistent, i.e., different clients may see different data. 16
  • 17. Master Operations 1. Namespace Management and Locking 2. Replica Placement 3. Creation, Re-replication and Rebalancing 4. Garbage Collection 5. Stale Replica Detection 17
  • 18. Master Operations Namespace Management and Locking  Separate locks on region namespace ensures:  Serialization  Multiple operations on master to avoid any delay.  Each master operation acquires a set of locks before it runs.  To make operation on /dir1/dir2/dir3/leaf it requires locks.  Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3  Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf  File creation doesn’t require write-lock on parent directory: read-lock is enough to protect it from deletion, rename, or snapshotted.  Write-locks on file names serialize attempts to create any duplicate file. 18
  • 19. Master Operations Locking Mechanism  Snapshot acquires  Read Locks on: /home, /save  Write Locks on: /home/user, /save/user  File to be created:  Read Locks on: /home, /home/user  Write Locks on: /home/user/foo  Conflicting locks on /home/user /home/user /save/user snapshotted /home/user/foo 19
  • 20. Master Operations Replica Placement  Serves two purposes:  Maximize data reliability and availability  Maximize Network Bandwidth utilization  Spread Chunk replicas across racks:  To ensure chunk survivability  To exploit aggregate read bandwidth of multiple racks  Write traffic has to flow through multiple racks. 20
  • 21. Master Operations Creation, re-replication and rebalancing  Creation: Master considers several factors  Place new replicas on chunk servers with below average disk utilization.  Limit the number of “recent” creations on chunk server.  Spread replicas of a chunk across racks.  Re-replication:  Master re-replicate a chunk when number of replicas fall below a goal level.  Re-replicated chunk is prioritized based on several factors.  Master limits the numbers of active clone operations both for the cluster and for each chunk servers.  Each chunk servers limits bandwidth it spends on each clone operation.  Balancing:  Master re-balances replicas periodically for better disk and load-balancing.  Master gradually fills up a chunk server rather than instantly filling it with new chunks. 21
  • 22. Master Operations Garbage Collection  Lazy garbage collection by GFS for a deleted file.  Mechanism:  Master logs the deletion like other changes.  File is renamed to a hidden name that include deletion timestamp.  Master removes any hidden files during regular namespace scanning thus erasing its in-memory metadata.  Similar scan performed for chunk namespace to identify orphaned chunks and erase metadata for the same.  Chunk Server can delete those chunks not identified in master metadata during regular heartbeat message exchange. 22
  • 23. Master Operations Stale Replica Detection  Problem: Chunk Replica may become stale if a chunk server fails and misses mutations.  Solution: for each chunk, master maintains a version number.  Whenever a master grants a new lease on a chunk, master increases the version number and inform up-to-date replicas (version number is stored permanently on the master and associated chunk servers)  Master detects that chunk server has a stale replica when the chunk server restarts and reports its set of chunks and associated version numbers.  Master removes stale replica in its regular garbage collection.  Master includes chunk version number when it informs clients which chunk server holds a lease on chunk, or when it instructs a chunk server to read the chunk from another chunk server in cloning operation. 23
  • 24. Fault Tolerance and Diagnosis High Availability  Strategies: Fast recovery and Replication.  Fast Recovery:  Master and chunk servers are designed to restore their state in seconds.  No matter how they terminated, no distinction between normal and abnormal termination (servers routinely shutdown just by killing process).  Clients and servers experience minor timeout on outstanding requests, reconnect to the restarted server, and retry.  Chunk Replication:  Chunk replicated on multiple chunk servers on different racks (different parts of the file namespace can have different replica on level).  Master clones existing replicas as chunk servers go offline or detect corrupted replicas (checksum verification).  Master Replication  Shadow master provides read-only access to file system even when the master is down.  Master operation logs and checkpoints are replicated on multiple machines for reliability. 24
  • 25. Fault Tolerance and Diagnosis Data Integrity  Each chunk server uses check summing to detect corruption of stored chunk.  Chunk is broken into 64KB blocks with associated 32 bit checksum.  Checksums are metadata kept in memory and stored persistently with logging, separate from user data.  For READS: chunk server verifies the checksum of data blocks that overlap the range before returning any data.  For WRITES: chunk server verifies the checksum of first and last data blocks that overlap the write range before perform the write, and finally compute and record new checksums. 25
  • 26. Measurements Micro-benchmarks: GFS cluster  One master, 2 master replicas, 16 chunk servers with 16 clients.  Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM disks, FastEthernet NIC connected to one HP 2524 Ethernet switch ports 10/100 + Gigabit uplink. 26
  • 27. Measurements Micro-benchmarks: READS  Each client read a randomly selected 4MB region 256 times (=1GB of data) from a 320 MB file.  Aggregate chunk server memory is 32GB, so 10% hit rate in Linux buffer cache is expected. 27
  • 28. Measurements Micro-benchmarks: WRITE  Each client writes 1GB of data to a new file in a series of 1MB writes.  Network stack does not interact very well with the pipelining scheme used for pushing data to the chunk replicas: network congestion is more likely for 16 writers than for 16 readers because each write involves 3 different replicas() 28
  • 29. Measurements Micro-benchmarks: RECORD APPENDS  Each client appends simultaneously to a single file.  Performance is limited by the network bandwidth of the 3 chunk servers that store the last chunk of the file, independent of the number of clients. 29
  • 30. Conclusion Google File System  Support Large Scale data processing workloads on COTS x86 servers.  Component failure are norms rather than exceptions.  Optimize for huge files mostly append to and then read sequentially.  Fault tolerance by constant monitoring, replicating crucial data and fast and automatic recovery.  Delivers high aggregate throughput to many concurrent readers and writers. Future Improvements  Networking Stack Limit: Write throughput can be improved in the future. 30
  • 31. References 1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google File System." ACM SIGOPS Operating Systems Review: 29. Print. 2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 224– 237, Saint-Malo, France, October 1997. 3. http://en.wikipedia.org/wiki/Google_File_System 4. http://computer.howstuffworks.com/internet/basics/google-file- system.htm 5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System 6. http://storagemojo.com/google-file-system-eval-part-i/ 7. https://www.youtube.com/watch?v=d2SWUIP40Nw 31

Editor's Notes

  1. Each chunk server uses check summing to detect corruption of stored data. Given that a GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences disk failures that cause data corruption or loss on both the read and write paths. (See Section 7 for one cause.) We can recover from corruption using other chunkre plicas, but it would be impractical to detect corruption by comparing replicas across chunk servers. Moreover, divergent replicas may be legal: the semantics of GFS mutations, in particular atomic record append as discussed earlier, does not guarantee identical replicas. Therefore, each chunk server must independently verify the integrity of its own copy by maintaining checksums. A chunki s broken up into 64 KB blocks. Each has a corresponding 32 bit checksum. Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data. For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver. Therefore chunkservers will not propagate corruptions to other machines. If a blockdo es not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master. In response, the requestor will read from other replicas, while the master will clone the chunkfrom another replica. After a valid new replica is in place, the master instructs the chunkserver that reported the mismatch to delete its replica. Checksumming has little effect on read performance for several reasons. Since most of our reads span at least a few blocks, we need to read and checksum only a relatively small amount of extra data for verification. GFS client code further reduces this overhead by trying to align reads at checksum block boundaries. Moreover, checksum lookups and comparison on the chunkserver are done without any I/O, and checksum calculation can often be overlapped with I/Os. Checksum computation is heavily optimized for writes that append to the end of a chunk(a s opposed to writes that overwrite existing data) because they are dominant in our workloads. We just incrementally update the checksum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the append. Even if the last partial checksum block is already corrupted and we fail to detect it now, the new checksum value will not match the stored data, and the corruption will be detected as usual when the blocki s next read. In contrast, if a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten, then perform the write, and finally compute and record the new checksums. If we do not verify the first and last blocks before overwriting them partially, the new checksums may hide corruption that exists in the regions not being overwritten. During idle periods, chunkservers can scan and verify the contents of inactive chunks. This allows us to detect corruption in chunks that are rarely read. Once the corruption is detected, the master can create a new uncorrupted replica and delete the corrupted replica. This prevents an inactive but corrupted chunkre plica from fooling the master into thinking that it has enough valid replicas of a chunk.
  2. In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google. We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunk servers, and 16 clients. Note that this configuration was set up for ease of testing. Typical clusters have hundreds of chunk servers and hundreds of clients. All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19 GFS server machines are connected to one switch, and all 16 client machines to the other. The two switches are connected with a 1 Gbps link.
  3. N clients read simultaneously from the file system. Each client reads a randomly selected 4 MB region from a 320 GB file set. This is repeated 256 times so that each client ends up reading 1 GB of data. The chunk servers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache. Our results should be close to cold cache results. Figure 3(a) shows the aggregate read rate for N clients and its theoretical limit. The limit peaks at an aggregate of 125 MB/s when the 1 Gbps linkb etween the two switches is saturated, or 12.5 MB/s per client when its 100 Mbps networkin terface gets saturated, whichever applies. The observed read rate is 10 MB/s, or 80% of the per-client limit, when just one client is reading. The aggregate read rate reaches 94 MB/s, about 75% of the 125 MB/s linklim it, for 16 readers, or 6 MB/s per client. The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously read from the same chunkserver.
  4. N clients write simultaneously to N distinct files. Each client writes 1 GB of data to a new file in a series of 1 MB writes. The aggregate write rate and its theoretical limit are shown in Figure 3(b). The limit plateaus at 67 MB/s because we need to write each byte to 3 of the 16 chunk servers, each with a 12.5 MB/s input connection. The write rate for one client is 6.3 MB/s, about half of the limit. The main culprit for this is our networkst ack. It does not interact very well with the pipelining scheme we use for pushing data to chunkrep licas. Delays in propagating data from one replica to another reduce the overall write rate. Aggregate write rate reaches 35 MB/s for 16 clients (or 2.2 MB/s per client), about half the theoretical limit. As in the case of reads, it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases. Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three different replicas. Writes are slower than we would like. In practice this has not been a major problem because even though it increases the latencies as seen by individual clients, it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients.
  5. Figure 3(c) shows record append performance. N clients append simultaneously to a single file. Performance is limited by the networkba ndwidth of the chunkservers that store the last chunko f the file, independent of the number of clients. It starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients, mostly due to congestion and variances in networktransf er rates seen by different clients. Our applications tend to produce multiple such files concurrently. In other words, N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds. Therefore, the chunkserver network congestion in our experiment is not a significant issue in practice because a client can make progress on writing one file while the chunkservers for another file are busy.