SlideShare a Scribd company logo
1 of 15
GOOGLE FILE SYSTEM 
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 
Presented By – Ankit Thiranh
OVERVIEW 
• Introduction 
• Architecture 
• Characteristics 
• System Interaction 
• Master Operation and Fault tolerance and diagnosis 
• Measurements 
• Some Real world clusters and their performance
INTRODUCTION 
• Google – large amount of data 
• Need a good file distribution system to process its data 
• Solution: Google File System 
• GFS is : 
• Large 
• Distributed 
• Highly fault tolerant system
ASSUMPTIONS 
• The system is built from many inexpensive commodity components that often fail. 
• The system stores a modest number of large files. 
• Primarily two kind of reads: large streaming reads and small random needs. 
• Many large sequential writes append data to files. 
• The system must efficiently implement well-defined semantics for multiple clients that 
concurrently append to the same file. 
• High sustained bandwidth is more important than low latency.
ARCHITECTURE
CHARACTERISTICS 
• Single master 
• Chunk size 
• Metadata 
• In-Memory Data structures 
• Chunk Locations 
• Operational Log 
• Consistency Model (figure) 
• Guarantees by GFS 
• Implications for Applications 
Write Record Append 
Serial Success defined Defined 
interspersed with 
inconsistent 
Concurrent 
successes 
Consistent but 
undefined 
Failure inconsistent 
File Region State After Mutation
SYSTEM INTERACTION 
• Leases and Mutation Order 
• Data flow 
• Atomic Record appends 
• Snapshot 
Figure 2: Write Control and Data Flow
MASTER OPERATION 
• Namespace Management and Locking 
• Replica Placement 
• Creation, Re-replication, Rebalancing 
• Garbage Collection 
• Mechanism 
• Discussion 
• State Replica Detection
FAULT TOLERANCE AND DIAGNOSIS 
• High Availability 
• Fast Recovery 
• Chunk Replication 
• Master Replication 
• Data Integrity 
• Diagnostics tools
MEASUREMENTS 
Aggregate Throughputs. Top curves show theoretical limits imposed by the network topology. Bottom curves 
show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in 
some cases because of low variance in measurements.
REAL WORLD CLUSTERS 
• Two clusters were examined: 
• Cluster A used for Research and development by over a hundred users. 
• Cluster B is used for production data processing with occasional human 
intervention 
• Storage 
• Metadata 
Cluster A B 
Chunkservers 342 227 
Available disk Size 
72 TB 
Used Disk Space 
55 TB 
Characteristics of two GFS clusters 
180 TB 
155 TB 
Number of Files 
Number of Dead Files 
Number of chunks 
735 k 
22 k 
992 k 
737 k 
232 k 
1550 k 
Metadata at chunkservers 
Metadata at master 
13 GB 
48 MB 
21 GB 
60 MB
PERFORMANCE EVALUATION OF TWO 
CLUSTERS 
• Read and write rates and Master load 
Cluster A B 
Read Rate (last minute) 583 MB/s 380 MB/s 
Read Rate (last hour) 562 MB/s 384 MB/s 
Read Rate (since start) 589 MB/s 49 MB/s 
Write Rate (last minute) 1 MB/s 101 MB/s 
Write Rate (last hour) 2 MB/s 117 MB/s 
Write Rate (since start) 25 MB/s 13 MB/s 
Master ops (last minute) 325 Ops/s 533 Ops/s 
Master ops (last hour) 381 Ops/s 518 Ops/s 
Master ops (since start) 202 Ops/s 347 Ops/s 
Performance Metrics for Two GFS Clusters
WORKLOAD BREAKDOWN 
• Chunkserver Workload 
Operation Read Write Record Append 
Cluster X Y X Y X Y 
0K 0.4 2.6 0 0 0 0 
1B….1K 0.1 4.1 6.6 4.9 0.2 9.2 
1K…8K 65.2 38.5 0.4 1.0 18.9 15.2 
8K…64K 29.9 45.1 17.8 43.0 78.0 2.8 
64K….128K 0.1 0.7 2.3 1.9 < 0.1 4.3 
128K….256K 0.2 0.3 31.6 0.4 < 0.1 10.6 
256K…512K 0.1 0.1 4.2 7.7 < 0.1 31.2 
512K….1M 3.9 6.9 35.5 28.7 2.2 25.5 
1M..inf 0.1 1.8 1.5 12.3 0.7 2.2 
Operation Read Write Record Append 
Cluster X Y X Y X Y 
1B….1K < 0.1 <0.1 < 0.1 <0.1 < 0.1 <0.1 
1K…8K 13.8 3.9 < 0.1 <0.1 < 0.1 0.1 
8K…64K 11.4 9.3 2.4 5.9 78.0 0.3 
64K….128K 0.3 0.7 0.3 0.3 < 0.1 1.2 
128K….256K 0.8 0.6 16.5 0.2 < 0.1 5.8 
256K…512K 1.4 0.3 3.4 7.7 < 0.1 38.4 
512K….1M 65.9 55.1 74.1 58.0 0.1 46.8 
1M..inf 6.4 28.0 3.3 28.0 53.9 7.4 
Operations Break down by Size (% ) Bytes Transferred Breakdown by Operation Size (% )
WORKLOAD BREAKDOWN 
• Master Workload 
Cluster X Y 
Open 26.1 16.3 
Delete 0.7 1.5 
FindLocation 64.3 65.8 
FindLeaseHolder 7.8 13.4 
FindMatchingFiles 0.6 2.2 
All other combined 0.5 0.8 
Master Requests Break down by Type (% )
Google file system

More Related Content

What's hot

GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File SystemVishal Polley
 
Google File System
Google File SystemGoogle File System
Google File Systemnadikari123
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Page Replacement Algorithms.pptx
Page Replacement Algorithms.pptxPage Replacement Algorithms.pptx
Page Replacement Algorithms.pptxinfomerlin
 
Remote procedure call on client server computing
Remote procedure call on client server computingRemote procedure call on client server computing
Remote procedure call on client server computingSatya P. Joshi
 
Ch9: Memory Management
Ch9: Memory ManagementCh9: Memory Management
Ch9: Memory ManagementAhmar Hashmi
 
Google File System
Google File SystemGoogle File System
Google File Systemguest2cb4689
 
file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada umardanjumamaiwada
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocolsZongYing Lyu
 
Integrated and Differentiated services Chapter 17
Integrated and Differentiated services Chapter 17Integrated and Differentiated services Chapter 17
Integrated and Differentiated services Chapter 17daniel ayalew
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introductionIBM Danmark
 

What's hot (20)

GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
 
Google File System
Google File SystemGoogle File System
Google File System
 
Google file system
Google file systemGoogle file system
Google file system
 
Google File System
Google File SystemGoogle File System
Google File System
 
GFS
GFSGFS
GFS
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Page Replacement Algorithms.pptx
Page Replacement Algorithms.pptxPage Replacement Algorithms.pptx
Page Replacement Algorithms.pptx
 
Remote procedure call on client server computing
Remote procedure call on client server computingRemote procedure call on client server computing
Remote procedure call on client server computing
 
Ch9: Memory Management
Ch9: Memory ManagementCh9: Memory Management
Ch9: Memory Management
 
Google File System
Google File SystemGoogle File System
Google File System
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Integrated and Differentiated services Chapter 17
Integrated and Differentiated services Chapter 17Integrated and Differentiated services Chapter 17
Integrated and Differentiated services Chapter 17
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introduction
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 

Similar to Google file system

Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
What's new in JBoss ON 3.2
What's new in JBoss ON 3.2What's new in JBoss ON 3.2
What's new in JBoss ON 3.2Thomas Segismont
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataacelyc1112009
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKAndrew Trossman
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014marvin herrera
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Introduction to STINGER
Introduction to STINGERIntroduction to STINGER
Introduction to STINGERrobertmccoll
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 

Similar to Google file system (20)

Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
What's new in JBoss ON 3.2
What's new in JBoss ON 3.2What's new in JBoss ON 3.2
What's new in JBoss ON 3.2
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Introduction to STINGER
Introduction to STINGERIntroduction to STINGER
Introduction to STINGER
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 

Recently uploaded

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 

Recently uploaded (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 

Google file system

  • 1. GOOGLE FILE SYSTEM Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presented By – Ankit Thiranh
  • 2. OVERVIEW • Introduction • Architecture • Characteristics • System Interaction • Master Operation and Fault tolerance and diagnosis • Measurements • Some Real world clusters and their performance
  • 3. INTRODUCTION • Google – large amount of data • Need a good file distribution system to process its data • Solution: Google File System • GFS is : • Large • Distributed • Highly fault tolerant system
  • 4. ASSUMPTIONS • The system is built from many inexpensive commodity components that often fail. • The system stores a modest number of large files. • Primarily two kind of reads: large streaming reads and small random needs. • Many large sequential writes append data to files. • The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. • High sustained bandwidth is more important than low latency.
  • 6. CHARACTERISTICS • Single master • Chunk size • Metadata • In-Memory Data structures • Chunk Locations • Operational Log • Consistency Model (figure) • Guarantees by GFS • Implications for Applications Write Record Append Serial Success defined Defined interspersed with inconsistent Concurrent successes Consistent but undefined Failure inconsistent File Region State After Mutation
  • 7. SYSTEM INTERACTION • Leases and Mutation Order • Data flow • Atomic Record appends • Snapshot Figure 2: Write Control and Data Flow
  • 8. MASTER OPERATION • Namespace Management and Locking • Replica Placement • Creation, Re-replication, Rebalancing • Garbage Collection • Mechanism • Discussion • State Replica Detection
  • 9. FAULT TOLERANCE AND DIAGNOSIS • High Availability • Fast Recovery • Chunk Replication • Master Replication • Data Integrity • Diagnostics tools
  • 10. MEASUREMENTS Aggregate Throughputs. Top curves show theoretical limits imposed by the network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.
  • 11. REAL WORLD CLUSTERS • Two clusters were examined: • Cluster A used for Research and development by over a hundred users. • Cluster B is used for production data processing with occasional human intervention • Storage • Metadata Cluster A B Chunkservers 342 227 Available disk Size 72 TB Used Disk Space 55 TB Characteristics of two GFS clusters 180 TB 155 TB Number of Files Number of Dead Files Number of chunks 735 k 22 k 992 k 737 k 232 k 1550 k Metadata at chunkservers Metadata at master 13 GB 48 MB 21 GB 60 MB
  • 12. PERFORMANCE EVALUATION OF TWO CLUSTERS • Read and write rates and Master load Cluster A B Read Rate (last minute) 583 MB/s 380 MB/s Read Rate (last hour) 562 MB/s 384 MB/s Read Rate (since start) 589 MB/s 49 MB/s Write Rate (last minute) 1 MB/s 101 MB/s Write Rate (last hour) 2 MB/s 117 MB/s Write Rate (since start) 25 MB/s 13 MB/s Master ops (last minute) 325 Ops/s 533 Ops/s Master ops (last hour) 381 Ops/s 518 Ops/s Master ops (since start) 202 Ops/s 347 Ops/s Performance Metrics for Two GFS Clusters
  • 13. WORKLOAD BREAKDOWN • Chunkserver Workload Operation Read Write Record Append Cluster X Y X Y X Y 0K 0.4 2.6 0 0 0 0 1B….1K 0.1 4.1 6.6 4.9 0.2 9.2 1K…8K 65.2 38.5 0.4 1.0 18.9 15.2 8K…64K 29.9 45.1 17.8 43.0 78.0 2.8 64K….128K 0.1 0.7 2.3 1.9 < 0.1 4.3 128K….256K 0.2 0.3 31.6 0.4 < 0.1 10.6 256K…512K 0.1 0.1 4.2 7.7 < 0.1 31.2 512K….1M 3.9 6.9 35.5 28.7 2.2 25.5 1M..inf 0.1 1.8 1.5 12.3 0.7 2.2 Operation Read Write Record Append Cluster X Y X Y X Y 1B….1K < 0.1 <0.1 < 0.1 <0.1 < 0.1 <0.1 1K…8K 13.8 3.9 < 0.1 <0.1 < 0.1 0.1 8K…64K 11.4 9.3 2.4 5.9 78.0 0.3 64K….128K 0.3 0.7 0.3 0.3 < 0.1 1.2 128K….256K 0.8 0.6 16.5 0.2 < 0.1 5.8 256K…512K 1.4 0.3 3.4 7.7 < 0.1 38.4 512K….1M 65.9 55.1 74.1 58.0 0.1 46.8 1M..inf 6.4 28.0 3.3 28.0 53.9 7.4 Operations Break down by Size (% ) Bytes Transferred Breakdown by Operation Size (% )
  • 14. WORKLOAD BREAKDOWN • Master Workload Cluster X Y Open 26.1 16.3 Delete 0.7 1.5 FindLocation 64.3 65.8 FindLeaseHolder 7.8 13.4 FindMatchingFiles 0.6 2.2 All other combined 0.5 0.8 Master Requests Break down by Type (% )

Editor's Notes

  1. GFS – single master, multiple chunkservers, multiple client. Files- divided into chunks, chunks- immutable and globally unique 64 bit chunk handle. Stored in multiple chunkservers, master- contains metadata includes the namespace, access control information, mapping of file to chunks and current location of chunks
  2. Single Master- can make sophisticated chunk replacement and replication decisions using global knowledge. Read example Chunk Size – 64 MB, advantages – reduces client-master interation, client more likely to perform many operations on given chunk, reduces metadata size. Metadata – stores file and chunk namespaces, mapping from files to chunks, location to chunk’s relica, metadata stored in memory to do fast operations, chunk location – does not keep a record, polls at startup, monitor by sending heartbeat messages,operation log- contains a history of critical metadata changes. Guarantee- application mutation on same order to all the replicas , using chunk version numbers to detect any replica Consistent – all replicas have the same data, defined – consistent – defined and client can see what the mutation has written
  3. Mutation – operation that changes the content of metadata Data flow – bandwidth – data is [pushed linearly along the server, avoid bottlenecks and high-latency links- each machine forwards the data to closest possible, latency min – pipelining the data transfer over TCP connections. Record append – client specifies the data, GFS appends automatically, same way as control flow Snapshots – makes a copy of file or ‘directory tree’ minimizing any interruption with ongoing mutations
  4. Master – executes all namespace operations, manages chunk replicas, Namespace – GFS logically represent its namespace as a look up table mapping full path names to metadata. Replica placement - 1) maximise data reliability and availability, and 2) maximum bandwidth utilization Creation, re-replication – replicas on severs with below average disk utilization, limit recent creation on each chunk server, spread replicas of a chunk across racks Garbage collection – after deletion, file renamed to a hidden file, deleted after 3 days, orphaned chunks, State replica detection – chunkserver failure missing mutation while it is down, master assigns – chunk server numbers to distinguish
  5. Fast recovery – mast and chunk server designed such that they restore their data and start in two seconds Chunk replication – discussed earlier Master replication – operations log and checkpoints are replicated on multiple machines, shadow masters – provide read-only access Data integrity – uses checksumming to detect corruption of stored data, we can recover from corruption using replicas, but it is impractical Diagnostic tools – generate diagnostic logs that record many significant events. The RPC logs include the exact requests and responses sent on the wire, except for the file data being read or written.
  6. The two clusters have similar numbers of files, though B has a larger proportion of dead files, namely files which were deleted or replaced by a new version but whose storage have not yet been reclaimed. It also has more chunks because its files tend to be larger
  7. Read returns no data in Y b’coz applications in production system use file as producer-consumer queues cluster Y sees a much higher percentage of large record appends than cluster X does because our production systems, which use cluster Y, are more aggressively tuned for GFS