SlideShare a Scribd company logo
Cluster-based Storage
Antonio Cesarano
Bonaventura Del Monte
Università degli studi di
Salerno
16th May 2014
Advanced Operating Systems
Prof. Giuseppe Cattaneo
Agenda
 Context
 Goals of design
 NASD
 NASD prototype
 Distrubuted file systems on NASD
 NASD parallel file system
 Conclusions
A Cost-Effective,
High-Bandwidth Storage Architecture
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang,
Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka, 1997-2001
Agenda
The File System
 Motivations
 Architecture
 Benchmarks
 Comparisons and conclusions
[Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
2003]
Context - 1998
New drive
attachment
technology
I/O bounded
applications Streaming audio-video
Data
mining
Fibre channel
And new network standards
Context - 1998
Cost-ineffective
storage servers
Excess of on-drive
transistors
Controller
Context - 1998
Big
files
Splitting
Storage1 Storage2 Storage
Goal
No traditional storage file server
Cost-effective bandwidth scaling
What is NASD?
Network-Attached Secure Disk
direct transfer to clients
secure interfaces via cryptographic support
asynchronous oversight
variable-size data objects map to blocks
Network-Attached Secure Disk
Architecture
NASD prototype
 Based on Unix inode interface
 Network with 13 NASD
 Each NASD runs on
•DEC Alpha 3000, 133MHz, 64MB RAM
•2 x Seagate Medallist on 5MB/s SCSI bus
•Connected to 10 clients by ATM (155MB/s)
 Ad Hoc handling modules (16K loc)
NASD prototype
Tests result:
It scales!
DFS on NASD
Porting NFS and AFS on NASD architecture
o Ok, no performance loss
o But there are concurrency limitations
Solution:
A new higher-level parallel file system
must be used…
NASD parallel file system
Scalable I/O low-level interface
Cheops as storage management layer
 Exports the same object interfaces of NASD devices
 Maps them to object on devices
 Maps striped objects
 Supports concurrency control for multi-disk accesses
(10K loc)
NASD parallel file system Test
Clustering data mining application
+ =
*Each NASD drive provides 6.2MB/s
Conclusions
High Scalability
Direct transfer to clients
Working prototype
Usabe with existing file systems
But...very high costs:
•Network adapters
• ASIC microcontroller,
•Workstation
increasing the total cost by over 80%
Change
From here…
The Google File System
• Started with their Search Engine
• They provided new services like:
 Google Video
 Gmail
 Google Maps, Earth
 Google App Engine
 … and many more
Design overview
Observing common operations in Google applications leads
developers to make several assumptions:
 Multiple clusters distribuited worldwide
 Fault-tolerance and auto-recovery need to be built into the
systems because problems are very often
 A modest number of large files (100+ MB or Multi-GB)
 Workloads consist of either large streaming or small
random reads, meanwhile write operations are sequential
and append large quantity of data to files
 Google applications and GFS should be co-designed
 Producer – consumer pattern
GFS Architecture
MASTER
CLIENT
CHUNK
SERVER
CHUNKS
UNIX FS
Request for Metadata
Metadata Response
METADATA
CHUNK
SERVER
CHUNKS
UNIX FS
R
A
M
R-W REQUEST
R-W RESPONSE
GFS Architecture: Chunks
 Similar to standard File System blocks but much
larger
 Size: at least 64 MB (configurable)
 Advantages:
• Reduced clients’ need to contact w/ the
master
• Client may perform many operations on a
single block
• Less chunks less metadata in the master
• No internal fragmentation due to lazy space
GFS Architecture: Chunks
 Disadvantages:
• Some small files, made of a small number of
chunks may be accessed many times
• Not a major issue since Google Apps mostly
read large multi-chunk files sequentially
• Moreover this can be fixed using an high
replication factor
GFS Architecture: Master
 A single process running on a separate machine
 Stores all metadata in its RAM:
• File and chunk namespace
• Mapping from files to chunks
• Chunks location
• Access control information and file locking
• Chunk versioning (snapshots handling)
• And so on…
GFS Architecture: Master
 Master has the following responsabilities:
 Chunk creation, re-replication,
rebalancing and deletion for:
 Balancing space utilization and access speed
 Spreading replicas across racks to reduce
correlated failures, usually 3 copies for each chunk
 Rebalancing data to smooth out storage and
request load
 Persistent and replicated logging of crititical
metadata updates
GFS Architecture: M - CS
Communication
 Master and chunkservers communicate regularly in
order to retrieve their states:
o Is chunkserver down?
o Are there disk failure on any chunkserver ?
o Are any replicas corrupted ?
o Which chunk-replicas does a given chunkserver
store?
 Moreover master handles garbage collection
and deletes «stale» replicas
o Master logs the deletion, then renames the
target file to an hidden name
o A lazy GC removes the hidden files after a
given amount of time
GFS Architecture: M - CS
Communication
 Server Requests
 Client retrieves metadata from master for the
requested
 Read / Write dataflows between client and
chunkserver decoupled from master control
flow
 Single master is no longer a bottleneck: its
involvement with R&W is minimized:
 Clients communicate directly with
chunkservers
 Master has to log operations as soon as they
are completed
 Less than 64 BYTES of metadata for each
GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
CHUNKSERVER
CHUNKSERVER
GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
CHUNKSERVER
CHUNKSERVER
NAME - RANGE
NAME
CHUNK
INDEX
CHUNK
HANDLE
REPLICA
LOCATIONS
GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
CHUNKSERVER
CHUNKSERVER
CHUNK
HANDLE
RANGEDATA
FROM
FILE
DATA
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
NAME - DATA
NAME
CHUNK
INDEX
CHUNK
HANDLE
PRIMARY
AND
SECONDAY
REPLICA
LOCATIONS
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
DATA
DATA
DATA
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
WRITE
CMD
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
ACKs
ACK
Fault Tolerance
 GFS has its own relaxed consistency
model
 Consistent: all replicas have the same
value
 Defined: each replica reflects the
performed mutations
 GFS is high available
 Faster recovery (machine quickly
rebootable)
 Chunks replicated at least 3 times (take this
RAID-6)
Benchmarking: small cluster
GFS tested on a small cluster:
 I master
 16 chunkservers
 16 clients
 Server machines connected to 100MBits central switc
 Same for client machines
 The two switches are connected to a 1Gbits switch
Benchmarking: small cluster
Read Rate Write Rate
1 client 10 MB/s 6.3 MB/s
16 clients 6 MB/s 2.2 MB/s
Network limit
12.5 MB/s
Benchmarking: real-world cluster
Cluster A: 342 PCs
 Used for research and development
 Tasks last few hours reading TBs of data, processing
and writing results back
 Cluster B: 227 PCs
 Continuously generates and processes multi-TB data s
 Typical tasks last more hours than cluster A tasks
Benchmarking: real-world cluster
Cluster A B
Chunkservers # 342 227
Available disk space 72 TB 180 TB
Used disk space 55 TB 155 TB
# of files 735000 737000
# of chunks 992000 1550000
Metadata at CSs 13 GB 21 GB
Metadata at Master 48 MB 60 MB
Read rate 580 MB/s (750 MB/s) 380 MB/s (1300
MB/s)
Write rate 30 MB/s 100 MB/s * 3
Master Ops 202~380 Ops/s 347~533 Ops/s
Benchmarking: recovery time
 One chunkserver killed in cluster B:
o This chunkserver had 15000 chunks
containing 600GB of data
o All chunks were restored in 23.2 mins
with a replication rate of 440 MB/s
 Two chunkserver killed in cluster B:
o Each with 16000 chunks and 660 GB
of data, 266 of them became uniques
o These 266 chunks were replicated at
an higher priority within 2 mins
Comparisons to others models
GFS
RAIDxFS
GPFS
AFS
NASD
spreads file data across
storage servers
simpler, uses only
replication for redundancy
location independent
namespace
centralized approach rather
than distribuited managementcommodity machines instead of
network attached disks
lazy allocated fixed-size blocks rather than variable-lengh objects
Conclusion
 GFS demonstrates how to support large-scale
processing workloads on commodity hardware:
 designed to tollerate frequent component
failures
 optimised for huge files that are mostly
appended and read
 It has met Google’s storage needs, therefore
good enough for them
 GFS has influenced massively the computer
science in the last few years

More Related Content

What's hot

File replication
File replicationFile replication
File replication
Klawal13
 
Chapter 9 OS
Chapter 9 OSChapter 9 OS
Chapter 9 OSC.U
 
Processes and Threads in Windows Vista
Processes and Threads in Windows VistaProcesses and Threads in Windows Vista
Processes and Threads in Windows VistaTrinh Phuc Tho
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)Sri Prasanna
 
Teoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFSTeoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFSAsociatia ProLinux
 
Ppt project process migration
Ppt project process migrationPpt project process migration
Ppt project process migration
jaya380
 
Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008
guestd9065
 
Gfs final
Gfs finalGfs final
Gfs final
AmitSaha123
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Final jaypaper linux
Final jaypaper linuxFinal jaypaper linux
Final jaypaper linux
jaya380
 
Linux process management
Linux process managementLinux process management
Linux process managementRaghu nath
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed Systems
Dr Sandeep Kumar Poonia
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
Kathirvel Ayyaswamy
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
Konstantin V. Shvachko
 
Process creation and termination In Operating System
Process creation and termination In Operating SystemProcess creation and termination In Operating System
Process creation and termination In Operating System
Farhan Aslam
 
Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
Dong Lin
 

What's hot (20)

File replication
File replicationFile replication
File replication
 
Chapter 9 OS
Chapter 9 OSChapter 9 OS
Chapter 9 OS
 
Processes and Threads in Windows Vista
Processes and Threads in Windows VistaProcesses and Threads in Windows Vista
Processes and Threads in Windows Vista
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
 
Teoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFSTeoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFS
 
Ppt project process migration
Ppt project process migrationPpt project process migration
Ppt project process migration
 
Kosmos Filesystem
Kosmos FilesystemKosmos Filesystem
Kosmos Filesystem
 
Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008
 
Hadoop
HadoopHadoop
Hadoop
 
Gfs final
Gfs finalGfs final
Gfs final
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Final jaypaper linux
Final jaypaper linuxFinal jaypaper linux
Final jaypaper linux
 
Linux process management
Linux process managementLinux process management
Linux process management
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed Systems
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Process creation and termination In Operating System
Process creation and termination In Operating SystemProcess creation and termination In Operating System
Process creation and termination In Operating System
 
Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
 

Viewers also liked

Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management System
Erhan Bagdemir
 
Google file system
Google file systemGoogle file system
Google file system
Ankit Thiranh
 
Google File System
Google File SystemGoogle File System
Google File Systemnadikari123
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'
 
History of Operating system
History of Operating systemHistory of Operating system
History of Operating system
tarun thakkar
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
BOSS Webtech
 
Cluster computing
Cluster computingCluster computing
Cluster computing
pooja khatana
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
Romain Jacotin
 

Viewers also liked (11)

Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management System
 
Google file system
Google file systemGoogle file system
Google file system
 
Google file system
Google file systemGoogle file system
Google file system
 
Google File System
Google File SystemGoogle File System
Google File System
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2
 
History of Operating system
History of Operating systemHistory of Operating system
History of Operating system
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Similar to Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
tittle
tittletittle
tittle
uvolodia
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio, Inc.
 
Gfs
GfsGfs
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Infrastructure Strategies 2007
Infrastructure Strategies 2007Infrastructure Strategies 2007
Infrastructure Strategies 2007
Dr. Jimmy Schwarzkopf
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
Lalit Rastogi
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
ShimoFcis
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
Chris Bunch
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 
Running a Megasite on Microsoft Technologies
Running a Megasite on Microsoft TechnologiesRunning a Megasite on Microsoft Technologies
Running a Megasite on Microsoft Technologies
goodfriday
 

Similar to Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014 (20)

Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
tittle
tittletittle
tittle
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
Gfs
GfsGfs
Gfs
 
Google File System
Google File SystemGoogle File System
Google File System
 
Infrastructure Strategies 2007
Infrastructure Strategies 2007Infrastructure Strategies 2007
Infrastructure Strategies 2007
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
Running a Megasite on Microsoft Technologies
Running a Megasite on Microsoft TechnologiesRunning a Megasite on Microsoft Technologies
Running a Megasite on Microsoft Technologies
 

More from Antonio Cesarano

Inspire JSON Merger
Inspire JSON MergerInspire JSON Merger
Inspire JSON Merger
Antonio Cesarano
 
Erasmus Traineeship Report @ RedHat
Erasmus Traineeship Report @ RedHatErasmus Traineeship Report @ RedHat
Erasmus Traineeship Report @ RedHat
Antonio Cesarano
 
Lost John - Mobile Game Development
Lost John - Mobile Game DevelopmentLost John - Mobile Game Development
Lost John - Mobile Game Development
Antonio Cesarano
 
Pitch ItLosers - TechGarage 2014
Pitch ItLosers - TechGarage 2014Pitch ItLosers - TechGarage 2014
Pitch ItLosers - TechGarage 2014
Antonio Cesarano
 
Project Proposal - Project Management
Project Proposal - Project ManagementProject Proposal - Project Management
Project Proposal - Project Management
Antonio Cesarano
 
Project management - Final Report
Project management - Final ReportProject management - Final Report
Project management - Final Report
Antonio Cesarano
 
Tech Talk Project Work
Tech Talk Project WorkTech Talk Project Work
Tech Talk Project Work
Antonio Cesarano
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
Antonio Cesarano
 

More from Antonio Cesarano (8)

Inspire JSON Merger
Inspire JSON MergerInspire JSON Merger
Inspire JSON Merger
 
Erasmus Traineeship Report @ RedHat
Erasmus Traineeship Report @ RedHatErasmus Traineeship Report @ RedHat
Erasmus Traineeship Report @ RedHat
 
Lost John - Mobile Game Development
Lost John - Mobile Game DevelopmentLost John - Mobile Game Development
Lost John - Mobile Game Development
 
Pitch ItLosers - TechGarage 2014
Pitch ItLosers - TechGarage 2014Pitch ItLosers - TechGarage 2014
Pitch ItLosers - TechGarage 2014
 
Project Proposal - Project Management
Project Proposal - Project ManagementProject Proposal - Project Management
Project Proposal - Project Management
 
Project management - Final Report
Project management - Final ReportProject management - Final Report
Project management - Final Report
 
Tech Talk Project Work
Tech Talk Project WorkTech Talk Project Work
Tech Talk Project Work
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 

Recently uploaded

Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 

Recently uploaded (20)

Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 

Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

  • 1. Cluster-based Storage Antonio Cesarano Bonaventura Del Monte Università degli studi di Salerno 16th May 2014 Advanced Operating Systems Prof. Giuseppe Cattaneo
  • 2. Agenda  Context  Goals of design  NASD  NASD prototype  Distrubuted file systems on NASD  NASD parallel file system  Conclusions A Cost-Effective, High-Bandwidth Storage Architecture Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka, 1997-2001
  • 3. Agenda The File System  Motivations  Architecture  Benchmarks  Comparisons and conclusions [Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, 2003]
  • 4. Context - 1998 New drive attachment technology I/O bounded applications Streaming audio-video Data mining Fibre channel And new network standards
  • 5. Context - 1998 Cost-ineffective storage servers Excess of on-drive transistors
  • 7. Goal No traditional storage file server Cost-effective bandwidth scaling
  • 8. What is NASD? Network-Attached Secure Disk direct transfer to clients secure interfaces via cryptographic support asynchronous oversight variable-size data objects map to blocks
  • 10. NASD prototype  Based on Unix inode interface  Network with 13 NASD  Each NASD runs on •DEC Alpha 3000, 133MHz, 64MB RAM •2 x Seagate Medallist on 5MB/s SCSI bus •Connected to 10 clients by ATM (155MB/s)  Ad Hoc handling modules (16K loc)
  • 12. DFS on NASD Porting NFS and AFS on NASD architecture o Ok, no performance loss o But there are concurrency limitations Solution: A new higher-level parallel file system must be used…
  • 13. NASD parallel file system Scalable I/O low-level interface Cheops as storage management layer  Exports the same object interfaces of NASD devices  Maps them to object on devices  Maps striped objects  Supports concurrency control for multi-disk accesses (10K loc)
  • 14. NASD parallel file system Test Clustering data mining application + = *Each NASD drive provides 6.2MB/s
  • 15. Conclusions High Scalability Direct transfer to clients Working prototype Usabe with existing file systems But...very high costs: •Network adapters • ASIC microcontroller, •Workstation increasing the total cost by over 80%
  • 17. The Google File System • Started with their Search Engine • They provided new services like:  Google Video  Gmail  Google Maps, Earth  Google App Engine  … and many more
  • 18. Design overview Observing common operations in Google applications leads developers to make several assumptions:  Multiple clusters distribuited worldwide  Fault-tolerance and auto-recovery need to be built into the systems because problems are very often  A modest number of large files (100+ MB or Multi-GB)  Workloads consist of either large streaming or small random reads, meanwhile write operations are sequential and append large quantity of data to files  Google applications and GFS should be co-designed  Producer – consumer pattern
  • 19. GFS Architecture MASTER CLIENT CHUNK SERVER CHUNKS UNIX FS Request for Metadata Metadata Response METADATA CHUNK SERVER CHUNKS UNIX FS R A M R-W REQUEST R-W RESPONSE
  • 20. GFS Architecture: Chunks  Similar to standard File System blocks but much larger  Size: at least 64 MB (configurable)  Advantages: • Reduced clients’ need to contact w/ the master • Client may perform many operations on a single block • Less chunks less metadata in the master • No internal fragmentation due to lazy space
  • 21. GFS Architecture: Chunks  Disadvantages: • Some small files, made of a small number of chunks may be accessed many times • Not a major issue since Google Apps mostly read large multi-chunk files sequentially • Moreover this can be fixed using an high replication factor
  • 22. GFS Architecture: Master  A single process running on a separate machine  Stores all metadata in its RAM: • File and chunk namespace • Mapping from files to chunks • Chunks location • Access control information and file locking • Chunk versioning (snapshots handling) • And so on…
  • 23. GFS Architecture: Master  Master has the following responsabilities:  Chunk creation, re-replication, rebalancing and deletion for:  Balancing space utilization and access speed  Spreading replicas across racks to reduce correlated failures, usually 3 copies for each chunk  Rebalancing data to smooth out storage and request load  Persistent and replicated logging of crititical metadata updates
  • 24. GFS Architecture: M - CS Communication  Master and chunkservers communicate regularly in order to retrieve their states: o Is chunkserver down? o Are there disk failure on any chunkserver ? o Are any replicas corrupted ? o Which chunk-replicas does a given chunkserver store?  Moreover master handles garbage collection and deletes «stale» replicas o Master logs the deletion, then renames the target file to an hidden name o A lazy GC removes the hidden files after a given amount of time
  • 25. GFS Architecture: M - CS Communication  Server Requests  Client retrieves metadata from master for the requested  Read / Write dataflows between client and chunkserver decoupled from master control flow  Single master is no longer a bottleneck: its involvement with R&W is minimized:  Clients communicate directly with chunkservers  Master has to log operations as soon as they are completed  Less than 64 BYTES of metadata for each
  • 26. GFS Architecture: Reading CHUNKSERVER MASTER APPLICATION GFS CLIENT METADATA R A M CHUNKSERVER CHUNKSERVER
  • 27. GFS Architecture: Reading CHUNKSERVER MASTER APPLICATION GFS CLIENT METADATA R A M CHUNKSERVER CHUNKSERVER NAME - RANGE NAME CHUNK INDEX CHUNK HANDLE REPLICA LOCATIONS
  • 28. GFS Architecture: Reading CHUNKSERVER MASTER APPLICATION GFS CLIENT METADATA R A M CHUNKSERVER CHUNKSERVER CHUNK HANDLE RANGEDATA FROM FILE DATA
  • 29. GFS Architecture: Writing MASTER APPLICATION GFS CLIENT METADATA R A M PRIMARY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK
  • 30. GFS Architecture: Writing MASTER APPLICATION GFS CLIENT METADATA R A M NAME - DATA NAME CHUNK INDEX CHUNK HANDLE PRIMARY AND SECONDAY REPLICA LOCATIONS PRIMARY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK
  • 31. GFS Architecture: Writing MASTER APPLICATION GFS CLIENT METADATA R A M PRIMARY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK DATA DATA DATA
  • 32. GFS Architecture: Writing MASTER APPLICATION GFS CLIENT METADATA R A M PRIMARY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK WRITE CMD
  • 33. GFS Architecture: Writing MASTER APPLICATION GFS CLIENT METADATA R A M PRIMARY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK SECONDAY CHUNKSERVER BUFFER CHUNK ACKs ACK
  • 34. Fault Tolerance  GFS has its own relaxed consistency model  Consistent: all replicas have the same value  Defined: each replica reflects the performed mutations  GFS is high available  Faster recovery (machine quickly rebootable)  Chunks replicated at least 3 times (take this RAID-6)
  • 35. Benchmarking: small cluster GFS tested on a small cluster:  I master  16 chunkservers  16 clients  Server machines connected to 100MBits central switc  Same for client machines  The two switches are connected to a 1Gbits switch
  • 36. Benchmarking: small cluster Read Rate Write Rate 1 client 10 MB/s 6.3 MB/s 16 clients 6 MB/s 2.2 MB/s Network limit 12.5 MB/s
  • 37. Benchmarking: real-world cluster Cluster A: 342 PCs  Used for research and development  Tasks last few hours reading TBs of data, processing and writing results back  Cluster B: 227 PCs  Continuously generates and processes multi-TB data s  Typical tasks last more hours than cluster A tasks
  • 38. Benchmarking: real-world cluster Cluster A B Chunkservers # 342 227 Available disk space 72 TB 180 TB Used disk space 55 TB 155 TB # of files 735000 737000 # of chunks 992000 1550000 Metadata at CSs 13 GB 21 GB Metadata at Master 48 MB 60 MB Read rate 580 MB/s (750 MB/s) 380 MB/s (1300 MB/s) Write rate 30 MB/s 100 MB/s * 3 Master Ops 202~380 Ops/s 347~533 Ops/s
  • 39. Benchmarking: recovery time  One chunkserver killed in cluster B: o This chunkserver had 15000 chunks containing 600GB of data o All chunks were restored in 23.2 mins with a replication rate of 440 MB/s  Two chunkserver killed in cluster B: o Each with 16000 chunks and 660 GB of data, 266 of them became uniques o These 266 chunks were replicated at an higher priority within 2 mins
  • 40. Comparisons to others models GFS RAIDxFS GPFS AFS NASD spreads file data across storage servers simpler, uses only replication for redundancy location independent namespace centralized approach rather than distribuited managementcommodity machines instead of network attached disks lazy allocated fixed-size blocks rather than variable-lengh objects
  • 41. Conclusion  GFS demonstrates how to support large-scale processing workloads on commodity hardware:  designed to tollerate frequent component failures  optimised for huge files that are mostly appended and read  It has met Google’s storage needs, therefore good enough for them  GFS has influenced massively the computer science in the last few years