SlideShare a Scribd company logo
1 of 27
Download to read offline
Give or take
Block and inode allocation in Spectrum
Scale
Tomer Perry
Spectrum Scale Development
<tomp@il.ibm.com>
Louise Bourgeois Give or Take 2002
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Some relevant Scale Basics
”The Cluster”
• Cluster – A group of operating system
instances, or nodes, on which an
instance of Spectrum Scale is deployed
• Network Shared Disk - Any block
storage (disk, partition, or LUN) given to
Spectrum Scale for its use
• NSD server – A node making an NSD
available to other nodes over the network
• GPFS filesystem – Combination of data
and metadata which is deployed over
NSDs and managed by the cluster
• Client – A node running Scale Software
accessing a filesystem
nsd
nsd
srv
nsd nsd
nsd
srv
nsd
srv
nsd
srv
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
Some relevant Scale Basics
MultiCluster
• A Cluster can share some or all of its
filesystems with other clusters
• Such cluster can be referred to as
“Storage Cluster”
• A cluster that don’t have any local
filesystems can be referred to as “Client
Cluster”
• A Client cluster can connect to 31
clusters ( outbound)
• A Storage cluster can be mounted by
unlimited number of client clusters
( Inbound) – 16383 really.
• Client cluster nodes “joins” the storage
cluster upon mount
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
Some relevant Scale Basics
Node Roles
• While the general concept in Scale is “all nodes were made equal” some has special roles
Config Servers

Holds cluster
configuration files

4.1 onward supports
CCR - quorum

Not in the data path
Cluster Manager

One of the quorum
nodes

Manage leases,
failures and recoveries

Selects filesystem
managers

mm*mgr -c
Filesystem Manager

One of the “manager”
nodes

One per filesystem

Manage filesystem
configurations ( disks)

Space allocation

Quota management
Token Manager/s

Multiple per filesystem

Each manage portion
of tokens for each
filesystem based on
inode number
Some relevant Scale Basics
From disk to namespace
●
Scale filesystem ( GPFS) can be described as loosely coupled layers
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
NSD NSD NSD NSD NSD NSD NSD NSD
FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk
Storage Pool Storage Pool
Filesystem
Fileset FilesetFilesetFileset
Some relevant Scale Basics
From disk to namespace
●
Scale filesystem ( GPFS) can be described as loosely coupled layers
●
Different layers have different properties
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
NSD NSD NSD NSD NSD NSD NSD NSD
FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk
Storage Pool Storage Pool
Filesystem
Fileset FilesetFilesetFileset
Size/MediaType/device
Name/Servers
Type/Pool/FailureGroup
BlockSize/Replication/LogSize/Inodes….
inodes/quotas/AFM/parent...
Some relevant Scale Basics
From disk to namespace
●
Scale filesystem ( GPFS) can be described as loosely coupled layers
●
Different layers have different properties
●
Can be divided into “name space virtualization” and “Storage virtualization”
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
NSD NSD NSD NSD NSD NSD NSD NSD
FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk
Storage Pool Storage Pool
Filesystem
Fileset FilesetFilesetFileset
Storage Virtualization
Namespace Virtualization
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Resource allocation
In distributed file systems
• One the main challenges in distributed software in general, and distributed filesystem in
particular, is how to efficiently manage resources while maintaining scalability
• Spectrum Scale approach was always “Scalability through autonomy” ( tokens, metanode,
lease protocol etc.)
• In the context of this session, we will limit the discussion to space allocation and
management: subblocks and inodes ( essentially, inode allocation is actually “metadata
space management”)
• Common approach, taking CAP theorem into account, is to use eventual consistency in
order to minimize the performance impact while still providing reasonable functionality.
Resource allocation
One size doesn’t fit all
• Although there are many similarities between subblock allocation and quotas, there are
substantial differences which eventually lead us to use different mechanisms for each
Type Managed entity Change rate Allocation method
Space
Allocation
Storage Pools
Filesets ( inodes)
Adding disks
Adding filesets
Map based
Quotas
Users/Groups/Filesets
Users/Groups@Filesets
Administrator
initiated
Share based
Resource allocation
One size doesn’t fit all
• Although there are many similarities between subblock allocation and quotas, there are
substantial differences which eventually lead us to use different mechanisms for each
Type Managed entity Change rate Allocation method
Space
Allocation
Storage Pools
Filesets ( inodes)
Adding disks
Adding filesets
Map based
Quotas
Users/Groups/Filesets
Users/Groups@Filesets
Administrator
initiated
Share based
Static
Dynamic
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Allocation map based management
• Due to the relatively “static” nature of a filesystem space management it is possible to use
the allocation map based approach
• One of the main advantages of this approach is that with proper planning, its relatively
easier to allow each node to manage its own allocation independently
• The basic approach is to try and divide the managed resource into “regions”
• When a node needs to resources to be allocated, the filesystem manager assign a region
to that node
• From now on, the node can use all the resources in that region
• The number of regions is affected by the “-n” filesystem/pool creation paramter
( while one can change the -n parameter using mmchfs, it will only apply to pools created
after the change)
Block allocation map
• Divided into a fixed number of n equal size “regions”
– Each region contains bits representing blocks on all disks
– Different nodes can use different regions and still stripe across all disks ( many more regions
than nodes) – based on the -n parameter on storage pool creation
Disk 1
last blockblock 1
region 0 region 1 region 2 region n-1
block 0
Disk 2
Disk 3
Block allocation map
• When adding more disks:
– Alloc map file is extended to provide room for additional bits
– Each allocation region no consists of two separately stored segments
block n/2-1block 1
region 1, segment 0
Disk 1
Disk 2
Disk 3
region 0, segment 0 region 2, segment 0
block 0
Disk 4
last blockblock n/2+1block n/2
region n-1, segment 0
region 0, segment 1 region 1, segment 1 region 2, segment 1 region n-1, segment 1
Inode Allocation map
• The inode allocation map follow the same principal of the block allocation map
• That said, there are differences due to the way inodes are being used:
– inode have different “allocation states” ( free/used/created/deleted)
– The use of multiple inode spaces ( independent filesets) might imply non
contiguous allocation, allocation “holes” and other properties that might make
management a bit more complex
– Inode use is usually more “latency sensitive” - thus require using function shipping
methods in order to maintain good performance ( file creates etc.)
– Inode expansion might be manual ( preallocating) or automatic ( upon file creation)
• Inode space expansion logic is outside the scope of this session
–
–
Allocation map
• So what should I use for -n on file system creation?
• Try to estimate the number of nodes that “might” use the filesystem in the future
• Don’t just “use a large number”. The overhead of managing many segments/regions might
be constly
• As long as the number is not “way of” - it should be OK.
–
–
●
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Quotas 101
• While the main goal of this session is to explain how quotas works in Scale, we should still
define some basic quota related terminology
• “A disk quota is a limit set by a system administrator that restricts certain aspects of file
system usage on modern operating systems. The function of using disk quotas is to
allocate limited disk space in a reasonable way” (source: wikipedia)
• There are two main types of quota: disk quotas and file quotas. The former limits the
maximum size a “quota entity” can use while the latter define the number of file system
elements that a “quota entity” can use ( “quota entity” might be a user, a group or a fileset)
• Other common terms are:
– Soft quota: a.k.a. warning level. When used with grace period implies real limit
– Hard quota: the effective limit. A quota entity can’t use more than that limit
– Grace period: a time frame during which soft quota can be exceeded before it
becomes a hard limit
Spectrum Scale quotas
• The dynamic nature of quotas, makes it extremely difficult to balance between flexibility
and performance impact
• Thus, an eventual consistency approach was chosen – using “quota shares”
• Basically, the quota manager ( part of the filesystem manager role) assign shares to
requesting clients. A client can use the share unless being revoked ( more details to come)
• Since the quota manager don’t know how much of a given share is actually being used, a
new term was introduced: in_doubt.
• The in_doubt value represent the shares that are “out there”. Its a factor of the share size
and the number of nodes mounting the filesystem
• In order to get an exact quota usage, the quota manager needs to contact all mounting
node in order to revoke the current shares they have.
●
Spectrum Scale quotas
”The share”
• Since we don’t want a client to contact the quota server for each request, we use the
concept of shares ( block and file share)
• Every time a node want to allocate a resource, it will contact the server and will receive a
share of resources
• In the past, the share size was static ( 20) but configurable. So one would need to balance
between performance impact and functionality ( what if in_doubt is bigger than the
quota…)
• Nowdays, dynamic shares are being used. It starts with qMinBlockShare/qMinFileShare (
* blocks/inodes) and might change based on usage patterns
●
Spectrum Scale quotas
”The share flow”
• The quota client node will calculate how much blocks/inodes it should request based
on historical data and will send the request to the quota server. It might be extremely
aggressive with that…
• The quota server will consider the request from a much wider perspective: Number of
nodes, remaining quota ( -in_doubt), grace period etc.
• The quota server will reply to the client request with the granted resources ( blocks or
inodes)
●
Spectrum Scale quotas
”The share flow” - under pressure
• It is also important to understand what’s going on when we’re getting “close to the
limit”. It always takes in_doubt into account as if its used
– Soft limit exceeded:
●
The grace counter kicks in
●
SoftQuotaExceeded callback is being executed ( if configured)
– Grace period ends:
●
Quota exceeded error
– Getting close to hard limit:
●
Quota server revoke from all clients for each request...performance impact
●
Spectrum Scale quotas
tips and tricks
• Sometimes, balancing the impact of quotas might be challenging. For example,
imagine and application which all its threads are being “throttled” due to quota revokes
slowing down access to other quota objects which are not “close to the edge” ( who
said Ganesha?).
• In those cases, it might be better to “fail fast”
• There are two major ways to achieve that:
– Use soft quotas only with short grace period ( 1 sec for example). In this case,
shortly after the soft quotas were exceeded, the app will get quota exceeded
error. It will use “a little more than the limit”
– Use the ( undocumented) qRevokeWatchThreshold parameter. Every time
we’re hitting HL-
(Usage+inDoubt)<qRevokeWatchThreshold*qMinShare we need to do
a revoke. We will add this revoke to the watch list. We won’t do another revoke
to that object for the next 60sec. So, if another share request will come in – we
Give or take a block

More Related Content

What's hot

CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + HadoopDavid Pavlis
 
Types of secondary storage devices ppt
Types of secondary storage devices pptTypes of secondary storage devices ppt
Types of secondary storage devices pptSurkhab Shelly
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platformashutosh123gupta
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMSkoolkampus
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstackopenstackindia
 
Mass storage device
Mass storage deviceMass storage device
Mass storage deviceRaza Umer
 
Storage as a Service with Gluster
Storage as a Service with GlusterStorage as a Service with Gluster
Storage as a Service with GlusterVijay Bellur
 
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Maciek Jozwiak
 

What's hot (20)

Ch10
Ch10Ch10
Ch10
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
OSCh14
OSCh14OSCh14
OSCh14
 
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
 
Disk Management
Disk ManagementDisk Management
Disk Management
 
Massstorage
MassstorageMassstorage
Massstorage
 
Types of secondary storage devices ppt
Types of secondary storage devices pptTypes of secondary storage devices ppt
Types of secondary storage devices ppt
 
Operation System
Operation SystemOperation System
Operation System
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platform
 
Thiru
ThiruThiru
Thiru
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Operating Systems
Operating SystemsOperating Systems
Operating Systems
 
Storage Technologies
Storage TechnologiesStorage Technologies
Storage Technologies
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
 
Mass storage structure
Mass storage structureMass storage structure
Mass storage structure
 
Mass storage device
Mass storage deviceMass storage device
Mass storage device
 
Storage as a Service with Gluster
Storage as a Service with GlusterStorage as a Service with Gluster
Storage as a Service with Gluster
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
 

Similar to Give or take a block

Similar to Give or take a block (20)

Module5 secondary storage
Module5 secondary storageModule5 secondary storage
Module5 secondary storage
 
Parallel databases
Parallel databasesParallel databases
Parallel databases
 
Os
OsOs
Os
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
OS_Ch14
OS_Ch14OS_Ch14
OS_Ch14
 
Ch14 OS
Ch14 OSCh14 OS
Ch14 OS
 
Os7
Os7Os7
Os7
 
Storage Management
Storage ManagementStorage Management
Storage Management
 
Operation System
Operation SystemOperation System
Operation System
 
Os
OsOs
Os
 
I/O System and Csae Study
I/O System and Csae StudyI/O System and Csae Study
I/O System and Csae Study
 
Palpandi
PalpandiPalpandi
Palpandi
 
File system and Deadlocks
File system and DeadlocksFile system and Deadlocks
File system and Deadlocks
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file system
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
Advancedrn
AdvancedrnAdvancedrn
Advancedrn
 
Advanced databases -client /server arch
Advanced databases -client /server archAdvanced databases -client /server arch
Advanced databases -client /server arch
 
Pandi
PandiPandi
Pandi
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for Docker
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Give or take a block

  • 1. Give or take Block and inode allocation in Spectrum Scale Tomer Perry Spectrum Scale Development <tomp@il.ibm.com> Louise Bourgeois Give or Take 2002
  • 2. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 3. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 4. Some relevant Scale Basics ”The Cluster” • Cluster – A group of operating system instances, or nodes, on which an instance of Spectrum Scale is deployed • Network Shared Disk - Any block storage (disk, partition, or LUN) given to Spectrum Scale for its use • NSD server – A node making an NSD available to other nodes over the network • GPFS filesystem – Combination of data and metadata which is deployed over NSDs and managed by the cluster • Client – A node running Scale Software accessing a filesystem nsd nsd srv nsd nsd nsd srv nsd srv nsd srv CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC
  • 5. Some relevant Scale Basics MultiCluster • A Cluster can share some or all of its filesystems with other clusters • Such cluster can be referred to as “Storage Cluster” • A cluster that don’t have any local filesystems can be referred to as “Client Cluster” • A Client cluster can connect to 31 clusters ( outbound) • A Storage cluster can be mounted by unlimited number of client clusters ( Inbound) – 16383 really. • Client cluster nodes “joins” the storage cluster upon mount CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC
  • 6. Some relevant Scale Basics Node Roles • While the general concept in Scale is “all nodes were made equal” some has special roles Config Servers  Holds cluster configuration files  4.1 onward supports CCR - quorum  Not in the data path Cluster Manager  One of the quorum nodes  Manage leases, failures and recoveries  Selects filesystem managers  mm*mgr -c Filesystem Manager  One of the “manager” nodes  One per filesystem  Manage filesystem configurations ( disks)  Space allocation  Quota management Token Manager/s  Multiple per filesystem  Each manage portion of tokens for each filesystem based on inode number
  • 7. Some relevant Scale Basics From disk to namespace ● Scale filesystem ( GPFS) can be described as loosely coupled layers Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk NSD NSD NSD NSD NSD NSD NSD NSD FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk Storage Pool Storage Pool Filesystem Fileset FilesetFilesetFileset
  • 8. Some relevant Scale Basics From disk to namespace ● Scale filesystem ( GPFS) can be described as loosely coupled layers ● Different layers have different properties Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk NSD NSD NSD NSD NSD NSD NSD NSD FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk Storage Pool Storage Pool Filesystem Fileset FilesetFilesetFileset Size/MediaType/device Name/Servers Type/Pool/FailureGroup BlockSize/Replication/LogSize/Inodes…. inodes/quotas/AFM/parent...
  • 9. Some relevant Scale Basics From disk to namespace ● Scale filesystem ( GPFS) can be described as loosely coupled layers ● Different layers have different properties ● Can be divided into “name space virtualization” and “Storage virtualization” Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk NSD NSD NSD NSD NSD NSD NSD NSD FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk Storage Pool Storage Pool Filesystem Fileset FilesetFilesetFileset Storage Virtualization Namespace Virtualization
  • 10. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 11. Resource allocation In distributed file systems • One the main challenges in distributed software in general, and distributed filesystem in particular, is how to efficiently manage resources while maintaining scalability • Spectrum Scale approach was always “Scalability through autonomy” ( tokens, metanode, lease protocol etc.) • In the context of this session, we will limit the discussion to space allocation and management: subblocks and inodes ( essentially, inode allocation is actually “metadata space management”) • Common approach, taking CAP theorem into account, is to use eventual consistency in order to minimize the performance impact while still providing reasonable functionality.
  • 12. Resource allocation One size doesn’t fit all • Although there are many similarities between subblock allocation and quotas, there are substantial differences which eventually lead us to use different mechanisms for each Type Managed entity Change rate Allocation method Space Allocation Storage Pools Filesets ( inodes) Adding disks Adding filesets Map based Quotas Users/Groups/Filesets Users/Groups@Filesets Administrator initiated Share based
  • 13. Resource allocation One size doesn’t fit all • Although there are many similarities between subblock allocation and quotas, there are substantial differences which eventually lead us to use different mechanisms for each Type Managed entity Change rate Allocation method Space Allocation Storage Pools Filesets ( inodes) Adding disks Adding filesets Map based Quotas Users/Groups/Filesets Users/Groups@Filesets Administrator initiated Share based Static Dynamic
  • 14. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 15. Allocation map based management • Due to the relatively “static” nature of a filesystem space management it is possible to use the allocation map based approach • One of the main advantages of this approach is that with proper planning, its relatively easier to allow each node to manage its own allocation independently • The basic approach is to try and divide the managed resource into “regions” • When a node needs to resources to be allocated, the filesystem manager assign a region to that node • From now on, the node can use all the resources in that region • The number of regions is affected by the “-n” filesystem/pool creation paramter ( while one can change the -n parameter using mmchfs, it will only apply to pools created after the change)
  • 16. Block allocation map • Divided into a fixed number of n equal size “regions” – Each region contains bits representing blocks on all disks – Different nodes can use different regions and still stripe across all disks ( many more regions than nodes) – based on the -n parameter on storage pool creation Disk 1 last blockblock 1 region 0 region 1 region 2 region n-1 block 0 Disk 2 Disk 3
  • 17. Block allocation map • When adding more disks: – Alloc map file is extended to provide room for additional bits – Each allocation region no consists of two separately stored segments block n/2-1block 1 region 1, segment 0 Disk 1 Disk 2 Disk 3 region 0, segment 0 region 2, segment 0 block 0 Disk 4 last blockblock n/2+1block n/2 region n-1, segment 0 region 0, segment 1 region 1, segment 1 region 2, segment 1 region n-1, segment 1
  • 18. Inode Allocation map • The inode allocation map follow the same principal of the block allocation map • That said, there are differences due to the way inodes are being used: – inode have different “allocation states” ( free/used/created/deleted) – The use of multiple inode spaces ( independent filesets) might imply non contiguous allocation, allocation “holes” and other properties that might make management a bit more complex – Inode use is usually more “latency sensitive” - thus require using function shipping methods in order to maintain good performance ( file creates etc.) – Inode expansion might be manual ( preallocating) or automatic ( upon file creation) • Inode space expansion logic is outside the scope of this session – –
  • 19. Allocation map • So what should I use for -n on file system creation? • Try to estimate the number of nodes that “might” use the filesystem in the future • Don’t just “use a large number”. The overhead of managing many segments/regions might be constly • As long as the number is not “way of” - it should be OK. – – ●
  • 20. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 21. Quotas 101 • While the main goal of this session is to explain how quotas works in Scale, we should still define some basic quota related terminology • “A disk quota is a limit set by a system administrator that restricts certain aspects of file system usage on modern operating systems. The function of using disk quotas is to allocate limited disk space in a reasonable way” (source: wikipedia) • There are two main types of quota: disk quotas and file quotas. The former limits the maximum size a “quota entity” can use while the latter define the number of file system elements that a “quota entity” can use ( “quota entity” might be a user, a group or a fileset) • Other common terms are: – Soft quota: a.k.a. warning level. When used with grace period implies real limit – Hard quota: the effective limit. A quota entity can’t use more than that limit – Grace period: a time frame during which soft quota can be exceeded before it becomes a hard limit
  • 22. Spectrum Scale quotas • The dynamic nature of quotas, makes it extremely difficult to balance between flexibility and performance impact • Thus, an eventual consistency approach was chosen – using “quota shares” • Basically, the quota manager ( part of the filesystem manager role) assign shares to requesting clients. A client can use the share unless being revoked ( more details to come) • Since the quota manager don’t know how much of a given share is actually being used, a new term was introduced: in_doubt. • The in_doubt value represent the shares that are “out there”. Its a factor of the share size and the number of nodes mounting the filesystem • In order to get an exact quota usage, the quota manager needs to contact all mounting node in order to revoke the current shares they have. ●
  • 23. Spectrum Scale quotas ”The share” • Since we don’t want a client to contact the quota server for each request, we use the concept of shares ( block and file share) • Every time a node want to allocate a resource, it will contact the server and will receive a share of resources • In the past, the share size was static ( 20) but configurable. So one would need to balance between performance impact and functionality ( what if in_doubt is bigger than the quota…) • Nowdays, dynamic shares are being used. It starts with qMinBlockShare/qMinFileShare ( * blocks/inodes) and might change based on usage patterns ●
  • 24. Spectrum Scale quotas ”The share flow” • The quota client node will calculate how much blocks/inodes it should request based on historical data and will send the request to the quota server. It might be extremely aggressive with that… • The quota server will consider the request from a much wider perspective: Number of nodes, remaining quota ( -in_doubt), grace period etc. • The quota server will reply to the client request with the granted resources ( blocks or inodes) ●
  • 25. Spectrum Scale quotas ”The share flow” - under pressure • It is also important to understand what’s going on when we’re getting “close to the limit”. It always takes in_doubt into account as if its used – Soft limit exceeded: ● The grace counter kicks in ● SoftQuotaExceeded callback is being executed ( if configured) – Grace period ends: ● Quota exceeded error – Getting close to hard limit: ● Quota server revoke from all clients for each request...performance impact ●
  • 26. Spectrum Scale quotas tips and tricks • Sometimes, balancing the impact of quotas might be challenging. For example, imagine and application which all its threads are being “throttled” due to quota revokes slowing down access to other quota objects which are not “close to the edge” ( who said Ganesha?). • In those cases, it might be better to “fail fast” • There are two major ways to achieve that: – Use soft quotas only with short grace period ( 1 sec for example). In this case, shortly after the soft quotas were exceeded, the app will get quota exceeded error. It will use “a little more than the limit” – Use the ( undocumented) qRevokeWatchThreshold parameter. Every time we’re hitting HL- (Usage+inDoubt)<qRevokeWatchThreshold*qMinShare we need to do a revoke. We will add this revoke to the watch list. We won’t do another revoke to that object for the next 60sec. So, if another share request will come in – we