SlideShare a Scribd company logo
Give or take
Block and inode allocation in Spectrum
Scale
Tomer Perry
Spectrum Scale Development
<tomp@il.ibm.com>
Louise Bourgeois Give or Take 2002
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Some relevant Scale Basics
”The Cluster”
• Cluster – A group of operating system
instances, or nodes, on which an
instance of Spectrum Scale is deployed
• Network Shared Disk - Any block
storage (disk, partition, or LUN) given to
Spectrum Scale for its use
• NSD server – A node making an NSD
available to other nodes over the network
• GPFS filesystem – Combination of data
and metadata which is deployed over
NSDs and managed by the cluster
• Client – A node running Scale Software
accessing a filesystem
nsd
nsd
srv
nsd nsd
nsd
srv
nsd
srv
nsd
srv
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
Some relevant Scale Basics
MultiCluster
• A Cluster can share some or all of its
filesystems with other clusters
• Such cluster can be referred to as
“Storage Cluster”
• A cluster that don’t have any local
filesystems can be referred to as “Client
Cluster”
• A Client cluster can connect to 31
clusters ( outbound)
• A Storage cluster can be mounted by
unlimited number of client clusters
( Inbound) – 16383 really.
• Client cluster nodes “joins” the storage
cluster upon mount
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
Some relevant Scale Basics
Node Roles
• While the general concept in Scale is “all nodes were made equal” some has special roles
Config Servers

Holds cluster
configuration files

4.1 onward supports
CCR - quorum

Not in the data path
Cluster Manager

One of the quorum
nodes

Manage leases,
failures and recoveries

Selects filesystem
managers

mm*mgr -c
Filesystem Manager

One of the “manager”
nodes

One per filesystem

Manage filesystem
configurations ( disks)

Space allocation

Quota management
Token Manager/s

Multiple per filesystem

Each manage portion
of tokens for each
filesystem based on
inode number
Some relevant Scale Basics
From disk to namespace
●
Scale filesystem ( GPFS) can be described as loosely coupled layers
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
NSD NSD NSD NSD NSD NSD NSD NSD
FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk
Storage Pool Storage Pool
Filesystem
Fileset FilesetFilesetFileset
Some relevant Scale Basics
From disk to namespace
●
Scale filesystem ( GPFS) can be described as loosely coupled layers
●
Different layers have different properties
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
NSD NSD NSD NSD NSD NSD NSD NSD
FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk
Storage Pool Storage Pool
Filesystem
Fileset FilesetFilesetFileset
Size/MediaType/device
Name/Servers
Type/Pool/FailureGroup
BlockSize/Replication/LogSize/Inodes….
inodes/quotas/AFM/parent...
Some relevant Scale Basics
From disk to namespace
●
Scale filesystem ( GPFS) can be described as loosely coupled layers
●
Different layers have different properties
●
Can be divided into “name space virtualization” and “Storage virtualization”
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
Disk/
LUN/
vdisk
NSD NSD NSD NSD NSD NSD NSD NSD
FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk
Storage Pool Storage Pool
Filesystem
Fileset FilesetFilesetFileset
Storage Virtualization
Namespace Virtualization
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Resource allocation
In distributed file systems
• One the main challenges in distributed software in general, and distributed filesystem in
particular, is how to efficiently manage resources while maintaining scalability
• Spectrum Scale approach was always “Scalability through autonomy” ( tokens, metanode,
lease protocol etc.)
• In the context of this session, we will limit the discussion to space allocation and
management: subblocks and inodes ( essentially, inode allocation is actually “metadata
space management”)
• Common approach, taking CAP theorem into account, is to use eventual consistency in
order to minimize the performance impact while still providing reasonable functionality.
Resource allocation
One size doesn’t fit all
• Although there are many similarities between subblock allocation and quotas, there are
substantial differences which eventually lead us to use different mechanisms for each
Type Managed entity Change rate Allocation method
Space
Allocation
Storage Pools
Filesets ( inodes)
Adding disks
Adding filesets
Map based
Quotas
Users/Groups/Filesets
Users/Groups@Filesets
Administrator
initiated
Share based
Resource allocation
One size doesn’t fit all
• Although there are many similarities between subblock allocation and quotas, there are
substantial differences which eventually lead us to use different mechanisms for each
Type Managed entity Change rate Allocation method
Space
Allocation
Storage Pools
Filesets ( inodes)
Adding disks
Adding filesets
Map based
Quotas
Users/Groups/Filesets
Users/Groups@Filesets
Administrator
initiated
Share based
Static
Dynamic
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Allocation map based management
• Due to the relatively “static” nature of a filesystem space management it is possible to use
the allocation map based approach
• One of the main advantages of this approach is that with proper planning, its relatively
easier to allow each node to manage its own allocation independently
• The basic approach is to try and divide the managed resource into “regions”
• When a node needs to resources to be allocated, the filesystem manager assign a region
to that node
• From now on, the node can use all the resources in that region
• The number of regions is affected by the “-n” filesystem/pool creation paramter
( while one can change the -n parameter using mmchfs, it will only apply to pools created
after the change)
Block allocation map
• Divided into a fixed number of n equal size “regions”
– Each region contains bits representing blocks on all disks
– Different nodes can use different regions and still stripe across all disks ( many more regions
than nodes) – based on the -n parameter on storage pool creation
Disk 1
last blockblock 1
region 0 region 1 region 2 region n-1
block 0
Disk 2
Disk 3
Block allocation map
• When adding more disks:
– Alloc map file is extended to provide room for additional bits
– Each allocation region no consists of two separately stored segments
block n/2-1block 1
region 1, segment 0
Disk 1
Disk 2
Disk 3
region 0, segment 0 region 2, segment 0
block 0
Disk 4
last blockblock n/2+1block n/2
region n-1, segment 0
region 0, segment 1 region 1, segment 1 region 2, segment 1 region n-1, segment 1
Inode Allocation map
• The inode allocation map follow the same principal of the block allocation map
• That said, there are differences due to the way inodes are being used:
– inode have different “allocation states” ( free/used/created/deleted)
– The use of multiple inode spaces ( independent filesets) might imply non
contiguous allocation, allocation “holes” and other properties that might make
management a bit more complex
– Inode use is usually more “latency sensitive” - thus require using function shipping
methods in order to maintain good performance ( file creates etc.)
– Inode expansion might be manual ( preallocating) or automatic ( upon file creation)
• Inode space expansion logic is outside the scope of this session
–
–
Allocation map
• So what should I use for -n on file system creation?
• Try to estimate the number of nodes that “might” use the filesystem in the future
• Don’t just “use a large number”. The overhead of managing many segments/regions might
be constly
• As long as the number is not “way of” - it should be OK.
–
–
●
Outline
●
Relevant Scale Basics
●
Allocation types
●
Map based
●
Share based
Quotas 101
• While the main goal of this session is to explain how quotas works in Scale, we should still
define some basic quota related terminology
• “A disk quota is a limit set by a system administrator that restricts certain aspects of file
system usage on modern operating systems. The function of using disk quotas is to
allocate limited disk space in a reasonable way” (source: wikipedia)
• There are two main types of quota: disk quotas and file quotas. The former limits the
maximum size a “quota entity” can use while the latter define the number of file system
elements that a “quota entity” can use ( “quota entity” might be a user, a group or a fileset)
• Other common terms are:
– Soft quota: a.k.a. warning level. When used with grace period implies real limit
– Hard quota: the effective limit. A quota entity can’t use more than that limit
– Grace period: a time frame during which soft quota can be exceeded before it
becomes a hard limit
Spectrum Scale quotas
• The dynamic nature of quotas, makes it extremely difficult to balance between flexibility
and performance impact
• Thus, an eventual consistency approach was chosen – using “quota shares”
• Basically, the quota manager ( part of the filesystem manager role) assign shares to
requesting clients. A client can use the share unless being revoked ( more details to come)
• Since the quota manager don’t know how much of a given share is actually being used, a
new term was introduced: in_doubt.
• The in_doubt value represent the shares that are “out there”. Its a factor of the share size
and the number of nodes mounting the filesystem
• In order to get an exact quota usage, the quota manager needs to contact all mounting
node in order to revoke the current shares they have.
●
Spectrum Scale quotas
”The share”
• Since we don’t want a client to contact the quota server for each request, we use the
concept of shares ( block and file share)
• Every time a node want to allocate a resource, it will contact the server and will receive a
share of resources
• In the past, the share size was static ( 20) but configurable. So one would need to balance
between performance impact and functionality ( what if in_doubt is bigger than the
quota…)
• Nowdays, dynamic shares are being used. It starts with qMinBlockShare/qMinFileShare (
* blocks/inodes) and might change based on usage patterns
●
Spectrum Scale quotas
”The share flow”
• The quota client node will calculate how much blocks/inodes it should request based
on historical data and will send the request to the quota server. It might be extremely
aggressive with that…
• The quota server will consider the request from a much wider perspective: Number of
nodes, remaining quota ( -in_doubt), grace period etc.
• The quota server will reply to the client request with the granted resources ( blocks or
inodes)
●
Spectrum Scale quotas
”The share flow” - under pressure
• It is also important to understand what’s going on when we’re getting “close to the
limit”. It always takes in_doubt into account as if its used
– Soft limit exceeded:
●
The grace counter kicks in
●
SoftQuotaExceeded callback is being executed ( if configured)
– Grace period ends:
●
Quota exceeded error
– Getting close to hard limit:
●
Quota server revoke from all clients for each request...performance impact
●
Spectrum Scale quotas
tips and tricks
• Sometimes, balancing the impact of quotas might be challenging. For example,
imagine and application which all its threads are being “throttled” due to quota revokes
slowing down access to other quota objects which are not “close to the edge” ( who
said Ganesha?).
• In those cases, it might be better to “fail fast”
• There are two major ways to achieve that:
– Use soft quotas only with short grace period ( 1 sec for example). In this case,
shortly after the soft quotas were exceeded, the app will get quota exceeded
error. It will use “a little more than the limit”
– Use the ( undocumented) qRevokeWatchThreshold parameter. Every time
we’re hitting HL-
(Usage+inDoubt)<qRevokeWatchThreshold*qMinShare we need to do
a revoke. We will add this revoke to the watch list. We won’t do another revoke
to that object for the next 60sec. So, if another share request will come in – we
Give or take a block

More Related Content

What's hot

Ch10
Ch10Ch10
Ch10
ushaindhu
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
David Pavlis
 
OSCh14
OSCh14OSCh14
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
Gokila Manickam
 
Disk Management
Disk ManagementDisk Management
Disk Management
Shipra Swati
 
Massstorage
MassstorageMassstorage
Massstorage
mari sami
 
Types of secondary storage devices ppt
Types of secondary storage devices pptTypes of secondary storage devices ppt
Types of secondary storage devices ppt
Surkhab Shelly
 
Operation System
Operation SystemOperation System
Operation System
ROHINIPRIYA1997
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platform
ashutosh123gupta
 
Thiru
ThiruThiru
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Operating Systems
Operating SystemsOperating Systems
Operating Systems
Geetha Kannan
 
Storage Technologies
Storage TechnologiesStorage Technologies
Storage Technologies
Rishav Bhurtel
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
koolkampus
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
openstackindia
 
Mass storage structure
Mass storage structureMass storage structure
Mass storage structure
pramila kanagaraj
 
Mass storage device
Mass storage deviceMass storage device
Mass storage device
Raza Umer
 
Storage as a Service with Gluster
Storage as a Service with GlusterStorage as a Service with Gluster
Storage as a Service with Gluster
Vijay Bellur
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Maciek Jozwiak
 

What's hot (20)

Ch10
Ch10Ch10
Ch10
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
OSCh14
OSCh14OSCh14
OSCh14
 
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
 
Disk Management
Disk ManagementDisk Management
Disk Management
 
Massstorage
MassstorageMassstorage
Massstorage
 
Types of secondary storage devices ppt
Types of secondary storage devices pptTypes of secondary storage devices ppt
Types of secondary storage devices ppt
 
Operation System
Operation SystemOperation System
Operation System
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platform
 
Thiru
ThiruThiru
Thiru
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Operating Systems
Operating SystemsOperating Systems
Operating Systems
 
Storage Technologies
Storage TechnologiesStorage Technologies
Storage Technologies
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
 
Mass storage structure
Mass storage structureMass storage structure
Mass storage structure
 
Mass storage device
Mass storage deviceMass storage device
Mass storage device
 
Storage as a Service with Gluster
Storage as a Service with GlusterStorage as a Service with Gluster
Storage as a Service with Gluster
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
 

Similar to Give or take a block

Module5 secondary storage
Module5 secondary storageModule5 secondary storage
Module5 secondary storage
ChethanaThammaiah
 
Parallel databases
Parallel databasesParallel databases
Parallel databases
Aniruddha Patil
 
Os
OsOs
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
Ghassen Smida
 
Ch14 OS
Ch14 OSCh14 OS
Ch14 OS
C.U
 
OS_Ch14
OS_Ch14OS_Ch14
Os7
Os7Os7
Storage Management
Storage ManagementStorage Management
Storage Management
SURBHI SAROHA
 
Operation System
Operation SystemOperation System
Operation System
ANANTHI1997
 
Os
OsOs
Palpandi
PalpandiPalpandi
I/O System and Csae Study
I/O System and Csae StudyI/O System and Csae Study
I/O System and Csae Study
palpandi it
 
File system and Deadlocks
File system and DeadlocksFile system and Deadlocks
File system and Deadlocks
Rohit Jain
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file system
Gang He
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
Atanu Chatterjee
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
Lars Marowsky-Brée
 
Advanced databases -client /server arch
Advanced databases -client /server archAdvanced databases -client /server arch
Advanced databases -client /server arch
Aravindharamanan S
 
Advancedrn
AdvancedrnAdvancedrn
Advancedrn
Aravindharamanan S
 
Pandi
PandiPandi
Pandi
Pandi C
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for Docker
Fred Love
 

Similar to Give or take a block (20)

Module5 secondary storage
Module5 secondary storageModule5 secondary storage
Module5 secondary storage
 
Parallel databases
Parallel databasesParallel databases
Parallel databases
 
Os
OsOs
Os
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Ch14 OS
Ch14 OSCh14 OS
Ch14 OS
 
OS_Ch14
OS_Ch14OS_Ch14
OS_Ch14
 
Os7
Os7Os7
Os7
 
Storage Management
Storage ManagementStorage Management
Storage Management
 
Operation System
Operation SystemOperation System
Operation System
 
Os
OsOs
Os
 
Palpandi
PalpandiPalpandi
Palpandi
 
I/O System and Csae Study
I/O System and Csae StudyI/O System and Csae Study
I/O System and Csae Study
 
File system and Deadlocks
File system and DeadlocksFile system and Deadlocks
File system and Deadlocks
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file system
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
Advanced databases -client /server arch
Advanced databases -client /server archAdvanced databases -client /server arch
Advanced databases -client /server arch
 
Advancedrn
AdvancedrnAdvancedrn
Advancedrn
 
Pandi
PandiPandi
Pandi
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for Docker
 

Recently uploaded

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 

Recently uploaded (20)

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 

Give or take a block

  • 1. Give or take Block and inode allocation in Spectrum Scale Tomer Perry Spectrum Scale Development <tomp@il.ibm.com> Louise Bourgeois Give or Take 2002
  • 2. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 3. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 4. Some relevant Scale Basics ”The Cluster” • Cluster – A group of operating system instances, or nodes, on which an instance of Spectrum Scale is deployed • Network Shared Disk - Any block storage (disk, partition, or LUN) given to Spectrum Scale for its use • NSD server – A node making an NSD available to other nodes over the network • GPFS filesystem – Combination of data and metadata which is deployed over NSDs and managed by the cluster • Client – A node running Scale Software accessing a filesystem nsd nsd srv nsd nsd nsd srv nsd srv nsd srv CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC CCCCC
  • 5. Some relevant Scale Basics MultiCluster • A Cluster can share some or all of its filesystems with other clusters • Such cluster can be referred to as “Storage Cluster” • A cluster that don’t have any local filesystems can be referred to as “Client Cluster” • A Client cluster can connect to 31 clusters ( outbound) • A Storage cluster can be mounted by unlimited number of client clusters ( Inbound) – 16383 really. • Client cluster nodes “joins” the storage cluster upon mount CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCC
  • 6. Some relevant Scale Basics Node Roles • While the general concept in Scale is “all nodes were made equal” some has special roles Config Servers  Holds cluster configuration files  4.1 onward supports CCR - quorum  Not in the data path Cluster Manager  One of the quorum nodes  Manage leases, failures and recoveries  Selects filesystem managers  mm*mgr -c Filesystem Manager  One of the “manager” nodes  One per filesystem  Manage filesystem configurations ( disks)  Space allocation  Quota management Token Manager/s  Multiple per filesystem  Each manage portion of tokens for each filesystem based on inode number
  • 7. Some relevant Scale Basics From disk to namespace ● Scale filesystem ( GPFS) can be described as loosely coupled layers Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk NSD NSD NSD NSD NSD NSD NSD NSD FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk Storage Pool Storage Pool Filesystem Fileset FilesetFilesetFileset
  • 8. Some relevant Scale Basics From disk to namespace ● Scale filesystem ( GPFS) can be described as loosely coupled layers ● Different layers have different properties Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk NSD NSD NSD NSD NSD NSD NSD NSD FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk Storage Pool Storage Pool Filesystem Fileset FilesetFilesetFileset Size/MediaType/device Name/Servers Type/Pool/FailureGroup BlockSize/Replication/LogSize/Inodes…. inodes/quotas/AFM/parent...
  • 9. Some relevant Scale Basics From disk to namespace ● Scale filesystem ( GPFS) can be described as loosely coupled layers ● Different layers have different properties ● Can be divided into “name space virtualization” and “Storage virtualization” Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk Disk/ LUN/ vdisk NSD NSD NSD NSD NSD NSD NSD NSD FS disk FS disk FS disk FS disk FS disk FS disk FS disk FS disk Storage Pool Storage Pool Filesystem Fileset FilesetFilesetFileset Storage Virtualization Namespace Virtualization
  • 10. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 11. Resource allocation In distributed file systems • One the main challenges in distributed software in general, and distributed filesystem in particular, is how to efficiently manage resources while maintaining scalability • Spectrum Scale approach was always “Scalability through autonomy” ( tokens, metanode, lease protocol etc.) • In the context of this session, we will limit the discussion to space allocation and management: subblocks and inodes ( essentially, inode allocation is actually “metadata space management”) • Common approach, taking CAP theorem into account, is to use eventual consistency in order to minimize the performance impact while still providing reasonable functionality.
  • 12. Resource allocation One size doesn’t fit all • Although there are many similarities between subblock allocation and quotas, there are substantial differences which eventually lead us to use different mechanisms for each Type Managed entity Change rate Allocation method Space Allocation Storage Pools Filesets ( inodes) Adding disks Adding filesets Map based Quotas Users/Groups/Filesets Users/Groups@Filesets Administrator initiated Share based
  • 13. Resource allocation One size doesn’t fit all • Although there are many similarities between subblock allocation and quotas, there are substantial differences which eventually lead us to use different mechanisms for each Type Managed entity Change rate Allocation method Space Allocation Storage Pools Filesets ( inodes) Adding disks Adding filesets Map based Quotas Users/Groups/Filesets Users/Groups@Filesets Administrator initiated Share based Static Dynamic
  • 14. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 15. Allocation map based management • Due to the relatively “static” nature of a filesystem space management it is possible to use the allocation map based approach • One of the main advantages of this approach is that with proper planning, its relatively easier to allow each node to manage its own allocation independently • The basic approach is to try and divide the managed resource into “regions” • When a node needs to resources to be allocated, the filesystem manager assign a region to that node • From now on, the node can use all the resources in that region • The number of regions is affected by the “-n” filesystem/pool creation paramter ( while one can change the -n parameter using mmchfs, it will only apply to pools created after the change)
  • 16. Block allocation map • Divided into a fixed number of n equal size “regions” – Each region contains bits representing blocks on all disks – Different nodes can use different regions and still stripe across all disks ( many more regions than nodes) – based on the -n parameter on storage pool creation Disk 1 last blockblock 1 region 0 region 1 region 2 region n-1 block 0 Disk 2 Disk 3
  • 17. Block allocation map • When adding more disks: – Alloc map file is extended to provide room for additional bits – Each allocation region no consists of two separately stored segments block n/2-1block 1 region 1, segment 0 Disk 1 Disk 2 Disk 3 region 0, segment 0 region 2, segment 0 block 0 Disk 4 last blockblock n/2+1block n/2 region n-1, segment 0 region 0, segment 1 region 1, segment 1 region 2, segment 1 region n-1, segment 1
  • 18. Inode Allocation map • The inode allocation map follow the same principal of the block allocation map • That said, there are differences due to the way inodes are being used: – inode have different “allocation states” ( free/used/created/deleted) – The use of multiple inode spaces ( independent filesets) might imply non contiguous allocation, allocation “holes” and other properties that might make management a bit more complex – Inode use is usually more “latency sensitive” - thus require using function shipping methods in order to maintain good performance ( file creates etc.) – Inode expansion might be manual ( preallocating) or automatic ( upon file creation) • Inode space expansion logic is outside the scope of this session – –
  • 19. Allocation map • So what should I use for -n on file system creation? • Try to estimate the number of nodes that “might” use the filesystem in the future • Don’t just “use a large number”. The overhead of managing many segments/regions might be constly • As long as the number is not “way of” - it should be OK. – – ●
  • 20. Outline ● Relevant Scale Basics ● Allocation types ● Map based ● Share based
  • 21. Quotas 101 • While the main goal of this session is to explain how quotas works in Scale, we should still define some basic quota related terminology • “A disk quota is a limit set by a system administrator that restricts certain aspects of file system usage on modern operating systems. The function of using disk quotas is to allocate limited disk space in a reasonable way” (source: wikipedia) • There are two main types of quota: disk quotas and file quotas. The former limits the maximum size a “quota entity” can use while the latter define the number of file system elements that a “quota entity” can use ( “quota entity” might be a user, a group or a fileset) • Other common terms are: – Soft quota: a.k.a. warning level. When used with grace period implies real limit – Hard quota: the effective limit. A quota entity can’t use more than that limit – Grace period: a time frame during which soft quota can be exceeded before it becomes a hard limit
  • 22. Spectrum Scale quotas • The dynamic nature of quotas, makes it extremely difficult to balance between flexibility and performance impact • Thus, an eventual consistency approach was chosen – using “quota shares” • Basically, the quota manager ( part of the filesystem manager role) assign shares to requesting clients. A client can use the share unless being revoked ( more details to come) • Since the quota manager don’t know how much of a given share is actually being used, a new term was introduced: in_doubt. • The in_doubt value represent the shares that are “out there”. Its a factor of the share size and the number of nodes mounting the filesystem • In order to get an exact quota usage, the quota manager needs to contact all mounting node in order to revoke the current shares they have. ●
  • 23. Spectrum Scale quotas ”The share” • Since we don’t want a client to contact the quota server for each request, we use the concept of shares ( block and file share) • Every time a node want to allocate a resource, it will contact the server and will receive a share of resources • In the past, the share size was static ( 20) but configurable. So one would need to balance between performance impact and functionality ( what if in_doubt is bigger than the quota…) • Nowdays, dynamic shares are being used. It starts with qMinBlockShare/qMinFileShare ( * blocks/inodes) and might change based on usage patterns ●
  • 24. Spectrum Scale quotas ”The share flow” • The quota client node will calculate how much blocks/inodes it should request based on historical data and will send the request to the quota server. It might be extremely aggressive with that… • The quota server will consider the request from a much wider perspective: Number of nodes, remaining quota ( -in_doubt), grace period etc. • The quota server will reply to the client request with the granted resources ( blocks or inodes) ●
  • 25. Spectrum Scale quotas ”The share flow” - under pressure • It is also important to understand what’s going on when we’re getting “close to the limit”. It always takes in_doubt into account as if its used – Soft limit exceeded: ● The grace counter kicks in ● SoftQuotaExceeded callback is being executed ( if configured) – Grace period ends: ● Quota exceeded error – Getting close to hard limit: ● Quota server revoke from all clients for each request...performance impact ●
  • 26. Spectrum Scale quotas tips and tricks • Sometimes, balancing the impact of quotas might be challenging. For example, imagine and application which all its threads are being “throttled” due to quota revokes slowing down access to other quota objects which are not “close to the edge” ( who said Ganesha?). • In those cases, it might be better to “fail fast” • There are two major ways to achieve that: – Use soft quotas only with short grace period ( 1 sec for example). In this case, shortly after the soft quotas were exceeded, the app will get quota exceeded error. It will use “a little more than the limit” – Use the ( undocumented) qRevokeWatchThreshold parameter. Every time we’re hitting HL- (Usage+inDoubt)<qRevokeWatchThreshold*qMinShare we need to do a revoke. We will add this revoke to the watch list. We won’t do another revoke to that object for the next 60sec. So, if another share request will come in – we