SlideShare a Scribd company logo
1 of 33
Object storage optimization in Swift
Alexandre LECUYER
DevOps / irc: alecuyer
Romain LE DISEZ
DevOps / irc: rledisez
What’s the problem?
• Performance is bad
• Disks 100% busy
• Replication/reconstruction is very (very) slow
2
Replica in Swift
3
/srv/node/<device>/objects/<partition>/<suffix>/<hash>/<timestamp>.data
012345
012345
012345
012345
Erasure Coding in Swift
4
012345
03
14
25
/srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data
a9
Comparison
• Replica:
– Performance
– Overhead
– 3 files per object
(3 replicas)
• Erasure coding
– Cost effective
– Slow-ish
– 15 files per object
(12+3 fragments)
5
Where inodes join the party…
• XFS:
– one inode per file
– one inode per directory
• Inode:
– ctime/mtime/atime
– owner/group
– Permissions
6
Bad things happen
• One inode takes 300 bytes to 1k of memory
• Average: 2.4 inodes per fragment
– Data file: 1
– Object directory: 1
– Suffix directory + Partition directory: 0.4
7
Memory issues
• Inodes cannot fit in cache anymore
– But every inode of the path must be checked to
open a data file
• Only top level directories are cached
– Only 20% of hit on inode cache
– Up to 50% of devices activity to read inodes
8
Stability issues
• More filesystem corruptions
• Inability to run xfs_repair
– 1K of memory per inode
• Need a dedicated servers just to repair filesystems
– About 48 hours to repair one filesystem
9
Let’s fix it!
(a.k.a. inodes are useless, right?)
10
We tried crazy things
• Storing objects in a K/V (RocksDB, LevelDB, …)
– Not suited to synchronous IO. Write amplification.
• Storing in a K/V the file handle of datafiles
– Atomicity on two separate data structures
• Patching XFS to drop useless information
– It’s already well optimized, inodes may be compressed
• Storing in ZFS DMU
– Lots of very cool features, but performance issues if full, low
level development
11
12
Object Header
Volume Header
Object Data
Object Header
Object Data
Store multiple objects in
large files
13
Object Header
Volume Header
Object Data
Object Header
Object Data
Dedicated to a partition
No concurrent writes
Append only
Swift request path
14
Proxy server
Proxy server
Object server Object server Object server
PUT / GET requests
How does Swift organize data ?
• PUT: « photo.jpg » -> MD5 hash:
bc6a624f493bf3042662064285f355c4
• Partition : bc6a -> 48234
• Suffix : 5c4
• Timestamp : 1449519086.42102.data
• /srv/node/sda/objects/48234/5c4/bc6a624f493b
f3042662064285f355c4/1449519086.42102.data
15
Example : writing an object
16
Proxy server Object server Index server
Volume Volume Volume
Obtain a write lock on a volume (fcntl)
Write the object at the end of the volume
Register the objectPUT
Example : reading an object
17
Proxy server Object server Index server
Volume Volume Volume
Open the volume
Read the object at the given offset
Get object locationGET
Index server
• Stores data in a key/value store : LevelDB
• Communication with gRPC
• Key : hash + filename
• Value : volume index + offset
• Keys are sorted on-disk for efficient seeks
18
Index server – keys example
• ……
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
• ……
19
What about directories ?
20
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
48234
48235
9e3
5c4
7d1
bc6a46b... 1475194591.74265.data
bc6a624...
bc6b78b…
1449519086.42102.data
1415965115.56792.data
Deletion - Hole punching
21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg
Deletion
• Hole-punching with fallocate()
• Reclaim space without
changing the file size!
22
Object Header
Volume Header
Object Data
Object Header
Object Data
Space reclaimed by the filesystem
Implementation overview
23
Swift code,
patched.
diskfile.py
Index server,
with levelDB as
the backing key-
value store
gRPC
vfile.py
module
vfile.py
• Provides a file like interface
• f = vfile.open(« /path/to/file »)
• f.read()
• vfile.listdir(« /srv/node/<disk>/<partition>/ »)
24
Managing fragmentation
Dedicated volumes for short lived files
25
Volume
Volume
Volume
Volume
Volume
Volume
« .data » files « .ts » files
Write performance
• We cannot afford two synchronous writes
• The large file write is synchronous (fdatasync)
• The large file is preallocated
• K/V writes are asynchronous
26
Recovery
• Scan the volumes backwards
• Add missing information to the key value
27
How does it perform ?
• Bytes per objects in K/V : 42 bytes
• Latency : slightly worse when empty, much
better when full
• REPLICATE : served from memory
• Saved space
• Room for improvement
28
Benchmarks
• PUT single thread
– XFS: 17/s
– Volumes: 40/s
• PUT 20 threads
– XFS: 4.7s (99%)
– Volumes: 615ms
(99%)
29
• GET
– XFS: 39/s
– Volumes: 93/s
What’s next
• Upstream
• Store short-lived objects in dedicated volumes
• Replication of volumes
• Choose replica/erasure-coding on the fly
30
Credits
• Haystack (Facebook project)
• Openstack Swift community
31
Thank you
Metadata storage
• (extra slide if time)
• Previously stored as extended attributes
• Now serialized with protobuf and stored in the
volume
33

More Related Content

What's hot

コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線Motonori Shindo
 
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2Preferred Networks
 
Kubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tk
Kubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tkKubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tk
Kubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tkwhywaita
 
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャーKubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャーToru Makabe
 
Understanding kube proxy in ipvs mode
Understanding kube proxy in ipvs modeUnderstanding kube proxy in ipvs mode
Understanding kube proxy in ipvs modeVictor Morales
 
PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021
PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021
PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021Preferred Networks
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Vietnam Open Infrastructure User Group
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門Kohei Tokunaga
 
서비스 무중단 마이그레이션 : KT에서 Amazon으로
서비스 무중단 마이그레이션 : KT에서 Amazon으로서비스 무중단 마이그레이션 : KT에서 Amazon으로
서비스 무중단 마이그레이션 : KT에서 Amazon으로신우 방
 
Full Isolation in Multi-Tenant SaaS with Kubernetes and Istio
Full Isolation in Multi-Tenant SaaS with Kubernetes and IstioFull Isolation in Multi-Tenant SaaS with Kubernetes and Istio
Full Isolation in Multi-Tenant SaaS with Kubernetes and IstioIchsan Rahardianto
 
Wide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender SystemsWide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender Systemskeunbong kwak
 
Kubernetes - introduction
Kubernetes - introductionKubernetes - introduction
Kubernetes - introductionSparkbit
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜Preferred Networks
 

What's hot (20)

コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線
 
Docker internals
Docker internalsDocker internals
Docker internals
 
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
 
Kubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tk
Kubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tkKubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tk
Kubernetesの良さを活かして開発・運用!Cloud Native入門 / An introductory Cloud Native #osc19tk
 
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャーKubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
 
Understanding kube proxy in ipvs mode
Understanding kube proxy in ipvs modeUnderstanding kube proxy in ipvs mode
Understanding kube proxy in ipvs mode
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021
PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021
PFNのML/DL基盤を支えるKubernetesにおける自動化 / DevOpsDays Tokyo 2021
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
 
TripleO Deep Dive
TripleO Deep DiveTripleO Deep Dive
TripleO Deep Dive
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門
 
서비스 무중단 마이그레이션 : KT에서 Amazon으로
서비스 무중단 마이그레이션 : KT에서 Amazon으로서비스 무중단 마이그레이션 : KT에서 Amazon으로
서비스 무중단 마이그레이션 : KT에서 Amazon으로
 
Full Isolation in Multi-Tenant SaaS with Kubernetes and Istio
Full Isolation in Multi-Tenant SaaS with Kubernetes and IstioFull Isolation in Multi-Tenant SaaS with Kubernetes and Istio
Full Isolation in Multi-Tenant SaaS with Kubernetes and Istio
 
Wide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender SystemsWide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender Systems
 
Kubernetes - introduction
Kubernetes - introductionKubernetes - introduction
Kubernetes - introduction
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
 
RDMA on ARM
RDMA on ARMRDMA on ARM
RDMA on ARM
 

Similar to Openstack Swift - Lots of small files

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under controlMarcin Przepiórowski
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyleKyle Hailey
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i Zend by Rogue Wave Software
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in databasegafurov_x
 
Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Howard Marks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 

Similar to Openstack Swift - Lots of small files (20)

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under control
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyle
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Flashback in OCI
Flashback in OCIFlashback in OCI
Flashback in OCI
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Super hybrid2016 tdc
Super hybrid2016 tdcSuper hybrid2016 tdc
Super hybrid2016 tdc
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Scalability
ScalabilityScalability
Scalability
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 

Recently uploaded

CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfTestgrid.io
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersEmilyJiang23
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 

Recently uploaded (20)

CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 

Openstack Swift - Lots of small files

  • 1. Object storage optimization in Swift Alexandre LECUYER DevOps / irc: alecuyer Romain LE DISEZ DevOps / irc: rledisez
  • 2. What’s the problem? • Performance is bad • Disks 100% busy • Replication/reconstruction is very (very) slow 2
  • 4. Erasure Coding in Swift 4 012345 03 14 25 /srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data a9
  • 5. Comparison • Replica: – Performance – Overhead – 3 files per object (3 replicas) • Erasure coding – Cost effective – Slow-ish – 15 files per object (12+3 fragments) 5
  • 6. Where inodes join the party… • XFS: – one inode per file – one inode per directory • Inode: – ctime/mtime/atime – owner/group – Permissions 6
  • 7. Bad things happen • One inode takes 300 bytes to 1k of memory • Average: 2.4 inodes per fragment – Data file: 1 – Object directory: 1 – Suffix directory + Partition directory: 0.4 7
  • 8. Memory issues • Inodes cannot fit in cache anymore – But every inode of the path must be checked to open a data file • Only top level directories are cached – Only 20% of hit on inode cache – Up to 50% of devices activity to read inodes 8
  • 9. Stability issues • More filesystem corruptions • Inability to run xfs_repair – 1K of memory per inode • Need a dedicated servers just to repair filesystems – About 48 hours to repair one filesystem 9
  • 10. Let’s fix it! (a.k.a. inodes are useless, right?) 10
  • 11. We tried crazy things • Storing objects in a K/V (RocksDB, LevelDB, …) – Not suited to synchronous IO. Write amplification. • Storing in a K/V the file handle of datafiles – Atomicity on two separate data structures • Patching XFS to drop useless information – It’s already well optimized, inodes may be compressed • Storing in ZFS DMU – Lots of very cool features, but performance issues if full, low level development 11
  • 12. 12 Object Header Volume Header Object Data Object Header Object Data Store multiple objects in large files
  • 13. 13 Object Header Volume Header Object Data Object Header Object Data Dedicated to a partition No concurrent writes Append only
  • 14. Swift request path 14 Proxy server Proxy server Object server Object server Object server PUT / GET requests
  • 15. How does Swift organize data ? • PUT: « photo.jpg » -> MD5 hash: bc6a624f493bf3042662064285f355c4 • Partition : bc6a -> 48234 • Suffix : 5c4 • Timestamp : 1449519086.42102.data • /srv/node/sda/objects/48234/5c4/bc6a624f493b f3042662064285f355c4/1449519086.42102.data 15
  • 16. Example : writing an object 16 Proxy server Object server Index server Volume Volume Volume Obtain a write lock on a volume (fcntl) Write the object at the end of the volume Register the objectPUT
  • 17. Example : reading an object 17 Proxy server Object server Index server Volume Volume Volume Open the volume Read the object at the given offset Get object locationGET
  • 18. Index server • Stores data in a key/value store : LevelDB • Communication with gRPC • Key : hash + filename • Value : volume index + offset • Keys are sorted on-disk for efficient seeks 18
  • 19. Index server – keys example • …… • bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data • bc6a624f493bf3042662064285f355c41449519086.42102.data • bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data • …… 19
  • 20. What about directories ? 20 • bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data • bc6a624f493bf3042662064285f355c41449519086.42102.data • bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data 48234 48235 9e3 5c4 7d1 bc6a46b... 1475194591.74265.data bc6a624... bc6b78b… 1449519086.42102.data 1415965115.56792.data
  • 21. Deletion - Hole punching 21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg
  • 22. Deletion • Hole-punching with fallocate() • Reclaim space without changing the file size! 22 Object Header Volume Header Object Data Object Header Object Data Space reclaimed by the filesystem
  • 23. Implementation overview 23 Swift code, patched. diskfile.py Index server, with levelDB as the backing key- value store gRPC vfile.py module
  • 24. vfile.py • Provides a file like interface • f = vfile.open(« /path/to/file ») • f.read() • vfile.listdir(« /srv/node/<disk>/<partition>/ ») 24
  • 25. Managing fragmentation Dedicated volumes for short lived files 25 Volume Volume Volume Volume Volume Volume « .data » files « .ts » files
  • 26. Write performance • We cannot afford two synchronous writes • The large file write is synchronous (fdatasync) • The large file is preallocated • K/V writes are asynchronous 26
  • 27. Recovery • Scan the volumes backwards • Add missing information to the key value 27
  • 28. How does it perform ? • Bytes per objects in K/V : 42 bytes • Latency : slightly worse when empty, much better when full • REPLICATE : served from memory • Saved space • Room for improvement 28
  • 29. Benchmarks • PUT single thread – XFS: 17/s – Volumes: 40/s • PUT 20 threads – XFS: 4.7s (99%) – Volumes: 615ms (99%) 29 • GET – XFS: 39/s – Volumes: 93/s
  • 30. What’s next • Upstream • Store short-lived objects in dedicated volumes • Replication of volumes • Choose replica/erasure-coding on the fly 30
  • 31. Credits • Haystack (Facebook project) • Openstack Swift community 31
  • 33. Metadata storage • (extra slide if time) • Previously stored as extended attributes • Now serialized with protobuf and stored in the volume 33

Editor's Notes

  1. Je vais vous parler d’un travail d’optimisation réalisé sur openstack swift. OVH opère plusieurs cluster swift, connus commercialement sous les noms Hubic, et PCS. Nos clients ont tendances à stocker énormément de petits fichiers sur ces infras. En particulier sur Hubic. Regarder le public (ordi entre moi et public) Pas répéter trop (replica / EC) Expliquer vfile = file, sur implementation Discuter après sur le stand
  2. This is really the case on hubic. No problem on PCS, because there are more spindles
  3. I’m going remind quickly some differences between replica and erasure code in Swift. In a replica policy, each object is written many times, on different devices. The usual replication factor is 3, but this is configurable. The durability of the object is dependent on the replication factor. In this example, each object is written 3 times, it means that even if you lose 2 replica, the object is still available. It is also a good way to increase download bandwidth by distributing the requests over the devices. Drawback of replication is the overhead. Each bytes is written N times. In this example, 6 bytes of the user becomes 18 bytes on the cluster. Each replica of an object is stored in a file, you can see the path on top. Important parts are the hash, which is a computation of the URL of the object, partition and suffix are extrracted from the hash. The timestamp is the date of the upload of the object, it is set by the cluster during the upload. The user can’t set it. It is essential in the « eventual consistency » model of Swift. In case of an incident, by comparing the different timestamps of a single objects, Swift can decice which one is the good one. The latest actually.
  4. Erasure Coding is a bit different. I’m not going to do all the theoritical explanation, with Reed Solomon and stuff, there is a good introduction in the Swift documentation. Each object will be split in N fragments, and M fragments of parity will be added to ensure the redondency, so the durability. In this example, the cluster is configured with 3 fragments of data and 1 fragment of parity. It means that if I lose 1 device, my object is still accesssible. All the computation of fragmenting and calculating parity is done on the swift proxies. The major interest of erasure coding is that you can balance overhead and durability in your cluster. In this example, the overhead is 1.3, but durability is not that good (2 device down and the object is unavailable). If you choose 10 fragments of data and 2 fragments of parity, you get the same level of durability than 3 replica, but with an overhead of only 1.2. (Well, durability is not that simple, because the more devices, the more risk, it’s statistics, but i’m simplifying) Compared to replica, you can’t scale the downloads, each fragment must be accessed to rebuild the object. Also, you have to anticipate the CPU consumption on the proxies. To sumarize, you can think of replication as RAID-1 while Erasure Coding is like RAID-5 or RAID-6, but with more configuration possibilities. Looking at the path of file, there is a new information: the fragment number. As each fragment is unique, they must be accessed in correct order to rebuild the object.
  5. It was even 30 files per object at beginning because of the durable file. Thankfully, it was dropped since then. X5 factor in number of files. -> problem is most acute for erasure coding
  6. 40M (to confirm?) inodes per devices, 36 devices per server, for 64GB of RAM => would require 700+GB of RAM to have everything in cache Bad choice at first: too man partitions per device. Reducing the number of partitions would tend to 2 inodes per fragments (17% improvement)
  7. K/V not suited at all to synchronous IO, which is required before the proxy replies that we object is actually safe on disk Explain write amp. Persistent file handle : open a file without having to walk through all inodes in the path So what’s the solution ? Too many inodes means we have too many files. Let’s have less files !
  8. Limiter les inodes veut dire limiter le nombre de fichiers. Evident ! On les appelle des « volumes ». Quelles sont leurs caractéristiques?
  9. Three important characteristics : Dedicated to a partition : Not one large volume the size of the disk !  Make a volume dedicated to a partition. It makes it easier to move a partition to another node (ring change) Append-only : we only append new objects at the end of the file. Nothing is ever overwritten. We don’t want to write a space allocator No concurrent writes : We must support concurrent writes to the same partition. Create multiple volumes. Now, we need a way to locate the objects we write in those large files. Let’s take a step back first
  10. Very simplified overview, for a replica configuration. not discussing authentication or container server, etc.. An object-server may have multiple disks with multiple object server processes. Explain PUT, GET (one server only) The request will arrive on one proxy server, which will contact specific object-servers based on the ring. Won’t go in details about that, but just to explain that we are modifying the object server code only, nothing above. We are at the bottom of the stack. The problem which we described is on the object server. This is where we are working, let’s zoom in.
  11. Explain consistent hashing We calculate a MD5 hash from the object name Then the partition is extracted from the hash, given the cluster configuration The ring tells us which object-servers will store a partition The suffix is used to limit the number of entries in a directory. (XFS developers unhappy about that) Timestamp : to manage versions : user uploads a new version of photo.jpg Now, let’s see in practice how this works with the new system
  12. Take care to explain again the request : Object server receives something like PUT toto.jpg Will calculate the object hash, and then PUT that to the object server
  13. Explain the get Now let’s zoom on the index server
  14. Un peu de détail sur l’index server. Il est écrit en go. Il y a une instance par disque : 1 base + 1 process.
  15. Explain key, value We are now able to find our files. What about directories ? Files are stored below multiple directories : partition, suffix These are necessary for the cluster (replicator, reconstructor)
  16. Give examples of operations happening : Per partition (placement through the ring configuration) Per suffix (Replication) Explain the partition power and its relation to the partition Explain how we scan seek to the prefix, and continue until the next partition number For suffixes just get the end of the name We trade CPU for memory. Ok we can write, read, and listdir. What about deletion?
  17. Explain hole punching mechanism. Reclaim space without changing the file size Extent count will increase
  18. Explain hole punching mechanism. Reclaim space without changing the file size
  19. Explain the flow One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow I left out a few details
  20. Explain the flow One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow I left out a few details
  21. Hole punching is great but there is still a small cost : more extents in the file Tombstone volumes can be closed and deleted once all files have been deleted Also planned for files with a X-Delete-At header Not a problem until you have lots of extents. Not expected to be needed often
  22. Explain why we can’t sync the KV Describe the recovery procedure in case we crashed
  23. Explain why we can’t sync the KV Describe the recovery procedure in case we crashed
  24. For 10 millions files, 400MB, vs 3 to 8GB with inodes Explain REPLICATE (non intuitive name) Improvement : smaller keys..
  25. Better performance expected now (fdatasync)
  26. Add hybrid access