SlideShare a Scribd company logo
Issue
● 10-20 Millions object per devices
– 50 millions inodes per devices
● 36 devices per server
● 64 GB of RAM
– 1 inode is 1KB in RAM
– Would need 1.75TB of RAM for caching all inodes
● 75 % cache miss on inodes
– Up to 50 % of IO to get inodes from device
– (replicator/reconstructor constantly scan device...)
Solution
● Get rid of inodes
● Haystack-like solution
– Objects in volumes (a.k.a. big files, 5GB or 10GB)
– K/V store to map object to (volume id, position)
● K/V is an gRPC service
● Backed by LevelDB (for now...)
● Need to avoid compaction issue
– fallocate(PUNCH_HOLE)
– Smart selection of volumes
Benefits
● 42 bytes per object in K/V
– Compared to 1KB for an XFS inode
– Fit in memory (20GB vs 1.75TB)
– Should easily go down to 30 bytes per object
● Listdir happens in K/V (so in memory)
● Space efficiency vs Block aligned (!)
● Flat namespace for objects
– No part/sfx/ohash
– Increasing part power is just a ring thing
Adding an object
1.Select a volume
2.Append objet data
1.Object header (magic string, ohash, size, …)
2.Object metadata
3.Object data
3.fdatasync() volume
4.Insert new entry in K/V (no transaction)
● <o><policy><ohash><filename> => <volume id><offset>
=> If crash, the volume act as a journal to replay
Removing an object
1.Select a volume
2.Insert a tombstone
3.fdatasync() volume
4.Insert tombstone in K/V
5.Run cleanup_ondisk_files()
1.Punch_hole the object
2.Remove the old entry from K/V
Volume selection
● Avoid holes in volumes to reduce compaction
– Try to group objects by partition
● => rebalance is compaction
– Put short life objects in dedicated volumes
● tombstone
● x-delete-at soon
– Dedicated volumes for handoff?
Benchmarks
● Atom C2750 2.40Ghz
● 16GB RAM
● HGST HUS726040ALA610 (4TB)
● Directly connecting to objet servers
Benchmarks
● Single threaded PUT (100 bytes objects)
– From 0 to 4 millions objects
● XFS : 19.8/s
● Volumes : 26.2/s
– From 4 millions to 8 millions objects
● XFS : 17/s
● Volumes : 39.2/s (b/c of not creating more volumes?)
● What we see (need numbers!)
– XFS : memory is full ; Volumes : memory is free
– Disks is more busy with XFS
Benchmarks
● Single threaded random GET
– XFS : 39/s
– Volumes : 93/s
Benchmarks
● Concurrent PUT, 20 threads for 10 minutes
avg 50% 95% 99% max
XFS 641ms 67ms 3.5s 4.7s 5.9s
Volumes 82ms 50ms 261ms 615ms 1.24s
Status
● Done
– HEAD/GET/PUT/DELETE/POST (replica)
● Todo
– REPLICATE/SSYNC
– Erasure Code
– XFS read compatibility
– Smarter volumes selection
– Func tests on object servers (is there any?)
– Doc

More Related Content

What's hot

Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUAndrey Vagin
 
“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection
Haim Yadid
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
Eytan Daniyalzade
 
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerC* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
DataStax Academy
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
yaevents
 
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
Sriram Murali
 
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
OpenVZ
 
.NET Memory Primer
.NET Memory Primer.NET Memory Primer
.NET Memory Primer
Martin Kulov
 
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With Librados
Ceph Community
 
Lab 01 03_16
Lab 01 03_16Lab 01 03_16
Lab 01 03_16
Hao Wu
 
Remora the another asdf.
Remora the another asdf.Remora the another asdf.
Remora the another asdf.
hyotang666
 
Memory management
Memory managementMemory management
Memory management
mitesh_sharma
 
Garbage collection
Garbage collectionGarbage collection
Garbage collection
Mudit Gupta
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
Jason Brown
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
Pluu love
 
A survey on Heap Exploitation
A survey on Heap Exploitation A survey on Heap Exploitation
A survey on Heap Exploitation
Alireza Karimi
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
Andrey Vagin
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
Deep Kapadia
 

What's hot (20)

Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
 
“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
nebulaconf
nebulaconfnebulaconf
nebulaconf
 
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerC* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
 
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
 
tokyotalk
tokyotalktokyotalk
tokyotalk
 
.NET Memory Primer
.NET Memory Primer.NET Memory Primer
.NET Memory Primer
 
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With Librados
 
Lab 01 03_16
Lab 01 03_16Lab 01 03_16
Lab 01 03_16
 
Remora the another asdf.
Remora the another asdf.Remora the another asdf.
Remora the another asdf.
 
Memory management
Memory managementMemory management
Memory management
 
Garbage collection
Garbage collectionGarbage collection
Garbage collection
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
 
A survey on Heap Exploitation
A survey on Heap Exploitation A survey on Heap Exploitation
A survey on Heap Exploitation
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 

Similar to Slide smallfiles

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
kanedafromparis
 
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
 
Bluestore
BluestoreBluestore
Bluestore
Ceph Community
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Haim Yadid
 
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage Collection
Haim Yadid
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
Ceph Community
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
AshutoshKumar437302
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
Haris456
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
Alex Rasmussen
 
Save Java memory
Save Java memorySave Java memory
Save Java memory
JavaDayUA
 
.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4
aminmesbahi
 

Similar to Slide smallfiles (20)

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Bluestore
BluestoreBluestore
Bluestore
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage Collection
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
Save Java memory
Save Java memorySave Java memory
Save Java memory
 
.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4
 

Recently uploaded

DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
AkolbilaEmmanuel1
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 

Recently uploaded (20)

DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 

Slide smallfiles

  • 1. Issue ● 10-20 Millions object per devices – 50 millions inodes per devices ● 36 devices per server ● 64 GB of RAM – 1 inode is 1KB in RAM – Would need 1.75TB of RAM for caching all inodes ● 75 % cache miss on inodes – Up to 50 % of IO to get inodes from device – (replicator/reconstructor constantly scan device...)
  • 2. Solution ● Get rid of inodes ● Haystack-like solution – Objects in volumes (a.k.a. big files, 5GB or 10GB) – K/V store to map object to (volume id, position) ● K/V is an gRPC service ● Backed by LevelDB (for now...) ● Need to avoid compaction issue – fallocate(PUNCH_HOLE) – Smart selection of volumes
  • 3. Benefits ● 42 bytes per object in K/V – Compared to 1KB for an XFS inode – Fit in memory (20GB vs 1.75TB) – Should easily go down to 30 bytes per object ● Listdir happens in K/V (so in memory) ● Space efficiency vs Block aligned (!) ● Flat namespace for objects – No part/sfx/ohash – Increasing part power is just a ring thing
  • 4. Adding an object 1.Select a volume 2.Append objet data 1.Object header (magic string, ohash, size, …) 2.Object metadata 3.Object data 3.fdatasync() volume 4.Insert new entry in K/V (no transaction) ● <o><policy><ohash><filename> => <volume id><offset> => If crash, the volume act as a journal to replay
  • 5. Removing an object 1.Select a volume 2.Insert a tombstone 3.fdatasync() volume 4.Insert tombstone in K/V 5.Run cleanup_ondisk_files() 1.Punch_hole the object 2.Remove the old entry from K/V
  • 6. Volume selection ● Avoid holes in volumes to reduce compaction – Try to group objects by partition ● => rebalance is compaction – Put short life objects in dedicated volumes ● tombstone ● x-delete-at soon – Dedicated volumes for handoff?
  • 7. Benchmarks ● Atom C2750 2.40Ghz ● 16GB RAM ● HGST HUS726040ALA610 (4TB) ● Directly connecting to objet servers
  • 8. Benchmarks ● Single threaded PUT (100 bytes objects) – From 0 to 4 millions objects ● XFS : 19.8/s ● Volumes : 26.2/s – From 4 millions to 8 millions objects ● XFS : 17/s ● Volumes : 39.2/s (b/c of not creating more volumes?) ● What we see (need numbers!) – XFS : memory is full ; Volumes : memory is free – Disks is more busy with XFS
  • 9. Benchmarks ● Single threaded random GET – XFS : 39/s – Volumes : 93/s
  • 10. Benchmarks ● Concurrent PUT, 20 threads for 10 minutes avg 50% 95% 99% max XFS 641ms 67ms 3.5s 4.7s 5.9s Volumes 82ms 50ms 261ms 615ms 1.24s
  • 11. Status ● Done – HEAD/GET/PUT/DELETE/POST (replica) ● Todo – REPLICATE/SSYNC – Erasure Code – XFS read compatibility – Smarter volumes selection – Func tests on object servers (is there any?) – Doc