SlideShare a Scribd company logo
1 of 11
Download to read offline
Issue
● 10-20 Millions object per devices
– 50 millions inodes per devices
● 36 devices per server
● 64 GB of RAM
– 1 inode is 1KB in RAM
– Would need 1.75TB of RAM for caching all inodes
● 75 % cache miss on inodes
– Up to 50 % of IO to get inodes from device
– (replicator/reconstructor constantly scan device...)
Solution
● Get rid of inodes
● Haystack-like solution
– Objects in volumes (a.k.a. big files, 5GB or 10GB)
– K/V store to map object to (volume id, position)
● K/V is an gRPC service
● Backed by LevelDB (for now...)
● Need to avoid compaction issue
– fallocate(PUNCH_HOLE)
– Smart selection of volumes
Benefits
● 42 bytes per object in K/V
– Compared to 1KB for an XFS inode
– Fit in memory (20GB vs 1.75TB)
– Should easily go down to 30 bytes per object
● Listdir happens in K/V (so in memory)
● Space efficiency vs Block aligned (!)
● Flat namespace for objects
– No part/sfx/ohash
– Increasing part power is just a ring thing
Adding an object
1.Select a volume
2.Append objet data
1.Object header (magic string, ohash, size, …)
2.Object metadata
3.Object data
3.fdatasync() volume
4.Insert new entry in K/V (no transaction)
● <o><policy><ohash><filename> => <volume id><offset>
=> If crash, the volume act as a journal to replay
Removing an object
1.Select a volume
2.Insert a tombstone
3.fdatasync() volume
4.Insert tombstone in K/V
5.Run cleanup_ondisk_files()
1.Punch_hole the object
2.Remove the old entry from K/V
Volume selection
● Avoid holes in volumes to reduce compaction
– Try to group objects by partition
● => rebalance is compaction
– Put short life objects in dedicated volumes
● tombstone
● x-delete-at soon
– Dedicated volumes for handoff?
Benchmarks
● Atom C2750 2.40Ghz
● 16GB RAM
● HGST HUS726040ALA610 (4TB)
● Directly connecting to objet servers
Benchmarks
● Single threaded PUT (100 bytes objects)
– From 0 to 4 millions objects
● XFS : 19.8/s
● Volumes : 26.2/s
– From 4 millions to 8 millions objects
● XFS : 17/s
● Volumes : 39.2/s (b/c of not creating more volumes?)
● What we see (need numbers!)
– XFS : memory is full ; Volumes : memory is free
– Disks is more busy with XFS
Benchmarks
● Single threaded random GET
– XFS : 39/s
– Volumes : 93/s
Benchmarks
● Concurrent PUT, 20 threads for 10 minutes
avg 50% 95% 99% max
XFS 641ms 67ms 3.5s 4.7s 5.9s
Volumes 82ms 50ms 261ms 615ms 1.24s
Status
● Done
– HEAD/GET/PUT/DELETE/POST (replica)
● Todo
– REPLICATE/SSYNC
– Erasure Code
– XFS read compatibility
– Smarter volumes selection
– Func tests on object servers (is there any?)
– Doc

More Related Content

What's hot

Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUAndrey Vagin
 
“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage CollectionHaim Yadid
 
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerC* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerDataStax Academy
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...yaevents
 
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python waySriram Murali
 
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinOpenVZ
 
.NET Memory Primer
.NET Memory Primer.NET Memory Primer
.NET Memory PrimerMartin Kulov
 
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Community
 
Lab 01 03_16
Lab 01 03_16Lab 01 03_16
Lab 01 03_16Hao Wu
 
Remora the another asdf.
Remora the another asdf.Remora the another asdf.
Remora the another asdf.hyotang666
 
Garbage collection
Garbage collectionGarbage collection
Garbage collectionMudit Gupta
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to PriamJason Brown
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealmPluu love
 
A survey on Heap Exploitation
A survey on Heap Exploitation A survey on Heap Exploitation
A survey on Heap Exploitation Alireza Karimi
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerAndrey Vagin
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 

What's hot (20)

Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
 
“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection“Show Me the Garbage!”, Understanding Garbage Collection
“Show Me the Garbage!”, Understanding Garbage Collection
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
nebulaconf
nebulaconfnebulaconf
nebulaconf
 
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerC* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
 
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
 
tokyotalk
tokyotalktokyotalk
tokyotalk
 
.NET Memory Primer
.NET Memory Primer.NET Memory Primer
.NET Memory Primer
 
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With Librados
 
Lab 01 03_16
Lab 01 03_16Lab 01 03_16
Lab 01 03_16
 
Remora the another asdf.
Remora the another asdf.Remora the another asdf.
Remora the another asdf.
 
Memory management
Memory managementMemory management
Memory management
 
Garbage collection
Garbage collectionGarbage collection
Garbage collection
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
 
A survey on Heap Exploitation
A survey on Heap Exploitation A survey on Heap Exploitation
A survey on Heap Exploitation
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 

Similar to Slide smallfiles

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid
 
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage CollectionHaim Yadid
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Community
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageTakashi Hoshino
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureHaris456
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
Save Java memory
Save Java memorySave Java memory
Save Java memoryJavaDayUA
 
.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4aminmesbahi
 

Similar to Slide smallfiles (20)

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Bluestore
BluestoreBluestore
Bluestore
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage Collection
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
Save Java memory
Save Java memorySave Java memory
Save Java memory
 
.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4
 

Recently uploaded

An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 

Recently uploaded (20)

An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 

Slide smallfiles

  • 1. Issue ● 10-20 Millions object per devices – 50 millions inodes per devices ● 36 devices per server ● 64 GB of RAM – 1 inode is 1KB in RAM – Would need 1.75TB of RAM for caching all inodes ● 75 % cache miss on inodes – Up to 50 % of IO to get inodes from device – (replicator/reconstructor constantly scan device...)
  • 2. Solution ● Get rid of inodes ● Haystack-like solution – Objects in volumes (a.k.a. big files, 5GB or 10GB) – K/V store to map object to (volume id, position) ● K/V is an gRPC service ● Backed by LevelDB (for now...) ● Need to avoid compaction issue – fallocate(PUNCH_HOLE) – Smart selection of volumes
  • 3. Benefits ● 42 bytes per object in K/V – Compared to 1KB for an XFS inode – Fit in memory (20GB vs 1.75TB) – Should easily go down to 30 bytes per object ● Listdir happens in K/V (so in memory) ● Space efficiency vs Block aligned (!) ● Flat namespace for objects – No part/sfx/ohash – Increasing part power is just a ring thing
  • 4. Adding an object 1.Select a volume 2.Append objet data 1.Object header (magic string, ohash, size, …) 2.Object metadata 3.Object data 3.fdatasync() volume 4.Insert new entry in K/V (no transaction) ● <o><policy><ohash><filename> => <volume id><offset> => If crash, the volume act as a journal to replay
  • 5. Removing an object 1.Select a volume 2.Insert a tombstone 3.fdatasync() volume 4.Insert tombstone in K/V 5.Run cleanup_ondisk_files() 1.Punch_hole the object 2.Remove the old entry from K/V
  • 6. Volume selection ● Avoid holes in volumes to reduce compaction – Try to group objects by partition ● => rebalance is compaction – Put short life objects in dedicated volumes ● tombstone ● x-delete-at soon – Dedicated volumes for handoff?
  • 7. Benchmarks ● Atom C2750 2.40Ghz ● 16GB RAM ● HGST HUS726040ALA610 (4TB) ● Directly connecting to objet servers
  • 8. Benchmarks ● Single threaded PUT (100 bytes objects) – From 0 to 4 millions objects ● XFS : 19.8/s ● Volumes : 26.2/s – From 4 millions to 8 millions objects ● XFS : 17/s ● Volumes : 39.2/s (b/c of not creating more volumes?) ● What we see (need numbers!) – XFS : memory is full ; Volumes : memory is free – Disks is more busy with XFS
  • 9. Benchmarks ● Single threaded random GET – XFS : 39/s – Volumes : 93/s
  • 10. Benchmarks ● Concurrent PUT, 20 threads for 10 minutes avg 50% 95% 99% max XFS 641ms 67ms 3.5s 4.7s 5.9s Volumes 82ms 50ms 261ms 615ms 1.24s
  • 11. Status ● Done – HEAD/GET/PUT/DELETE/POST (replica) ● Todo – REPLICATE/SSYNC – Erasure Code – XFS read compatibility – Smarter volumes selection – Func tests on object servers (is there any?) – Doc