Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsColleen Corrice
At Red Hat Storage Day Minneapolis on 4/12/16, Intel's Dan Ferber presented on Intel storage components, benchmarks, and contributions as they relate to Ceph.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsColleen Corrice
At Red Hat Storage Day Minneapolis on 4/12/16, Intel's Dan Ferber presented on Intel storage components, benchmarks, and contributions as they relate to Ceph.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
This presentation provides an overview of the Dell PowerEdge R730xd server performance results with Red Hat Ceph Storage. It covers the advantages of using Red Hat Ceph Storage on Dell servers with their proven hardware components that provide high scalability, enhanced ROI cost benefits, and support of unstructured data.
This presentation provides a basic overview of Ceph, upon which SUSE Storage is based. It discusses the various factors and trade-offs that affect the performance and other functional and non-functional properties of a software-defined storage (SDS) environment.
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
This presentation provides an overview of the Dell PowerEdge R730xd server performance results with Red Hat Ceph Storage. It covers the advantages of using Red Hat Ceph Storage on Dell servers with their proven hardware components that provide high scalability, enhanced ROI cost benefits, and support of unstructured data.
This presentation provides a basic overview of Ceph, upon which SUSE Storage is based. It discusses the various factors and trade-offs that affect the performance and other functional and non-functional properties of a software-defined storage (SDS) environment.
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
This talk was given during Lucene Revolution 2017 and has two goals: first, to discuss the tradeoffs for running Solr on Docker. For example, you get dynamic allocation of operating system caches, but you also get some CPU overhead. We'll keep in mind that Solr nodes tend to be different than your average container: Solr is usually long running, takes quite some RSS and a lot of virtual memory. This will imply, for example, that it makes more sense to use Docker on big physical boxes than on configurable-size VMs (like Amazon EC2).
The second goal is to discuss issues with deploying Solr on Docker and how to work around them. For example, many older (and some of the newer) combinations of Docker, Linux Kernel and JVM have memory leaks. We'll go over Docker operations best practices, such as using container limits to cap memory usage and prevent the host OOM killer from terminating a memory-consuming process - usually a Solr node. Or running Docker in Swarm mode over multiple smaller boxes to limit the spread of a single issue.
Data deduplication is a hot topic in storage and saves significant disk space for many environments, with some trade offs. We’ll discuss what deduplication is and where the Open Source solutions are versus commercial offerings. Presentation will lean towards the practical – where attendees can use it in their real world projects (what works, what doesn’t, should you use in production, etcetera).
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
Aujourd’hui, un nombre important de nos clients ont franchi le cap et utilisent Oracle 12c pour des applications en production. Parmi tous ces clients, nous constatons de plus en plus un intérêt prononcé pour l’option Multitenant.
Même si le passage à l’option Multitenant ne présente pas de difficultés en soi, il existe de nombreux pièges à éviter ainsi que quelques points à clarifier, notamment sur les aspects relatifs à la performance.
En abordant un cas concret avec l’un de nos clients, cette présentation vous apportera des éléments de réponses afin de vous permettre d’envisager sereinement la mise en œuvre de cette option qui constitue un changement important et très intéressant d’un point vue architecture.
Best Practices & Performance Tuning - OpenStack Cloud Storage with Ceph - In this presentation, we discuss best practices and performance tuning for OpenStack cloud storage with Ceph to achieve high availability, durability, reliability and scalability at any point of time. Also discuss best practices for failure domain, recovery, rebalancing, backfilling, scrubbing, deep-scrubbing and operations
Similar to Ceph Performance: Projects Leading up to Jewel (20)
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
2. OVERVIEW
What's been going on with Ceph performance since Hammer?
Answer: A lot!
● Memory Allocator Testing
● Bluestore Development
● RADOS Gateway Bucket Index Overhead
● Cache Teiring Probabilistic Promotion Throttling
First let's look at how we are testing all this stuff...
3. CBT
CBT is an open source tool for creating Ceph clusters and running benchmarks
against them.
● Automatically builds clusters and runs through a variety of tests.
● Can launch various monitoring and profiling tools such as collectl.
● YAML based configuration file for cluster and test configuration.
● Open Source: https://github.com/ceph/cbt
4. MEMORY ALLOCATOR TESTING
We sat down at the 2015 Ceph Hackathon and tested a CBT configuration to
replicate memory allocator results on SSD based clusters pioneered by Sandisk
and Intel.
Memory
Allocator
Version Notes
TCMalloc 2.1 (default) Thread Cache can not be changed due to bug.
TCMalloc 2.4 Default 32MB Thread Cache
TCMalloc 2.4 64MB Thread Cache
TCMalloc 2.4 128MB Thread cache
Jemalloc 3.6.0 Default jemalloc configuration
Example CBT Test
5. MEMORY ALLOCATOR TESTING
The cluster is rebuilt the exact same way for every memory allocator tested. Tests
are run across many different IO sizes and IO Types. The most impressive change
was in 4K Random Writes. Over 4X faster with jemalloc!
4KB Random Writes
0 50 100 150 200 250 300 350
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
TCMalloc 2.1 (32MB TC)
TCMalloc 2.4 (32MB TC)
TCMalloc 2.4 (64MB TC)
TCMalloc 2.4 (128MB TC)
JEMalloc
Time (seconds)
IOPS
TCMalloc 2.4 (128MB TC)
performance degrading!
4.1x faster writes vs
TCMalloc 2.1 with jemalloc!
6. We need to examine RSS memory
usage during recovery to see what
happens in a memory intensive
scenario. CBT can perform
recovery tests during benchmarks
with a small configuration change:
cluster:
recovery_test:
osds: [ 1, 2, 3, 4,
5, 6, 7, 8,
9,10,11,12,
13,14,15,16]
WHAT NOW?
Does the Memory Allocator Affect Memory Usage?
Test procedure:
● Start the test.
● Wait 60 seconds.
● Mark OSDs on Node 1 down/out.
● Wait until the cluster heals.
● Mark OSDs on Node 1 up/in.
● Wait until the cluster heals.
● Wait 60 seconds.
● End the test.
7. MEMORY ALLOCATOR TESTING
Memory Usage during Recovery with Concurrent 4KB Random Writes
Much higher jemalloc
RSS memory usage!
0 500 1000 1500 2000 2500
0
200
400
600
800
1000
1200
Node1 OSD RSS Memory Usage During Recovery
TCMalloc 2.4 (32MB TC)
TCMalloc 2.4 (128MB TC)
jemalloc 3.6.0
Time (Seconds)
OSDRSSMemory(MB)
OSDs marked up/in after
previously marked down/out
Highest peak memory
usage with jemalloc.
Jemalloc completes
recovery faster than
tcmalloc.
8. MEMORY ALLOCATOR TESTING
General Conclusions
● Ceph is very hard on memory allocators. Opportunities for tuning.
● Huge performance gains and latency drops possible!
● Small IO on fast SSDs is CPU limited in these tests.
● Jemalloc provides higher performance but uses more memory.
● Memory allocator tuning primarily necessary for SimpleMessenger.
AsyncMessenger not affected.
● We decided to keep TCMalloc as the default memory allocator in Jewel but
increased the amount of thread cache to 128MB.
9. FILESTORE DEFICIENCIES
Ceph already has Filestore. Why add a new OSD backend?
● 2X journal write penalty needs to go away!
● Filestore stores metadata in XATTRS. On XFS, any XATTR larger than 254B
causes all XATTRS to be moved out of the inode.
● Filestore's PG directory hierarchy grows with the number of objects. This can be
mitigated by favoring dentry cache with vfs_cache_pressure, but...
● OSD regularly call syncfs to persist buffered writes. Syncfs does an O(n) search
of the entire in-kernel inode cache and slows down as more inodes are cached!
● Pick your poison. Crank up vfs_cache_pressure to avoid syncfs penalties or turn
it down to avoid extra dentry seeks caused by deep PG directory hierarchies?
● There must be a better way...
10. BLUESTORE
How is BlueStore different?
BlueStore
BlueFS
RocksDB
BlockDeviceBlockDeviceBlockDevice
BlueRocksEnv
data metadata
Allocator
ObjectStore
● BlueStore = Block + NewStore
● consume raw block device(s)
● key/value database (RocksDB) for metadata
● data written directly to block device
● pluggable block Allocator
● We must share the block device with RocksDB
● implement our own rocksdb::Env
● implement tiny “file system” BlueFS
● make BlueStore and BlueFS share
11. BLUESTORE
BlueStore Advantages
● Large writes go to directly to block device and small writes to RocksDB WAL.
● No more crazy PG directory hierarchy!
● metadata stored in RocksDB instead of FS XATTRS / LevelDB
● Less SSD wear due to journal writes, and SSDs used for more than journaling.
● Map BlueFS/RocksDB “directories” to different block devices
● db.wal/ – on NVRAM, NVMe, SSD
● db/ – level0 and hot SSTs on SSD
● db.slow/ – cold SSTs on HDD
Not production ready yet, but will be in Jewel as an experimental feature!
12. BLUESTORE HDD PERFORMANCE
Read 50% Mixed Write-20%
0%
20%
40%
60%
80%
100%
120%
140%
10.1.0 Bluestore HDD Performance vs Filestore
Average of Many IO Sizes (4K, 8K, 16K, 32K ... 4096K)
Sequential
Random
IO Type
PerformanceIncrease
BlueStore average sequential
reads a little worse...
BlueStore much faster!
How does BlueStore Perform?
HDD Performance looks good overall, though sequential reads are important to
watch closely since BlueStore relies on client-side readahead.
13. BLUESTORE NVMe PERFORMANCE
NVMe results however are decidedly mixed.
What these averages don't show is dramatic performance variation at different IO
sizes. Let's take a look at what's happening.
Read 50% Mixed Write
-20%
-10%
0%
10%
20%
30%
40%
10.1.0 Bluestore NVMe Performance vs Filestore
Average of Many IO Sizes (4K, 8K, 16K, 32K ... 4096K)
Sequential
Random
IO Type
PerformanceDifference
BlueStore average sequential
reads still a little worse...
But mixed sequential workloads
are better?
14. BLUESTORE NVMe PERFORMANCE
NVMe results are decidedly mixed, but why?
Performance generally good at small and large IO sizes but is slower than filestore
at middle IO sizes. BlueStore is experimental still, stay tuned while we tune!
-100%
-50%
0%
50%
100%
150%
200%
250%
10.1.0 Bluestore NVME Performance vs Filestore
Performance at different IO Sizes
Sequential Read
Sequential Write
Random Read
Random Write
Sequential 50% Mixed
Random 50% Mixed
IO Size
PerformanceImprovement
15. RGW WRITE PERFORMANCE
A common question:
Why is there a difference in performance between RGW writes and pure RADOS
writes?
There are several factors that play a part:
● S3/Swift protocol likely higher overhead than native RADOS.
● Writes translated through a gateway results in extra latency and potentially
additional bottlenecks.
● Most importantly, RGW maintains bucket indices that have to be updated every
time there is a write while RADOS does not maintain indices.
16. RGW BUCKET INDEX OVERHEAD
How are RGW Bucket Indices Stored?
● Standard RADOS Objects with the same rules as other objects
● Can be sharded across multiple RADOS objects to improve parallelism
● Data stored in OMAP (ie XATTRS on the underlying object file using filestore)
What Happens during an RGW Write?
● Prepare Step: First stage of 2 stage bucket index update in preparation for write.
● Actual Write
●
Commit Step: Asynchronous 2nd
stage of bucket index update to record that the
write completed.
Note: Every time an object is accessed on filestore backed OSDs, multiple
metadata seeks may be required depending on the kernel dentry/inode cache,
the total numbrer of objects, and external memory pressure.
17. RGW BUCKET INDEX OVERHEAD
A real example from a customer deployment:
Use GDB as a “poorman's profiler” to see what RGW threads are doing during a
heavy 256K object write workload:
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p <process>
Results:
● 200 threads doing IoCtx::operate
● 169 librados::ObjectWriteOperation
● 126 RGWRados::Bucket::UpdateIndex::prepare(RGWModifyOp) ()
● 31 librados::ObjectReadOperation
● 31 RGWRados::raw_obj_stat(…) () ← read head metadata
18. IMPROVING RGW WRITES
How to make RGW writes faster?
Bluestore gives us a lot
● No more journal penalty for large writes (helps everything, including RGW!)
● Much better allocator behavior and allocator metadata stored in RocksDB
● Bucket index updates in RocksDB instead of XATTRS (should be faster!)
● No need for separate SSD pool for Bucket Indices?
What about filestore?
● Put journals for rgw.buckets pool on SSD to avoid journal penalty
● Put rgw.buckets.index pools on SSD backed OSDs
● More OSDs for rgw.buckets.index means more PGs, higher memory usage, and
potentially lower distribution quality.
● Are there alternatives?
19. RGW BUCKET INDEX OVERHEAD
Are there alternatives? Potentially yes!
Ben England from Red Hat's Performance Team is testing RGW with LVM Cache.
Initial (100% cached) performance looks promising. Will it scale though?
0 500 1000 1500 2000 2500 3000 3500 4000
RGW Performance with LVM Cache OSDs
1M 64K Objects, 16 Index Shards, 128MB TCMalloc Thread Cache
LVM Cache
Native Disk
Puts/Second
DiskConfiguration
20. RGW BUCKET INDEX OVERHEAD
What if you don't need bucket indices?
The customer we tested with didn't need Ceph to keep track of bucket contents, so
for Jewel we introduce the concept of Blind Buckets that do not maintain bucket
indices. For this customer the overall performance improvement was near 60%.
0 2 4 6 8 10 12 14 16 18
RGW 256K Object Write Performance
Blind Buckets
Normal Buckets
IOPS (Thousands)
BucketType
21. CACHE TIERING
The original cache tiering implementation in firefly was very slow
when
firefly hammer
0
50
100
150
200
250
Client Read Throughput (MB/s)
4K Zipf 1.2 Read, 256GB Volume, 128GB Cache Tier
base only
cache tier
Throughput(MB/s)
firefly hammer
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Cache Tier Writes During Client Reads
4K Zipf 1.2 Read, 256GB Volume, 128GB Cache Tier
WriteThroughput(MB/s)
22. CACHE TIERING
There have been many improvements since then...
● Memory Allocator tuning helps SSD tier in general
● Read proxy support added in Hammer
● Write proxy support added in Infernalis
● Recency fixed: https://github.com/ceph/ceph/pull/6702
● Other misc improvements.
● Is it enough?
23. CACHE TIERING
Is it enough?
Zipf 1.2 distribution reads performing very similar between base only and
tiered but random reads are still slow at small IO sizes.
-80%
-60%
-40%
-20%
0%
20%
40%
60%
RBD Random Read
NVMe Cache-Tiered vs Non-Tiered
Non-Tiered
Tiered (Rec 2)
IO Size
PercentImprovement
-2%
-1%
-1%
0%
1%
1%
RBD Zipf 1.2 Read
NVMe Cache-Tiered vs Non-Tiered
Non-Tiered
Tiered (Rec 2)
IO Size
PercentImprovement
24. CACHE TIERING
Limit promotions even more with object and throughput throttling.
Performance improves dramatically and in this case even beats the base tier when
promotions are throttled very aggressively.
4 8 16 32 64 128 256 512 1024 2048 4096
-80%
-60%
-40%
-20%
0%
20%
40%
60%
RBD Random Read Probablistic Promotion Improvement
NVMe Cache-Tiered vs Non-Tiered
Non-Tiered
Tiered (Rec 2)
Tiered (Rec 1, VH Promote)
Tiered (Rec 1, H Promote)
Tiered (Rec 1, M Promote)
Tiered (Rec 1, L Promote)
Tiered (Rec 1, VL Promote)
Tiered (Rec 2, 0 Promote)
IO Size
PercentImprovement
25. CACHE TIERING
Conclusions
● Performance very dependent on promotion and eviction rates.
● Limiting promotion can improve performance, but are we making the
cache tier less adaptive to changing hot/cold distributions?
● Will need a lot more testing and user feedback to see if our default
promotion throttles make sense!