Ceph Day NYC: Ceph Performance & Benchmarking

•Download as ODP, PDF•

7 likes•4,948 views

Ceph Community

Mark Nelson from Inktank discusses his performance and benchmarking efforts with Ceph.

Technology Business

That's Ceph, I use Ceph now, Ceph is Cool.

RBD KO QEMU RBD RGW CephFS FUSE
librbd libcephfs
Ceph Storage Cluster Protocol (librados)
OSDs MonitorsOSDs MDSs

CRUSH:
Hash Based
Deterministic Data Placement
Pseudo-Random, Weighted, Distribution
Hierarchically Defined Failure Domains

ADVANTAGES:
Avoids Centralized Data Lookups
Even Data Distribution
Healing is Distributed
Abstracted Storage Backends

CHALLENGES:
Ceph Loves Homogeneity (Per Pool)
Ceph Loves Concurrency
Data Integrity is Expensive
Data Movement is Unavoidable
Distributed Storage is Hard!

BORING!
How fast can we go?
Let's test something Fun!

Supermicro SC847A 36-drive Chassis
2x Intel XEON E5-2630L
4x LSI SAS9207-8i Controllers
24x 1TB 7200rpm spinning disks
8x Intel 520 SSDs
Bonded 10GbE Network
Total Cost: ~$12k

Write Read
0
500
1000
1500
2000
2500
Cuttlefish RADOS Bench 4M Object Throughput
4 Processes, 128 Concurrent Operations
BTRFS
EXT4
XFS
Throughput(MB/s)

Yeah, yeah, the bonded 10GbE network is maxed
out. Good for you Mark.

Who cares about RADOS Bench though?
I've moved to the cloud and do lots of small writes
on block storage.

OK, if Ceph is so awesome why are you only
testing 1 server? How does it scale?

Oak Ridge National Laboratory
4 Storage Servers, 8 Client Nodes
DDN SFA10K Storage Chassis
QDR Infiniband Everywhere
A Boatload of Drives!

1 2 3 4
0
2000
4000
6000
8000
10000
12000
14000
ORNL Multi-Server RADOS Bench Througput
4MB IOs, 8 Client Nodes
Writes
Reads
Writes (Including Journals)
Disk Fabric Max
Client Network Max
Server Nodes (11 OSDs Each)
Throughput(MB/s)

So RADOS is scaling nicely.
How much does data replication hurt us?

1 2 3
0
2000
4000
6000
8000
10000
12000
ORNL 4MB RADOS Bench Throughput
Write
Read
Total Write
(Including Journals)
Replication Level
Throughput(MB/s)

This is an HPC site. What about CephFS?
NOTE: CephFS is not production ready!
(Marketing and sales can now sleep again)

1 2 3 4 5 6 7 8
0
1000
2000
3000
4000
5000
6000
7000
ORNL 4M CephFS (IOR) Throughput
Max Write
Avg Write
Max Read
Avg Read
Client Nodes (8 Processes Each)
Throughput(MiB/s)

Hundreds of Cluster Configurations
Hundreds of Tunable Settings
Hundreds of Potential IO Patterns
Too Many Permutations to Test Everything!

When performance is bad, how do you diagnose?

More testing and Bug fixes!
Erasure Coding
Cloning from Journal Writes (BTRFS)
RSOCKETS/RDMA
Tiering

What's hot

Disaggregating Ceph using NVMeoF

Zoltan Arnold Nagy

Ceph Day Seoul - The Anatomy of Ceph I/O

Ceph Community

MySQL Head-to-Head

Patrick McGarry

Scaling Cassandra for Big Data

DataStax Academy

Ceph Day San Jose - Ceph at Salesforce

Ceph Community

Ceph is what you are looking for if you need seamlessly scalable storage: While traditional storage solutions like SANs or SAN drop-in replacements are scale-up only, Ceph allows you to turn every off-the-shelf computer into a member of a distributed, centrally managed storage network. With Ceph in place, you will never again have to worry about adding new disks to existing SANs or replacing existing disks with bigger ones -- just add new nodes to the cluster and see your storage grow. And as if that wasn't enough already, Ceph comes with inherent HA and replication capabilities, taking the burden of securing your data against outages off your shoulder. This presentation gives an insight into the basic ideas behind the Ceph object storage solution. It also elaborates on the existing front-ends for Ceph (CephFS, RBD, radosgw) and explains how they work. Examples will demonstrate how to use Ceph in production. A live demo completes the talk.

OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...

NETWAYS

Cassandra and Solid State Drives

Rick Branson

Practical ZFS

All Things Open

Performance comparison of Distributed File Systems on 1Gbit networks

Marian Marinov

This is redis - feature and usecase

Kris Jeong

Ceph on 64-bit ARM with X-Gene

Ceph Community

Redis Persistence

Ismaeel Enjreny

What every data programmer needs to know about disks

iammutex

Ceph Day KL - Ceph Tiering with High Performance Archiecture

Ceph Community

Red Hat Gluster Storage

Katsutoshi Kojima

Ceph Day KL - Ceph on All-Flash Storage

Ceph Community

Ceph Day KL - Bluestore

Ceph Community

A simple introduction to redis

Zhichao Liang

Presentation from 2016 Austin OpenStack Summit. The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.

CephFS in Jewel: Stable at Last

Ceph Community

What's hot (19)

Disaggregating Ceph using NVMeoF

Ceph Day Seoul - The Anatomy of Ceph I/O

MySQL Head-to-Head

Scaling Cassandra for Big Data

Ceph Day San Jose - Ceph at Salesforce

OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...

Cassandra and Solid State Drives

Practical ZFS

Performance comparison of Distributed File Systems on 1Gbit networks

This is redis - feature and usecase

Ceph on 64-bit ARM with X-Gene

Redis Persistence

What every data programmer needs to know about disks

Ceph Day KL - Ceph Tiering with High Performance Archiecture

Red Hat Gluster Storage

Ceph Day KL - Ceph on All-Flash Storage

Ceph Day KL - Bluestore

A simple introduction to redis

CephFS in Jewel: Stable at Last

Viewers also liked

Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster

Ceph Community

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Ceph Community

Ceph Performance and Optimization - Ceph Day Frankfurt

Ceph Community

Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster

Ceph Community

Ceph Day Tokyo -- Ceph on All-Flash Storage

Ceph Community

Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...

Odinot Stanislas

Viewers also liked (6)

Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Ceph Performance and Optimization - Ceph Day Frankfurt

Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster

Ceph Day Tokyo -- Ceph on All-Flash Storage

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...

Similar to Ceph Day NYC: Ceph Performance & Benchmarking

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

OpenStack Korea Community

Ceph Performance and Sizing Guide

Jose De La Rosa

Red Hat Storage Day New York - New Reference Architectures

Red_Hat_Storage

In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores. With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc. They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Databricks

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Databricks

"In this session for administrators of all skill levels, you’ll get a deep technical dive into Red Hat Storage Server and GlusterFS administration. We’ll start with the basics of what scale-out storage is, and learn about the unique implementation of Red Hat Storage Server and its advantages over legacy and competing technologies. From the basic knowledge and design principles, we’ll move to a live start-to-finish demonstration. Your experience will include: Building a cluster. Allocating resources. Creating and modifying volumes of different types. Accessing data via multiple client protocols. A resiliency demonstration. Expanding and contracting volumes. Implementing directory quotas. Recovering from and preventing split-brain. Asynchronous parallel geo-replication. Behind-the-curtain views of configuration files and logs. Extended attributes used by GlusterFS. Performance tuning basics. New and upcoming feature demonstrations. Those new to the scale-out product will leave this session with the knowledge and confidence to set up their first Red Hat Storage Server environment. Experienced administrators will sharpen their skills and gain insights into the newest features. IT executives and managers will gain a valuable overview to help fuel the drive for next-generation infrastructures."

Red Hat Storage Server Administration Deep Dive

Red_Hat_Storage

Your 1st Ceph cluster

Mirantis

Have you heard that all in-memory databases are equally fast but unreliable, inconsistent and expensive? This session highlights in-memory technology that busts all those myths. Redis, the fastest database on the planet, is not a simply in-memory key-value data-store; but rather a rich in-memory data-structure engine that serves the world’s most popular apps. Redis Labs’ unique clustering technology enables Redis to be highly reliable, keeping every data byte intact despite hundreds of cloud instance failures and dozens of complete data-center outages. It delivers full CP system characteristics at high performance. And with the latest Redis on Flash technology, Redis Labs achieves close to in-memory performance at 70% lower operational costs. Learn about the best uses of in-memory computing to accelerate everyday applications such as high volume transactions, real time analytics, IoT data ingestion and more.

IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...

In-Memory Computing Summit

Accelerating hbase with nvme and bucket cache

David Grier

[B4]deview 2012-hdfs

NAVER D2

Storage and performance, Whiptail

Internet World

Ceph

Hien Nguyen Van

Demystifying Storage

bhavintu79

Quantcast File System (QFS) - Alternative to HDFS

bigdatagurus_meetup

on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS. First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark. In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.

Accelerating HBase with NVMe and Bucket Cache

Nicolas Poggi

Demystifying Storage - Building large SANs

Directi Group

Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers

Ceph Community

##Что такое Storage Replica ##Архитектура и сценарии ##Синхронная и асинхронная репликация ##Междисковая, межсерверная, внутрикластерная и межкластерная репликация ##Дизайн и проектирование Storage Replica ##Нововведения в Windows Server 2016 TP5 ##Графический интерфейс управления, и другие возможности - демонстрация и планы развития ##Интеграция Storage Replica с Storage Spaces Direct

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...

Виталий Стародубцев

The Smug Mug Tale

MySQLConference

Ceph Performance: Projects Leading up to Jewel

Colleen Corrice

Similar to Ceph Day NYC: Ceph Performance & Benchmarking (20)

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

Ceph Performance and Sizing Guide

Red Hat Storage Day New York - New Reference Architectures

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Red Hat Storage Server Administration Deep Dive

Your 1st Ceph cluster

IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...

Accelerating hbase with nvme and bucket cache

[B4]deview 2012-hdfs

Storage and performance, Whiptail

Ceph

Demystifying Storage

Quantcast File System (QFS) - Alternative to HDFS

Accelerating HBase with NVMe and Bucket Cache

Demystifying Storage - Building large SANs

Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...

The Smug Mug Tale

Ceph Performance: Projects Leading up to Jewel

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Histor y of HAM Radio presentation slide

vu2urc

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Real Time Object Detection Using Open CV

Khem

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

How to Troubleshoot Apps for the Modern Connected Worker

2024: Domino Containers - The Next Step. News from the Domino Container commu...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

How to Troubleshoot Apps for the Modern Connected Worker

Boost PC performance: How more available memory can improve productivity

A Domino Admins Adventures (Engage 2024)

Artificial Intelligence: Facts and Myths

Boost Fertility New Invention Ups Success Rates.pdf

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

A Year of the Servo Reboot: Where Are We Now?

Histor y of HAM Radio presentation slide

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Real Time Object Detection Using Open CV

Axa Assurance Maroc - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

🐬 The future of MySQL is Postgres 🐘

HTML Injection Attacks: Impact and Mitigation Strategies

Ceph Day NYC: Ceph Performance & Benchmarking

1. That's Ceph, I use Ceph now, Ceph is Cool.

2. Who's the crazy guy speaking?

3. What about Ceph?

4. RBD KO QEMU RBD RGW CephFS FUSE librbd libcephfs Ceph Storage Cluster Protocol (librados) OSDs MonitorsOSDs MDSs

5. DISTRIBUTED EVERYTHING

6. CRUSH: Hash Based Deterministic Data Placement Pseudo-Random, Weighted, Distribution Hierarchically Defined Failure Domains

7. ADVANTAGES: Avoids Centralized Data Lookups Even Data Distribution Healing is Distributed Abstracted Storage Backends

8. CHALLENGES: Ceph Loves Homogeneity (Per Pool) Ceph Loves Concurrency Data Integrity is Expensive Data Movement is Unavoidable Distributed Storage is Hard!

9. BORING! How fast can we go? Let's test something Fun!

10. Supermicro SC847A 36-drive Chassis 2x Intel XEON E5-2630L 4x LSI SAS9207-8i Controllers 24x 1TB 7200rpm spinning disks 8x Intel 520 SSDs Bonded 10GbE Network Total Cost: ~$12k

11. Write Read 0 500 1000 1500 2000 2500 Cuttlefish RADOS Bench 4M Object Throughput 4 Processes, 128 Concurrent Operations BTRFS EXT4 XFS Throughput(MB/s)

12. Yeah, yeah, the bonded 10GbE network is maxed out. Good for you Mark.

13. Who cares about RADOS Bench though? I've moved to the cloud and do lots of small writes on block storage.

14.

15. OK, if Ceph is so awesome why are you only testing 1 server? How does it scale?

16. Oak Ridge National Laboratory 4 Storage Servers, 8 Client Nodes DDN SFA10K Storage Chassis QDR Infiniband Everywhere A Boatload of Drives!

17. 1 2 3 4 0 2000 4000 6000 8000 10000 12000 14000 ORNL Multi-Server RADOS Bench Througput 4MB IOs, 8 Client Nodes Writes Reads Writes (Including Journals) Disk Fabric Max Client Network Max Server Nodes (11 OSDs Each) Throughput(MB/s)

18. So RADOS is scaling nicely. How much does data replication hurt us?

19. 1 2 3 0 2000 4000 6000 8000 10000 12000 ORNL 4MB RADOS Bench Throughput Write Read Total Write (Including Journals) Replication Level Throughput(MB/s)

20. This is an HPC site. What about CephFS? NOTE: CephFS is not production ready! (Marketing and sales can now sleep again)

21. 1 2 3 4 5 6 7 8 0 1000 2000 3000 4000 5000 6000 7000 ORNL 4M CephFS (IOR) Throughput Max Write Avg Write Max Read Avg Read Client Nodes (8 Processes Each) Throughput(MiB/s)

22. Hundreds of Cluster Configurations Hundreds of Tunable Settings Hundreds of Potential IO Patterns Too Many Permutations to Test Everything!

23. When performance is bad, how do you diagnose?

24. Ceph Admin Socket

25.

26. Collectl

27.

28. Blktrace & Seekwatcher

29.

30. perf

31.

32. Where are we going from here?

33. More testing and Bug fixes! Erasure Coding Cloning from Journal Writes (BTRFS) RSOCKETS/RDMA Tiering

34. THANK YOU

Editor's Notes

Yes, that is a Cephalopod attacking a police box. Likeness to any existing objects, characters, or ideas is purely coincidental.
My Name is Mark and I work for a company called Inktank making an open source distributed storage system called Ceph. Before I started working for Inktank, I worked for the Minnesota Supercomputing Institute. My job was to figure out how to make our clusters run as efficiently as possible. A lot of researchers ran code on the clusters that didn't make good use of the expensive network fabrics those systems have. We worked to find better alternatives for these folks and ended up prototyping a high performance openstack cloud for research computing. The one piece that was missing was the storage solution. That's how I discovered Ceph.
I had heard of Ceph through my work for the Institute. The original development was funded by a high performance computing research grant at Lawrence Livermore National Laboratory. It had been in development since 2004, but it was only in around 2010 that I really started hearing about people starting to deploy storage with it. Ceph itself is an amazing piece of software. It lets you take commodity servers and turn them into a high performance, fault tolerant, distributed storage solution. It was designed to scale from the beginning and is made up from many distinct components.
The primary building blocks of Ceph are the daemons that run on the nodes in the cluster. Ceph is composed of OSDs that store data and monitors that keep track of cluster health and state. When using ceph as a distributed POSIX filesystem (CephFS), metadata servers may also be used. On top of these daemons are various APIs. Librados is the lowest level API that can be used to interact with rados directly at the object level. Librbd and libcephfs provide file-like API access to RBD and CephFS respectively. Finally, we have the high level block, object, and filesystem interfaces that make ceph such a versatile storage solution.
If you take one thing away from this talk, it should be that Ceph is designed to be distributed. Any number of services can run on any number of nodes. You can have as many storage servers as you want. Cluster monitors are distributed and use an election algorithm called PAXOS to avoid split-brain scenarios when servers fail. For S3 or swift compatible object storage, you can distribute requests across multiple gateways. When CephFS (Still Beta!) is used, the metadata servers can be distributed across multiple nodes and store metadata by distributing it across all of the OSD servers. When talking about data distribution specifically though, the crowning achievement in Ceph is CRUSH.
In many distributed storage systems there is some kind of centralized server that maintains an allocation table of where data is stored in the cluster. Not only is this a single point of failure, but it also can become a bottleneck as clients need to query this server to find out where data should be written or read. CRUSH does away with this. It is a hash based algorithm that allows any client to algorithmically determine where in the cluster data should be placed based on its name. Better yet, data is distributed across OSDs in the cluster pseudo-randomly. CRUSH also provides other benefits, like the ability to hierarchically define failure domains to ensure that replicated data ends up on different hardware.
From a performance perspective Ceph has a lot of benefits over many other distributed storage systems. There is no centralized server that can become a bottleneck for data allocation lookups. Data is well distributed across OSDs due to the psuedo-random nature of CRUSH. On traditional storage solutions a RAID array is used on each server. When a disk fails, the RAID array needs to be rebuilt which causes a hotspot in the cluster that drags performance down, and RAID rebuilds can last a long time. Because data in Ceph is pseudo-randomly distributed, healing happens cluster wide which dramatically speeds up the recovery process.
One of the challenges in any distributed storage system is what happens when you have hotspots in the cluster. If any one server is slow to fulfill requests, they will start backing up. Eventually a limit is reached where all outstanding requests will be concentrated on the slower server(s) starving the other faster servers and potentially degrading overall cluster performance significantly. Likewise distributed storage systems in general need a lot of concurrency to keep all of the servers and disks constantly working. From a performance perspective, another challenge regarding Ceph specifically is that Ceph works really hard to ensure data integrity. It does a full write of the data for every journal commit, does crc32 checksums for every data transfer, and regularly does background scrubs.
You guys have been patient so far but if you are anything like me then your attention span is starting to wear thin about now. Email or slashdot could be looking appealing. So let's switch things up a little bit. We've talked about why Ceph is conceptually so amazing, but what can it really deliver as far as performance goes? That was the question we asked about a year ago after seeing some rather lacklustre results on some of our existing internal test hardware. Our director of engineering told me to go forth and build a system that would give us some insight into how Ceph could perform on (what in my opinion would be) an ideal setup.
One of the things that we noticed from our previous testing is that some systems are harder to get working well than others. One potential culprit appeared to be that some expander backplanes may not behave entirely properly. For this system, we decided to skip expanders entirely and directly connect each drive in the system to it's own controller SAS lane. That means that with 24 spinning disks and 8 SSDs we needed 4 dual-port controllers to connect all of the drives. With so many disks in this system, we'd need a lot of external network connectivity and opted for a bonded 10GbE setup. We only have a single client which could be a bottleneck, but at that point we were just hoping to break 1GB/s to start out with which seemed feasible. So where we able to do it?
What you are looking at is a chart showing the write and read throughput on our test platform using our RADOS bench tool. This tool directly utilizes librados to write objects out as fast as possible and after writes complete, read them back. We're doing syncs and flushes between the tests on both the client and server and have measured the underlying disk throughput to make sure the results are accurate. What you are seeing here is that for writes, we not only hit 1GB/s, but are in fact hitting 2GB/s and maxing out the bonded 10GbE link. Reads aren't quite saturating the network but are coming pretty close. This is really good news because it means that with the right hardware, you can build Ceph nodes that can perform extremely well.
I like to show the previous slide because it makes a big impact and I get to feel vindicated regarding spending a bunch of our startup money on new toys. And who doesn't like seeing a single server able to write out 2GB/s+ of data? The problem though is that reading and writing out 4MB objects directly via librados isn't necessarily a great representation of how Ceph will really perform once you layer block or S3 storage on top of it.
Say for instance that you are using Ceph to provide block storage via RBD for openstack and have an application that does lots of 4K writes. Testing RADOS Bench with 4K objects might give you some rough idea of how RBD performs with small IOs, but it's also misleading. One of the things that you might not know is that RBD stores each block in a 4MB object behind the scenes. Doing 4K writes to 4MB objects results in different behavior that writing out distinct 4K objects themselves. Throw in the writeback cache implementation in QEMU RBD and things get complicated very fast. Ultimately you really do have to do the tests directly on RBD to know what's going on.
...And we've done that. In alarming detail. What you are seeing here is a comparison of sequential and random 4K writes using a really useful benchmarking tool called fio. We are testing Kernel RBD and QEMU RBD at differing concurrency and IO depth values. What you may notice is that the scaling behavior and throughput looks very different in each case. QEMU RBD with the writeback cache enabled is performing much better. Interestingly RBD cache not only helps sequential writes, but seems to help with random writes too (especially at low levels of concurrency). This is one example of the kind of testing we do, but we have thousands of graphs like this exploring different workloads and system configurations. It's a lot of work!
We've gotten a lot of interesting results from our high performance test node and published quite a bit of those results in different articles on the web. Unfortunately we only have 1 of those nodes and our readers have started asking for are tests showing how performance scales across nodes. Do you remember what I said earlier about Ceph loving homogeneity and lots of concurrency? The truth is that the more consistently your hardware behaves, especially over time, the better your cluster is going to scale. Since I like to show good results instead of bad ones, lets take a look at an example of a cluster that's scaling really well.
About a year and a half ago Oak Ridge National Laboratory (ORNL) reached out to Inktank to investigate how Ceph performs on a high performance storage system they have in their lab. This platform is a bit different than the typical platform that we deploy Ceph on. It has a ton (over 400!) of drives configured in RAID LUNs that are in chassis connected to the servers via QDR Infiniband links. As a result, the back-end storage maxes out at about 12GB/s but has pretty consistent performance characteristics since there are so many drives behind the scenes. Initially the results ORNL was seeing were quite bad. We worked together with them to find optimal hardware settings and did a lot of tuning and testing. By the time we were done performance had improved quite a bit.
This chart shows performance on the ORNL cluster in it's final configuration using RADOS bench. Notice that throughput is scaling fairly linearly as we add storage nodes to the cluster. The blue and red lines represent write and read performance respectively. If you just look at the write performance, the results might seem disappointing. Ceph however, is doing full data writes to the journals to guarantee the atomicity of it's write operations. Normally in high performance situations we get around this by putting journals on SSDs, but this solution unfortunately doesn't have any. Another limitation is that the client network is using IPoIB, and on this hardware that means the clients will never see more than about 10GB/s aggregate throughput. Despite these limitations, we are scaling well and throughput to the storage chassis is pretty good!
All of the results I've shown so far have been designed to showcase how much throughput we can push from the client and are not using any kind of replication. On the ORNL hardware this is probably justifiable because they are using RAID5 arrays behind the scenes and for some solutions like HPC scratch storage, running without replication may be acceptable. For a lot of folks Ceph's seamless support for replication is what makes it so compelling. So how much does replication hurt?
As you might expect, replication has a pretty profound impact on write performance. Between doing journal writes and 3x replication, we see that client write performance is over 6 times slower than the actual write speed to the DDN chassis. What is probably more interesting is how the total write throughput changes. Going from 1x to 2x replication lowers the overall write performance by about 15-20%. When Ceph writes data to an OSD, the data must be written to the journal (BTRFS is a special case), and to the replica OSD's journal before the acknowledgement can be sent to the client. This not only results in extra data being written, but extra latency for every write. Read performance remains high regardless of replication.
So again I've shown a bunch of RADOS bench results, but that's not what people ultimately care about. For high performance computing, customers really want CephFS: Our distributed POSIX filesystem. Before we go on, let me say that our block and object layers are production ready, but CephFS is still in beta. It's probably the most complex interface we have on top of Ceph and there are still a number of known bugs. We've also done very little performance tuning, so when we started this testing we were pretty unsure about how it would perform.
To test CephFS, we settled on a tool that is commonly used in the HPC space called IOR. It coordinates IO on multiple client nodes using MPI and has many options that are useful for testing high performancre storage system s. When we first started testing CephFS, the performance was lower than we hoped. Through a series of tests, profiling, and investigation we were able to tweak the configuration to produce the results you see here. With 8 client nodes, writes are nearly as high as what we saw with RADOS bench, but reads have topped out lower. We are still seeing some variability in the results and have more tuning to do, but are happy with the performance we've been able to get so far given CephFS's level of maturity.
The results that we've shown thus far are only a small sample of the barrage of tests that we run. We have hundreds, if not thousands of graphs and charts showcasing performance of Ceph on different hardware, and serving different kinds of IO. Given how open Ceph is, this is ultimately going to be a losing battle. There are too many platforms, too many applications, and too many ways performance can be impacted to capture them all.
So given that you can't catch everything ahead of time, what do you do when cluster performance is lower than you expected? First, make a pot of coffee because you may be in for a long night. It's going to take some blood, sweat and tears, but luckily some other folks have paved the way and developed some very useful tools that can make the job a lot easier.
Ha, ran out of time. Slide notes end here. :)

Ceph Day NYC: Ceph Performance & Benchmarking

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Ceph Day NYC: Ceph Performance & Benchmarking

Similar to Ceph Day NYC: Ceph Performance & Benchmarking (20)

Recently uploaded

Recently uploaded (20)

Ceph Day NYC: Ceph Performance & Benchmarking

Editor's Notes