This document summarizes Noah Watkins' presentation on building a distributed shared log using Ceph. The key points are:
1) Noah discusses how shared logs are challenging to scale due to the need to funnel all writes through a total ordering engine. This bottlenecks performance.
2) CORFU is introduced as a shared log design that decouples I/O from ordering by striping the log across flash devices and using a sequencer to assign positions.
3) Noah then explains how the components of CORFU can be mapped onto Ceph, using RADOS object classes, librados, and striping policies to implement the shared log without requiring custom hardware interfaces.
4) ZLog is presented
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements.
Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations.
This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiCeph Community
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Greg Farnum,Red Hat RADOS Core Developer
Josh Durgin, Red Hat RADOS Lead
Kefu Chai, Red Hat Senior Software Engineer
Ceph is an open source distributed storage system designed for scalability and reliability. Ceph's block device, RADOS block device (RBD), is widely used to store virtual machines, and is the most popular block storage used with OpenStack.
In this session, you'll learn how RBD works, including how it:
* Uses RADOS classes to make access easier from user space and within the Linux kernel.
* Implements thin provisioning.
* Builds on RADOS self-managed snapshots for cloning and differential backups.
* Increases performance with caching of various kinds.
* Uses watch/notify RADOS primitives to handle online management operations.
* Integrates with QEMU, libvirt, and OpenStack.
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsColleen Corrice
At Red Hat Storage Day Minneapolis on 4/12/16, Intel's Dan Ferber presented on Intel storage components, benchmarks, and contributions as they relate to Ceph.
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
HKG15-401: Ceph and Software Defined Storage on ARM servers
---------------------------------------------------
Speaker: Yazen Ghannam Steve Capper
Date: February 12, 2015
---------------------------------------------------
★ Session Summary ★
Running Ceph in the colocation, ongoing optimizations
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250828
Video: https://www.youtube.com/watch?v=RdZojLL7ttk
Etherpad: http://pad.linaro.org/p/hkg15-401
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
In this presentation I talk about our motivation to converting our microservices to run on Kubernetes. I discuss many of the technical challenges we encountered along the way, including networking issues, Java issues, monitoring and alerting, and managing all of our resources!
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements.
Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations.
This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiCeph Community
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Greg Farnum,Red Hat RADOS Core Developer
Josh Durgin, Red Hat RADOS Lead
Kefu Chai, Red Hat Senior Software Engineer
Ceph is an open source distributed storage system designed for scalability and reliability. Ceph's block device, RADOS block device (RBD), is widely used to store virtual machines, and is the most popular block storage used with OpenStack.
In this session, you'll learn how RBD works, including how it:
* Uses RADOS classes to make access easier from user space and within the Linux kernel.
* Implements thin provisioning.
* Builds on RADOS self-managed snapshots for cloning and differential backups.
* Increases performance with caching of various kinds.
* Uses watch/notify RADOS primitives to handle online management operations.
* Integrates with QEMU, libvirt, and OpenStack.
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsColleen Corrice
At Red Hat Storage Day Minneapolis on 4/12/16, Intel's Dan Ferber presented on Intel storage components, benchmarks, and contributions as they relate to Ceph.
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
HKG15-401: Ceph and Software Defined Storage on ARM servers
---------------------------------------------------
Speaker: Yazen Ghannam Steve Capper
Date: February 12, 2015
---------------------------------------------------
★ Session Summary ★
Running Ceph in the colocation, ongoing optimizations
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250828
Video: https://www.youtube.com/watch?v=RdZojLL7ttk
Etherpad: http://pad.linaro.org/p/hkg15-401
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
In this presentation I talk about our motivation to converting our microservices to run on Kubernetes. I discuss many of the technical challenges we encountered along the way, including networking issues, Java issues, monitoring and alerting, and managing all of our resources!
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
Getting to understand your Kubernetes storage capabilities is important in order to run a proper cluster in production. In this session I will demonstrate how to use Sherlock, an open source platform written to test persistent NVMe/TCP storage in Kubernetes, either via synthetic workload or via variety of databases, all easily done and summarized to give you an estimate of what your IOPS, Latency and Throughput your storage can provide to the Kubernetes cluster.
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...Linaro
Session ID: HKG18-411
Session Name: HKG18-411 - Introduction to OpenAMP which is an open source solution for heterogeneous system orchestration and communication
Speaker: Wendy Liang
Track: IoT, Embedded
★ Session Summary ★
Introduction to OpenAMP which is an open source solution for heterogeneous system orchestration and communication
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/hkg18/hkg18-411/
Presentation: http://connect.linaro.org.s3.amazonaws.com/hkg18/presentations/hkg18-411.pdf
Video: http://connect.linaro.org.s3.amazonaws.com/hkg18/videos/hkg18-411.mp4
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2018 (HKG18)
19-23 March 2018
Regal Airport Hotel Hong Kong
---------------------------------------------------
Keyword: IoT, Embedded
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
Netflix Open Source Meetup Season 4 Episode 2aspyker
In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix.
The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs.
The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis.
Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.
Internal presentation of Docker, Lightweight Virtualization, and linux Containers; at Spotify NYC offices, featuring engineers from Yandex, LinkedIn, Criteo, and NASA!
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...OpenShift Origin
Extending OpenShift Origin: Build Your Own Cartridge
Presenters: Bill DeCoste
Cartridges allow developers to provide services running on top of the Red Hat OpenShift Platform-as-a-Service (PaaS). OpenShift already provides cartridges for numerous web application frameworks and databases. Writing your own cartridges allows you to customize or enhance an existing service, or provide new services. In this session, the presenter will discuss best practices for cartridge development and the latest changes in the OpenShift cartridge support.
* Latest changes made in the platform to ease cartridge development
* OpenShift Cartridges vs. plugins
* Outline for development of a new cartridge
* Customization of existing cartridges
* Quickstarts: leveraging a cartridge or cartridges to provide a complete application
Benchmarking for postgresql workloads in kubernetesDoKC
ABSTRACT OF THE TALK
6 months have passed since our last DoK webinar about benchmarking PostgreSQL workloads in a Kubernetes environment. In the meantime, many things have happened at EDB, and we’re happy to share what we’ve learned in this timeframe. We’ll use cnp-bench and cnp-sandbox to help us describe some of the challenges we might face when running PostgreSQL workloads, how to spot them, and what actions to take to make your databases healthier and more longeve.
cnp-bench is a collection of Helm charts that help run storage and database benchmarks, using popular open source tools like fio, pgbench, and HammerDB. cnp-sandbox is a Helm chart that sets up a Prometheus/Grafana stack, including basic metrics and dashboards for Cloud Native PostgreSQL, the Kubernetes operator developed by EDB. Both cnp-sandbox and cnp-bench are open source and recommended for development, testing, and pre-production environments only.
BIO
A long time open-source programmer and entrepreneur, Gabriele has a degree in Statistics from the University of Florence. After having consistently contributed to the growth of 2ndQuadrant and its members through nurturing a lean and devops culture, he is now leading the Cloud Native initiative at EDB. Gabriele lives in Prato, a small but vibrant city located in the northern part of Tuscany, Italy - famous for having hosted the first European PostgreSQL conferences. His second home is Melbourne, Australia, where he studied at Monash University and worked in the ICT sector. He loves playing the Blues with his Fender Stratocaster, but his major passions are called Elisabeth and Charlotte!
KEY TAKE-AWAYS FROM THE TALK
- A methodology for benchmarking a PostgreSQL database in Kubernetes
- Open source set of tools for benchmarking a PostgreSQL database in Kubernetes
- Reasons why benchmarking both the storage and the database is important
https://github.com/EnterpriseDB/cnp-sandbox
https://github.com/EnterpriseDB/cnp-bench
Customize and Secure the Runtime and Dependencies of Your Procedural Language...VMware Tanzu
Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container
Greenplum Summit at PostgresConf US 2018
Hubert Zhang and Jack Wu
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...Yandex
Lightweight virtualization", also called "OS-level virtualization", is not new. On Linux it evolved from VServer to OpenVZ, and, more recently, to Linux Containers (LXC). It is not Linux-specific; on FreeBSD it's called "Jails", while on Solaris it’s "Zones". Some of those have been available for a decade and are widely used to provide VPS (Virtual Private Servers), cheaper alternatives to virtual machines or physical servers. But containers have other purposes and are increasingly popular as the core components of public and private Platform-as-a-Service (PAAS), among others.
Just like a virtual machine, a Linux Container can run (almost) anywhere. But containers have many advantages over VMs: they are lightweight and easier to manage. After operating a large-scale PAAS for a few years, dotCloud realized that with those advantages, containers could become the perfect format for software delivery, since that is how dotCloud delivers from their build system to their hosts. To make it happen everywhere, dotCloud open-sourced Docker, the next generation of the containers engine powering its PAAS. Docker has been extremely successful so far, being adopted by many projects in various fields: PAAS, of course, but also continuous integration, testing, and more.
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
This talk was presented at SRE NYC Meetup on August 16, 2017 at Squarespace HQ.
https://www.youtube.com/watch?v=UJ1QAKprVr4
As the engineering teams at Squarespace grow, we have been building more and more microservices. However, this has added operational strain as we try to shoehorn a growing, complex dynamic environment into our static data center infrastructure. We needed to rethink how we handle deployments, dependency management, resource allocation, monitoring, and alerting. Docker containerization and Kubernetes orchestration helps us tackle many of these problems, but the journey has been challenging. In this talk, we’ll discuss the challenges of running Kubernetes in a datacenter and how we switched to a more SLA-focused alert structure than per instance health with Prometheus and AlertManager.
Similar to Experiences building a distributed shared log on RADOS - Noah Watkins (20)
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3
Experiences building a distributed shared log on RADOS - Noah Watkins
1. Experiences building a distributed
shared-log on RADOS
Noah Watkins
UC Santa Cruz
@noahdesu
2. 2
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
3. 3
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
4. 4
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
● Ceph shop for prototyping
8. The agenda
● Why should I care about logs?
● How to build a really, really fast log
● Mapping techniques for fast logs onto Ceph
● A few usage examples
8
10. A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
10
0 1 2 3 4 5 6 7 8 9 10 11 12
11. A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
● New entries are appended to the log
11
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append
12. A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
● New entries are appended to the log
● Log entries can be read directly
12
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append
13. Everyone is going bananas for logs
13
● Very high throughput logs
● No global ordering
● Strongly consistent, global ordering
● Write batching
● Single writer
Apache
BookKeeper
19. Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
19
0 1 2 3 4 5 6 7 8 9 10 11 12
20. Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
20
0 1 2 3 4 5 6 7 8 9 10 11 12
Parallel reads
supported when
log is distributed
Random read
workloads perform
poorly on HDDs
21. CORFU: A Shared Log Design for Flash Clusters
21
Balakrishnan, et. al, NSDI 2011
25. CORFU stripes log across a cluster of flash devices
25
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
26. You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple....
26
100K-1M pos/sec
27. You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple.... you’d be correct
27
100K-1M pos/sec
28. You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple.... you’d be correct
● Isn’t this supposed to be a Ceph talk?
28
100K-1M pos/sec
29. The CORFU I/O path
LogInterface(e.g.Append)
29
Cluster of flash devices
Append( ) ??
30. CORFU uses a cluster map and striping function
LogInterface(e.g.Append)
30
Cluster of flash devices
Append( )
Striping and
Addressing
31. CORFU replicates log entries across the cluster
LogInterface(e.g.Append)
31
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Replication
32. The CORFU protocol relies on custom I/O interface
LogInterface(e.g.Append)
32
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication
33. A dedicated CORFU cluster is expensive...
● Duplicated services
○ Fault-tolerance
○ Metadata management
● Dedicated / over-provisioned hardware
○ You need a full cluster of flash devices!
● Custom storage interfaces
○ Hardware or software
● Already have a Ceph cluster?
○ Too bad
● Can we use software-defined storage?
33
34. Mapping the components of CORFU onto Ceph
LogInterface(e.g.Append)
34
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication
35. A cluster of OSDs rather than raw storage devices
LogInterface(e.g.Append)
35
Cluster of OSDs
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
38. Storage interface built using RADOS object classes
LogInterface(e.g.Append)
38
Append( )
Striping and
Addressing
Object naming
RADOS
object
class
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Cluster of OSDs
39. ZLog is an implementation of CORFU on Ceph
osd osd osd
39
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
40. ZLog is an implementation of CORFU on Ceph
osd osd osd
40
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
Transparent
changes to
hardware,
software, tuning
41. ZLog is an implementation of CORFU on Ceph
osd osd osd
41
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
Transparent
changes to
hardware,
software, tuning
Integration with
features like
tiering, erasure
coding, and I/O
hinting
42. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
42
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
43. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
43
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
44. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
44
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
45. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
45
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
50. Design decisions for the ZLog I/O path
50
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
51. Design decisions for the ZLog I/O path
51
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
52. Design decisions for the ZLog I/O path
52
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
53. Design decisions for the ZLog I/O path
53
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
● Treat an object like a
consensus API
54. Design decisions for the ZLog I/O path
54
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
● Treat an object like a
consensus API
55. The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
55
56. The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
56
57. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
57
58. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
58
59. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
59
60. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
60
61. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
○ Atomic update
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
61
62. librados won’t [currently] cut it for this job
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
○ Atomic update
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
62
63. CORFU write interface as an object class
63
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
64. CORFU write interface as an object class
64
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
65. CORFU write interface as an object class
65
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
66. CORFU write interface as an object class
66
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
67. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
67
68. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
68
69. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
69
70. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
70
71. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
5. Distribute your plugin to OSDs at <libdir>/rados-classes/cls_hello.so
6. Starting making requests ioctx::exec(“hello”, …)
71
72. The CORFU seal(obj, epoch) interface is tougher
● Semantics are do the following atomically
○ Verify and update the stored epoch
○ Return the largest position written
● RADOS object classes don’t support this mix of ops
○ Currently
● Solution
○ Split the operation
○ Verify
● Luckily this isn’t a fast path
72
74. PostgreSQL logical replication with ZLog
74
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
INSERT UPDATE DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Queries
ZLog
https://github.com/cruzdb/pg_zlog
75. CruzDB is an immutable key-value store on ZLog
75
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
https://github.com/cruzdb/cruzdb
76. CruzDB is an immutable key-value store on ZLog
76
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
https://github.com/cruzdb/cruzdb
77. CruzDB is an immutable key-value store on ZLog
77
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshots
https://github.com/cruzdb/cruzdb
78. CruzDB is an immutable key-value store on ZLog
78
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshotsTime travel
capabilities
https://github.com/cruzdb/cruzdb
79. Wrapping up
● We use Ceph as a prototyping system
○ Application-specific interfaces
● There is a broad range of interests in log-oriented interfaces
● We’ve built a Ceph-based implementation of a high-performance log
○ ZLog @ https://github.com/cruzdb/zlog
● Using it for several real-world use cases
○ PostgreSQL logical replication (https://github.com/cruzdb/pg_zlog)
○ CruzDB immutable database (https://github.com/cruzdb/cruzdb)
79
80. Thank you and questions
● Noah Watkins (@noahdesu)
● Learning resources
○ Object class SDK documentation
■ http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
○ ZLog is the only user of the SDK I’m aware of
■ https://github.com/cruzdb/zlog/blob/master/src/storage/ceph/cls_zlog.cc
○ The Ceph tree contains a ton of object classes for reference
■ https://github.com/ceph/ceph/tree/master/src/cls
○ Writing object classes using Lua
■ https://ceph.com/geen-categorie/dynamic-object-interfaces-with-lua/
■ https://nwat.xyz/blog/2018/03/13/video-of-my-talk-at-lua-workshop-2017/
80