This document provides an overview and summary of BlueStore, a new storage backend for Ceph that provides faster performance compared to the previous FileStore backend. BlueStore uses a key/value database (RocksDB) to store metadata and writes data directly to block devices. It addresses issues with the transactional consistency of FileStore by avoiding double writes and using more efficient data structures. BlueStore aims to provide more natural transaction atomicity, efficient enumeration of objects, and optimal I/O patterns for different storage devices.
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
Audience Level
Intermediate
Synopsis
Ceph – the most popular storage solution for OpenStack – stores all data as a collection of objects. This object store was originally implemented on top of a POSIX filesystem, an approach that turned out to have a number of problems, notably with performance and complexity.
BlueStore, a new storage backend for Ceph, was created to solve these issues; the Ceph Jewel release included an early prototype. The code and on-disk format were declared stable (but experimental) for Ceph Kraken, and now in the upcoming Ceph Luminous release, BlueStore will be the recommended default storage backend.
With a 2-3x performance boost, you’ll want to look at migrating your Ceph clusters to BlueStore. This talk goes into detail about what BlueStore does, the problems it solves, and what you need to do to use it.
Speaker Bio:
Tim works for SUSE, hacking on Ceph and related technologies. He has spoken often about distributed storage and high availability at conferences such as linux.conf.au. In his spare time he wrangles pigs, chickens, sheep and ducks, and was declared by one colleague “teammate most likely to survive the zombie apocalypse”.
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)Ontico
- Что такое SDS (общие места для (почти) всех решений — масштабирование, абстрагирование от аппаратных ресурсов, управление с помощью политик, кластерные ФС);
- Почему мы решили использовать SDS (нужно было объектное хранилище);
- Почему решили использовать именно Ceph, а не другие открытые (GlusterFS, Swift...) или проприетарные (IBM Elastic Storage, Huawei OceanStor) решения;
- Что еще умеет Ceph, кроме object storage (RBD, CephFS);
- Как работает Ceph (со стороны сервера);
- Что нового дает BlueStore по сравнению с классическим (поверх ФС);
- Сравнение производительности (метрики тестов);
- BlueStore — все еще tech preview;
- Заключение. Ссылки, литература.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
BlueStore is a new storage backend for Ceph OSDs that consumes block devices directly, bypassing the local XFS file system that is currently used today. It's design is motivated by everything we've learned about OSD workloads and interface requirements over the last decade, and everything that has worked well and not so well when storing objects as files in local files systems like XFS, btrfs, or ext4. BlueStore has been under development for a bit more than a year now, and has reached a state where it is becoming usable in production. This talk will cover the BlueStore design, how it has evolved over the last year, and what challenges remain before it can become the new default storage backend.
CRUSH is the powerful, highly configurable algorithm Red Hat Ceph Storage uses to determine how data is stored across the many servers in a cluster. A healthy Red Hat Ceph Storage deployment depends on a properly configured CRUSH map. In this session, we will review the Red Hat Ceph Storage architecture and explain the purpose of CRUSH. Using example CRUSH maps, we will show you what works and what does not, and explain why.
Presented at Red Hat Summit 2016-06-29.
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
Audience Level
Intermediate
Synopsis
Ceph – the most popular storage solution for OpenStack – stores all data as a collection of objects. This object store was originally implemented on top of a POSIX filesystem, an approach that turned out to have a number of problems, notably with performance and complexity.
BlueStore, a new storage backend for Ceph, was created to solve these issues; the Ceph Jewel release included an early prototype. The code and on-disk format were declared stable (but experimental) for Ceph Kraken, and now in the upcoming Ceph Luminous release, BlueStore will be the recommended default storage backend.
With a 2-3x performance boost, you’ll want to look at migrating your Ceph clusters to BlueStore. This talk goes into detail about what BlueStore does, the problems it solves, and what you need to do to use it.
Speaker Bio:
Tim works for SUSE, hacking on Ceph and related technologies. He has spoken often about distributed storage and high availability at conferences such as linux.conf.au. In his spare time he wrangles pigs, chickens, sheep and ducks, and was declared by one colleague “teammate most likely to survive the zombie apocalypse”.
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)Ontico
- Что такое SDS (общие места для (почти) всех решений — масштабирование, абстрагирование от аппаратных ресурсов, управление с помощью политик, кластерные ФС);
- Почему мы решили использовать SDS (нужно было объектное хранилище);
- Почему решили использовать именно Ceph, а не другие открытые (GlusterFS, Swift...) или проприетарные (IBM Elastic Storage, Huawei OceanStor) решения;
- Что еще умеет Ceph, кроме object storage (RBD, CephFS);
- Как работает Ceph (со стороны сервера);
- Что нового дает BlueStore по сравнению с классическим (поверх ФС);
- Сравнение производительности (метрики тестов);
- BlueStore — все еще tech preview;
- Заключение. Ссылки, литература.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
BlueStore is a new storage backend for Ceph OSDs that consumes block devices directly, bypassing the local XFS file system that is currently used today. It's design is motivated by everything we've learned about OSD workloads and interface requirements over the last decade, and everything that has worked well and not so well when storing objects as files in local files systems like XFS, btrfs, or ext4. BlueStore has been under development for a bit more than a year now, and has reached a state where it is becoming usable in production. This talk will cover the BlueStore design, how it has evolved over the last year, and what challenges remain before it can become the new default storage backend.
CRUSH is the powerful, highly configurable algorithm Red Hat Ceph Storage uses to determine how data is stored across the many servers in a cluster. A healthy Red Hat Ceph Storage deployment depends on a properly configured CRUSH map. In this session, we will review the Red Hat Ceph Storage architecture and explain the purpose of CRUSH. Using example CRUSH maps, we will show you what works and what does not, and explain why.
Presented at Red Hat Summit 2016-06-29.
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
HKG15-401: Ceph and Software Defined Storage on ARM servers
---------------------------------------------------
Speaker: Yazen Ghannam Steve Capper
Date: February 12, 2015
---------------------------------------------------
★ Session Summary ★
Running Ceph in the colocation, ongoing optimizations
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250828
Video: https://www.youtube.com/watch?v=RdZojLL7ttk
Etherpad: http://pad.linaro.org/p/hkg15-401
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
Ceph is a highly scalable open source distributed storage system that provides object, block, and file interfaces on a single platform. Although Ceph RBD block storage has dominated OpenStack deployments for several years, maturing object (S3, Swift, and librados) interfaces and stable CephFS (file) interfaces now make Ceph the only fully open source unified storage platform.
This talk will cover Ceph's architectural vision and project mission and how our approach differs from alternative approaches to storage in the OpenStack ecosystem. In particular, we will look at how our open development model dovetails well with OpenStack, how major contributors are advancing Ceph capabilities and performance at a rapid pace to adapt to new hardware types and deployment models, and what major features we are priotizing for the next few years to meet the needs of expanding cloud workloads.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
HKG15-401: Ceph and Software Defined Storage on ARM servers
---------------------------------------------------
Speaker: Yazen Ghannam Steve Capper
Date: February 12, 2015
---------------------------------------------------
★ Session Summary ★
Running Ceph in the colocation, ongoing optimizations
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250828
Video: https://www.youtube.com/watch?v=RdZojLL7ttk
Etherpad: http://pad.linaro.org/p/hkg15-401
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
Ceph is a highly scalable open source distributed storage system that provides object, block, and file interfaces on a single platform. Although Ceph RBD block storage has dominated OpenStack deployments for several years, maturing object (S3, Swift, and librados) interfaces and stable CephFS (file) interfaces now make Ceph the only fully open source unified storage platform.
This talk will cover Ceph's architectural vision and project mission and how our approach differs from alternative approaches to storage in the OpenStack ecosystem. In particular, we will look at how our open development model dovetails well with OpenStack, how major contributors are advancing Ceph capabilities and performance at a rapid pace to adapt to new hardware types and deployment models, and what major features we are priotizing for the next few years to meet the needs of expanding cloud workloads.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.
BlueStore: a new, faster storage backend for CephSage Weil
Traditionally Ceph has made use of local file systems like XFS or btrfs to store its data. However, the mismatch between the OSD's requirements and the POSIX interface provided by kernel file systems has a huge performance cost and requires a lot of complexity. BlueStore, an entirely new OSD storage backend, utilizes block devices directly, doubling performance for most workloads. This talk will cover the motivation a new backend, the design and implementation, the improved performance on HDDs, SSDs, and NVMe, and discuss some of the thornier issues we had to overcome when replacing tried and true kernel file systems with entirely new code running in userspace.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hgpOEw.
Preslav Le talks about how Dropbox’s infrastructure evolved over the years and how it looks today as well the challenges and lessons learned and tips addressing massive scale, consistency, architecture, MySQL, Memcache, and more. Filmed at qconsf.com.
Preslav Le has been a software engineer at Dropbox for the past 3 years, contributing to various aspects of Dropbox’s infrastructure including traffic, performance and storage. He was part of the core on-call and storage on-call rotations, dealing with high emergency real world issues from bad code pushes to complete datacenter outages.
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...Ron Munitz
My session at AnDevCon Bostong, May 2013, Boston, MA.
This class introduces the concepts of AOSP and how to use it in order to configure and build one of the most popular Android devices available: The Android emulator, for an x86 target. You will then learn a reincarnation of the AOSP, intended to bring Android to as many x86 devices as possible. You will see its structure and compare it with the AOSP, and demonstrate how such a build works within Virtual Box, QEMU and more.
LEVEL: Intermediate
TOPIC AREA: Embedded Android
For Training/Consulting requests: info@thepscg.com
Logging and ranting / Vytis Valentinavičius (Lamoda)Ontico
HighLoad++ 2017
Зал «Пекин+Шанхай», 7 ноября, 16:00
Тезисы:
http://www.highload.ru/2017/abstracts/2842.html
A story about real life experience in Lamoda, featuring logging, forest animals, limited size buffers and morning routines.
Possible takeaways from this presentation:
1. Understanding the need of central log aggregation
2. Learning a few tips about logging and event aggregation
3. Saving a lot of money by implementing your own personal "poor-man's" NewRelic
...
In this presentation, we introduce liblightnvm, a user space library that manages provisioning and I/O submission for physical flash.
We argue how liblightnvm can benefit I/O-intensive applications by providing predictable latency and reducing device write amplification, thus prolonging the device's endurance. We show how to integrate liblightnvm with RocksDB.
PyLadies Talk: Learn to love the command line!Blanca Mancilla
This talks aims to uncover some of the magic powers of scripting and the command line.
I'll share with you some of my experience using the shell to schedule backups of a git repository or to find strings in files of unknown name and location.
And then you might see that it is a tough love!
The slides we used at the first meetup hosted at Redis Labs' TLV offices :)
Touches on some of the more notable user-facing functionality in the newest Redis version, as well as interesting internal optimizations with major gains.
#RedisTLV: www.meetup.com/Tel-Aviv-Redis-Meetup/events/227594422/
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Sergii Khomenko
Many people use Go for different projects: WebDev, DevOps or other general-purpose tasks. On another hand, with all the beauty and performance of the language it could be a good challenger for Data applications. In the talk, we will go through the common problems of Data Engineering. Starting with high-performance caching and probabilistic data structures like Bloom filters, CountMin or Hyperloglog. We will cover all stages of Data Pipelining like writing data producers for open source Apache Kafka or proprietary Amazon Kinesis or Google Pub/Sub with further data consuming and processing.
The talk covers real-life use-cases of Data Applications and will provide an overview of existing possibilities of Golang as a language for data engineering. In the talk, we will cover basic ideas of building high-performance data application, creating your own data pipelines based on open source solutions and also hosted proprietary like Amazon Kinesis or Google Pub/Sub. The idea is to provide an overview how good is Golang for data engineering and what are Pros and Cons.
CTF3, Stripe's third Capture-the-Flag, focused on distributed systems engineering with a goal of learning to build fault-tolerant, performant software while playing around with a bunch of cool cutting-edge technologies.
More here: https://stripe.com/blog/ctf3-launch.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Ceph Day KL - Bluestore
1. B L U E S TO R E : A N E W, FASTER S T O R A G E B A C K E N D
F O R C E P H
Patrick McGarry
Ceph Days APAC Roadshow
2016
2. 2
O UTLIN E
● Ce p h b a c k g r o u n d a n d c o n t e x t
–
–
FileStore, a n d w h y POSIX failed us
Ne wS to r e – a h y b r i d a p p r o a c h
● BlueStore – a n e w Ce p h OSD b a c k e n d
–
–
M e t a d a t a
D a t a
●
●
●
Performance
Status a n d availability
S u m m a r y
4. CEPH
●
●
●
●
●
●
Object, block, a n d file storage in a single cluster
All c o m p o n e n t s scale horizontally
N o single p o in t of failure
H a r d w a r e agnostic, c o m m o d i t y h a r d w a r e
Self-manage w h e n e v e r possible
O p e n source (LGPL)
●
●
“ A Scalable, High-Performance Distributed File S y s t e m ”
“ p e r f o r ma n c e , reliability, a n d scalability”
4
5. CEPH COMPONENTS
RGW
A w e b services g a t e w a y
for o b je ct storage,
co mp a t ib le w i t h S3 a n d
Swift
LIBRADOS
A library a llo wing a p p s t o directly access RADOS (C, C + + , Java, Python, Ruby, PHP)
RADOS
A software -based, reliable, a u t o n o m o u s , d is t rib ute d o bject store c o m p r i s e d of
self-healing, self-managing, intelligent st o ra g e n o d e s a n d lig h t we ig h t mo n it o rs
RBD
A reliable, fully-distributed
block d e vice w i t h clo u d
p la t f o rm in t e g rat ion
CEPHFS
A d ist ribut ed file s y s t e m
w i t h POSIX se ma n t ics a n d
scale-out m e t a d a t a
m a n a g e m e n t
OBJECT
5
BLOCK FILE
6. OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
b t rfs
ex t 4
M
M
M
6
7. OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
b t rfs
ex t 4
M
M
M
FileStore
7
FileStoreFileStoreFileStore
8. ●
8
ObjectStore
–
–
abs t ract interface for storing
local d a t a
EBOFS, FileStore
●
EBOFS
–
–
a us er -s pac e e x t e n t - b a s e d
o b j e c t file s y s t e m
deprec at ed in f av or of FileStore
o n btrfs in 2 0 0 9
●
Object – “ file ”
–
–
–
d a t a (file-like b y t e s t ream )
at t ributes (small key/value)
o m a p ( u n b o u n d e d key/value)
●
Collection – “ d i r e c t o r y ”
–
–
p l a c e m e n t g r o u p shard (slice of
t h e RADOS pool)
s h a r d e d b y 3 2 - b i t h a s h v a l u e
●
All writes are transactions
–
–
A t o m i c + C o n s i s t e n t + D u r a b l e
Isolation prov ided b y OSD
OBJECTSTORE A N D DATA MODEL
9. ●
9
FileSt ore
–
–
PG = collection = directory
object = file
●
Le v e ld b
–
–
large x a t t r spillover
object o m a p (key/value) d a t a
●
Originally just for development...
– later, o n l y s u p p o r t e d b a c k e n d
( o n XFS)
● /var/lib/ceph/osd/ceph-123/
– current/
● meta/
–
–
osdmap123
osdmap124
● 0.1_head/
–
–
object1
object12
● 0.7_head/
–
–
object3
object5
● 0.a_head/
–
–
object4
object6
● db/
– <leveldb files>
FILESTORE
10. ●
1 0
OSD carefully m a n a g e s
c o n s is te n c y of its d a t a
●
All w rite s a re tra n s a c tio n s
– w e n e e d A + C + D ; OSD prov ides I
● M o s t a re s i m p l e
–
–
–
w r i t e s o m e b y t e s t o objec t (file)
u p d a t e objec t a t t r i b u t e (file
x a t t r )
a p p e n d t o u p d a t e log (lev eldb
insert)
...but o t h e r s a re arbitrarily
l a r g e / c o m p l e x
[
{
"op_name": "write",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"length": 4194304,
"offset": 0,
"bufferlist length": 4194304
},
{
"op_name": "setattrs",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"attr_lens": {
"_": 269,
"snapset": 31
}
},
{
"op_name": "omap_setkeys",
"collection": "0.6_head",
"oid": "#0:60000000::::head#",
"attr_lens": {
"0000000005.00000000000000000006": 178,
"_info": 847
}
}
]
POSIX FAILS: TRANSACTIONS
11. ●
1 1
Btrfs t r a n s a c t io n h o o k s
dangerous, and only for
how to avoid the
/* trans start and trans end are
* use by applications that know
* resulting deadlocks
*/
#define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6)
#define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
●
●
Wr it e b a c k o r d e r in g
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
W h a t if w e hit a n error? c e p h - o s d process dies?
–
#define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …)
T her e is n o rollback...
POSIX FAILS: TRANSACTIONS
12. ●
1 2
Writ e -ahead journal
–
–
serialize a n d journal e v e r y ObjectStore::Transaction
t h e n w r i t e it t o t h e file s y s t e m
●
Btrfs parallel journaling
–
–
periodic s y n c t akes a snapshot , t h e n t r i m old journal ent ries
o n OSD restart: rollback a n d replay journal against last s n a p s h o t
●
XFS/ext4 w rit e -ahead journaling
–
–
–
periodic sync, t h e n t r i m old journal ent ries
o n restart, replay ent ire journal
lots of u g l y h a c k e r y t o deal w i t h e v e n t s t h a t aren't i d e m p o t e n t
●
e.g., r e n a m e s , collection d e l e t e + create, …
●
fu l l d a t a j o u r n a l → w e d o u b l e w r i t e e v e r y t h i n g → ~ h a l v e d i s k t h r o u g h p u t
POSIX FAILS: TRANSACTIONS
13. POSIX FAILS: ENUMERATION
1 3
●
●
C e p h objects are d i st ri but ed b y a 32-bit h a sh
E n u m e r a t i o n is in h a sh o r der
–
–
–
s c r u b b i n g
“b a c k f i l l ” ( d a t a re b a la n cin g , re c o v e ry )
e n u m e r a t i o n via lib ra do s client API
●
POSIX readdir is n o t well-ordered
●
N e e d O(1) “ s p l i t ” for a g i v e n sh a rd/range
●
Build directory tree b y hash-value prefix
–
–
–
split a n y d ire ct o ry w h e n size > ~ 1 0 0 files
m e r g e w h e n size < ~ 5 0 files
r e a d e n t ire d ire cto ry, so rt i n - m e m o r y
…
DIR_A/
DIR_A/A03224D3_qwer
DIR_A/A247233E_zxcv
…
DIR_B/
DIR_B/DIR_8/
DIR_B/DIR_8/B823032D_foo
DIR_B/DIR_8/B8474342_bar
DIR_B/DIR_9/
DIR_B/DIR_9/B924273B_baz
DIR_B/DIR_A/
DIR_B/DIR_A/BA4328D2_asdf
…
15. NE W OBJECTSTORE GOALS
1 5
●
●
●
●
●
M o r e n a t u r a l t ra n sa ct i o n a t o m i c i t y
Avoid d o u b l e w ri t e s
Efficient o b j e ct e n u m e r a t i o n
Efficient cl one o p e r a t i o n
Efficient splice ( “ m o v e t h e s e b y t e s f r o m o b j e ct X t o o b j e ct Y”)
●
●
Efficient IO p a t t e r n for H D D s, SSDs, N V M e
M i n i m a l locking, m a x i m u m parallelism ( b e t w e e n PGs)
●
●
Full d a t a a n d m e t a d a t a c h e c k s u m s
Inline c o m p r e s s i o n
16. NEWSTORE – WE MANAGE NAMESPACE
●
●
POSIX has t h e w r o n g m e t a d a t a m o d e l for u s
Ordere d k ey / v al ue is perf ec t m a t c h
–
–
well-defined object n a m e sort order
efficient e n u m e r a t i o n a n d r a n d o m lookup
●
N e w S t o r e = roc k s db + obj ec t files
– /var/lib/ceph/osd/ceph-123/
● db/
– <rocksdb, leveldb, whatever>
● blobs.1/
–
–
–
0
1
...
● blobs.2/
–
–
–
100000
100001
...
H D D
OSD
SSD SSD
OSD
H D D
OSD
NewStore NewStore NewStore
RocksDBRocksDB
1 6
17. ●
1 7
RocksDB has a write-ahead log
“ j o u r n a l”
●
XFS/ext4(/btrfs) h a v e their o w n
journal (tree-log)
●
Journal-on-journal has h ig h
o v e r h e a d
– e a c h journal m a n a g e s half of
overall consistency, b u t incurs
t h e s a m e o v e r h e a d
●
●
write(2) + fsync(2) to new blobs.2/10302
1 write + flush to block device
1 write + flush to XFS/ext4 journal
write(2) + fsync(2) on RocksDB log
1 write + flush to block device
1 write + flush to XFS/ext4 journal
NEWSTORE FAIL: CONSISTENCY OVERHEAD
18. ●
1 8
We c an't o v e r w r i t e a POSIX file as p a r t of a a t o m i c t rans action
– ( w e m u s t p r e s e r v e old d a t a unt il t h e t ransact ion c o m m i t s )
●
●
Writing o v e r w r i t e d a t a t o a n e w file m e a n s m a n y files for e a c h objec t
Writ e -ahead logging
–
–
–
–
p u t o v e r w r i t e d a t a in a “WAL” records in RocksDB
c o m m i t a t o m i c a l l y w i t h t ransact ion
t h e n o v e r w r i t e original file d a t a
...but t h e n w e ' r e b a c k t o a d o u b l e - w r i t e for o v e r w r i t e s
●
●
Perf ormance sucks again
Ov erw rit es d o m i n a t e RBD bloc k w ork loads
NEWSTORE FAIL: ATOMICITY NEEDS WAL
20. ●
BlueStore = Block + N e w S t o r e
–
–
–
–
c o n s u m e r a w b l o c k dev ic e(s )
k e y / v a l u e d a t a b a s e (RocksDB) f or m e t a d a t a
d a t a w rit t en directly t o block device
p l u g g a b l e b l o c k Allocat or (policy)
●
We m u s t share t h e block device w i t h RocksDB
–
–
–
i m p l e m e n t our o w n rocksdb::Env
i m p l e m e n t t iny “file s y s t e m ” BlueFS
m a k e BlueStore a n d BlueFS share device(s)
BLUESTORE
BlockDeviceBlockDeviceBlockDevice
data
B l u e S t o r e
metadata
R o c k s D B
BlueRocksEnv
BlueFS
Allocator
ObjectStore
2 0
21. ROCKSDB: BLUEROCKSENV + BLUEFS
●
class BlueRocksEnv : public rocksdb: : EnvWrapper
– p a s s e s file IO o p e r a t i o n s t o BlueFS
●
BlueFS is a super-simple “file s y s t e m ”
– all m e t a d a t a l o a d e d in RAM o n s t a r t / m o u n t
–
–
–
–
n o n e e d t o s t o r e b lo c k f r e e list
c o a r s e a llo c a t io n u n i t ( 1 M B b lo c k s )
all m e t a d a t a lives in w r i t t e n t o a jo u r n a l
jo u r n a l r e w r i t t e n / c o m p a c t e d w h e n it g e t s la r g e
su p e rb lock journal …
d a t a d a t a
m o r e journal … d a t a
d a t a
file 1 0 file 11 file 1 2 file 1 2 file 1 3 r m file 1 2 file 1 3 ...
●
M a p “ d i rect ories” t o different block devices
–
–
–
db.wal/
d b /
d b .slow /
– o n NVRAM, NVMe, SSD
– level0 a n d h o t SSTs o n SSD
– cold SSTs o n H D D
●
BlueStore periodically balances free space
2 1
22. ROCKSDB: JOURNAL RECYCLING
2 2
●
rocksdb LogReader only u n d e r stand s t w o m o d e s
–
–
r e a d unt il e n d of file ( n e e d a c c u r a t e file size)
r e a d all v alid records, t h e n i g n o r e zeros a t e n d ( n e e d z e r o e d tail)
●
●
writing t o “ f r e s h ” log “files” m e a n s > 1 IO for a log a p p e n d
mo d ifie d u p s t r e a m rocksdb t o re-use previous log files
– n o w res em bles “ n o r m a l ” journaling behav ior ov er a circular buffer
● w o r k s w i t h vanilla RocksDB o n files a n d o n BlueFS
23. ●
2 3
Single dev i c e
– H D D or SSD
●
●
rocksdb
object d a t a
● Two devices
– 1 2 8 M B of SSD or NVRAM
● rocksdb WAL
– big d e v ic e
●
e v e r y t h in g else
MULTI-DEVICE SUPPORT
● Two devices
– a f e w GB of SSD
●
●
r o c k s d b WAL
r o c k s d b ( w a r m d a t a )
– b ig d e v ic e
●
●
r o cksd b (cold d a t a )
o b je c t d a t a
● Three devices
– 1 2 8 M B NVRAM
●
r o c k s d b WAL
– a f e w GB SSD
●
r o c k s d b ( w a r m d a t a )
– b ig d e v ic e
●
●
r o c k s d b (cold d a t a )
o b je c t d a t a
25. BLUESTORE METADATA
2 5
●
Partition n a m e s p a c e for different m e t a d a t a
–
–
–
S* – “ s u p e r b l o c k ” m e t a d a t a f or t h e e n t i r e s t ore
B * – b l o c k alloc at ion m e t a d a t a (f ree b l o c k b i t m a p )
T* – stats (bytes used, c om pres sed, etc.)
–
–
C * – collection n a m e →c n o d e _ t
O * – o b j e c t n a m e →o n o d e _ t o r b n o d e _ t
– L* – w r i t e - a h e a d l o g ent ries, p r o m i s e s of f u t u r e IO
– M * – o m a p (user key/value data, stored in objects)
26. ●
2 6
Collection m e t a d a t a
– Interval of object n a m e s p a c e
shard pool hash name bits
C<NOSHARD,12,3d3e0000> “12.e3d3” = <19>
shard pool hash name snap gen
O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = …
struct spg_t {
uint64_t pool;
uint32_t hash;
shard_id_t shard;
};
struct bluestore_cnode_t {
uint32_t bits;
};
●
Nice properties
–
–
O rd e re d e n u m e r a t i o n of objects
We ca n “s p l i t ” collections b y adjusting co d e
m e t a d a t a o n ly
CN O DE
27. ●
2 7
Per object m e t a d a t a
–
–
Lives direc t ly in k e y / v a l u e pair
Serializes t o 1 0 0 s of b y t e s
●
●
●
Size in b y te s
Inline attributes (user a ttr d a ta )
D a t a pointers (user b y t e d a ta )
–
–
lextent_t →(blob, offset, lengt h)
blob →(disk extents, csums, ...)
●
O m a p prefix/ID (user k/v d a ta )
struct bluestore_onode_t {
uint64_t size;
map<string,bufferptr> attrs;
map<uint64_t,bluestore_lextent_t> extent_map;
uint64_t omap_head;
};
struct bluestore_blob_t {
vector<bluestore_pextent_t> extents;
uint32_t compressed_length;
bluestore_extent_ref_map_t ref_map;
uint8_t csum_type, csum_order;
bufferptr csum_data;
};
struct bluestore_pextent_t {
uint64_t offset;
uint64_t length;
};
ONODE
28. ●
B lo b m e t a d a t a
–
–
–
–
Usually blobs stored in t h e o n o d e
S o m e t i m e s w e share blocks b e t w e e n objects (usually clones/snaps)
W e n e e d t o reference c o u n t t h o s e e x t e n t s
W e still w a n t t o split collections a n d repartition e x t e n t m e t a d a t a b y h a s h
shard pool hash name snap gen
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e02c2> = bnode
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e125d> = bnode
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = onode
●
o n o d e va lu e includes, a n d b n o d e va lu e is
map<int64_t,bluestore_blob_t> blob_map;
●
le xt e n t b lo b ids
–
–
> 0 →b lo b in o n o d e
< 0 →b lo b in b n o d e
BN O D E
2 8
29. ●
2 9
We scrub... periodically
–
–
–
w i n d o w b e f o r e w e d e t e c t error
w e m a y r e a d b a d d a t a
w e m a y n o t b e sure w h i c h c o p y
is b a d
●
We w a n t t o validate c h e c k s u m
o n e v e r y read
●
Mu s t store m o r e m e t a d a t a in t h e
blobs
–
–
3 2 - b i t c s u m m e t a d a t a f or 4 M B
o b j e c t a n d 4 K B blocks = 4 K B
larger c s u m bloc k s
● c s u m _ o r d e r > 1 2
– smaller c s u m s
● crc32c_8 or 1 6
●
IO hints
–
–
s e q r e a d + w r i t e →b i g c h u n k s
c o m p r e s s i o n → b i g c h u n k s
● Per-pool policy
CHECKSUMS
30. ●
3 x replication is ex pens iv e
– A n y s c al e - out c l us t er is e x p e n s i v e
●
●
Lots of s t or ed d a t a is (highly) c om pr es s ible
N e e d largish e x t e n t s t o g e t c o m p r e s s i o n benef it ( 6 4 KB, 1 2 8 KB)
–
–
m a y n e e d t o s u p p o r t s m a l l (ov er)w ri t es
o v e r w r i t e s oc c l ude/ o bs c ur e c o m p r e s s e d bl obs
– c o m p a c t e d ( r e w r i t t e n ) w h e n > N l ay ers d e e p
INLINE COMPRESSION
start of object end of object
3 0
allocated
written
written (compressed)
uncompressed blob
32. Te rm s
3 2
●
Sequencer
–
–
An i n d e p e n d e n t , totally ordered
q u e u e of transactions
O n e per PG
●
TransContext
– State describing a n ex ec uting
t r a n s a c t i o n
DATA PATH BASICS
Two w a y s t o w r it e
●
N e w allocation
–
–
A n y w r i t e larger t h a n
m i n _ a l l o c _ s i z e g o e s t o a n e w,
u n u s e d e x t e n t o n disk
O n c e t h a t IO c o m p l e t e s , w e
c o m m i t t h e t r ans ac t ion
●
WAL ( w r it e - a h e a d - lo g g e d )
– C o m m i t t e m p o r a r y p r o m i s e t o
( ov er ) w r it e d a t a w i t h t r ans ac t ion
●
includes d a t a !
–
–
D o a s y n c o v e r w r i t e
T h e n u p t e m p o r a r y k / v pair
33. TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
WAL_QUEUED WAL_AIO_WAIT
FINISH
WAL_CLEANUP
Initiate someAIO
Wait for next TransContext(s) in Sequencer to be ready
WAL_QUEUED
Sequencer queue
Initiate someAIO
Wait for next commit batch
PREPARE
AIO_WAIT
KV_QUEUED
AIO_WAIT
KV_COMMITTING
KV_COMMITTING
KV_COMMITTING
FINISH
WAL_QUEUED
FINISH
WAL_AIO_WAIT
WAL_CLEANUP
WAL_CLEANUP
FINISH
FINISHWAL_CLEANUP_COMMITTING
3 3
34. ●
3 4
O n o d e S p a c e p e r collection
– i n - m e m o r y g h o bj ect _t → O n o d e m a p of d e c o d e d o n o d e s
●
Bu ff e r Sp a c e for i n - m e m o r y blobs
– m a y c o n t a i n c a c h e d o n -di sk d a t a
● B o t h buffers a n d o n o d e s h a v e lifecycles linked t o a C a ch e
–
–
LRUCache – trivial LRU
TwoQCache – i m p l e m e n t s 2 Q c a c h e r e p l a c e m e n t a l g o r i t h m (d efaul t)
●
C a c h e is s h a r d e d for parallelism
–
–
–
Collection → sh a r d m a p p i n g m a t c h e s OSD's o p _ w q
s a m e CPU c o n t e x t t h a t processes client r e q u est s will t o u c h t h e LRU/2Q lists
IO c o m p l e t i o n e x e c u t i o n n o t y e t s h a r d e d – TODO?
CACH IN G
35. ●
3 5
FreelistManager
–
–
persist list of free e x t e n t s t o k e y / v a lu e
store
p re p a re i n c r e m e n t a l u p d a t e s for allocate
o r release
●
Initial i m p l e m e n t a t i o n
–
–
–
–
e xt e n t -b a se d
<offset> = <length>
k e p t i n - m e m o r y c o p y
e n fo rces a n o rd e rin g o n c o m m i t s ; freelist
u p d a t e s h a d t o pass t h r o u g h single
t h re a d / lo ck
del 1600=100000
put 1700=0fff00
sma ll initial m e m o r y f o o t p rint , v e r y
e xp e n sive w h e n f r a g m e n t e d
●
N e w b i t m a p - b a s e d a p p r oa ch
<offset> = <region bitmap>
– w h e r e r e g i o n is N bloc k s
● 1 2 8 bloc k s = 8 b y t e s
–
–
–
u s e k / v m e r g e o p e r a t o r t o XOR
allocation or release
merge 10=0000000011
merge 20=1110000000
R oc k s D B l o g - s t r u c t u r e d - m e r g e
t r e e c oales c es k e y s d u r i n g
c o m p a c t i o n
n o i n - m e m o r y s t a t e
BLOCK FREE LIST
36. ●
3 6
Allocator
– abstract interface t o allocate blocks
● St upidAlloc at or
–
–
–
–
e x t e n t - b a s e d
bin free ext ent s b y size (powers of
2)
c h o o s e suff icient ly l a r g e e x t e n t
closest t o hint
h i g h l y v a r i a b l e m e m o r y u s a g e
● bt ree of free ext ent s
–
–
i m p l e m e n t e d , w o r k s
based o n ancient ebofs policy
●
Bit m apAlloc ator
– hierarchy of indexes
●
●
●
L1: 2 bits = 2 ^ 6 blocks
L2: 2 bits = 2 ^ 1 2 blocks
...
0 0 = all free, 11 = all used,
0 1 = m i x
– fixed m e m o r y c o n s u m p t i o n
●
~ 3 5 M B RAM p e r TB
BLOCK ALLOCATOR
37. ●
3 7
●
Let's s u p p ort t h e m natively!
2 5 6 M B zones / b a n d s
–
–
–
m u s t b e w r i t t e n sequentially,
b u t n o t all a t onc e
libzbc s upport s ZAC a n d ZBC
H D D s
h o s t - m a n a g e d or hos t -aw are
●
SMRAllocator
–
–
–
w r i t e p o i n t e r p e r z o n e
u s e d + f ree c o u n t e r s p e r z o n e
B o n u s : a l m o s t n o m e m o r y !
●
IO ordering
– m u s t e n s u r e a l l o c a t e d w r i t e s
reach disk in order
● Cle a n in g
–
–
–
store k / v hints
z one offset →object has h
pick e m pt i e s t closed zone, scan
hints, m o v e objects t h a t are still
t h e r e
opportunistically rewrite objects
w e read if t h e z one is f lagged
f or c l e a n i n g s o o n
SMR H D D
40. H D D : RANDOM WRITE
200
150
100
50
0
Ceph 10.1.0 Bluestore vs Filestore Random Writes
BS
HDD
HDD
IOSize
Throughput(MB/s)
600
400
200
0
450 1600
400 1400
350 1200
300
250
1000
FS 800
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
4 0
IOSize
IOPS
41. H D D : SEQUENTIAL READ
400
200
0
600
Ceph 10.1.0 Bluestore vs Filestore Sequential Reads
1200
1000
800
FS HDD
BS HDD
4 1
IOSize
Throughput(MB/s)
42. H D D : RANDOM READ
1400
1200
1000
800
600
400
200
0
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
IOSize
Throughput(MB/s)
3500
3000
2500
2000
1500
1000
500
0
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
4 2
IOSize
IOPS
43. SSD A N D NVME?
4 3
●
N VMe journal
–
–
r a n d o m w r i t e s ~ 2 x f as t er
s o m e t e s t i n g a n o m a l i e s ( p r o b l e m w i t h t e s t rig kernel?)
●
SSD only
–
–
s im ilar t o H D D res ult
small writ e benef it is m o r e p r o n o u n c e d
●
N VMe only
– m o r e t es t ing anom alies o n test rig.. WIP
45. STATUS
4 5
●
D o n e
–
–
–
fully function IO p a t h w it h c h e c k s u m s a n d compression
fsck
b it m a p - bas e d allocator a n d freelist
●
Current efforts
–
–
–
–
optimize m e t a d a t a e n c o d in g efficiency
p e r f o r m a n c e t u n in g
ZetaScale key/value d b as RocksDB alternative
b o u n d s o n c o m p r e s s e d blob occlusion
●
So o n
–
–
–
–
per-pool properties t h a t m a p t o compression, c h e c k s u m , IO hints
m o r e p e r f o r m a n c e optimization
native SMR H D D support
SPDK (kernel bypass for N V M e devices)
46. AVAILABILITY
4 6
● E x per im ent al bac k end in Jewel v10.2.z (just released)
–
–
e n a b l e e x p e r i m e n t a l u n r e c o v e r a b l e d a t a c o r r u p t i n g f eat ures = bluest ore roc k s db
ceph- disk --bluestore DEV
●
n o m ul t i - dev i c e m a g i c provisioni ng just y e t
– predat es c h e c k s u m s a n d c o m p r e s s i o n
● Current m a s t e r
–
–
–
n e w disk f o r m a t
c h e c k s u m s
c om pres s i o n
● The goal...
–
–
stable in K rak en (Fall '16)
def aul t in L u m i n o u s (Spring '17)
47. SUM M ARY
4 7
●
●
●
●
●
C e p h is g r e a t
POSIX w a s p o o r choice for storing objects
RocksDB rocks a n d w a s e a s y t o e m b e d
O u r n e w BlueStore b a c k e n d is a w e s o m e
Full d a t a c h e c k s u m s a n d inline c o m p r e s s i o n !
A bit of background on what Ceph is
what Filestore is and why it doesn’t work anymore
what newstore is (first attempt)
Bluestore, current effort
High level, how it’s structured, data path, performance numbers
Current status of development, where we’re at and how to try it.
[basic stuff]
The original paper used the [last two bullets] but performance has been a challenge compared to raw hardware capabilities
The RADOS cluster is structured in a series of hosts
Collection of OSD daemons siting in front of HDDs
FS sitting on top of disk
In reality there is a well-contained of the OSD called filestore
Responsible for writing that data to the filesystem on that disk
It’s that piece that is getting replaced
Filestore implements a interface called object store
Abstract interface that describes how each OSD daemon stores data on its local disk (just local disk)
The larger ceph system is responsible for replicating across multiple OSDs
Originally called EBOFS,and filestore (two implementations)
Built around two abstraction:
Objects (sort of files): data (bunch of bytes), attributes – (extended attributes), omap – unbounded key value thing (less common)
Collections: directory (group of objects): pool of objects sharded into PGs, PGs map to collections
All writes are transactions – atomically and consistently and durably. Don’t worry about I in ACID (provided by another layer)
EBOFS was first – user based, extent-based FS, copy-on-write btree based FS (full control of stack, most natural interface)
got rid of it, switched to writing to btrfs in 2009 – had everything we needed and community growing
Filestore – write objects as files
Leveldb xattr (when they are too big)
Originally just for dev w/o having dedicated disks – morphed it into prod
OSD dir – dir for each PG
DB dir – has level db
Meta dir – high level metadata objects for osd as a whole
Because this is built on existing FS, constrained by POSIX – has problems
1st – interface wants to provide atomic (b/c osd is managing consistency of data it’s stores locally – if it fails can recoverand resync w/ other replicas) need that transactionality
In practice most transactions are pretty simple – writing bytes, attr = what ver, log = what version .. Can’t rely on simplicity
On the right is an example of one of these transactions
Initially to support these we tied into btr
Had an ioctl that we’d bracket all of our stuff to prevent btr from doing a transaction while we were doing our work. Internal checkpoints
Got us most of the way there
The problem here is “what happens if OSD daemon crashed and didn’t finish writing full transaction” btr would see write start, some write, no end…would never get the second half.
Got around that to add a very horrible mount option to deliberately make btr wedge itself and crash. Internally there was no option for rollback. Btr no meant to be transactional in that way...hard to shoehorn that in later.
Didn’t work, so instead we....
Did a write-ahead journal, serialize into a sequence of bytes
In btr we could be a little bit clever – snapshot == full checkpoint (after checkpoint, we could trim journal). If OSD restarted, roll back to snapshot and replay journal (nice consistency model)
Non-btr not so elegant – still do periodic sync, but on restart we just replayed the journal blindly, might be repeating operations
unfortunately the objects or interfaces that were supported aren’t all idempotent – had things like renames/clones/etc – whole bunch of hackery so we don’t apply those operations twice (kinda nasty…but it works)
Write twice – journal + disk…this halves disk throughput
Another place POSIX gets in our way is enumeration
Objects are distributed in a pool based on a 32-bit hash – we do enumeration in hash order for scrub / backfill / and when you request a list of objects via API
POSIX readdir is totally random
Also need the ability to take a given collection and split it in half/quarter/etc – that’s part of ceph is we can repartition our data collections
Can’t do that with POSIX… can’t take a dir of a million files and split it into two dir
In practice in filestore, we build an ugly tree of directories+files where the dir names are based on the first prefix part of the hash for the file (deep nested file structure – looks similar to what other projects do)
Not terribly efficient because of complicated dir structre – hit some bottlenecks
Time to do something different – POSIX is more trouble that it’s worth
Objects aren’t files
Collections aren’t directories
Ordered k/v database
RocksDB (picked somewhat randomly)
Idea is you plug in your k/v db (rocksdb / leveldb / any kv db)
Actual data for object is written in simple file w/ simple name (short name!) nice big efficient directories
Didn’t work very well
Main issue is rocksDB has a write ahead journal to maintain its consistency
FS also has journal
Journal on Journal is very inefficient (papers about it) each journal managing half of overall consistency of the system (pay the ovehead twice)
When writing a file in newstore you write a file on blob, do an fsync, io with file data, another io to journal, flush device twice,
The newstore update metadata, append a record to rocksdb, append rocksdb log file, then fsync that (another 2 ios w/ rocksdb log and again w/ fs log file.
Pay 4 ios when you want to pay 2
Solution is to put everything in one big journal
Problem is the system still needs the atomicity to do overwrites in the system
In POSIX you can’t overwrite part of a file that already exists as a part of a larger transaction (POSIX doesn’t understand transactions)
IN ceph we needs these overwrites to be atomic (so they don’t overwrite things unless they are ready to be committed)
Could have had newstore write to a new file…but that leads to a big complex mapping structure
End up where we were before with write-ahead logging
The allocator is something we used to get from XFS or w/e that we now do ourselves
We have to share the block device w/ RocksDB (writes a bunch of files like log file)
Do that by implementing rocksdb backend – nice abstracted env class that captures platform dependent stuff
Implement a very simple FS (just complicated enough to support RocksDB operations
All metadata is stored in RAM
Idea is write to journal – write updates to fnodes (like inodes) as they happen
When you hit threshold, you rewrite whole thing in more compactable
RocksDB writes big files only so it keeps it simples.
BlueFS is smart about multiple devices (RocksDB writes types of data to different dirs, logs to SSD)
BlueStore and BlueFS communicate so that as BlueFS runs out of space, bluestore gives it more and vice versa
Did one tricky thing w/ rocksdb upstream
Rocksdb written to use logfiles (journal) – write a new log file each time which leads to a pretty inefficient io pattern
Every file system / db that does data logging uses a circular buffer – so we implemented that
Two device is like what people do now (SSD journal + multiple HDDs for data)
Larger device for two device can do more
Three devices could even be split
Don’t support bluestore teiring object data – but are exploring
Ordere enumeration of objects – carefully construct key that sort in the order we want
B/c objects are in hash order we can take collection that represents a range and split it into two collection without rewriting any k/v pairs (just change collection metadata to arbitrarily carve into two pieces)
This is something that filestore had to work hard to do
ONODES stores per object metadata
Main things in here are:
Size of object in bytes
In-lines attributes like ver=2
Data pointers that indicate where the byte data is stored on disk
Structure “omap head” that is if you have user data stored as k/v data, where to find it
One other structure
Need to store metadata about the blob – ONODE has mapping object space to logical extents which maps to blobs, but doesn’t contain the blobs themselves – usually stored next to ONODE, but occassionally blobs will have multiple ONODES map to multiple blobs
Map of an identifier to blob – blob tells you where to find the data
Blobs let us do checksums (every day = metadata, every week = data)
With bluestore we want to validate a checksums on every read – that means bluestore blobs have to store more metadata (to include checksum)
Use industry standard crc32c
IOHints – (we control whole stack)
things like RGW = read/write sequentially (no small overwrites) -> large checksum block
If we compress a block, checksum for entire region
Idea is policies for a pool basis
3x is expensive
Bluestore implements in-line compression
Trick is when you need to support overwrites (hopefully diagram makes sense)
Figuring out performance is future work
How the code flows when we’re taking data from OSD to disk
Sequencer – independent stream fed to object store (1 per PG)
Each transaction is represented by a transcontext
New allocation – (most of the time) new region of disk, update metadata to point to the data
WAL – (sometimes small writes) temporary k/v pair in rocksdb – effectively data journaling like filestore, only do it with small writes
Complicated slide describes flow of transactions through this process
Bluestore implements its own cache in user space memory (not using the kernel for any caching)
Couple other things that happen
Freelist – keeps track of unused space on disk
Separate module that is responsible where we should allocate new data
Pluggable, has two implementations
StupidAllocator – not bad, highly variable memory usage
BitmapAllocator – new implementation from SanDisk
Allocator is pluggable so we also have a GSOC student who is adding support for SMR harddisk (annoying, prevents overwrites, have to write stripes)
These graphs are produces a couple months ago…prelim and not super-detailed
Sequential write for spinning platter – large io is twice as fast (as you would expect, removing double writes)
Random writes are much better, also about twice as fast (left is streaming throughput, right is IOPS) –
kink between 32k and 64k writes is where we transition from WAL to writing to a new region of disk
Sequential reads are a little more interesting, high end we’re a little better, low end we’re the same…middle there is a dip pattern
Newstore is based on XFS with the readahead
Bluestore isn’t…b/c ceph has its own read ahead (cephfs, rbd, radosgw all have their own) ...this is faster when you look at the client level, but not at the OSD
Random reads, sort of what you’d expect. Small IO our metadata is more efficient