Automating Gluster @ Facebook - Shreyas Siravara

•

0 likes•1,996 views

The document discusses the lifecycle of a Gluster volume, including creation, maintenance through software upgrades and hardware repairs, and decommission. During creation, hardware must be homogenous and the layout distributed across racks for high availability. Maintenance involves batch upgrading software on racks serially to avoid quorum loss, and replacing hardware by first copying data in blocks and then using replace-brick. Decommissioning full hardware requires copying all data in blocks first to another node before replacing to ensure integrity and no downtime.

Technology

Lifecycle of a Gluster Volume
Shreyas Siravara
Production Engineer
Automating GlusterFS @ Facebook

Stages of a Gluster Volume
1. Creation
2. Maintenance
• Software Upgrades
• Hardware Repairs
3. Decommission

Creation
• Homogenous hardware
•Bricks are the same size
•Exact same CPU, memory configuration
• Easy to debug problems
Validate Hardware

Creation
Layout Management
• Rack failure resilient layout
• Spread replicas across racks
• Automate entire process to avoid human error
• Layout of replicas supports large-scale maintenance
• Avoid data unavailability

Maintenance
Hardware Repair
• What happens if a brick needs repair?
• Some manual effort for physical repairs
• This is done with the local gluster daemons not running
• What happens if a brick comes back empty?
• Multiple replaced drives in a RAID
• SHD automatically “discovers” that the brick is empty & heals it

Maintenance
Hardware Repair
• What happens if the root drive is replaced?
• Fresh OS install
• Automated “restore” flow
• Facebook automation installs the OS
• Install Gluster
• Restore the nodes prior UUID & restore the peer list
• SHD cleans up the pending heals

Maintenance
Software Upgrades: Goals
• Goals:
• Push quickly and safely
• Avoid quorum loss & split-brains
• The customer should not know we’re doing a push
• Halt the push if we find something critical
• Code changes should not result in incompatibility between
servers & clients

Maintenance
Software Upgrades: Batching
• Create batches based on layout
• Every rack becomes a “batch”
• Batches are scheduled serially
• Concurrency within the batch
Batch 1
Rack 1
Brick 1
Brick 4
Brick 7
Batch 2
Rack 2
Brick 2
Brick 5
Brick 8
Batch 3
Rack 3
Brick 3
Brick 6
Brick 9

Maintenance
Software Upgrades: Host Procedure
• Single Host Procedure:
1. Check for quorum margin
2. Wait for pending heals to drop
3. Stop Gluster & install the new version
4. Start Gluster

Maintenance
Software Upgrades: Volume Procedure
• Volume Procedure:
• Upgrade every host in the batch
• Health-check
• Run the next batch
Batch 1
Rack 1
Brick 1
Brick 4
Brick 7
Batch 2
Rack 2
Brick 2
Brick 5
Brick 8
Batch 3
Rack 3
Brick 3
Brick 6
Brick 9
Pending Upgraded

Maintenance
Software Upgrades: Advantages & Potential Improvements
• Advantages:
• Maintain quorum
• Clients don’t need to know that a volume is being upgraded
• We should:
• Correctly drain traffic when we stop Gluster daemons
• Stop listening for new requests
• Complete outstanding I/O

Decommission
Requirements & Challenges
• Requirement:
• Replace 100% of the hardware in a Gluster volume
• Challenges:
• Volume size
• Data Integrity
• No customer impact
• SLA: No errors, low latency

Decommission
Simple Strategy: Replace-brick
• Replace bricks one-replica at a time, wait for rebuilds
• Use gluster volume replace-brick
• Good for smaller volumes, with low numbers of files
• Scales poorly with 10s of millions of files per brick
• Self-heal daemon is not yet fast enough
• Even with multi-threaded SHD

Decommission
Improved Strategy: “Block” copy + Replace-brick
xfsdump
Source Brick Dest Brick
gluster volume replace-brick
Source Brick Dest Brick

Decommission
Improved Strategy: “Block” copy + Replace-brick
• Advantages:
• 100s of MB/s to run the first copy
• Self-heal daemon just has to “top-up” the node
• Heals only the data that changed while the node was offline
• Easy to automate
• Predictable, fixed procedure

Final Thoughts
• Layout is important
• Data unavailability can be avoided
• Decompose into host-level & volume-level procedures
• Keep the procedures simple & predictable
• Avoid overly-complex automation with many edge-cases

Automating Gluster @ Facebook - Shreyas Siravara

This paper introduces the Virtual Disk Integrity in Real Time (vDIRT) monitor, a mechanism to measure virtual hard disks in real time from the Dom0 trusted computing base. vDIRT is an improvement over traditional methods for auditing file integrity which rely on a service in a potentially compromised host. It also overcomes the limitations of existing methods for assuring disk integrity that are coarse grained and do not scale to large disks. vDIRT is a capability to measure disk reads and writes in real time, allowing for fine grained tracking of sectors within files, as well as the overall disk. The vDIRT implementation and its impact on performance is discussed to show that disk operation monitoring from Dom0 is practical.

MesosCon EU 2017 - Criteo - Operating Mesos-based Infrastructures

pierrecdn -

Erasure Code at Scale - Thomas William Byrne

Ceph Community

Red Hat Enterprise Linux: Open, hyperconverged infrastructure

Red_Hat_Storage

The next generation of IT will be built around flexible infrastructures and operational efficiencies, lowering costs and increasing overall business value in the organization. A hyperconverged infrastructure that's built on Red Hat supported technologies--including Linux, Gluster storage, and oVirt virtualization manager--will run on commodity x86 servers using the performance of local storage, to deliver a cost-effective, modular, highly scalable, and secure hyperconverged solution.

This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#netflix2 Operating C* can involve a lot of required manpower, complex automation, or both. Some of this complexity comes from operational/configuration activity of the underlying kernel and hardware but much of it is operation complexity stemming from C* itself. Some examples of this complexity are restarting the database in a safe way, reliability backing up and restoring snapshots, monitoring the health of the datastore, and even ensuring eventual consistency through repair. As a result of these complexities, C* operators end up with complicated operational setups, which are expensive to build, manage and monitor. As part of this talk, we will share lessons learned in managing such complexity via our Priam sidecar including recent innovations in how our sidecar ensures the highest possible uptime and correctness of Cassandra. We then use this to motivate building in the management sidecar directly as part of C* itself (CASSANDRA-14395).

Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph

Sean Cohen

IT organizations require a disaster recovery strategy addressing outages with loss of storage, or extended loss of availability at the primary site. Applications need to rapidly migrate to the secondary site and transition with little or no impact to their availability.This talk will cover the various architectural options and levels of maturity in OpenStack services for building multi-site configurations using the Mitaka release. We’ll present the latest capabilities for Volume, Image and Object Storage with Ceph as the backend storage solution, and look at the future developments the OpenStack and Ceph communities are driving to improve and simplify the relevant use cases. Slides from OpenStack Austin Summit 2016 session: http://alturl.com/hpesz

Managing Redis with Kubernetes - Kelsey Hightower, Google

Redis Labs

CEPH DAY BERLIN - CEPH ON THE BRAIN!

Ceph Community

How does Ceph perform when used in high-performance computing? This talk will cover a year of running Ceph on a (small) Cray supercomputer. I will describe how Ceph was configured to perform in an all-NVME configuration, and the process of analysis and optimisation of the configuration. I'll also give details on the efforts underway to adapt Ceph's messaging to run over high performance network fabrics, and how this work could become the next frontier in the battle against storage performance bottlenecks.

Openvz booth

OpenVZ

RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai

Ceph Community

RBD: What will the future bring? - Jason Dillaman

Ceph Community

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)

Tibo Beijen

NGENSTOR_ODA_P2V_V5UniFabric

What's hot

IITCC15: Xen Project 4.6 Update

The Linux Foundation

State of Gluster Performance

Gluster.org

NantOmics

Ceph Community

Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang

Ceph Community

Redis Day Keynote Salvatore Sanfillipo Redis Labs

Redis Labs

Challenges with Gluster and Persistent Memory with Dan Lambright

Gluster.org

2021.06. Ceph Project Update

Ceph Community

Gluster overview & future directions vault 2015

Vijay Bellur

Ceph Tech Talk: Ceph at DigitalOcean

Ceph Community

2021.02 new in Ceph Pacific Dashboard

Ceph Community

Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...

Gluster.org

Performance tuning in BlueStore & RocksDB - Li Xiaoyan

Ceph Community

On demand file-caching_-_gustavo_brand

Gluster.org

Looking towards an official cassandra sidecar netflix

Vinay Kumar Chella

Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph

Sean Cohen

Managing Redis with Kubernetes - Kelsey Hightower, Google

Redis Labs

CEPH DAY BERLIN - CEPH ON THE BRAIN!

Ceph Community

Openvz booth

OpenVZ

RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai

Ceph Community

RBD: What will the future bring? - Jason Dillaman

Ceph Community

What's hot (20)

IITCC15: Xen Project 4.6 Update

State of Gluster Performance

NantOmics

Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang

Redis Day Keynote Salvatore Sanfillipo Redis Labs

Challenges with Gluster and Persistent Memory with Dan Lambright

2021.06. Ceph Project Update

Gluster overview & future directions vault 2015

Ceph Tech Talk: Ceph at DigitalOcean

2021.02 new in Ceph Pacific Dashboard

Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...

Performance tuning in BlueStore & RocksDB - Li Xiaoyan

On demand file-caching_-_gustavo_brand

Looking towards an official cassandra sidecar netflix

Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph

Managing Redis with Kubernetes - Kelsey Hightower, Google

CEPH DAY BERLIN - CEPH ON THE BRAIN!

Openvz booth

RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai

RBD: What will the future bring? - Jason Dillaman

Similar to Automating Gluster @ Facebook - Shreyas Siravara

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)

Tibo Beijen

NGENSTOR_ODA_P2V_V5UniFabric

The Hard Problems of Continuous DeploymentTimothy Fitz

Geek Sync | Top 5 Tips to Keep Always On Always Humming and Users Happy

IDERA Software

You can watch the replay for this Geek Sync webcast in the IDERA Resource Center: http://ow.ly/qLFi50A5aPp Have you ever wondered what it takes to keep an Always On availability group running and the users and administrators who depend on it happy? Join IDERA and Matt Gordon as he uses his experience maintaining several production Always On Availability Groups as an example to provide you some battle-tested information and hopefully save you some sleepless nights. From security tips to maintenance advice, come hear about some less than obvious tips that will keep users happy and the DBA’s phone quiet. This will be an interactive Geek Sync you will not want to miss.

Diagnosing Problems in Production (Nov 2015)

Jon Haddad

Advanced Operations

DataStax Academy

This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.

Technical track-afterimaging Progress Database

Vinh Nguyen

Orleans gdc2019

Crystin Cox

Make It Cooler: Using Decentralized Version Control

indiver

A commonly used version control system in the ColdFusion community is Subversion -- a centralized system that relies on being connected to a central server. The next generation version control systems are “decentralized”, in that version control tasks do not rely on a central server. Decentralized version control systems are more efficient and offer a more practical way of software development. In this session, Indy takes you through the considerations in moving from Subversion to Git, a decentralized version control system. You also get to understand the pros and cons of each and hear of the practical experience of migrating projects to decentralized version control. Version control is often used in conjunction with a testing framework and continuous integration. To complete the picture, Indy walks you through how to integrate Git with a testing framework, MXUnit, and a continuous integration server, Hudson.

SQLDay2013_Denny Cherry - SQLServer2012inaHighlyAvailableWorldPolish SQL Server User Group

Nagios XI Best Practices

Nagios

Scaling and Managing Selenium Grid

dimakovalenko

Cassandra Day Atlanta 2015: Diagnosing Problems in Production

DataStax Academy

Cassandra Day Chicago 2015: Diagnosing Problems in Production

DataStax Academy

Speaker(s): Jon Haddad, Apache Cassandra Evangelist at DataStax This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.

Cassandra Day London 2015: Diagnosing Problems in Production

DataStax Academy

Webinar: Diagnosing Apache Cassandra Problems in Production

DataStax Academy

Webinar: Diagnosing Apache Cassandra Problems in Production

DataStax Academy

Diagnosing Problems in Production - Cassandra

Jon Haddad

This presentation covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Readers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This presentation is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.

stackconf 2022: Infrastructure Automation (anti) patterns

NETWAYS

Over the past 1.5 decade our industry has tried to adopt an increased amount of infrastructure automation. We called it Configuration Management, Infrastructure as Code, infrastructure as Software, Provisioning, Orchestration. We learned about Desired State, Idempotence, etc.. We have seen a number of tools become popular; we have seen a number of tools disappear. But over the years we have seen a number of patterns appear and reappear. Patterns that lead to actually getting great benefits out of automation, or just wasting time while missing out on goals. This talk will explain you a number of these patterns which we have frequently encountered in the wild, with their benefits and caveats. We will try to keep this tool agnostic. Your vision might be Clouded, and you might have to take this with a grain of Salt while you play the Chef from the Muppet show the story, all names, characters, and incidents portrayed in this production are fictitious. No identification with actual persons (living or deceased), places, buildings, and products are intended or should be inferred.

Infrastructure as Code Patterns

Kris Buytaert

Similar to Automating Gluster @ Facebook - Shreyas Siravara (20)

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)

NGENSTOR_ODA_P2V_V5

The Hard Problems of Continuous Deployment

Geek Sync | Top 5 Tips to Keep Always On Always Humming and Users Happy

Diagnosing Problems in Production (Nov 2015)

Advanced Operations

Technical track-afterimaging Progress Database

Orleans gdc2019

Make It Cooler: Using Decentralized Version Control

SQLDay2013_Denny Cherry - SQLServer2012inaHighlyAvailableWorld

Nagios XI Best Practices

Scaling and Managing Selenium Grid

Cassandra Day Atlanta 2015: Diagnosing Problems in Production

Cassandra Day Chicago 2015: Diagnosing Problems in Production

Cassandra Day London 2015: Diagnosing Problems in Production

Webinar: Diagnosing Apache Cassandra Problems in Production

Diagnosing Problems in Production - Cassandra

stackconf 2022: Infrastructure Automation (anti) patterns

Infrastructure as Code Patterns

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Knowledge engineering: from people to machines and back

Elena Simperl

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

The Future of Platform Engineering

Jemma Hussein Allen

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Key Trends Shaping the Future of Infrastructure.pdf

Connector Corner: Automate dynamic content and events by pushing a button

Elevating Tactical DDD Patterns Through Object Calisthenics

Knowledge engineering: from people to machines and back

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Designing Great Products: The Power of Design and Leadership by Chief Designe...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Neuro-symbolic is not enough, we need neuro-*semantic*

Bits & Pixels using AI for Good.........

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

UiPath Test Automation using UiPath Test Suite series, part 4

JMeter webinar - integration with InfluxDB and Grafana

The Future of Platform Engineering

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Automating Gluster @ Facebook - Shreyas Siravara

2. Lifecycle of a Gluster Volume Shreyas Siravara Production Engineer Automating GlusterFS @ Facebook

3. Stages of a Gluster Volume 1. Creation 2. Maintenance • Software Upgrades • Hardware Repairs 3. Decommission

4. Creation • Homogenous hardware •Bricks are the same size •Exact same CPU, memory configuration • Easy to debug problems Validate Hardware

5. Creation Layout Management • Rack failure resilient layout • Spread replicas across racks • Automate entire process to avoid human error • Layout of replicas supports large-scale maintenance • Avoid data unavailability

6. Maintenance Hardware Repair • What happens if a brick needs repair? • Some manual effort for physical repairs • This is done with the local gluster daemons not running • What happens if a brick comes back empty? • Multiple replaced drives in a RAID • SHD automatically “discovers” that the brick is empty & heals it

7. Maintenance Hardware Repair • What happens if the root drive is replaced? • Fresh OS install • Automated “restore” flow • Facebook automation installs the OS • Install Gluster • Restore the nodes prior UUID & restore the peer list • SHD cleans up the pending heals

8. Maintenance Software Upgrades: Goals • Goals: • Push quickly and safely • Avoid quorum loss & split-brains • The customer should not know we’re doing a push • Halt the push if we find something critical • Code changes should not result in incompatibility between servers & clients

9. Maintenance Software Upgrades: Batching • Create batches based on layout • Every rack becomes a “batch” • Batches are scheduled serially • Concurrency within the batch Batch 1 Rack 1 Brick 1 Brick 4 Brick 7 Batch 2 Rack 2 Brick 2 Brick 5 Brick 8 Batch 3 Rack 3 Brick 3 Brick 6 Brick 9

10. Maintenance Software Upgrades: Host Procedure • Single Host Procedure: 1. Check for quorum margin 2. Wait for pending heals to drop 3. Stop Gluster & install the new version 4. Start Gluster

11. Maintenance Software Upgrades: Volume Procedure • Volume Procedure: • Upgrade every host in the batch • Health-check • Run the next batch Batch 1 Rack 1 Brick 1 Brick 4 Brick 7 Batch 2 Rack 2 Brick 2 Brick 5 Brick 8 Batch 3 Rack 3 Brick 3 Brick 6 Brick 9 Pending Upgraded

12. Maintenance Software Upgrades: Advantages & Potential Improvements • Advantages: • Maintain quorum • Clients don’t need to know that a volume is being upgraded • We should: • Correctly drain traffic when we stop Gluster daemons • Stop listening for new requests • Complete outstanding I/O

13. Decommission Requirements & Challenges • Requirement: • Replace 100% of the hardware in a Gluster volume • Challenges: • Volume size • Data Integrity • No customer impact • SLA: No errors, low latency

14. Decommission Simple Strategy: Replace-brick • Replace bricks one-replica at a time, wait for rebuilds • Use gluster volume replace-brick • Good for smaller volumes, with low numbers of files • Scales poorly with 10s of millions of files per brick • Self-heal daemon is not yet fast enough • Even with multi-threaded SHD

15. Decommission Improved Strategy: “Block” copy + Replace-brick xfsdump Source Brick Dest Brick gluster volume replace-brick Source Brick Dest Brick

16. Decommission Improved Strategy: “Block” copy + Replace-brick • Advantages: • 100s of MB/s to run the first copy • Self-heal daemon just has to “top-up” the node • Heals only the data that changed while the node was offline • Easy to automate • Predictable, fixed procedure

17. Final Thoughts • Layout is important • Data unavailability can be avoided • Decompose into host-level & volume-level procedures • Keep the procedures simple & predictable • Avoid overly-complex automation with many edge-cases

Automating Gluster @ Facebook - Shreyas Siravara

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automating Gluster @ Facebook - Shreyas Siravara

Similar to Automating Gluster @ Facebook - Shreyas Siravara (20)

More from Gluster.org

More from Gluster.org (20)

Recently uploaded

Recently uploaded (20)

Automating Gluster @ Facebook - Shreyas Siravara