You deployed automation, enabled automatic database master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such a surprise.
Once upon a time, a failure brought down a MySQL master database. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, an edge-case recovery, and a lack of oversight in tooling and scripting lead to a split-brain and data corruption. This talk will go into details about the convoluted—but still real-world—sequence of events that lead to this disaster. I cover what could have avoided the split-brain and what could have made data reconciliation easier.
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
Of course there is no such thing as perfect service discovery, and we will see why in the talk. However, the way ProxySQL is deployed in this case minimizes the risk for split-brains, and this is why I qualify it as almost perfect. But let’s step back a little...
MySQL alone is not a high availability solution. To provide resilience to primary failure, other components need to be integrated with MySQL. At MessageBird, these additional components are ProxySQL and Orchestrator. In this talk, we describe how ProxySQL is architectured to provide close to perfect Service Discovery and how this, combined with Orchestrator, allows for automatic failover. The talk presents the details of the integration of MySQL, ProxySQL and Orchestrator in Google Cloud (and it would be easy to re-implement a similar architecture at other cloud vendors or on-premises). We will also cover lessons learned for the 2 years this architecture has been in production. Come to this talk to learn more about MySQL high availability, ProxySQL and Orchestrator.
Since 5.7.2, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK (DATABASE based parallel replication is also implemented in 5.6 but this is not covered in this talk). In early 5.7 versions, parallel replication was based on group commit (like MariaDB) and 5.7.6 changed that to intervals.
Intervals are more complicated but they are also more powerful. In this talk, I will explain in detail how they work and why intervals are better than group commit. I will also cover how to optimize parallel replication in MySQL 5.7 and what improvements are coming in MySQL 8.0.
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)Jean-François Gagné
To get better replication speed and less lag, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK. But fully benefiting from this feature is not as simple as just enabling it.
In this talk, I explain in detail how this feature works. I also cover how to optimize parallel replication and the improvements made in MySQL 8.0 and back-ported in 5.7 (Write Sets), greatly improving the potential for parallel execution on replicas (but needing RBR).
Come to this talk to get all the details about MySQL 5.7 and 8.0 Parallel Replication.
How to set up orchestrator to manage thousands of MySQL serversSimon J Mudd
This document discusses how to scale Orchestrator to manage thousands of MySQL servers. It describes how Booking.com uses Orchestrator to manage their thousands of MySQL servers. As the number of monitored servers increases, integration with internal infrastructure is needed, Orchestrator performance must be optimized, and high availability and wider user access features are added. The document provides examples of configuration settings and special considerations needed to effectively use Orchestrator at large scale.
Up to MySQL 5.5, replication was not crash safe: after an unclean shutdown, it would fail with “duplicate key” or “row not found” error, or might generate silent data corruption. It looks like 5.6 is much better, right ? The short answer is maybe: in the simplest case, it is possible to achieve replication crash safety, but it is not the default setting. MySQL 5.7 is not much better, 8.0 has better defaults, but it is still not replication crash-safe by default, and it is still easy to get things wrong.
Crash safety is impacted by replication positioning (File+Position or GTID), type (single-threaded or MTS), MTS settings (Database or Logical Clock, and with or without slave preserve commit order), the sync-ing of relay logs, the presence of binary logs, log-slave-updates and the sync-ing of binary logs. This is very complicated stuff and even the manual is sometimes confused about it.
In this talk, I will explain the impact of the above and help you find the path to crash safety nirvana. I will also give details about replication internals, so you might learn a thing or two.
The presentation covers improvements made to the redo logs in MySQL 8.0 and their impact on the MySQL performance and Operations. This covers the MySQL version still MySQL 8.0.30.
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
MariaDB MaxScale is a database proxy that abstracts database clusters to simplify application development and management. It isolates complexity by providing a single logical view of the database while enabling high availability, scalability and performance. MaxScale intelligently routes queries by classifying them, load balancing across nodes, and handling failures transparently using monitors to track cluster state. It supports various cluster types including master-slave and synchronous replication. Filters can extend its functionality such as enforcing consistent reads. MaxScale abstracts different database clusters to behave like a single highly available database.
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
Of course there is no such thing as perfect service discovery, and we will see why in the talk. However, the way ProxySQL is deployed in this case minimizes the risk for split-brains, and this is why I qualify it as almost perfect. But let’s step back a little...
MySQL alone is not a high availability solution. To provide resilience to primary failure, other components need to be integrated with MySQL. At MessageBird, these additional components are ProxySQL and Orchestrator. In this talk, we describe how ProxySQL is architectured to provide close to perfect Service Discovery and how this, combined with Orchestrator, allows for automatic failover. The talk presents the details of the integration of MySQL, ProxySQL and Orchestrator in Google Cloud (and it would be easy to re-implement a similar architecture at other cloud vendors or on-premises). We will also cover lessons learned for the 2 years this architecture has been in production. Come to this talk to learn more about MySQL high availability, ProxySQL and Orchestrator.
Since 5.7.2, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK (DATABASE based parallel replication is also implemented in 5.6 but this is not covered in this talk). In early 5.7 versions, parallel replication was based on group commit (like MariaDB) and 5.7.6 changed that to intervals.
Intervals are more complicated but they are also more powerful. In this talk, I will explain in detail how they work and why intervals are better than group commit. I will also cover how to optimize parallel replication in MySQL 5.7 and what improvements are coming in MySQL 8.0.
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)Jean-François Gagné
To get better replication speed and less lag, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK. But fully benefiting from this feature is not as simple as just enabling it.
In this talk, I explain in detail how this feature works. I also cover how to optimize parallel replication and the improvements made in MySQL 8.0 and back-ported in 5.7 (Write Sets), greatly improving the potential for parallel execution on replicas (but needing RBR).
Come to this talk to get all the details about MySQL 5.7 and 8.0 Parallel Replication.
How to set up orchestrator to manage thousands of MySQL serversSimon J Mudd
This document discusses how to scale Orchestrator to manage thousands of MySQL servers. It describes how Booking.com uses Orchestrator to manage their thousands of MySQL servers. As the number of monitored servers increases, integration with internal infrastructure is needed, Orchestrator performance must be optimized, and high availability and wider user access features are added. The document provides examples of configuration settings and special considerations needed to effectively use Orchestrator at large scale.
Up to MySQL 5.5, replication was not crash safe: after an unclean shutdown, it would fail with “duplicate key” or “row not found” error, or might generate silent data corruption. It looks like 5.6 is much better, right ? The short answer is maybe: in the simplest case, it is possible to achieve replication crash safety, but it is not the default setting. MySQL 5.7 is not much better, 8.0 has better defaults, but it is still not replication crash-safe by default, and it is still easy to get things wrong.
Crash safety is impacted by replication positioning (File+Position or GTID), type (single-threaded or MTS), MTS settings (Database or Logical Clock, and with or without slave preserve commit order), the sync-ing of relay logs, the presence of binary logs, log-slave-updates and the sync-ing of binary logs. This is very complicated stuff and even the manual is sometimes confused about it.
In this talk, I will explain the impact of the above and help you find the path to crash safety nirvana. I will also give details about replication internals, so you might learn a thing or two.
The presentation covers improvements made to the redo logs in MySQL 8.0 and their impact on the MySQL performance and Operations. This covers the MySQL version still MySQL 8.0.30.
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
MariaDB MaxScale is a database proxy that abstracts database clusters to simplify application development and management. It isolates complexity by providing a single logical view of the database while enabling high availability, scalability and performance. MaxScale intelligently routes queries by classifying them, load balancing across nodes, and handling failures transparently using monitors to track cluster state. It supports various cluster types including master-slave and synchronous replication. Filters can extend its functionality such as enforcing consistent reads. MaxScale abstracts different database clusters to behave like a single highly available database.
Replication Troubleshooting in Classic VS GTIDMydbops
This presentation talk will assist you in troubleshooting MySQL replication for the most common issues we might face with a simple comparison of how can we get them solved in the different replication methods (Classic VS GTID).
MySQL Database Architectures - InnoDB ReplicaSet & ClusterKenny Gryp
This document provides an overview and comparison of MySQL InnoDB Cluster and MySQL InnoDB ReplicaSet. It discusses the components, goals, and features of each solution. MySQL InnoDB Cluster uses Group Replication to provide high availability, automatic failover, and data consistency. MySQL InnoDB ReplicaSet uses asynchronous replication and provides availability and read scaling through manual primary/secondary configuration and failover. Both solutions integrate MySQL Shell, Router, and automatic member provisioning for easy management.
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...Jean-François Gagné
Since 5.7.2, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK (DATABASE based parallel replication is also implemented in 5.6 but this is not covered in this talk). In early 5.7 versions, parallel replication was based on group commit (like MariaDB) and 5.7.6 changed that to intervals.
Intervals are more complicated but they are also more powerful. In this talk, I will explain in detail how they work and why intervals are better than group commit. I will also cover how to optimize parallel replication in MySQL 5.7 and what improvements are coming in MySQL 8.0. I will also explain why Group Replication is replicating faster than standard asynchronous replication.
Come to this talk to get all the details about MySQL 5.7 Parallel Replication.
24시간 365일 서비스를 위한 MySQL DB 이중화.
MySQL 이중화 방안들에 대해 알아보고 운영하면서 겪은 고민들을 이야기해 봅니다.
목차
1. DB 이중화 필요성
2. 이중화 방안
- HW 이중화
- MySQL Replication 이중화
3. 이중화 운영 장애
4. DNS와 VIP
5. MySQL 이중화 솔루션 비교
대상
- MySQL을 서비스하고 있는 인프라 담당자
- MySQL 이중화에 관심 있는 개발자
MariaDB 10.0 introduces domain-based parallel replication which allows transactions in different domains to execute concurrently on replicas. This can result in out-of-order transaction commit. MariaDB 10.1 adds optimistic parallel replication which maintains commit order. The document discusses various parallel replication techniques in MySQL and MariaDB including schema-based replication in MySQL 5.6 and logical clock replication in MySQL 5.7. It provides performance benchmarks of these techniques from Booking.com's database environments.
This document summarizes information about scaling out MariaDB server and cluster with MaxScale. It discusses the ReadWriteSplit and ConnectionRoute routers for scaling out a MariaDB server replication topology. It also discusses using the Galera Monitor and ReadWriteSplit modules to scale out a MariaDB Cluster deployment with MaxScale. Configurations are provided for using these routers and monitors with MaxScale.
MySQL Administrator
Basic course
- MySQL 개요
- MySQL 설치 / 설정
- MySQL 아키텍처 - MySQL 스토리지 엔진
- MySQL 관리
- MySQL 백업 / 복구
- MySQL 모니터링
Advanced course
- MySQL Optimization
- MariaDB / Percona
- MySQL HA (High Availability)
- MySQL troubleshooting
네오클로바
http://neoclova.co.kr/
Orchestrator allows for easy MySQL failover by monitoring the cluster and promoting a new master when failures occur. Two test cases were demonstrated: 1) using a VIP and scripts to redirect connections during failover and 2) integrating with Proxysql to separate reads and writes and automatically redirect write transactions during failover while keeping read queries distributed. Both cases resulted in failover occurring within 16 seconds while maintaining application availability.
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)Altinity Ltd
ProxySQL 2.0 includes several new features such as query cache improvements, GTID causal reads for consistency, native Galera cluster support, Amazon Aurora integration, LDAP authentication, improved SSL support, a new audit log, and performance enhancements. It also adds new monitoring tables, variables, and configuration options to support these features.
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
This summary provides an overview of the key points from the document:
1. The document is a presentation on MySQL replication scalability and reliability given at dataops.barcelona in June 2019. It covers topics like introduction to replication, use cases for replication like read scaling and high availability, and best practices.
2. The presentation provides an overview of MySQL replication including what it is, why you would use it, and how it works at a high level. It also discusses tools for monitoring and visualizing replication topology.
3. Challenges like replication lag are discussed along with techniques to prevent and address lag, such as transaction design practices and throttling. Advanced topics like parallel replication are also mentioned.
MySQL performance can be improved by tuning queries, server options, and hardware. Traditionally it was an area of responsibility for three different roles: Development, DBA, and System Administrators. Now DevOps handle these all. But there is a gap. Knowledge gained by MySQL DBAs after years or focusing on a single product is hard to gain when you focus on more than one. This is why I am doing this session. I will show a minimal but most effective set of options to improve MySQL performance. For illustrations, I will use real user stories gained from my Support experience and Percona Kubernetes operators for PXC and MySQL.
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZENorvald Ryeng
This presentation focuses on two of the new features in MySQL 8.0.18: hash joins and EXPLAIN ANALYZE. It covers how these features work, both on the surface and on the inside, and how you can use them to improve your queries and make them go faster.
Both features are the result of major refactoring of how the MySQL executor works. In addition to explaining and demonstrating the features themselves, the presentation looks at how the investment in a new iterator based executor prepares MySQL for a future with faster queries, greater plan flexibility and even more SQL features.
GTIDs were introduced to solve replication problems and improve database consistency in MySQL database replication.
When, accidentally, transactions occur on a replica, this introduces GTIDs on that replica that don't exist on the master. When, on a master failover, this replica becomes the new master, and the corresponding binlogs of the errant GTIDs are already purged, replication breaks on the replicas of this new master, because those missing GTIDs can't be retrieved from the binlogs of this new master.
This presentation will talk about GTIDs and how to detect errant GTIDs on a replica (before the corresponding binlogs are purged) and how to look at the corresponding transactions in the binlogs. I'll give some examples of transactions that could happen on a replica that didn't originate from a primary node, explain how this is possible and share some tips on how to avoid this.
Basic understanding of MySQL database replication is assumed.
This presentation was at Percona Live 2019 in Austin, Texas.
https://www.percona.com/live/19/sessions/errant-gtids-breaking-replication-how-to-detect-and-avoid-them
MMUG18 - MySQL Failover and OrchestratorSimon J Mudd
Description of recovery and failover with of MySQL and specifically how orchestrator is designed to make this simpler.
A presentation I gave at the Madrid MySQL Users Group on 17/05/2017.
This presentation covers MySQL data encryption at disk. How to encrypt all tablespaces and MySQL related files for the compliances ? The new releases in MySQL 8.0 take care of the encryption of the system tablespace and supporting tables unlike MySQL 5.7.
You’ve deployed automation, enabled automatic master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such surprise.
Once upon a time, a failure brought down a master. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, with an edge-case recovery, and a lack of oversight in automation, lead to a split-brain. This talk will go into details about the convoluted - but still real world - sequence of events that lead to this disaster. I will cover what could have avoided the split-brain and what could have make things easier to fix it.
This short talk will be about an incident that kept DBAs working on a weekend. Two bugs, one in our application code and one in the database, joined force and almost brought down Booking.com. And this occurred at one of the worst possible times. Curious about what happened: come to this talk to learn more.
Replication Troubleshooting in Classic VS GTIDMydbops
This presentation talk will assist you in troubleshooting MySQL replication for the most common issues we might face with a simple comparison of how can we get them solved in the different replication methods (Classic VS GTID).
MySQL Database Architectures - InnoDB ReplicaSet & ClusterKenny Gryp
This document provides an overview and comparison of MySQL InnoDB Cluster and MySQL InnoDB ReplicaSet. It discusses the components, goals, and features of each solution. MySQL InnoDB Cluster uses Group Replication to provide high availability, automatic failover, and data consistency. MySQL InnoDB ReplicaSet uses asynchronous replication and provides availability and read scaling through manual primary/secondary configuration and failover. Both solutions integrate MySQL Shell, Router, and automatic member provisioning for easy management.
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...Jean-François Gagné
Since 5.7.2, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK (DATABASE based parallel replication is also implemented in 5.6 but this is not covered in this talk). In early 5.7 versions, parallel replication was based on group commit (like MariaDB) and 5.7.6 changed that to intervals.
Intervals are more complicated but they are also more powerful. In this talk, I will explain in detail how they work and why intervals are better than group commit. I will also cover how to optimize parallel replication in MySQL 5.7 and what improvements are coming in MySQL 8.0. I will also explain why Group Replication is replicating faster than standard asynchronous replication.
Come to this talk to get all the details about MySQL 5.7 Parallel Replication.
24시간 365일 서비스를 위한 MySQL DB 이중화.
MySQL 이중화 방안들에 대해 알아보고 운영하면서 겪은 고민들을 이야기해 봅니다.
목차
1. DB 이중화 필요성
2. 이중화 방안
- HW 이중화
- MySQL Replication 이중화
3. 이중화 운영 장애
4. DNS와 VIP
5. MySQL 이중화 솔루션 비교
대상
- MySQL을 서비스하고 있는 인프라 담당자
- MySQL 이중화에 관심 있는 개발자
MariaDB 10.0 introduces domain-based parallel replication which allows transactions in different domains to execute concurrently on replicas. This can result in out-of-order transaction commit. MariaDB 10.1 adds optimistic parallel replication which maintains commit order. The document discusses various parallel replication techniques in MySQL and MariaDB including schema-based replication in MySQL 5.6 and logical clock replication in MySQL 5.7. It provides performance benchmarks of these techniques from Booking.com's database environments.
This document summarizes information about scaling out MariaDB server and cluster with MaxScale. It discusses the ReadWriteSplit and ConnectionRoute routers for scaling out a MariaDB server replication topology. It also discusses using the Galera Monitor and ReadWriteSplit modules to scale out a MariaDB Cluster deployment with MaxScale. Configurations are provided for using these routers and monitors with MaxScale.
MySQL Administrator
Basic course
- MySQL 개요
- MySQL 설치 / 설정
- MySQL 아키텍처 - MySQL 스토리지 엔진
- MySQL 관리
- MySQL 백업 / 복구
- MySQL 모니터링
Advanced course
- MySQL Optimization
- MariaDB / Percona
- MySQL HA (High Availability)
- MySQL troubleshooting
네오클로바
http://neoclova.co.kr/
Orchestrator allows for easy MySQL failover by monitoring the cluster and promoting a new master when failures occur. Two test cases were demonstrated: 1) using a VIP and scripts to redirect connections during failover and 2) integrating with Proxysql to separate reads and writes and automatically redirect write transactions during failover while keeping read queries distributed. Both cases resulted in failover occurring within 16 seconds while maintaining application availability.
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)Altinity Ltd
ProxySQL 2.0 includes several new features such as query cache improvements, GTID causal reads for consistency, native Galera cluster support, Amazon Aurora integration, LDAP authentication, improved SSL support, a new audit log, and performance enhancements. It also adds new monitoring tables, variables, and configuration options to support these features.
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
This summary provides an overview of the key points from the document:
1. The document is a presentation on MySQL replication scalability and reliability given at dataops.barcelona in June 2019. It covers topics like introduction to replication, use cases for replication like read scaling and high availability, and best practices.
2. The presentation provides an overview of MySQL replication including what it is, why you would use it, and how it works at a high level. It also discusses tools for monitoring and visualizing replication topology.
3. Challenges like replication lag are discussed along with techniques to prevent and address lag, such as transaction design practices and throttling. Advanced topics like parallel replication are also mentioned.
MySQL performance can be improved by tuning queries, server options, and hardware. Traditionally it was an area of responsibility for three different roles: Development, DBA, and System Administrators. Now DevOps handle these all. But there is a gap. Knowledge gained by MySQL DBAs after years or focusing on a single product is hard to gain when you focus on more than one. This is why I am doing this session. I will show a minimal but most effective set of options to improve MySQL performance. For illustrations, I will use real user stories gained from my Support experience and Percona Kubernetes operators for PXC and MySQL.
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZENorvald Ryeng
This presentation focuses on two of the new features in MySQL 8.0.18: hash joins and EXPLAIN ANALYZE. It covers how these features work, both on the surface and on the inside, and how you can use them to improve your queries and make them go faster.
Both features are the result of major refactoring of how the MySQL executor works. In addition to explaining and demonstrating the features themselves, the presentation looks at how the investment in a new iterator based executor prepares MySQL for a future with faster queries, greater plan flexibility and even more SQL features.
GTIDs were introduced to solve replication problems and improve database consistency in MySQL database replication.
When, accidentally, transactions occur on a replica, this introduces GTIDs on that replica that don't exist on the master. When, on a master failover, this replica becomes the new master, and the corresponding binlogs of the errant GTIDs are already purged, replication breaks on the replicas of this new master, because those missing GTIDs can't be retrieved from the binlogs of this new master.
This presentation will talk about GTIDs and how to detect errant GTIDs on a replica (before the corresponding binlogs are purged) and how to look at the corresponding transactions in the binlogs. I'll give some examples of transactions that could happen on a replica that didn't originate from a primary node, explain how this is possible and share some tips on how to avoid this.
Basic understanding of MySQL database replication is assumed.
This presentation was at Percona Live 2019 in Austin, Texas.
https://www.percona.com/live/19/sessions/errant-gtids-breaking-replication-how-to-detect-and-avoid-them
MMUG18 - MySQL Failover and OrchestratorSimon J Mudd
Description of recovery and failover with of MySQL and specifically how orchestrator is designed to make this simpler.
A presentation I gave at the Madrid MySQL Users Group on 17/05/2017.
This presentation covers MySQL data encryption at disk. How to encrypt all tablespaces and MySQL related files for the compliances ? The new releases in MySQL 8.0 take care of the encryption of the system tablespace and supporting tables unlike MySQL 5.7.
You’ve deployed automation, enabled automatic master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such surprise.
Once upon a time, a failure brought down a master. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, with an edge-case recovery, and a lack of oversight in automation, lead to a split-brain. This talk will go into details about the convoluted - but still real world - sequence of events that lead to this disaster. I will cover what could have avoided the split-brain and what could have make things easier to fix it.
This short talk will be about an incident that kept DBAs working on a weekend. Two bugs, one in our application code and one in the database, joined force and almost brought down Booking.com. And this occurred at one of the worst possible times. Curious about what happened: come to this talk to learn more.
MySQLinsanity! This document provides an overview of Stanley Huang's MySQL performance tuning experience and expertise. It begins with introductions and background on Stanley Huang. It then discusses the typical phases of MySQL performance tuning projects, including SQL tuning and RDBMS tuning. Specific tips are provided around topics like slow query logging, index usage, partitioning, and server configuration. The document concludes with an invitation for questions.
MySQL 5.7 innodb_enhance_partii_20160527Saewoong Lee
Release Date : 2016.05.27
Version : MySQL 5.7
Index :
- Part I : InnoDB Performance
- Part I : InnoDB Buffer Pool Flushing
- Part I : InnoDB internal Transaction General
- Part I : InnoDB Improved adaptive flushing
- Part II : InnoDB Online DDL
- Part II : Tablespace management
- Part II : InnoDB Bulk Load for Create Index
- Part II : InnoDB Temporary Tables
- Part II : InnoDB Full-Text CJK Support
- Part II : Support Syslog on Linux / Unix OS
- Part II : Performance_schema
- Part II : Useful tips
Percona Live '18 Tutorial: The Accidental DBAJenni Snyder
The document is an agenda for a talk titled "The Accidental DBA" about MySQL database administration. The talk covers MySQL basics like installation, configuration, backups and replication. It is meant to introduce attendees to common MySQL administration tasks and help them get more comfortable with database administration even if they came to it by accident. The speaker is Jenni Snyder, an engineering manager at Yelp who became their first DBA in 2011.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Gdb can be used by MySQL DBAs as a last resort tool to troubleshoot issues. It allows inspecting variable values, setting variables, calling functions, and getting stack traces from a running or crashed mysqld process. The presentation provides examples of using gdb to study InnoDB locks, metadata locks, and real bugs. While gdb can help in some cases, ideally DBAs should use profiling tools, implement missing features, and follow best practices to avoid needing gdb.
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...Puppet
The document discusses managing trusted instances in the cloud. It outlines the problem of verifying instances provisioned in the cloud are legitimate. It then provides an overview of a solution where instances generate certificate signing requests with metadata upon launch, and a puppetmaster signs the requests after verifying the instance information with the cloud provider API. Signed certificates are returned to the instances containing the metadata, allowing the instances to be identified and classified in puppet configurations.
This document provides instructions for uploading a PHP shell to a vulnerable website using SQL injection. It begins by creating a vulnerable PHP page and MySQL database. SQL injection is then demonstrated by manipulating the 'id' parameter. The number of columns in the database table is determined. Hexadecimal encoding is used to insert a PHP uploader script into the database as a file. The uploaded file path is then manipulated to execute the PHP shell.
Working in Web Operations means dealing with production systems that in most cases needs to be operational 24×7x365.
To reach 99.99999% uptime, you must fail as little as possible.
This talk will go through a few real-world incidents and failures experienced by our small WebOps team, and outline what we are learning (the hard way), and how we’re trying to improve.
What could possibly go wrong? :-)
No matter how resilient your database infrastructure is, backups are still needed to defend against catastrophic failures. Be it the unlikely hardware failure of all data centers, or the more likely and all-too-human user error. Acknowledging the importance of good backup procedures, the Scylla Manager now natively supports backup and restore operations. In this talk, we will learn more about how that works and the guarantees provided, as well as how to set it up to guarantee maximum resiliency to your cluster.
This document discusses functional database strategies, specifically focusing on immutable rows, window functions, event-sourcing, and database interactions from a functional perspective. The key points covered include:
1. Making database rows immutable by storing them as a log of changes rather than allowing in-place updates. This allows reconstructing the full history.
2. Using window functions to partition and aggregate over row groups to enable querying the latest or historical state.
3. Storing a log of events rather than the current application state, in line with event-sourcing principles.
4. Techniques for interacting with databases in a functional way, including restricting mutations and optimizing data structures.
Case Study: MySQL migration from latin1 to UTF-8Olivier DASINI
This document summarizes Olivier Dasini's presentation on migrating a MySQL database from the latin1 character set to UTF-8. Some key points:
- The migration involved converting database tables, columns, and data to the UTF-8 character set and UTF-8 collations to support non-Latin characters from around the world.
- Challenges included minimizing downtime to avoid loss of income, dealing with legacy data issues, and handling errors due to differences between character sets.
- The solution involved a rolling upgrade approach, with slaves being migrated first to test the process before a master-slave switchover.
- Significant effort was required to clean legacy data issues and handle errors manually
Build a DataWarehouse for your logs with Python, AWS Athena and GlueMaxym Kharchenko
This document discusses using SQL to query logs stored in Amazon S3. It describes a pipeline for structuring unstructured log files, creating tables to represent the logs, transforming and compacting the data, and then querying the logs using AWS Athena. The pipeline involves using AWS Glue to schedule ETL jobs that convert logs to structured JSON, transform and compact the data into Parquet format, and build materialized views. AWS Athena then provides a SQL interface for interactively querying the logs stored in S3.
As the leap second approaches, there is no better time to reflect on our misconceptions about time and numerals, past catastrophes and possible mitigation techniques.
causos da linha de frente - #rsonrails 2011bzanchet
This document discusses:
1) The history and infrastructure of the company Linode, including their pricing and server configurations.
2) Details about scaling a Rails application on Softlayer servers to handle 600 million pageviews and 13 million images per month.
3) Examples of code optimizations made to the application including fragment caching and database optimizations like dependent destroys and counter caches.
MySQL 5.7 Tutorial Dutch PHP Conference 2015Dave Stokes
MySQL 5.7 is the latest version of the MySQL database. It includes new features such as support for JSON as a native data type with functions for manipulating JSON documents. Security has also been improved with secure defaults, password rotation/expiration controls, and SSL encryption enabled by default for the C client library. The release candidate for 5.7 was released in April 2015 and includes patches, contributions, and enhancements over previous versions.
Similar to Autopsy of a MySQL Automation Disaster (20)
Have you ever needed to get some additional write throughput from MySQL ? If yes, you probably found that setting sync_binlog to 0 (and trx_commit to 2) gives you an extra performance boost. As all such easy optimisation, it comes at a cost. This talk explains how this tuning works, presents its consequences and makes recommendations to avoid them. This will bring us to the details of how MySQL commits transactions and how those are replicated to slaves. Come to this talk to learn how to get the benefit of this tuning the right way and to learn some replication internals.
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
You have a working application that is using MySQL: great! At the beginning, you are probably using a single database instance, and maybe – but not necessarily – you have replication for backups, but you are not reading from slaves yet. Scalability and reliability were not the main focus in the past, but they are starting to be a concern. Soon, you will have many databases and you will have to deal with replication lag. This talk will present how to tackle the transition.
We mostly cover standard/asynchronous replication, but we will also touch on Galera and Group Replication. We present how to adapt the application to become replication-friendly, which facilitate reading from and failing over to slaves. We also present solutions for managing read views at scale and enabling read-your-own-writes on slaves. We also touch on vertical and horizontal sharding for when deploying bigger servers is not possible anymore.
Are UNIQUE and FOREIGN KEYs still possible at scale, what are the downsides of AUTO_INCREMENTs, how to avoid overloading replication, what are the limits of archiving, … Come to this talk to get answers and to leave with tools for tackling the challenges of the future.
Up to MySQL 5.5, replication was not crash safe: it would fail with “dup.key” or “not found” error (or data corruption). So 5.6 is better, right? Maybe: it is possible, but not the default. MySQL 5.7 is not much better, 8.0 has safer defaults but it is still easy to get things wrong.
Crash safety is impacted by positioning (File+Pos or GTID), type (single/multi-threaded), MTS settings (Db/Logical Clock, and preserve commit order), the sync-ing of relay logs, the presence of binlogs, log-slave-updates and their sync-ing. This is complicated and even the manual is confused about it.
In this talk, I will explain above with details on replication internals, so you might learn a thing or two.
Up to MySQL 5.5, replication was not crash safe: after a crash, it would fail with "duplicate key" or "row not found" error, or might generate silent data corruption. It looks like 5.6 is much better, right? The short answer is maybe: in the simplest case, it is possible to achieve replication crash safety but it is not the default setting. MySQL 5.7 is not much better, 8.0 has safer defaults but it is still easy to get things wrong.
Crash safety is impacted by replication positioning (File+Pos or GTID), type (single-threaded or MTS), MTS settings (Database or Logical Clock, and with or without slave preserve commit order), the sync-ing of relay logs, the presence of binary logs, log-slave-updates and their sync-ing. This is very complicated stuff and even the manual is confused about it.
In this talk, I will explain the impact of above and help you finding the path to crash safety nirvana. I will also give details about replication internals, so you might learn a thing or two.
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsJean-François Gagné
- The document discusses various parallel replication technologies in MySQL/MariaDB including schema-based parallel replication in MySQL 5.6, group commit-based approaches in MariaDB 10.0 and MySQL 5.7, and optimistic parallel replication in MariaDB 10.1.
- It provides an overview of how each approach tags and dispatches transactions to worker threads on slaves and their limitations regarding transaction ordering and gaps.
- Examples from Booking.com show how parallel replication can scale to thousands of servers but also hit issues like long transactions blocking progress.
MySQL/MariaDB replication is asynchronous. You can make replication faster by using better hardware (faster CPU, more RAM, or quicker disks), or you can use parallel replication to remove it single-threaded limitation; but lag can still happen. This talk is not about making replication faster, it is how to deal with its asynchronous nature, including the (in-)famous lag.
We will start by explaining the consequences of asynchronous replication and how/when lag can happen. Then, we will present the solution used at Booking.com to avoid both creating lag and minimize the consequence of stale reads on slaves (hint: this solution does not mean reading from the master because this does not scale).
Once all above is well understood, we will discuss how Booking.com’s solution can be improved: this solution was designed years ago and we would do this differently if starting from scratch today. Finally, I will present an innovative way to avoid lag: the no-slave-left-behind MariaDB patch.
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
Booking.com uses MySQL parallel replication extensively with thousands of servers replicating. The presentation summarized MySQL and MariaDB parallel replication features including: 1) MySQL 5.6 uses schema-based parallel replication but transactions commit out of order. 2) MariaDB 10.0 introduced out-of-order parallel replication using write domains that can cause gaps. 3) MariaDB 10.1 includes five parallel modes including optimistic replication to reduce deadlocks during parallel execution. Long transactions and intermediate masters can limit parallelism.
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
In the last 24 months, MySQL replication speed has improved a lot thanks to implementing parallel replication. MySQL and MariaDB have different types of parallel replication; in this talk, I present in details the different implementations, with their limitations and the corresponding tuning parameters. I also present benchmark results from real Booking.com workloads. Finally, I discuss some deployments at Booking.com that benefits from parallel replication speed improvements.
MySQL Parallel Replication: inventory, use-cases and limitationsJean-François Gagné
In the last 24 months, MySQL replication speed has improved a lot thanks to implementing parallel replication. MySQL and MariaDB have different types of parallel replication; in this talk, I present in detail the different implementations, with their limitations and the corresponding tuning parameters (covering MySQL 5.6, MariaDB 10.0, MariaDB 10.1 and MySQL 5.7). I also present benchmark results from real Booking.com workloads. Finally, I discuss some deployments at Booking.com that benefits from parallel replication speed improvements.
Riding the Binlog: an in Deep Dissection of the Replication StreamJean-François Gagné
Binary Logs are the cornerstone of MySQL Replication, but is it fully understood ? To start apprehending this, we can think of the binary logs as a transport for a Stream of Transactions. Traveling from master to slave, sometimes via Intermediate Masters, this stream evolves: it can shrink by the application of filters, can grow by the addition of slave-local transactions, and two streams can merge by the usage of multi-source replication. After presenting the binary logs Stream Model, the different MySQL use-cases will be mapped to the model, which can serve as a validation of the model. After this validation, the model will be used to make prediction on new use-cases/features that could emerge in the future.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the “How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his company’s pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Leveraging the Graph for Clinical Trials and Standards
Autopsy of a MySQL Automation Disaster
1. Autopsy of a MySQL automation
disaster
by Jean-François Gagné
presented at SRECon Dublin (October 2019)
Senior Infrastructure Engineer / System and MySQL Expert
jeanfrancois AT messagebird DOT com / @jfg956 #SREcon
2. To err is human
To really foul things up requires a computer[1]
(or a script)
[1]: http://quoteinvestigator.com/2010/12/07/foul-computer/
2
6. MySQL replication’’
● Orchestrator allows to:
● Visualize replication deployments
● Move slaves for planned maintenance of an intermediate master
● Automatically replace an intermediate master in case of its unexpected failure
(thanks to pseudo-GTIDs when we have not deployed GTIDs)
● Automatically replace a master in case of a failure (failing over to a slave)
● But Orchestrator cannot replace a master alone:
● Booking.com uses DNS for master discovery
● So Orchestrator calls a homemade script to repoint DNS (and to do other magic)
6
9. MySQL Master High Availability [1 of 4]
Failing-over the master to a slave is my favorite HA method
• But it is not as easy as it sounds, and it is hard to automate well
• An example of complete failover solution in production:
https://github.blog/2018-06-20-mysql-high-availability-at-github/
The five considerations of master high availability:
(https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html)
• Plan how you are doing master high availability
• Decide when you apply your plan (Failure Detection – FD)
• Tell the application about the change (Service Discovery – SD)
• Protect against the limit of FD and SD for avoiding split-brains (Fencing)
• Fix your data if something goes wrong
9
10. 10
MySQL Master High Availability [2 of 4]
Failure detection (FD) is the 1st part (and 1st challenge) of failover
• It is a very hard problem: partial failure, unreliable network, partitions, …
• It is impossible to be 100% sure of a failure, and confidence needs time
à quick FD is unreliable, relatively reliable FD implies longer unavailability
Ø You need to accept that FD generates false positive (and/or false negative)
Repointing is the 2nd part of failover:
• Relatively easy with the right tools: MHA, GTID, Pseudo-GTID, Binlog Servers, …
• Complexity grows with the number of direct slaves of the master
(what if you cannot contact some of those slaves…)
• Some software for repointing:
• Orchestrator, Ripple Binlog Server,
Replication Manager, MHA, Cluster Control, MaxScale, …
11. 11
MySQL Master High Availability [3 of 4]
In this configuration and when the master fails,
one of the slave needs to be repointed to the new master:
12. 12
MySQL Master High Availability [4 of 4]
Service Discovery (SD) is the 3rd part (and 2nd challenge) of failover:
• If centralised, it is a SPOF; if distributed, impossible to update atomically
• SD will either introduce a bottleneck (including performance limits)
or will be unreliable in some way (pointing to the wrong master)
• Some ways to implement MySQL Master SD: DNS, ViP, Proxy, Zookeeper, …
http://code.openark.org/blog/mysql/mysql-master-discovery-methods-part-1-dns
Ø Unreliable FD and unreliable SD is a recipe for split-brains !
Protecting against split-brains (Fencing): Adv. Subject – not many solutions
(Proxies and semi-synchronous replication might help)
Fixing your data in case of a split-brain: only you can know how to do this !
(tip on this later in the talk)
13. MySQL Service Discovery @ MessageBird
MessageBird uses ProxySQL for MySQL Service Discovery
13
16. Our subject database
● Simple replication deployment (in two data centers):
Master RWs +---+
happen here --> | A |
+---+
|
+------------------------+
| |
Reads +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ And reads
| Y | <-- happen here
+---+
16
17. Incident: 1st event
● A and B (two servers in same data center) fail at the same time:
Master RWs +-/+
happen here --> | A |
but now failing +/-+
Reads +-/+ +---+
happen here --> | B | | X |
but now failing +/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
17
(I will cover how/why this happened later.)
18. Incident: 1st event’
● Orchestrator fixes things:
+-/+
| A |
+/-+
Reads +-/+ +---+ Now, Master RWs
happen here --> | B | | X | <-- happen here
but now failing +/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
18
19. Split-brain: disaster
● A few things happen in that day and night, and I wake-up to this:
+-/+
| A |
+/-+
Master RWs +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+
| Y |
+---+
19
20. Split-brain: disaster’
● And to make things worse, reads are still happening on Y:
+-/+
| A |
+/-+
Master RWs +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
20
21. Split-brain: disaster’’
● This is not good:
● When A and B failed, X was promoted as the new master
● Something made DNS point to B (we will see what later)
à writes are now happening on B
● But B is outdated: all writes to X (after the failure of A) did not reach B
● So we have data on X that cannot be read on B
● And we have new data on B
that is not read on Y
21
+-/+
| A |
+/-+
Master RWs +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
22. Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
22
+-/+
| A |
+/-+
+-/+ +---+ Master RWs
| B | | X | <-- happen here
+/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
23. Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
● And something caused a failure of A
23
+---+
| A |
+---+
|
+---+ +---+ Master RWs
| B | | X | <-- happen here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
24. Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
● And something caused a failure of A
● But how did DNS end-up pointing to B ?
● The failover to B called the DNS repointing script
● The script stole the DNS entry
from X and pointed it to B
24
+-/+
| A |
+/-+
+---+ +---+ Master RWs
| B | | X | <-- happen here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
25. Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
● And something caused a failure of A
● But how did DNS end-up pointing to B ?
● The failover to B called the DNS repointing script
● The script stole the DNS entry
from X and pointed it to B
● But is that all: what made A fail ?
25
+-/+
| A |
+/-+
Master RWs +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
26. Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
26
+---+
| A |
+---+
|
+---+ +---+ Master RWs
| B | | X | <-- happen here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
27. Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But because A came back before re-slaving, it injected heartbeat and p-GTID to B
27
+-/+
| A |
+/-+
+---+ +---+ Master RWs
| B | | X | <-- happen here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
28. Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But because A came back before re-slaving, it injected heartbeat and p-GTID to B
● Then B could have been re-cloned without problems
28
+---+
| A |
+---+
|
+---+ +---+ Master RWs
| B | | X | <-- happen here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
29. Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But because A came back before re-slaving, it injected heartbeat and p-GTID to B
● Then B could have been re-cloned without problems
● But A was re-cloned instead (human error #1)
29
+---+
| A |
+---+
+-/+ +---+ Master RWs
| B | | X | <-- happen here
+/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
30. Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But because A came back before re-slaving, it injected heartbeat and p-GTID to B
● Then B could have been re-cloned without problems
● But A was re-cloned instead (human error #1)
● Why did Orchestrator not fail-over right away ?
● B was promoted hours after A was brought down…
● Because A was downtimed only for 4 hours (human error #2)
30
+-/+
| A |
+/-+
+---+ +---+ Master RWs
| B | | X | <-- happen here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
31. Orchestrator anti-flapping
● Orchestrator has a failover throttling/acknowledgment mechanism[1]:
● Automated recovery will happen
● for an instance in a cluster that has not recently been recovered
● unless such recent recoveries were acknowledged
● In our case:
● the recovery might have been acknowledged too early (human error #0 ?)
● with a too short “recently” timeout
● and maybe Orchestrator should not have failed over the second time
[1]: https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md
#blocking-acknowledgments-anti-flapping
31
32. Split-brain: summary
● So in summary, this disaster was caused by:
1. A fancy failure: two servers failing at the same time
2. A debatable premature acknowledgment in Orchestrator
and probably too short a timeout for recent failover
3. Edge-case recovery: both servers forming a new replication topology
4. Restarting with the event-scheduler enabled (A injecting heartbeat and p-GTID)
5. Re-cloning wrong (A instead of B; should have created C and D and thrown away A
and B; too short downtime for the re-cloning)
6. Orchestrator failing over something that it should not have
(including taking an questionnable action when a downtime expired)
7. DNS repointing script not defensive enough
32
33. Fancy failure: more details
● Why did A and B fail at the same time ?
● Deployment error: the two servers in the same rack/failure domain ?
● And/or very unlucky ?
● Very unlucky because…
10 to 20 servers failed that day in the same data center
Because human operations and “sensitive” hardware
33
+-/+
| A |
+/-+
+-/+ +---+
| B | | X |
+/-+ +---+
|
+---+
| Y |
+---+
34. How to fix such situation ?
● Fixing “split-brain” data on B and X is hard
● Some solutions are:
● Kill B or X (and lose data)
● Replay writes from B on X or vice-versa (manually or with replication)
● But AUTO_INCREMENTs are in the way:
● up to i used on A before 1st failover
● i-n to j1 used on X after recovery
● i to j2 used on B after 2nd failover
34
+-/+
| A |
+/-+
Master RWs +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
35. Takeaway
● Twisted situations happen
● Automation (including failover) is not simple:
à code automation scripts defensively
● Be mindful for premature acknowledgment, downtime more than less,
shutdown slaves first à understand complex interactions of tools in details
● Try something else than AUTO-INCREMENTs for Primary Key
(monotonically increasing UUID[1] [2] ?)
[1]: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
[2]: http://mysql.rjweb.org/doc.php/uuid
35
36. 36
Re Failover War Story
Master failover does not always go as planned
• because it is complicated
It is not a matter of “if” things go wrong
• but “when” things will go wrong
Please share your war stories
• so we can learn from each-others’ experience
• GitHub has a MySQL public Post-Mortem (great of them to share this):
https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
38. Thanks !
by Jean-François Gagné
presented at SRECon Dublin (October 2019)
Senior Infrastructure Engineer / System and MySQL Expert
jeanfrancois AT messagebird DOT com / @jfg956 #SREcon