SlideShare a Scribd company logo
1 of 66
Download to read offline
Almost Perfect
Service Discovery and Failover
with ProxySQL and Orchestrator
by Jean-François Gagné
and Art van Scheppingen
Presented at Percona Live Online, May 2021
Almost Perfect
Service Discovery and Failover
with ProxySQL and Orchestrator
Jean-François Gagné
System and MySQL Expert at HubSpot
jgagne AT hubspot DOT com / @jfg956
MySQL Service Discovery at MessageBird
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
3
Summary of part #1
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
• MySQL Primary High Availability
• Failover to a Replica War Story
• MySQL at MessageBird (Percona Server)
• MySQL Service Discovery at MessageBird (ProxySQL)
• Orchestrator integration and the Failover Process
4
MySQL Primary High Availability [1 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Failing-over the primary to a replica is my favorite high availability method
• But it is not as easy as it sounds, and it is hard to automate well
• An example of complete failover solution in production:
https://github.blog/2018-06-20-mysql-high-availability-at-github/
The five considerations for primary high availability:
(https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html)
• Plan how you are doing primary high availability
• Decide when you apply your plan (Failure Detection – FD)
• Tell the application about the primary change (Service Discovery – SD)
• Protect against the limit of FD and SD for avoiding split-brains (Fencing)
• Fix your data if something goes wrong
5
MySQL Primary High Availability [2 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Failure detection (FD) is the 1st part (and 1st challenge) of failing-over
• It is a very hard problem: partial failure, unreliable network, partitions, …
• It is impossible to be 100% sure of a failure, and confidence needs time
à quick FD is unreliable, relatively reliable FD implies longer downtime
Ø Quick FD for short downtime generates false positive
Repointing is the 2nd part of failing-over to a replica:
• Relatively easy with the right tools: GTID, Pseudo-GTID, Binlog Servers, …
• Complexity grows with the number of direct replicas of the primary
• Some software for repointing:
• Orchestrator, Ripple Binlog Server,
Replication Manager, MHA, Cluster Control, MaxScale
6
MySQL Primary High Availability [3 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
What do I mean by repointing:
• In below configuration and when the primary fails,
once one of the replica as been chosen as the new primary
the other replica needs to be re-sourced (re-slaved) to the new primary
7
MySQL Primary High Availability [4 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Service Discovery (SD) is the 3rd part (and 2nd challenge) of failover:
• If centralized à SPOF; if distributed à impossible to update atomically
• SD will either introduce a bottleneck (including performance limits)
• Or it will be unreliable in some way (pointing to the wrong primary)
• Some ways to implement Service Discovery: DNS, ViP, Proxy, Zookeeper, …
http://code.openark.org/blog/mysql/mysql-master-discovery-methods-part-1-dns
Ø Unreliable FD and unreliable SD is a recipe for split-brains !
Protecting against split-brains (Fencing): Adv. Subject – not many solutions
(proxies and semi-synchronous replication might help)
Fixing your data in case of a split-brain: only you can know how to do this !
(tip on this in the war story) 8
Failover War Story [1 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Some infrastructure context:
• Service Discovery is DNS (and failure detector is Orchestrator)
• The databases are behind a firewall in two data centers
9
Failover War Story [1 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Some infrastructure context:
• Service Discovery is DNS (and failure detector is Orchestrator)
• The databases are behind a firewall in two data centers
• And we have a failure of the firewall in the zone of the primary
10
Failover War Story [2 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Things went as planned: failed-over from Zone1 to Zone2
• New primary in zone 2: stop replication, set it read-write, update DNS, …
• Everything was ok…
11
Failover War Story [2 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Things went as planned: failed-over from Zone1 to Zone2
• New primary in zone 2: stop replication, set it read-write, update DNS, …
• Everything was ok… until the firewall came back up
12
Failover War Story [3 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Once the firewall came back up, no detectable problems
• But some intuition made me checked the binary logs of the old primary
• And I found new transactions with timestamp after the firewall recovery
(and obviously this is after the failover to zone 2)
13
Failover War Story [4 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
These new transactions are problematic:
• They are in the databases in zone 1, but not in zone 2
• They share common auto-increments with data in zone 2
• Luckily, there are only a few transactions, so easy to fix, but what happened ?
14
Failover War Story [5 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
The infrastructure is a little more complicated than initially presented:
• There are web servers and local DNS behind the firewalls (fw)
• The DNS update of the failover did not reach zone 1 (because of the fw failure)
• When the firewall came back up, the web servers received traffic
and because the DNS was not yet updated, they wrote on the old primary
• Once updated (a few seconds later), writes went to the new primary in zone 2
15
Failover War Story [6 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
This war story was a decentralized Service Discovery causing problems
Remember that it is not a matter of “if” but “when” things will go wrong
Please share your war stories so we can learn from each-others’ experience
• GitHub has a MySQL public Post-Mortem (great of them to share this):
https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
• I also have another MySQL Primary Failover war story in another talk:
https://www.usenix.org/conference/srecon19emea/presentation/gagne
Tip for easier data-reconciliation: use UUID instead of auto-increments
• But store UUID in an optimized way (in primary key order)
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
http://mysql.rjweb.org/doc.php/uuid 16
MySQL at MessageBird [1 of 2]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
MessageBird is using MySQL 5.7 (more precisely Percona Server)
These are hosted in many Google Cloud Regions
There are three types of MySQL deployments
1. Multi-region primary:
replicas in many regions and primary potentially on many regions
2. Single-region primary with replicas in many regions
3. Primary and replicas all in a single region
17
MySQL at MessageBird [2 of 2]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
This is a multi-region primary (regions are color-coded)
MySQL Service Discovery at MessageBird [1 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Requirements for MySQL Service Discovery:
• Being able to route traffic to local replicas
• Embed some sort of fencing mechanism
This led to a multi-layer solution using ProxySQL:
• Three layers for multi-region primary:
1. Collect
2. Master-Gateway (mgw)
3. Fencing
For single-region, mgw and fencing are merged in local-fencing (locfen)
19
MySQL Service Discovery at MessageBird [2 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
20
MySQL Service Discovery at MessageBird [3 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
21
The collect layer is a standard entry-point design
The fencing layer is a natural HA way to route
traffic to the primary (a single node would not be HA)
The master-gateway layer is the glue between
collect and fencing (more about this later in the talk)
The local-fencing layer is the mgw and the fencing layers
merged for single-region primary databases because
routing to a single region does not need the mgw glue
(three nodes for N+2 HA, more about this later in the talk)
The collect is the entry-point of the MySQL Service Discovery:
• It starts with a load-balancer sending traffic to ProxySQL
• We have at least 3 instances of ProxySQL for N+2 high availability
• From here, read-only traffic is sent directly to replicas
• Primary traffic (read-write) is sent to mgw or locfen
22
MySQL Service Discovery at MessageBird [4 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Routing from collect to locfen is either local or crossing a region boundary
For mgw, it is biased to local when a mgw is on the same region as collect
• ProxySQL routing is weight-based (no easy fallback routing)
MySQL Service Discovery at MessageBird [5 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
23
The master-gateway is deployed on all regions potentially hosting a primary
The same way fencing (or locfen) is designed as small as possible
to reduce the update-scope of failover (to a single region) …
… mgw bounds the update-scope of moving the primary to another region to
a continent (in this case, the three mgw regions are in Europe)
(it avoids a Planet-Scale reconfiguration of collect on failover)
… and the way mgw routes traffic to fencing protects against writing to the
primary in case of network partitions
24
MySQL Service Discovery at MessageBird [6 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
25
MySQL Service Discovery at MessageBird [7 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
The mgw routing is as follow:
• If the primary is in a remote region, traffic is routed to fencing in that region
• If the primary is in the same region, traffic is routed to the other mgw
Ø No path to the primary not crossing a region boundary
No path to the primary that is not crossing a region boundary
• That might sound sub-optimal, but it is an interesting tradeoff
• It makes the best vs worse case round-trip ratio to the primary closer
Without crossing a region boundary:
• Best case (local access): ~1 ms round-trip to the primary
• Worse case (remote access): ~20 ms round-trip to the primary
Ø Ratio of 1 to 20
With crossing a region boundary in mgw:
• Best case (remote access): ~20 ms round-trip to the primary
• Worse case (local access): ~40 ms round-trip to the primary
• Even worse case (remote routed to mgw of the primary) 60 ms
Ø Ratio of 1 to 2 (3 in the worse case) 26
MySQL Service Discovery at MessageBird [8 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
27
MySQL Service Discovery at MessageBird [9 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
The worse case (which could be avoided with smarter collect routing)
Avoiding low round-trip variance over optimal best-case
• With region-remote primary accesses, a high latency is unavoidable:
when having 20 ms latency, having 40 or 60 should not be problematic
• Moving the average closer to the median avoids problems
• And in the case where most writes are local,
it avoids surprises when the database becomes remote
• And it prevents writing to the primary in case of a network partition
28
MySQL Service Discovery at MessageBird [10 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
29
MySQL Service Discovery at MessageBird [11 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
And therefore, I claim this design has interesting tradeoff
• But it might not fit everyone’s requirements
Why only two fencing nodes and three locfen nodes:
• I like high availability N+2
• This allows a single failure to not need an immediate fix
• If a locfen node fails on Friday evening, it can wait until Monday
• A failure of one of the two fencing node looks problematic
• But we can failover to another region having two healthy nodes
• And updating two ProxySQL in case of a failover is easier than three
(needing three locfen is something I dislike as it makes failover more fragile)
30
MySQL Service Discovery at MessageBird [12 / 12]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Failover
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
31
Failing-over is a multi-step process:
1. Detecting a failure
2. Fencing the primary: setting it as OFFLINE_HARD in ProxySQL
3. Regrouping replicas under the new primary
4. Waiting for replication to catch-up on the new primary
5. Making the new primary ready: stop replication, set writable, start HB, …
6. Updating ProxySQL to point to the new primary
7. Re-configure fencing and master-gateway if needed
Reconfiguring Fencing and MGW [1 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
32
Reconfiguring Fencing and MGW [2 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
33
Reconfiguring Fencing and MGW [3 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
34
Reconfiguring Fencing and MGW [4 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
35
Reconfiguring Fencing and MGW [5 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
36
Reconfiguring Fencing and MGW [6 of 6]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
37
Reacting to a network partition is a similar operation
Orchestrator integration
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
Orchestrator integration
• Pre-failover hook: fence the primary in fencing (or locfen)
• Post-failover hook: update fencing (or locfen) to the new primary
• List of ProxySQL nodes (fencing or locfen)
and the ProxySQL hostgroup are store in each databases
which make this information available to the Orchestrator hooks
38
MySQL Service Discovery at MessageBird
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
39
Almost Perfect
Service Discovery and Failover
with ProxySQL and Orchestrator
Art van Scheppingen
Senior Database Engineer at MessageBird
art.vanscheppingen AT messagebird DOT com
Has been successfully in production for over two years now
• Most of our workload is on single-region primary (e.g. locfen)
• We have one cluster on multi-region primary
ProxySQL has been very stable for us
• No big issues on 1.4.x, 2.0.x and 2.1.x
How does it work far?
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
41
Easier for devs to setup connections to primary
• Point connection to ip address
• No failover handling necessary
Easier for devs to scale out reads
• Point connection to the same ip address
• Uses a different user for RO
• We can differentiate reads (read-only, read-only replica-only, etc)
How does it work far?
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
42
Multiplexing in ProxySQL turned out to be tricky
Without Multiplexing:
• 1 on 1 number of connections between Collect and Locfen
• Application makes a connection for each shard (8 at the moment)
• Application only uses one actively
• High number of connections means high load
With multiplexing
• High number of connections on Collect
• Theoretically only 1/8th of the connections on Locfen
• Even lower due to less overhead on establishing connections
Mastering multiplexing [1 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
43
Prior to multiplexing
Mastering multiplexing [2 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
44
We enabled multiplexing in our configuration but not much happened
ProxySQL disabled multiplexing due to an auto increment
• Bugreport on ORM that didn’t handle multiplexing well (Hibernate)
• Parsing of the OK packet for auto increments
• Whenever encountered: multiplexing is affected
• mysql-auto_increment_delay_multiplex set to 5 by default
• This means for 5 consecutive queries multiplexing is disabled
Mastering multiplexing [3 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
45
After mastering multiplexing
Mastering multiplexing [4 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
46
We tried reusing collect and locfen as much as possible
• Centralized configuration (hostgroups and users)
• Less complexity
Expansion was inevitable
• Noisy neighbors (hostgroups)
• Reducing risk
• Better tuning for certain workloads
• Easier for maintenance
• Cascading effect to other hostgroups
Currently running 3 vertical stacks of collect and locfen
Separation of stacks
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
47
Above 50% CPU usage ProxySQL will show increased latency
• Lower CPU usage by multiplexing
• Lower CPU usage by idle-threads
When a certain hostgroup in Locfen is under stress (1000x normal workload)
• CPU usage can get above 50%
• Latency will increase on all hostgroups
• Latency will cascade upstream to the collect layer
Separation of stacks: Noisy neighbors [1 of 2]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
48
Above 50% CPU usage ProxySQL will show increased latency
Separation of stacks: Noisy neighbors [2 of 2]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
49
For us ProxySQL scales up to about 12K of connections per host
• After this we will hit the limits of TCP
Our Collect layer reached 7.8K connections
• If one collect host fails two remain
• Two remaining hosts will have to do an additional 3.9K connections
• Very close to our 12K limit
• Replacing a failed host now becomes an emergency operation
Separation of stacks: Reducing risk [1 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
50
ProxySQL keeps count of connection errors (MySQL + TCP shared)
• Will shun a host if a backend becomes “less responsive” (e.g. high
load)
• Happens on hostgroups with any number of hosts
Hitting the limits of TCP
• Errors to locfen or MGW will increase
• Locfen and MGW backends will be shunned
• Established connections will also be closed
Separation of stacks: Reducing risk [2 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
51
Separation of stacks: Reducing risk [3 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
52
Separation of stacks: Reducing risk [4 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
53
ProxySQL keeps count of connection errors (MySQL + TCP shared)
• Will shun a host if a backend becomes “less responsive” (e.g. high
load)
• Happens on hostgroups with any number of hosts
Shunning a primary for 1 second will cause another torrent of connections
• Client gets a timeout and will reconnect immediately
• No available backend: new connection is “paused” up to 10 seconds
• After 1 second primary become available again
• ProxySQL has thousands of connections waiting
• Rinse and repeat…
Shunning a primary [1 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
54
How to detect shunned hosts?
• ProxySQL will log a shunned host in the proxysql log
• This includes server name, error rate and duration of shun
2020-06-11 12:01:39 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 10 errors/sec. Shunning for 10 seconds
2020-06-11 12:01:44 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 18 errors/sec. Shunning for 10 seconds
2020-06-11 12:01:49 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 10 errors/sec. Shunning for 10 seconds
2020-06-11 12:01:54 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 10 errors/sec. Shunning for 10 seconds
How do we get them in our graphs?
• ProxySQL log tailer
• Looks for: connection timeouts, shunned hosts, shunned due to
replication lag
• Exports metrics to Prometheus every minute
Shunning a primary [2 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
55
Shunning a primary [3 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
56
Shunning a primary for 1 second will cause an avalanche connections
• Normal latency is 10ms to 50ms
• Added latency of 1 second will decrease application throughput
• Decreased application throughput means k8s scale up workers
• k8s scale up means more incoming connections
• Rinse and repeat…
How we dealt with this:
• During some incidents we throttle down workers
• Counter intuitive: throttling down works increases throughput
• Some application workers now have a fixed ceiling
Shunning a primary [4 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
57
Most ProxySQL tuning is done on a global level
Some examples:
• mysql-connection_delay_multiplex_ms
• mysql-free_connections_pct
• mysql-wait_timeout
Having a separate stack allows
• Fine tuned multiplexing configuration
• Earlier or later closing of connections
• Separate handling of (end) user connections
Separation of stacks: Better tuning
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
58
Maintenance on Collect is scary
• Draining from GLB gracefully is ”closing after X-minutes”
• Near capacity means we run a risk when performing maintenance
Maintenance on MGW/Fencing/Locfen is scary
• Draining a host takes ages to happen
• Aggressive reuse of connections by connection pool
• Connection timeout (wait_timeout) is 8 hours
• Some applications don’t handle closing of a connection well
Separation of stacks: maintenance
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
59
Warstory: instability on one cluster swiped out many others
The instable cluster
• MySQL back_log set too low
• TCP listen overflows on primary
• ProxySQL started to shun primary
The effect
• Continuous shunning happened
• TCP listen overflows started to happen on ProxySQL
• Affected stability on other hostgroups
Separation of stacks: cascading effect [1 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
60
Shunning of primary causes endless shunning loop
Separation of stacks: cascading effect [2 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
61
Listen overflows on ProxySQL hosts make it even worse
Separation of stacks: cascading effect [3 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
62
In effect Collect layer shuns Locfen layer hosts
Separation of stacks: cascading effect [4 of 4]
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
63
After a client-connection closes, the connection will be reused
• Collect → Locfen
• Locfen → database
ProxySQL resets connection to initial connection parameters
What if new connection doesn’t match settings (e.g. UTF8mb4 or CET timezone?)
Will the connection between Locfen → database also change?
Other quirks: Connection contamination
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
64
Uneven distribution of connections/queries
• Weight influences the distribution of connections
• Reuse of existing connections is favored by ProxySQL
• Influenced by mysql-free_connections_pct
The variable mysql-free_connections_pct is a global variable
• Percentage of maximum allowed connections of a hostgroup
• Some hostgroups allow up to 3000 incoming connections
• 2% of 3000 is 60 connections, actual usage is 10 to 15
• More connections are kept open in connection pool than necessary
Other quirks: Uneven distribution
(Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
65
Thanks !
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
by Jean-François Gagné
and Art van Scheppingen
Presented at Percona Live Online, May 2021

More Related Content

What's hot

Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Mydbops
 
Optimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performanceOptimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performanceMariaDB plc
 
MySQL GTID 시작하기
MySQL GTID 시작하기MySQL GTID 시작하기
MySQL GTID 시작하기I Goo Lee
 
ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)YoungHeon (Roy) Kim
 
MySQL Administrator 2021 - 네오클로바
MySQL Administrator 2021 - 네오클로바MySQL Administrator 2021 - 네오클로바
MySQL Administrator 2021 - 네오클로바NeoClova
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Mydbops
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesSeveralnines
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
 
MariaDB Galera Cluster presentation
MariaDB Galera Cluster presentationMariaDB Galera Cluster presentation
MariaDB Galera Cluster presentationFrancisco Gonçalves
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleColin Charles
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestratorYoungHeon (Roy) Kim
 
MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)NeoClova
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLOlivier DASINI
 
MySQL Timeout Variables Explained
MySQL Timeout Variables Explained MySQL Timeout Variables Explained
MySQL Timeout Variables Explained Mydbops
 
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?Miguel Araújo
 
How to set up orchestrator to manage thousands of MySQL servers
How to set up orchestrator to manage thousands of MySQL serversHow to set up orchestrator to manage thousands of MySQL servers
How to set up orchestrator to manage thousands of MySQL serversSimon J Mudd
 
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxMySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxNeoClova
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 

What's hot (20)

Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
 
Optimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performanceOptimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performance
 
MySQL GTID 시작하기
MySQL GTID 시작하기MySQL GTID 시작하기
MySQL GTID 시작하기
 
ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)ProxySQL & PXC(Query routing and Failover Test)
ProxySQL & PXC(Query routing and Failover Test)
 
MySQL Administrator 2021 - 네오클로바
MySQL Administrator 2021 - 네오클로바MySQL Administrator 2021 - 네오클로바
MySQL Administrator 2021 - 네오클로바
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
MariaDB Galera Cluster presentation
MariaDB Galera Cluster presentationMariaDB Galera Cluster presentation
MariaDB Galera Cluster presentation
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestrator
 
MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
 
MySQL Timeout Variables Explained
MySQL Timeout Variables Explained MySQL Timeout Variables Explained
MySQL Timeout Variables Explained
 
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
 
How to set up orchestrator to manage thousands of MySQL servers
How to set up orchestrator to manage thousands of MySQL serversHow to set up orchestrator to manage thousands of MySQL servers
How to set up orchestrator to manage thousands of MySQL servers
 
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxMySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 

Similar to Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator

Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...LarryZaman
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
"Stateful app as an efficient way to build dispatching for riders and drivers...
"Stateful app as an efficient way to build dispatching for riders and drivers..."Stateful app as an efficient way to build dispatching for riders and drivers...
"Stateful app as an efficient way to build dispatching for riders and drivers...Fwdays
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid CloudRafał Leszko
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloudRafał Leszko
 
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Microsoft Technet France
 
MySQL Options in OpenStack
MySQL Options in OpenStackMySQL Options in OpenStack
MySQL Options in OpenStackTesora
 
OpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStackOpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStackMatt Lord
 
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...Edward Burns
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Eran Gampel
 
AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...
AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...
AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...WASdev Community
 
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...Cloud Native Day Tel Aviv
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
Leveraging Endpoint Flexibility in Data-Intensive Clusters
Leveraging Endpoint Flexibility in Data-Intensive ClustersLeveraging Endpoint Flexibility in Data-Intensive Clusters
Leveraging Endpoint Flexibility in Data-Intensive ClustersRan Ziv
 
Nordic infrastructure Conference 2017 - SQL Server on Linux Overview
Nordic infrastructure Conference 2017 - SQL Server on Linux OverviewNordic infrastructure Conference 2017 - SQL Server on Linux Overview
Nordic infrastructure Conference 2017 - SQL Server on Linux OverviewTravis Wright
 
Sfrac on oracle_vm_with_npiv_whitepaper_sol
Sfrac on oracle_vm_with_npiv_whitepaper_solSfrac on oracle_vm_with_npiv_whitepaper_sol
Sfrac on oracle_vm_with_npiv_whitepaper_solNovonil Choudhuri
 

Similar to Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator (20)

Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
"Stateful app as an efficient way to build dispatching for riders and drivers...
"Stateful app as an efficient way to build dispatching for riders and drivers..."Stateful app as an efficient way to build dispatching for riders and drivers...
"Stateful app as an efficient way to build dispatching for riders and drivers...
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloud
 
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
 
MySQL Options in OpenStack
MySQL Options in OpenStackMySQL Options in OpenStack
MySQL Options in OpenStack
 
OpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStackOpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStack
 
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
MySQL Cluster
MySQL ClusterMySQL Cluster
MySQL Cluster
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
 
AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...
AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...
AAI-4847 Full Disclosure on the Performance Characteristics of WebSphere Appl...
 
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Leveraging Endpoint Flexibility in Data-Intensive Clusters
Leveraging Endpoint Flexibility in Data-Intensive ClustersLeveraging Endpoint Flexibility in Data-Intensive Clusters
Leveraging Endpoint Flexibility in Data-Intensive Clusters
 
Nordic infrastructure Conference 2017 - SQL Server on Linux Overview
Nordic infrastructure Conference 2017 - SQL Server on Linux OverviewNordic infrastructure Conference 2017 - SQL Server on Linux Overview
Nordic infrastructure Conference 2017 - SQL Server on Linux Overview
 
Sfrac on oracle_vm_with_npiv_whitepaper_sol
Sfrac on oracle_vm_with_npiv_whitepaper_solSfrac on oracle_vm_with_npiv_whitepaper_sol
Sfrac on oracle_vm_with_npiv_whitepaper_sol
 

More from Jean-François Gagné

The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1Jean-François Gagné
 
Autopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation DisasterAutopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation DisasterJean-François Gagné
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comJean-François Gagné
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...Jean-François Gagné
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
The two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.comThe two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.comJean-François Gagné
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagJean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsJean-François Gagné
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamJean-François Gagné
 

More from Jean-François Gagné (15)

The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
Autopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation DisasterAutopsy of a MySQL Automation Disaster
Autopsy of a MySQL Automation Disaster
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.com
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
 
The two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.comThe two little bugs that almost brought down Booking.com
The two little bugs that almost brought down Booking.com
 
Autopsy of an automation disaster
Autopsy of an automation disasterAutopsy of an automation disaster
Autopsy of an automation disaster
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lag
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication Stream
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator

  • 1. Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator by Jean-François Gagné and Art van Scheppingen Presented at Percona Live Online, May 2021
  • 2. Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator Jean-François Gagné System and MySQL Expert at HubSpot jgagne AT hubspot DOT com / @jfg956
  • 3. MySQL Service Discovery at MessageBird (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 3
  • 4. Summary of part #1 (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) • MySQL Primary High Availability • Failover to a Replica War Story • MySQL at MessageBird (Percona Server) • MySQL Service Discovery at MessageBird (ProxySQL) • Orchestrator integration and the Failover Process 4
  • 5. MySQL Primary High Availability [1 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Failing-over the primary to a replica is my favorite high availability method • But it is not as easy as it sounds, and it is hard to automate well • An example of complete failover solution in production: https://github.blog/2018-06-20-mysql-high-availability-at-github/ The five considerations for primary high availability: (https://jfg-mysql.blogspot.com/2019/02/mysql-master-high-availability-and-failover-more-thoughts.html) • Plan how you are doing primary high availability • Decide when you apply your plan (Failure Detection – FD) • Tell the application about the primary change (Service Discovery – SD) • Protect against the limit of FD and SD for avoiding split-brains (Fencing) • Fix your data if something goes wrong 5
  • 6. MySQL Primary High Availability [2 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Failure detection (FD) is the 1st part (and 1st challenge) of failing-over • It is a very hard problem: partial failure, unreliable network, partitions, … • It is impossible to be 100% sure of a failure, and confidence needs time à quick FD is unreliable, relatively reliable FD implies longer downtime Ø Quick FD for short downtime generates false positive Repointing is the 2nd part of failing-over to a replica: • Relatively easy with the right tools: GTID, Pseudo-GTID, Binlog Servers, … • Complexity grows with the number of direct replicas of the primary • Some software for repointing: • Orchestrator, Ripple Binlog Server, Replication Manager, MHA, Cluster Control, MaxScale 6
  • 7. MySQL Primary High Availability [3 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) What do I mean by repointing: • In below configuration and when the primary fails, once one of the replica as been chosen as the new primary the other replica needs to be re-sourced (re-slaved) to the new primary 7
  • 8. MySQL Primary High Availability [4 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Service Discovery (SD) is the 3rd part (and 2nd challenge) of failover: • If centralized à SPOF; if distributed à impossible to update atomically • SD will either introduce a bottleneck (including performance limits) • Or it will be unreliable in some way (pointing to the wrong primary) • Some ways to implement Service Discovery: DNS, ViP, Proxy, Zookeeper, … http://code.openark.org/blog/mysql/mysql-master-discovery-methods-part-1-dns Ø Unreliable FD and unreliable SD is a recipe for split-brains ! Protecting against split-brains (Fencing): Adv. Subject – not many solutions (proxies and semi-synchronous replication might help) Fixing your data in case of a split-brain: only you can know how to do this ! (tip on this in the war story) 8
  • 9. Failover War Story [1 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Some infrastructure context: • Service Discovery is DNS (and failure detector is Orchestrator) • The databases are behind a firewall in two data centers 9
  • 10. Failover War Story [1 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Some infrastructure context: • Service Discovery is DNS (and failure detector is Orchestrator) • The databases are behind a firewall in two data centers • And we have a failure of the firewall in the zone of the primary 10
  • 11. Failover War Story [2 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Things went as planned: failed-over from Zone1 to Zone2 • New primary in zone 2: stop replication, set it read-write, update DNS, … • Everything was ok… 11
  • 12. Failover War Story [2 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Things went as planned: failed-over from Zone1 to Zone2 • New primary in zone 2: stop replication, set it read-write, update DNS, … • Everything was ok… until the firewall came back up 12
  • 13. Failover War Story [3 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Once the firewall came back up, no detectable problems • But some intuition made me checked the binary logs of the old primary • And I found new transactions with timestamp after the firewall recovery (and obviously this is after the failover to zone 2) 13
  • 14. Failover War Story [4 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) These new transactions are problematic: • They are in the databases in zone 1, but not in zone 2 • They share common auto-increments with data in zone 2 • Luckily, there are only a few transactions, so easy to fix, but what happened ? 14
  • 15. Failover War Story [5 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) The infrastructure is a little more complicated than initially presented: • There are web servers and local DNS behind the firewalls (fw) • The DNS update of the failover did not reach zone 1 (because of the fw failure) • When the firewall came back up, the web servers received traffic and because the DNS was not yet updated, they wrote on the old primary • Once updated (a few seconds later), writes went to the new primary in zone 2 15
  • 16. Failover War Story [6 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) This war story was a decentralized Service Discovery causing problems Remember that it is not a matter of “if” but “when” things will go wrong Please share your war stories so we can learn from each-others’ experience • GitHub has a MySQL public Post-Mortem (great of them to share this): https://blog.github.com/2018-10-30-oct21-post-incident-analysis/ • I also have another MySQL Primary Failover war story in another talk: https://www.usenix.org/conference/srecon19emea/presentation/gagne Tip for easier data-reconciliation: use UUID instead of auto-increments • But store UUID in an optimized way (in primary key order) https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ http://mysql.rjweb.org/doc.php/uuid 16
  • 17. MySQL at MessageBird [1 of 2] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) MessageBird is using MySQL 5.7 (more precisely Percona Server) These are hosted in many Google Cloud Regions There are three types of MySQL deployments 1. Multi-region primary: replicas in many regions and primary potentially on many regions 2. Single-region primary with replicas in many regions 3. Primary and replicas all in a single region 17
  • 18. MySQL at MessageBird [2 of 2] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) This is a multi-region primary (regions are color-coded)
  • 19. MySQL Service Discovery at MessageBird [1 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Requirements for MySQL Service Discovery: • Being able to route traffic to local replicas • Embed some sort of fencing mechanism This led to a multi-layer solution using ProxySQL: • Three layers for multi-region primary: 1. Collect 2. Master-Gateway (mgw) 3. Fencing For single-region, mgw and fencing are merged in local-fencing (locfen) 19
  • 20. MySQL Service Discovery at MessageBird [2 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 20
  • 21. MySQL Service Discovery at MessageBird [3 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 21 The collect layer is a standard entry-point design The fencing layer is a natural HA way to route traffic to the primary (a single node would not be HA) The master-gateway layer is the glue between collect and fencing (more about this later in the talk) The local-fencing layer is the mgw and the fencing layers merged for single-region primary databases because routing to a single region does not need the mgw glue (three nodes for N+2 HA, more about this later in the talk)
  • 22. The collect is the entry-point of the MySQL Service Discovery: • It starts with a load-balancer sending traffic to ProxySQL • We have at least 3 instances of ProxySQL for N+2 high availability • From here, read-only traffic is sent directly to replicas • Primary traffic (read-write) is sent to mgw or locfen 22 MySQL Service Discovery at MessageBird [4 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
  • 23. Routing from collect to locfen is either local or crossing a region boundary For mgw, it is biased to local when a mgw is on the same region as collect • ProxySQL routing is weight-based (no easy fallback routing) MySQL Service Discovery at MessageBird [5 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 23
  • 24. The master-gateway is deployed on all regions potentially hosting a primary The same way fencing (or locfen) is designed as small as possible to reduce the update-scope of failover (to a single region) … … mgw bounds the update-scope of moving the primary to another region to a continent (in this case, the three mgw regions are in Europe) (it avoids a Planet-Scale reconfiguration of collect on failover) … and the way mgw routes traffic to fencing protects against writing to the primary in case of network partitions 24 MySQL Service Discovery at MessageBird [6 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
  • 25. 25 MySQL Service Discovery at MessageBird [7 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) The mgw routing is as follow: • If the primary is in a remote region, traffic is routed to fencing in that region • If the primary is in the same region, traffic is routed to the other mgw Ø No path to the primary not crossing a region boundary
  • 26. No path to the primary that is not crossing a region boundary • That might sound sub-optimal, but it is an interesting tradeoff • It makes the best vs worse case round-trip ratio to the primary closer Without crossing a region boundary: • Best case (local access): ~1 ms round-trip to the primary • Worse case (remote access): ~20 ms round-trip to the primary Ø Ratio of 1 to 20 With crossing a region boundary in mgw: • Best case (remote access): ~20 ms round-trip to the primary • Worse case (local access): ~40 ms round-trip to the primary • Even worse case (remote routed to mgw of the primary) 60 ms Ø Ratio of 1 to 2 (3 in the worse case) 26 MySQL Service Discovery at MessageBird [8 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
  • 27. 27 MySQL Service Discovery at MessageBird [9 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) The worse case (which could be avoided with smarter collect routing)
  • 28. Avoiding low round-trip variance over optimal best-case • With region-remote primary accesses, a high latency is unavoidable: when having 20 ms latency, having 40 or 60 should not be problematic • Moving the average closer to the median avoids problems • And in the case where most writes are local, it avoids surprises when the database becomes remote • And it prevents writing to the primary in case of a network partition 28 MySQL Service Discovery at MessageBird [10 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
  • 29. 29 MySQL Service Discovery at MessageBird [11 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) And therefore, I claim this design has interesting tradeoff • But it might not fit everyone’s requirements
  • 30. Why only two fencing nodes and three locfen nodes: • I like high availability N+2 • This allows a single failure to not need an immediate fix • If a locfen node fails on Friday evening, it can wait until Monday • A failure of one of the two fencing node looks problematic • But we can failover to another region having two healthy nodes • And updating two ProxySQL in case of a failover is easier than three (needing three locfen is something I dislike as it makes failover more fragile) 30 MySQL Service Discovery at MessageBird [12 / 12] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021)
  • 31. Failover (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 31 Failing-over is a multi-step process: 1. Detecting a failure 2. Fencing the primary: setting it as OFFLINE_HARD in ProxySQL 3. Regrouping replicas under the new primary 4. Waiting for replication to catch-up on the new primary 5. Making the new primary ready: stop replication, set writable, start HB, … 6. Updating ProxySQL to point to the new primary 7. Re-configure fencing and master-gateway if needed
  • 32. Reconfiguring Fencing and MGW [1 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 32
  • 33. Reconfiguring Fencing and MGW [2 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 33
  • 34. Reconfiguring Fencing and MGW [3 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 34
  • 35. Reconfiguring Fencing and MGW [4 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 35
  • 36. Reconfiguring Fencing and MGW [5 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 36
  • 37. Reconfiguring Fencing and MGW [6 of 6] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 37 Reacting to a network partition is a similar operation
  • 38. Orchestrator integration (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) Orchestrator integration • Pre-failover hook: fence the primary in fencing (or locfen) • Post-failover hook: update fencing (or locfen) to the new primary • List of ProxySQL nodes (fencing or locfen) and the ProxySQL hostgroup are store in each databases which make this information available to the Orchestrator hooks 38
  • 39. MySQL Service Discovery at MessageBird (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 39
  • 40. Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator Art van Scheppingen Senior Database Engineer at MessageBird art.vanscheppingen AT messagebird DOT com
  • 41. Has been successfully in production for over two years now • Most of our workload is on single-region primary (e.g. locfen) • We have one cluster on multi-region primary ProxySQL has been very stable for us • No big issues on 1.4.x, 2.0.x and 2.1.x How does it work far? (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 41
  • 42. Easier for devs to setup connections to primary • Point connection to ip address • No failover handling necessary Easier for devs to scale out reads • Point connection to the same ip address • Uses a different user for RO • We can differentiate reads (read-only, read-only replica-only, etc) How does it work far? (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 42
  • 43. Multiplexing in ProxySQL turned out to be tricky Without Multiplexing: • 1 on 1 number of connections between Collect and Locfen • Application makes a connection for each shard (8 at the moment) • Application only uses one actively • High number of connections means high load With multiplexing • High number of connections on Collect • Theoretically only 1/8th of the connections on Locfen • Even lower due to less overhead on establishing connections Mastering multiplexing [1 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 43
  • 44. Prior to multiplexing Mastering multiplexing [2 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 44
  • 45. We enabled multiplexing in our configuration but not much happened ProxySQL disabled multiplexing due to an auto increment • Bugreport on ORM that didn’t handle multiplexing well (Hibernate) • Parsing of the OK packet for auto increments • Whenever encountered: multiplexing is affected • mysql-auto_increment_delay_multiplex set to 5 by default • This means for 5 consecutive queries multiplexing is disabled Mastering multiplexing [3 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 45
  • 46. After mastering multiplexing Mastering multiplexing [4 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 46
  • 47. We tried reusing collect and locfen as much as possible • Centralized configuration (hostgroups and users) • Less complexity Expansion was inevitable • Noisy neighbors (hostgroups) • Reducing risk • Better tuning for certain workloads • Easier for maintenance • Cascading effect to other hostgroups Currently running 3 vertical stacks of collect and locfen Separation of stacks (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 47
  • 48. Above 50% CPU usage ProxySQL will show increased latency • Lower CPU usage by multiplexing • Lower CPU usage by idle-threads When a certain hostgroup in Locfen is under stress (1000x normal workload) • CPU usage can get above 50% • Latency will increase on all hostgroups • Latency will cascade upstream to the collect layer Separation of stacks: Noisy neighbors [1 of 2] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 48
  • 49. Above 50% CPU usage ProxySQL will show increased latency Separation of stacks: Noisy neighbors [2 of 2] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 49
  • 50. For us ProxySQL scales up to about 12K of connections per host • After this we will hit the limits of TCP Our Collect layer reached 7.8K connections • If one collect host fails two remain • Two remaining hosts will have to do an additional 3.9K connections • Very close to our 12K limit • Replacing a failed host now becomes an emergency operation Separation of stacks: Reducing risk [1 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 50
  • 51. ProxySQL keeps count of connection errors (MySQL + TCP shared) • Will shun a host if a backend becomes “less responsive” (e.g. high load) • Happens on hostgroups with any number of hosts Hitting the limits of TCP • Errors to locfen or MGW will increase • Locfen and MGW backends will be shunned • Established connections will also be closed Separation of stacks: Reducing risk [2 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 51
  • 52. Separation of stacks: Reducing risk [3 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 52
  • 53. Separation of stacks: Reducing risk [4 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 53
  • 54. ProxySQL keeps count of connection errors (MySQL + TCP shared) • Will shun a host if a backend becomes “less responsive” (e.g. high load) • Happens on hostgroups with any number of hosts Shunning a primary for 1 second will cause another torrent of connections • Client gets a timeout and will reconnect immediately • No available backend: new connection is “paused” up to 10 seconds • After 1 second primary become available again • ProxySQL has thousands of connections waiting • Rinse and repeat… Shunning a primary [1 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 54
  • 55. How to detect shunned hosts? • ProxySQL will log a shunned host in the proxysql log • This includes server name, error rate and duration of shun 2020-06-11 12:01:39 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 10 errors/sec. Shunning for 10 seconds 2020-06-11 12:01:44 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 18 errors/sec. Shunning for 10 seconds 2020-06-11 12:01:49 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 10 errors/sec. Shunning for 10 seconds 2020-06-11 12:01:54 MySQL_HostGroups_Manager.cpp:311:connect_error(): [ERROR] Shunning server x.x.x.x:3306 with 10 errors/sec. Shunning for 10 seconds How do we get them in our graphs? • ProxySQL log tailer • Looks for: connection timeouts, shunned hosts, shunned due to replication lag • Exports metrics to Prometheus every minute Shunning a primary [2 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 55
  • 56. Shunning a primary [3 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 56
  • 57. Shunning a primary for 1 second will cause an avalanche connections • Normal latency is 10ms to 50ms • Added latency of 1 second will decrease application throughput • Decreased application throughput means k8s scale up workers • k8s scale up means more incoming connections • Rinse and repeat… How we dealt with this: • During some incidents we throttle down workers • Counter intuitive: throttling down works increases throughput • Some application workers now have a fixed ceiling Shunning a primary [4 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 57
  • 58. Most ProxySQL tuning is done on a global level Some examples: • mysql-connection_delay_multiplex_ms • mysql-free_connections_pct • mysql-wait_timeout Having a separate stack allows • Fine tuned multiplexing configuration • Earlier or later closing of connections • Separate handling of (end) user connections Separation of stacks: Better tuning (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 58
  • 59. Maintenance on Collect is scary • Draining from GLB gracefully is ”closing after X-minutes” • Near capacity means we run a risk when performing maintenance Maintenance on MGW/Fencing/Locfen is scary • Draining a host takes ages to happen • Aggressive reuse of connections by connection pool • Connection timeout (wait_timeout) is 8 hours • Some applications don’t handle closing of a connection well Separation of stacks: maintenance (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 59
  • 60. Warstory: instability on one cluster swiped out many others The instable cluster • MySQL back_log set too low • TCP listen overflows on primary • ProxySQL started to shun primary The effect • Continuous shunning happened • TCP listen overflows started to happen on ProxySQL • Affected stability on other hostgroups Separation of stacks: cascading effect [1 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 60
  • 61. Shunning of primary causes endless shunning loop Separation of stacks: cascading effect [2 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 61
  • 62. Listen overflows on ProxySQL hosts make it even worse Separation of stacks: cascading effect [3 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 62
  • 63. In effect Collect layer shuns Locfen layer hosts Separation of stacks: cascading effect [4 of 4] (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 63
  • 64. After a client-connection closes, the connection will be reused • Collect → Locfen • Locfen → database ProxySQL resets connection to initial connection parameters What if new connection doesn’t match settings (e.g. UTF8mb4 or CET timezone?) Will the connection between Locfen → database also change? Other quirks: Connection contamination (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 64
  • 65. Uneven distribution of connections/queries • Weight influences the distribution of connections • Reuse of existing connections is favored by ProxySQL • Influenced by mysql-free_connections_pct The variable mysql-free_connections_pct is a global variable • Percentage of maximum allowed connections of a hostgroup • Some hostgroups allow up to 3000 incoming connections • 2% of 3000 is 60 connections, actual usage is 10 to 15 • More connections are kept open in connection pool than necessary Other quirks: Uneven distribution (Service Discovery and Failover with ProxySQL and Orchestrator – PL May 2021) 65
  • 66. Thanks ! Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator by Jean-François Gagné and Art van Scheppingen Presented at Percona Live Online, May 2021