SlideShare a Scribd company logo
1 of 54
Download to read offline
S
High Availability in
GCE
By Allan and Carmen Mason
Who We Are
S  Allan Mason
Advance Auto Parts
“The largest retailer of automotive replacement parts and accessories in the United States.”
Who We Are
S  Carmen Mason
VitalSource Technologies
"VitalSource is a global leader in building, enhancing, and delivering e-learning content.”
Agenda
S  Current Infrastructure and Failover solution.
S  HA requirements
S  Something to manage our traffic
S  Topology Managers
S  Execution
S  Final Infrastructure and Failover solution.
S
Infrastructure & Failover
Solution
Current Data Center (LaVerne)
Local Data Center
Solution Provides
•  Saved Binary Logs
•  Differential Relay Logs
•  Notification
•  VIP Handling
What a Failover looks like
S  7 to 11 seconds
S  Automatically assigns most up to date replica as new master
S  Automatically updates VIP.
S  Replicas are slaved to the new master.
S  A CHANGE MASTER statement is logged to bring the old master in
line once it’s fixed.
S  Alerts by email, with the decisions that were made during the failover.
Herding Cats
(noisy neighbors)
“Firemen just ran into our datacenter
with a hose.”
- Bossman
S
Requirements & Issues
S
Why Google Cloud?
S  Amazon is our competition.
S  Our biggest reason.
S  Amazon seems to cost more?
S  It is hard to be sure.
S  Reserved Instances
S  Lost interest and capital opportunity cost though
S  Lost flexibility
S  Probably fine if you need a lot of instances, with a stable load.
Project Requirements
S  Move our existing applications and databases to GCE. 
S  Infrastructure (INF) team CANNOT be the bottleneck.
S  The initial request was to move to GCE in one year. 
S  Given 1 month to test and decide on MHA w/ MHA
Helper wrapper replacement.
Boss’s Mandate
S  “Do things the right way.”
Second Generation High Availability
with CloudSQL!
Misconceptions
S  CloudSQL high availability is only if the entire zone fails,
not the instance.
S  This failover could take 5 minutes to even begin to trigger. 
S  Server options are MySQL and PostgreSQL (beta)
S  No Percona Server?
Why Not ‘Lift and Ship’
S  MHA with MHA Helper wrapper was not a viable option at
the time.
S  GCE instances are CentOS 7, which MHA Helper didn’t
support. 
S  No VIP support in Google Cloud.
Our Requirements &
‘nice to have’s
S  Try to at least reach our current uptime with a hands free
automatic failover.
S  Availability: Min 99.9% (<8.76 hours/yr application uptime).
S  Prefer 99.999% (<31sec. downtime/yr). Doesn’t everybody?
S  VIP handling or traffic control to direct the apps to the database.
S  All slaves are brought up to date with differential relay logs.
S  Slaves still reachable are issued CHANGE MASTER statements. 
S  Old master is fenced off until we can get to it. 
Traffic Control & Topology
S  Two problems…
S  Traffic control : How do the applications find the Master?
S  Topology: Who’s the new Master, after current Master fails?
S
Traffic Control
The Players: Traffic Control
S  MaxScale
S  https://mariadb.com/products/mariadb-maxscale
S  HAProxy
S  http://www.haproxy.org/
S  ProxySQL
S  http://www.proxysql.com/
Why Not MaxScale?
S  MaxScale
S  New License: BSL
S  Research showed that it does not scale above 6 threads as well
as ProxySQL.
Why Not HAProxy?
S  HAProxy
S  Although a viable solution, we wanted something written
specifically for MySQL, that provides additional functionality.
Why ProxySQL?
S  ProxySQL
S  Written specifically for MySQL. 
S  Monitors topology and quickly recognizes changes, forwarding
queries accordingly. 
S  Provides functionality that we liked, such as:
S  Rewriting queries on the fly – bad developers!
S  Query routing – read / write splitting
S  Load balancing – balance reads between replicas
S  Query throttling – If you won’t throttle the API…
ProxySQL Install
S  Can be found in the Percona repo
S  Each app is in it’s own Google Container Engine (GKE)
cluster.
S  Every app gets it’s own container cluster.
S  ProxySQL is in each of those containers ( aka tiled )
S  No network latency
S  No single point of failure
ProxySQL Config
S  Do not run with the default config as is. Change the default
admin_credentials in
/etc/proxysql.cnf
S  admin_credentials = "admin:admin”
S  “SETEC ASTRONOMY” (Google it!)
S  Mount the config file as a Kubernetes Secret
ProxySQL Config
S  We use the remaining defaults in the configuration file for
now, with the following exceptions:
S  mysql-monitor_ping_interval = 5000 # default = 60000 (ms)
S  connect_retries_on_failure = 5 # default = 10
ProxySQL Failover
S  Create the monitor user in MySQL.
S  GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'%'
IDENTIFIED BY 'password';
S  ProxySQL monitors the read_only flag on the server to
determine hostgroup. In our configuration, if a server is in
hostgroup 0 it is writable. If it is in hostgroup 1 it is read-
only.
S  hostgroup = 0 (writers)
S  hostgroup = 1 (readers)
S
Replication Topology
The Players: Topology
S  Master High Availability Manager with tools for MySQL
(MHA)
S  https://github.com/yoshinorim/mha4mysql-manager/wiki
S  Orchestrator
S  https://github.com/github/orchestrator
Wooo Shiny!
Put a pin in it…
S  Orchestrator
S  Does not apply differential relay log events.
S  Lack of extensive in-house experience with it.
S  Set up in test environment to consider for near-term use.
What is Known
S  Why MHA?
S  Easy to automate the configuration and installation
S  Easy to use.
S  Very fast failovers: 7 to 11 seconds of downtime, proven several
times in production.
S  Brings replication servers current using the logs from the most
up to date slave, or the master (if accessible).
Setup MHA
Setup MHA
S  One manager node, three MHA nodes.
S  Install a few perl dependencies.
S  Install MHA node on all DB servers.
S  Including on the manager node.
S  Install MHA Manager.
S  MHA assumes it runs as root.
S  Create /etc/sudoers.d/mha_sudo
Cmnd_Alias VIP_MGMT = /sbin/ip, /sbin/arping
Defaults:mha !requiretty
mha ALL=(root) NOPASSWD: VIP_MGMT
MHA Requirements
S  SSH with passphraseless authentication.
S  Set databases as read_only (all but Master).
S  log_bin must be enabled on candidate masters.
S  Replication filtering rules must be the same on all MySQL
servers.
S  Preserve relay logs: relay_log_purge = OFF;
MHA Sanity Check
S  MHA Manager has built in Sanity Checks.
S  If it is all good? It just listens...
[info] Ping(SELECT) succeeded, waiting until MySQL doesn't
respond..
MHA Failover
[warning] Got error on MySQL select ping: 2006 (MySQL server
has gone away)
S  If the MHA Manager can ssh to the Master, it will save the
binlogs to the MHA Manager’s directory.
S  Otherwise it copies the Relay Log from the most up to date slave.
S  Polls 4 more times to see if the Master is really really not there.
S  Polls the servers. Who’s dead? Who’s alive?
S  Begins Phase 1: Configuration Check
S  Phase 2: Dead Master Shutdown
S  Phase 3: Master Recovery Phase
S
Execution of Our
Solution
“I have this TERRIBLE idea…”
DB Slave in
Datacenter
Server in
Old DR site
GCE DR
instance
Desperately trying to upload Terabytes of data
And then replicate it from the datacenter!
Moving the BIG guys
S  Tried to use the VPN from the datacenter to GCE.
S  Too unstable.
S  Tried direct to GCE.
S  Too slow
S  Aspera to the rescue!
S  http://www.asperasoft.com/
Xtrabackup
S  On Source Server
S  innobackupex
S  --compress --compress-threads
S  --stream=xbstream --parallel
S  --slave-info
S  On Destination Instance
S  Install xbstream and qpress
S  xbstream -x
S  innobackupex --decompress
Trickle
S  Trickle to throttle, and not kill our delicate network.
S  -s run as a one-off vs trickled service
S  -u set the upload rate (KB/s)
S  GSUtil
S  -m Parallel copy (multithreaded and multiprocessing)
S  gs:// Google Storage bucket.
S  trickle -s -u 75000 gsutil -m cp 2017mmddDB.xbstream gs://
mybucket/
Aspera
S  Aspera
S  -Q fair (not trickle) transfer policy
S  -T Max throughput, no encryption
S  -v Verbose
S  -m 400M minimum transfer rate
S  -l 800M target transfer rate
S  -i key file
S  ascp -Q -T -v -m 400M -l 800M -i /home/ascp/.ssh/id_rsa /path/
to/bkup/2017mmddDB.xbstream ascp@192.168.0.1:/target/dir
Pt-heartbeat
CREATE DATABASE dbatools;
CREATE TABLE `heartbeat` (
`ts` varchar(26) NOT NULL,
`server_id` int(10) unsigned NOT NULL,
`file` varchar(255) DEFAULT NULL,
`position` bigint(20) unsigned DEFAULT NULL,
`relay_master_log_file` varchar(255) DEFAULT NULL,
`exec_master_log_pos` bigint(20) unsigned DEFAULT NULL,
PRIMARY KEY (`server_id`)
pt-heartbeat -D recovery --update -h localhost --daemonize
pt-heartbeat -D recovery --monitor h=masterIP --master-server-id masterID -u
dbatools -p Password
Failing Prod over to GCE
S  Database slave in GCE production environment.
S  Run parallel builds, where possible, to test against.
S  Cut over in our time, during a maintenance window.
Switch to GCE Master
S  Stop writes to current master:
S  screen
S  FLUSH TABLES WITH READ LOCK;
S  SET GLOBAL READ_ONLY = 1;
S  Keep this window open.
S  Stop slave on new master:
S  SHOW SLAVE STATUSG to check that the slave is waiting for the master to
send another event. (Not currently updating)
S  STOP SLAVE;
S  RESET SLAVE ALL;
S  Stop service on current master:
S  service mysql stop
S  DNS changes are made here in both Google and our local DNS servers.
S  Start MHA manager now that we are no longer replicating back to DC
S
Final Infrastructure &
Failover solution
Google Cloud
Final Infrastructure
Solution Provides
•  Saved Binary Logs
•  Differential Relay Logs
•  Notification
•  No silos
•  No SPOF
•  Beautifully Scalable
What a Failover Looks Like
S  10-20 seconds
S  Automatically assigns most up to date replica as new master
S  ProxySQL directs traffic to master.
S  Replicas are slaved to the new master.
S  A CHANGE MASTER statement is logged to bring the old master in
line once it’s fixed.
S  Alerts by email and notifications in HipChat, with the decisions that
were made during the failover.
New Issues
S  ~10ms lag between regions, which means…
S  All database instances are in the same region as the application
GKE clusters.
S  Avoid allowing Central instance to be master to prevent lag, as
all GKE clusters are in an East region.
S  no_master
Contact Info
S  Allan Mason
S  allan@digital-knight.com
S  Carmen Mason
S  Carmen.mason@ingramcontent.com
S  http://www.vitalsource.com
S  @CarmenMasonCIC
S
Thanks!
go away

More Related Content

What's hot

Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera clusterFosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera clusterFrederic Descamps
 
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019confluent
 
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentWebinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentSeveralnines
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 
Become a MySQL DBA - slides: Deciding on a relevant backup solution
Become a MySQL DBA - slides: Deciding on a relevant backup solutionBecome a MySQL DBA - slides: Deciding on a relevant backup solution
Become a MySQL DBA - slides: Deciding on a relevant backup solutionSeveralnines
 
Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...
Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...
Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...Severalnines
 
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...confluent
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Joel Koshy
 
Scaling LoL Chat to 70M Players
Scaling LoL Chat to 70M PlayersScaling LoL Chat to 70M Players
Scaling LoL Chat to 70M PlayersMichał Ptaszek
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...confluent
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...DataStax
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
 
Double Sync Replication
Double Sync ReplicationDouble Sync Replication
Double Sync ReplicationLixun Peng
 
Deployment topologies for high availability (ha)
Deployment topologies for high availability (ha)Deployment topologies for high availability (ha)
Deployment topologies for high availability (ha)Deepak Mane
 
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021StreamNative
 
SaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your AutomationSaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your AutomationSteven Gonzales
 

What's hot (20)

Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera clusterFosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
 
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
 
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentWebinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 
Become a MySQL DBA - slides: Deciding on a relevant backup solution
Become a MySQL DBA - slides: Deciding on a relevant backup solutionBecome a MySQL DBA - slides: Deciding on a relevant backup solution
Become a MySQL DBA - slides: Deciding on a relevant backup solution
 
Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...
Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...
Webinar slides: ClusterControl 1.4: The MySQL Replication & MongoDB Edition -...
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
 
Scaling LoL Chat to 70M Players
Scaling LoL Chat to 70M PlayersScaling LoL Chat to 70M Players
Scaling LoL Chat to 70M Players
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Double Sync Replication
Double Sync ReplicationDouble Sync Replication
Double Sync Replication
 
Deployment topologies for high availability (ha)
Deployment topologies for high availability (ha)Deployment topologies for high availability (ha)
Deployment topologies for high availability (ha)
 
Time Machine
Time MachineTime Machine
Time Machine
 
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
 
SaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your AutomationSaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your Automation
 

Similar to Pl2017 High Availability in GCE

MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMakerKris Buytaert
 
Mater,slave on mysql
Mater,slave on mysqlMater,slave on mysql
Mater,slave on mysqlVasudeva Rao
 
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Command Prompt., Inc
 
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...Amazon Web Services
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and futureEd Yakabosky
 
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
Scylla Summit 2018: Meshify - A Case Study, or Petshop SeamonstersScylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
Scylla Summit 2018: Meshify - A Case Study, or Petshop SeamonstersScyllaDB
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...HostedbyConfluent
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloudTahsin Hasan
 
Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...
Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...
Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...Severalnines
 
Configuration Management and Salt
Configuration Management and SaltConfiguration Management and Salt
Configuration Management and Salt55020
 
Managing and Scaling Puppet - PuppetConf 2014
Managing and Scaling Puppet - PuppetConf 2014Managing and Scaling Puppet - PuppetConf 2014
Managing and Scaling Puppet - PuppetConf 2014Puppet
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Managementguest2e11e8
 
RDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and PatternsRDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and PatternsLaine Campbell
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and FutureKartik Paramasivam
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011Mike Willbanks
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecFIRAT GULEC
 
Codemotion Rome 2018 Docker Swarm Mode
Codemotion Rome 2018 Docker Swarm ModeCodemotion Rome 2018 Docker Swarm Mode
Codemotion Rome 2018 Docker Swarm ModeSimone Soldateschi
 
DATABASE AUTOMATION with Thousands of database, monitoring and backup
DATABASE AUTOMATION with Thousands of database, monitoring and backupDATABASE AUTOMATION with Thousands of database, monitoring and backup
DATABASE AUTOMATION with Thousands of database, monitoring and backupSaewoong Lee
 

Similar to Pl2017 High Availability in GCE (20)

MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
Mater,slave on mysql
Mater,slave on mysqlMater,slave on mysql
Mater,slave on mysql
 
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
 
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Mysql S&M
Mysql S&MMysql S&M
Mysql S&M
 
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
Scylla Summit 2018: Meshify - A Case Study, or Petshop SeamonstersScylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloud
 
Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...
Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...
Webinar Slides : Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Clu...
 
Configuration Management and Salt
Configuration Management and SaltConfiguration Management and Salt
Configuration Management and Salt
 
Managing and Scaling Puppet - PuppetConf 2014
Managing and Scaling Puppet - PuppetConf 2014Managing and Scaling Puppet - PuppetConf 2014
Managing and Scaling Puppet - PuppetConf 2014
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Management
 
RDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and PatternsRDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and Patterns
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat Gulec
 
Codemotion Rome 2018 Docker Swarm Mode
Codemotion Rome 2018 Docker Swarm ModeCodemotion Rome 2018 Docker Swarm Mode
Codemotion Rome 2018 Docker Swarm Mode
 
DATABASE AUTOMATION with Thousands of database, monitoring and backup
DATABASE AUTOMATION with Thousands of database, monitoring and backupDATABASE AUTOMATION with Thousands of database, monitoring and backup
DATABASE AUTOMATION with Thousands of database, monitoring and backup
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Pl2017 High Availability in GCE

  • 1. S High Availability in GCE By Allan and Carmen Mason
  • 2. Who We Are S  Allan Mason Advance Auto Parts “The largest retailer of automotive replacement parts and accessories in the United States.”
  • 3. Who We Are S  Carmen Mason VitalSource Technologies "VitalSource is a global leader in building, enhancing, and delivering e-learning content.”
  • 4. Agenda S  Current Infrastructure and Failover solution. S  HA requirements S  Something to manage our traffic S  Topology Managers S  Execution S  Final Infrastructure and Failover solution.
  • 6.
  • 7. Local Data Center Solution Provides •  Saved Binary Logs •  Differential Relay Logs •  Notification •  VIP Handling
  • 8. What a Failover looks like S  7 to 11 seconds S  Automatically assigns most up to date replica as new master S  Automatically updates VIP. S  Replicas are slaved to the new master. S  A CHANGE MASTER statement is logged to bring the old master in line once it’s fixed. S  Alerts by email, with the decisions that were made during the failover.
  • 10. “Firemen just ran into our datacenter with a hose.” - Bossman
  • 12. S
  • 13. Why Google Cloud? S  Amazon is our competition. S  Our biggest reason. S  Amazon seems to cost more? S  It is hard to be sure. S  Reserved Instances S  Lost interest and capital opportunity cost though S  Lost flexibility S  Probably fine if you need a lot of instances, with a stable load.
  • 14. Project Requirements S  Move our existing applications and databases to GCE.  S  Infrastructure (INF) team CANNOT be the bottleneck. S  The initial request was to move to GCE in one year.  S  Given 1 month to test and decide on MHA w/ MHA Helper wrapper replacement.
  • 15. Boss’s Mandate S  “Do things the right way.”
  • 16. Second Generation High Availability with CloudSQL! Misconceptions S  CloudSQL high availability is only if the entire zone fails, not the instance. S  This failover could take 5 minutes to even begin to trigger.  S  Server options are MySQL and PostgreSQL (beta) S  No Percona Server?
  • 17. Why Not ‘Lift and Ship’ S  MHA with MHA Helper wrapper was not a viable option at the time. S  GCE instances are CentOS 7, which MHA Helper didn’t support.  S  No VIP support in Google Cloud.
  • 18. Our Requirements & ‘nice to have’s S  Try to at least reach our current uptime with a hands free automatic failover. S  Availability: Min 99.9% (<8.76 hours/yr application uptime). S  Prefer 99.999% (<31sec. downtime/yr). Doesn’t everybody? S  VIP handling or traffic control to direct the apps to the database. S  All slaves are brought up to date with differential relay logs. S  Slaves still reachable are issued CHANGE MASTER statements.  S  Old master is fenced off until we can get to it. 
  • 19. Traffic Control & Topology S  Two problems… S  Traffic control : How do the applications find the Master? S  Topology: Who’s the new Master, after current Master fails?
  • 21. The Players: Traffic Control S  MaxScale S  https://mariadb.com/products/mariadb-maxscale S  HAProxy S  http://www.haproxy.org/ S  ProxySQL S  http://www.proxysql.com/
  • 22. Why Not MaxScale? S  MaxScale S  New License: BSL S  Research showed that it does not scale above 6 threads as well as ProxySQL.
  • 23. Why Not HAProxy? S  HAProxy S  Although a viable solution, we wanted something written specifically for MySQL, that provides additional functionality.
  • 24. Why ProxySQL? S  ProxySQL S  Written specifically for MySQL.  S  Monitors topology and quickly recognizes changes, forwarding queries accordingly.  S  Provides functionality that we liked, such as: S  Rewriting queries on the fly – bad developers! S  Query routing – read / write splitting S  Load balancing – balance reads between replicas S  Query throttling – If you won’t throttle the API…
  • 25. ProxySQL Install S  Can be found in the Percona repo S  Each app is in it’s own Google Container Engine (GKE) cluster. S  Every app gets it’s own container cluster. S  ProxySQL is in each of those containers ( aka tiled ) S  No network latency S  No single point of failure
  • 26. ProxySQL Config S  Do not run with the default config as is. Change the default admin_credentials in /etc/proxysql.cnf S  admin_credentials = "admin:admin” S  “SETEC ASTRONOMY” (Google it!) S  Mount the config file as a Kubernetes Secret
  • 27. ProxySQL Config S  We use the remaining defaults in the configuration file for now, with the following exceptions: S  mysql-monitor_ping_interval = 5000 # default = 60000 (ms) S  connect_retries_on_failure = 5 # default = 10
  • 28. ProxySQL Failover S  Create the monitor user in MySQL. S  GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'%' IDENTIFIED BY 'password'; S  ProxySQL monitors the read_only flag on the server to determine hostgroup. In our configuration, if a server is in hostgroup 0 it is writable. If it is in hostgroup 1 it is read- only. S  hostgroup = 0 (writers) S  hostgroup = 1 (readers)
  • 30. The Players: Topology S  Master High Availability Manager with tools for MySQL (MHA) S  https://github.com/yoshinorim/mha4mysql-manager/wiki S  Orchestrator S  https://github.com/github/orchestrator
  • 31. Wooo Shiny! Put a pin in it… S  Orchestrator S  Does not apply differential relay log events. S  Lack of extensive in-house experience with it. S  Set up in test environment to consider for near-term use.
  • 32. What is Known S  Why MHA? S  Easy to automate the configuration and installation S  Easy to use. S  Very fast failovers: 7 to 11 seconds of downtime, proven several times in production. S  Brings replication servers current using the logs from the most up to date slave, or the master (if accessible).
  • 34. Setup MHA S  One manager node, three MHA nodes. S  Install a few perl dependencies. S  Install MHA node on all DB servers. S  Including on the manager node. S  Install MHA Manager. S  MHA assumes it runs as root. S  Create /etc/sudoers.d/mha_sudo Cmnd_Alias VIP_MGMT = /sbin/ip, /sbin/arping Defaults:mha !requiretty mha ALL=(root) NOPASSWD: VIP_MGMT
  • 35. MHA Requirements S  SSH with passphraseless authentication. S  Set databases as read_only (all but Master). S  log_bin must be enabled on candidate masters. S  Replication filtering rules must be the same on all MySQL servers. S  Preserve relay logs: relay_log_purge = OFF;
  • 36. MHA Sanity Check S  MHA Manager has built in Sanity Checks. S  If it is all good? It just listens... [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
  • 37. MHA Failover [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) S  If the MHA Manager can ssh to the Master, it will save the binlogs to the MHA Manager’s directory. S  Otherwise it copies the Relay Log from the most up to date slave. S  Polls 4 more times to see if the Master is really really not there. S  Polls the servers. Who’s dead? Who’s alive? S  Begins Phase 1: Configuration Check S  Phase 2: Dead Master Shutdown S  Phase 3: Master Recovery Phase
  • 39.
  • 40. “I have this TERRIBLE idea…” DB Slave in Datacenter Server in Old DR site GCE DR instance Desperately trying to upload Terabytes of data And then replicate it from the datacenter!
  • 41. Moving the BIG guys S  Tried to use the VPN from the datacenter to GCE. S  Too unstable. S  Tried direct to GCE. S  Too slow S  Aspera to the rescue! S  http://www.asperasoft.com/
  • 42. Xtrabackup S  On Source Server S  innobackupex S  --compress --compress-threads S  --stream=xbstream --parallel S  --slave-info S  On Destination Instance S  Install xbstream and qpress S  xbstream -x S  innobackupex --decompress
  • 43.
  • 44. Trickle S  Trickle to throttle, and not kill our delicate network. S  -s run as a one-off vs trickled service S  -u set the upload rate (KB/s) S  GSUtil S  -m Parallel copy (multithreaded and multiprocessing) S  gs:// Google Storage bucket. S  trickle -s -u 75000 gsutil -m cp 2017mmddDB.xbstream gs:// mybucket/
  • 45. Aspera S  Aspera S  -Q fair (not trickle) transfer policy S  -T Max throughput, no encryption S  -v Verbose S  -m 400M minimum transfer rate S  -l 800M target transfer rate S  -i key file S  ascp -Q -T -v -m 400M -l 800M -i /home/ascp/.ssh/id_rsa /path/ to/bkup/2017mmddDB.xbstream ascp@192.168.0.1:/target/dir
  • 46. Pt-heartbeat CREATE DATABASE dbatools; CREATE TABLE `heartbeat` ( `ts` varchar(26) NOT NULL, `server_id` int(10) unsigned NOT NULL, `file` varchar(255) DEFAULT NULL, `position` bigint(20) unsigned DEFAULT NULL, `relay_master_log_file` varchar(255) DEFAULT NULL, `exec_master_log_pos` bigint(20) unsigned DEFAULT NULL, PRIMARY KEY (`server_id`) pt-heartbeat -D recovery --update -h localhost --daemonize pt-heartbeat -D recovery --monitor h=masterIP --master-server-id masterID -u dbatools -p Password
  • 47. Failing Prod over to GCE S  Database slave in GCE production environment. S  Run parallel builds, where possible, to test against. S  Cut over in our time, during a maintenance window.
  • 48. Switch to GCE Master S  Stop writes to current master: S  screen S  FLUSH TABLES WITH READ LOCK; S  SET GLOBAL READ_ONLY = 1; S  Keep this window open. S  Stop slave on new master: S  SHOW SLAVE STATUSG to check that the slave is waiting for the master to send another event. (Not currently updating) S  STOP SLAVE; S  RESET SLAVE ALL; S  Stop service on current master: S  service mysql stop S  DNS changes are made here in both Google and our local DNS servers. S  Start MHA manager now that we are no longer replicating back to DC
  • 49. S Final Infrastructure & Failover solution Google Cloud
  • 50. Final Infrastructure Solution Provides •  Saved Binary Logs •  Differential Relay Logs •  Notification •  No silos •  No SPOF •  Beautifully Scalable
  • 51. What a Failover Looks Like S  10-20 seconds S  Automatically assigns most up to date replica as new master S  ProxySQL directs traffic to master. S  Replicas are slaved to the new master. S  A CHANGE MASTER statement is logged to bring the old master in line once it’s fixed. S  Alerts by email and notifications in HipChat, with the decisions that were made during the failover.
  • 52. New Issues S  ~10ms lag between regions, which means… S  All database instances are in the same region as the application GKE clusters. S  Avoid allowing Central instance to be master to prevent lag, as all GKE clusters are in an East region. S  no_master
  • 53. Contact Info S  Allan Mason S  allan@digital-knight.com S  Carmen Mason S  Carmen.mason@ingramcontent.com S  http://www.vitalsource.com S  @CarmenMasonCIC