Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to ContainerDays Boston 2016: "Autopilot: Running Real-world Applications in Containers" (Tim Gross)(20)

Advertisement

More from DynamicInfraDays(16)

Advertisement

ContainerDays Boston 2016: "Autopilot: Running Real-world Applications in Containers" (Tim Gross)

  1. Applications on Autopilot Tim Gross @0x74696d (“tim”) github.com/autopilotpattern
  2. github.com/autopilotpattern What if your containers were self-aware and self-operating?
  3. github.com/autopilotpattern
  4. github.com/autopilotpattern How do we get from dev to prod? • Service Discovery • Load balancing • Automated-failover • Config changes • Monitoring
  5. github.com/autopilotpattern App
  6. github.com/autopilotpattern Nginx Consul MySQL Primary MySQL Replica ES Master ES Data Kibana Prometheus Sales Logstash Customers
  7. github.com/autopilotpattern Nginx Consul /custom ers /sales /sales/data /customers/data read/write read- only MySQL Primary MySQL Replica ES Master ES Data Kibana Prometheus Sales Logstash Customers Load Balancing
  8. github.com/autopilotpattern Nginx Consul /custom ers /sales /sales/data /customers/data read/write read- only async replication MySQL Primary MySQL Replica ES Master ES Data Kibana Prometheus Sales Logstash Customers Replication & Fail-over
  9. github.com/autopilotpattern Nginx Consul /custom ers /sales /sales/data /customers/data read/write read- only async replication MySQL Primary MySQL Replica ES Master ES Data Kibana Prometheus Sales Logstash Customers Service discovery
  10. github.com/autopilotpattern Nginx Consul /custom ers /sales /sales/data /customers/data read/write read- only async replication MySQL Primary MySQL Replica ES Master ES Data Kibana Prometheus Sales Logstash Customers Logging
  11. github.com/autopilotpattern Nginx Consul /custom ers /sales /sales/data /customers/data read/write read- only async replication MySQL Primary MySQL Replica ES Master ES Data Kibana Prometheus Sales Logstash Customers Monitoring
  12. github.com/autopilotpattern Problem: Service Discovery
  13. github.com/autopilotpattern App Application
  14. github.com/autopilotpattern Database App Application w/ database
  15. github.com/autopilotpattern Database App Application w/ database How does the app find the DB? Can we just use DNS?
  16. github.com/autopilotpattern Couchbase App CouchbaseCouchbaseCouchbase Application w/ Couchbase
  17. github.com/autopilotpattern Couchbase App CouchbaseCouchbaseCouchbase Application w/ Couchbase Nodes coordinate shards via IP address Can’t use an A-record
  18. github.com/autopilotpattern Couchbase Couchbase CouchbaseCouchbaseCouchbase App Application w/ Couchbase What happens when we add a node?
  19. github.com/autopilotpattern Couchbase Couchbase CouchbaseCouchbaseCouchbase App Application w/ Couchbase Does app respect DNS TTL?
  20. github.com/autopilotpattern Problem: Load Balancing
  21. github.com/autopilotpattern Nginx Customers Sales Microservices application /sales/data /customers/data /sales /custom ers
  22. github.com/autopilotpattern Nginx SalesCustomers /sales/data /customers/data /sales /custom ers Microservices application
  23. github.com/autopilotpattern Nginx SalesCustomers /sales/data /customers/data /sales /custom ers Microservices application How do apps update peers when we scale out?
  24. github.com/autopilotpattern Nginx SalesCustomers /sales/data /custom ers/data Microservices application Route everything thru Nginx (or LB)?
  25. github.com/autopilotpattern Nginx SalesCustomers /sales/data /custom ers/data Microservices application How do we update Nginx backends? Adds network path length and SPoF
  26. github.com/autopilotpattern Sales Sidecar/ Proxy Customers http://localhost http://192.168.1.1 ex. Bamboo Compute node
  27. github.com/autopilotpattern Sales Sidecar/ Proxy Customers http://localhost http://192.168.1.1 How do we update proxy config? Adds network path lengthCompute node
  28. github.com/autopilotpattern Nginx Sales Consul /custom ers /sales /sales/data /customers/data Customers Microservices application w/ discovery catalog
  29. github.com/autopilotpattern Nginx Sales Consul /custom ers /sales /sales/data /customers/data Customers How do we make existing applications use it? Microservices application w/ discovery catalog
  30. github.com/autopilotpattern Problem: Automated-Failover
  31. github.com/autopilotpattern read/write read-only async replication App Primary Replica MySQL with replication
  32. github.com/autopilotpattern read/write read-only async replication App Primary Replica MySQL with replication How does client find DB? How does replica find primary? How does primary tell replica where to start?
  33. github.com/autopilotpattern read/write read-only async replication App Primary Replica MySQL with replication How do we update client on failover? How do we promote a replica? How do we orchestrate backups?
  34. github.com/autopilotpattern Solutions that don’t work: Configuration Management (ex. Chef, Puppet, Ansible)
  35. github.com/autopilotpattern • No CM server in local development • No service discovery on change
  36. github.com/autopilotpattern Solutions that don’t work: *aaS (ex. PaaS, DBaaS)
  37. github.com/autopilotpattern • Vendor lock-in • Poor performance • Very expensive
  38. github.com/autopilotpattern Solutions that don’t work: Mega-orchestrator (ex. Kubernetes)
  39. github.com/autopilotpattern Shifts responsibility for app behavior away from app developers
  40. github.com/autopilotpattern Shifts responsibility for app behavior away from app developers
  41. github.com/autopilotpattern What if your containers were self-aware and self-operating?
  42. github.com/autopilotpattern
  43. github.com/autopilotpattern “[a] pattern where containers autonomously adapt to changes in their environment and coordinate their actions thru a globally shared state” Lukasz Guminski, Container Solutions http://container-solutions.com/containerpilot-on-mantl/
  44. github.com/autopilotpattern Make applications responsible for: Startup Shutdown Scaling Discovery Recovery Telemetry
  45. github.com/autopilotpattern Empower application development teams
  46. github.com/autopilotpattern 3 requirements
  47. github.com/autopilotpattern #1: Ability to provision containers across multiple compute nodes
  48. github.com/autopilotpattern VM or physical hardware VM or physical hardware VM or physical hardware Nginx Consul MySQL Primary ES Master Prometheus LogstashCustomers Nginx Consul MySQL Primary ES Master Prometheus Logstash Customers Customers Cluster management & provisioning
  49. github.com/autopilotpattern Options for cluster management and container placement:
  50. github.com/autopilotpattern #2: Network virtualization
  51. github.com/autopilotpattern IP inside the container IP outside the container ==
  52. github.com/autopilotpattern NAT Sales Customers 192.168.1.101Compute Node 172.17.0.2:80 192.168.1.100:32380 Docker bridge networking Consul
  53. github.com/autopilotpattern NAT Sales Customers Consul Compute Node 192.168.1.100:32380 Docker bridge networking “I’m listening on 172.17.0.2:80” 172.17.0.2:80
  54. github.com/autopilotpattern NAT Sales Customers Consul Compute Node 192.168.1.100:32380 Docker bridge networking “Where is Customers?” “172.17.0.2:80” 172.17.0.2:80
  55. github.com/autopilotpattern NAT Sales Customers Consul Compute Node 192.168.1.100:32380 Docker bridge networking WTF???!!! 172.17.0.2:80 172.17.0.2:80 No route to host!
  56. github.com/autopilotpattern Sales Customers Compute Node Docker host networking Consul 192.168.1.101 192.168.1.100:80
  57. github.com/autopilotpattern Sales Customers Compute Node Docker host networking Consul 192.168.1.101 192.168.1.100:80 “I’m listening on 192.168.1.100:80”
  58. github.com/autopilotpattern Sales Customers Compute Node Docker host networking Consul 192.168.1.101 192.168.1.100:80 “I’m listening on 192.168.1.100:80” Customers 192.168.1.100:80
  59. github.com/autopilotpattern Sales Customers Compute Node Docker host networking Consul 192.168.1.101 192.168.1.100:80 “I’m listening on 192.168.1.100:80” Customers 192.168.1.100:80 Port conflicts!
  60. github.com/autopilotpattern Sales Customers Compute Node Overlay networking Consul 192.168.1.101 192.168.1.100:80 “I’m listening on 192.168.1.100:80” Customers 192.168.1.102:80
  61. github.com/autopilotpattern Sales Customers Compute Node Overlay networking Consul 192.168.1.101 192.168.1.100:80 “I’m listening on 192.168.1.102:80” Customers 192.168.1.102:80
  62. github.com/autopilotpattern Options for overlay networking:
  63. github.com/autopilotpattern #3: Infrastructure-backed service discovery
  64. github.com/autopilotpattern Nginx Sales Consul Customers Microservices app
  65. github.com/autopilotpattern Nginx Sales Consul Customers Microservices app How do we bootstrap service catalog HA? How do services find service catalog?
  66. github.com/autopilotpattern Options to bootstrap service catalog: infrastructure-backed DNS * run on each node* Container Name Service (CNS)
  67. github.com/autopilotpattern #4: We might need some help
  68. github.com/autopilotpattern App-centric micro-orchestrator that runs inside the container. User-defined behaviors: • Lifecycle hooks (preStop, preStop, postStop) • Health checks w/ heart beats • Watch discovery catalog for changes • Update config on upstream changes • Gather performance metrics
  69. github.com/autopilotpattern Sales Container Pilot Application Application container http://localhost http://192.168.1.1 Side car?
  70. github.com/autopilotpattern Sales Container Pilot Application Application container http://localhost http://192.168.1.1 Not a side car!
  71. github.com/autopilotpattern Sales Container Pilot Application Consul Where is Sales? Application container
  72. github.com/autopilotpattern Sales Container Pilot Application Consul Where is Sales? 192.168.1.100 192.168.1.101 192.168.1.102 Application container
  73. github.com/autopilotpattern Sales Container Pilot Application Consul Where is Sales? 192.168.1.100 192.168.1.101 192.168.1.102 Application container onChange event
  74. github.com/autopilotpattern Sales Container Pilot Application http://192.168.1.100 Consul Where is Sales? 192.168.1.100 192.168.1.101 192.168.1.102 Application container onChange event
  75. github.com/autopilotpattern Application onChange event User-defined behavior hooks: • preStart • preStop • postStop • health • onChange • sensor • task Application container
  76. github.com/autopilotpattern read/write read-only async replication App Primary Replica Consul MySQL with replication
  77. github.com/autopilotpattern ~ $ git clone git@github.com:autopilotpattern/mysql.git ~ $ cd mysql ~/mysql $ tree --dirsfirst . ├── bin │ └── manage.py ├── etc │ ├── containerpilot.json │ └── my.cnf.tmpl ├── tests ├── _env ├── Dockerfile ├── docker-compose.yml ├── local-compose.yml └── setup.sh
  78. github.com/autopilotpattern ~/mysql/docker-compose.yml mysql: image: autopilotpattern/mysql:latest mem_limit: 4g restart: always # expose for linking, but each container gets a private IP for # internal use as well expose: - 3306 labels: - triton.cns.services=mysql env_file: _env environment: - CONTAINERPILOT=file:///etc/containerpilot.json
  79. github.com/autopilotpattern ~/mysql/docker-compose.yml mysql: image: autopilotpattern/mysql:latest mem_limit: 4g restart: always # expose for linking, but each container gets a private IP for # internal use as well expose: - 3306 labels: - triton.cns.services=mysql env_file: _env environment: - CONTAINERPILOT=file:///etc/containerpilot.json Infrastructure-backed service discovery requirement
  80. github.com/autopilotpattern ~/mysql/docker-compose.yml mysql: image: autopilotpattern/mysql:latest mem_limit: 4g restart: always # expose for linking, but each container gets a private IP for # internal use as well expose: - 3306 labels: - triton.cns.services=mysql env_file: _env environment: - CONTAINERPILOT=file:///etc/containerpilot.json Credentials from environment
  81. github.com/autopilotpattern ~/workshop/mysql $ ./setup.sh /path/to/private/key.pem ~/workshop/mysql $ emacs _env MYSQL_USER=me MYSQL_PASSWORD=password1 MYSQL_REPL_USER=repl MYSQL_REPL_PASSWORD=password2 MYSQL_DATABASE=mydb MANTA_BUCKET=/<username>/stor/triton-mysql MANTA_USER=<username> MANTA_SUBUSER= MANTA_ROLE= MANTA_URL=https://us-east.manta.joyent.com MANTA_KEY_ID=1a:b8:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx MANTA_PRIVATE_KEY=-----BEGIN RSA PRIVATE KEY——#… CONSUL=consul.svc.0f06a3e0-a0da-eb00-a7ae-989d4e44e2ad.us-east-1.cns.joyent.com
  82. github.com/autopilotpattern ~/mysql $ docker-compose -p my up -d Creating my_consul_1 Creating my_mysql_1 ~/mysql $ docker-compose -p my ps Name Command State Ports ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– my_consul_1 /bin/start -server -bootst... Up 53/tcp, 53/udp, 8300/tcp... my_mysql_1 containerpilot mysqld… Up 0.0.0.0:3600
  83. github.com/autopilotpattern ~/mysql $ docker-compose -p my scale mysql=2 Creating my_mysql_2 ~/mysql $ docker-compose -p my ps Name Command State Ports ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– my_consul_1 /bin/start -server -bootst... Up 53/tcp, 53/udp, 8300/tcp... my_mysql_1 containerpilot mysqld… Up 0.0.0.0:3600 my_mysql_2 containerpilot mysqld… Up 0.0.0.0:3600
  84. github.com/autopilotpattern FROM percona:5.6 RUN apt-get update && apt-get install -y python python-dev gcc curl percona-xtrabackup # get Python drivers MySQL, Consul, and Manta RUN curl -Ls -o get-pip.py https://bootstrap.pypa.io/get-pip.py && python get-pip.py && pip install PyMySQL==0.6.7 python-Consul==0.4.7 manta==2.5.0 mock==2.0.0 # get ContainerPilot release (see repo for checksum verification!) RUN curl -Lo /tmp/cp.tar.gz https://github.com/joyent/containerpilot/… tar -xz -f /tmp/cp.tar.gz && mv /containerpilot /usr/local/bin/ # configure ContainerPilot and MySQL COPY etc/* /etc/ COPY bin/* /usr/local/bin/ # override the parent entrypoint ENTRYPOINT [] # use --console to get error logs to stderr CMD [ “containerpilot", “mysqld”, "--console", "--log-bin=mysql-bin", "--log_slave_updates=ON", "--gtid-mode=ON", "--enforce-gtid-consistency=ON" ] ~/mysql/Dockerfile
  85. github.com/autopilotpattern FROM percona:5.6 RUN apt-get update && apt-get install -y python python-dev gcc curl percona-xtrabackup # get Python drivers MySQL, Consul, and Manta RUN curl -Ls -o get-pip.py https://bootstrap.pypa.io/get-pip.py && python get-pip.py && pip install PyMySQL==0.6.7 python-Consul==0.4.7 manta==2.5.0 mock==2.0.0 # get ContainerPilot release (see repo for checksum verification!) RUN curl -Lo /tmp/cp.tar.gz https://github.com/joyent/containerpilot/… tar -xz -f /tmp/cp.tar.gz && mv /containerpilot /usr/local/bin/ # configure ContainerPilot and MySQL COPY etc/* /etc/ COPY bin/* /usr/local/bin/ # override the parent entrypoint ENTRYPOINT [] # use --console to get error logs to stderr CMD [ “containerpilot", “mysqld”, "--console", "--log-bin=mysql-bin", "--log_slave_updates=ON", "--gtid-mode=ON", "--enforce-gtid-consistency=ON" ] ~/mysql/Dockerfile
  86. github.com/autopilotpattern FROM percona:5.6 RUN apt-get update && apt-get install -y python python-dev gcc curl percona-xtrabackup # get Python drivers MySQL, Consul, and Manta RUN curl -Ls -o get-pip.py https://bootstrap.pypa.io/get-pip.py && python get-pip.py && pip install PyMySQL==0.6.7 python-Consul==0.4.7 manta==2.5.0 mock==2.0.0 # get ContainerPilot release (see repo for checksum verification!) RUN curl -Lo /tmp/cp.tar.gz https://github.com/joyent/containerpilot/… tar -xz -f /tmp/cp.tar.gz && mv /containerpilot /usr/local/bin/ # configure ContainerPilot and MySQL COPY etc/* /etc/ COPY bin/* /usr/local/bin/ # override the parent entrypoint ENTRYPOINT [] # use --console to get error logs to stderr CMD [ “containerpilot", “mysqld”, "--console", "--log-bin=mysql-bin", "--log_slave_updates=ON", "--gtid-mode=ON", "--enforce-gtid-consistency=ON" ] ~/mysql/Dockerfile
  87. github.com/autopilotpattern FROM percona:5.6 RUN apt-get update && apt-get install -y python python-dev gcc curl percona-xtrabackup # get Python drivers MySQL, Consul, and Manta RUN curl -Ls -o get-pip.py https://bootstrap.pypa.io/get-pip.py && python get-pip.py && pip install PyMySQL==0.6.7 python-Consul==0.4.7 manta==2.5.0 mock==2.0.0 # get ContainerPilot release (see repo for checksum verification!) RUN curl -Lo /tmp/cp.tar.gz https://github.com/joyent/containerpilot/… tar -xz -f /tmp/cp.tar.gz && mv /containerpilot /usr/local/bin/ # configure ContainerPilot and MySQL COPY etc/* /etc/ COPY bin/* /usr/local/bin/ # override the parent entrypoint ENTRYPOINT [] # use --console to get error logs to stderr CMD [ “containerpilot", “mysqld”, "--console", "--log-bin=mysql-bin", "--log_slave_updates=ON", "--gtid-mode=ON", "--enforce-gtid-consistency=ON" ] ~/mysql/Dockerfile
  88. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] }
  89. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] } Environment variable interpolation
  90. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] } Service definition
  91. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] } Backend definition
  92. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] } Huh? This isn’t in our docker-compose.yml
  93. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] } Logic lives in manage.py
  94. github.com/autopilotpattern Container Pilot Consul Lifecycle: preStart MySQL container
  95. github.com/autopilotpattern Container Pilot Consul Lifecycle: preStart MySQL container PID1 Separate container
  96. github.com/autopilotpattern Container Pilot Consul Lifecycle: preStart Manta object store Store snapshots MySQL container
  97. github.com/autopilotpattern Container Pilot Consul preStart Lifecycle: preStart Manta object store MySQL container
  98. github.com/autopilotpattern Container Pilot Consul preStart Lifecycle: preStart Manta object store MySQL container Note: no main application running yet! If exit code of preStart != 0, ContainerPilot exits
  99. github.com/autopilotpattern Container Pilot Consul preStart Lifecycle: preStart Manta object store MySQL container “Has a snapshot been written to Manta?”
  100. github.com/autopilotpattern Container Pilot Consul MySQL container preStart Lifecycle: preStart Manta object store “Has a snapshot been written to Manta?” “Nope!”
  101. github.com/autopilotpattern Container Pilot Consul MySQL container preStart Lifecycle: preStart Manta object store “Has a snapshot been written to Manta?” “Nope!” initialize DB
  102. github.com/autopilotpattern ~/mysql/bin/manage.py def pre_start(): """ MySQL must be running in order to execute most of our setup behavior so we're just going to make sure the directory structures are in place and then let the first health check handler take it from there """ if not os.path.isdir(os.path.join(config.datadir, 'mysql')): last_backup = has_snapshot() if last_backup: get_snapshot(last_backup) restore_from_snapshot(last_backup) else: if not initialize_db(): log.info('Skipping database setup.') sys.exit(0)
  103. github.com/autopilotpattern ~/mysql/bin/manage.py def pre_start(): """ MySQL must be running in order to execute most of our setup behavior so we're just going to make sure the directory structures are in place and then let the first health check handler take it from there """ if not os.path.isdir(os.path.join(config.datadir, 'mysql')): last_backup = has_snapshot() if last_backup: get_snapshot(last_backup) restore_from_snapshot(last_backup) else: if not initialize_db(): log.info('Skipping database setup.') sys.exit(0) Check w/ Consul for snapshot
  104. github.com/autopilotpattern ~/mysql/bin/manage.py def pre_start(): """ MySQL must be running in order to execute most of our setup behavior so we're just going to make sure the directory structures are in place and then let the first health check handler take it from there """ if not os.path.isdir(os.path.join(config.datadir, 'mysql')): last_backup = has_snapshot() if last_backup: get_snapshot(last_backup) restore_from_snapshot(last_backup) else: if not initialize_db(): log.info('Skipping database setup.') sys.exit(0) calls /usr/bin/mysql_install_db
  105. github.com/autopilotpattern ~/mysql/bin/manage.py def pre_start(): """ MySQL must be running in order to execute most of our setup behavior so we're just going to make sure the directory structures are in place and then let the first health check handler take it from there """ if not os.path.isdir(os.path.join(config.datadir, 'mysql')): last_backup = has_snapshot() if last_backup: get_snapshot(last_backup) restore_from_snapshot(last_backup) else: if not initialize_db(): log.info('Skipping database setup.') sys.exit(0)
  106. github.com/autopilotpattern Container Pilot Consul Lifecycle: run Manta object store MySQL container
  107. github.com/autopilotpattern Container Pilot mysqld Consul Lifecycle: run • Attach to stdout/ stderr • Return exit code of application to Docker runtime MySQL container Manta object store
  108. github.com/autopilotpattern Container Pilot Consul Lifecycle: health mysqld health Manta object store MySQL container
  109. github.com/autopilotpattern Manta object store Container Pilot Consul Lifecycle: health User-defined health check inside the container. Runs every poll seconds. mysqld MySQL container health
  110. github.com/autopilotpattern mysqld MySQL container Container Pilot Consul health Lifecycle: health Manta object store first time? finish initialization
  111. github.com/autopilotpattern ~/mysql/bin/manage.pydef health(): """ Run a simple health check. Also acts as a check for whether the ContainerPilot configuration needs to be reloaded (if it's been changed externally), or if we need to make a backup because the backup TTL has expired. """ node = MySQLNode() cp = ContainerPilot(node) if cp.update(): cp.reload() return # Because we need MySQL up to finish initialization, we need to check # for each pass thru the health check that we've done so. The happy # path is to check a lock file against the node state (which has been # set above) and immediately return when we discover the lock exists. # Otherwise, we bootstrap the instance. was_ready = assert_initialized_for_state(node) ctx = dict(user=config.repl_user, password=config.repl_password, timeout=cp.config['services'][0]['ttl']) node.conn = wait_for_connection(**ctx) # Update our lock on being the primary/standby. if node.is_primary() or node.is_standby(): update_session_ttl() # Create a snapshot and send it to the object store if all((node.is_snapshot_node(), (not is_backup_running()), (is_binlog_stale(node.conn) or is_time_for_snapshot()))): write_snapshot(node.conn) mysql_query(node.conn, 'SELECT 1', ())
  112. github.com/autopilotpattern ~/mysql/bin/manage.pydef run_as_primary(node): """ The overall workflow here is ported and reworked from the Oracle-provided Docker image: https://github.com/mysql/mysql-docker/blob/mysql-server/5.7/docker-entrypoint.sh """ node.state = PRIMARY mark_as_primary(node) node.conn = wait_for_connection() if node.conn: # if we can make a connection w/o a password then this is the # first pass set_timezone_info() setup_root_user(node.conn) create_db(node.conn) create_default_user(node.conn) create_repl_user(node.conn) run_external_scripts('/etc/initdb.d') expire_root_password(node.conn) else: ctx = dict(user=config.repl_user, password=config.repl_password, database=config.mysql_db) node.conn = wait_for_connection(**ctx) stop_replication(node.conn) # in case this is a newly-promoted primary if USE_STANDBY: # if we're using a standby instance then we need to first # snapshot the primary so that we can bootstrap the standby. write_snapshot(node.conn) Set up DB, user, replication user, and expire password, etc.
  113. github.com/autopilotpattern ~/mysql/bin/manage.py def run_as_replica(node): try: ctx = dict(user=config.repl_user, password=config.repl_password, database=config.mysql_db) node.conn = wait_for_connection(**ctx) set_primary_for_replica(node.conn) except Exception as ex: log.exception(ex) def set_primary_for_replica(conn): """ Set up GTID-based replication to the primary; once this is set the replica will automatically try to catch up with the primary's last transactions. """ primary = get_primary_host() sql = ('CHANGE MASTER TO ' 'MASTER_HOST = %s, ' 'MASTER_USER = %s, ' 'MASTER_PASSWORD = %s, ' 'MASTER_PORT = 3306, ' 'MASTER_CONNECT_RETRY = 60, ' 'MASTER_AUTO_POSITION = 1, ' 'MASTER_SSL = 0; ' 'START SLAVE;') mysql_exec(conn, sql, (primary, config.repl_user, config.repl_password,))
  114. github.com/autopilotpattern ~/mysql/bin/manage.py def run_as_replica(node): try: ctx = dict(user=config.repl_user, password=config.repl_password, database=config.mysql_db) node.conn = wait_for_connection(**ctx) set_primary_for_replica(node.conn) except Exception as ex: log.exception(ex) def set_primary_for_replica(conn): """ Set up GTID-based replication to the primary; once this is set the replica will automatically try to catch up with the primary's last transactions. """ primary = get_primary_host() sql = ('CHANGE MASTER TO ' 'MASTER_HOST = %s, ' 'MASTER_USER = %s, ' 'MASTER_PASSWORD = %s, ' 'MASTER_PORT = 3306, ' 'MASTER_CONNECT_RETRY = 60, ' 'MASTER_AUTO_POSITION = 1, ' 'MASTER_SSL = 0; ' 'START SLAVE;') mysql_exec(conn, sql, (primary, config.repl_user, config.repl_password,)) gets from Consul
  115. github.com/autopilotpattern ~/mysql/bin/manage.py def run_as_replica(node): try: ctx = dict(user=config.repl_user, password=config.repl_password, database=config.mysql_db) node.conn = wait_for_connection(**ctx) set_primary_for_replica(node.conn) except Exception as ex: log.exception(ex) def set_primary_for_replica(conn): """ Set up GTID-based replication to the primary; once this is set the replica will automatically try to catch up with the primary's last transactions. """ primary = get_primary_host() sql = ('CHANGE MASTER TO ' 'MASTER_HOST = %s, ' 'MASTER_USER = %s, ' 'MASTER_PASSWORD = %s, ' 'MASTER_PORT = 3306, ' 'MASTER_CONNECT_RETRY = 60, ' 'MASTER_AUTO_POSITION = 1, ' 'MASTER_SSL = 0; ' 'START SLAVE;') mysql_exec(conn, sql, (primary, config.repl_user, config.repl_password,)) Remember our preStart downloaded the snapshot
  116. github.com/autopilotpattern Wait a sec. How do we know which instance is primary!?
  117. github.com/autopilotpattern Container Pilot Consul Lifecycle: health Exit! mysqld MySQL container health
  118. github.com/autopilotpattern Container Pilot mysqld Consul health Lifecycle: health Exit code is 0? “I am mysql-12345. I am available at 192.168.100.2:4000. I am healthy for the next 10 seconds.” MySQL container
  119. github.com/autopilotpattern Container Pilot mysqld Consul MySQL container health Lifecycle: health If exit code != 0, do nothing (TTL expires)
  120. github.com/autopilotpattern Ask Consul for Primary
  121. github.com/autopilotpattern I’m the primary! Ask Consul for Primary
  122. github.com/autopilotpattern I’m the primary! Ask Consul for Primary Update lock TTL w/ each health check
  123. github.com/autopilotpattern I’m the primary! Someone else is the primary! I’m a replica! Ask Consul for Primary
  124. github.com/autopilotpattern I’m the primary! Someone else is the primary! I’m a replica! Ask Consul for Primary Syncs up using snapshot and GTID
  125. github.com/autopilotpattern No Primary? I’m the Primary! I’m the primary! Someone else is the primary! I’m a replica! Ask Consul for Primary
  126. github.com/autopilotpattern No Primary? I’m the Primary? I’m the primary! Someone else is the primary! I’m a replica! Ask Consul for Primary Need to assert only 1 primary
  127. github.com/autopilotpattern No Primary? I’m the Primary? I’m the primary! Failed! Go back to start I’m the primary! Someone else is the primary! I’m a replica! Set lock in Consul w/ TTL Ask Consul for Primary
  128. github.com/autopilotpattern No Primary? I’m the Primary? I’m the primary! Failed! Go back to start I’m the primary! Someone else is the primary! I’m a replica! Set lock in Consul w/ TTL Ask Consul for Primary Update lock TTL w/ each health check. Rewrite ContainerPilot config and SIGHUP
  129. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] }
  130. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": “mysql-primary", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] } Rewrite & reload config
  131. github.com/autopilotpattern ~/mysql/bin/manage.py def health(): """ Run a simple health check. Also acts as a check for whether the ContainerPilot configuration needs to be reloaded (if it's been changed externally), or if we need to make a backup because the backup TTL has expired. """ node = MySQLNode() cp = ContainerPilot(node) if cp.update(): cp.reload() return was_ready = assert_initialized_for_state(node) # cp.reload() will exit early so no need to setup # connection until this point ctx = dict(user=config.repl_user, password=config.repl_password, timeout=cp.config['services'][0]['ttl']) node.conn = wait_for_connection(**ctx) # Update our lock on being the primary/standby. # If this lock is allowed to expire and the health check for the primary # fails, the `onChange` handlers for the replicas will try to self-elect # as primary by obtaining the lock. # If this node can update the lock but the DB fails its health check, # then the operator will need to manually intervene if they want to # force a failover. This architecture is a result of Consul not # permitting us to acquire a new lock on a health-checked session if the # health check is *currently* failing, but has the happy side-effect of # reducing the risk of flapping on a transient health check failure. if node.is_primary() or node.is_standby(): update_session_ttl() # Create a snapshot and send it to the object store. if all((node.is_snapshot_node(), (not is_backup_running()), (is_binlog_stale(node.conn) or is_time_for_snapshot()))): write_snapshot(node.conn) mysql_query(node.conn, 'SELECT 1', ())
  132. github.com/autopilotpattern Wait a sec. How do we fail-over?
  133. github.com/autopilotpattern ~/mysql/etc/containerpilot.json { "consul": "{{ .CONSUL }}:8500", "preStart": "python /usr/local/bin/manage.py", "services": [ { "name": "mysql", "port": 3306, "health": "python /usr/local/bin/manage.py health", "poll": 5, "ttl": 25 } ], "backends": [ { "name": "mysql-primary", "poll": 10, "onChange": "python /usr/local/bin/manage.py on_change" } ] }
  134. github.com/autopilotpattern Container Pilot mysqld Consul Where is mysql-primary? 192.168.1.100 MySQL container Lifecycle: onChange
  135. github.com/autopilotpattern Container Pilot mysqld Consul Where is mysql-primary? 192.168.1.100 MySQL container Lifecycle: onChange Check Consul for services listed in backends. Runs every poll seconds.
  136. github.com/autopilotpattern replica primary Healthy! Healthy! Failed! Ask Consul for Primary no change Ask Consul for Primary no change Ask Consul for Primary fire onChange handler
  137. github.com/autopilotpattern ~/mysql/bin/manage.pydef on_change(): node = MySQLNode() ctx = dict(user=config.repl_user, password=config.repl_password, timeout=cp.config['services'][0]['ttl']) node.conn = wait_for_connection(**ctx) # need to stop replication whether we're the new primary or not stop_replication(node.conn) while True: try: # if there is no primary node, we'll try to obtain the lock. # if we get the lock we'll reload as the new primary, otherwise # someone else got the lock but we don't know who yet so loop primary = get_primary_node() if not primary: session_id = get_session(no_cache=True) if mark_with_session(PRIMARY_KEY, node.hostname, session_id): node.state = PRIMARY if cp.update(): cp.reload() return else: # we lost the race to lock the session for ourselves time.sleep(1) continue # we know who the primary is but not whether they're healthy. # if it's not healthy, we'll throw an exception and start over. ip = get_primary_host(primary=primary) if ip == node.ip: if cp.update(): cp.reload() return set_primary_for_replica(node.conn) return except Exception as ex: # This exception gets thrown if the session lock for `mysql-primary` # key has not expired yet (but there's no healthy primary either), # or if the replica's target primary isn't ready yet. log.debug(ex) time.sleep(1) # avoid hammering Consul continue
  138. github.com/autopilotpattern replica primary Healthy! Healthy! Failed! no change no change Ask Consul for Primary Ask Consul for Primary Ask Consul for Primary fire onChange handler
  139. github.com/autopilotpattern replica primary Healthy! Healthy! Failed! no change no change Ask Consul for Primary Ask Consul for Primary Ask Consul for Primary Ask Consul for Primary Ok, I’m primary Set lock in Consul Success! primary Healthy! fire onChange handler
  140. Applications on Autopilot Tim Gross @0x74696d (“tim”) github.com/autopilotpattern
Advertisement