Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Database Defense In Depth

907 views

Published on

This was presented at Percona Live MySQL Conference and Expo 2015 at https://www.percona.com/live/mysql-conference-2015/sessions/database-defense-depth

"It looks like a database problem"
This is a pretty common phrase that comes up when triaging a production issue if MySQL is in the critical path. Often enough, the problems can be tracked down to new query patterns, edge cases in our company's applications, or even bugs in MySQL.

In this talk, I'll describe the tools built by the DBAs at Box to help proactively monitor and defend their MySQL infrastructure. I'll also discuss our learnings from running these defenses over the past 3 years and how it has
influenced new procedures and software development patterns for MySQL at Box.

Published in: Software
  • Be the first to comment

Database Defense In Depth

  1. 1. 1 Geoffrey Anderson Percona Live 2015 Database Defense in Depth
  2. 2. 2 Geoffrey Anderson • DBA @ Box, Inc. • MySQL, HBase, and more! • #DBHangOps
  3. 3. 3
  4. 4. 4 OUR VISION: Share, manage and access your content from any device, anywhere
  5. 5. 5 32MM+ USERS 99% FORTUNE 500 275K+ BUSINESSES
  6. 6. 6 Global Enterprises Trust Box HEALTHCARE INDUSTRIAL HIGH TECH MEDIA RETAIL SERVICES
  7. 7. 7 Defenses? Huh? For a Database?
  8. 8. 8 • Query Tuning • Anemometer • Raingauge • Table Partitioning • User Management / Least Privilege • Graphs / Monitoring –OpenTSDB / Graphite / Kibana –statsd / tcollector • user_statistics plugin • Percona Server • Proxying • Sharding • Vertically Scaling • Replication –Multiple read replicas • Backups –…and recovery! • Binary Logs • Percona Toolkit –pt-query-digest –pt-slave-delay –pt-stalk –pt-table-checksum –pt-table-sync • Killing connections • Killing bad queries • Killing bad transactions • Nagios alerts / PagerDuty • Innotop • Health checking Defenses
  9. 9. 9 Let’s take a journey
  10. 10. 10 You have your database
  11. 11. 11 Queries
  12. 12. 12 Queries Yay business!
  13. 13. 13 Queries Defense #1 – Query comments!
  14. 14. 14 Hi! I’m Nagios! How do you feel today? Defense #2 – Alerting on the Database
  15. 15. 15 Defense #2 – Alerting on the Database Hi! I’m Nagios! How do you feel today?
  16. 16. 16 Defense #3 – Dashboards Current week Previous week
  17. 17. 17 Defense #3 – Dashboards
  18. 18. 18 Defense #4 – User Statistics Per-user connection counts Per-user cpu time Ahhh heavy queries!
  19. 19. 19 Defense #4 – User and Table statistics Per-table and per- schema counts! WAT
  20. 20. 20 Queries Ahhh something terrible happened!
  21. 21. 21 Defense #5 – Box Raingauge Ahhh Something terrible happened!
  22. 22. 22
  23. 23. 23 Defense #6 – Monitor the queries mysql-syspt-query-digest Anemometer
  24. 24. 24 Queries
  25. 25. 25 Time to break up the data
  26. 26. 26 Defense #7 – Sharding
  27. 27. 27 Defense #7 – Sharding
  28. 28. 28 Defense #8 – innotop
  29. 29. 29 Defense #8 – innotop [general] mode=A [/general] [connections] mysql56Test=user= dsn=DBI:mysql:;host=127.0.0.1;port=3306;mysql_read_default_group=client anotherDb=user= dsn=DBI:mysql:;host=127.0.0.1;port=3306;mysql_read_default_group=client [/connections] [server_groups] all=mysql56Test anotherDb [/server_groups] [active_server_groups] A=all [/active_server_groups] [active_columns] health_dashboard=cxn uptime spark_qps qps max_query_time spark_run run connections open slave_running time_behind_master longest_sql [/active_columns] [visible_tables] A=health_dashboard [/visible_tables] [colors] health_dashboard=col='slave_running' op='eq' arg='No' color='black on_red' health_dashboard=col='time_behind_master' op='>' arg='30' color='cyan' health_dashboard=col='max_query_time' op='>' arg='60' color='red' health_dashboard=col='max_query_time' op='>' arg='30' color='yellow' [/colors] Define connections (with custom names) Assign connections to server groups Set the default active group Define views Set the default view Setup color highlighting
  30. 30. 30 Defense #8 – innotop [general] mode=A [/general] [connections] mysql56Test=user= dsn=DBI:mysql:;host=127.0.0.1;port=3306;mysql_read_default_group=client anotherDb=user= dsn=DBI:mysql:;host=127.0.0.1;port=3306;mysql_read_default_group=client [/connections] [server_groups] all=mysql56Test anotherDb [/server_groups] [active_server_groups] A=all [/active_server_groups] [active_columns] health_dashboard=cxn uptime spark_qps qps max_query_time spark_run run connections open slave_running time_behind_master longest_sql [/active_columns] [visible_tables] A=health_dashboard [/visible_tables] [colors] health_dashboard=col='slave_running' op='eq' arg='No' color='black on_red' health_dashboard=col='time_behind_master' op='>' arg='30' color='cyan' health_dashboard=col='max_query_time' op='>' arg='60' color='red' health_dashboard=col='max_query_time' op='>' arg='30' color='yellow' [/colors]
  31. 31. 31 Now for the fun stuff
  32. 32. 32 Bad queries happen
  33. 33. 33 Bad queries happen I woke up at 3am? FOR THIS!? Who added that crappy query…? This should take care of itself… >:[
  34. 34. 34 Defense #9 – Query killer $ pt-kill --interval 10 --match-user '^app_rw.*' --busy-time 120s --ignore-command '(Prepare|Execute)' --match-info '^s*(?s*(SELECT|select)' --ignore-info '(.* SQL_NO_CACHE .*|.*&plz_no_kill=1.*)' --idle-time 28800 --victims all --print --kill --log /var/log/qkkill.log --execute-command '(mail –s "Killed queries on ${HOSTNAME}" "database-peeps@box.com" <<-EOF Last 10 killed queries in /var/log/qkkill.log: $(tail -10 /var/log/qkkill.log) EOF )' --daemonize --pid /var/run/qkkill.pid Kill queries over 2min! Only kill SELECT statements Don’t kill queries matching this…Idle for 8 hours? You gotta go! E-mail the last killed query!
  35. 35. 35 Defense #9 – Query killer $ pt-kill --interval 10 --match-user '^app_rw.*' --busy-time 120s --ignore-command '(Prepare|Execute)' --match-info '^s*(?s*(SELECT|select)' --ignore-info '(.* SQL_NO_CACHE .*|.*&plz_no_kill=1.*)' --idle-time 28800 --victims all --print --kill --log /var/log/qkkill.log --execute-command '(mail –s "Killed queries on ${HOSTNAME}" "database-peeps@box.com" <<-EOF Last 10 killed queries in /var/log/qkkill.log: $(tail -10 /var/log/qkkill.log) EOF )' --daemonize --pid /var/run/qkkill.pid Some sharp edges…
  36. 36. 36 Long transactions happen… WTF!?
  37. 37. 37 Defense #10 – Transaction Monitor (and killer) Find the connection for the longest transaction Get the source host for that connection Try to catch a query from the connection before killing it
  38. 38. 38 Defense #10 – Transaction Monitor (and killer)
  39. 39. 39 Too many connections!
  40. 40. 40 Defense #11 – Dynamic Query Killer Thresholds $ pt-kill --interval 10 --match-user '^app_rw.*' --busy-time 120s --ignore-command '(Prepare|Execute)' --match-info '^s*(?s*(SELECT|select)' --ignore-info '(.* SQL_NO_CACHE .*|.*&plz_no_kill=1.*)' --idle-time 28800 --victims all --print --kill --log /var/log/qkkill.log --execute-command '(mail –s "Killed queries on ${HOSTNAME}" "database-peeps@box.com" <<-EOF Last 10 killed queries in /var/log/qkkill.log: $(tail -10 /var/log/qkkill.log) EOF )' --daemonize --pid /var/run/qkkill.pid --dynamic-time 60 --min-busy-time 10 --min-idle-time 30 Percentage of max connections when dynamic killing should start Don’t kill busy queries younger than this many seconds! Don’t kill idle queries younger than this many seconds!
  41. 41. 41 Defense #11 – Dynamic Query Killer Thresholds
  42. 42. 42 Defense #11 – Dynamic Query Killer Thresholds
  43. 43. 43 The thundering herd.. Connections
  44. 44. 44 The thundering herd.. Connections
  45. 45. 45 Defense #12 – Healthchecking { "names": { "global": { "node_age": 106, "state": "ON" }, "shard1": { "node_age": 276, "state": "ON" }, }, "max_allowed_age": 6000 } '1', duh... Connections SELECT 1 Gimme Data!Are the DBs up? Yep!
  46. 46. 46 Defense #12 – Healthchecking { "names": { "global": { "node_age": 106, "state": "OFF" }, "shard1": { "node_age": 276, "state": "ON" }, }, "max_allowed_age": 6000 } Connections Are the DBs up? NO, PLZ LEAVE
  47. 47. 47 Defense #12 – Healthchecking { "names": { "global": { "node_age": 106, "state": "OFF" }, "shard1": { "node_age": 276, "state": "ON" }, }, "max_allowed_age": 6000 } Connections
  48. 48. 48 '1', duh... Connections SELECT 1 Gimme Data! • Cache hit? • Rewrite Query? • Send to RO replica?
  49. 49. 49 Defense #13 – Data Layer as a Service Connections • Cache hit? • Rewrite Query? • Send to RO replica? Data Here ya go.. Get Data LIMIT 1
  50. 50. 50 Defense #13 – Data Layer as a Service Connections • Cache hit? • Rewrite Query? • Send to RO replica? Here ya go..
  51. 51. 51 Locks happen… AHHHH WHAT HAPPENED!?
  52. 52. 52 Defense #14 – Lock Monitor SELECT *, UNIX_TIMESTAMP() - UNIX_TIMESTAMP(t.TRX_STARTED) as age FROM INFORMATION_SCHEMA.INNODB_LOCKS l JOIN INFORMATION_SCHEMA.INNODB_TRX t ON t.trx_id=l.lock_trx_id ORDER BY age DESC +---------------------+-----------+-------------------+-----+-----+ | trx_mysql_thread_id | trx_state | lock_table | ... | age | +---------------------+-----------+-------------------+-----+-----+ | 419 | RUNNING | `sakila`.`rental` | ... | 544 | | 581 | LOCK WAIT | `sakila`.`rental` | ... | 49 | | 631 | LOCK WAIT | `sakila`.`rental` | ... | 49 | | 573 | LOCK WAIT | `sakila`.`rental` | ... | 49 | ... 500 rows in set (0.00 sec)
  53. 53. 53 Defense #14 – Lock Monitor (and killer) KILL 419 /* WITH FIRE */; +---------------------+-----------+-------------------+-----+-----+ | trx_mysql_thread_id | trx_state | lock_table | ... | age | +---------------------+-----------+-------------------+-----+-----+ | 419 | RUNNING | `sakila`.`rental` | ... | 544 | | 581 | LOCK WAIT | `sakila`.`rental` | ... | 49 | | 631 | LOCK WAIT | `sakila`.`rental` | ... | 49 | | 573 | LOCK WAIT | `sakila`.`rental` | ... | 49 | ... 500 rows in set (0.00 sec)
  54. 54. 54 Idle connections…
  55. 55. 55 Idle connections…
  56. 56. 56 Idle connections…
  57. 57. 57 Defense #15 – strace errant connections $ db-strace-longest-trx mydb.example.com Longest transaction on mydb.example.com:3306 MySQL thread id is 15, client is 10.0.0.5:51540, running for 08:00:07 Warning: Permanently added '10.0.0.5' (RSA) to the list of known hosts. Process 19820 attached - interrupt to quit write(3, "...SELECT * FROM demo_test /* ORM_READ(732):...", 28) = 28 read(3, "100100500023def4testtdemo_testt... ", 16384) = 309 pipe([4, 5]) = 0 [pid 19820] write(5, "+----+--------------+------+------+------+n| c1 | c2 | c3 | c4 | c5 |n+----+--------------+------+-- ----+------+n| AA | HELLO, HELLO | 8 | 0 | 0 |n+----+------ --------+------+------+------+n", 215 <unfinished ...> pid [19820] close(5) = 0
  58. 58. 58 Even idler connections! $ db-strace-longest-trx mydb.example.com Longest transaction on mydb.example.com:3306 MySQL thread id is 19, client is 10.0.0.5:51740, running for 08:00:12 Warning: Permanently added '10.0.0.5' (RSA) to the list of known hosts. Process 1833 attached - interrupt to quit Where’s the thing!?
  59. 59. 59 Defense #16 – pt-pmp $ db-strace-longest-trx mydb.example.com Longest transaction on mydb.example.com:3306 MySQL thread id is 19, client is 10.0.0.5:51740, running for 08:00:12 Warning: Permanently added '10.0.0.5' (RSA) to the list of known hosts. Process 1833 attached - interrupt to quit Where’s the thing!?gdb -ex "set pagination 0" -ex "bt" -ex "source /path/to/php-src/.gdbinit" -ex "dump_bt executor_globals.current_execute_data" -batch -p "$pid" >> "$output_file"
  60. 60. 60 Defense #16 – pt-pmp php-pmp $ db-strace-longest-trx mydb.example.com Longest transaction on mydb.example.com:3306 MySQL thread id is 19, client is 10.0.0.5:51740, running for 08:00:12 Warning: Permanently added '10.0.0.5' (RSA) to the list of known hosts. Process 1833 attached - interrupt to quit Where’s the thing!?gdb -ex "set pagination 0" -ex "bt" -ex "source /path/to/php-src/.gdbinit" -ex "dump_bt executor_globals.current_execute_data" -batch -p "$pid" >> "$output_file"
  61. 61. 61 Defense #16 – pt-pmp php-pmp $ db-strace-longest-trx mydb.example.com Longest transaction on mydb.example.com:3306 MySQL thread id is 19, client is 10.0.0.5:51740, running for 08:00:12 Warning: Permanently added '10.0.0.5' (RSA) to the list of known hosts. Process 1833 attached - interrupt to quit Where’s the thing!?gdb -ex "set pagination 0" -ex "bt" -ex "source /path/to/php-src/.gdbinit" -ex "dump_bt executor_globals.current_execute_data" -batch -p "$pid" >> "$output_file" But this is still too manual…
  62. 62. 62 Defense #16.1 – Autorun php-pmp $ pt-kill --interval 10 --match-user '^app_rw.*' --busy-time 120s --ignore-command '(Prepare|Execute)' --match-info '^s*(?s*(SELECT|select)' --ignore-info '(.* SQL_NO_CACHE .*|.*&plz_no_kill=1.*)' --idle-time 28800 --victims all --print --kill --log /var/log/qkkill.log --execute-command '(mail –s "Killed queries on ${HOSTNAME}" "database-peeps@box.com" <<-EOF Last 10 killed queries in /var/log/qkkill.log: $(tail -10 /var/log/qkkill.log) EOF )' --daemonize --pid /var/run/qkkill.pid Change this to…
  63. 63. 63 Defense #16.1 – Autorun php-pmp $ pt-kill --interval 10 --match-user '^app_rw.*' --busy-time 120s --ignore-command '(Prepare|Execute)' --match-info '^s*(?s*(SELECT|select)' --ignore-info '(.* SQL_NO_CACHE .*|.*&plz_no_kill=1.*)' --idle-time 28800 --victims all --print --kill --log /var/log/qkkill.log --execute-command '(mail –s "Killed queries on ${HOSTNAME}" "database-peeps@box.com" <<-EOF Last 10 killed queries in /var/log/qkkill.log: $(tail -10 /var/log/qkkill.log) EOF )' --daemonize --pid /var/run/qkkill.pid Extract source client info Discover remote pid Run php-pmp on client pidSend to raingauge!
  64. 64. 64 Defense #16.1 – Autorun php-pmp $ pt-kill --interval 10 --match-user '^app_rw.*' --busy-time 120s --ignore-command '(Prepare|Execute)' --match-info '^s*(?s*(SELECT|select)' --ignore-info '(.* SQL_NO_CACHE .*|.*&plz_no_kill=1.*)' --idle-time 28800 --victims all --print --kill --log /var/log/qkkill.log --execute-command '(mail –s "Killed queries on ${HOSTNAME}" "database-peeps@box.com" <<-EOF Last 10 killed queries in /var/log/qkkill.log: $(tail -10 /var/log/qkkill.log) EOF )' --daemonize --pid /var/run/qkkill.pid Send to raingauge!
  65. 65. 65 pt-stalk --daemonize --variable=Threads_running --threshold=150 --pid=/usr/var/run/ptstalk.pid --cycles=2 --sleep=60 --exec-after-sleep=/usr/bin/raingauge_package_and_send.sh --dest=/tmp/raingauge What about Threads_Created? What about CPU?
  66. 66. 66 pt-stalk --daemonize --function=/usr/bin/raingauge_triggers.sh --variable=test_triggered --threshold=0 --pid=/usr/var/run/ptstalk.pid --cycles=2 --sleep=60 --exec-after-sleep=/usr/bin/raingauge_package_and_send.sh --dest=/tmp/raingauge Defense #17 – Multivariable pt-stalk
  67. 67. 67 Defense #17 – Multivariable pt-stalk pt-stalk-raingauge --daemonize --function=/usr/bin/raingauge_triggers.sh --variable=test_triggered --threshold=0 --pid=/usr/var/run/ptstalk.pid --cycles=2 --sleep=60 --exec-after-sleep=/usr/bin/raingauge_package_and_send.sh --dest=/tmp/raingauge
  68. 68. 68 $user = 'app'; $pass = 'meow'; Time to rotate passwords!
  69. 69. 69 Defense #18 – Automated User Management In all seriousness…why? • Security exposures • Accidental deletions by shared accounts • Setup quotas! Also… • Don’t use your root account • Don’t give nagios the SUPER privilege…
  70. 70. 70 Defense #18 – Automated User Management In all seriousness…why? • Security exposures • Accidental deletions by shared accounts • Setup quotas! Also… • Don’t use your root account • Don’t give nagios the SUPER privilege…
  71. 71. 71 Backups How many of you will take backups? How many of you actually take backups?
  72. 72. 72 Defense #19 – Backups mysqldump --all-the-things --seriously --all-of-them CHANGE MASTER TO MASTER_DELAY = 43200;
  73. 73. 73 Defense #19 – Backups
  74. 74. 74 Defense #19 – Backups
  75. 75. 75 Relax Take vacations Hire more DBAs Defense #20 – Prevent burnout
  76. 76. 76 Defense #20 – Prevent burnout But I can’t find any DBAs to hire! Things will fall apart without me!
  77. 77. 77 Defense #20 – Prevent burnout But I can’t find any DBAs to hire! Things will fall apart without me!
  78. 78. 78 • Defense #1 – Query Comments • Defense #2 – Alerting on the Database • Defense #3 – Dashboards • Defense #4 – User and Table Statistics • Defense #5 – Box Raingauge • Defense #6 – Monitor the Queries • Defense #7 – Sharding • Defense #8 – Innotop • Defense #9 – Query Killer • Defense #10 – Transaction Monitor (and killer) • Defense #11 – Dynamic Query Killer Thresholds • Defense #12 – Healthchecking • Defense #13 – Data Layer as a Service • Defense #14 – Lock Monitor (and killer) • Defense #15 – strace Errant Connections • Defense #16 – pt-pmp php-pmp • Defense #17 – Multivariable pt-stalk • Defense #18 – Automated User Management • Defense #19 - Backups • Defense #20 – Prevent Burnout Defenses
  79. 79. 79
  80. 80. 80 Come to the #DBHangOps BoF @ 6pm in room 203! Email geoff@box.com Twitter @geodbz Tech Blog www.box.com/blog/engineering Github github.com/box ?
  81. 81. 81 Links of Supreme Interest! • https://www.percona.com/live/mysql-conference-2014/sessions/box-weather-station-open- source-tools-performance-and-forensic-monitoring • https://www.percona.com/live/mysql-conference-2014/sessions/mysql-devops-outbrain • http://www.percona.com/live/mysql-conference-2012/sessions/etsy-shard-architecture- starts-s-and-ends-hard • http://www.infoq.com/presentations/box-mysql-sharding
  82. 82. 82 Image Credit / Source • http://ak-hdl.buzzfed.com/static/2014-06/19/12/enhanced/webdr06/anigif_enhanced-26665- 1403194377-3.gif • http://31.media.tumblr.com/5f6ffc7bf7f456ba0442596d07faa925/tumblr_n4zmymPU191s0te ago2_r1_400.gif • https://www.flickr.com/photos/patrick_verstappen/8421238929 • https://www.flickr.com/photos/apocalust/5092520561 • http://38.media.tumblr.com/e1a7a9881573597bbe97ade824f578a3/tumblr_ngcsl73nLt1u4i1d io4_1280.gif • http://stream1.gifsoup.com/view3/4891599/lion-king-stampede-o.gif • http://24.media.tumblr.com/tumblr_lv5jvwWTyT1r4hwmoo1_500.gif • http://cl.jroo.me/z3/2/R/v/e/a.baa-the-hacker-cat.jpg • https://www.flickr.com/photos/daryl_mitchell/1199598508 • https://www.flickr.com/photos/patrice-photographiste/14909768574

×