SlideShare a Scribd company logo
2013-10-18 
MONITOR 
SOME OF THE THINGS
Optimization, Backups, 
Replication, and more 
3rd Edition 
Covers Version 5.5 
High 
Performance 
MySQL 
Baron Schwartz, 
Peter Zaitsev & 
Vadim Tkachenko 
ME 
• Cofounder of @VividCortex 
• Author of High Performance MySQL 
• @xaprb on Twitter 
• baron@vividcortex.com 
• http://www.linkedin.com/in/xaprb
RANT, RECAPPED 
• The sky is falling 
• Tools drive processes, and we need better tools designed for methods 
• Pay attention to CAPS (Capacity, Availability, Performance, Scalability) 
• Monitoring tools need to be a lot smarter 
• Measure and monitor “work getting done”
HARD CAPACITY 
• Disk volume 
• CPU Cycles 
• max_connections 
• File descriptors, sockets, TCP port 
numbers, etc 
• %used, absolute quantity available
SOFT CAPACITY 
• Neil Gunther’s Universal Scalability 
Law 
• %used, absolute quantity available 
• Throughput, concurrency, errors
AVAILABILITY 
• Availability is absence of downtime • %used, absolute quantity available 
• Throughput, concurrency, errors 
• MTBF, MTTR, MTTD, %availability
TASK PERFORMANCE 
• Task performance is consistently fast 
response time. 
• Measure an SLA in percentile 
response time per task, over 
observation intervals 
• %used, absolute quantity available 
• Throughput, concurrency, errors 
• MTBF, MTTR, MTTD, %availability 
• Response time, 95% response time
RESOURCE PERFORMANCE 
• Resource performance is ability to run 
tasks consistently fast. 
• %used, absolute quantity available 
• Throughput, concurrency, errors 
• MTBF, MTTR, MTTD, %availability 
• Response time, 95% response time 
• Throughput, concurrency, busy time, 
total response time, backlog/queue
SCALABILITY 
• Universal Scalability Law again • %used, absolute quantity available 
• Throughput, concurrency, errors 
• MTBF, MTTR, MTTD, %availability 
• Response time, 95% response time 
• Throughput, concurrency, busy time, 
total response time, backlog/queue
STALL DETECTION 
• Overloaded or underperforming? • %used, absolute quantity available 
• Throughput, concurrency, errors 
• MTBF, MTTR, MTTD, %availability 
• Response time, 95% response time 
• Throughput, concurrency, busy time, 
total response time, backlog/queue 
• Utilization, saturation, errors, sources 
of load/demand
GIT ‘ER DONE 
MONITOR WORK AND 
RESOURCES
WHAT NOT TO DO 
• Don’t use top-N lists from Google 
• Don’t just do what’s included in some 
Nagios plugin
№1 
TOP 10 LIST 
1. MySQL availability 
2. Presence of insecure users and databases 
3. Aborted connects 
4. Error log 
5. Deadlocks 
6. Change in server configuration 
7. Slow query log 
8. Slave lag 
9. Percentage of maximum allowed connections 
10. Percentage of full table scans
№2 
TOP 10 LIST 
1. Threads_connected 
2. Created_tmp_disk_tables 
3. Handler_read_first 
4. Innodb_buffer_pool_wait_free 
5. Key_reads 
6. Max_used_connections 
7. Open_tables 
8. Select_full_join 
9. Slow_queries 
10. Uptime
№1 
PLUGIN 
1. threadcache-hitrate (Hit rate of the thread-cache) 
2. slave-io-running (Slave io running: Yes) 
3. slave-sql-running (Slave sql running: Yes) 
4. qcache-hitrate (Query cache hitrate) 
5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 
6. keycache-hitrate (MyISAM key cache hitrate) 
7. bufferpool-hitrate (InnoDB buffer pool hitrate) 
8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 
9. log-waits (InnoDB log waits because of a too small log buffer) 
10. tablecache-hitrate (Table cache hitrate) 
11. table-lock-contention (Table lock contention) 
12. index-usage (Usage of indices) 
13. tmp-disk-tables (Percent of temp tables created on disk) 
14. long-running-procs (long running processes)
№2 
PLUGIN 
1. connection-time 
2. uptime 
3. threads-connected 
4. threadcache-hitrate 
5. q[uery]cache-hitrate 
6. q[uery]cache-lowmem-prunes 
7. [myisam-]keycache-hitrate 
8. [innodb-]bufferpool-hitrate 
9. [innodb-]bufferpool-wait-free 
10. [innodb-]log-waits 
11. tablecache-hitrate 
12. table-lock-contention 
13. index-usage 
14. tmp-disk-tables 
15. slow-queries 
16. long-running-procs 
17. slave-lag 
18. slave-io-running 
19. slave-sql-running 
20. sql 
21. open-files 
22. encode 
23. cluster-ndb-running
№3 
PLUGIN
HTTP://WWW.FLICKR.COM/PHOTOS/NASAMARSHALL/5926864640/ 
SURFACE AREA
DUPLICATE SIGNALS 
• Queries 
• Com_admin_commands 
• Com_assign_to_keycache 
• Com_alter_db 
• Com_alter_db_upgrade 
• Com_alter_event 
• Com_alter_function 
• Com_alter_procedure 
• Com_alter_server 
• Com_alter_table 
• Com_alter_tablespace 
• Com_alter_user 
• Com_analyze 
• Com_begin 
• Com_binlog 
• Com_ad_nauseum
DESIRABLE METRICS 
• %used, absolute quantity available 
• Throughput, concurrency, errors 
• MTBF, MTTR, MTTD, %availability 
• Response time, 95% response time 
• Throughput, concurrency, busy time, total response time, backlog/queue 
• Utilization, saturation, errors, sources of load/demand
Desirable Easy
Desirable Easy
IRRELEVANT 
EXAMPLE PLEASE?
RESOURCE LIMITS 
• Threads_connected near max_connections? 
• %table cache used? 
• Open file handles? 
• Long-running queries/transactions?
ERRORS 
• Deadlocks? 
• Aborted connects?
AVAILABILITY 
• Ability to connect and run a query? 
• Uptime is small? 
• Replication is running?
PERFORMANCE 
• You can get throughput (Queries) and concurrency (Threads_running) from MySQL 
• But in a Nagios check, no context to know whether they’re good or bad 
• You generally can’t get response time, busy time, utilization, backlog, etc 
• You can aggregate thread states, thread times, users, databases, query abstracts...
NAGIOS IS BEST AT 
LIVING IN THE 
MOMENT
THOU SHALT NOT 
• Cache hit ratios 
• Thread cache hit ratio 
• Buffer pool cache hit ratio 
• Table cache hit ratio 
• Key cache hit ratio 
• Query cache hit ratio 
• Rates of “bad” queries 
• % temp tables on disk 
• % full table scans 
• % slow queries 
• Unfixable things 
• Replication delay
WHY NOT? 
• Those are properties of the workload and application 
• They are not conditions to alert/warn about 
• They are not fixable / actionable in the service
ALERTS ARE 
BETTER TOGETHER
QUESTION: 
WHAT IS BETTER?
№1 ALERT!!!!! 
Disk CRIT 100% /dev/sda2
№2 ALERT!!!!! 
Replication CRIT Slave I/O Thread No
№3 ALERT!!!!! 
Replication CRIT Slave SQL Thread No
№4 ALERT!!!!! 
Replication CRIT Seconds_Behind_Master NULL
№5 ALERT!!!!! 
MySQL CRIT oldest transaction: 86400 seconds
- OR -
№1 ALERT!!!!! 
CRIT 
* Disk /dev/sda2 full 
* Replication stopped 
* Oldest transaction 86400 seconds 
* 4999 threads in status “Waiting for table metadata lock”
HOLLER AT ME 
QUESTIONS? 
@XAPRB / BARON@VIVIDCORTEX.COM
RESOURCES 
• Chapter 3 of High Performance MySQL, 3rd Edition 
• Percona White Papers 
• Causes of Downtime in Production MySQL Servers 
• Preventing MySQL Emergencies 
• Goal-Driven Performance Optimization 
• Forecasting MySQL Scalability with the Universal Scalability Law 
• Method R: Optimizing Oracle Performance, Cary Millsap 
• The Goal, Eli Goldratt 
• The USE Method (Brendan Gregg) & his new book 
• Guerrilla Capacity Planning, Neil J. Gunther 
• Fundamental Performance & Scalability Instrumentation

More Related Content

What's hot

5 things you didn't know nginx could do velocity
5 things you didn't know nginx could do   velocity5 things you didn't know nginx could do   velocity
5 things you didn't know nginx could do velocity
sarahnovotny
 
How to Fail at Kafka
How to Fail at KafkaHow to Fail at Kafka
How to Fail at Kafka
confluent
 
Nginx - Tips and Tricks.
Nginx - Tips and Tricks.Nginx - Tips and Tricks.
Nginx - Tips and Tricks.Harish S
 
Redis acl
Redis aclRedis acl
Redis acl
DaeMyung Kang
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development Workflow
Jeffery Smith
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
Tomas Doran
 
NGINX 101 - now with more Docker
NGINX 101 - now with more DockerNGINX 101 - now with more Docker
NGINX 101 - now with more Docker
Sarah Novotny
 
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStackSaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltStack
 
Load Balancing with Nginx
Load Balancing with NginxLoad Balancing with Nginx
Load Balancing with Nginx
Marian Marinov
 
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres OpenSteve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres OpenPostgresOpen
 
under the covers -- chef in 20 minutes or less
under the covers -- chef in 20 minutes or lessunder the covers -- chef in 20 minutes or less
under the covers -- chef in 20 minutes or less
sarahnovotny
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
Integrated Cache on Netscaler
Integrated Cache on NetscalerIntegrated Cache on Netscaler
Integrated Cache on Netscaler
Mark Hillick
 
Sensu
SensuSensu
Extending functionality in nginx, with modules!
Extending functionality in nginx, with modules!Extending functionality in nginx, with modules!
Extending functionality in nginx, with modules!
Trygve Vea
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Sean Chittenden
 
How To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesHow To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - Slides
Severalnines
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
Amazon Web Services
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
Severalnines
 
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Fastly
 

What's hot (20)

5 things you didn't know nginx could do velocity
5 things you didn't know nginx could do   velocity5 things you didn't know nginx could do   velocity
5 things you didn't know nginx could do velocity
 
How to Fail at Kafka
How to Fail at KafkaHow to Fail at Kafka
How to Fail at Kafka
 
Nginx - Tips and Tricks.
Nginx - Tips and Tricks.Nginx - Tips and Tricks.
Nginx - Tips and Tricks.
 
Redis acl
Redis aclRedis acl
Redis acl
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development Workflow
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
 
NGINX 101 - now with more Docker
NGINX 101 - now with more DockerNGINX 101 - now with more Docker
NGINX 101 - now with more Docker
 
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStackSaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
 
Load Balancing with Nginx
Load Balancing with NginxLoad Balancing with Nginx
Load Balancing with Nginx
 
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres OpenSteve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
 
under the covers -- chef in 20 minutes or less
under the covers -- chef in 20 minutes or lessunder the covers -- chef in 20 minutes or less
under the covers -- chef in 20 minutes or less
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Integrated Cache on Netscaler
Integrated Cache on NetscalerIntegrated Cache on Netscaler
Integrated Cache on Netscaler
 
Sensu
SensuSensu
Sensu
 
Extending functionality in nginx, with modules!
Extending functionality in nginx, with modules!Extending functionality in nginx, with modules!
Extending functionality in nginx, with modules!
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
How To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesHow To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - Slides
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
 
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
 

Viewers also liked

Individual fucas
Individual fucasIndividual fucas
Individual fucas
nathanpatalano123
 
REED 729 Seminar in Reading
REED 729 Seminar in ReadingREED 729 Seminar in Reading
REED 729 Seminar in Reading
Towson University
 
Umesh
UmeshUmesh
Umesh
Umesh Mali
 
The five most common causes of rejection of the Brazilian work permit procedure
The five most common causes of rejection of the Brazilian work permit procedureThe five most common causes of rejection of the Brazilian work permit procedure
The five most common causes of rejection of the Brazilian work permit procedure
Rui da Fonseca e Castro
 
Supporting Student Success: UDL and Your Library
Supporting Student Success: UDL and Your LibrarySupporting Student Success: UDL and Your Library
Supporting Student Success: UDL and Your Library
Towson University
 
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
Manuela Pestana
 
Work Visa for Brazil, Brief Description of the Procedure
Work Visa for Brazil, Brief Description of the ProcedureWork Visa for Brazil, Brief Description of the Procedure
Work Visa for Brazil, Brief Description of the Procedure
Rui da Fonseca e Castro
 
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
Reflective Solutions
 
Bài Tập Hóa
Bài Tập HóaBài Tập Hóa
Bài Tập Hóa
Tú Nguyễn
 
Dairy industry
Dairy industryDairy industry
Dairy industry
Abhi Varshney
 
I am a person of ...
I am a person of ...I am a person of ...
I am a person of ...
ms451711
 
Akif instraction
Akif instractionAkif instraction
Akif instraction
Akif Durna
 
Making big data small
Making big data smallMaking big data small
Making big data small
andertech
 
Ch12 pp
Ch12 ppCh12 pp
Ch12 pp
ms451711
 
Ch7 delivering speeches (modes of delivery)
Ch7 delivering speeches (modes of delivery)Ch7 delivering speeches (modes of delivery)
Ch7 delivering speeches (modes of delivery)
ms451711
 
Chapter 2 3
Chapter 2 3Chapter 2 3
Chapter 2 3
ms451711
 

Viewers also liked (16)

Individual fucas
Individual fucasIndividual fucas
Individual fucas
 
REED 729 Seminar in Reading
REED 729 Seminar in ReadingREED 729 Seminar in Reading
REED 729 Seminar in Reading
 
Umesh
UmeshUmesh
Umesh
 
The five most common causes of rejection of the Brazilian work permit procedure
The five most common causes of rejection of the Brazilian work permit procedureThe five most common causes of rejection of the Brazilian work permit procedure
The five most common causes of rejection of the Brazilian work permit procedure
 
Supporting Student Success: UDL and Your Library
Supporting Student Success: UDL and Your LibrarySupporting Student Success: UDL and Your Library
Supporting Student Success: UDL and Your Library
 
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
You are-all-crazy-subjectivaly-speaking-uploaded-1224441527362216-8
 
Work Visa for Brazil, Brief Description of the Procedure
Work Visa for Brazil, Brief Description of the ProcedureWork Visa for Brazil, Brief Description of the Procedure
Work Visa for Brazil, Brief Description of the Procedure
 
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
Maximo Performance - A Best Practice Overview Webinar, August 27, 2014
 
Bài Tập Hóa
Bài Tập HóaBài Tập Hóa
Bài Tập Hóa
 
Dairy industry
Dairy industryDairy industry
Dairy industry
 
I am a person of ...
I am a person of ...I am a person of ...
I am a person of ...
 
Akif instraction
Akif instractionAkif instraction
Akif instraction
 
Making big data small
Making big data smallMaking big data small
Making big data small
 
Ch12 pp
Ch12 ppCh12 pp
Ch12 pp
 
Ch7 delivering speeches (modes of delivery)
Ch7 delivering speeches (modes of delivery)Ch7 delivering speeches (modes of delivery)
Ch7 delivering speeches (modes of delivery)
 
Chapter 2 3
Chapter 2 3Chapter 2 3
Chapter 2 3
 

Similar to Monitor some of the things

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
srisatish ambati
 
KoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just beganKoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just beganTobias Koprowski
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
srisatish ambati
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
Luis Cabaceira
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
Luis Cabaceira
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
heraflux
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014
Ryusuke Kajiyama
 
Ensuring Consistency in a Replicated World
Ensuring Consistency in a Replicated WorldEnsuring Consistency in a Replicated World
Ensuring Consistency in a Replicated World
Yelp Engineering
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
Jon Haddad
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
DataStax Academy
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
DataStax Academy
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
pasalapudi
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
jaxconf
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.
 

Similar to Monitor some of the things (20)

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
KoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just beganKoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just began
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014
 
Ensuring Consistency in a Replicated World
Ensuring Consistency in a Replicated WorldEnsuring Consistency in a Replicated World
Ensuring Consistency in a Replicated World
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 

Recently uploaded

Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 

Recently uploaded (20)

Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 

Monitor some of the things

  • 1. 2013-10-18 MONITOR SOME OF THE THINGS
  • 2. Optimization, Backups, Replication, and more 3rd Edition Covers Version 5.5 High Performance MySQL Baron Schwartz, Peter Zaitsev & Vadim Tkachenko ME • Cofounder of @VividCortex • Author of High Performance MySQL • @xaprb on Twitter • baron@vividcortex.com • http://www.linkedin.com/in/xaprb
  • 3. RANT, RECAPPED • The sky is falling • Tools drive processes, and we need better tools designed for methods • Pay attention to CAPS (Capacity, Availability, Performance, Scalability) • Monitoring tools need to be a lot smarter • Measure and monitor “work getting done”
  • 4. HARD CAPACITY • Disk volume • CPU Cycles • max_connections • File descriptors, sockets, TCP port numbers, etc • %used, absolute quantity available
  • 5. SOFT CAPACITY • Neil Gunther’s Universal Scalability Law • %used, absolute quantity available • Throughput, concurrency, errors
  • 6. AVAILABILITY • Availability is absence of downtime • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability
  • 7. TASK PERFORMANCE • Task performance is consistently fast response time. • Measure an SLA in percentile response time per task, over observation intervals • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time
  • 8. RESOURCE PERFORMANCE • Resource performance is ability to run tasks consistently fast. • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue
  • 9. SCALABILITY • Universal Scalability Law again • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue
  • 10. STALL DETECTION • Overloaded or underperforming? • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue • Utilization, saturation, errors, sources of load/demand
  • 11. GIT ‘ER DONE MONITOR WORK AND RESOURCES
  • 12. WHAT NOT TO DO • Don’t use top-N lists from Google • Don’t just do what’s included in some Nagios plugin
  • 13. №1 TOP 10 LIST 1. MySQL availability 2. Presence of insecure users and databases 3. Aborted connects 4. Error log 5. Deadlocks 6. Change in server configuration 7. Slow query log 8. Slave lag 9. Percentage of maximum allowed connections 10. Percentage of full table scans
  • 14. №2 TOP 10 LIST 1. Threads_connected 2. Created_tmp_disk_tables 3. Handler_read_first 4. Innodb_buffer_pool_wait_free 5. Key_reads 6. Max_used_connections 7. Open_tables 8. Select_full_join 9. Slow_queries 10. Uptime
  • 15. №1 PLUGIN 1. threadcache-hitrate (Hit rate of the thread-cache) 2. slave-io-running (Slave io running: Yes) 3. slave-sql-running (Slave sql running: Yes) 4. qcache-hitrate (Query cache hitrate) 5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 6. keycache-hitrate (MyISAM key cache hitrate) 7. bufferpool-hitrate (InnoDB buffer pool hitrate) 8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 9. log-waits (InnoDB log waits because of a too small log buffer) 10. tablecache-hitrate (Table cache hitrate) 11. table-lock-contention (Table lock contention) 12. index-usage (Usage of indices) 13. tmp-disk-tables (Percent of temp tables created on disk) 14. long-running-procs (long running processes)
  • 16. №2 PLUGIN 1. connection-time 2. uptime 3. threads-connected 4. threadcache-hitrate 5. q[uery]cache-hitrate 6. q[uery]cache-lowmem-prunes 7. [myisam-]keycache-hitrate 8. [innodb-]bufferpool-hitrate 9. [innodb-]bufferpool-wait-free 10. [innodb-]log-waits 11. tablecache-hitrate 12. table-lock-contention 13. index-usage 14. tmp-disk-tables 15. slow-queries 16. long-running-procs 17. slave-lag 18. slave-io-running 19. slave-sql-running 20. sql 21. open-files 22. encode 23. cluster-ndb-running
  • 19. DUPLICATE SIGNALS • Queries • Com_admin_commands • Com_assign_to_keycache • Com_alter_db • Com_alter_db_upgrade • Com_alter_event • Com_alter_function • Com_alter_procedure • Com_alter_server • Com_alter_table • Com_alter_tablespace • Com_alter_user • Com_analyze • Com_begin • Com_binlog • Com_ad_nauseum
  • 20. DESIRABLE METRICS • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue • Utilization, saturation, errors, sources of load/demand
  • 24. RESOURCE LIMITS • Threads_connected near max_connections? • %table cache used? • Open file handles? • Long-running queries/transactions?
  • 25. ERRORS • Deadlocks? • Aborted connects?
  • 26. AVAILABILITY • Ability to connect and run a query? • Uptime is small? • Replication is running?
  • 27. PERFORMANCE • You can get throughput (Queries) and concurrency (Threads_running) from MySQL • But in a Nagios check, no context to know whether they’re good or bad • You generally can’t get response time, busy time, utilization, backlog, etc • You can aggregate thread states, thread times, users, databases, query abstracts...
  • 28. NAGIOS IS BEST AT LIVING IN THE MOMENT
  • 29. THOU SHALT NOT • Cache hit ratios • Thread cache hit ratio • Buffer pool cache hit ratio • Table cache hit ratio • Key cache hit ratio • Query cache hit ratio • Rates of “bad” queries • % temp tables on disk • % full table scans • % slow queries • Unfixable things • Replication delay
  • 30. WHY NOT? • Those are properties of the workload and application • They are not conditions to alert/warn about • They are not fixable / actionable in the service
  • 31. ALERTS ARE BETTER TOGETHER
  • 32. QUESTION: WHAT IS BETTER?
  • 33. №1 ALERT!!!!! Disk CRIT 100% /dev/sda2
  • 34. №2 ALERT!!!!! Replication CRIT Slave I/O Thread No
  • 35. №3 ALERT!!!!! Replication CRIT Slave SQL Thread No
  • 36. №4 ALERT!!!!! Replication CRIT Seconds_Behind_Master NULL
  • 37. №5 ALERT!!!!! MySQL CRIT oldest transaction: 86400 seconds
  • 39. №1 ALERT!!!!! CRIT * Disk /dev/sda2 full * Replication stopped * Oldest transaction 86400 seconds * 4999 threads in status “Waiting for table metadata lock”
  • 40. HOLLER AT ME QUESTIONS? @XAPRB / BARON@VIVIDCORTEX.COM
  • 41. RESOURCES • Chapter 3 of High Performance MySQL, 3rd Edition • Percona White Papers • Causes of Downtime in Production MySQL Servers • Preventing MySQL Emergencies • Goal-Driven Performance Optimization • Forecasting MySQL Scalability with the Universal Scalability Law • Method R: Optimizing Oracle Performance, Cary Millsap • The Goal, Eli Goldratt • The USE Method (Brendan Gregg) & his new book • Guerrilla Capacity Planning, Neil J. Gunther • Fundamental Performance & Scalability Instrumentation