Application Performance doesn't come easy. How to find the root cause of performance issues in modern and complex applications? All you have is a complaining user to start with?
In this presentation (mainly in German, but understandable for english speakers) I'd reprised the fundamentals of trouble shooting and have some new examples on how to tackle issues.
Follow up presentation to "Performance Trouble Shooting 101 - Schweine, Schlangen und Papierschnitte"
8. Triage
• Determine who needs to fix it
• Starts with overview and comparison to
„normal“ performance
• First level task (Operators)
• First indication of problem type
• Works best with transactional data
9. 50 ms
.NET
10 ms Amazon EC2
60 ms
Windows Azure
Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5
45,3 ms
CLOUD
50 ms
Release 2.4
Release 2.5
Release 2.6
Release 3.0
Login
Search Flight
View Flight Status
Make Reservation
Tomcat
145 Mule, Tibco, AG
ms
145 ms
ESB
145 ms
145 ms
10 ms
WEBms
100 2.0
Memcached
250 ms
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0
Browser Logic
AJAX
Web Frameworks
300.NET
ms
300 ms
310 ms
AGILE
Release 3.4
Release 3.5
Release 3.6
Release 4.0
SQL
Server
150 ms
Tomcat
160 VMWare
ms
145 ms
Oracle
Release 4.4
Release 4.5
Release 4.6
Release 5.0
Coherence
SOA
1 MQ
ms
15 ms
250 ms
JBoss
Release 1.4
Release 1.5
Release 1.6
Release 2.0
ATG, Vignette,
Sharepoint
Hadoop
Cassandra
MongoDB
BIG DATA
10. Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5
Pr
.NET
ob
lem
Amazon EC2
Windows Azure
CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0
Login
Search Flight
View Flight Status
Make Reservation
Tomcat
Mule, Tibco, AG
Tomcat
ESB
VMWare
WEB 2.0
Memcached
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0
Browser Logic
AJAX
Web Frameworks
Oracle
Coherence
Hadoop
Cassandra
MongoDB
SOA
.NET
MQ
AGILE
Release 3.4
Release 3.5
Release 3.6
Release 4.0
SQL
Server
Release 4.4
Release 4.5
Release 4.6
Release 5.0
JBoss
Release 1.4
Release 1.5
Release 1.6
Release 2.0
ATG, Vignette,
Sharepoint
BIG DATA
12. Diagnose
• Determine the root of the problem
• Uses first level information to narrow scope
• Needs specialists
• Lots of data / information needed in real time
and historical
• Usually needs iterations
• More than 1 tool used in the process
13. Rootcause detection
• Confirm the rootcause after you diagnosed it
• Document it
• Recreate it in test if possible
• Needs the same data as diagnostics
14. Solution finding
• Find a solution for the problem
• Architect a workaround or a fix
• Again needs the diagnostic data
• Run some test runs with different options check them in realtime
• Confirm the idea for the fix
• May be a different team then the trouble
shooters
15. How to get the data?
• Intuition
• Experience
• Tools
• Logfiles
• Communication
17. 3 Key Things Impact
Performance & Availability
Concurrency
Data Volume
Resource
18. Why do things crash and slow down?
Development
Concurrency
Data Volume
Resource
QA/Test
Concurrency Data Volume
Resource
Production
Concurrency
Data Volume
Resource
20. Logfiles
Pros:
Dev
Test
Prod
• Anything can be logged
• Easy to implement (if you have the sourcecode)
Cons:
• Only what the developer thinks is needed
• I/O heavy
• No chance for change if you don‘t own the
source code
• Lots of files - no TX context usually
• How to correlate in distributed environment?
21. Logfiles - 2
Logging can be the source of problems itself
e.g. Log4Net
• Synchronous local file system access
• The more you log the longer it takes
• Can only be diagnosed with another tool
Dev
Test
Prod
23. Logfiles - 4
[#|2013-04-16T16:04:44.319+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Store timer|#]
[#|2013-04-16T16:04:44.335+0200|INFO|sun-appserver2.1|com.appdynamics.TOP.SUMMARY.STATS.WRITE|
_ThreadID=14;_ThreadName=pool-1-thread-9;|START TIME for timer service(TopSummaryStatsWriterTimerTaskBean) will be: Tue
Apr 16 16:05:00 CEST 2013|#]
[#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Store timer|#]
[#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Purger timer|#]
[#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Purger timer|#]
[#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Detail String cache timer|#]
[#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Detail String cache timer|#]
[#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats rollup timer|#]
24. Profiler
Pros:
• No config needed
• Lots of data - lots of detail
Cons:
• Lots of data - not suitable for production
• Needs experience
• No transactional concept / context
Dev
Test
26. JMX (and similar)
Pros:
•
•
•
Built into most application servers
JConsole is part of the JDK
Easy to implement MBeans
Cons:
•
•
•
•
No transaction context
Not available for 3rd party
No historical data
Usually one JVM only
Dev
Test
Prod
28. APM tools (free)
Pros:
• They are free
• Transaction context (most of them)
• Quick setup (the commercial ones)
Dev
Test
Prod
Cons:
• Usually functionally constrained (commercial)
• Hard to configure (open source)
• Usually no history
29. Dev
APM tools (commercial) Test
Pros:
• Transactions, Historical data
• Distributed monitoring
• Deep dive diagnostics
• Production fit
Cons:
• Costly
• Choose the right one
Prod
34. The wonderful world of errors
•
•
•
•
•
•
Sudden outage
Always erroneous
Sporadically Errormessages
Silent death / Bleed to death
Increasing errorrates
Wrong / meaningless error messages
35. Diagnosis – Rough Flow
Look at symptoms
Eliminate definite non-causes
Prioritize the suspicions
Confirm suspicion / Eliminate suspicion
• Compare with „normal“
• Gather more information
• Define root cause and confirm it
• Redo from Start
•
•
•
•
36. Possible Causes
(in no particular order)
•
•
•
•
•
•
•
Bad Coding
Too much load
Backend not reachable / slow
Conflicting resources
Memory Leak
Resource Leak
Network / Hardware Problem
39. Average Response Time (ms)
11.000
Connection Timeout
8.250
5.500
2.750
0
time
10:01 10:03 10:05 10:07 10:09 10:11 10:13 10:15 10:17 10:19 10:21 10:23 10:25 10:27 10:29
40. Connection Pool vs. Errors
15,00
org.hibernate.util.JDBCExceptionReporter : Cannot get a connection, pool error Timeout waiting for idle object
11,25
7,50
3,75
0
10:00
10:02
10:04
10:06
10:08
10:10
10:12
10:14
10:16
10:18
10:20
10:22
10:24
10:26
10:28
10:30
41. 1st Diagnosis
•
•
•
•
•
OK - We do have a problem
Database connection pool depleted
Waiting times stacking
10 minutes until errors appear in logs
But WHY?
43. How to find data
•
•
•
•
Check log for DB connection info
Ask architect which TX are using this pool
Use JMX to check pool metrics
Check load info (if available)
48. What did we find out?
•
•
There were other TX‘s using the DB
•
This TX had a specific DB connection pool
It was just a single transaction with the DB
problem
OK - Now let‘s check the load
49. Load and DB connections
Trx/
min
Avg RT
Pool Limit
Pool Usage
Trx Stalls
Why the sudden load increase?
50. Root Cause
•
•
Loadbalancer was not working correctly
•
Many different pools made this config
necessary
DB connection pool size was not
appropriate for this load
51. The missing link
Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5
.NET
Amazon EC2
Windows Azure
CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0
Login
Browser(s)
Purchase
Search Flight
Search Flight
Flight Flight
View Status Status
Login
Make Reservation
Native
Mobile
App
MOBILE
Tomcat
Mule, Tibco, AG
Tomcat
ESB
VMWare
WEB 2.0
Memcached
Weblogic
Network
Release 1.4
Release 1.5
Release 1.6
Release 2.0
Browser Logic
AJAX
Web Frameworks
Oracle
Coherence
Hadoop
Cassandra
MongoDB
SOA
.NET
MQ
AGILE
Release 3.4
Release 3.5
Release 3.6
Release 4.0
SQL
Server
Release 4.4
Release 4.5
Release 4.6
Release 5.0
JBoss
Release 1.4
Release 1.5
Release 1.6
Release 2.0
ATG, Vignette,
Sharepoint
BIG DATA
55. Linear Memory Leak
Symptoms:
•
•
•
•
OOM (Out of memory error)
Slow over time with spikes
Sawtooth with upward trend
• Causes
•
•
Objects added to linear structures without being removed
(e.g., linked lists)
Other API misuse (addListener() without corresponding
removeListener(), etc.)
56. Linear Memory Leak
Aggregate detection:
•
•
•
linear growth in heap utilization
GC time growth
Specific detection:
•
•
•
•
Figure out object types being leaked
Verbose GC
Find related APIs and search code for misuse
57. Linear Memory Leak
Challenges
•
•
•
References - many small objects are referenced in one
collection
Death by 1000 cuts (Papierschnitte)
Specific detection:
•
•
•
•
Figure out object types being leaked
Verbose GC
Find related APIs and search code for misuse
58. Specific detection
•
•
•
•
•
•
•
•
•
•
Heap Dump Comparison
Needs at least 2 dumps
Stops the JVM
Can take several minutes each
Creates tons of data
Finds the object, not the code responsible for the leak
Profiler
High overhead - not for production
Lots of data
APM Solution
•
•
•
Collection based algorithm – finds only collection leaks
Instance counting
Trade off between low overhead and usefulness of data
59.
60.
61.
62.
63. Exponential Memory Leak
Causes:
• Objects added to most data structures
without being removed (e.g., vectors,
hashtables)
• Other API misuse (as Linear Leak)
• Aggregate detection:
• exponential growth in heap
• Specific detection:
• Same as Linear Leak
•
64. Resource Leak
Causes:
• API misuse of Java objects with resourcestyle lifecycle (create->use->destroy)
• Aggregate detection:
• Slow over time
• Growth in heap (if you’re lucky)
• Specific detection:
• Audit code for API misuses
• Object instance tracking
•
68. Bad Coding: Infinite Loop
Causes:
• Infinite loop in code
• Aggregate detection:
• Stalled threads
• Permanently high usage of CPU / threads
• Specific detection:
• Thread dumps as needed
• Stack traces / graphs
•
69. Bad Coding: CPU-Bound Component
Causes:
• Idiot with a “Learn Java in 24 Hours” book
• Aggregate Detection:
• Response time measurement
• Aggregate CPU utilization
• Specific Detection:
• Detailed CPU utilization
• Typical Cure:
• Cache of data or of performed calculations
•
70. Layer-itis
Causes:
•
•
•
Poorly implemented data bridge layer, or simply
too many of them
DB -> XML -> XSLT -> More XML -> “Custom
Data Management Layer” -> Consumer
Aggregate Detection:
•
•
Response time measurements
Specific Detection:
•
•
•
Call graphs - Call trace (stack trace not
enough)
Ask for a design or architecture document
71. O/R Mapper misuse
Causes:
•
•
•
•
Hibernate fixes everything
Massive SQL statements (length and amount)
Wrong data strategy
Aggregate Detection:
•
•
•
Response time measurements
DB time measurements
Specific Detection:
•
•
Call stacks / snapshots
76. Threading: Deadlock
Found one Java-level deadlock:
=============================
"Thread-2":
waiting to lock monitor 102054308 (object 7f3113800, a java.lang.Object),
which is held by "Thread-1"
"Thread-1":
waiting to lock monitor 1020348b8 (object 7f3113810, a java.lang.Object),
which is held by "Thread-2"
Java stack information for the threads listed above:
===================================================
"Thread-2":
at DeadlockTest$2.run(DeadlockTest.java:42)
- waiting to lock <7f3113800> (a java.lang.Object)
- locked <7f3113810> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:680)
"Thread-1":
at DeadlockTest$1.run(DeadlockTest.java:26)
- waiting to lock <7f3113810> (a java.lang.Object)
- locked <7f3113800> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:680)
77. Threading: Chokepoint
Causes:
• Many threads bottlenecked waiting for
one lock
• Aggregate Detection:
• Stalled threads / high concurrent usage
• Exponential slowness
• Low CPU usage
• Specific Detection:
• Request response time monitoring
• CPU block / wait timing
•
79. Internal Resource Bottleneck
•
•
•
•
•
•
•
•
Causes:
Overusage of internal resource (threads,
database connections, etc.)
Underallocation of same
Aggregate Detection:
Stalled threads / high concurrent usage
Call rate and average response time of internal
resource
Specific Detection:
Also compare with methods from Resource
Leak, External Bottleneck, and Overusage of
External System
80. External Bottleneck
Causes:
•
•
•
External system (database, authentication server) is
slow
Compare with Overusage of external system
Aggregate Detection:
•
•
•
Response time on backend calls
Exceptions
Specific Detection:
•
•
•
Callgraphs
Specific monitoring on those backends
82. Production Ground to a halt for 2 hours And again the next day
Trx/
min
Avg RT
Pool Limit
Pool Usage
Trx Stalls
83. Overusage of External System
Causes:
•
•
•
•
•
•
•
•
Poor design or tuning of interaction with backend system
(e.g., join between two million-row tables for each user
logon)
O/R mapper misconfiguration
Aggregate Detection:
Response time measurement
Specific Detection:
Timing on backend systems
Also need tools for those backend systems
87. •
•
One interesting problem occurs when the size of
transactions with backend systems needs to be tuned
Can be intertwined with / exacerbated by Layer-itis and
Overusage of External System
Many small requests
System constantly
wastes resources
dispatching /
unmarshalling many
xactions and results
“Death by a thousand
cuts”
“Just Right”
One HUGE request
System periodically
slows to a crawl as
many resources get
thrown at large
chunk of work
“Pig in a Python”