Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Noch mehr Schweine und Schlangen
Neue Tipps zum Performance Troubleshooting
Rainer Schuppe
AppDynamics GmbH

Rainer
Customer Support
System Support / Ops
Consultant / Dev
Solution Architect
Sales Engineer

+Rainer Schuppe

Reprise:
Why care about performance
Where to start? What to do? Who to
blame?
Tooling
Usecases - Symptoms & Diagnostics

Complexity increases
Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

.NET

Amazon EC2
Windows Azure

CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Search Flight
View Flight Status
Make Reservation

Tomcat

Mule, Tibco, AG
Tomcat

ESB

VMWare

WEB 2.0
Memcached
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

Oracle

Coherence
Hadoop
Cassandra
MongoDB

SOA

.NET
MQ

AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

Release 4.4
Release 4.5
Release 4.6
Release 5.0

JBoss

Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

BIG DATA

Generic Troubleshooting Process
Alert / Detection

Rootcause
Detection

Triage

Diagnosis

Data /
Information
Solution
Finding

Move on with life

Fix

Triage
• Determine who needs to ﬁx it
• Starts with overview and comparison to

„normal“ performance
• First level task (Operators)
• First indication of problem type
• Works best with transactional data

50 ms
.NET
10 ms Amazon EC2
60 ms
Windows Azure

Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

45,3 ms

CLOUD

50 ms
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Search Flight
View Flight Status
Make Reservation

Tomcat

145 Mule, Tibco, AG
ms
145 ms
ESB
145 ms
145 ms
10 ms

WEBms
100 2.0

Memcached

250 ms
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

300.NET
ms
300 ms
310 ms
AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

150 ms
Tomcat
160 VMWare
ms
145 ms
Oracle

Release 4.4
Release 4.5
Release 4.6
Release 5.0

Coherence

SOA

1 MQ
ms
15 ms

250 ms
JBoss
Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

Hadoop
Cassandra
MongoDB

BIG DATA

Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

Pr

.NET

ob

lem

Amazon EC2
Windows Azure

CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Search Flight
View Flight Status
Make Reservation

Tomcat

Mule, Tibco, AG
Tomcat

ESB

VMWare

WEB 2.0
Memcached
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

Oracle

Coherence
Hadoop
Cassandra
MongoDB

SOA

.NET
MQ

AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

Release 4.4
Release 4.5
Release 4.6
Release 5.0

JBoss

Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

BIG DATA

Diagnose
• Determine the root of the problem
• Uses ﬁrst level information to narrow scope
• Needs specialists
• Lots of data / information needed in real time
and historical
• Usually needs iterations
• More than 1 tool used in the process

Rootcause detection
• Conﬁrm the rootcause after you diagnosed it
• Document it
• Recreate it in test if possible
• Needs the same data as diagnostics

Solution finding
• Find a solution for the problem
• Architect a workaround or a fix
• Again needs the diagnostic data
• Run some test runs with different options check them in realtime
• Confirm the idea for the fix
• May be a different team then the trouble
shooters

How to get the data?
• Intuition
• Experience
• Tools
• Logﬁles
• Communication

3 Key Things Impact
Performance & Availability
Concurrency

Data Volume

Resource

Why do things crash and slow down?
Development

Concurrency

Data Volume

Resource

QA/Test

Concurrency Data Volume

Resource

Production

Concurrency

Data Volume

Resource

Technologies
Logging
ARM
Bytecode Instrumentation / Aspects
Sampling
JMX (Java Management Extensions)
PMI (IBM WebSphere speciﬁc)

Dev
Test
Prod

Logﬁles
Pros:

Dev
Test
Prod

• Anything can be logged
• Easy to implement (if you have the sourcecode)
Cons:
• Only what the developer thinks is needed
• I/O heavy
• No chance for change if you don‘t own the
source code
• Lots of ﬁles - no TX context usually
• How to correlate in distributed environment?

Logﬁles - 2
Logging can be the source of problems itself
e.g. Log4Net
• Synchronous local ﬁle system access
• The more you log the longer it takes
• Can only be diagnosed with another tool

Dev
Test
Prod

Bringing down production with Logging

Microsoft.Win32.Win32Native:SetFilePointerWin32

Logﬁles - 4
[#|2013-04-16T16:04:44.319+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Store timer|#]
[#|2013-04-16T16:04:44.335+0200|INFO|sun-appserver2.1|com.appdynamics.TOP.SUMMARY.STATS.WRITE|
_ThreadID=14;_ThreadName=pool-1-thread-9;|START TIME for timer service(TopSummaryStatsWriterTimerTaskBean) will be: Tue
Apr 16 16:05:00 CEST 2013|#]
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Store timer|#]
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Purger timer|#]
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Purger timer|#]
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Detail String cache timer|#]
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Detail String cache timer|#]
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats rollup timer|#]

Proﬁler
Pros:

• No conﬁg needed
• Lots of data - lots of detail
Cons:
• Lots of data - not suitable for production
• Needs experience
• No transactional concept / context

Dev
Test

JMX (and similar)
Pros:

•
•
•

Built into most application servers
JConsole is part of the JDK
Easy to implement MBeans

Cons:

•
•
•
•

No transaction context
Not available for 3rd party
No historical data
Usually one JVM only

Dev
Test
Prod

APM tools (free)
Pros:

• They are free
• Transaction context (most of them)
• Quick setup (the commercial ones)

Dev
Test
Prod

Cons:
• Usually functionally constrained (commercial)
• Hard to conﬁgure (open source)
• Usually no history

Dev
APM tools (commercial) Test

Pros:

• Transactions, Historical data
• Distributed monitoring
• Deep dive diagnostics
• Production ﬁt
Cons:
• Costly
• Choose the right one

Prod

Diagnosis
There are just 2 sorts of issues

50 shades of slow (appx.)
•
•
•
•
•
•

Constantly slow (Turtle)
Slowly, but constantly slower
Exponentially slower
Suddenly slower
Sporadically slow
Spontaneous crash

The wonderful world of errors
•
•
•
•
•
•

Sudden outage
Always erroneous
Sporadically Errormessages
Silent death / Bleed to death
Increasing errorrates
Wrong / meaningless error messages

Diagnosis – Rough Flow
Look at symptoms
Eliminate definite non-causes
Prioritize the suspicions
Confirm suspicion / Eliminate suspicion
• Compare with „normal“
• Gather more information
• Define root cause and confirm it
• Redo from Start
•
•
•
•

Possible Causes
(in no particular order)

•
•
•
•
•
•
•

Bad Coding
Too much load
Backend not reachable / slow
Conﬂicting resources
Memory Leak
Resource Leak
Network / Hardware Problem

Example
•
•
•

Resource Contention
Exceptions
Load Issues

Symptoms
•
•
•

User complaints - Slow performance
Exceptions appeared in logﬁles
Alerts for ops triggered

Average Response Time (ms)
11.000

Connection Timeout

8.250

5.500

2.750

0

time

10:01 10:03 10:05 10:07 10:09 10:11 10:13 10:15 10:17 10:19 10:21 10:23 10:25 10:27 10:29

Connection Pool vs. Errors
15,00

org.hibernate.util.JDBCExceptionReporter : Cannot get a connection, pool error Timeout waiting for idle object

11,25

7,50

3,75

0
10:00

10:02

10:04

10:06

10:08

10:10

10:12

10:14

10:16

10:18

10:20

10:22

10:24

10:26

10:28

10:30

1st Diagnosis
•
•
•
•
•

OK - We do have a problem
Database connection pool depleted
Waiting times stacking
10 minutes until errors appear in logs
But WHY?

Open Questions
•
•
•
•
•

Which database?
Which DB Pool?
Transaction speciﬁc?
Problem on DB?
What is the load?

How to ﬁnd data
•
•
•
•

Check log for DB connection info
Ask architect which TX are using this pool
Use JMX to check pool metrics
Check load info (if available)

Who else is using the DB?
Ask your architect!

Who else is using the DB?
Ask your tool!

What did we ﬁnd out?
•
•

There were other TX‘s using the DB

•

This TX had a speciﬁc DB connection pool

It was just a single transaction with the DB
problem

OK - Now let‘s check the load

Load and DB connections
Trx/
min
Avg RT
Pool Limit
Pool Usage
Trx Stalls

Why the sudden load increase?

Root Cause
•
•

Loadbalancer was not working correctly

•

Many different pools made this conﬁg
necessary

DB connection pool size was not
appropriate for this load

The missing link

Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

.NET

Amazon EC2
Windows Azure

CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Browser(s)
Purchase
Search Flight
Search Flight
Flight Flight
View Status Status
Login
Make Reservation
Native
Mobile
App

MOBILE

Tomcat

Mule, Tibco, AG
Tomcat

ESB

VMWare

WEB 2.0
Memcached
Weblogic

Network

Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

Oracle

Coherence
Hadoop
Cassandra
MongoDB

SOA

.NET
MQ

AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

Release 4.4
Release 4.5
Release 4.6
Release 5.0

JBoss

Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

BIG DATA

A quick look at
memory
and resources

Linear Memory Leak
Symptoms:

•
•
•
•

OOM (Out of memory error)
Slow over time with spikes
Sawtooth with upward trend

• Causes
•
•

Objects added to linear structures without being removed
(e.g., linked lists)
Other API misuse (addListener() without corresponding
removeListener(), etc.)

Linear Memory Leak
Aggregate detection:

•
•
•

linear growth in heap utilization
GC time growth

Speciﬁc detection:

•
•
•
•

Figure out object types being leaked
Verbose GC
Find related APIs and search code for misuse

Linear Memory Leak
Challenges

•
•
•

References - many small objects are referenced in one
collection
Death by 1000 cuts (Papierschnitte)


•
•
•
•

Figure out object types being leaked
Verbose GC
Find related APIs and search code for misuse

Specific detection
•
•
•
•
•
•

•
•
•

•

Heap Dump Comparison

Needs at least 2 dumps
Stops the JVM
Can take several minutes each
Creates tons of data
Finds the object, not the code responsible for the leak

Profiler

High overhead - not for production
Lots of data

APM Solution
•
•
•

Collection based algorithm – finds only collection leaks
Instance counting
Trade off between low overhead and usefulness of data

Exponential Memory Leak
Causes:
• Objects added to most data structures
without being removed (e.g., vectors,
hashtables)
• Other API misuse (as Linear Leak)
• Aggregate detection:
• exponential growth in heap
• Speciﬁc detection:
• Same as Linear Leak
•

Resource Leak
Causes:
• API misuse of Java objects with resourcestyle lifecycle (create->use->destroy)
• Slow over time
• Growth in heap (if you’re lucky)
• Audit code for API misuses
• Object instance tracking
•

Resource conﬂict (block / wait)

Resource conﬂict / blocking
•

•

•

Causes:
• Overcautious data integrity strategy
• Synchronising is always good
Aggregate detection:
• Stalled threads
• High thread usage - low CPU usage
• Thread dumps as needed
• Stack traces / graphs
• CPU block / wait timing measurement

Bad Coding: Inﬁnite Loop
Causes:
• Inﬁnite loop in code
• Stalled threads
• Permanently high usage of CPU / threads
• Thread dumps as needed
• Stack traces / graphs
•

Bad Coding: CPU-Bound Component
Causes:
• Idiot with a “Learn Java in 24 Hours” book
• Aggregate Detection:
• Response time measurement
• Aggregate CPU utilization
• Speciﬁc Detection:
• Detailed CPU utilization
• Typical Cure:
• Cache of data or of performed calculations
•

Layer-itis
Causes:

•
•
•

Poorly implemented data bridge layer, or simply
too many of them
DB -> XML -> XSLT -> More XML -> “Custom
Data Management Layer” -> Consumer

Aggregate Detection:

•
•

Response time measurements

Speciﬁc Detection:

•
•
•

Call graphs - Call trace (stack trace not
enough)
Ask for a design or architecture document

O/R Mapper misuse
Causes:

•
•
•
•

Hibernate ﬁxes everything
Massive SQL statements (length and amount)
Wrong data strategy


•
•
•

Response time measurements
DB time measurements


•
•

Call stacks / snapshots

The Unending Retry
Causes:
• Continual attempts to call backend +
unavailable backend
• Aggregate Detection / Speciﬁc Detection:
• Response time measurement
• Backend detection - measurement (time
& # of calls)
• Stalled TX count
• Exceptions
• Busy thread count
•

don’t forget about thrown exceptions

Threading: Deadlock / Livelock
Causes:
• Fundamental error in threading / lock
acquisition strategy
• Stalled threads / permanently high
concurrent usage
• Deadlock detection in JVM
• Thread dumps
• Busy thread count
•

Threading: Deadlock
Found one Java-level deadlock:
=============================
"Thread-2":
  waiting to lock monitor 102054308 (object 7f3113800, a java.lang.Object),
  which is held by "Thread-1"
"Thread-1":
  waiting to lock monitor 1020348b8 (object 7f3113810, a java.lang.Object),
  which is held by "Thread-2"

Java stack information for the threads listed above:
===================================================
"Thread-2":
    at DeadlockTest$2.run(DeadlockTest.java:42)
    - waiting to lock <7f3113800> (a java.lang.Object)
    - locked <7f3113810> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:680)
"Thread-1":
    at DeadlockTest$1.run(DeadlockTest.java:26)
    - waiting to lock <7f3113810> (a java.lang.Object)
    - locked <7f3113800> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:680)

Threading: Chokepoint
Causes:
• Many threads bottlenecked waiting for
one lock
• Stalled threads / high concurrent usage
• Exponential slowness
• Low CPU usage
• Request response time monitoring
• CPU block / wait timing
•

Internal Resource Bottleneck
•

•
•

•

•

•
•
•

Causes:

Overusage of internal resource (threads,
database connections, etc.)
Underallocation of same


Stalled threads / high concurrent usage
Call rate and average response time of internal
resource


Also compare with methods from Resource
Leak, External Bottleneck, and Overusage of
External System

External Bottleneck
Causes:

•
•
•

External system (database, authentication server) is
slow
Compare with Overusage of external system


•
•
•

Response time on backend calls
Exceptions


•
•
•

Callgraphs
Speciﬁc monitoring on those backends

Production Ground to a halt for 2 hours And again the next day

Trx/
min
Avg RT
Pool Limit
Pool Usage
Trx Stalls

Overusage of External System
Causes:

•
•
•

•
•

•
•
•

Poor design or tuning of interaction with backend system
(e.g., join between two million-row tables for each user
logon)
O/R mapper misconﬁguration


Response time measurement


Timing on backend systems
Also need tools for those backend systems

Applications will become more Complex and Change Faster

Distributed
Monolithic
AGILE

Release 1.1
Release 1.2
Release 1.23
Release 1.5

.NET Service

WebLogic Service

WEB 2.0

3rd Party Web
Service

ESB/MQ
Browser(s)

Purchase
Search Flight
Flight Status
Login

Native
Mobile
App

Oracle

CDN

Network

Apache

Sybase

NOSQL
JBoss Service

SOA

MySQL

Cassandra

PHP Service

Memcached

MOBILE

DB2

SQL Server
Tomcat Service
JBoss Service
VMWare
Private

Amazon EC2
Public

CLOUD

PostgreSQL

Hadoop

BIG DATA

Copyright © 2013 AppDynamics. All rights reserved.

86

•
•

One interesting problem occurs when the size of
transactions with backend systems needs to be tuned
Can be intertwined with / exacerbated by Layer-itis and
Overusage of External System

Many small requests
System constantly
wastes resources
dispatching /
unmarshalling many
xactions and results
“Death by a thousand
cuts”

“Just Right”
One HUGE request
System periodically
slows to a crawl as
many resources get
thrown at large
chunk of work
“Pig in a Python”

Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

Similar to Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen (20)

Recently uploaded

Recently uploaded (20)

Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen