Performance Concurrency Troubleshooting Final

System Performance

Build Fuel Tune

Simar Singh
simar.singh@redknee.com
learn@ssimar.com

Learn and Apply
Topics Index (Click Links in Slide Show)
• Performance • Concepts
• Concurrency (Threads)
• Troubleshooting
• Processing (CPU/Cores) • Processing
• Memory (System / Process)
• Thread Dumps
• Memory
• Garbage Collection
• Heap Dumps
• Core Dumps & Postmortem
• Java (jstack, jmap, jstat, VisualVM)
• Solaris (prstat vmstat mpstat pstack)

Concepts
Concurrency and Performance
(Part 1)

What will we Discuss?
– LEARN
– There are laws and principals that govern concurrency and performance.
– Performance can be built, fueled and/or tuned.
– How do we measure performance and capacity in abstract terms?
– Capacity (throughput) and Load are often used interchangeably but
incorrectly.
– What is the difference between Resource utilization and saturation?
– How performance & capacity are measured on a live system (CPU & Memory)?

– APPLY
– Find out how is your system being used or abused?
– Find out how your system is performing as a whole?
– Find out how a particular process in the system is performing?
– Find out how a particular thread in the process performing?
– Find out the bottle-necks? What is less or missing?

Performance – Built, Fueled or Tuned
• Built (Implementation and Techniques)
– Binary Search O(log n) is more efficient than Linear Search O(n)
– Caching can improve Disk I/O significantly boosting
performance.

• Fueled (More Resources)
– Simply get a machine with more CPU(s) and Memory if
constrained.
– Implement RAID to improve Disk I/O

• Tuned (Settings and Configurations)
Tune Garbage Collection to optimize Java Processes
– Tune Oracle parameters to get optimum database performance

Capacity and Load
• Load is an Expectation out of system
– It is the rate of work that we put on the system.
– It is an factor external to the system.
– Load may vary with time and events.
– It has no upper cap, can increase infinitely
• Capacity is a Potential of the system
– It is the max rate of work, the system supports efficiently, effectively & infinitely
– It is a factor, internal to the system.
Maximum capacity of a system is finite and stays fairly constant.
We often call Throughput as the System’s Capacity for Load.
• Chemistry between Load & Capacity
– LOAD = CAPACITY? Good Expectation matches the potential. Hired
– LOAD > CAPACITY? Bad Expectations is more than potential. Fired
– LOAD < CAPACITY? Ugly Expectations is less then potential. Find another one
– If not good better be ugly than bad.

Performance Measurement of a
System
•
Measures of System’s Capacity
Response Time or Latency
– Measures time spent executing a request
• Round-trip time (RTT) for a Transaction
– Good for understanding user experience
– Least scalable, Developers focus on how much time each transaction takes
• Throughput
– Measures the number of transactions executed over a period of time
• Output Transactions per second (TPS)
– A measure of the system's capacity for load
– Depending upon the resource type, It could be hit rate (for cache)
• Resource Utilization
– Measures the use of a resource
• Memory, disk space, CPU, network bandwidth
– Helpful for system sizing, is generally the easiest measurement to Understand
– Throughput and Response Time can conflict, because resources are limited
• Locking, resource contention, container activity

It is time for System Capacity to be Loaded with work
(Throttling & Buffering Techniques)

• No one stops us to load a system more than its capacity (Max Throughput).

• Transactions Per Seconds -Misconception, Real traffic may be in bursts
– Received 3600 transactions in a hour, not sure if every second only 60 were pumped
– Probably we received in bursts - all in first 10 minutes and for nothing last 50 minutes
– So we really cant say, at what tps? We can regulate bursts with throttling and buffering
• Throttling – (Implemented by producer to smoothen output)
– Spreads bursts over time to smoothen output from a process
– We may add throttles to control output rate from threads to each external interface
Throttle of 10 tps ensures max output is 10 tps regardless of the load & capacity.
Throttling is scheme for producers (Check production to rate the consumer can accept)

• Buffering – (Implemented by consumer to smoothen input)
– Spreads burst over time to smoothen input from an external interface
– We add buffering to control input rate to threads from each external interface
Application processes input at 10 tps, load above it will be buffered & processed later
Buffering is a scheme for consumers (Take whatever is produced, consume at our own)

Supply Chain Principle
(Apply it to define a optimum Thread Pool Size)

• The more throughput you want, more will be the resource consumption.

• You may apply this principle to define the optimum thread-pool size for a
system/application.

– To support a Throughput (t) transactions per second- (t) =
20 tps

– Where each transaction takes (d) seconds to complete- (d) =
5 seconds

– We need (d*t) threads at least (min size of the thread pool)-
(d*t) = 100 threads

• Thread is an abstract CPU unit resource here.

To support a Throughput (t) of 20 tps
Where each transaction takes(d) 5 seconds
We need 100 (d*t) threads at least

1 sec 2 sec 3 sec 4 sec 5 sec
1 sec 2 sec 3 sec 4 sec 5 sec
1 sec
20 2 sec 3 sec 4 sec 5 sec
1 sec 2 sec
20 3 sec 4 sec 5 sec
1 sec 2 sec 3 sec
20 4 sec 5 sec
1 sec 2 sec 3 sec 4 sec
20 5 sec
20
20
20 
20
20
20
20
20
20
20
20
20
20
20

20
20

Quantify Resource Consumption
Utilization & Saturation
• Resource Utilization
– Utilization measures how busy a resource is.
– It is usually represented as a percentage average over a time interval.
• Resource Saturation
– Saturation is often a measure of work that has queued waiting for the resource
– It can be measured as both
• As an average over time
• And at a particular point in time.
– For some resources that do not queue, saturation may be synthesized by error counts.
Example Page-Faults reveal memory saturation.
• Load (input rate of requests) is an independent/external variable
• Resource consumption, Throughput (out-put rate of response) are
dependent/internal variables, a function of load.

How Load, Resource Consumption and
Throughput related?
• As load increases, throughput increases, until maximum resource utilization on the
bottleneck device is reached. At this point, maximum possible throughput is
reached, Saturation occurs.
• Then, queuing (waiting for saturated resources) starts to occur.
• Queuing typically manifests itself by degradation in response times.
• This phenomenon is described by Little’s Law:
L=X*R
L (LOAD), X (THROUGHPUT) and R (RESPONSE TIME)
• As L increases, X increases (R also increases slightly, because there is always some
level of contention at the component level).
• At some point, X reaches Xmax – the maximum throughput of the system. At this
point, as L continues to increase, the response time R increases in proportion and
through-put may then start to decrease, both due to resource contention.

Performance pattern of a Concurrent Process

Example
How Throughput and Resource Consumption are related?

• Throughput & Latency can have an inverse or direct relationship
– Concurrent tasks (Threads) often contend for resources (locking & contention)
• Single-Threaded – Higher Throughput = Lower Latency
– Consistent throughput, does not increase with incoming load & resources
– Processes serially, Good for batch jobs
– Response Time linearly varies with request order.
• Multi-Threaded – Higher Throughput = Higher Latency (Most of the time)
– Throughput may increase linearly with load, it starts to drop after threshold
– Process Concurrently, Good for interactive modules (Web Apps)
– Near consistent Response Time, doesn’t vary much with order but load.
Single Threaded – 10 CPU(s) Multi Threaded – 10 CPU(s)
Threads = 1 Threads = 10
Latency = .1 seconds Latency = .1 seconds
Throughput = 1/.1 = 10 tx/sec Throughput = 1/.1 * 10 = 100
Threads = 1 Threads = 100
Latency = .001 second Latency = .2 seconds
Throughput = 1/.001 = 1000 tx/sec Throughput = 1/.2 * 100 = 500 tx/sec

Producer Consumer Principle
Predicting Maximum Throughput
Identify Bottleneck Device/Resource

• The Utilization Law: Ui = T * Di
• Where Ui is the percentage of utilization of a device in the application, T is the application
throughput, and Di is the service demand of the application device.
• The maximum throughput of an application Tmax is limited by the maximum service demand of all
of the devices in the application.
• EXAMPLE - A load test reports 200 kb/sec average throughput:
CPUavg = 80% Dcpu = 0.8 / 200 kb/sec = 0.004 sec/kb
Memoryavg = 30% Dmemory = 0.3 / 200 kb/sec = 0.0015 sec/kb
Diskavg = 8% Ddisk = 0.08 / 200 kb/sec = 0.0004 sec/kb
Network I/Oavg = 40% Dnetwork I/O = 0.4 / 200 kb/sec = 0.002 sec/kb
• In this case, Dmax corresponds to the CPU. So, the CPU is the bottleneck device.
• We can use this to predict the maximum throughput of the application by setting the CPU utilization to
100% and dividing by Dcpu. In other words, for this example:
Tmax = 1 / Dcpu = 250 kb/sec
• In order to increase the capacity of this application, it would first be necessary to increase CPU capacity.
Increasing memory, network capacity or disk capacity would have little or no effect on performance until
after CPU capacity has been increased sufficiently.

Work Pools & Thread Pools
Working Together
• Work Pools are queues of work to be performed by a software application or component.
– If all threads in thread pool are busy, incoming work can be
queued in work pool
– Threads from thread pool, when freed can execute them later

• Work Pools are filling up congestion & smoothen bursts
– A queue consisting of units of work to be performed
– CONGESTION, by allowing the current (client) threads to submit
work and return
– BURST, over capacity transaction can buffered in work pool and
executed later
– Allow for caching of units of work to reduce system intensive
calls
• Can perform a bulk fetch form a database instead of fetching on record at a time

Queuing Tasks may be risky
• One task could lock up another that would be able to continue if the queued task
were to run.

• Queuing can smoothen in-coming traffic burst limited in time (depending upon the
rate of traffic and size)

• Fails if traffic arrives on average faster than they can be processed.

• In general, Work Pools are in memory so it is important to understand what the
impact of restarting a system is, as in memory elements will be lost.

– Is it relevant to lose the queued work?
– Is the queue backed up on disk?

Bounded & Unbounded Pools
(Load Shedding)
• If not bounded, pools can grow freely but can cause system to exhaust resources.
– Work Pool / Queue Unbounded - (May overload Memory / Heap &
crash)
• Each work object in the queue stays holding the space until consumed
– Thread Pool Unbounded – (May overload CPU / Native Space and
Crash)
• Each thread asks to be scheduled on CPU and consumes native stack space
• If queue size is bounded, incoming execute requests block when it is full. We can apply different Policies to
handle t, for example
– Reject if there is no space (Can have side affects)
– Remove based on Priority – (Ex priority may be function of time –
Timeouts)
• Thread Pools can have different policies when Work Pools is full:
– Block till there is available space – Starve (VERY BAD – Sometimes
Needed)
– Run in Current Thread (Very Dangerous!)

Work pool & thread pool sizes can
often be traded off for each other
Large Work-Pool and small thread pools

– Minimizes CPU usage, OS resources, and context-switching overhead.

– Can lead to artificially low throughput especially if tasks frequently block (ex I/O bound)

Small Work pool generally require larger thread pool sizes

– Keeps CPUs busier

– May cause scheduling overhead (Context Switching) and may lessen throughput.
Especially if the number of CPUs are less.

Processing (CPU) Performance &
Troubleshooting
(Part 2)

CPU
• Many modern systems from Sun boast numerous CPUs or virtual CPUs
(which may be cores or hardware threads).

• The CPUs are shared by applications on the system, according to a policy
prescribed by the operating system and scheduler

• If the system becomes CPU resource limited, then application or kernel
threads have to wait on a queue to be scheduled on a processor,
potentially degrading system performance.

• The time spent on these queues, the length of these queues and the
utilization of the system processor are important metrics for quantifying
CPU-related performance bottlenecks.

Process – User and Kernel Level
Threads
• Process includes the set of executable programs, address
space, stack, and process control block. One or more threads
may execute the program(s).
• User-level threads (threads library)
– Invisible to the OS and are maintained by a thread Library.
– are the interface for application parallelism
• Kernel threads
– the unit that can be dispatched on a processor and it’s structures are
maintain by the kernel
• Lightweight processes (LWP)
– Each LWP supports one or more User Level Thread and maps to exactly one
Kernel Level Thread. Maintains the state of a thread.

CPU Consumption Model

By default Solaris 10 uses Process 4 model, rest are obsolete.

Dispatcher and Run Queue at CPU

User Thread over a Solaris LWP
State of User Thread and LWP may be different

Solaris Threading Model
If you are in a thread, the thread library must schedule it on an a LWP
Each LWP has a kernel thread, which schedules it on a CPU.
Threading models are used between LWPs & Solaris Threads

JVM Memory Organization & Threads
• Method Area
– JVM loads the class file, their type info and binary data in this area
– This memory area is shared by all threads
• Heap Area
– JVM places all objects the program instantiates onto the heap
– This memory area is shared by all threads
– This memory can be adjusted by VM options -Xmx & -Xms as required
• Java Stack and Program Counter (PC) Register
– Each new thread that executes, gets its own pc register & Java stack.
– The value of the pc register indicates the next instruction to execute.
– A thread's Java stack stores the state of Java method invocations for the
thread. The state of a Java method invocation includes
• its local variables & the parameters with which it was invoked,
• its return value (if any), and intermediate calculations.
– This memory may be adjusted by VM option –Xss, typically 1m for RK Apps
– The state of native method (JVM method) invocations is stored in an
implementation-dependent way in native method stacks, as well as possibly in
registers or other implementation-dependent memory areas.

A Java thread’s Stack Memory
• The Java stack is composed of stack frames (or frames).
• A stack frame contains the state of one Java method invocation.

– When a thread invokes a method, the Java virtual
machine pushes a new frame onto that thread's
Java stack.
– When the method completes, the virtual machine
pops and discards the frame for that method.

Thread Modes
Kernel & User Mode Privilege
• A LWP may either execute in kernel (sys) or user (usr) privilege mode.
• Operations like, processing data on local memory and inter-process
communication between threads of the same process does not require
kernel mode privilege for the thread executing the user program.
• However, intra-process communication or hardware access are done by
kernel programs the executing thread requires kernel mode privilege
• User programs often call by call kernel programs by making system calls.
• A LWP runs in user mode until it makes a system call that requires kernel
mode privilege. The mode switch then happens, which is costly.

LWP/Thread Modes
User Mode and Kernel Mode
Don’t confuse the modes with type (Kernel and User)

Complete Process State Diagram
State of a process is a super set of Thread States
A process’s thread state is defined by its threads.

vmstat tool provides a glimpse of the system's behavior

VMSTAT - Glimpse of CPU Behavior
The vmstat tool provides a glimpse of the system's behavior on one line indicates
both CPU utilization and saturation.
The first line is the summary since boot, followed by samples every five seconds

Far right is cpu:id for percent idle lets us determine how utilized the CPUs are
In this ex, the idle time for the 5 second samples was always 0, indicating
100% utilization.
On the far left is kthr:r for the total number of threads on the ready to run queues.
If the value is more than the number of CPU’s it indicates CPU saturation.
Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest
saturation for this single CPU server. A value of 4 would indicate high
saturation.

More about VMSTAT

Count Description
kthr
r Total number of runnable threads on the dispatcher queues

faults
in Number of interrupts per second
sy Number of system calls per second
cs Number of context switches per second, both voluntary and involuntary

cpu
us Percent user time; time the CPUs spent processing user-mode threads

sy Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus
the time spent processing kernel threads

id Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU
utilization

CPU Utilization
• You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us
and sy.
• 100% utilized may be fine—it can be the price of doing business.
• When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance;
the performance degradation is gradual. Because of this, CPU saturation is often a
better indicator of performance issues than is CPU utilization.
• The measurement interval is important: 5% utilization sounds close to idle; however, for
a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for
57 minutes. It is useful to have both short- and long-duration measurements.
• An server running at 10% CPU utilization sounds like 90% of the CPU is available for
"free," that is, it could be used without affecting the existing application. This isn't quite
true. When an application on a server with 10% CPU utilization wants the CPUs, they
will almost always be available immediately. On a server with 100% CPU utilization, the
same application will find that the CPUs are already busy—and will need to preempt
the currently running thread or wait to be scheduled. This can increase latency.

CPU Saturation
• The kthr:r metric from vmstat is useful as a measure for CPU saturation.
However, since this is the total across all the CPU run queues, divide kthr:r
by the CPU count for a value that can be compared with other servers.

• Any sustained non-zero value is likely to degrade performance. The
performance degradation is gradual (unlike the case with memory
saturation, where it is rapid).

• Interval time is still quite important. It is possible to see CPU saturation
(kthr:r) while a CPU is idle (cpu:idl). You may find that the run queue is
quite long for a short period of time, followed by idle time. Averaging over
the interval gives both a non-zero run queue length and idle time.

Solaris Peformance Tools
Tool Uses Description
vmstat kstat For an initial view of overall CPU behavior

psrinfo kstat For physical CPU properties

uptime getloadavg() For the load averages, to gauge recent CPU activity
sar kstat, sadc For overall CPU behavior, and dispatcher queue
statistics; sar also allows historical data collection

mpstat kstat For per-CPU statistics

prstat procfs To identify process CPU consumption

dtrace Dtrace For detailed analysis of CPU activity, including
scheduling events and dispatcher analysis

uptime Command
Prints up time with CPU Load averages. They represent both
utilization and saturation of the CPUs.

• The numbers are the 1-, 5-, and 15-minute load averages.

• Load averages is often approximated as the average number of runnable
and running threads, which is a reasonable description.

• A value equal to your CPU count usually means 100% utilization; less than
your CPU count is proportionally less than 100% utilization; and greater
than your CPU count is a measure of saturation

• A consistent load average higher than your CPU count may cause degraded
performance. Solaris handles CPU saturation very well, so load averages
should not be used for anything more than an initial approximation of CPU
load.

sar - The system activity reporter
Provide live statistics or can be activated to record historical
CPU statistics, prints the user (%usr), system (%sys), wait I/O
(%wio), and idle times (%idle).
Identifies long-term patterns that may be missed when taking a
quick look at the system. Also, historical data provides a
reference for what is "normal" for your system
The following example shows the default output of sar, which is
also the -u option to sar. An interval of 1 second and a count of
5 were specified.

sar –q - Statistics on the run queues

runq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can be
used as a measure of CPU saturation
swpq-sz (swapped-out queue size). Number of swapped-out threads. Swapping
out threads is a last resort for relieving memory pressure, so this field will be
zero unless there was a dire memory shortage.
%runocc (run queue occupancy). Helps prevent a danger when intervals are
used, that is, short bursts of activity can be averaged down to unnoticeable
values. The run queue occupancy can identify whether short bursts of run queue
activity occurred
%swpocc (swapped out occupancy). Percentage of time there were swapped
out threads. If one thread is swapped, all others of threads of the process must also be.

Is my system performing well?

About the Individual Processors
psrinfo -v command determines the number of processors in the system and their
speed. In Solaris 10, -vp prints additional information.

The mpstat command summarizes the utilization statistics for each CPU. Following
syscl (system calls) csw (context switches)
is an example of four CPU switches) migr (migrations of threads between processors)
icsw (involuntary context machine, being sampled every 1 second.
intr (interrupts) ithr (interrupts as threads)
smtx (kernel mutexes) srw (kernel reader/writer mutexes)

What are sampling and Clock tick
woes?
• While most counters you see in Solaris are highly accurate, sampling issues remain
in a few minor places. In particular, the run queue length as seen from vmstat
(kthr:r) is based on a sample that is taken every second. Example, a problem was
caused by a program that deliberately created numerous short-lived threads every
second, such that the one-second run queue sample usually missed the activity.

• The runq-sz from sar -q suffers from the same problem, as does %runocc(which for
short-interval measurements defeats the purpose of %runocc).

• These are all minor issues, and a valid workaround is to use DTrace, with which
statistics can be created at any accuracy desired

Who Is Using the CPU?
The default output from the prstat command shows one line of output
per process, showing CPU utilization value before the prstat
command was executed.

The system load average indicates the demand and queuing for
CPU resources averaged over a 1-, 5-, and 15-minute period if that
exceeds the number of CPUs, the system is overloaded.

How is the CPU being consumed?
• Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates.
• Microstates represent a time-based summary broken into percentages for each thread.
• USR through LAT sum to 100% of the time spent for each thread during the prstat sample.
• USR (user time) and SYS (system time) thread spent running on the CPU.
• The LAT (latency) is the amount of time thread spent waiting for CPU. A non-zero number means there
was some queuing/saturation for CPU resources.
• SLP inidicates the time thread spends blocked waiting for blocking events like Disk I/O etc.
• TFL & DTL determine if and how much the thread is waiting for memory paging.
• TRP indicates the time spent on software traps

Each Thread is waiting for CPU about 0.2% of the time. - CPU resources are not constrained.

Each Thread is waiting for CPU about 80% of the time. - CPU resources are Constrained

How are threads inside the process
performing?

The example shows us that thread number two in the target process is using the most CPU, and
spending 83% of its time waiting for CPU. We can further look at information about thread
number two with the pstack <pid>/<LWPID> command. Just pstack <pid> to shows all threads

Take a java thread dump and identify the thread with native thread id = 2. This is the one. This
way con relate the code in Java that called the native system call or library method on the
system.

Process Stack on a Java Virtual
Machine: pstack
• Use the “C++ stack unmangler” with Java virtual machine (JVM) targets to see the
native java function calls  c stack

Tracing Processes
truss
truss traces system calls made on behalf of a process. It includes the user LWP
(thread) number, system call name, arguments and return codes for each system call.

truss –c option traces system call counts

Why Memory Saturation brings more
rapid a degradation in performance
compared to CPU saturation.
• Memory saturation may cause rapid degradation in performance. To come
over saturation OS resorts to page-in/out and swapping, which themselves
are an heavy task and with processes competing for memory, a race
condition may occur.

• The available memory on a server may be artificially constrained, either
through pre-allocation of memory or through the use of a garbage
collection mechanism that doesn’t free up memory until some threshold is
reached.

Thread Dumps

• What exactly is "Thread dump“
– Thread dump" basically gives you information on what
each of the thread in the VM is doing at any given
point of time.

• If an application seems stuck, or is running out of resources, a thread dump will reveal
the state of the server. Java's thread dumps are a vital tool for server debugging. For
scenarios like
– PERFORMANCE RELATED ISSUES
– DEADLOCK (SYSTEM LOCKS UP)
– TIMEOUT ISSUES
– SYSTEM STOPS PROCESSING TRAFFIC

Thread dumps in Redknee Applications
• Java thread dumps are obtained by doing:
– Send (kill -3 <pid>) - On Unix  See
thread dump in ctl logs
– Press (Ctrl + Shift Break) – on Windows  See
thread dumps on xbuild console
– $JAVA_HOME/bin/jstack <pid>  See
thread dumps on Shell console
• Java thread dumps list all of the threads in an application

• Threads are outputted in the order that they are created, newest thread being at the
top

• Threads should be named with a useful name of what they do or what they are
responsible for (Open Tickets)

Common Threads in Redknee
• Idle”
– CORBA Threads to handle incoming requests, however are currently not doing any work
• “RMI TCP Connection(<port>)-<IP>”
– Outbound connection over RMI to a specific host and port
• "FileLogger“
– Framework thread for logging
• "JavaIDL Reader for <host>:<port>“
– CORBA Thread reading requests from a server
• "TP-Processor8“
– Tomcat Web Thread
• “Thread-<#>”
– Thread that has not been named (BAD)
• "ChannelHome ForwardingThread“
– Thread used to cluster transactions over to peer
– One of these threads per Home that is clustered (DB table)
• "Worker#1“
– Worker threads doing work

Thread Dump May Give you Clues
• C:learnclasses>java Test
• Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mixed mode):

• "Signal Dispatcher" daemon prio=10 tid=0x0091db28 nid=0x744 waiting on condition [0..0]

• "Finalizer" daemon prio=9 tid=0x0091ab78 nid=0x73c in Object.wait() [1816f000..1816fd88]
• at java.lang.Object.wait(Native Method)
• - waiting on <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)
• at java.lang.ref.ReferenceQueue.remove(Unknown Source)
• - locked <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)
• at java.lang.ref.ReferenceQueue.remove(Unknown Source)
• at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

• "Reference Handler" daemon prio=10 tid=0x009196f0 nid=0x738 in Object.wait() [1812f000..1812fd88]
• at java.lang.Object.wait(Native Method)
• - waiting on <0x10010388> (a java.lang.ref.Reference$Lock)
• at java.lang.Object.wait(Unknown Source)
• at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
• - locked <0x10010388> (a java.lang.ref.Reference$Lock)

• "main" prio=5 tid=0x00234998 nid=0x4c8 runnable [6f000..6fc3c]
• at Test.findNewLine(Test.java:13)
• at Test.<init>(Test.java:4)
• at Test.main(Test.java:20)

• "VM Thread" prio=5 tid=0x00959370 nid=0x6e8 runnable

• "VM Periodic Task Thread" prio=10 tid=0x0023e718 nid=0x74c waiting on condition
• "Suspend Checker Thread" prio=10 tid=0x0091cd58 nid=0x740 runnable

What is there in the Thread Dump?
• In this case we can see that, at the time we took the thread dump, there were seven threads: Show Thread
Dump
– Signal Dispatcher
– Finalizer
– Reference Handler
– main
– VM Thread
– VM Periodic Task Thread
– Suspend Checker Thread

• Each thread name is followed by whether the thread is a daemon thread or not.
• Then comes prio the priority of the thread [ex: prio=5].
• tid and nid are Java thread id and the native thread id.
• Then what follows the state of the thread. It is either:
– Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS
thread scheduler schedules it.
– Suspended [marked as S in some VMs]: I presume this indicates that the thread is not in a runnable state. Can some one please confirm?!
– Object.wait() [marked as CW in some VMs]: indicates that the thread is waiting on an object using Object.wait()
– Waiting for monitor entry [marked as MW in some VMs]: indicates that the thread is waiting to enter a synchronized block
• What follows the thread description line is a regular stack trace.

Threads in a Dead-Lock
• A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each thread in the
deadlock is waiting on a resource locked by some other thread in the set of deadlocked threads. In newer
JDKs they are detected automatically
– Found one Java-level deadlock:
– =============================
– "Thread-1":
– waiting to lock monitor 0x0091a27c (object 0x140fa790, a java.lang.Class),
– which is held by "Thread-0"

– "Thread-0":
– waiting to lock monitor 0x0091a25c (object 0x14026800, a java.lang.Class),
– which is held by "Thread-1"

– Java stack information for the threads listed above:
– ===================================================
– "Thread-1":
– at Deadlock$2.run(Deadlock.java:48)
– - waiting to lock <0x140fa790> (a java.lang.Class)
– - locked <0x14026800> (a java.lang.Class)
– "Thread-0":
– at Deadlock$1.run(Deadlock.java:33)
– - waiting to lock <0x14026800> (a java.lang.Class)
– - locked <0x140fa790> (a java.lang.Class)

– Found 1 deadlock.

Memory
Performance & Troubleshooting
(Part 3)

Memory
• Memory includes
physical (RAM)
Swap space

• Swap space is a part storage acting as a memory.

• Memory is more complicated a subject than CPU.

• Memory saturation triggers CPU saturation (Page Faults / GC)

Memory Utilization and Saturation
• To sustain a higher throughput, application spawns more threads
and holds the request data

• Each thread occupies memory for data it operates on and its own
stack.

• A point where memory demanded by an process can no longer be
met from available memory, saturation occurs.

• Sudden increases in utilization without accompanying increases in
throughput can also be used to detect degraded performance
modes caused by software ‘aging’ issues, such as memory leaks

VMSTAT – Glimpse of Memory
Utilization

If the scan rate (sr) is continuously over 200 pages per second then there
is a memory shortage on the system.

Counter Description
swap Available swap space in Kbytes.
free Combined size of the cache list and free list.
re Page reclaims—The number of pages reclaimed from the cache list.
mf Minor faults—The number of pages attached to an address space.
fr Page-frees—Kilobytes that have been freed
pi and po Kilobytes Paged in and Paged out respectively
de Anticipated short-term memory in kilobytes shortfall to free ahead.
sr The number of pages scanned by the page scanner per second.

Relieving Memory Pressure

After the free memory exhausts, from cache list (FS,I/O etc cache).
Next the swapper swaps out entire threads, seriously degrading the
performance of swapped-out applications. The page scanner selects pages,
and is characterized by the scan rate (sr) from vmstat. Both use some form
of the Not Recently Used algorithm.
The swapper and the page scanner are only used when appropriate. Since
Solaris 8, the cyclic page cache, which maintains lists for a Least Recently
Used selection, is preferred.

Heap and Non-Heap Memory
• Heap Memory
Storage for Java objects
-Xmx<size> & -Xms<size>

• Non Heap Memory
Per-class structures such as runtime constant pool, field and method data,
Code for methods and constructors, as well as interned Strings.
Store loaded classes and other meta-data
JVM code itself, JVM internal structures, loaded profiler agent code and data, etc.
-XX:MaxPermSize=<size>

• Other
Space system/OS takes for process
Stacks of a threads (-Xss & -Xoss)
System & Native space

What is Garbage Collection?

Reclaim memory from inaccessible object

Stack Overflow or Out of Memory
• If u See OutOfMemoryError: unable to create native thread
– This means your Application is falling short Native Memory space – C Space
– Either, Insufficient memory to allocate thread stack or PC to the new Thread
– Or application has crossed JVM’s memory limit (3.2 GB in 32 bit environment)
– The JVM/application hangs with this error, we need to restart.
• See if you can reduce active threads which ate away system’s memory
• Or if you can decrease stack size to decrease memory use per thread
• If you Can’t bring memory consumption down, need more system memory
• If u See StackOverflowException
– It means the thread that threw this exception fell short of Stack Memory
Space
– A thread stacks method states invoked by it on to the stack memory
– For the number of nested invocations the thread makes, memory is
insufficient
– Only the thread dies with this exception, the application doesn’t hang.
• See if you can bring down number of nested invocations by the thread
• Or else, increase the stack size with VM option –Xss, by default it is 1m

Pros and Cons of Garbage Collection?

Advantages Disadvantages
Increased reliability Unpredictable application
Easier to write complex pauses
apps Increased CPU/memory
No memory leaks or utilization
invalid pointers Brutally complex

GC Logging
• Java Garbage Collection activity may be recorded in a log
file. VM options
– -verbosegc (Enable GC Logging, outputs to std-err
– -xloggc:<file> (GC logging to file)
– –XX:+PrintGCDetails (Detailed GC records)
– -XX:+PrintGCDateStamps (absolute instead of relative timestamps)
– Note: From relative timestamps in a GC log we can find absolute times by either by tracing forward from
application/GC start or backwards from application/GC stop

• Asynchronous garbage collection occurs whenever memory
available memory is low.
• System.gc() does not force a synchronous garbage
collection but just gives a hint to VM. VM options
– +XXDisableExplicitGC - Disable explicit GC

What to look for in GC Logs?
• Important information from GC logs
– The size of the heap after garbage collection
– The time taken to run the garbage collection
– The number of bytes reclaimed by garbage collection

• Heap Size after GC may give us a good idea of
memory requirement.
– 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)

• The other two help us assess the cost of GC to your
application.
• All of them together help us tune GC.

How to Calculate Impact of GC on your
Application?
• Run test (60sec, Collect GC logs)

• Measure
– Out of 60 sec, GC ran for 17.2 sec, ie 29% of the time.
– Considering relative CPU utilization, GC cost may be even higher.
– 3037K of memory was recycled in 60 secs, ie 51831 bytes/second

• Analyze
– 29% time being consumed by GC is too high (should be between 5-15%)
– Is 51831 bytes/sec of memory recycled justifiable against operation?
– For an average 50 byte objects, it churned around 1036 objects/ sec

Heap Ranges – Xms to Xmx
• Heap Range can be defined
– VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size

• What causes VM to expand heap?
– Expansion of heap is a CPU Intensive and causes defragmented Heap
– VM Tries GC, Defragmentation, Compaction, etc to free up memory.
– If still unable to free up required memory, VM decides to expand heap

– VM may not wait till brink, it keeps some free space for temp objects
– By default, Sun tries to keep the proportion of free space to living objects at each
garbage collection within 40%-70% range.
• If less than 40% heap is free after GC, expand the heap
• If more than 70% heap is free after GC, contract the heap
– VM Args that can customize the default ratio
• -XX:MinFreeHeapRatio
• -XX:MaxFreeHeapRatio

Gross Heap Tuning
• Consequences of large heap sizes
– GC Cycles occur less frequently, but each sweep takes longer
– Long GC cycles may induce perceptible pauses in the system.
– If heap grows to a size more than available RAM, paging/swapping may occur.
• Consequences of low heap sizes
– GC runs too frequently with less recovery in each cycle
– Cost of GC becomes more
– Since, GC has to sweep less space each time, pauses are imperceptible.
• Max verses Min Heap sizes.
– Contraction & Expansion of heap is costly, should be worth the cause.
– Frequent contraction expansion also leads to segmented heap.
– Keep Xmx=Xms, for transaction oriented system which frequently peaks.
– Keep Xms<Xmx if the application infrequently operates at upper capacity.

We Just Learnt Gross Heap
Tuning
There might just be need for Fine Tuning

• We can fine tune the GC considering the intricacies of
GC Algorithm & Heap Structure. We will learn shortly.

• Goss Heap tuning is quite simple yet effective &
empirically established.

• Gross techniques are fairly effective irrespective of the
variables and most important we can always afford
apply them.

What is the advanced heap made of?
The one that works with Generational Garbage Collector in JVM

• HEAP is made up of
– Old Space or Tenure Space
• Objects, when get old in the young space, are transferred here.

– Young Space or Eden Space
• Young objects are held here.

– Scratch Space
• Working Space for algorithms

– New Space
• <Young Space> + <Scratch Space>

Generational Garbage Collector
Modern Heap

Are there better GC implementations to chose? JDK 1.4.x Options

Generation Low Pause Collectors Throughput Collectors Heap Sizes

1 CPU 2+ CPUs 1 CPU 2+ CPUs

Parallel Scavenge Collector
Serial Parallel Copying -XX:NewSize
Copying
Copying Collector -XX:+UseParallelGC -XX:MaxNewSize
Young Collector
Collector
-XX:+UseAdaptiveSizePolicy -XX:SurvivorRatio
(default)
(default) -XX:+UseParNewGC
-XX:+AggressiveHeap

Mark- Mark-
Concurrent Collector
Compact Compact Mark-Compact Collector -Xms
Old Collector Collector (default) -Xmx
-XX:+UseConcMarkSweepGC
(default) (default)

-XX:PermSize
Permanent Can be turned off with –Xnoclassgc (use with care)
-XX:MaxPermSize

jstat

Reference http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html

Heap Dump (Java)
Snapshot of the memory of a time
VMs usually invokes a GC before dumping heap

It contains
• Objects (Class, fields, primitive values and references)
• Classes (Classloader, name, super class, static fields)
• GC Roots (Objects defined to be reachable by the JVM)
• Thread Stacks (in time with per-frame about local objects)
Does not Contain
• Allocation information
Who created the objects and where they have been created?
• Live & Stale
Used memory consists of both live and dead objects.
JVM usually does a GC before generating a heap dump
Tools may attempt to remove these when loading the dump unreachable
from the GC roots

Heap Dump (Java)
How to take it?
• On Demand
VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak
Tools # JDK6 Jconsole, VisualVM, MAT
jmap -d64 -dump:file=<file-ascii-hdump> <pid>
jmap -d64 -dump:format=b, file=<file-bin-hdump> <pid>

• Automatic on Crash
VM-arg # -XX:+HeapDumpOnOutOfMemoryError

• Postmorterm after crash; from Core-Dump
jmap -d64 -dump:format=b,file=<file> <java-bin> <core-file>

Heap Dump (Java)
Shallow vs Retained Heap
Shallow heap
• Held by object’s primitive fields and reference variables
• Excludes referenced objects but just references (32/64 bits)
Retained heap
• Object’s shallow size plus the shallow sizes of the objects that are
accessible, directly or indirectly, only from this object.
• Memory that’s freed by the GC when this object is collected.
Garbage Collection Roots
• A garbage collection root is an object accessible from outside the heap.
• GC root objects, which will not be collected by Garbage Collector at the time
of measuring Locals (Java/Native), Threads, System Class, JNI,, Monitor,
Finalizer)

Shallow vs. Retained Heap

http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcept
s%2Fshallowretainedheap.html

In general, retained size is a GC root is an integral measure, which helps to understand
consumption memory by objects graphs

Dominator Tree
(Object Dependencies)
• Identifies chunks of retained memory & the keep-alive
• In the dominator tree each object is the immediate dominator of its children, so
dependencies between the objects are easily identified.

• The edges in the dominator tree do not directly correspond to object references from
the object graph. Same object may actually be under retained set of multiple roots.
• http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html

OQL (Object Query Language)
Heap Dump not just for
Troubleshooting
• OQL is an Object Query Language that let’s us query the heap dump in SQL
fashion.

• This enables us to analyze heap not only after problems but proactively search for
patterns. Ex select to see if there are more than two objects for Boolean, ideally
two .TRUE and .FALSE (singleton like Enums) are sufficient –
select toHtml(a) + " = " + a.value from java.lang.Boolean a
where objectid(a.clazz.statics.TRUE) != objectid(a)
&& objectid(a.clazz.statics.FALSE) != objectid(a)
(Runs on Visual VM

• Visual VM and MAT, both support nice interfaces for OQL
http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html
http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fwelcome.html

References
• Thread Dump Analyzer (Thread Dumps)
• (http://java.net/projects/tda/)
• GC Viewer (GC logs)
• (http://www.tagtraum.com/gcviewer.html)
• Eclipse Memory Analyzer tool (Heap Dump, OQL)
(http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html )
• Visual VM / J-Console /JMX – (Inspect Live Application, Snapshots, Dumps, OQL)
Bundled with Java SDK

Feedback – Q&A
simar.singh@redknee.com
learn@ssimar.com

Performance Concurrency Troubleshooting Final

Recommended

Recommended

More Related Content

Similar to Performance Concurrency Troubleshooting Final

Similar to Performance Concurrency Troubleshooting Final (20)

Performance Concurrency Troubleshooting Final