System Performance    Build Fuel Tune          Simar Singh      simar.singh@redknee.com         learn@ssimar.com
Learn and ApplyTopics                                      Index (Click Links in Slide Show)•   Performance               ...
ConceptsConcurrency and Performance        (Part 1)
What will we Discuss?– LEARN– There are laws and principals that govern concurrency and performance.– Performance can be b...
Performance – Built, Fueled or Tuned• Built (Implementation and Techniques)   – Binary Search O(log n) is more efficient t...
Capacity and Load•   Load is an Expectation out of system     –   It is the rate of work that we put on the system.     – ...
Performance Measurement of a                     System•           Measures of System’s Capacity    Response Time or Laten...
It is time for System Capacity to be Loaded with work                              (Throttling & Buffering Techniques)•   ...
Supply Chain Principle                   (Apply it to define a optimum Thread Pool Size)•   The more throughput you want, ...
To support a Throughput (t) of 20 tps                     Where each transaction takes(d) 5 seconds                     We...
Quantify Resource Consumption         Utilization & Saturation• Resource Utilization    – Utilization measures how busy a ...
How Load, Resource Consumption and       Throughput related?•   As load increases, throughput increases, until maximum res...
Performance pattern of a Concurrent Process
Example                How Throughput and Resource Consumption are related?•   Throughput & Latency can have an inverse or...
Producer Consumer Principle                                     Predicting Maximum Throughput                             ...
Work Pools & Thread Pools                          Working Together•   Work Pools are queues of work to be performed by a ...
Queuing Tasks may be risky•   One task could lock up another that would be able to continue if the queued task    were to ...
Bounded & Unbounded Pools                  (Load Shedding)•   If not bounded, pools can grow freely but can cause system t...
Work pool & thread pool sizes canoften be traded off for each otherLarge Work-Pool and small thread pools– Minimizes CPU u...
Processing (CPU) Performance &        Troubleshooting            (Part 2)
CPU• Many modern systems from Sun boast numerous CPUs or virtual CPUs  (which may be cores or hardware threads).• The CPUs...
Process – User and Kernel Level                 Threads• Process includes the set of executable programs, address  space, ...
CPU Consumption ModelBy default Solaris 10 uses Process 4 model, rest are obsolete.
Dispatcher and Run Queue at CPU
User Thread over a Solaris LWP     State of User Thread and LWP may be different
Solaris Threading ModelIf you are in a thread, the thread library must schedule it on an a LWPEach LWP has a kernel thread...
JVM Organization
JVM Memory Organization & Threads•   Method Area     – JVM loads the class file, their type info and binary data in this a...
A Java thread’s Stack Memory•   The Java stack is composed of stack frames (or frames).•   A stack frame contains the stat...
Thread Modes         Kernel & User Mode Privilege• A LWP may either execute in kernel (sys) or user (usr) privilege mode.•...
LWP/Thread ModesUser Mode and Kernel Mode   Don’t confuse the modes with type (Kernel and User)
Complete Process State Diagram     State of a process is a super set of Thread States     A process’s thread state is defi...
vmstat tool provides a glimpse of the systems behaviorVMSTAT - Glimpse of CPU Behavior The vmstat tool provides a glimpse ...
More about VMSTATCount                                               Description kthr  r      Total number of runnable thr...
CPU Utilization•   You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us    and sy.•   ...
CPU Saturation• The kthr:r metric from vmstat is useful as a measure for CPU saturation.  However, since this is the total...
Solaris Peformance ToolsTool      Uses           Descriptionvmstat    kstat          For an initial view of overall CPU be...
uptime CommandPrints up time with CPU Load averages. They represent bothutilization and saturation of the CPUs.•   The num...
sar - The system activity reporterProvide live statistics or can be activated to record historicalCPU statistics, prints t...
sar –q - Statistics on the run queuesrunq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can beused as a...
Is my system performing well?           About the Individual Processors psrinfo -v command determines the number of proces...
What are sampling and Clock tick                  woes?•   While most counters you see in Solaris are highly accurate, sam...
Who Is Using the CPU?The default output from the prstat command shows one line of outputper process, showing CPU utilizati...
How is the CPU being consumed?•   Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates.• ...
How are threads inside the process             performing?The example shows us that thread number two in the target proces...
Process Stack on a Java Virtual               Machine: pstack•   Use the “C++ stack unmangler” with Java virtual machine (...
Tracing Processes                                  trusstruss traces system calls made on behalf of a process. It includes...
Why Memory Saturation brings more rapid a degradation in performance    compared to CPU saturation.• Memory saturation may...
Thread Dumps•   What exactly is "Thread dump“     – Thread dump" basically gives you information on what       each of the...
Thread dumps in Redknee Applications•   Java thread dumps are obtained by doing:     – Send (kill -3 <pid>) - On Unix     ...
Common Threads in Redknee•   Idle”     –      CORBA Threads to handle incoming requests, however are currently not doing a...
Thread Dump May Give you Clues•   C:learnclasses>java Test•   Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mi...
What is there in the Thread Dump?•   In this case we can see that, at the time we took the thread dump, there were seven t...
Threads in a Dead-Lock•   A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each t...
MemoryPerformance & Troubleshooting             (Part 3)
Memory• Memory includes     physical (RAM)     Swap space• Swap space is a part storage acting as a memory.• Memory is mor...
Memory Utilization and Saturation• To sustain a higher throughput, application spawns more threads  and holds the request ...
VMSTAT – Glimpse of Memory               UtilizationIf the scan rate (sr) is continuously over 200 pages per second then t...
Memory Consumption Model
Relieving Memory PressureAfter the free memory exhausts, from cache list (FS,I/O etc cache).Next the swapper swaps out ent...
Heap and Non-Heap Memory• Heap Memory  Storage for Java objects  -Xmx<size> & -Xms<size>• Non Heap Memory  Per-class struc...
What is Garbage Collection?Reclaim memory from inaccessible object
Stack Overflow or Out of Memory•   If u See OutOfMemoryError: unable to create native thread     –    This means your Appl...
Pros and Cons of Garbage Collection?Advantages                        Disadvantages     Increased reliability             ...
GC Logging• Java Garbage Collection activity may be recorded in a log  file. VM options   –   -verbosegc (Enable GC Loggin...
What to look for in GC Logs?• Important information from GC logs   – The size of the heap after garbage collection   – The...
How to Calculate Impact of GC on your            Application?• Run test (60sec, Collect GC logs)  – 36690K->35325K(458752K...
Heap Ranges – Xms to Xmx• Heap Range can be defined  – VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size• What ...
Gross Heap Tuning• Consequences of large heap sizes    – GC Cycles occur less frequently, but each sweep takes longer    –...
We Just Learnt Gross Heap                    Tuning                  There might just be need for Fine Tuning• We can fine...
What is the advanced heap made of?                  The one that works with Generational Garbage Collector in JVM• HEAP is...
jmap -heap
Generational Garbage Collector        Modern Heap
Fine Tuning the Heap
Are there better GC implementations to chose? JDK 1.4.x OptionsGeneration        Low Pause Collectors                     ...
jstatReference http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
Heap Dump (Java)      Snapshot of the memory of a time                               VMs usually invokes a GC before dumpi...
Heap Dump (Java)                    How to take it?• On Demand   VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak   Tools #...
Heap Dump (Java)               Shallow vs Retained HeapShallow heap• Held by object’s primitive fields and reference varia...
Shallow vs. Retained Heaphttp://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowret...
Dominator Tree                               (Object Dependencies)•   Identifies chunks of retained memory & the keep-aliv...
OQL (Object Query Language)            Heap Dump not just for                Troubleshooting•   OQL is an Object Query Lan...
References•   Thread Dump Analyzer (Thread Dumps)•   (http://java.net/projects/tda/)•   GC Viewer (GC logs)•   (http://www...
Feedback – Q&A  simar.singh@redknee.com     learn@ssimar.com
Upcoming SlideShare
Loading in …5
×

Performance Concurrency Troubleshooting Final

3,619 views

Published on

LEARN
There are laws and principals that govern concurrency and performance.
Performance can be built, fueled and/or tuned.
How do we measure performance and capacity in abstract terms?
Capacity (throughput) and Load are often used interchangeably but incorrectly.
What is the difference between Resource utilization and saturation?
How performance &amp; capacity are measured on a live system (CPU &amp; Memory)?

APPLY
Find out how is your system being used or abused?
Find out how your system is performing as a whole?
Find out how a particular process in the system is performing?
Find out how a particular thread in the process performing?
Find out the bottle-necks? What is less or missing?

dump,thread,heap,concurrency,vmstat,solaris,mpstat,truss,garbage,pstack,system,java,troubleshooting,gc,leak,performance,jhat,jmap,core,jstat,

1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
3,619
On SlideShare
0
From Embeds
0
Number of Embeds
413
Actions
Shares
0
Downloads
109
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Performance Concurrency Troubleshooting Final

  1. 1. System Performance Build Fuel Tune Simar Singh simar.singh@redknee.com learn@ssimar.com
  2. 2. Learn and ApplyTopics Index (Click Links in Slide Show)• Performance • Concepts• Concurrency (Threads)• Troubleshooting• Processing (CPU/Cores) • Processing• Memory (System / Process)• Thread Dumps • Memory• Garbage Collection• Heap Dumps• Core Dumps & Postmortem• Java (jstack, jmap, jstat, VisualVM)• Solaris (prstat vmstat mpstat pstack)
  3. 3. ConceptsConcurrency and Performance (Part 1)
  4. 4. What will we Discuss?– LEARN– There are laws and principals that govern concurrency and performance.– Performance can be built, fueled and/or tuned.– How do we measure performance and capacity in abstract terms?– Capacity (throughput) and Load are often used interchangeably but incorrectly.– What is the difference between Resource utilization and saturation?– How performance & capacity are measured on a live system (CPU & Memory)?– APPLY– Find out how is your system being used or abused?– Find out how your system is performing as a whole?– Find out how a particular process in the system is performing?– Find out how a particular thread in the process performing?– Find out the bottle-necks? What is less or missing?
  5. 5. Performance – Built, Fueled or Tuned• Built (Implementation and Techniques) – Binary Search O(log n) is more efficient than Linear Search O(n) – Caching can improve Disk I/O significantly boosting performance.• Fueled (More Resources) – Simply get a machine with more CPU(s) and Memory if constrained. – Implement RAID to improve Disk I/O• Tuned (Settings and Configurations) Tune Garbage Collection to optimize Java Processes – Tune Oracle parameters to get optimum database performance
  6. 6. Capacity and Load• Load is an Expectation out of system – It is the rate of work that we put on the system. – It is an factor external to the system. – Load may vary with time and events. – It has no upper cap, can increase infinitely• Capacity is a Potential of the system – It is the max rate of work, the system supports efficiently, effectively & infinitely – It is a factor, internal to the system. Maximum capacity of a system is finite and stays fairly constant. We often call Throughput as the System’s Capacity for Load.• Chemistry between Load & Capacity – LOAD = CAPACITY? Good Expectation matches the potential. Hired – LOAD > CAPACITY? Bad Expectations is more than potential. Fired – LOAD < CAPACITY? Ugly Expectations is less then potential. Find another one – If not good better be ugly than bad.
  7. 7. Performance Measurement of a System• Measures of System’s Capacity Response Time or Latency – Measures time spent executing a request • Round-trip time (RTT) for a Transaction – Good for understanding user experience – Least scalable, Developers focus on how much time each transaction takes• Throughput – Measures the number of transactions executed over a period of time • Output Transactions per second (TPS) – A measure of the systems capacity for load – Depending upon the resource type, It could be hit rate (for cache)• Resource Utilization – Measures the use of a resource • Memory, disk space, CPU, network bandwidth – Helpful for system sizing, is generally the easiest measurement to Understand – Throughput and Response Time can conflict, because resources are limited • Locking, resource contention, container activity
  8. 8. It is time for System Capacity to be Loaded with work (Throttling & Buffering Techniques)• No one stops us to load a system more than its capacity (Max Throughput).• Transactions Per Seconds -Misconception, Real traffic may be in bursts – Received 3600 transactions in a hour, not sure if every second only 60 were pumped – Probably we received in bursts - all in first 10 minutes and for nothing last 50 minutes – So we really cant say, at what tps? We can regulate bursts with throttling and buffering• Throttling – (Implemented by producer to smoothen output) – Spreads bursts over time to smoothen output from a process – We may add throttles to control output rate from threads to each external interface Throttle of 10 tps ensures max output is 10 tps regardless of the load & capacity. Throttling is scheme for producers (Check production to rate the consumer can accept)• Buffering – (Implemented by consumer to smoothen input) – Spreads burst over time to smoothen input from an external interface – We add buffering to control input rate to threads from each external interface Application processes input at 10 tps, load above it will be buffered & processed later Buffering is a scheme for consumers (Take whatever is produced, consume at our own)
  9. 9. Supply Chain Principle (Apply it to define a optimum Thread Pool Size)• The more throughput you want, more will be the resource consumption.• You may apply this principle to define the optimum thread-pool size for a system/application. – To support a Throughput (t) transactions per second- (t) = 20 tps – Where each transaction takes (d) seconds to complete- (d) = 5 seconds – We need (d*t) threads at least (min size of the thread pool)- (d*t) = 100 threads• Thread is an abstract CPU unit resource here.
  10. 10. To support a Throughput (t) of 20 tps Where each transaction takes(d) 5 seconds We need 100 (d*t) threads at least1 sec 2 sec 3 sec 4 sec 5 sec 1 sec 2 sec 3 sec 4 sec 5 sec 1 sec20 2 sec 3 sec 4 sec 5 sec 1 sec 2 sec 20 3 sec 4 sec 5 sec 1 sec 2 sec 3 sec 20 4 sec 5 sec 1 sec 2 sec 3 sec 4 sec 20 5 sec 20 20 20  20 20 20 20 20 20 20 20 20 20 20 20 20
  11. 11. Quantify Resource Consumption Utilization & Saturation• Resource Utilization – Utilization measures how busy a resource is. – It is usually represented as a percentage average over a time interval.• Resource Saturation – Saturation is often a measure of work that has queued waiting for the resource – It can be measured as both • As an average over time • And at a particular point in time. – For some resources that do not queue, saturation may be synthesized by error counts. Example Page-Faults reveal memory saturation.• Load (input rate of requests) is an independent/external variable• Resource consumption, Throughput (out-put rate of response) are dependent/internal variables, a function of load.
  12. 12. How Load, Resource Consumption and Throughput related?• As load increases, throughput increases, until maximum resource utilization on the bottleneck device is reached. At this point, maximum possible throughput is reached, Saturation occurs.• Then, queuing (waiting for saturated resources) starts to occur.• Queuing typically manifests itself by degradation in response times.• This phenomenon is described by Little’s Law: L=X*R L (LOAD), X (THROUGHPUT) and R (RESPONSE TIME)• As L increases, X increases (R also increases slightly, because there is always some level of contention at the component level).• At some point, X reaches Xmax – the maximum throughput of the system. At this point, as L continues to increase, the response time R increases in proportion and through-put may then start to decrease, both due to resource contention.
  13. 13. Performance pattern of a Concurrent Process
  14. 14. Example How Throughput and Resource Consumption are related?• Throughput & Latency can have an inverse or direct relationship – Concurrent tasks (Threads) often contend for resources (locking & contention) • Single-Threaded – Higher Throughput = Lower Latency – Consistent throughput, does not increase with incoming load & resources – Processes serially, Good for batch jobs – Response Time linearly varies with request order. • Multi-Threaded – Higher Throughput = Higher Latency (Most of the time) – Throughput may increase linearly with load, it starts to drop after threshold – Process Concurrently, Good for interactive modules (Web Apps) – Near consistent Response Time, doesn’t vary much with order but load. Single Threaded – 10 CPU(s) Multi Threaded – 10 CPU(s) Threads = 1 Threads = 10 Latency = .1 seconds Latency = .1 seconds Throughput = 1/.1 = 10 tx/sec Throughput = 1/.1 * 10 = 100 Threads = 1 Threads = 100 Latency = .001 second Latency = .2 seconds Throughput = 1/.001 = 1000 tx/sec Throughput = 1/.2 * 100 = 500 tx/sec
  15. 15. Producer Consumer Principle Predicting Maximum Throughput Identify Bottleneck Device/Resource• The Utilization Law: Ui = T * Di• Where Ui is the percentage of utilization of a device in the application, T is the application throughput, and Di is the service demand of the application device.• The maximum throughput of an application Tmax is limited by the maximum service demand of all of the devices in the application.• EXAMPLE - A load test reports 200 kb/sec average throughput: CPUavg = 80% Dcpu = 0.8 / 200 kb/sec = 0.004 sec/kb Memoryavg = 30% Dmemory = 0.3 / 200 kb/sec = 0.0015 sec/kb Diskavg = 8% Ddisk = 0.08 / 200 kb/sec = 0.0004 sec/kb Network I/Oavg = 40% Dnetwork I/O = 0.4 / 200 kb/sec = 0.002 sec/kb• In this case, Dmax corresponds to the CPU. So, the CPU is the bottleneck device.• We can use this to predict the maximum throughput of the application by setting the CPU utilization to 100% and dividing by Dcpu. In other words, for this example: Tmax = 1 / Dcpu = 250 kb/sec• In order to increase the capacity of this application, it would first be necessary to increase CPU capacity. Increasing memory, network capacity or disk capacity would have little or no effect on performance until after CPU capacity has been increased sufficiently.
  16. 16. Work Pools & Thread Pools Working Together• Work Pools are queues of work to be performed by a software application or component. – If all threads in thread pool are busy, incoming work can be queued in work pool – Threads from thread pool, when freed can execute them later• Work Pools are filling up congestion & smoothen bursts – A queue consisting of units of work to be performed – CONGESTION, by allowing the current (client) threads to submit work and return – BURST, over capacity transaction can buffered in work pool and executed later – Allow for caching of units of work to reduce system intensive calls • Can perform a bulk fetch form a database instead of fetching on record at a time
  17. 17. Queuing Tasks may be risky• One task could lock up another that would be able to continue if the queued task were to run.• Queuing can smoothen in-coming traffic burst limited in time (depending upon the rate of traffic and size)• Fails if traffic arrives on average faster than they can be processed.• In general, Work Pools are in memory so it is important to understand what the impact of restarting a system is, as in memory elements will be lost. – Is it relevant to lose the queued work? – Is the queue backed up on disk?
  18. 18. Bounded & Unbounded Pools (Load Shedding)• If not bounded, pools can grow freely but can cause system to exhaust resources. – Work Pool / Queue Unbounded - (May overload Memory / Heap & crash) • Each work object in the queue stays holding the space until consumed – Thread Pool Unbounded – (May overload CPU / Native Space and Crash) • Each thread asks to be scheduled on CPU and consumes native stack space• If queue size is bounded, incoming execute requests block when it is full. We can apply different Policies to handle t, for example – Reject if there is no space (Can have side affects) – Remove based on Priority – (Ex priority may be function of time – Timeouts)• Thread Pools can have different policies when Work Pools is full: – Block till there is available space – Starve (VERY BAD – Sometimes Needed) – Run in Current Thread (Very Dangerous!)
  19. 19. Work pool & thread pool sizes canoften be traded off for each otherLarge Work-Pool and small thread pools– Minimizes CPU usage, OS resources, and context-switching overhead.– Can lead to artificially low throughput especially if tasks frequently block (ex I/O bound)Small Work pool generally require larger thread pool sizes– Keeps CPUs busier– May cause scheduling overhead (Context Switching) and may lessen throughput. Especially if the number of CPUs are less.
  20. 20. Processing (CPU) Performance & Troubleshooting (Part 2)
  21. 21. CPU• Many modern systems from Sun boast numerous CPUs or virtual CPUs (which may be cores or hardware threads).• The CPUs are shared by applications on the system, according to a policy prescribed by the operating system and scheduler• If the system becomes CPU resource limited, then application or kernel threads have to wait on a queue to be scheduled on a processor, potentially degrading system performance.• The time spent on these queues, the length of these queues and the utilization of the system processor are important metrics for quantifying CPU-related performance bottlenecks.
  22. 22. Process – User and Kernel Level Threads• Process includes the set of executable programs, address space, stack, and process control block. One or more threads may execute the program(s).• User-level threads (threads library) – Invisible to the OS and are maintained by a thread Library. – are the interface for application parallelism• Kernel threads – the unit that can be dispatched on a processor and it’s structures are maintain by the kernel• Lightweight processes (LWP) – Each LWP supports one or more User Level Thread and maps to exactly one Kernel Level Thread. Maintains the state of a thread.
  23. 23. CPU Consumption ModelBy default Solaris 10 uses Process 4 model, rest are obsolete.
  24. 24. Dispatcher and Run Queue at CPU
  25. 25. User Thread over a Solaris LWP State of User Thread and LWP may be different
  26. 26. Solaris Threading ModelIf you are in a thread, the thread library must schedule it on an a LWPEach LWP has a kernel thread, which schedules it on a CPU.Threading models are used between LWPs & Solaris Threads
  27. 27. JVM Organization
  28. 28. JVM Memory Organization & Threads• Method Area – JVM loads the class file, their type info and binary data in this area – This memory area is shared by all threads• Heap Area – JVM places all objects the program instantiates onto the heap – This memory area is shared by all threads – This memory can be adjusted by VM options -Xmx & -Xms as required• Java Stack and Program Counter (PC) Register – Each new thread that executes, gets its own pc register & Java stack. – The value of the pc register indicates the next instruction to execute. – A threads Java stack stores the state of Java method invocations for the thread. The state of a Java method invocation includes • its local variables & the parameters with which it was invoked, • its return value (if any), and intermediate calculations. – This memory may be adjusted by VM option –Xss, typically 1m for RK Apps – The state of native method (JVM method) invocations is stored in an implementation-dependent way in native method stacks, as well as possibly in registers or other implementation-dependent memory areas.
  29. 29. A Java thread’s Stack Memory• The Java stack is composed of stack frames (or frames).• A stack frame contains the state of one Java method invocation. – When a thread invokes a method, the Java virtual machine pushes a new frame onto that threads Java stack. – When the method completes, the virtual machine pops and discards the frame for that method.
  30. 30. Thread Modes Kernel & User Mode Privilege• A LWP may either execute in kernel (sys) or user (usr) privilege mode.• Operations like, processing data on local memory and inter-process communication between threads of the same process does not require kernel mode privilege for the thread executing the user program.• However, intra-process communication or hardware access are done by kernel programs the executing thread requires kernel mode privilege• User programs often call by call kernel programs by making system calls.• A LWP runs in user mode until it makes a system call that requires kernel mode privilege. The mode switch then happens, which is costly.
  31. 31. LWP/Thread ModesUser Mode and Kernel Mode Don’t confuse the modes with type (Kernel and User)
  32. 32. Complete Process State Diagram State of a process is a super set of Thread States A process’s thread state is defined by its threads.
  33. 33. vmstat tool provides a glimpse of the systems behaviorVMSTAT - Glimpse of CPU Behavior The vmstat tool provides a glimpse of the systems behavior on one line indicates both CPU utilization and saturation. The first line is the summary since boot, followed by samples every five seconds Far right is cpu:id for percent idle lets us determine how utilized the CPUs are In this ex, the idle time for the 5 second samples was always 0, indicating 100% utilization. On the far left is kthr:r for the total number of threads on the ready to run queues. If the value is more than the number of CPU’s it indicates CPU saturation. Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest saturation for this single CPU server. A value of 4 would indicate high saturation.
  34. 34. More about VMSTATCount Description kthr r Total number of runnable threads on the dispatcher queuesfaults in Number of interrupts per second sy Number of system calls per second cs Number of context switches per second, both voluntary and involuntary cpu us Percent user time; time the CPUs spent processing user-mode threads sy Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus the time spent processing kernel threads id Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU utilization
  35. 35. CPU Utilization• You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us and sy.• 100% utilized may be fine—it can be the price of doing business.• When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance; the performance degradation is gradual. Because of this, CPU saturation is often a better indicator of performance issues than is CPU utilization.• The measurement interval is important: 5% utilization sounds close to idle; however, for a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for 57 minutes. It is useful to have both short- and long-duration measurements.• An server running at 10% CPU utilization sounds like 90% of the CPU is available for "free," that is, it could be used without affecting the existing application. This isnt quite true. When an application on a server with 10% CPU utilization wants the CPUs, they will almost always be available immediately. On a server with 100% CPU utilization, the same application will find that the CPUs are already busy—and will need to preempt the currently running thread or wait to be scheduled. This can increase latency.
  36. 36. CPU Saturation• The kthr:r metric from vmstat is useful as a measure for CPU saturation. However, since this is the total across all the CPU run queues, divide kthr:r by the CPU count for a value that can be compared with other servers.• Any sustained non-zero value is likely to degrade performance. The performance degradation is gradual (unlike the case with memory saturation, where it is rapid).• Interval time is still quite important. It is possible to see CPU saturation (kthr:r) while a CPU is idle (cpu:idl). You may find that the run queue is quite long for a short period of time, followed by idle time. Averaging over the interval gives both a non-zero run queue length and idle time.
  37. 37. Solaris Peformance ToolsTool Uses Descriptionvmstat kstat For an initial view of overall CPU behaviorpsrinfo kstat For physical CPU propertiesuptime getloadavg() For the load averages, to gauge recent CPU activitysar kstat, sadc For overall CPU behavior, and dispatcher queue statistics; sar also allows historical data collectionmpstat kstat For per-CPU statisticsprstat procfs To identify process CPU consumptiondtrace Dtrace For detailed analysis of CPU activity, including scheduling events and dispatcher analysis
  38. 38. uptime CommandPrints up time with CPU Load averages. They represent bothutilization and saturation of the CPUs.• The numbers are the 1-, 5-, and 15-minute load averages.• Load averages is often approximated as the average number of runnable and running threads, which is a reasonable description.• A value equal to your CPU count usually means 100% utilization; less than your CPU count is proportionally less than 100% utilization; and greater than your CPU count is a measure of saturation• A consistent load average higher than your CPU count may cause degraded performance. Solaris handles CPU saturation very well, so load averages should not be used for anything more than an initial approximation of CPU load.
  39. 39. sar - The system activity reporterProvide live statistics or can be activated to record historicalCPU statistics, prints the user (%usr), system (%sys), wait I/O(%wio), and idle times (%idle).Identifies long-term patterns that may be missed when taking aquick look at the system. Also, historical data provides areference for what is "normal" for your systemThe following example shows the default output of sar, which isalso the -u option to sar. An interval of 1 second and a count of5 were specified.
  40. 40. sar –q - Statistics on the run queuesrunq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can beused as a measure of CPU saturationswpq-sz (swapped-out queue size). Number of swapped-out threads. Swappingout threads is a last resort for relieving memory pressure, so this field will bezero unless there was a dire memory shortage.%runocc (run queue occupancy). Helps prevent a danger when intervals areused, that is, short bursts of activity can be averaged down to unnoticeablevalues. The run queue occupancy can identify whether short bursts of run queueactivity occurred%swpocc (swapped out occupancy). Percentage of time there were swappedout threads. If one thread is swapped, all others of threads of the process must also be.
  41. 41. Is my system performing well? About the Individual Processors psrinfo -v command determines the number of processors in the system and their speed. In Solaris 10, -vp prints additional information.The mpstat command summarizes the utilization statistics for each CPU. Following syscl (system calls) csw (context switches)is an example of four CPU switches) migr (migrations of threads between processors) icsw (involuntary context machine, being sampled every 1 second. intr (interrupts) ithr (interrupts as threads) smtx (kernel mutexes) srw (kernel reader/writer mutexes)
  42. 42. What are sampling and Clock tick woes?• While most counters you see in Solaris are highly accurate, sampling issues remain in a few minor places. In particular, the run queue length as seen from vmstat (kthr:r) is based on a sample that is taken every second. Example, a problem was caused by a program that deliberately created numerous short-lived threads every second, such that the one-second run queue sample usually missed the activity.• The runq-sz from sar -q suffers from the same problem, as does %runocc(which for short-interval measurements defeats the purpose of %runocc).• These are all minor issues, and a valid workaround is to use DTrace, with which statistics can be created at any accuracy desired
  43. 43. Who Is Using the CPU?The default output from the prstat command shows one line of outputper process, showing CPU utilization value before the prstatcommand was executed.The system load average indicates the demand and queuing forCPU resources averaged over a 1-, 5-, and 15-minute period if thatexceeds the number of CPUs, the system is overloaded.
  44. 44. How is the CPU being consumed?• Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates.• Microstates represent a time-based summary broken into percentages for each thread.• USR through LAT sum to 100% of the time spent for each thread during the prstat sample.• USR (user time) and SYS (system time) thread spent running on the CPU.• The LAT (latency) is the amount of time thread spent waiting for CPU. A non-zero number means there was some queuing/saturation for CPU resources.• SLP inidicates the time thread spends blocked waiting for blocking events like Disk I/O etc.• TFL & DTL determine if and how much the thread is waiting for memory paging.• TRP indicates the time spent on software traps Each Thread is waiting for CPU about 0.2% of the time. - CPU resources are not constrained. Each Thread is waiting for CPU about 80% of the time. - CPU resources are Constrained
  45. 45. How are threads inside the process performing?The example shows us that thread number two in the target process is using the most CPU, andspending 83% of its time waiting for CPU. We can further look at information about threadnumber two with the pstack <pid>/<LWPID> command. Just pstack <pid> to shows all threads Take a java thread dump and identify the thread with native thread id = 2. This is the one. This way con relate the code in Java that called the native system call or library method on the system.
  46. 46. Process Stack on a Java Virtual Machine: pstack• Use the “C++ stack unmangler” with Java virtual machine (JVM) targets to see the native java function calls  c stack
  47. 47. Tracing Processes trusstruss traces system calls made on behalf of a process. It includes the user LWP(thread) number, system call name, arguments and return codes for each system call. truss –c option traces system call counts
  48. 48. Why Memory Saturation brings more rapid a degradation in performance compared to CPU saturation.• Memory saturation may cause rapid degradation in performance. To come over saturation OS resorts to page-in/out and swapping, which themselves are an heavy task and with processes competing for memory, a race condition may occur.• The available memory on a server may be artificially constrained, either through pre-allocation of memory or through the use of a garbage collection mechanism that doesn’t free up memory until some threshold is reached.
  49. 49. Thread Dumps• What exactly is "Thread dump“ – Thread dump" basically gives you information on what each of the thread in the VM is doing at any given point of time.• If an application seems stuck, or is running out of resources, a thread dump will reveal the state of the server. Javas thread dumps are a vital tool for server debugging. For scenarios like – PERFORMANCE RELATED ISSUES – DEADLOCK (SYSTEM LOCKS UP) – TIMEOUT ISSUES – SYSTEM STOPS PROCESSING TRAFFIC
  50. 50. Thread dumps in Redknee Applications• Java thread dumps are obtained by doing: – Send (kill -3 <pid>) - On Unix  See thread dump in ctl logs – Press (Ctrl + Shift Break) – on Windows  See thread dumps on xbuild console – $JAVA_HOME/bin/jstack <pid>  See thread dumps on Shell console• Java thread dumps list all of the threads in an application• Threads are outputted in the order that they are created, newest thread being at the top• Threads should be named with a useful name of what they do or what they are responsible for (Open Tickets)
  51. 51. Common Threads in Redknee• Idle” – CORBA Threads to handle incoming requests, however are currently not doing any work• “RMI TCP Connection(<port>)-<IP>” – Outbound connection over RMI to a specific host and port• "FileLogger“ – Framework thread for logging• "JavaIDL Reader for <host>:<port>“ – CORBA Thread reading requests from a server• "TP-Processor8“ – Tomcat Web Thread• “Thread-<#>” – Thread that has not been named (BAD)• "ChannelHome ForwardingThread“ – Thread used to cluster transactions over to peer – One of these threads per Home that is clustered (DB table)• "Worker#1“ – Worker threads doing work
  52. 52. Thread Dump May Give you Clues• C:learnclasses>java Test• Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mixed mode):• "Signal Dispatcher" daemon prio=10 tid=0x0091db28 nid=0x744 waiting on condition [0..0]• "Finalizer" daemon prio=9 tid=0x0091ab78 nid=0x73c in Object.wait() [1816f000..1816fd88]• at java.lang.Object.wait(Native Method)• - waiting on <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)• at java.lang.ref.ReferenceQueue.remove(Unknown Source)• - locked <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)• at java.lang.ref.ReferenceQueue.remove(Unknown Source)• at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)• "Reference Handler" daemon prio=10 tid=0x009196f0 nid=0x738 in Object.wait() [1812f000..1812fd88]• at java.lang.Object.wait(Native Method)• - waiting on <0x10010388> (a java.lang.ref.Reference$Lock)• at java.lang.Object.wait(Unknown Source)• at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)• - locked <0x10010388> (a java.lang.ref.Reference$Lock)• "main" prio=5 tid=0x00234998 nid=0x4c8 runnable [6f000..6fc3c]• at Test.findNewLine(Test.java:13)• at Test.<init>(Test.java:4)• at Test.main(Test.java:20)• "VM Thread" prio=5 tid=0x00959370 nid=0x6e8 runnable• "VM Periodic Task Thread" prio=10 tid=0x0023e718 nid=0x74c waiting on condition• "Suspend Checker Thread" prio=10 tid=0x0091cd58 nid=0x740 runnable
  53. 53. What is there in the Thread Dump?• In this case we can see that, at the time we took the thread dump, there were seven threads: Show Thread Dump – Signal Dispatcher – Finalizer – Reference Handler – main – VM Thread – VM Periodic Task Thread – Suspend Checker Thread• Each thread name is followed by whether the thread is a daemon thread or not.• Then comes prio the priority of the thread [ex: prio=5].• tid and nid are Java thread id and the native thread id.• Then what follows the state of the thread. It is either: – Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS thread scheduler schedules it. – Suspended [marked as S in some VMs]: I presume this indicates that the thread is not in a runnable state. Can some one please confirm?! – Object.wait() [marked as CW in some VMs]: indicates that the thread is waiting on an object using Object.wait() – Waiting for monitor entry [marked as MW in some VMs]: indicates that the thread is waiting to enter a synchronized block• What follows the thread description line is a regular stack trace.
  54. 54. Threads in a Dead-Lock• A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each thread in the deadlock is waiting on a resource locked by some other thread in the set of deadlocked threads. In newer JDKs they are detected automatically – Found one Java-level deadlock: – ============================= – "Thread-1": – waiting to lock monitor 0x0091a27c (object 0x140fa790, a java.lang.Class), – which is held by "Thread-0" – "Thread-0": – waiting to lock monitor 0x0091a25c (object 0x14026800, a java.lang.Class), – which is held by "Thread-1" – Java stack information for the threads listed above: – =================================================== – "Thread-1": – at Deadlock$2.run(Deadlock.java:48) – - waiting to lock <0x140fa790> (a java.lang.Class) – - locked <0x14026800> (a java.lang.Class) – "Thread-0": – at Deadlock$1.run(Deadlock.java:33) – - waiting to lock <0x14026800> (a java.lang.Class) – - locked <0x140fa790> (a java.lang.Class) – Found 1 deadlock.
  55. 55. MemoryPerformance & Troubleshooting (Part 3)
  56. 56. Memory• Memory includes physical (RAM) Swap space• Swap space is a part storage acting as a memory.• Memory is more complicated a subject than CPU.• Memory saturation triggers CPU saturation (Page Faults / GC)
  57. 57. Memory Utilization and Saturation• To sustain a higher throughput, application spawns more threads and holds the request data• Each thread occupies memory for data it operates on and its own stack.• A point where memory demanded by an process can no longer be met from available memory, saturation occurs.• Sudden increases in utilization without accompanying increases in throughput can also be used to detect degraded performance modes caused by software ‘aging’ issues, such as memory leaks
  58. 58. VMSTAT – Glimpse of Memory UtilizationIf the scan rate (sr) is continuously over 200 pages per second then thereis a memory shortage on the system.Counter Descriptionswap Available swap space in Kbytes.free Combined size of the cache list and free list.re Page reclaims—The number of pages reclaimed from the cache list.mf Minor faults—The number of pages attached to an address space.fr Page-frees—Kilobytes that have been freedpi and po Kilobytes Paged in and Paged out respectivelyde Anticipated short-term memory in kilobytes shortfall to free ahead.sr The number of pages scanned by the page scanner per second.
  59. 59. Memory Consumption Model
  60. 60. Relieving Memory PressureAfter the free memory exhausts, from cache list (FS,I/O etc cache).Next the swapper swaps out entire threads, seriously degrading theperformance of swapped-out applications. The page scanner selects pages,and is characterized by the scan rate (sr) from vmstat. Both use some formof the Not Recently Used algorithm.The swapper and the page scanner are only used when appropriate. SinceSolaris 8, the cyclic page cache, which maintains lists for a Least RecentlyUsed selection, is preferred.
  61. 61. Heap and Non-Heap Memory• Heap Memory Storage for Java objects -Xmx<size> & -Xms<size>• Non Heap Memory Per-class structures such as runtime constant pool, field and method data, Code for methods and constructors, as well as interned Strings. Store loaded classes and other meta-data JVM code itself, JVM internal structures, loaded profiler agent code and data, etc. -XX:MaxPermSize=<size>• Other Space system/OS takes for process Stacks of a threads (-Xss & -Xoss) System & Native space
  62. 62. What is Garbage Collection?Reclaim memory from inaccessible object
  63. 63. Stack Overflow or Out of Memory• If u See OutOfMemoryError: unable to create native thread – This means your Application is falling short Native Memory space – C Space – Either, Insufficient memory to allocate thread stack or PC to the new Thread – Or application has crossed JVM’s memory limit (3.2 GB in 32 bit environment) – The JVM/application hangs with this error, we need to restart. • See if you can reduce active threads which ate away system’s memory • Or if you can decrease stack size to decrease memory use per thread • If you Can’t bring memory consumption down, need more system memory• If u See StackOverflowException – It means the thread that threw this exception fell short of Stack Memory Space – A thread stacks method states invoked by it on to the stack memory – For the number of nested invocations the thread makes, memory is insufficient – Only the thread dies with this exception, the application doesn’t hang. • See if you can bring down number of nested invocations by the thread • Or else, increase the stack size with VM option –Xss, by default it is 1m
  64. 64. Pros and Cons of Garbage Collection?Advantages Disadvantages Increased reliability Unpredictable application Easier to write complex pauses apps Increased CPU/memory No memory leaks or utilization invalid pointers Brutally complex
  65. 65. GC Logging• Java Garbage Collection activity may be recorded in a log file. VM options – -verbosegc (Enable GC Logging, outputs to std-err – -xloggc:<file> (GC logging to file) – –XX:+PrintGCDetails (Detailed GC records) – -XX:+PrintGCDateStamps (absolute instead of relative timestamps) – Note: From relative timestamps in a GC log we can find absolute times by either by tracing forward from application/GC start or backwards from application/GC stop• Asynchronous garbage collection occurs whenever memory available memory is low.• System.gc() does not force a synchronous garbage collection but just gives a hint to VM. VM options – +XXDisableExplicitGC - Disable explicit GC
  66. 66. What to look for in GC Logs?• Important information from GC logs – The size of the heap after garbage collection – The time taken to run the garbage collection – The number of bytes reclaimed by garbage collection• Heap Size after GC may give us a good idea of memory requirement. – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)• The other two help us assess the cost of GC to your application.• All of them together help us tune GC.
  67. 67. How to Calculate Impact of GC on your Application?• Run test (60sec, Collect GC logs) – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed) – 42406K->41504K(458752K), 4.4044878 secs – (902K reclaimed) – 48617K->47874K(458752K), 4.5652409 secs – (770K reclaimed)• Measure – Out of 60 sec, GC ran for 17.2 sec, ie 29% of the time. – Considering relative CPU utilization, GC cost may be even higher. – 3037K of memory was recycled in 60 secs, ie 51831 bytes/second• Analyze – 29% time being consumed by GC is too high (should be between 5-15%) – Is 51831 bytes/sec of memory recycled justifiable against operation? – For an average 50 byte objects, it churned around 1036 objects/ sec
  68. 68. Heap Ranges – Xms to Xmx• Heap Range can be defined – VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size• What causes VM to expand heap? – Expansion of heap is a CPU Intensive and causes defragmented Heap – VM Tries GC, Defragmentation, Compaction, etc to free up memory. – If still unable to free up required memory, VM decides to expand heap – VM may not wait till brink, it keeps some free space for temp objects – By default, Sun tries to keep the proportion of free space to living objects at each garbage collection within 40%-70% range. • If less than 40% heap is free after GC, expand the heap • If more than 70% heap is free after GC, contract the heap – VM Args that can customize the default ratio • -XX:MinFreeHeapRatio • -XX:MaxFreeHeapRatio
  69. 69. Gross Heap Tuning• Consequences of large heap sizes – GC Cycles occur less frequently, but each sweep takes longer – Long GC cycles may induce perceptible pauses in the system. – If heap grows to a size more than available RAM, paging/swapping may occur.• Consequences of low heap sizes – GC runs too frequently with less recovery in each cycle – Cost of GC becomes more – Since, GC has to sweep less space each time, pauses are imperceptible.• Max verses Min Heap sizes. – Contraction & Expansion of heap is costly, should be worth the cause. – Frequent contraction expansion also leads to segmented heap. – Keep Xmx=Xms, for transaction oriented system which frequently peaks. – Keep Xms<Xmx if the application infrequently operates at upper capacity.
  70. 70. We Just Learnt Gross Heap Tuning There might just be need for Fine Tuning• We can fine tune the GC considering the intricacies of GC Algorithm & Heap Structure. We will learn shortly.• Goss Heap tuning is quite simple yet effective & empirically established.• Gross techniques are fairly effective irrespective of the variables and most important we can always afford apply them.
  71. 71. What is the advanced heap made of? The one that works with Generational Garbage Collector in JVM• HEAP is made up of – Old Space or Tenure Space • Objects, when get old in the young space, are transferred here. – Young Space or Eden Space • Young objects are held here. – Scratch Space • Working Space for algorithms – New Space • <Young Space> + <Scratch Space>
  72. 72. jmap -heap
  73. 73. Generational Garbage Collector Modern Heap
  74. 74. Fine Tuning the Heap
  75. 75. Are there better GC implementations to chose? JDK 1.4.x OptionsGeneration Low Pause Collectors Throughput Collectors Heap Sizes 1 CPU 2+ CPUs 1 CPU 2+ CPUs Parallel Scavenge Collector Serial Parallel Copying -XX:NewSize Copying Copying Collector -XX:+UseParallelGC -XX:MaxNewSize Young Collector Collector -XX:+UseAdaptiveSizePolicy -XX:SurvivorRatio (default) (default) -XX:+UseParNewGC -XX:+AggressiveHeap Mark- Mark- Concurrent Collector Compact Compact Mark-Compact Collector -Xms Old Collector Collector (default) -Xmx -XX:+UseConcMarkSweepGC (default) (default) -XX:PermSizePermanent Can be turned off with –Xnoclassgc (use with care) -XX:MaxPermSize
  76. 76. jstatReference http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
  77. 77. Heap Dump (Java) Snapshot of the memory of a time VMs usually invokes a GC before dumping heapIt contains• Objects (Class, fields, primitive values and references)• Classes (Classloader, name, super class, static fields)• GC Roots (Objects defined to be reachable by the JVM)• Thread Stacks (in time with per-frame about local objects)Does not Contain• Allocation information Who created the objects and where they have been created?• Live & Stale Used memory consists of both live and dead objects. JVM usually does a GC before generating a heap dump Tools may attempt to remove these when loading the dump unreachablefrom the GC roots
  78. 78. Heap Dump (Java) How to take it?• On Demand VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak Tools # JDK6 Jconsole, VisualVM, MAT jmap -d64 -dump:file=<file-ascii-hdump> <pid> jmap -d64 -dump:format=b, file=<file-bin-hdump> <pid>• Automatic on Crash VM-arg # -XX:+HeapDumpOnOutOfMemoryError• Postmorterm after crash; from Core-Dump jmap -d64 -dump:format=b,file=<file> <java-bin> <core-file>
  79. 79. Heap Dump (Java) Shallow vs Retained HeapShallow heap• Held by object’s primitive fields and reference variables• Excludes referenced objects but just references (32/64 bits)Retained heap• Object’s shallow size plus the shallow sizes of the objects that are accessible, directly or indirectly, only from this object.• Memory that’s freed by the GC when this object is collected.Garbage Collection Roots• A garbage collection root is an object accessible from outside the heap.• GC root objects, which will not be collected by Garbage Collector at the time of measuring Locals (Java/Native), Threads, System Class, JNI,, Monitor, Finalizer)
  80. 80. Shallow vs. Retained Heaphttp://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.htmlIn general, retained size is a GC root is an integral measure, which helps to understandconsumption memory by objects graphs
  81. 81. Dominator Tree (Object Dependencies)• Identifies chunks of retained memory & the keep-alive• In the dominator tree each object is the immediate dominator of its children, so dependencies between the objects are easily identified.• The edges in the dominator tree do not directly correspond to object references from the object graph. Same object may actually be under retained set of multiple roots.• http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html
  82. 82. OQL (Object Query Language) Heap Dump not just for Troubleshooting• OQL is an Object Query Language that let’s us query the heap dump in SQL fashion.• This enables us to analyze heap not only after problems but proactively search for patterns. Ex select to see if there are more than two objects for Boolean, ideally two .TRUE and .FALSE (singleton like Enums) are sufficient – select toHtml(a) + " = " + a.value from java.lang.Boolean a where objectid(a.clazz.statics.TRUE) != objectid(a) && objectid(a.clazz.statics.FALSE) != objectid(a) (Runs on Visual VM• Visual VM and MAT, both support nice interfaces for OQL http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fwelcome.html
  83. 83. References• Thread Dump Analyzer (Thread Dumps)• (http://java.net/projects/tda/)• GC Viewer (GC logs)• (http://www.tagtraum.com/gcviewer.html)• Eclipse Memory Analyzer tool (Heap Dump, OQL) (http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html )• Visual VM / J-Console /JMX – (Inspect Live Application, Snapshots, Dumps, OQL) Bundled with Java SDK
  84. 84. Feedback – Q&A simar.singh@redknee.com learn@ssimar.com

×