1. Rohit Kelapure IBM Advisory Software Engineer 29 September 2011 Server Resiliency - Debugging Java deployments
2. Introduction to Speaker – Rohit Kelapure Responsible for the resiliency of WebSphere Application Server Team Lead and architect of Caching & Data replication features in WebSphere Called upon to hose down fires & resolve critical situations Customer advocate for large banks Active blogger All Things WebSphere Apache Open Web Beans committer Java EE, OSGI & Spring Developer kelapure@us.ibm.com kelapure@gmail.com Linkedin http://twitter.com/#!/rkela 2
3. Important Disclaimers THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED IN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED ON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES. ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A GUIDE. IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM, WITHOUT NOTICE. IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF: - CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS 3
5. Outline Server Resiliency Fundamentals Common JVM Problems Protecting your JVM Hung thread detection, Thread Interruption, Thread hang recovery Memory leak detection, protection & action Scenario based problem resolution Tooling Eclipse Memory Analyzer Thread Dump Analyzer Garbage Collection and Memory Visualizer 5
6. Resiliency Property of a material that can absorb external energy when it is forced to deform elastically, and then be able to recover to its original form and release the energy 6
7. Server Resiliency Concepts 7 October 4, 2011 Explicit Messaging Distributed Shared memory – Better Consistency Message Passing – Loose Coupling Uniform Interface E.g. World Wide Web Better scalability, reusability and reliability Data, process and other forms of computations identified by one mechanism Semantics of operations in messages for operating on the data are unified Self Management Composed of self-managing components. Managed element, managers, sensors & effectors e.g. TCP/IP Congestion Control Redundancy (Data and processing) Create Replicas High cost of initialization and reconfiguration Redundant elements need to be synchronized from time to time Partition Splitting the data into smaller pieces and storing them in distributed fashion Allows for parallelization & divide and conquer Partial failure isolation Virtualization Functionalities of processing and data element virtualized as a service Loose coupling between system and consumed services Integration by enforcing explicitly boundary and schema-based interfaces Decentralized Control High communication overhead of centralized control for a system of heavy redundancy Sometimes trapped in locally optimized solutions Fixing issues requires shutting down the entire system e.g. AWS outage
9. Thread Hangs Threading and synchronization issues are among the top 5 application performance challenges too aggressive with shared resources causes data inconsistencies too conservative leads to too much contention between threads Application unresponsiveness Adding users / threads /CPUs causes app slow down (less throughput, worse response) High lock acquire times & contention Race conditions, deadlock, I/O under lock Tooling is needed to rescue applications and the JVM from itself Identify these conditions If possible remedy them in the short term for server resiliency 9 October 4, 2011
10. JVM Hung Thread Detection Every X seconds an alarm thread wakes up and iterates over all managed thread pools. Subtract the "start time" of the thread from the current time, and passes it to a monitor. Detection policy then determines based on the available data if the thread is hung Print stack trace of the hung thread 10 October 4, 2011
11. Thread Interruption 101 Thread.stop stops thread by throwing ThreadDeath exception * Deprecated Thread.interrupt(): Cooperative mechanism for a thread to signal another thread that it should, at its convenience and if it feels like it, stop what it is doing and do something else. Interruption is usually the most sensible way to implement task cancellation. Because each thread has its own interruption policy, you should not interrupt a thread unless you know what interruption means to that thread. Any method sensing interruption should Assume current task is cancelled & perform some task‐specific cleanup Exit as quickly and cleanly as possible ensuring that callers are aware of cancellation Propagate the exception, making your method an interruptible blocking method, to throw new InterruptedException() Restore the interruption status so that code higher up on the call stack can deal with it Thread.currentThread().interrupt() Only code that implements a thread's interruption policy may swallow an interruption request. 11 October 4, 2011
14. Dealing with Non‐interruptible Blocking Many blocking library methods respond to interruption by returning early and throwing InterruptedException Makes it easier to build tasks that are responsive to cancellation Lock.lockInterruptibly Thread.sleep, Thread.wait Thread.notify Thread.join Not all blocking methods or blocking mechanisms are responsive to interruption if a thread is blocked performing synchronous socket I/O, interruption has no effect other than setting the thread's interrupted status If a thread is blocked waiting for an intrinsic lock, there is nothing you can do to stop short of ensuring that it eventually acquires the lock 14 October 4, 2011
15. Thread Hang Recovery – Technique Application specific hacks for thread hang recovery Byte code instrumentation Transform the concrete subclasses of the abstract classes InputStream& OutputStreamto make the socket I/O operations interruptible. Transform an application class so that every loop can be interrupted by invoking Interrupter.interrupt(Thread, boolean) Transform a monitorenter instruction and a monitorexit instruction so that the wait at entering into a monitor is interruptible http://www.ibm.com/developerworks/websphere/downloads/hungthread.html 15
16. Memory Leaks Leaks come in various types, such as Memory leaks Thread and ThreadLocal leaks ClassLoader leaks System resource leaks Connection leaks Customers want to increase application uptime without cycling the server. Frequent application restarts without stopping the server. Frequent redeployments of the application result in OOM errors What do we have today Offline post-mortem analysis of a JVM heap. Tools like Jrockit Mission Control, MAT. IEMA are the IBM Extensions for Memory Analyzer Runtime memory leak detection using JVMTI and PMI (Runtime Performance Advisor) We don’t have application level i.e. top down memory leak detection and protection Leak detection by looking at suspect patterns in application code 16 October 4, 2011
17. ClassLoader Leaks 101 A class is uniquely identified by Its name + The class loader that loaded it Class with the same name can be loaded multiple times in a single JVM, each in a different class loader Web containers use this for isolating web applications Each web application gets its own class loader Reference Chain An object retains a reference to the class it is an instance of A class retains a reference to the class loader that loaded it The class loader retains a reference to every class it loaded Retaining a reference to a single object from a web application pins every class loaded by the web application These references often remain after a web application reload With each reload, more classes get pinned ultimately leading to an OOM 17 October 4, 2011
18. Tomcat pioneered approach - Leak Prevention JRE triggered leak Singleton / static initializer Can be a Thread Something that won’t get garbage collected Retains a reference to the context class loader when loaded If web application code triggers the initialization The context class loader will be web application class loader A reference is created to the web application class loader This reference is never garbage collected Pins the class loader (and hence all the classes it loaded) in memory Prevention with a DeployedObjectListener Calling various parts of the Java API that are known to retain a reference to the current context class loader Initialize these singletons when the Application Server’s class loader is the context class loader 18 October 5, 2011
19. Leak Detection Application Triggered Leaks ClassLoader Threads ThreadLocal JDBC Drivers Non Application RMI Targets Resource Bundle Static final references InstrospectionUtils Loggers Prevention Code executes when a web application is stopped, un-deployed or reloaded Check, via a combination of standard API calls and some reflection tricks, for known causes of memory leaks 19 October 4, 2011
21. What is wrong with my application …? Why does my application run slow every time I do X ? Why does my application have erratic response times ? Why am I getting Out of Memory Errors ? What is my applications memory footprint ? Which parts of my application are CPU intensive ? How did my JVM vanish without a trace ? Why is my application unresponsive ? What monitoring do I put in place for my app. ? 21 October 4, 2011
22. What is your JVM up to ? Windows style task manager for displaying thread status and allow for their recovery & interruption Leverage the ThreadMXBean API in the JDK to display thread information https://github.com/kelapure/dynacache/blob/master/scripts/AllThreads.jsphttps://github.com/kelapure/dynacache/blob/master/scripts/ViewThread.jsp 22 October 4, 2011
23. Application runs slow when I do XXX ? Understand impact of activity on components Look at the thread & method profiles IBM Java Health Center Visual VM Jrockit Mission Control JVM method & dump trace - pinpoint performance problems. Shows entry & exit times of any Java method Method to trace to file for all methods in tests.mytest.package Allows taking javadump, heapdump, etc when a method is hit Dump javacore when method testInnerMethod in an inner class TestInnerClass of a class TestClass is called Use Btrace, -Xtrace * –Xdump to trigger dumps on a range of events gpf, user, abort, fullgc, slow, allocation, thrstop, throw … Stack traces, tool launching 23 October 4, 2011
24. Application has erratic response times ? Verbose gc should be enabled by default <2% impact on performance VisualGC, GCMV &PMAT : Visualize GC output In use space after GC Positive gradient over time indicates memory leak Increased load (use for capacity plan) Memory leak (take HDs for PD.) Choose the right GC policy Optimized for “batch” type applications, consistent allocation profile Tight responsiveness criteria, allocations of large objects High rates of object “burn”, large # of transitional objects 12, 16 core SMP systems with allocation contention (AIX only) GC overhead > 10% wrong policy | more tuning Enable compressed references for 64 bit JVM 24 October 5, 2011
25. Out Of Memory Errors ? JVM Heap sized incorrectly GC adapts heap size to keep occupancy [40, 70]% Determine heap occupancy of the app. under load Xmx = 43% larger than max. occupancy of app. For 700MB occupancy , 1000MB Max. heap is reqd. (700 +43% of 700) Analyze heapdumps & system dumps with tools like Eclipse Memory Analyzer Lack of Java heap or Native heap Eclipse Memory Analyzer and IBM extensions Finding which methods allocated large objects Prints stacktrace for all objects above 1K Enable Java Heap and Native heap monitoring JMX and metrics output by JVM Classloader exhaustion 25 October 4, 2011
26. Applications memory footprint ? HPROF – profiler shipped with JDK – uses JVMTI Analysis of memory usage -Xrunhprof:heap=all Performance Inspector tools - JPROF Java Profiling Agent Capture state of the Java Heap later processed by HDUMP Group a system dump by classloader since each app has its own classloader, you can get accurate information on how much heap each application is taking up Use MAT to investigate heapdumps & system dumps Find large clumps, Inspect those objects, What retains them ? Why is this object not being garbage collected – List Objects > incoming refs, Path to GC roots, Immediate dominators Limit analysis to a single application in a JEE environment - Dominator tree grouped by ClassLoader Dominator tree grouped by Class Loader Set of objects that can be reclaimed if we could delete X - Retained Size Graphs Retained Size Graphs Traditional memory hogs like HTTPSession, Cache - Use Object Query Language (OQL Use Object Query Language (OQL) 26 October 4, 2011
27. Using Javacores for Troubleshooting Javacores are often the most critical piece of information to resolve a hang, high CPU, crash and sometimes memory problems A Javacore is a text file that contains a lot of useful information The date, time, java™ version, full command path and arguments All the threads in the JVM, including thread state, priority, thread ID, name Thread call stacks Javacores can be generated automatically or on demand Automatically when an OutOfMemoryException is thrown On demand with “kill -3 <pid>” Message to the SystemOut when a javacore is generated 27 "WebContainer : 537" (TID:0x088C7200, sys_thread_t:0x09C19F00, state:CW, native ID:0x000070E8) prio=5 at java/net/SocketInputStream.socketRead0(Native Method) at java/net/SocketInputStream.read(SocketInputStream.java:155) at oracle/net/ns/Packet.receive(Bytecode PC:31) at oracle/net/ns/DataPacket.receive(Bytecode PC:1) at oracle/net/ns/NetInputStream.read(Bytecode PC:33) at oracle/jdbc/driver/T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1123) at oracle/jdbc/driver/T4C8Oall.receive(T4C8Oall.java:480) at oracle/jdbc/driver/T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:813) at oracle/jdbc/driver/OracleStatement.doExecuteWithTimeout(OracleStatement.java:1154) at oracle/jdbc/driver/OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3415) at com/ibm/commerce/user/objects/EJSJDBCPersisterCMPDemographicsBean_2bcaa7a2.load() at com/ibm/ejs/container/ContainerManagedBeanO.load(ContainerManagedBeanO.java:1018) at com/ibm/ejs/container/EJSHome.activateBean(EJSHome.java:1718)
28. CPU intensive parts of the app? ThreadDumps or Javacores- Poor mans profiler Periodic javacores Thread analysis – using the Thread Monitor Dump Analyzer tool High CPU is typically diagnosed by comparing two key pieces of information Using Javacores, determine what code the threads are executing Gather CPU usage statistics by thread For each Javacore compare the call stacks between threads Focus first on Request processing threads first Are all the threads doing similar work? Are the threads moving ? Collect CPU statistics per thread Is there one thread consuming most of the CPU? Are there many active threads each consuming a small percentage of CPU? High CPU due to excessive garbage collection ? If this is a load/capacity problem then use HPROF profiler -Xrunhrof:cpu=samples, -Xrunhprof:cpu=time 28 October 4, 2011
29. Diagnosis - Hangs Often hangs are due to unresponsive synchronous requests SMTP Server, Database, Map Service, Store Locator, Inventory, Order processing, etc 3XMTHREADINFO "Servlet.Engine.Transports : 11" (TID:0x7DD38040, sys_thread_t:0x44618828, state:R, native ID:0x4A9F) prio=54XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.SQLExecute()4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.execute2(DB2PreparedStatement.java)4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.executeQuery(DB2PreparedStatement.java()4XESTACKTRACE at ... 3XMTHREADINFO "Servlet.Engine.Transports : 12" (TID:0x7DD37FC0, sys_thread_t:0x4461BDA8, state:R, native ID:0x4BA0) prio=54XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.SQLExecute()4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.execute2(DB2PreparedStatement.java)4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.executeQuery(DB2PreparedStatement.java()4XESTACKTRACE at ... 3XMTHREADINFO "Servlet.Engine.Transports : 13" (TID:0x7DD34C50, sys_thread_t:0x4465B028, state:R, native ID:0x4CCF) prio=54XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.SQLExecute()4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.execute2(DB2PreparedStatement.java)4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.executeQuery(DB2PreparedStatement.java() Not all hangs are waiting on an external resource A JVM can hang due to a synchronization problem - One thread blocking several others 29 3XMTHREADINFO "Servlet.Engine.Transports : 11" (TID:0x7DD38040, sys_thread_t:0x44618828, state:R, native ID:0x4A9F) prio=53LKMONOBJECT com/ibm/ws/cache/Cache@0x65FB8788/0x65FB8794: owner "Default : DMN0" (0x355B48003LKWAITERQ Waiting to enter:3LKWAITER "WebContainer : 0" (0x3ACCD000)3LKWAITER "WebContainer : 1" (0x3ACCCB00)3LKWAITER "WebContainer : 2" (0x38D68300)3LKWAITER "WebContainer : 3" (0x38D68800)
30. How did my JVM vanish without trace ? JVM Process Crash Usual Suspects Bad JNI calls, Segmentation violations, Call Stack Overflow Native memory leaks - Object allocation fails with sufficient space in the JVM heap Unexpected OS exceptions (out of disk space, file handles), JIT failures Monitor the OS process size Runtime check of JVM memory allocations – Xcheck:memory Native memory usage - Create a core dump on an OOM JNI code static analysis -Xcheck:jni (errors, warnings, advice) GCMV provides scripts and graphing for native memory Windows “perfmon“, Linux “ps” & AIX “svmon” Find the last stack of native code executing on the thread during the crash The signal info (1TISIGINFO) will show the Javacore was created due to a crash Signal 11 (SIGSEGV) or GPF 30 October 4, 2011
36. Runtime Serviceability aids Troubleshooting panels in the administration console Performance Monitoring Infrastructure metrics Diagnostic Provider Mbeans Dump Configuration, State and run self-test Application Response Measurement/Request Metrics Follow transaction end-to-end and find bottlenecks Trace logs & First Failure Data Capture Runtime Performance Advisors Memory leak detection, session size, … Specialized tracing and Runtime checks Tomcat Classloader Leak Detection Session crossover, Connection leak, ByteBuffer leak detection Runaway CPU thread protection 36 October 4, 2011
37. References Java theory and practice: Dealing with InterruptedException http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html Architectural design for resilience http://dx.doi.org/10.1080/17517570903067751 IBM Support Assistant http://www-01.ibm.com/software/support/isa/download.html How Customers get into trouble http://www-01.ibm.com/support/docview.wss?uid=swg27008359 37
Session ID 22723 Status Accepted Title JVM Flight Simulator: Debugging Java Deployments Abstract Troubleshooting issues such as instances of OutOfMemoryError, performance problems, and various exceptions is a common task for anyone developing or deploying an application. This deep dive session presents a hands-on demo of using open source IBM tools such as Monitoring and Diagnostic Tools for Java, Extended Memory Analyzer Tool, and the Support Assistant. Come learn how to diagnose these common problem types. Speakers Rohit Kelapure IBM Advisory Software EngineerType Conference Session Length 60 minutes JavaOne Primary Track Core Java Platform JavaOne Optional Track Java SE, Client Side Technologies, and Rich User Experiences
Challenge #5: Threading and Synchronization Issues Of the many issues affecting the performance of Java applications, synchronization ranks near the top. There is no question that synchronization is necessary to protect shared data in an application. The fundamental need to synchronize lies with Java's support for concurrency. This happens by allowing the execution of code by separate threads within the same process. Using shared resources efficiently, such as connection pools and data caches, is very important for good application performance. Being too aggressive with shared resources causes data inconsistencies, and being too conservative leads to too much contention between threads (because resource locking is involved). This affects the performance of the application largely because most threads servicing users are affected and slowed down -- they end up waiting for resources instead of doing real processing work.If you want to improve synchronization issues, application performance management tools can help; the right tool can enable you to monitor application execution under high loads (aka "in production") and quickly pinpoint the execution times. In doing so, you will increase your ability to identify thread synchronization issues become greatly increase -- and the overall MTTR will drop dramatically.
Length of time in seconds thread can be active before considered hung Number of times that false alarms can occur before automatically increasing the thresholdOpportunity to implement in apache Commons ThreadPoolDoes not include any spawned threads or unmanaged threads
Calling interrupt does not necessarily stop the target thread from doing what it is doing; it merely delivers the message that interruption has been requestedDeprecated in JDK1.2 because it can corrupt object state:General‐purpose application task & application library code should never swallow interruption requests
and makes enough progress that you can get its attention some other way
Path to GC roots – Reference chain that prevents object from being GcedDominator tree grouped by Class Loader- Limit analysis to a single application in a JEE environment Retained Size Graphs- set of objects that can be reclaimed if we could delete XSELECT data as "MemorySessionData", data.@usedHeapSize as "Heap Size", data.@retainedHeapSize as "Retained Heap Size", mSwappableData, toString(data.mManager.scAppParms._J2EEName) as "J2EE Name", toString(data.appName) as "App Name", toString(data.mSessionId) as "Session ID" FROM com.ibm.ws.webcontainer.httpsession.MemorySessionData data
If the Javacores show most threads are idle, it is possible that the requests are not making their way to the Application ServerThe following example shows multiple threads waiting for a DB2 database to respond. This indicates the bottleneck is in the DB
Connection Manager, Node Synchronization, Node agent, Deployment Manager, WebContainer Runtime AdvisorSpecialized tracing and Runtime checksConnection Leak, WsByteBuffer leak detection, Session crossover, transaction ID, request ID The advisors provide a variety of advice on the following application server resources: Object Request Broker service thread pools Web container thread pools Connection pool size Persisted session size and time Data source statement cache size Session cache size Dynamic cache size Java virtual machine heap size DB2 Performance Configuration wizard