Real Life Java EE Performance Tuning


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Real Life Java EE Performance Tuning

  1. 1. Real Life Java EE Performance Tuning Matt Brasier Principal Consultant C2B2 Consulting LTD [email_address]
  2. 2. About Me <ul><li>Professional Services Consultant </li></ul><ul><li>Customers include </li></ul><ul><ul><li>Red Hat (JBoss) </li></ul></ul><ul><ul><li>BEA </li></ul></ul><ul><ul><li>Cape Clear </li></ul></ul><ul><ul><li>Government/Finance/Telecoms </li></ul></ul><ul><li>C2B2 Consulting </li></ul><ul><ul><li>SOA and Java EE consultancy </li></ul></ul><ul><ul><li>Fast, Reliable, Manageable, Secure </li></ul></ul>
  3. 3. What we will cover <ul><li>Philosophy </li></ul><ul><ul><li>How I approach a performance problem situation </li></ul></ul><ul><li>Enterprise Java Performance </li></ul><ul><ul><li>What kind of things affect performance of Enterprise Systems </li></ul></ul><ul><li>Case Study 1 </li></ul><ul><ul><li>A new version of the application runs slowly </li></ul></ul><ul><li>Case Study 2 </li></ul><ul><ul><li>Logging in takes a long time in the live environment </li></ul></ul><ul><li>Case Study 3 </li></ul><ul><ul><li>The application does not scale </li></ul></ul>
  4. 4. What we will learn <ul><li>Philosophy </li></ul><ul><ul><li>Suggestions to keep in mind when looking at a performance problem </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>Suggested tools for looking at a performance problem </li></ul></ul><ul><li>Techniques </li></ul><ul><ul><li>How to use the tools, knowledge and skills to solve your performance problem </li></ul></ul>
  5. 5. Philosophy <ul><li>‘ A good understanding’ is the best performance tuning tool </li></ul><ul><li>Prefer common and open source tools </li></ul><ul><li>Observe, Hypothesize, Tweak, Test </li></ul><ul><li>‘ Trust no-one’ </li></ul>
  6. 6. Classic Java performance problems <ul><li>Memory leaks </li></ul><ul><ul><li>Increased GC Time </li></ul></ul><ul><li>Poor GC or JVM Memory configuration </li></ul><ul><li>CPU bound code </li></ul><ul><li>IO bound code </li></ul><ul><li>Memory bound code </li></ul><ul><ul><li>Increased GC time </li></ul></ul>
  7. 7. Enterprise Java Performance <ul><li>CAVEAT: Consultancy Selection Bias </li></ul><ul><li>80/20: 80% of time finding, 20% fixing </li></ul><ul><li>Many ‘Enterprise’ Java performance problems turn out not to be ‘classic’ performance bottlenecks </li></ul><ul><ul><li>Infrastructure/Middleware performance </li></ul></ul><ul><li>There are many factors that can affect the performance of an enterprise system </li></ul><ul><ul><li>Not just code </li></ul></ul>
  8. 8. Enterprise Java Performance <ul><li>Not all Java EE performance problems are classical ‘Java performance problems’ </li></ul><ul><li>Common types of Java EE performance problem </li></ul><ul><ul><li>Resource starvation </li></ul></ul><ul><ul><li>Threading problems </li></ul></ul><ul><ul><li>‘ Suboptimal configuration’ </li></ul></ul><ul><ul><li>Network related problems </li></ul></ul><ul><ul><li>Scalability problems </li></ul></ul>
  9. 9. A Good Understanding <ul><li>Consider the system as a whole </li></ul><ul><li>Know how infrastructure components work </li></ul><ul><ul><li>Not just what they do, but how they do it </li></ul></ul><ul><li>How do the Java EE specifications say they should work? </li></ul>
  10. 10. Approach <ul><li>Understand the system </li></ul><ul><li>Understand the environment </li></ul><ul><li>Understand the situation </li></ul><ul><li>Talk to people who know </li></ul><ul><ul><li>But trust no-one </li></ul></ul><ul><li>Take a look for myself </li></ul><ul><li>Observe, Hypothesize, Tweak, Test </li></ul><ul><ul><li>Rinse and repeat </li></ul></ul>
  11. 11. Case Study 1
  12. 12. Case Study 1 <ul><li>Existing customer calls </li></ul><ul><ul><li>“ We deployed a new version of the application, and it is running a lot slower” </li></ul></ul><ul><li>The Environment </li></ul><ul><ul><li>Sun Java 5 </li></ul></ul><ul><ul><li>WebLogic Server 9.2 Cluster (3 nodes) </li></ul></ul><ul><ul><li>WebLogic Integration 9.2 Cluster (3 nodes) </li></ul></ul><ul><ul><li>Documentum Document Management </li></ul></ul><ul><ul><li>Oracle Database </li></ul></ul><ul><ul><li>Solaris OS </li></ul></ul>
  13. 13. Case Study 1 <ul><li>The System </li></ul><ul><ul><li>Web Application </li></ul></ul><ul><ul><li>WLI based workflow system </li></ul></ul><ul><li>The situation </li></ul><ul><ul><li>New version deployed into the performance testing environment </li></ul></ul><ul><ul><li>Automated performance tests indicate the application is approximately 30% slower </li></ul></ul>
  14. 14. Case Study 1 <ul><li>Observe </li></ul><ul><ul><li>No monitoring in place </li></ul></ul><ul><ul><li>Some alerting, but no historical data </li></ul></ul><ul><li>Hypothesize </li></ul><ul><ul><li>If we had more monitoring, we would stand a better chance </li></ul></ul><ul><li>Tweak </li></ul><ul><ul><li>Put some monitoring in place </li></ul></ul><ul><ul><li>Hyperic HQ from SpringSource </li></ul></ul>
  15. 15. Case Study 1 <ul><li>Test </li></ul><ul><ul><li>Re-run tests </li></ul></ul><ul><li>Observe </li></ul><ul><ul><li>Monitoring indicates that one server is slower </li></ul></ul><ul><ul><ul><li>Handling less requests per second </li></ul></ul></ul><ul><ul><ul><li>Lots of transaction timeouts </li></ul></ul></ul><ul><ul><ul><li>Higher CPU </li></ul></ul></ul><ul><ul><ul><li>Less network traffic </li></ul></ul></ul><ul><li>Tweak </li></ul><ul><ul><li>Add more monitoring to the slow server </li></ul></ul><ul><ul><li>Examine log files </li></ul></ul><ul><ul><li>Thread dumps! </li></ul></ul>
  16. 16. Case Study 1 <ul><li>Hypothesize </li></ul><ul><ul><li>Thread dumps show lots of threads in logging code waiting to write to the log file </li></ul></ul><ul><ul><li>Log files for the slow server have DEBUG messages in them </li></ul></ul><ul><ul><ul><li>The other servers don’t </li></ul></ul></ul><ul><li>“ The logging configurations are identical, the servers are configured with Maven” </li></ul><ul><ul><li>Trust no one </li></ul></ul><ul><li>Test </li></ul><ul><ul><li>Log in to the server and manually check the logging configuration </li></ul></ul>
  17. 17. Case Study 1 <ul><li>Solution </li></ul><ul><ul><li>Debug logging was enabled on one server </li></ul></ul><ul><ul><li>Turned debug logging off - the system was then about the same speed as the old release </li></ul></ul>
  18. 18. Hyperic HQ
  19. 19. Hyperic HQ <ul><li>Monitoring tool </li></ul><ul><ul><li>Not a profiling tool </li></ul></ul><ul><li>Historical data </li></ul><ul><ul><li>Trends </li></ul></ul><ul><ul><li>Abnormal behaviour </li></ul></ul><ul><ul><li>‘ Hot’ spots </li></ul></ul><ul><li>Wide variety of data </li></ul><ul><ul><li>JVM level statistics </li></ul></ul><ul><ul><li>JMX statistics </li></ul></ul><ul><ul><li>OS statistics </li></ul></ul>
  20. 20. Thread Dumps <ul><li>My Number 2 tool for finding performance problems </li></ul><ul><ul><li>CTRL-BREAK in windows </li></ul></ul><ul><ul><li>Kill -3 on Unix/Linux </li></ul></ul><ul><ul><li>Jstack tool </li></ul></ul><ul><ul><li>Available from consoles of many application servers </li></ul></ul><ul><li>All threads in the VM and what they are doing at that moment </li></ul>
  21. 21. Thread Dumps <ul><li>A number of thread dumps over time gives a good picture </li></ul><ul><ul><li>Any operation that appears a lot is a suspect </li></ul></ul><ul><ul><li>Understand what ‘normal’ thread dumps look like </li></ul></ul><ul><li> </li></ul>
  22. 22. Thread Dump
  23. 23. Thread Dumps <ul><li>Look near the top of each stack </li></ul><ul><li>Look for stacks with your code in them </li></ul><ul><li>Look for long stacks </li></ul><ul><li>Look for deadlocks and other threading issues </li></ul>
  24. 24. The Understanding <ul><li>What does a normal WebLogic thread dump look like? </li></ul><ul><li>It is not normal to see logging code frequently in a thread dump </li></ul><ul><li>Lots of threads all waiting on a single lock object is a Bad Thing™ </li></ul><ul><li>If three servers are supposed to do the same thing, their thread dumps should look similar </li></ul><ul><ul><li>Over time </li></ul></ul>
  25. 25. Lessons <ul><li>Thread dumps hold a lot of information </li></ul><ul><li>Infrastructure configuration faults are more common than infrastructure bugs </li></ul><ul><li>Automated/continuous build and deploy solutions are no silver bullet </li></ul><ul><ul><li>Check the results yourself </li></ul></ul><ul><li>Believe your ‘instincts’ </li></ul>
  26. 26. Case Study 2
  27. 27. Case Study 2 <ul><li>Customer Call </li></ul><ul><ul><li>“ We deployed our application into the live environment and it takes several minutes for users to log in” </li></ul></ul><ul><li>Environment </li></ul><ul><ul><li>Apache web servers </li></ul></ul><ul><ul><li>WebLogic Portal 8.1 Cluster (2 nodes) </li></ul></ul><ul><ul><li>Oracle Database </li></ul></ul><ul><ul><li>Windows Server 2003 </li></ul></ul><ul><ul><li>Bespoke Single Sign On server </li></ul></ul>
  28. 28. Case Study 2 <ul><li>The System </li></ul><ul><ul><li>Web application based on WSRP portlets </li></ul></ul><ul><ul><li>Oracle database storing user data </li></ul></ul><ul><li>The Situtation </li></ul><ul><ul><li>The first users to log-in in the morning find that it takes several minutes </li></ul></ul><ul><ul><li>After the first few log-ins, the application runs fine </li></ul></ul>
  29. 29. Case Study 2 <ul><li>Hypothesize </li></ul><ul><ul><li>The bespoke Single Sign On server makes me suspicious </li></ul></ul><ul><ul><ul><li>Bespoke code is tested less </li></ul></ul></ul><ul><li>Test </li></ul><ul><ul><li>Turn on debug logging for the SSO implementation </li></ul></ul><ul><ul><li>Observe timings of log messages </li></ul></ul>
  30. 30. Case Study 2 <ul><li>Observe </li></ul><ul><ul><li>The logs indicate that the SSO log-in is proceeding as expected </li></ul></ul><ul><ul><li>It appears that loading the users profile data from the database is taking a long time </li></ul></ul><ul><li>Hypothesize </li></ul><ul><ul><li>TCP timeouts when connecting to the database due to a firewall </li></ul></ul>
  31. 31. Case Study 2 <ul><li>Test </li></ul><ul><ul><li>Observe the connection pool statistics in the WebLogic console </li></ul></ul><ul><ul><li>The console indicates that a large number of connections have been opened during the time the application has been running </li></ul></ul><ul><ul><ul><li>Connections are not normally closed and re-opened </li></ul></ul></ul><ul><ul><li>See how long you need to leave the system before the problem occurs </li></ul></ul>
  32. 32. Case Study 2 <ul><li>Solution </li></ul><ul><ul><li>Discussions with the networking team indicated that there was a firewall, configured to silently terminate network connections that were Idle for 60 minutes </li></ul></ul><ul><ul><li>Set WebLogic to test connections after they have been idle for 50 minutes. </li></ul></ul>
  33. 33. Lessons <ul><li>Consider the system as a whole </li></ul><ul><ul><li>Hardware </li></ul></ul><ul><ul><li>Networking </li></ul></ul><ul><ul><li>OS </li></ul></ul><ul><ul><li>Middleware </li></ul></ul><ul><ul><li>Application </li></ul></ul>
  34. 34. The Understanding <ul><li>Firewalls are often configured to silently terminate idle TCP connections </li></ul><ul><li>The TCP protocol requires that a connection is closed by both sides, or times out </li></ul><ul><ul><li>The time out is several minutes </li></ul></ul><ul><li>In a healthy WebLogic connection pool, the number of connections opened since the server started = the maximum number in the pool </li></ul>
  35. 35. Case Study 3
  36. 36. Case Study 3 <ul><li>Customer call </li></ul><ul><ul><li>“ It takes about 20 seconds to render a page, and the performance does not scale” </li></ul></ul><ul><li>Environment </li></ul><ul><ul><li>WebLogic Portal 9.1 Cluster (2 nodes) </li></ul></ul><ul><ul><li>Oracle 10g Database </li></ul></ul><ul><ul><li>Red Hat Enterprise Linux </li></ul></ul>
  37. 37. Case Study 3 <ul><li>The System </li></ul><ul><ul><li>Online content delivery system </li></ul></ul><ul><ul><li>WebLogic Portal with a commercial set of portlets </li></ul></ul><ul><li>The Situation </li></ul><ul><ul><li>Two problems </li></ul></ul><ul><ul><ul><li>Running the performance tests with 20 threads in JMeter is twice as slow as running the tests with 10 threads </li></ul></ul></ul><ul><ul><ul><li>Viewing a content item takes around 20 seconds </li></ul></ul></ul>
  38. 38. Case Study 3 <ul><li>Handle the two problems separately </li></ul><ul><ul><li>They may be related, they may not be </li></ul></ul>
  39. 39. Case Study 3 <ul><li>Observe </li></ul><ul><ul><li>Viewing a content item takes around 16 seconds on my laptop </li></ul></ul><ul><li>Test </li></ul><ul><ul><li>Is the rendering speed dependent on the browser used? </li></ul></ul><ul><ul><li>Is the rendering speed dependent on the client machine? </li></ul></ul><ul><ul><li>What does the page source look like? </li></ul></ul>
  40. 40. Case Study 3 <ul><li>Observe </li></ul><ul><ul><li>In Opera the page renders quickly except for the table of contents on the left </li></ul></ul><ul><ul><li>In Firefox, the whole page renders at the same time </li></ul></ul><ul><ul><li>The page renders faster in IE and Opera than firefox </li></ul></ul><ul><ul><li>The page renders faster on faster machines </li></ul></ul><ul><ul><li>There is a lot of Javascript, and AJAX is used to load the table of contents </li></ul></ul>
  41. 41. Case Study 3 <ul><li>Hypothesize </li></ul><ul><ul><li>The AJAX rendering of the TOC is taking a long time, and slowing down the whole page load </li></ul></ul><ul><li>Tweak </li></ul><ul><ul><li>Remove the TOC from the page </li></ul></ul><ul><ul><li>Disable JavaScript in the browser </li></ul></ul><ul><li>Test </li></ul><ul><ul><li>The page renders in less than 2 seconds </li></ul></ul>
  42. 42. Case Study 3 <ul><li>Hypothesize </li></ul><ul><ul><li>JMeter does not execute the javascript, so the poor performance of JMeter is not related to the poor page load speed </li></ul></ul>
  43. 43. Case Study 3 <ul><li>Solution 1 </li></ul><ul><ul><li>The portlet developers have used AJAX to render the table of contents for a content item, this is much slower than just constructing the table of contents on the server side </li></ul></ul><ul><ul><li>Rewrite the portlet to construct the table of contents on the server side </li></ul></ul><ul><ul><li>Developers sometimes select a technology to enhance their CVs, not to implement a business requirement </li></ul></ul>
  44. 44. Case Study 3 <ul><li>Problem 2 – Scalability </li></ul><ul><li>Observe </li></ul><ul><ul><li>Running the tests on JMeter with 10 users, each page response takes 5s </li></ul></ul><ul><ul><li>Running the test with 20 users each page response takes 12s </li></ul></ul><ul><ul><li>JMeter is being run on an old laptop, which is at 100% CPU in both cases </li></ul></ul>
  45. 45. Case Study 3 <ul><li>Hypothesize </li></ul><ul><ul><li>As the test machine is at 100% CPU, it is the performance of JMeter that is being measured, not the performance of WebLogic </li></ul></ul><ul><li>Observe </li></ul><ul><ul><li>WebLogic is running at around 2% CPU usage, with many idle threads </li></ul></ul>
  46. 46. Case Study 3 <ul><li>Tweak </li></ul><ul><ul><li>Run the test from a number of more modern machines, and make sure each one does not exceed 70% CPU </li></ul></ul><ul><li>Observe </li></ul><ul><ul><li>Four machines can each run 20 threads and get responses in 1.5 seconds, and WebLogic is still running at around 5% CPU and not struggling </li></ul></ul>
  47. 47. Case Study 3 <ul><li>Solution </li></ul><ul><ul><li>The problem was that the test client was not able to generate the loads requested, resulting in the performance of the test client being measured </li></ul></ul><ul><ul><li>Use a larger test client </li></ul></ul>
  48. 48. Useful tools <ul><li>Ethereal/Wireshark </li></ul><ul><ul><li>Network traffic sniffer </li></ul></ul><ul><ul><li>See when requests/responses were sent/received </li></ul></ul><ul><li>Firebug + YSlow </li></ul><ul><ul><li>Firefox plugin for performance analysis </li></ul></ul>
  49. 49. Lessons <ul><li>Separate problems should initially be prioritised and investigated separately </li></ul><ul><ul><li>Keep in mind that they may be related </li></ul></ul><ul><li>Ensure the test system can generate the required load </li></ul><ul><ul><li>It should have plenty of free resources available </li></ul></ul>
  50. 50. Lessons <ul><li>The consultant effect </li></ul><ul><ul><li>Take a step back </li></ul></ul><ul><ul><li>Get a fresh perspective </li></ul></ul>
  51. 51. The Understanding <ul><li>A slow test client will give slow results </li></ul><ul><li>Client side rendering is usually less efficient than server side </li></ul><ul><li>WebLogic is normally fast! </li></ul>
  52. 52. What did we learn? <ul><li>Simple tools can provide a lot of information </li></ul><ul><li>Understanding how the system should behave will help highlight possible causes </li></ul><ul><li>Experience is vital </li></ul><ul><ul><li>Write a log of what you find </li></ul></ul><ul><li>Take a step back from the problem </li></ul><ul><ul><li>Use a second pair of eyes </li></ul></ul>
  53. 53. What did we learn? <ul><li>Philosophy </li></ul><ul><ul><li>Understand they system as a whole </li></ul></ul><ul><ul><li>A deep understanding of how it should work </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>Thread dumps </li></ul></ul><ul><ul><li>Monitoring tools </li></ul></ul><ul><ul><li>Packet sniffing </li></ul></ul><ul><li>Techniques </li></ul><ul><ul><li>Observe, Hypothesize, Tweak, Test </li></ul></ul>
  54. 54. Questions
  55. 55. Session Evaluation <ul><li>Please complete a session evaluation and turn it into any conference staff member or at the registration desk. Thank you. </li></ul>