Your SlideShare is downloading. ×

Real Life Java EE Performance Tuning

3,553

Published on

Presentation by …

Presentation by
Matt Brasier
Head of Consultancy
C2B2 Consulting

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,553
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
182
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Real Life Java EE Performance Tuning Matt Brasier Principal Consultant C2B2 Consulting LTD [email_address]
  • 2. About Me
    • Professional Services Consultant
    • Customers include
      • Red Hat (JBoss)
      • BEA
      • Cape Clear
      • Government/Finance/Telecoms
    • C2B2 Consulting
      • SOA and Java EE consultancy
      • Fast, Reliable, Manageable, Secure
  • 3. What we will cover
    • Philosophy
      • How I approach a performance problem situation
    • Enterprise Java Performance
      • What kind of things affect performance of Enterprise Systems
    • Case Study 1
      • A new version of the application runs slowly
    • Case Study 2
      • Logging in takes a long time in the live environment
    • Case Study 3
      • The application does not scale
  • 4. What we will learn
    • Philosophy
      • Suggestions to keep in mind when looking at a performance problem
    • Tools
      • Suggested tools for looking at a performance problem
    • Techniques
      • How to use the tools, knowledge and skills to solve your performance problem
  • 5. Philosophy
    • ‘ A good understanding’ is the best performance tuning tool
    • Prefer common and open source tools
    • Observe, Hypothesize, Tweak, Test
    • ‘ Trust no-one’
  • 6. Classic Java performance problems
    • Memory leaks
      • Increased GC Time
    • Poor GC or JVM Memory configuration
    • CPU bound code
    • IO bound code
    • Memory bound code
      • Increased GC time
  • 7. Enterprise Java Performance
    • CAVEAT: Consultancy Selection Bias
    • 80/20: 80% of time finding, 20% fixing
    • Many ‘Enterprise’ Java performance problems turn out not to be ‘classic’ performance bottlenecks
      • Infrastructure/Middleware performance
    • There are many factors that can affect the performance of an enterprise system
      • Not just code
  • 8. Enterprise Java Performance
    • Not all Java EE performance problems are classical ‘Java performance problems’
    • Common types of Java EE performance problem
      • Resource starvation
      • Threading problems
      • ‘ Suboptimal configuration’
      • Network related problems
      • Scalability problems
  • 9. A Good Understanding
    • Consider the system as a whole
    • Know how infrastructure components work
      • Not just what they do, but how they do it
    • How do the Java EE specifications say they should work?
  • 10. Approach
    • Understand the system
    • Understand the environment
    • Understand the situation
    • Talk to people who know
      • But trust no-one
    • Take a look for myself
    • Observe, Hypothesize, Tweak, Test
      • Rinse and repeat
  • 11. Case Study 1
  • 12. Case Study 1
    • Existing customer calls
      • “ We deployed a new version of the application, and it is running a lot slower”
    • The Environment
      • Sun Java 5
      • WebLogic Server 9.2 Cluster (3 nodes)
      • WebLogic Integration 9.2 Cluster (3 nodes)
      • Documentum Document Management
      • Oracle Database
      • Solaris OS
  • 13. Case Study 1
    • The System
      • Web Application
      • WLI based workflow system
    • The situation
      • New version deployed into the performance testing environment
      • Automated performance tests indicate the application is approximately 30% slower
  • 14. Case Study 1
    • Observe
      • No monitoring in place
      • Some alerting, but no historical data
    • Hypothesize
      • If we had more monitoring, we would stand a better chance
    • Tweak
      • Put some monitoring in place
      • Hyperic HQ from SpringSource
  • 15. Case Study 1
    • Test
      • Re-run tests
    • Observe
      • Monitoring indicates that one server is slower
        • Handling less requests per second
        • Lots of transaction timeouts
        • Higher CPU
        • Less network traffic
    • Tweak
      • Add more monitoring to the slow server
      • Examine log files
      • Thread dumps!
  • 16. Case Study 1
    • Hypothesize
      • Thread dumps show lots of threads in logging code waiting to write to the log file
      • Log files for the slow server have DEBUG messages in them
        • The other servers don’t
    • “ The logging configurations are identical, the servers are configured with Maven”
      • Trust no one
    • Test
      • Log in to the server and manually check the logging configuration
  • 17. Case Study 1
    • Solution
      • Debug logging was enabled on one server
      • Turned debug logging off - the system was then about the same speed as the old release
  • 18. Hyperic HQ
  • 19. Hyperic HQ
    • Monitoring tool
      • Not a profiling tool
    • Historical data
      • Trends
      • Abnormal behaviour
      • ‘ Hot’ spots
    • Wide variety of data
      • JVM level statistics
      • JMX statistics
      • OS statistics
  • 20. Thread Dumps
    • My Number 2 tool for finding performance problems
      • CTRL-BREAK in windows
      • Kill -3 on Unix/Linux
      • Jstack tool
      • Available from consoles of many application servers
    • All threads in the VM and what they are doing at that moment
  • 21. Thread Dumps
    • A number of thread dumps over time gives a good picture
      • Any operation that appears a lot is a suspect
      • Understand what ‘normal’ thread dumps look like
    • http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
  • 22. Thread Dump
  • 23. Thread Dumps
    • Look near the top of each stack
    • Look for stacks with your code in them
    • Look for long stacks
    • Look for deadlocks and other threading issues
  • 24. The Understanding
    • What does a normal WebLogic thread dump look like?
    • It is not normal to see logging code frequently in a thread dump
    • Lots of threads all waiting on a single lock object is a Bad Thing™
    • If three servers are supposed to do the same thing, their thread dumps should look similar
      • Over time
  • 25. Lessons
    • Thread dumps hold a lot of information
    • Infrastructure configuration faults are more common than infrastructure bugs
    • Automated/continuous build and deploy solutions are no silver bullet
      • Check the results yourself
    • Believe your ‘instincts’
  • 26. Case Study 2
  • 27. Case Study 2
    • Customer Call
      • “ We deployed our application into the live environment and it takes several minutes for users to log in”
    • Environment
      • Apache web servers
      • WebLogic Portal 8.1 Cluster (2 nodes)
      • Oracle Database
      • Windows Server 2003
      • Bespoke Single Sign On server
  • 28. Case Study 2
    • The System
      • Web application based on WSRP portlets
      • Oracle database storing user data
    • The Situtation
      • The first users to log-in in the morning find that it takes several minutes
      • After the first few log-ins, the application runs fine
  • 29. Case Study 2
    • Hypothesize
      • The bespoke Single Sign On server makes me suspicious
        • Bespoke code is tested less
    • Test
      • Turn on debug logging for the SSO implementation
      • Observe timings of log messages
  • 30. Case Study 2
    • Observe
      • The logs indicate that the SSO log-in is proceeding as expected
      • It appears that loading the users profile data from the database is taking a long time
    • Hypothesize
      • TCP timeouts when connecting to the database due to a firewall
  • 31. Case Study 2
    • Test
      • Observe the connection pool statistics in the WebLogic console
      • The console indicates that a large number of connections have been opened during the time the application has been running
        • Connections are not normally closed and re-opened
      • See how long you need to leave the system before the problem occurs
  • 32. Case Study 2
    • Solution
      • Discussions with the networking team indicated that there was a firewall, configured to silently terminate network connections that were Idle for 60 minutes
      • Set WebLogic to test connections after they have been idle for 50 minutes.
  • 33. Lessons
    • Consider the system as a whole
      • Hardware
      • Networking
      • OS
      • Middleware
      • Application
  • 34. The Understanding
    • Firewalls are often configured to silently terminate idle TCP connections
    • The TCP protocol requires that a connection is closed by both sides, or times out
      • The time out is several minutes
    • In a healthy WebLogic connection pool, the number of connections opened since the server started = the maximum number in the pool
  • 35. Case Study 3
  • 36. Case Study 3
    • Customer call
      • “ It takes about 20 seconds to render a page, and the performance does not scale”
    • Environment
      • WebLogic Portal 9.1 Cluster (2 nodes)
      • Oracle 10g Database
      • Red Hat Enterprise Linux
  • 37. Case Study 3
    • The System
      • Online content delivery system
      • WebLogic Portal with a commercial set of portlets
    • The Situation
      • Two problems
        • Running the performance tests with 20 threads in JMeter is twice as slow as running the tests with 10 threads
        • Viewing a content item takes around 20 seconds
  • 38. Case Study 3
    • Handle the two problems separately
      • They may be related, they may not be
  • 39. Case Study 3
    • Observe
      • Viewing a content item takes around 16 seconds on my laptop
    • Test
      • Is the rendering speed dependent on the browser used?
      • Is the rendering speed dependent on the client machine?
      • What does the page source look like?
  • 40. Case Study 3
    • Observe
      • In Opera the page renders quickly except for the table of contents on the left
      • In Firefox, the whole page renders at the same time
      • The page renders faster in IE and Opera than firefox
      • The page renders faster on faster machines
      • There is a lot of Javascript, and AJAX is used to load the table of contents
  • 41. Case Study 3
    • Hypothesize
      • The AJAX rendering of the TOC is taking a long time, and slowing down the whole page load
    • Tweak
      • Remove the TOC from the page
      • Disable JavaScript in the browser
    • Test
      • The page renders in less than 2 seconds
  • 42. Case Study 3
    • Hypothesize
      • JMeter does not execute the javascript, so the poor performance of JMeter is not related to the poor page load speed
  • 43. Case Study 3
    • Solution 1
      • The portlet developers have used AJAX to render the table of contents for a content item, this is much slower than just constructing the table of contents on the server side
      • Rewrite the portlet to construct the table of contents on the server side
      • Developers sometimes select a technology to enhance their CVs, not to implement a business requirement
  • 44. Case Study 3
    • Problem 2 – Scalability
    • Observe
      • Running the tests on JMeter with 10 users, each page response takes 5s
      • Running the test with 20 users each page response takes 12s
      • JMeter is being run on an old laptop, which is at 100% CPU in both cases
  • 45. Case Study 3
    • Hypothesize
      • As the test machine is at 100% CPU, it is the performance of JMeter that is being measured, not the performance of WebLogic
    • Observe
      • WebLogic is running at around 2% CPU usage, with many idle threads
  • 46. Case Study 3
    • Tweak
      • Run the test from a number of more modern machines, and make sure each one does not exceed 70% CPU
    • Observe
      • Four machines can each run 20 threads and get responses in 1.5 seconds, and WebLogic is still running at around 5% CPU and not struggling
  • 47. Case Study 3
    • Solution
      • The problem was that the test client was not able to generate the loads requested, resulting in the performance of the test client being measured
      • Use a larger test client
  • 48. Useful tools
    • Ethereal/Wireshark
      • Network traffic sniffer
      • See when requests/responses were sent/received
    • Firebug + YSlow
      • Firefox plugin for performance analysis
  • 49. Lessons
    • Separate problems should initially be prioritised and investigated separately
      • Keep in mind that they may be related
    • Ensure the test system can generate the required load
      • It should have plenty of free resources available
  • 50. Lessons
    • The consultant effect
      • Take a step back
      • Get a fresh perspective
  • 51. The Understanding
    • A slow test client will give slow results
    • Client side rendering is usually less efficient than server side
    • WebLogic is normally fast!
  • 52. What did we learn?
    • Simple tools can provide a lot of information
    • Understanding how the system should behave will help highlight possible causes
    • Experience is vital
      • Write a log of what you find
    • Take a step back from the problem
      • Use a second pair of eyes
  • 53. What did we learn?
    • Philosophy
      • Understand they system as a whole
      • A deep understanding of how it should work
    • Tools
      • Thread dumps
      • Monitoring tools
      • Packet sniffing
    • Techniques
      • Observe, Hypothesize, Tweak, Test
  • 54. Questions
  • 55. Session Evaluation
    • Please complete a session evaluation and turn it into any conference staff member or at the registration desk. Thank you.

×