Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Top Java Performance Problems and Metrics To Check in Your Pipeline

1,728 views

Published on

Why is Performance Important? What are the most common reasons applications dont scale and perform well. Which technical metrics to look at. How to check it automated in the pipeline

Published in: Software
  • Be the first to comment

Top Java Performance Problems and Metrics To Check in Your Pipeline

  1. 1. And other Tips & Tricks to make you a “Performance Expert” More @ http://blog.dynatrace.com – Tools @ http://bit.ly/dtpersonal Andreas Grabner - @grabnerandi Deep Dive Into Top Performance Mistakes
  2. 2. Why Performance? Confidential, Dynatrace, LLC
  3. 3. 700 deployments / YEAR 10 + deployments / DAY 50 – 60 deployments / DAY Every 11.6 SECONDS
  4. 4. Not only fast delivered but also delivering fast! -1000ms +2% Response Time Conversions -1000ms +10% +100ms -1%
  5. 5. #1: Which Geo has which “User Experience”? #2: Who are these users?
  6. 6. Daily Deployments + Mkt Push Increase # of unhappy users! Drop in Conversion Rate Overall increase of Users!
  7. 7. Satisfied Users Click more Content
  8. 8. Tolerating Users click less content
  9. 9. Frustrated Users mainly click on Support
  10. 10. Update of Dependency Injection Library impacts Memory & CPU
  11. 11. App with Regular Load supported by 10 Containers Twice the Load but 48 (=4.8x!) Containers! App doesn’t scale!! Does it really scale?
  12. 12. How to analyze perf? Confidential, Dynatrace, LLC
  13. 13. Time: Wall Clock, CPU, I/O, Wait/Sync, Susp, Page Load Throughput: # of Requests per Timeinterval Resources: CPU Cycles, Memory, I/O, Log Messages, ... Pools and Queues: Sizes, Utilization, Acquisition Time, # Publishers vs # Subscribers, Process Time Interactions: # SQLs, # Messages, # Services, # Images, # CSS Errors: Exceptions, HTTPs, TCP Packet Loss
  14. 14. AND MANY MORE
  15. 15. 0.02ms 0.01ms
  16. 16. https://dynatrace.github.io/ufo/ “In Your Face” Data!
  17. 17. Where do your Stories come from?
  18. 18. Share Your PurePath - http://bit.ly/sharepurepath
  19. 19. 3rd parties Akamai Cloudfront Synthetic Apache IIS Node.js nginx Java .NET PHP IBM WMQ ESBs MongoDB Hbase Cassandra CICs IMS ORACLE MSSQL MySQL DB2 Mobile Collector Plugins Dynatrace Server Hosts Session Storage Splunk Elasticsearch Solr Rich Client Web Interface Web
  20. 20. Dev/Arch Method Level Hotspots + Exceptions, Logs, Memory Allocation, Threads, Actual Code ...
  21. 21. Export & Share Share Your PurePath - http://bit.ly/sharepurepath
  22. 22. 20% 80%
  23. 23. Frontend Performance We are getting FATer!
  24. 24. Mobile landing page of Super Bowl ad 434 Resources in total on that page: 230 JPEGs, 75 PNGs, 50 GIFs, … Total size of ~ 20MB
  25. 25. Fifa.com during Worldcup Source: http://apmblog.compuware.com/2014/05/21/is-the-fifa-world-cup-website-ready-for-the-tournament/
  26. 26. 8MB of background image for STPCon (Word Press)
  27. 27. Availability dropped to 0% Availability And Response Time
  28. 28. Tip for handling Spike Load: GO LEAN!! 1h before SuperBowl KickOff 1h after Game ended
  29. 29. Make F12 or Browser Agent your friend!
  30. 30. Key Metrics # of Resources Size of Resources Total Size of Content HTTP 3xx, 4xx, 5xx # of Domains
  31. 31. Backend Performance The Usual Suspects
  32. 32. • Symptoms • HTML takes between 60 and 120s to render • High GC Time • Developer Assumptions • Bad GC Tuning • Probably bad Database Performance as rendering was simple • Result: 2 Years of Finger pointing between Dev and DBA Project: Online Room Reservation System
  33. 33. Developers built own monitoring void roomreservationReport(int officeId) { long startTime = System.currentTimeMillis(); Object data = loadDataForOffice(officeId); long dataLoadTime = System.currentTimeMillis() - startTime; generateReport(data, officeId); } Result: Avg. Data Load Time: 45s! DB Tool says: Avg. SQL Query: <1ms!
  34. 34. #1: Loading too much data 24889! Calls to the Database API! High Memory Usage results in GC resulting to high GC to keep all data in Memory
  35. 35. #2: On individual connections 12444! individual connections Classical N+1 Query Problem Individual SQL really <1ms
  36. 36. #3: Putting all data in temp Hashtable Lots of time spent in Hashtable.get Called from their Entity Objects
  37. 37. • … you know what code is doing you inherited!! • … you are not making mistakes like this  • Explore the Right Tools • Built-In Database Analysis Tools • “Logging” options of Frameworks such as Hibernate, … • JMX, Perf Counters, … of your Application Servers • Performance Tracing Tools: Dynatrace, Ruxit, NewRelic, AppDynamics, Your Profiler of Choice … Lessons Learned – Don’t Assume …
  38. 38. Key Metrics # of SQL Calls # of same SQL Execs (1+N) # of Connections Rows/Data Transferred
  39. 39. Logging WE CAN LOG THIS!! Or we just throw a lot of Exceptions  LOG
  40. 40. Log Hotspots in Frameworks! callAppenders clear CPU and I/O Hotspot Excessive logging through Spring Framework
  41. 41. Debug Log and outdated log4j library #1: Top Problem: log4j.callAppenders -> 71% Sync Time #2: Most of logging done from fillDetail method #3: Doing “DEBUG” log output: Is this necessary?
  42. 42. Overhead caused by Exceptions fillInStackTrace is Top 2 in CPU Hotspots All these Exceptions that never show up in a log file are consuming all CPU
  43. 43. Too Many Exceptions vs Log Messages 2-5 Log Messages per 5 Min Looking at the important (SEVERE, FATAL, …) log messages written Up to 20000 Custom Exceptions That’s about 4000x the number of Exceptions per Log Message
  44. 44. Key Metrics # of Log Entries Size of Logs per Use Case
  45. 45. Pools & Queues Proper Sizing!!
  46. 46. Wrong Pool Sizes Configured Do we have enough DB CONNECTIONS per pool?
  47. 47. Threading Issues
  48. 48. Threading Issues (Analysis) Tip: I like the Thread Column as it tells me where we spawn off async threads and where the “main threads” might be waiting
  49. 49. Sync / Wait 1.63s in Object.wait Means that this thread is put to hold Waiting on the next Connection to become available!
  50. 50. Key Metrics Pool and Queue Sizes Time in Sync & Wait
  51. 51. (Micro)Services Architectural Mistakes with „Migrating“ to (Micro)Services
  52. 52. Example #2: Online Sports Club Search Service 2015201420xx Response Time 2016+ 1) Started as a small project 2) Slowly growing user base 3) Expanding to new markets – 1st performance degradation! 4) Adding more markets – performance becomes a business impact Users 4) Potentially start loosing users
  53. 53. Early 2015: Monolithic App Can‘t scale vertically endlessly! 2.68s Load Time 94.09% CPU Bound
  54. 54. Proposal: Service approach! Front End to Cloud Scale Backend in Containers!
  55. 55. 7:00 a.m. Low Load and Service running on minimum redundancy 12:00 p.m. Scaled up service during peak load with failover of problematic node 7:00 p.m. Scaled down again to lower load and move to different geo location Testing the Backend Service alone scales well …
  56. 56. Go live – 7:00 a.m.
  57. 57. Go live – 12:00 p.m.
  58. 58. What Went Wrong?
  59. 59. 26.7s Load Time 5kB Payload 33! Service Calls 99kB - 3kB for each call! 171!Total SQL Count Architecture Violation Direct access to DB from frontend service Single search query end-to-end
  60. 60. The fixed end-to-end use case “Re-architect” vs. “Migrate” to Service-Orientation 2.5s (vs 26.7) 5kB Payload 1! (vs 33!) Service Call 5kB (vs 99) Payload! 3!(vs 177) Total SQL Count
  61. 61. You measure it! from Dev (to) Ops
  62. 62. Build 17 testNewsAlert OK testSearch OK Build # Use Case Stat # API Calls # SQL Payload CPU 1 5 2kb 70ms 1 3 5kb 120ms Use Case Tests and Monitors Service & App Metrics Build 26 testNewsAlert OK testSearch OK Build 25 testNewsAlert OK testSearch OK 1 4 1kb 60ms 34 171 104kb 550ms Ops #ServInst Usage RT 1 0.5% 7.2s 1 63% 5.2s 1 4 1kb 60ms 2 3 10kb 150ms 1 0.6% 4.2s 5 75% 2.5s Build 35 testNewsAlert - testSearch OK - - - - 2 3 10kb 150ms - - - 8 80% 2.0s Metrics from and for Dev(to)Ops Re-architecture into „Services“ + Performance Fixes Scenario: Monolithic App with 2 Key Features
  63. 63. Key Metrics # of Service Calls Payload of Service Calls # of Involved Threads 1+N Service Call Pattern!
  64. 64. Tips & Tricks And more Metrics of course 
  65. 65. Tip: Layer Breakdown over Time With increasing load: Which LAYER doesn’t SCALE?
  66. 66. Tip: Exceptions and Log Messages How are # of EXCEPTIONS evolving over time? How many SEVERE LOG messages to we write in relation to Exceptions?
  67. 67. Tip: Failed Transactions Are more TRANSACTIONS FAILING (HTTP 5xx, 4xx, …) under heavier load?
  68. 68. Tip: Database Activity Do we see increased in AVG # of SQL Executions over Time? Do TOTAL # of SQL Executions increase with load? Shouldn’t it flatten due to CACHES?
  69. 69. Tip: Database History Dashboard How many SQL Statements are PREPARED? What’s the overall Execution Time of different SQL Types (SELECT, INSERT, DELETE, …)
  70. 70. For more Key Metrics http://blog.dynatrace.com http://blog.ruxit.com
  71. 71. Questions and/or Demo Slides: slideshare.net/grabnerandi Get Tools: bit.ly/dtpersonal YouTube Tutorials: bit.ly/dttutorials Contact Me: agrabner@dynatrace.com Follow Me: @grabnerandi Read More: blog.dynatrace.com
  72. 72. Andreas Grabner Dynatrace Developer Advocate @grabnerandi http://blog.dynatrace.com

×