Performance Analysis of Idle Programs

1,243 views
1,147 views

Published on

Matthew Arnold's ECOOP 2011 Summer School talk.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,243
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • http://www.faqs.org/photo-dict/phrase/3311/running-hourglass.html
  • Performance Analysis of Idle Programs

    1. 1. Performance Analysis of Idle Programs Erik Altman Matthew Arnold Stephen Fink Nick Mitchell Peter Sweeney IBM T.J. Watson Research Center “ WAIT Performance Tool”
    2. 2. Overview <ul><li>Application-driven scalability study </li></ul><ul><ul><li>Performance analysis of large enterprise app </li></ul></ul><ul><ul><li>Code changes resulting in 1.4X - 5X improvement </li></ul></ul><ul><li>WAIT performance tool </li></ul><ul><ul><li>So easy your mother/father could use it </li></ul></ul><ul><ul><li>Demo </li></ul></ul><ul><ul><li>Implementation details </li></ul></ul><ul><li>Lessons learned </li></ul><ul><ul><li>Performance tooling </li></ul></ul><ul><ul><li>The multicore controversy </li></ul></ul><ul><ul><ul><li>Are we working on the right things? </li></ul></ul></ul>professor
    3. 3. Application-Driven Scalability Study <ul><li>App-driven scalability study (2008/2009) </li></ul><ul><ul><li>Goal: Identify scaling problems and solutions for future multicore architectures </li></ul></ul><ul><ul><li>Approach: application-driven (data-driven) exploration </li></ul></ul><ul><ul><ul><li>Choose a real application </li></ul></ul></ul><ul><ul><ul><li>Let workload drive the research </li></ul></ul></ul><ul><ul><ul><li>Identify scalability problems </li></ul></ul></ul><ul><ul><ul><li>Restructure application to improve scalability </li></ul></ul></ul><ul><li>Assumption: application is already parallel </li></ul><ul><ul><li>Not attempting automatically parallelize serial code </li></ul></ul><ul><ul><ul><li>Open to adding fine-grained parallelism within transaction </li></ul></ul></ul><ul><li>Infrastructure </li></ul><ul><ul><li>Two POWER5 p596 machines ( 64-core , 128 hw threads) </li></ul></ul><ul><li>Team members </li></ul><ul><ul><ul><li>Erik Altman, Matthew Arnold, Rajesh Bordewekar, Robert Delmonico, Nick Mitchell, Peter Sweeney </li></ul></ul></ul>
    4. 4. Application <ul><li>“ Zanzibar”: content management system </li></ul><ul><ul><li>Multitier: J2EE (Java) application server, DB2, LDAP, client(s) </li></ul></ul><ul><ul><li>Document ingestions and retrieval </li></ul></ul><ul><ul><ul><li>Used by hospitals, banks, etc </li></ul></ul></ul><ul><ul><ul><li>Data + metadata </li></ul></ul></ul><ul><ul><li>Mature code </li></ul></ul><ul><ul><ul><li>In production several years, multiple major releases </li></ul></ul></ul><ul><ul><ul><li>Previous performance study in 2007 </li></ul></ul></ul><ul><li>Plan of attack </li></ul><ul><ul><li>First ensure it scales on smaller hardware (4-core / 8 core) </li></ul></ul><ul><ul><li>Then upgrade to large 64-core machine </li></ul></ul><ul><ul><ul><li>Find and fix bottlenecks until it scales </li></ul></ul></ul>
    5. 5. Initial Result: failure on almost all fronts <ul><li>Install and config took several weeks </li></ul><ul><ul><li>Real application, real workload, multi-tier configuration </li></ul></ul><ul><ul><li>Load driver </li></ul></ul><ul><li>Terrible scalability even on modest 4-way hardware </li></ul><ul><ul><li>Observed performance: 1 doc/second </li></ul></ul><ul><ul><ul><li>Target > 1000 docs/second </li></ul></ul></ul><ul><ul><li>App server machine < 10% CPU utilization </li></ul></ul><ul><li>Existing performance tools did not prove useful </li></ul><ul><ul><li>Struggled even to identify the primary bottleneck we faced </li></ul></ul><ul><ul><ul><li>Let alone its underlying cause </li></ul></ul></ul><ul><li>Advice we were given </li></ul><ul><ul><li>You need a massive hard disk array for that application </li></ul></ul><ul><ul><li>Gigabit ethernet? You need Infiniband or Hipersockets </li></ul></ul>
    6. 6. Stop! We’ve already learned several lessons <ul><li>Lesson 1: This “application-driven” research idea stinks </li></ul><ul><ul><li>“ We aren’t the right people to be doing this. Someone else should get this deployed so we can focus on what we’re good at: Java performance analysis.” </li></ul></ul><ul><li>Lesson 2: Ignore lesson 1 </li></ul><ul><ul><li>Despite being frustrated, we learned a lot </li></ul></ul><ul><ul><ul><li>Whole point was to open our mind to new problems </li></ul></ul></ul><ul><ul><li>We are an important demographic </li></ul></ul><ul><ul><ul><li>Mostly-competent non-experts </li></ul></ul></ul><ul><ul><ul><li>“ Why is the app I just installed an order of magnititude too slow?” </li></ul></ul></ul><ul><ul><ul><ul><li>Very common question </li></ul></ul></ul></ul><ul><li>Disclaimer: if you go down this road </li></ul><ul><ul><li>You will end up working on things you didn’t intend (or want?) to </li></ul></ul>
    7. 7. OK so let’s find some bottlenecks <ul><li>Matt and Nick, see what you can find! </li></ul><ul><ul><li>Matt : I installed and ran Java Lock Analyzer. I don’t see any hot locks </li></ul></ul><ul><ul><li>Nick: Yeah, I did kill -3 to generate javacores and the thread stacks show we’re waiting on the database </li></ul></ul><ul><ul><li>Matt: I installed and ran tprof. Method toLowerCase() is the hottest method </li></ul></ul><ul><ul><li>Nick: Yeah, that was clear from the thread dumps too </li></ul></ul><ul><li>Observation 1 </li></ul><ul><ul><li>Seasoned performance experts often don’t use any fancy tools </li></ul></ul><ul><ul><ul><li>Start with simple utilities: top, ps, kill -3, oprofile, netperf </li></ul></ul></ul><ul><ul><ul><li>Top performance experts don’t use tools developed in research? </li></ul></ul></ul><ul><li>Observation 2 </li></ul><ul><ul><li>The tools we found were a mismatch for “performance triage” </li></ul></ul><ul><ul><li>Targeted focus: Hot locks, GC analyzer, DB query analyzer, etc </li></ul></ul><ul><ul><ul><li>How do I know which tool to use first? </li></ul></ul></ul><ul><ul><ul><li>Once you fix one bottleneck, you start all over </li></ul></ul></ul><ul><ul><li>High installation and usage effort </li></ul></ul>
    8. 8. Constraints of a real-world production deployment <ul><ul><li>Instrument the application? NO! </li></ul></ul><ul><ul><li>Recompile the application? NON! </li></ul></ul><ul><ul><li>Deploy a fancy monitoring agent? NICHT! </li></ul></ul><ul><ul><li>Analysis the source? ノー ! </li></ul></ul><ul><ul><li>Install more modern JVM? yIntagh ! </li></ul></ul>
    9. 9. Let’s see what Javacores can do <ul><li>Zanzibar analysis done almost entirely using Javacores </li></ul><ul><li>Methodology used </li></ul><ul><ul><li>Trigger a few javacores from server under load </li></ul></ul><ul><ul><ul><li>Manually inspect </li></ul></ul></ul><ul><ul><ul><ul><li>Look for frequently occurring thread stacks </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Whether running, or blocked </li></ul></ul></ul></ul><ul><ul><ul><li>Fix problem </li></ul></ul></ul><ul><ul><ul><li>Repeat </li></ul></ul></ul>
    10. 10. No single class of bottleneck dominated <ul><li>Bottlenecks found </li></ul><ul><ul><li>Exceeded disk throughput on the database machine </li></ul></ul><ul><ul><li>Exceeded disk throughput on the application server machine </li></ul></ul><ul><ul><li>Application overuse of filesystem metadata operations </li></ul></ul><ul><ul><li>Lock contention in application code </li></ul></ul><ul><ul><li>Saturating network </li></ul></ul><ul><ul><li>GC bottlenecks due to JVM implementation </li></ul></ul><ul><ul><li>GC bottlenecks due to application issues </li></ul></ul><ul><ul><li>Difficulties driving enough load due to bottlenecks in load generator </li></ul></ul>
    11. 11. 1) Lock Contention <ul><li>Found 11 hot locks </li></ul><ul><li>Replaced with </li></ul><ul><ul><li>Thread-local storage </li></ul></ul><ul><ul><ul><li>Fine-grained data replication </li></ul></ul></ul><ul><ul><ul><li>Good for lazily initialized, read-only data </li></ul></ul></ul><ul><ul><li>Concurrent collections </li></ul></ul><ul><ul><ul><li>Significantly more scalable </li></ul></ul></ul><ul><li>Alternative: App server cloning </li></ul><ul><ul><li>Coarse grained data replication </li></ul></ul>
    12. 12. 2) Contended Shared Resources <ul><li>Disk </li></ul><ul><ul><li>Database machine </li></ul></ul><ul><ul><ul><li>Heavy use of BLOB (Binary Large OBjects) </li></ul></ul></ul><ul><ul><ul><ul><li>Non-buffered writes </li></ul></ul></ul></ul><ul><ul><li>App server machine </li></ul></ul><ul><ul><ul><li>Frequent filesystem calls (open, close, exists, rename) </li></ul></ul></ul><ul><li>OS/Filesystem </li></ul><ul><ul><li>Both of above bottleneck even with RAM disk </li></ul></ul><ul><li>JVM </li></ul><ul><ul><li>Object allocation and GC </li></ul></ul><ul><ul><ul><li>Excessive temp object creation </li></ul></ul></ul><ul><ul><ul><li>Reduced object creation rate by 3X </li></ul></ul></ul><ul><ul><ul><ul><li>Objects per request: 2850  900 </li></ul></ul></ul></ul><ul><li>Network </li></ul><ul><ul><li>Bloated protocols </li></ul></ul>
    13. 13. 3) Single-threaded Performance <ul><li>For coarse-grained bottlenecks </li></ul><ul><ul><li>Identify contended resource X </li></ul></ul><ul><ul><li>Change code to put less pressure on X </li></ul></ul><ul><ul><li>Repeat </li></ul></ul><ul><li>Eventually you get to finer-granularity resources </li></ul><ul><ul><li>It became simpler to </li></ul></ul><ul><ul><ul><li>Give up hunting for individual contended resources </li></ul></ul></ul><ul><ul><ul><li>Focus on single-threaded performance instead </li></ul></ul></ul><ul><ul><li>Find grossly inefficient code and fix it </li></ul></ul><ul><ul><ul><li>Improves latency / response time </li></ul></ul></ul><ul><ul><ul><li>Also improves scalability </li></ul></ul></ul><ul><ul><ul><ul><li>If a program executes the minimal number of steps to accomplish a task it likely consumes fewer shared resources at all levels of the stack </li></ul></ul></ul></ul>
    14. 14. Examples of Single-threaded Improvements <ul><li>Redundant computations </li></ul><ul><ul><li>Redundant calls to file.exists() </li></ul></ul><ul><ul><li>Excessive use of toLowerCase() </li></ul></ul><ul><li>Over-general code </li></ul><ul><ul><li>Creating hashmap to pass 2 elements </li></ul></ul><ul><li>Unnecessary copies and conversion </li></ul><ul><ul><li>Stores same data in both Id and String form </li></ul></ul><ul><ul><ul><li>Converts back and forth frequently </li></ul></ul></ul><ul><ul><ul><li>Calls isId() frequently </li></ul></ul></ul><ul><ul><li>String operations to find prepared statement </li></ul></ul>
    15. 15. Performance Results <ul><li>8-way x86 </li></ul>
    16. 16. Results: 56 core POWER p595 Scaling with -checkin (first minute) 0 1000 2000 3000 4000 5000 6000 7000 2 4 8 16 32 56 Num Cores Docs/second Orig Modified
    17. 17. Results: 56 core POWER p595 Without -checkin flag 0 1000 2000 3000 4000 5000 6000 7000 2 4 8 16 32 56 Num cores Docs/sec Orig Modified
    18. 18. Zanzibar Results: Interesting Points <ul><ul><li>Significant improvements from 27 small, targeted changes </li></ul></ul><ul><ul><li>Fixing 11 Java locks left no hot locks </li></ul></ul><ul><ul><ul><li>Even in a 64-core / 128-thread system </li></ul></ul></ul><ul><ul><li>Locks were only one of many problems to scalability </li></ul></ul><ul><ul><ul><li>Contended shared resources at many levels of the stack </li></ul></ul></ul><ul><ul><ul><li>JVM cloning wouldn’t have helped some but not all </li></ul></ul></ul><ul><ul><li>GC is what prevented scaling beyond 32 cores </li></ul></ul><ul><ul><li>Improvements were high-level Java code or config changes </li></ul></ul><ul><ul><ul><li>Nothing specific to multi-core microarchitecture </li></ul></ul></ul><ul><ul><li>Made more single-threaded changes than expected </li></ul></ul><ul><ul><ul><li>Accounted for roughly half the speedups (8-way, no “-checkin”) </li></ul></ul></ul><ul><ul><li>Did most of the work on 4-way and 8-way machines </li></ul></ul>Much of this was not what we expected to be working on
    19. 19. Overview <ul><li>Zanzibar scalability study </li></ul><ul><ul><li>Performance analysis of large enterprise app </li></ul></ul><ul><ul><li>Code changes resulting in 1.4X - 5X improvement </li></ul></ul><ul><li>WAIT performance tool </li></ul><ul><ul><li>Demo </li></ul></ul><ul><ul><li>Implementation details </li></ul></ul><ul><li>Lessons learned </li></ul><ul><ul><li>Performance tooling </li></ul></ul><ul><ul><li>The multicore controversy </li></ul></ul><ul><ul><ul><li>Are we working on the right things? </li></ul></ul></ul>
    20. 20. WAIT Performance Tool <ul><li>OOPSLA 2010 “Performance analysis of idle programs. Quickly identify performance and scalability inhibitors” - Altman, Arnold, Fink, Mitchell </li></ul><ul><li>Quickly identify primary bottleneck </li></ul><ul><li>Usable in large-scale, deployed production setting </li></ul><ul><ul><li>Intrusive monitoring not an option </li></ul></ul><ul><li>Learn from past experiences </li></ul><ul><ul><li>Rule based </li></ul></ul><ul><ul><li>“Taught” about common causes of bottlenecks. </li></ul></ul>
    21. 21. WAIT: Key Attributes <ul><li>WAIT focuses on primary bottlenecks </li></ul><ul><ul><li>Gives high-level, whole-system, </li></ul></ul><ul><ul><li>summary of performance inhibitors </li></ul></ul><ul><li>WAIT is zero install </li></ul><ul><ul><li>Leverages built-in data collectors </li></ul></ul><ul><ul><li>Reports results in a browser </li></ul></ul><ul><li>WAIT is non-disruptive </li></ul><ul><ul><li>No special flags, no restart </li></ul></ul><ul><ul><li>Use in any customer or development location </li></ul></ul><ul><li>WAIT is low-overhead </li></ul><ul><ul><li>Uses only infrequent samples of an already-running application </li></ul></ul><ul><li>WAIT is simple to use </li></ul><ul><ul><li>Novices to experts: Start at high level and drill down </li></ul></ul><ul><li>WAIT does not capture sensitive user data </li></ul><ul><ul><li>No Social Security numbers, credit card numbers, etc </li></ul></ul><ul><li>WAIT uses centralized knowledge base </li></ul><ul><ul><li>Allows rules and knowledge base to grow over time </li></ul></ul>Customer A Customer B WAIT Cloud Server
    22. 22. What information does WAIT use? <ul><li>Standard O/S tools </li></ul><ul><ul><li>CPU utliization: (vmstat) </li></ul></ul><ul><ul><li>Process utilization: (ps) </li></ul></ul><ul><ul><li>Memory stats: … </li></ul></ul>
    23. 23. What information does WAIT use? <ul><li>Standard O/S tools </li></ul><ul><ul><li>CPU utliization: (vmstat) </li></ul></ul><ul><ul><li>Process utilization: (ps) </li></ul></ul><ul><ul><li>Memory stats: … </li></ul></ul><ul><li>JVM dump files </li></ul><ul><ul><li>Most JVMs respond to SIGQUIT (kill -3) </li></ul></ul><ul><ul><ul><li>Dump state of JVM to stderr or file </li></ul></ul></ul><ul><ul><ul><li>IBM JVMs produce “javacore” </li></ul></ul></ul>Works with all IBM JVMs, Sun/Oracle JVMs
    24. 24. How to use WAIT <ul><li>Gather some javacores </li></ul><ul><ul><li>Manually execute kill -3 PID </li></ul></ul><ul><ul><li>Run WAIT data collection script </li></ul></ul><ul><li>Upload data to server </li></ul><ul><ul><li>http:// wait.researchlabs.ibm.com </li></ul></ul><ul><li>View result in browser </li></ul>Firefox Safari Chrome iPhone
    25. 25. How to use WAIT <ul><li>Gather some javacores </li></ul><ul><ul><li>Manually execute kill -3 PID </li></ul></ul><ul><ul><li>Run WAIT data collection script </li></ul></ul><ul><li>Upload data to server </li></ul><ul><ul><li>http:// wait.researchlabs.ibm.com </li></ul></ul><ul><li>View result in browser </li></ul>Firefox Safari Chrome iPhone Free, public server Nothing to install Sample-based (low overhead) Doesn’t require restarting app
    26. 26. Wait Data Collector <ul><li>Specify JVM process ID </li></ul><ul><ul><li>31136 </li></ul></ul><ul><li>Triggers periodic: </li></ul><ul><ul><li>javacores </li></ul></ul><ul><ul><li>vmstat (machine util) </li></ul></ul><ul><ul><li>ps (process util) </li></ul></ul><ul><li>Creates zip file </li></ul><ul><ul><li>Upload to WAIT server </li></ul></ul>Next slide
    27. 27. Upload Java Cores to WAIT Website
    28. 28. What is the CPU doing? What Java work is running? What Java work cannot run? View WAIT Report in a Browser What is memory consumption? Not directly available from profiling tools
    29. 29. WAIT Report: What is the main cause of delay? Drill down by clicking on legend item Where are those delays coming from in the code? <ul><li>Example Rule: </li></ul><ul><ul><li>If socketRead at top of stack AND </li></ul></ul><ul><ul><li>If JDBC methods lower on stack </li></ul></ul><ul><li> Getting data from database </li></ul>
    30. 30. Filesystem Bottleneck
    31. 31. Z/OS: Lock Contention
    32. 32. Lock Contention: Waiting vs Owning
    33. 33. Deadlock
    34. 34. <ul><li>DB2 </li></ul><ul><li>JDBC </li></ul><ul><li>App Server </li></ul><ul><ul><li>Websphere, WebLogic, Jboss </li></ul></ul><ul><li>WebSphere Commerce </li></ul><ul><li>Portal </li></ul><ul><li>MQ </li></ul><ul><li>Oracle </li></ul><ul><li>RMIs </li></ul><ul><li>Apache commons </li></ul><ul><li>Tomcat </li></ul>Some Frameworks Supported by WAIT Rules No finger pointing
    35. 35. Example Report: Memory Leak Disclaimer: Appearance and function of any offering may differ from this depiction.
    36. 36. Example Report: Memory Analysis Disclaimer: Appearance and function of any offering may differ from this depiction.
    37. 37. WAIT Implementation Details
    38. 38. Looks Simple! <ul><li>In some ways it is. In others, it’s not. </li></ul><ul><ul><li>Some information presented is nontrivial to compute </li></ul></ul>
    39. 39. Looks Simple! <ul><li>In some ways it is. In others, it’s not. </li></ul><ul><ul><li>Some information presented is nontrivial to compute </li></ul></ul><ul><li>Example: is thread running or sleeping? </li></ul><ul><ul><li>Difficult to determine give input data </li></ul></ul><ul><ul><li>Thread states reported by JVM are useless </li></ul></ul><ul><ul><ul><li>JVM stops all threads before writing out thread stacks </li></ul></ul></ul>
    40. 40. Looks Simple! <ul><li>In some ways it is. In others, it’s not. </li></ul><ul><ul><li>Some information presented is nontrivial to compute </li></ul></ul><ul><li>Example: is thread running or sleeping? </li></ul><ul><ul><li>Difficult to determine give input data </li></ul></ul><ul><ul><li>Thread states reported by JVM are useless </li></ul></ul><ul><ul><ul><li>JVM stops all threads before writing out thread stacks </li></ul></ul></ul><ul><ul><li>What is the “correct” thread state? </li></ul></ul><ul><ul><ul><li>Java language level </li></ul></ul></ul><ul><ul><ul><li>JVM level </li></ul></ul></ul><ul><ul><ul><li>O/S level </li></ul></ul></ul>
    41. 41. “WAIT State” <ul><li>Hierarchical abstraction of execution state </li></ul><ul><ul><li>Common cases of making forward progress (or lack thereof) </li></ul></ul><ul><li>Top level: Runnable vs Waiting </li></ul>
    42. 42. WAIT State: RUNNABLE
    43. 43. WAIT State: WAITING
    44. 44. “Category” <ul><li>Hierarchical abstraction of code activity being performed </li></ul><ul><ul><li>What is the code doing? </li></ul></ul>
    45. 45. Analysis engine <ul><li>Use rule-based knowledge to map: </li></ul><ul><ul><li>Thread stack  < Category , WAIT State > </li></ul></ul><ul><li>Category analysis </li></ul><ul><ul><li>Simple pattern matching based on known methods </li></ul></ul><ul><ul><ul><li>java/net/SocketInputStream.socketRead0  Network </li></ul></ul></ul><ul><ul><ul><li>com/mysql/jdbc/ConnectionImpl.commit  DB Commit </li></ul></ul></ul><ul><ul><ul><li>lcom/sun/jndi/ldap/Connection.readReply  LDAP </li></ul></ul></ul><ul><ul><li>Algorithm </li></ul></ul><ul><ul><ul><li>Label every stack frame </li></ul></ul></ul><ul><ul><ul><ul><li>If no rule apply, use package name </li></ul></ul></ul></ul><ul><ul><ul><li>Stack assigned label of highest priority rule </li></ul></ul></ul>
    46. 46. WAIT State Analysis <ul><li>Uses several inputs to make best guess </li></ul><ul><ul><li>Category </li></ul></ul><ul><ul><li>Lock graph </li></ul></ul><ul><ul><li>Known methods </li></ul></ul><ul><ul><ul><li>Spinlock implementations </li></ul></ul></ul><ul><ul><ul><li>Native routines (blocking read, etc) </li></ul></ul></ul><ul><ul><li>Algorithm acts as a sieve </li></ul></ul><ul><ul><ul><li>Looking for states that can be assigned with most certainty </li></ul></ul></ul><ul><li>Unknown native methods are problematic </li></ul><ul><ul><li>Cannot be certain of execution state </li></ul></ul><ul><ul><li>Assign “native unknown” </li></ul></ul><ul><ul><ul><li>Combine with CPU utilization to have good guess </li></ul></ul></ul>
    47. 47. Rule Statistics (as of Mar 2010) Number of Rules DB2, MySql, Oracle, Apache, SqlServer Rule Coverage 6 LDAP 12 Logging 13 Classloader 22 JEE 30 Marshalling 30 Waiting for Work 46 Disk, Network I/O 41 Client Communication 59 Administrative 72 Database # Rules Category 23% Package Fallback 77% Category Rule 1,391,033 # Thread Stacks 830 # Reports
    48. 48. Internal WAIT Usage Statistics (as of June 2011) <ul><ul><li>Types of users </li></ul></ul><ul><ul><ul><li>Crit sit teams </li></ul></ul></ul><ul><ul><ul><li>L2/L3 Support </li></ul></ul></ul><ul><ul><ul><li>Developers </li></ul></ul></ul><ul><ul><ul><li>Testers </li></ul></ul></ul><ul><ul><ul><li>Performance analysts </li></ul></ul></ul>600+ Unique users 5200+ Total reports
    49. 49. Who is using WAIT?
    50. 50. Tool in Software Lifecycle Entry Point Entry Point Exit Point Performance Tuning The tool applies everywhere in cycle. – Key: Lightweight and simple Build Use latest compiler Turn on optimization Enable parallelization* Analyze Static code analysis Find “hot spots” Identify performance bottlenecks Identify scalability bottlenecks* Code & Tune Refine compiler options/directives Use optimized libraries Recode part of application Introduce/increase parallelism* Test & Debug Run Application Check correctness Check concurrency issues* Monitor Measure performance Collect execution stats Validate performance gains Gather stats on scalability*
    51. 51. Testimonial: Health Care System Crit-Sit <ul><li>April-May 2010: Tool team worked intensively on custom workload </li></ul><ul><ul><li>Batch, Java, Database </li></ul></ul><ul><ul><li>Major health care provider </li></ul></ul><ul><li>Approximately 10x gain during that period </li></ul><ul><ul><li>Metric: Transactions per hour </li></ul></ul><ul><li>Others continued to use tool intensively through August – when performance goal achieved. </li></ul><ul><ul><li>400+ WAIT reports over 4 months as part of standard performance testing script </li></ul></ul><ul><ul><li>Result: 60x overall improvement </li></ul></ul><ul><ul><ul><li>30 small, localized changes </li></ul></ul></ul>“ Tell your boss you have a tool I would buy.” &quot;The [other] guys are still trying to figure out how to get the tracing installed on all the tiers.&quot; Crit-Sit = Critical Situation / Problem
    52. 52. Integration into Development and Test Process <ul><li>Integration of tool into system test process  Big Gains </li></ul><ul><li>Approach: </li></ul><ul><ul><li>Automate tool data collection during tests </li></ul></ul><ul><ul><li>Automate uploading of data to analysis and display server </li></ul></ul><ul><ul><li>Show reports to testers: </li></ul></ul><ul><ul><ul><li>Understand performance </li></ul></ul></ul><ul><ul><ul><li>Identify bugs </li></ul></ul></ul><ul><ul><ul><li>Track application health </li></ul></ul></ul><ul><ul><ul><li>Forward report URL to developers to speed defect resolution. </li></ul></ul></ul>
    53. 53. Overview <ul><li>Zanzibar scalability study </li></ul><ul><ul><li>Performance analysis of large enterprise app </li></ul></ul><ul><ul><li>Code changes resulting in 1.4X - 5X improvement </li></ul></ul><ul><li>WAIT performance tool </li></ul><ul><ul><li>Demo </li></ul></ul><ul><ul><li>Implementation details </li></ul></ul><ul><li>Lessons learned </li></ul><ul><ul><li>Performance tooling </li></ul></ul><ul><ul><li>The multicore controversy </li></ul></ul><ul><ul><ul><li>Are we working on the right things? </li></ul></ul></ul>
    54. 54. Is the WAIT tool “Research”? <ul><li>In some aspects, NO </li></ul><ul><ul><li>Lots of engineering </li></ul></ul><ul><ul><li>Focus on portability and robustness </li></ul></ul><ul><ul><li>Significant UI work </li></ul></ul><ul><li>In many ways, YES </li></ul><ul><ul><li>Philosophy: do more with less </li></ul></ul><ul><ul><ul><li>Opposite of predominant mindset in research community </li></ul></ul></ul><ul><ul><ul><ul><li>Gather more data  more precise result </li></ul></ul></ul></ul><ul><ul><li>Would be nice to see more of the less-is-more approach </li></ul></ul><ul><ul><ul><li>May be harder to publish </li></ul></ul></ul><ul><ul><ul><li>Concrete research statements still possible </li></ul></ul></ul><ul><ul><ul><li>“ We achieved 90% of the accuracy of technique X without having to Y” </li></ul></ul></ul>
    55. 55. +1 for managed languages <ul><li>User-triggered snapshots (Javacores) beneficial </li></ul><ul><ul><li>Significant advantage of managed languages? </li></ul></ul><ul><ul><ul><li>Keep in mind when designing future languages and runtimes </li></ul></ul></ul><ul><ul><li>We should write down </li></ul></ul><ul><ul><ul><li>What we found useful </li></ul></ul></ul><ul><ul><ul><li>What we wish we have but don’t, Ex </li></ul></ul></ul><ul><ul><ul><ul><li>OS thread states for all threads </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Window of CPU utilization in javacores </li></ul></ul></ul></ul><ul><li>Other applications? </li></ul><ul><ul><li>Web browsers </li></ul></ul><ul><ul><li>OS-level virtual machines (VMWare) </li></ul></ul><ul><ul><li>Hypervisors </li></ul></ul>
    56. 56. Ease of use matters for adoption <ul><li>“Zero-install” / view in browser very popular among users </li></ul><ul><ul><li>Reality is that people are lazy/busy </li></ul></ul><ul><ul><ul><li>Any barrier to entry significantly reduces likelihood of use </li></ul></ul></ul><ul><li>Cloud-based tooling </li></ul><ul><ul><li>Incremental growth of knowledge base </li></ul></ul><ul><ul><ul><li>Update rules and fix bugs as new submissions come in </li></ul></ul></ul><ul><ul><ul><li>Critical to early adoption of WAIT </li></ul></ul></ul><ul><ul><li>Enables collaboration </li></ul></ul><ul><ul><ul><li>Users pass around URLs </li></ul></ul></ul><ul><ul><li>Cross-report analysis </li></ul></ul><ul><ul><ul><li>Huge repository of data to mine </li></ul></ul></ul><ul><ul><li>Downside </li></ul></ul><ul><ul><ul><li>Requires network connection </li></ul></ul></ul><ul><ul><ul><li>Some users do have confidentiality concerns </li></ul></ul></ul>
    57. 57. Cloud and Testing <ul><li>Cloud model changes software development process </li></ul><ul><ul><li>You observe all executions of the program as they occur </li></ul></ul><ul><li>Huge advantages </li></ul><ul><ul><li>Rapid bug detection and turnaround for fixes </li></ul></ul><ul><ul><ul><li>Fix bug immediately and make changes live </li></ul></ul></ul><ul><ul><li>This agile model was key to success of WAIT </li></ul></ul><ul><li>But this creates new problems </li></ul><ul><ul><li>Common scenario </li></ul></ul><ul><ul><ul><li>Observe bug on user submission </li></ul></ul></ul><ul><ul><ul><li>Fix bug and release ASAP to avoid reoccurrence of bug </li></ul></ul></ul><ul><ul><ul><li>Discover that in our haste we broke several other things </li></ul></ul></ul>
    58. 58. The Good News: Lots of regression data <ul><li>We have all the input data ever seen </li></ul><ul><ul><li>Input data for 5000+ executions </li></ul></ul><ul><ul><li>All used for regression testing </li></ul></ul><ul><li>Problem: time </li></ul><ul><ul><li>Took several hours to run full regression on single machine </li></ul></ul><ul><ul><li>We expect data to grow by 10x (100x?) in a few years </li></ul></ul><ul><li>Solution: parallelize </li></ul><ul><ul><li>Implemented with hadoop using ~100 small-medium machines </li></ul></ul><ul><ul><ul><li>Full regression in 15-20 mins </li></ul></ul></ul><ul><ul><li>Todo: automatic test prioritization </li></ul></ul>
    59. 59. Summary <ul><li>Data-driven study of large application </li></ul><ul><ul><li>Painful but worth it. Learned a lot. </li></ul></ul><ul><ul><li>Didn’t rely on much existing tooling </li></ul></ul><ul><ul><ul><li>Mismatch for performance triage </li></ul></ul></ul><ul><ul><li>No single problem or magic fix </li></ul></ul><ul><ul><ul><li>Hot locks far from the only problem </li></ul></ul></ul><ul><ul><ul><li>Contended resources at many levels of the software stack </li></ul></ul></ul><ul><ul><ul><li>Very little was multi-core specific </li></ul></ul></ul><ul><li>WAIT Tool: Quickly identify primary performance inhibitors </li></ul><ul><ul><li>Real demand for this seemingly simple concept </li></ul></ul><ul><ul><ul><li>Scalability analysis: focus on what is not running </li></ul></ul></ul><ul><ul><li>Ease of use matters to adoption </li></ul></ul><ul><ul><ul><li>Do your best with easily available information </li></ul></ul></ul><ul><ul><ul><li>Why don’t we see more of this? </li></ul></ul></ul><ul><ul><li>Strategy that has proven quite powerful </li></ul></ul><ul><ul><ul><li>Expert system presents high-level guess at what the problem is </li></ul></ul></ul><ul><ul><ul><li>Allow drilling down to raw data as proof </li></ul></ul></ul><ul><li>WAIT available now https:// wait.researchlabs.ibm.com / </li></ul><ul><ul><ul><li>Tool, Documentation: Demo, Examples, Manual </li></ul></ul></ul>
    60. 60. The End

    ×