Performance Analysis of Idle Programs

Performance Analysis of Idle Programs Erik Altman Matthew Arnold Stephen Fink Nick Mitchell Peter Sweeney IBM T.J. Watson Research Center “ WAIT Performance Tool”

Overview Application-driven scalability study Performance analysis of large enterprise app Code changes resulting in 1.4X - 5X improvement WAIT performance tool So easy your mother/father could use it Demo Implementation details Lessons learned Performance tooling The multicore controversy Are we working on the right things? professor

Application-Driven Scalability Study App-driven scalability study (2008/2009) Goal: Identify scaling problems and solutions for future multicore architectures Approach: application-driven (data-driven) exploration Choose a real application Let workload drive the research Identify scalability problems Restructure application to improve scalability Assumption: application is already parallel Not attempting automatically parallelize serial code Open to adding fine-grained parallelism within transaction Infrastructure Two POWER5 p596 machines ( 64-core , 128 hw threads) Team members Erik Altman, Matthew Arnold, Rajesh Bordewekar, Robert Delmonico, Nick Mitchell, Peter Sweeney

Application “ Zanzibar”: content management system Multitier: J2EE (Java) application server, DB2, LDAP, client(s) Document ingestions and retrieval Used by hospitals, banks, etc Data + metadata Mature code In production several years, multiple major releases Previous performance study in 2007 Plan of attack First ensure it scales on smaller hardware (4-core / 8 core) Then upgrade to large 64-core machine Find and fix bottlenecks until it scales

Initial Result: failure on almost all fronts Install and config took several weeks Real application, real workload, multi-tier configuration Load driver Terrible scalability even on modest 4-way hardware Observed performance: 1 doc/second Target > 1000 docs/second App server machine < 10% CPU utilization Existing performance tools did not prove useful Struggled even to identify the primary bottleneck we faced Let alone its underlying cause Advice we were given You need a massive hard disk array for that application Gigabit ethernet? You need Infiniband or Hipersockets

Stop! We’ve already learned several lessons Lesson 1: This “application-driven” research idea stinks “ We aren’t the right people to be doing this. Someone else should get this deployed so we can focus on what we’re good at: Java performance analysis.” Lesson 2: Ignore lesson 1 Despite being frustrated, we learned a lot Whole point was to open our mind to new problems We are an important demographic Mostly-competent non-experts “ Why is the app I just installed an order of magnititude too slow?” Very common question Disclaimer: if you go down this road You will end up working on things you didn’t intend (or want?) to

OK so let’s find some bottlenecks Matt and Nick, see what you can find! Matt : I installed and ran Java Lock Analyzer. I don’t see any hot locks Nick: Yeah, I did kill -3 to generate javacores and the thread stacks show we’re waiting on the database Matt: I installed and ran tprof. Method toLowerCase() is the hottest method Nick: Yeah, that was clear from the thread dumps too Observation 1 Seasoned performance experts often don’t use any fancy tools Start with simple utilities: top, ps, kill -3, oprofile, netperf Top performance experts don’t use tools developed in research? Observation 2 The tools we found were a mismatch for “performance triage” Targeted focus: Hot locks, GC analyzer, DB query analyzer, etc How do I know which tool to use first? Once you fix one bottleneck, you start all over High installation and usage effort

Constraints of a real-world production deployment Instrument the application? NO! Recompile the application? NON! Deploy a fancy monitoring agent? NICHT! Analysis the source? ノー ! Install more modern JVM? yIntagh !

Let’s see what Javacores can do Zanzibar analysis done almost entirely using Javacores Methodology used Trigger a few javacores from server under load Manually inspect Look for frequently occurring thread stacks Whether running, or blocked Fix problem Repeat

No single class of bottleneck dominated Bottlenecks found Exceeded disk throughput on the database machine Exceeded disk throughput on the application server machine Application overuse of filesystem metadata operations Lock contention in application code Saturating network GC bottlenecks due to JVM implementation GC bottlenecks due to application issues Difficulties driving enough load due to bottlenecks in load generator

1) Lock Contention Found 11 hot locks Replaced with Thread-local storage Fine-grained data replication Good for lazily initialized, read-only data Concurrent collections Significantly more scalable Alternative: App server cloning Coarse grained data replication

2) Contended Shared Resources Disk Database machine Heavy use of BLOB (Binary Large OBjects) Non-buffered writes App server machine Frequent filesystem calls (open, close, exists, rename) OS/Filesystem Both of above bottleneck even with RAM disk JVM Object allocation and GC Excessive temp object creation Reduced object creation rate by 3X Objects per request: 2850  900 Network Bloated protocols

3) Single-threaded Performance For coarse-grained bottlenecks Identify contended resource X Change code to put less pressure on X Repeat Eventually you get to finer-granularity resources It became simpler to Give up hunting for individual contended resources Focus on single-threaded performance instead Find grossly inefficient code and fix it Improves latency / response time Also improves scalability If a program executes the minimal number of steps to accomplish a task it likely consumes fewer shared resources at all levels of the stack

Examples of Single-threaded Improvements Redundant computations Redundant calls to file.exists() Excessive use of toLowerCase() Over-general code Creating hashmap to pass 2 elements Unnecessary copies and conversion Stores same data in both Id and String form Converts back and forth frequently Calls isId() frequently String operations to find prepared statement

Results: 56 core POWER p595 Scaling with -checkin (first minute) 0 1000 2000 3000 4000 5000 6000 7000 2 4 8 16 32 56 Num Cores Docs/second Orig Modified

Results: 56 core POWER p595 Without -checkin flag 0 1000 2000 3000 4000 5000 6000 7000 2 4 8 16 32 56 Num cores Docs/sec Orig Modified

Zanzibar Results: Interesting Points Significant improvements from 27 small, targeted changes Fixing 11 Java locks left no hot locks Even in a 64-core / 128-thread system Locks were only one of many problems to scalability Contended shared resources at many levels of the stack JVM cloning wouldn’t have helped some but not all GC is what prevented scaling beyond 32 cores Improvements were high-level Java code or config changes Nothing specific to multi-core microarchitecture Made more single-threaded changes than expected Accounted for roughly half the speedups (8-way, no “-checkin”) Did most of the work on 4-way and 8-way machines Much of this was not what we expected to be working on

Overview Zanzibar scalability study Performance analysis of large enterprise app Code changes resulting in 1.4X - 5X improvement WAIT performance tool Demo Implementation details Lessons learned Performance tooling The multicore controversy Are we working on the right things?

WAIT Performance Tool OOPSLA 2010 “Performance analysis of idle programs. Quickly identify performance and scalability inhibitors” - Altman, Arnold, Fink, Mitchell Quickly identify primary bottleneck Usable in large-scale, deployed production setting Intrusive monitoring not an option Learn from past experiences Rule based “Taught” about common causes of bottlenecks.

WAIT: Key Attributes WAIT focuses on primary bottlenecks Gives high-level, whole-system, summary of performance inhibitors WAIT is zero install Leverages built-in data collectors Reports results in a browser WAIT is non-disruptive No special flags, no restart Use in any customer or development location WAIT is low-overhead Uses only infrequent samples of an already-running application WAIT is simple to use Novices to experts: Start at high level and drill down WAIT does not capture sensitive user data No Social Security numbers, credit card numbers, etc WAIT uses centralized knowledge base Allows rules and knowledge base to grow over time Customer A Customer B WAIT Cloud Server

What information does WAIT use? Standard O/S tools CPU utliization: (vmstat) Process utilization: (ps) Memory stats: …

What information does WAIT use? Standard O/S tools CPU utliization: (vmstat) Process utilization: (ps) Memory stats: … JVM dump files Most JVMs respond to SIGQUIT (kill -3) Dump state of JVM to stderr or file IBM JVMs produce “javacore” Works with all IBM JVMs, Sun/Oracle JVMs

How to use WAIT Gather some javacores Manually execute kill -3 PID Run WAIT data collection script Upload data to server http:// wait.researchlabs.ibm.com View result in browser Firefox Safari Chrome iPhone

How to use WAIT Gather some javacores Manually execute kill -3 PID Run WAIT data collection script Upload data to server http:// wait.researchlabs.ibm.com View result in browser Firefox Safari Chrome iPhone Free, public server Nothing to install Sample-based (low overhead) Doesn’t require restarting app

Wait Data Collector Specify JVM process ID 31136 Triggers periodic: javacores vmstat (machine util) ps (process util) Creates zip file Upload to WAIT server Next slide

Upload Java Cores to WAIT Website

What is the CPU doing? What Java work is running? What Java work cannot run? View WAIT Report in a Browser What is memory consumption? Not directly available from profiling tools

WAIT Report: What is the main cause of delay? Drill down by clicking on legend item Where are those delays coming from in the code? Example Rule: If socketRead at top of stack AND If JDBC methods lower on stack  Getting data from database

Lock Contention: Waiting vs Owning

DB2 JDBC App Server Websphere, WebLogic, Jboss WebSphere Commerce Portal MQ Oracle RMIs Apache commons Tomcat Some Frameworks Supported by WAIT Rules No finger pointing

Example Report: Memory Leak Disclaimer: Appearance and function of any offering may differ from this depiction.

Example Report: Memory Analysis Disclaimer: Appearance and function of any offering may differ from this depiction.

Looks Simple! In some ways it is. In others, it’s not. Some information presented is nontrivial to compute

Looks Simple! In some ways it is. In others, it’s not. Some information presented is nontrivial to compute Example: is thread running or sleeping? Difficult to determine give input data Thread states reported by JVM are useless JVM stops all threads before writing out thread stacks

Looks Simple! In some ways it is. In others, it’s not. Some information presented is nontrivial to compute Example: is thread running or sleeping? Difficult to determine give input data Thread states reported by JVM are useless JVM stops all threads before writing out thread stacks What is the “correct” thread state? Java language level JVM level O/S level

“WAIT State” Hierarchical abstraction of execution state Common cases of making forward progress (or lack thereof) Top level: Runnable vs Waiting

“Category” Hierarchical abstraction of code activity being performed What is the code doing?

Analysis engine Use rule-based knowledge to map: Thread stack  < Category , WAIT State > Category analysis Simple pattern matching based on known methods java/net/SocketInputStream.socketRead0  Network com/mysql/jdbc/ConnectionImpl.commit  DB Commit lcom/sun/jndi/ldap/Connection.readReply  LDAP Algorithm Label every stack frame If no rule apply, use package name Stack assigned label of highest priority rule

WAIT State Analysis Uses several inputs to make best guess Category Lock graph Known methods Spinlock implementations Native routines (blocking read, etc) Algorithm acts as a sieve Looking for states that can be assigned with most certainty Unknown native methods are problematic Cannot be certain of execution state Assign “native unknown” Combine with CPU utilization to have good guess

Rule Statistics (as of Mar 2010) Number of Rules DB2, MySql, Oracle, Apache, SqlServer Rule Coverage 6 LDAP 12 Logging 13 Classloader 22 JEE 30 Marshalling 30 Waiting for Work 46 Disk, Network I/O 41 Client Communication 59 Administrative 72 Database # Rules Category 23% Package Fallback 77% Category Rule 1,391,033 # Thread Stacks 830 # Reports

Internal WAIT Usage Statistics (as of June 2011) Types of users Crit sit teams L2/L3 Support Developers Testers Performance analysts 600+ Unique users 5200+ Total reports

Tool in Software Lifecycle Entry Point Entry Point Exit Point Performance Tuning The tool applies everywhere in cycle. – Key: Lightweight and simple Build Use latest compiler Turn on optimization Enable parallelization* Analyze Static code analysis Find “hot spots” Identify performance bottlenecks Identify scalability bottlenecks* Code & Tune Refine compiler options/directives Use optimized libraries Recode part of application Introduce/increase parallelism* Test & Debug Run Application Check correctness Check concurrency issues* Monitor Measure performance Collect execution stats Validate performance gains Gather stats on scalability*

Testimonial: Health Care System Crit-Sit April-May 2010: Tool team worked intensively on custom workload Batch, Java, Database Major health care provider Approximately 10x gain during that period Metric: Transactions per hour Others continued to use tool intensively through August – when performance goal achieved. 400+ WAIT reports over 4 months as part of standard performance testing script Result: 60x overall improvement 30 small, localized changes “ Tell your boss you have a tool I would buy.” "The [other] guys are still trying to figure out how to get the tracing installed on all the tiers." Crit-Sit = Critical Situation / Problem

Integration into Development and Test Process Integration of tool into system test process  Big Gains Approach: Automate tool data collection during tests Automate uploading of data to analysis and display server Show reports to testers: Understand performance Identify bugs Track application health Forward report URL to developers to speed defect resolution.

Is the WAIT tool “Research”? In some aspects, NO Lots of engineering Focus on portability and robustness Significant UI work In many ways, YES Philosophy: do more with less Opposite of predominant mindset in research community Gather more data  more precise result Would be nice to see more of the less-is-more approach May be harder to publish Concrete research statements still possible “ We achieved 90% of the accuracy of technique X without having to Y”

+1 for managed languages User-triggered snapshots (Javacores) beneficial Significant advantage of managed languages? Keep in mind when designing future languages and runtimes We should write down What we found useful What we wish we have but don’t, Ex OS thread states for all threads Window of CPU utilization in javacores Other applications? Web browsers OS-level virtual machines (VMWare) Hypervisors

Ease of use matters for adoption “Zero-install” / view in browser very popular among users Reality is that people are lazy/busy Any barrier to entry significantly reduces likelihood of use Cloud-based tooling Incremental growth of knowledge base Update rules and fix bugs as new submissions come in Critical to early adoption of WAIT Enables collaboration Users pass around URLs Cross-report analysis Huge repository of data to mine Downside Requires network connection Some users do have confidentiality concerns

Cloud and Testing Cloud model changes software development process You observe all executions of the program as they occur Huge advantages Rapid bug detection and turnaround for fixes Fix bug immediately and make changes live This agile model was key to success of WAIT But this creates new problems Common scenario Observe bug on user submission Fix bug and release ASAP to avoid reoccurrence of bug Discover that in our haste we broke several other things

The Good News: Lots of regression data We have all the input data ever seen Input data for 5000+ executions All used for regression testing Problem: time Took several hours to run full regression on single machine We expect data to grow by 10x (100x?) in a few years Solution: parallelize Implemented with hadoop using ~100 small-medium machines Full regression in 15-20 mins Todo: automatic test prioritization

Summary Data-driven study of large application Painful but worth it. Learned a lot. Didn’t rely on much existing tooling Mismatch for performance triage No single problem or magic fix Hot locks far from the only problem Contended resources at many levels of the software stack Very little was multi-core specific WAIT Tool: Quickly identify primary performance inhibitors Real demand for this seemingly simple concept Scalability analysis: focus on what is not running Ease of use matters to adoption Do your best with easily available information Why don’t we see more of this? Strategy that has proven quite powerful Expert system presents high-level guess at what the problem is Allow drilling down to raw data as proof WAIT available now https:// wait.researchlabs.ibm.com / Tool, Documentation: Demo, Examples, Manual

Performance Analysis of Idle Programs

More Related Content

What's hot

Viewers also liked

Similar to Performance Analysis of Idle Programs

More from greenwop

Recently uploaded

Performance Analysis of Idle Programs

Editor's Notes