• Save
Performance Analysis of Idle Programs
Upcoming SlideShare
Loading in...5
×
 

Performance Analysis of Idle Programs

on

  • 1,136 views

Matthew Arnold's ECOOP 2011 Summer School talk.

Matthew Arnold's ECOOP 2011 Summer School talk.

Statistics

Views

Total Views
1,136
Views on SlideShare
1,095
Embed Views
41

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 41

http://ecoop11.comp.lancs.ac.uk 27
https://twitter.com 9
http://scc-sentinel.lancs.ac.uk 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://www.faqs.org/photo-dict/phrase/3311/running-hourglass.html

Performance Analysis of Idle Programs Performance Analysis of Idle Programs Presentation Transcript

  • Performance Analysis of Idle Programs Erik Altman Matthew Arnold Stephen Fink Nick Mitchell Peter Sweeney IBM T.J. Watson Research Center “ WAIT Performance Tool”
  • Overview
    • Application-driven scalability study
      • Performance analysis of large enterprise app
      • Code changes resulting in 1.4X - 5X improvement
    • WAIT performance tool
      • So easy your mother/father could use it
      • Demo
      • Implementation details
    • Lessons learned
      • Performance tooling
      • The multicore controversy
        • Are we working on the right things?
    professor
  • Application-Driven Scalability Study
    • App-driven scalability study (2008/2009)
      • Goal: Identify scaling problems and solutions for future multicore architectures
      • Approach: application-driven (data-driven) exploration
        • Choose a real application
        • Let workload drive the research
        • Identify scalability problems
        • Restructure application to improve scalability
    • Assumption: application is already parallel
      • Not attempting automatically parallelize serial code
        • Open to adding fine-grained parallelism within transaction
    • Infrastructure
      • Two POWER5 p596 machines ( 64-core , 128 hw threads)
    • Team members
        • Erik Altman, Matthew Arnold, Rajesh Bordewekar, Robert Delmonico, Nick Mitchell, Peter Sweeney
  • Application
    • “ Zanzibar”: content management system
      • Multitier: J2EE (Java) application server, DB2, LDAP, client(s)
      • Document ingestions and retrieval
        • Used by hospitals, banks, etc
        • Data + metadata
      • Mature code
        • In production several years, multiple major releases
        • Previous performance study in 2007
    • Plan of attack
      • First ensure it scales on smaller hardware (4-core / 8 core)
      • Then upgrade to large 64-core machine
        • Find and fix bottlenecks until it scales
  • Initial Result: failure on almost all fronts
    • Install and config took several weeks
      • Real application, real workload, multi-tier configuration
      • Load driver
    • Terrible scalability even on modest 4-way hardware
      • Observed performance: 1 doc/second
        • Target > 1000 docs/second
      • App server machine < 10% CPU utilization
    • Existing performance tools did not prove useful
      • Struggled even to identify the primary bottleneck we faced
        • Let alone its underlying cause
    • Advice we were given
      • You need a massive hard disk array for that application
      • Gigabit ethernet? You need Infiniband or Hipersockets
  • Stop! We’ve already learned several lessons
    • Lesson 1: This “application-driven” research idea stinks
      • “ We aren’t the right people to be doing this. Someone else should get this deployed so we can focus on what we’re good at: Java performance analysis.”
    • Lesson 2: Ignore lesson 1
      • Despite being frustrated, we learned a lot
        • Whole point was to open our mind to new problems
      • We are an important demographic
        • Mostly-competent non-experts
        • “ Why is the app I just installed an order of magnititude too slow?”
          • Very common question
    • Disclaimer: if you go down this road
      • You will end up working on things you didn’t intend (or want?) to
  • OK so let’s find some bottlenecks
    • Matt and Nick, see what you can find!
      • Matt : I installed and ran Java Lock Analyzer. I don’t see any hot locks
      • Nick: Yeah, I did kill -3 to generate javacores and the thread stacks show we’re waiting on the database
      • Matt: I installed and ran tprof. Method toLowerCase() is the hottest method
      • Nick: Yeah, that was clear from the thread dumps too
    • Observation 1
      • Seasoned performance experts often don’t use any fancy tools
        • Start with simple utilities: top, ps, kill -3, oprofile, netperf
        • Top performance experts don’t use tools developed in research?
    • Observation 2
      • The tools we found were a mismatch for “performance triage”
      • Targeted focus: Hot locks, GC analyzer, DB query analyzer, etc
        • How do I know which tool to use first?
        • Once you fix one bottleneck, you start all over
      • High installation and usage effort
  • Constraints of a real-world production deployment
      • Instrument the application? NO!
      • Recompile the application? NON!
      • Deploy a fancy monitoring agent? NICHT!
      • Analysis the source? ノー !
      • Install more modern JVM? yIntagh !
  • Let’s see what Javacores can do
    • Zanzibar analysis done almost entirely using Javacores
    • Methodology used
      • Trigger a few javacores from server under load
        • Manually inspect
          • Look for frequently occurring thread stacks
          • Whether running, or blocked
        • Fix problem
        • Repeat
  • No single class of bottleneck dominated
    • Bottlenecks found
      • Exceeded disk throughput on the database machine
      • Exceeded disk throughput on the application server machine
      • Application overuse of filesystem metadata operations
      • Lock contention in application code
      • Saturating network
      • GC bottlenecks due to JVM implementation
      • GC bottlenecks due to application issues
      • Difficulties driving enough load due to bottlenecks in load generator
  • 1) Lock Contention
    • Found 11 hot locks
    • Replaced with
      • Thread-local storage
        • Fine-grained data replication
        • Good for lazily initialized, read-only data
      • Concurrent collections
        • Significantly more scalable
    • Alternative: App server cloning
      • Coarse grained data replication
  • 2) Contended Shared Resources
    • Disk
      • Database machine
        • Heavy use of BLOB (Binary Large OBjects)
          • Non-buffered writes
      • App server machine
        • Frequent filesystem calls (open, close, exists, rename)
    • OS/Filesystem
      • Both of above bottleneck even with RAM disk
    • JVM
      • Object allocation and GC
        • Excessive temp object creation
        • Reduced object creation rate by 3X
          • Objects per request: 2850  900
    • Network
      • Bloated protocols
  • 3) Single-threaded Performance
    • For coarse-grained bottlenecks
      • Identify contended resource X
      • Change code to put less pressure on X
      • Repeat
    • Eventually you get to finer-granularity resources
      • It became simpler to
        • Give up hunting for individual contended resources
        • Focus on single-threaded performance instead
      • Find grossly inefficient code and fix it
        • Improves latency / response time
        • Also improves scalability
          • If a program executes the minimal number of steps to accomplish a task it likely consumes fewer shared resources at all levels of the stack
  • Examples of Single-threaded Improvements
    • Redundant computations
      • Redundant calls to file.exists()
      • Excessive use of toLowerCase()
    • Over-general code
      • Creating hashmap to pass 2 elements
    • Unnecessary copies and conversion
      • Stores same data in both Id and String form
        • Converts back and forth frequently
        • Calls isId() frequently
      • String operations to find prepared statement
  • Performance Results
    • 8-way x86
  • Results: 56 core POWER p595 Scaling with -checkin (first minute) 0 1000 2000 3000 4000 5000 6000 7000 2 4 8 16 32 56 Num Cores Docs/second Orig Modified
  • Results: 56 core POWER p595 Without -checkin flag 0 1000 2000 3000 4000 5000 6000 7000 2 4 8 16 32 56 Num cores Docs/sec Orig Modified
  • Zanzibar Results: Interesting Points
      • Significant improvements from 27 small, targeted changes
      • Fixing 11 Java locks left no hot locks
        • Even in a 64-core / 128-thread system
      • Locks were only one of many problems to scalability
        • Contended shared resources at many levels of the stack
        • JVM cloning wouldn’t have helped some but not all
      • GC is what prevented scaling beyond 32 cores
      • Improvements were high-level Java code or config changes
        • Nothing specific to multi-core microarchitecture
      • Made more single-threaded changes than expected
        • Accounted for roughly half the speedups (8-way, no “-checkin”)
      • Did most of the work on 4-way and 8-way machines
    Much of this was not what we expected to be working on
  • Overview
    • Zanzibar scalability study
      • Performance analysis of large enterprise app
      • Code changes resulting in 1.4X - 5X improvement
    • WAIT performance tool
      • Demo
      • Implementation details
    • Lessons learned
      • Performance tooling
      • The multicore controversy
        • Are we working on the right things?
  • WAIT Performance Tool
    • OOPSLA 2010 “Performance analysis of idle programs. Quickly identify performance and scalability inhibitors” - Altman, Arnold, Fink, Mitchell
    • Quickly identify primary bottleneck
    • Usable in large-scale, deployed production setting
      • Intrusive monitoring not an option
    • Learn from past experiences
      • Rule based
      • “Taught” about common causes of bottlenecks.
  • WAIT: Key Attributes
    • WAIT focuses on primary bottlenecks
      • Gives high-level, whole-system,
      • summary of performance inhibitors
    • WAIT is zero install
      • Leverages built-in data collectors
      • Reports results in a browser
    • WAIT is non-disruptive
      • No special flags, no restart
      • Use in any customer or development location
    • WAIT is low-overhead
      • Uses only infrequent samples of an already-running application
    • WAIT is simple to use
      • Novices to experts: Start at high level and drill down
    • WAIT does not capture sensitive user data
      • No Social Security numbers, credit card numbers, etc
    • WAIT uses centralized knowledge base
      • Allows rules and knowledge base to grow over time
    Customer A Customer B WAIT Cloud Server
  • What information does WAIT use?
    • Standard O/S tools
      • CPU utliization: (vmstat)
      • Process utilization: (ps)
      • Memory stats: …
  • What information does WAIT use?
    • Standard O/S tools
      • CPU utliization: (vmstat)
      • Process utilization: (ps)
      • Memory stats: …
    • JVM dump files
      • Most JVMs respond to SIGQUIT (kill -3)
        • Dump state of JVM to stderr or file
        • IBM JVMs produce “javacore”
    Works with all IBM JVMs, Sun/Oracle JVMs
  • How to use WAIT
    • Gather some javacores
      • Manually execute kill -3 PID
      • Run WAIT data collection script
    • Upload data to server
      • http:// wait.researchlabs.ibm.com
    • View result in browser
    Firefox Safari Chrome iPhone
  • How to use WAIT
    • Gather some javacores
      • Manually execute kill -3 PID
      • Run WAIT data collection script
    • Upload data to server
      • http:// wait.researchlabs.ibm.com
    • View result in browser
    Firefox Safari Chrome iPhone Free, public server Nothing to install Sample-based (low overhead) Doesn’t require restarting app
  • Wait Data Collector
    • Specify JVM process ID
      • 31136
    • Triggers periodic:
      • javacores
      • vmstat (machine util)
      • ps (process util)
    • Creates zip file
      • Upload to WAIT server
    Next slide
  • Upload Java Cores to WAIT Website
  • What is the CPU doing? What Java work is running? What Java work cannot run? View WAIT Report in a Browser What is memory consumption? Not directly available from profiling tools
  • WAIT Report: What is the main cause of delay? Drill down by clicking on legend item Where are those delays coming from in the code?
    • Example Rule:
      • If socketRead at top of stack AND
      • If JDBC methods lower on stack
    •  Getting data from database
  • Filesystem Bottleneck
  • Z/OS: Lock Contention
  • Lock Contention: Waiting vs Owning
  • Deadlock
    • DB2
    • JDBC
    • App Server
      • Websphere, WebLogic, Jboss
    • WebSphere Commerce
    • Portal
    • MQ
    • Oracle
    • RMIs
    • Apache commons
    • Tomcat
    Some Frameworks Supported by WAIT Rules No finger pointing
  • Example Report: Memory Leak Disclaimer: Appearance and function of any offering may differ from this depiction.
  • Example Report: Memory Analysis Disclaimer: Appearance and function of any offering may differ from this depiction.
  • WAIT Implementation Details
  • Looks Simple!
    • In some ways it is. In others, it’s not.
      • Some information presented is nontrivial to compute
  • Looks Simple!
    • In some ways it is. In others, it’s not.
      • Some information presented is nontrivial to compute
    • Example: is thread running or sleeping?
      • Difficult to determine give input data
      • Thread states reported by JVM are useless
        • JVM stops all threads before writing out thread stacks
  • Looks Simple!
    • In some ways it is. In others, it’s not.
      • Some information presented is nontrivial to compute
    • Example: is thread running or sleeping?
      • Difficult to determine give input data
      • Thread states reported by JVM are useless
        • JVM stops all threads before writing out thread stacks
      • What is the “correct” thread state?
        • Java language level
        • JVM level
        • O/S level
  • “WAIT State”
    • Hierarchical abstraction of execution state
      • Common cases of making forward progress (or lack thereof)
    • Top level: Runnable vs Waiting
  • WAIT State: RUNNABLE
  • WAIT State: WAITING
  • “Category”
    • Hierarchical abstraction of code activity being performed
      • What is the code doing?
  • Analysis engine
    • Use rule-based knowledge to map:
      • Thread stack  < Category , WAIT State >
    • Category analysis
      • Simple pattern matching based on known methods
        • java/net/SocketInputStream.socketRead0  Network
        • com/mysql/jdbc/ConnectionImpl.commit  DB Commit
        • lcom/sun/jndi/ldap/Connection.readReply  LDAP
      • Algorithm
        • Label every stack frame
          • If no rule apply, use package name
        • Stack assigned label of highest priority rule
  • WAIT State Analysis
    • Uses several inputs to make best guess
      • Category
      • Lock graph
      • Known methods
        • Spinlock implementations
        • Native routines (blocking read, etc)
      • Algorithm acts as a sieve
        • Looking for states that can be assigned with most certainty
    • Unknown native methods are problematic
      • Cannot be certain of execution state
      • Assign “native unknown”
        • Combine with CPU utilization to have good guess
  • Rule Statistics (as of Mar 2010) Number of Rules DB2, MySql, Oracle, Apache, SqlServer Rule Coverage 6 LDAP 12 Logging 13 Classloader 22 JEE 30 Marshalling 30 Waiting for Work 46 Disk, Network I/O 41 Client Communication 59 Administrative 72 Database # Rules Category 23% Package Fallback 77% Category Rule 1,391,033 # Thread Stacks 830 # Reports
  • Internal WAIT Usage Statistics (as of June 2011)
      • Types of users
        • Crit sit teams
        • L2/L3 Support
        • Developers
        • Testers
        • Performance analysts
    600+ Unique users 5200+ Total reports
  • Who is using WAIT?
  • Tool in Software Lifecycle Entry Point Entry Point Exit Point Performance Tuning The tool applies everywhere in cycle. – Key: Lightweight and simple Build Use latest compiler Turn on optimization Enable parallelization* Analyze Static code analysis Find “hot spots” Identify performance bottlenecks Identify scalability bottlenecks* Code & Tune Refine compiler options/directives Use optimized libraries Recode part of application Introduce/increase parallelism* Test & Debug Run Application Check correctness Check concurrency issues* Monitor Measure performance Collect execution stats Validate performance gains Gather stats on scalability*
  • Testimonial: Health Care System Crit-Sit
    • April-May 2010: Tool team worked intensively on custom workload
      • Batch, Java, Database
      • Major health care provider
    • Approximately 10x gain during that period
      • Metric: Transactions per hour
    • Others continued to use tool intensively through August – when performance goal achieved.
      • 400+ WAIT reports over 4 months as part of standard performance testing script
      • Result: 60x overall improvement
        • 30 small, localized changes
    “ Tell your boss you have a tool I would buy.” &quot;The [other] guys are still trying to figure out how to get the tracing installed on all the tiers.&quot; Crit-Sit = Critical Situation / Problem
  • Integration into Development and Test Process
    • Integration of tool into system test process  Big Gains
    • Approach:
      • Automate tool data collection during tests
      • Automate uploading of data to analysis and display server
      • Show reports to testers:
        • Understand performance
        • Identify bugs
        • Track application health
        • Forward report URL to developers to speed defect resolution.
  • Overview
    • Zanzibar scalability study
      • Performance analysis of large enterprise app
      • Code changes resulting in 1.4X - 5X improvement
    • WAIT performance tool
      • Demo
      • Implementation details
    • Lessons learned
      • Performance tooling
      • The multicore controversy
        • Are we working on the right things?
  • Is the WAIT tool “Research”?
    • In some aspects, NO
      • Lots of engineering
      • Focus on portability and robustness
      • Significant UI work
    • In many ways, YES
      • Philosophy: do more with less
        • Opposite of predominant mindset in research community
          • Gather more data  more precise result
      • Would be nice to see more of the less-is-more approach
        • May be harder to publish
        • Concrete research statements still possible
        • “ We achieved 90% of the accuracy of technique X without having to Y”
  • +1 for managed languages
    • User-triggered snapshots (Javacores) beneficial
      • Significant advantage of managed languages?
        • Keep in mind when designing future languages and runtimes
      • We should write down
        • What we found useful
        • What we wish we have but don’t, Ex
          • OS thread states for all threads
          • Window of CPU utilization in javacores
    • Other applications?
      • Web browsers
      • OS-level virtual machines (VMWare)
      • Hypervisors
  • Ease of use matters for adoption
    • “Zero-install” / view in browser very popular among users
      • Reality is that people are lazy/busy
        • Any barrier to entry significantly reduces likelihood of use
    • Cloud-based tooling
      • Incremental growth of knowledge base
        • Update rules and fix bugs as new submissions come in
        • Critical to early adoption of WAIT
      • Enables collaboration
        • Users pass around URLs
      • Cross-report analysis
        • Huge repository of data to mine
      • Downside
        • Requires network connection
        • Some users do have confidentiality concerns
  • Cloud and Testing
    • Cloud model changes software development process
      • You observe all executions of the program as they occur
    • Huge advantages
      • Rapid bug detection and turnaround for fixes
        • Fix bug immediately and make changes live
      • This agile model was key to success of WAIT
    • But this creates new problems
      • Common scenario
        • Observe bug on user submission
        • Fix bug and release ASAP to avoid reoccurrence of bug
        • Discover that in our haste we broke several other things
  • The Good News: Lots of regression data
    • We have all the input data ever seen
      • Input data for 5000+ executions
      • All used for regression testing
    • Problem: time
      • Took several hours to run full regression on single machine
      • We expect data to grow by 10x (100x?) in a few years
    • Solution: parallelize
      • Implemented with hadoop using ~100 small-medium machines
        • Full regression in 15-20 mins
      • Todo: automatic test prioritization
  • Summary
    • Data-driven study of large application
      • Painful but worth it. Learned a lot.
      • Didn’t rely on much existing tooling
        • Mismatch for performance triage
      • No single problem or magic fix
        • Hot locks far from the only problem
        • Contended resources at many levels of the software stack
        • Very little was multi-core specific
    • WAIT Tool: Quickly identify primary performance inhibitors
      • Real demand for this seemingly simple concept
        • Scalability analysis: focus on what is not running
      • Ease of use matters to adoption
        • Do your best with easily available information
        • Why don’t we see more of this?
      • Strategy that has proven quite powerful
        • Expert system presents high-level guess at what the problem is
        • Allow drilling down to raw data as proof
    • WAIT available now https:// wait.researchlabs.ibm.com /
        • Tool, Documentation: Demo, Examples, Manual
  • The End