LEARN
There are laws and principals that govern concurrency and performance.
Performance can be built, fueled and/or tuned.
How do we measure performance and capacity in abstract terms?
Capacity (throughput) and Load are often used interchangeably but incorrectly.
What is the difference between Resource utilization and saturation?
How performance & capacity are measured on a live system (CPU & Memory)?
APPLY
Find out how is your system being used or abused?
Find out how your system is performing as a whole?
Find out how a particular process in the system is performing?
Find out how a particular thread in the process performing?
Find out the bottle-necks? What is less or missing?
dump,thread,heap,concurrency,vmstat,solaris,mpstat,truss,garbage,pstack,system,java,troubleshooting,gc,leak,performance,jhat,jmap,core,jstat,
The document discusses the scaling habits of ASP.NET applications over multiple versions from initial launch to large-scale business success. As an application grows from version 1 with a few users to version N with thousands of users, the key scaling challenges change from fixing logical problems to addressing performance bottlenecks and high availability requirements. The solutions also evolve from simple code optimizations to sophisticated architectures with load balancing, caching, and separate servers for web and database tiers.
Resource Management in (Embedded) Real-Time Systemsjeronimored
Rate monotonic analysis provides techniques for analyzing real-time systems with periodic tasks. It focuses on ensuring tasks meet deadlines through priority-based scheduling, where the highest priority is given to the task with the shortest period. The utilization bound test determines if a set of periodic tasks will always meet deadlines based on the total utilization being below a limit that depends on the number of tasks.
The document discusses cost based performance modeling to address uncertainties in requirements, code, and hardware. It introduces the concept of modeling system behavior as transactions with costs that map resource requirements. Examples of questions that can be answered include maximum supported load for different hardware. The approach involves defining transactions, measuring individual costs, and using a spreadsheet model to estimate overall resource utilization and constraints for a given transaction load. This allows exploring performance across different architectures and identifying bottlenecks.
Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm.
This document discusses various metrics for evaluating computer performance and discusses latency. It defines latency as the time it takes a computer to perform a single task and discusses how latency is measured. Latency is important for application responsiveness, real-time applications, and other situations where waiting time matters. The document also introduces the performance equation that models latency in terms of architectural parameters like instructions, clock cycles, and clock frequency.
This document discusses challenges faced in implementing Presto, an open source distributed SQL query engine, for targeted audience delivery at TiVo. It describes choosing appropriate instance types for Presto worker nodes based on memory needs. It also addresses scaling the Presto cluster elastically to handle query concurrency and maturity issues with the Presto software. The document provides insights on testing Presto using Docker containers and connecting to mocked tables.
Evaluating computers involves considering metrics like latency, throughput, bandwidth, cost, power, and reliability. Latency refers to how long a single task takes and is usually measured in seconds or clock cycles. Performance is defined as the inverse of latency, so a system with lower latency is considered higher performing. Amdahl's Law states that the overall speedup from optimizing a portion of a system is limited by the percentage of time spent in that portion. It is important for determining whether optimizations are worthwhile.
The document discusses the scaling habits of ASP.NET applications over multiple versions from initial launch to large-scale business success. As an application grows from version 1 with a few users to version N with thousands of users, the key scaling challenges change from fixing logical problems to addressing performance bottlenecks and high availability requirements. The solutions also evolve from simple code optimizations to sophisticated architectures with load balancing, caching, and separate servers for web and database tiers.
Resource Management in (Embedded) Real-Time Systemsjeronimored
Rate monotonic analysis provides techniques for analyzing real-time systems with periodic tasks. It focuses on ensuring tasks meet deadlines through priority-based scheduling, where the highest priority is given to the task with the shortest period. The utilization bound test determines if a set of periodic tasks will always meet deadlines based on the total utilization being below a limit that depends on the number of tasks.
The document discusses cost based performance modeling to address uncertainties in requirements, code, and hardware. It introduces the concept of modeling system behavior as transactions with costs that map resource requirements. Examples of questions that can be answered include maximum supported load for different hardware. The approach involves defining transactions, measuring individual costs, and using a spreadsheet model to estimate overall resource utilization and constraints for a given transaction load. This allows exploring performance across different architectures and identifying bottlenecks.
Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm.
This document discusses various metrics for evaluating computer performance and discusses latency. It defines latency as the time it takes a computer to perform a single task and discusses how latency is measured. Latency is important for application responsiveness, real-time applications, and other situations where waiting time matters. The document also introduces the performance equation that models latency in terms of architectural parameters like instructions, clock cycles, and clock frequency.
This document discusses challenges faced in implementing Presto, an open source distributed SQL query engine, for targeted audience delivery at TiVo. It describes choosing appropriate instance types for Presto worker nodes based on memory needs. It also addresses scaling the Presto cluster elastically to handle query concurrency and maturity issues with the Presto software. The document provides insights on testing Presto using Docker containers and connecting to mocked tables.
Evaluating computers involves considering metrics like latency, throughput, bandwidth, cost, power, and reliability. Latency refers to how long a single task takes and is usually measured in seconds or clock cycles. Performance is defined as the inverse of latency, so a system with lower latency is considered higher performing. Amdahl's Law states that the overall speedup from optimizing a portion of a system is limited by the percentage of time spent in that portion. It is important for determining whether optimizations are worthwhile.
Deploying any software can be a challenge if you don't understand how resources are used or how to plan for the capacity of your systems. Whether you need to deploy or grow a single MongoDB instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment.
This webinar will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs for new and growing deployments. The goal of this webinar will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
Measuring CDN performance and why you're doing it wrongFastly
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content.
In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you.
More specifically, we'll cover:
Different techniques for measuring CDN performance
Differentiating between network footprint and object delivery performance
Choosing the right content to test
Core metrics to focus on and how each impacts real traffic
Understanding cache hit ratio, why it can be misleading, and how to measure for it
This document discusses successfully breeding rabbits in the cloud. It describes rabbits as small public apps that want to live outside and scale quickly. It discusses how the cloud provides rabbits with global reach, fast time to market, performance, scalability, availability and cost efficiency. The document emphasizes that the weakest link limits overall scalability and that caches, queues and separating concerns can help address this. It stresses understanding cloud capacity and efficiently using resources to keep costs low. Finally, it discusses how rabbits can achieve self-service, marketing, support, education and testing at scale.
Expecto Performa! The Magic and Reality of Performance TuningAtlassian
In the enterprise there are rarely simple solutions to highly nuanced problems that satisfy all needs. Several customers might each ask "How do I make Jira/Confluence faster?" and each require a different answer. Using this example, this talk will pick apart the inputs, outputs, concerns, and realities of answering a short question with a long answer. We'll then discuss real-world examples from our own internal instances, to give you a taste of the process we've gone through to solve our own performance problems, and to show why there is no simple playbook; "it depends" on a lot! The key takeaways are:
* The importance of having a shared definition of performance
* The importance of having agreed-upon priorities, including what isn't important
* The importance of measuring (allthethings) and understanding them
* The thing you think is the problem might not be the problem, and vice versa.
* The real world and the ideal world tend to look nothing alike!
This document discusses network emulation using tc. It begins with an agenda that covers why emulation is useful, what aspects of a network can be emulated, how tc works, how to do emulation with tc, a comparison of tc to Nistnet and WANem, and references for further information. It then goes into detail on each agenda item, providing explanations of concepts like qdiscs, classes, filters, and examples of using tc for tasks like bandwidth emulation, delay/jitter emulation, and loss emulation. The key advantages and limitations of Nistnet and WANem are also outlined.
This document discusses Reactive Programming and Reactive Streams. It introduces Reactor, a reactive programming framework, and how it addresses issues like latency in microservices architectures. Reactive Streams provide an interoperable way to work with asynchronous data streams in a non-blocking manner. Streams represent sequences of data that can be processed reactively through operators like map and filter.
The document discusses key concepts for designing IT infrastructure to ensure high performance. It covers perceived performance from a user perspective, benchmarking systems, profiling users to predict load, identifying and managing bottlenecks, scaling systems horizontally and vertically, load balancing, caching frequently used data, and designing systems based on their intended use to optimize performance. The overall goal is to design infrastructure that can meet performance requirements under all conditions, both currently and as load increases over time.
The document describes the rise and failure of an IoT solution built on Azure services. The solution struggled to meet performance requirements, with events taking too long to process, the system freezing under load, and throttling exceptions occurring. Logging was also found to be too expensive at high throughput levels. The use of Event Hubs, Redis Cache, and auto-scaling Web Apps were also issues. The conclusion is that it may make more sense to use an existing IoT framework instead of building a custom solution.
This document provides an overview of patterns for enterprise application architecture. It discusses layering as a common technique for breaking apart complex software systems into layers like presentation, domain, and data layers. It describes different kinds of enterprise applications and considerations for performance. It also examines patterns for organizing domain logic, mapping to relational databases, and handling common behavioral issues like change tracking, object loading, and identity management.
Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm - by Ted Dunning
This document discusses various topics related to optimizing OLTP performance in Oracle databases, including:
1) Database performance principles such as acceptable CPU utilization levels and how user response times are affected by utilization levels above 60-65%.
2) Different connection architectures including dedicated servers, shared servers, and database resident connection pooling and their tradeoffs in terms of connection speed and code path length.
3) The importance of writing efficient SQL statements and maintaining proper schema statistics to enable the database to choose efficient execution plans.
4) Best practices for SQL optimization such as validating join conditions, indexes, partition pruning strategies, and parallelization levels are emphasized.
The document discusses storage and input/output (I/O) systems. It notes that storage systems require greater dependability than processors due to the risk of data loss from storage failures. I/O devices are characterized by their behavior (input/output or storage), partner (human or machine), and data rate. Performance of storage devices depends on seek time, rotational latency, transfer time, and controller overhead. Dependability, reliability, and availability are also discussed, along with different types of storage technologies like magnetic disks, flash storage, and how processors connect to memory and I/O devices using buses.
This document provides an overview of Presto as a Service in Treasure Data, including how Treasure Data deploys and monitors Presto. Key points include:
- Treasure Data offers Presto as an interactive query engine accessible through its API and web console.
- Treasure Data uses blue-green deployments and a private Maven repository to deploy new Presto versions with no downtime.
- Treasure Data monitors Presto using its REST API and collects query logs to analyze performance and detect anomalies.
- Treasure Data implements multi-tenancy in Presto by allocating resources like worker nodes based on customers' price plans and resource usage.
Presto is a distributed SQL query engine that Treasure Data provides as a service. Taro Saito discussed the internals of the Presto service at Treasure Data, including how the TD Presto connector optimizes scan performance from storage systems and how the service manages multi-tenancy and resource allocation for customers. Key challenges in providing a database as a service were also covered, such as balancing cost and performance.
This document provides an outline for a presentation on SQL Server performance tuning. It is targeted at intermediate SQL Server professionals. The presentation will cover understanding key performance concepts and influencers, common monitoring tools and techniques, interpreting performance data, and reviewing actual customer mistakes. It emphasizes providing practical solutions rather than deep technical dives or troubleshooting "magic bullets". The outline includes sections on understanding performance and key influencers like workload, resources, optimization, and contention. It also covers measuring performance using different metrics depending on the workload and user perspective, and establishing performance baselines for interpretation. Storage performance fundamentals like IOPS, throughput, latency, and queuing are highlighted.
This document discusses hardware provisioning best practices for MongoDB. It covers key concepts like bottlenecks, working sets, and replication vs sharding. It also presents two case studies where these concepts were applied: 1) For a Spanish bank storing logs, the working set was 4TB so they provisioned servers with at least that much RAM. 2) For an online retailer storing products, testing found the working set was 270GB, so they recommended a replica set with 384GB RAM per server to avoid complexity of sharding. The key lessons are to understand requirements, test with a proof of concept, measure resource usage, and expect that applications may become bottlenecks over time.
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Redis Labs
The document discusses rate limiting and metering using Redis. It begins by introducing rate limiting and metering and why Redis is well-suited for these tasks. It then covers different Redis data structures that can be used, such as lists, hashes, sorted sets and strings. Common Redis commands for counting, setting keys and checking time to live are also presented. Different rate limiting design patterns and anti-patterns are described, including fixed window, sliding window and token bucket approaches. Finally, resources for further information are provided.
This document discusses real-time and embedded systems. It defines embedded systems as computer systems that perform specific functions and often interact with their environment. Real-time systems are systems where the correctness depends on both the logical results and the time the results are produced. Examples of real-time embedded systems include nuclear reactor control and flight control systems. The document discusses characteristics of real-time systems like being event-driven, having high failure costs, and requiring predictable behavior. It also defines types of real-time systems like hard, soft, and firm real-time systems.
More Related Content
Similar to Performance Concurrency Troubleshooting Final
Deploying any software can be a challenge if you don't understand how resources are used or how to plan for the capacity of your systems. Whether you need to deploy or grow a single MongoDB instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment.
This webinar will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs for new and growing deployments. The goal of this webinar will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
Measuring CDN performance and why you're doing it wrongFastly
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content.
In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you.
More specifically, we'll cover:
Different techniques for measuring CDN performance
Differentiating between network footprint and object delivery performance
Choosing the right content to test
Core metrics to focus on and how each impacts real traffic
Understanding cache hit ratio, why it can be misleading, and how to measure for it
This document discusses successfully breeding rabbits in the cloud. It describes rabbits as small public apps that want to live outside and scale quickly. It discusses how the cloud provides rabbits with global reach, fast time to market, performance, scalability, availability and cost efficiency. The document emphasizes that the weakest link limits overall scalability and that caches, queues and separating concerns can help address this. It stresses understanding cloud capacity and efficiently using resources to keep costs low. Finally, it discusses how rabbits can achieve self-service, marketing, support, education and testing at scale.
Expecto Performa! The Magic and Reality of Performance TuningAtlassian
In the enterprise there are rarely simple solutions to highly nuanced problems that satisfy all needs. Several customers might each ask "How do I make Jira/Confluence faster?" and each require a different answer. Using this example, this talk will pick apart the inputs, outputs, concerns, and realities of answering a short question with a long answer. We'll then discuss real-world examples from our own internal instances, to give you a taste of the process we've gone through to solve our own performance problems, and to show why there is no simple playbook; "it depends" on a lot! The key takeaways are:
* The importance of having a shared definition of performance
* The importance of having agreed-upon priorities, including what isn't important
* The importance of measuring (allthethings) and understanding them
* The thing you think is the problem might not be the problem, and vice versa.
* The real world and the ideal world tend to look nothing alike!
This document discusses network emulation using tc. It begins with an agenda that covers why emulation is useful, what aspects of a network can be emulated, how tc works, how to do emulation with tc, a comparison of tc to Nistnet and WANem, and references for further information. It then goes into detail on each agenda item, providing explanations of concepts like qdiscs, classes, filters, and examples of using tc for tasks like bandwidth emulation, delay/jitter emulation, and loss emulation. The key advantages and limitations of Nistnet and WANem are also outlined.
This document discusses Reactive Programming and Reactive Streams. It introduces Reactor, a reactive programming framework, and how it addresses issues like latency in microservices architectures. Reactive Streams provide an interoperable way to work with asynchronous data streams in a non-blocking manner. Streams represent sequences of data that can be processed reactively through operators like map and filter.
The document discusses key concepts for designing IT infrastructure to ensure high performance. It covers perceived performance from a user perspective, benchmarking systems, profiling users to predict load, identifying and managing bottlenecks, scaling systems horizontally and vertically, load balancing, caching frequently used data, and designing systems based on their intended use to optimize performance. The overall goal is to design infrastructure that can meet performance requirements under all conditions, both currently and as load increases over time.
The document describes the rise and failure of an IoT solution built on Azure services. The solution struggled to meet performance requirements, with events taking too long to process, the system freezing under load, and throttling exceptions occurring. Logging was also found to be too expensive at high throughput levels. The use of Event Hubs, Redis Cache, and auto-scaling Web Apps were also issues. The conclusion is that it may make more sense to use an existing IoT framework instead of building a custom solution.
This document provides an overview of patterns for enterprise application architecture. It discusses layering as a common technique for breaking apart complex software systems into layers like presentation, domain, and data layers. It describes different kinds of enterprise applications and considerations for performance. It also examines patterns for organizing domain logic, mapping to relational databases, and handling common behavioral issues like change tracking, object loading, and identity management.
Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm - by Ted Dunning
This document discusses various topics related to optimizing OLTP performance in Oracle databases, including:
1) Database performance principles such as acceptable CPU utilization levels and how user response times are affected by utilization levels above 60-65%.
2) Different connection architectures including dedicated servers, shared servers, and database resident connection pooling and their tradeoffs in terms of connection speed and code path length.
3) The importance of writing efficient SQL statements and maintaining proper schema statistics to enable the database to choose efficient execution plans.
4) Best practices for SQL optimization such as validating join conditions, indexes, partition pruning strategies, and parallelization levels are emphasized.
The document discusses storage and input/output (I/O) systems. It notes that storage systems require greater dependability than processors due to the risk of data loss from storage failures. I/O devices are characterized by their behavior (input/output or storage), partner (human or machine), and data rate. Performance of storage devices depends on seek time, rotational latency, transfer time, and controller overhead. Dependability, reliability, and availability are also discussed, along with different types of storage technologies like magnetic disks, flash storage, and how processors connect to memory and I/O devices using buses.
This document provides an overview of Presto as a Service in Treasure Data, including how Treasure Data deploys and monitors Presto. Key points include:
- Treasure Data offers Presto as an interactive query engine accessible through its API and web console.
- Treasure Data uses blue-green deployments and a private Maven repository to deploy new Presto versions with no downtime.
- Treasure Data monitors Presto using its REST API and collects query logs to analyze performance and detect anomalies.
- Treasure Data implements multi-tenancy in Presto by allocating resources like worker nodes based on customers' price plans and resource usage.
Presto is a distributed SQL query engine that Treasure Data provides as a service. Taro Saito discussed the internals of the Presto service at Treasure Data, including how the TD Presto connector optimizes scan performance from storage systems and how the service manages multi-tenancy and resource allocation for customers. Key challenges in providing a database as a service were also covered, such as balancing cost and performance.
This document provides an outline for a presentation on SQL Server performance tuning. It is targeted at intermediate SQL Server professionals. The presentation will cover understanding key performance concepts and influencers, common monitoring tools and techniques, interpreting performance data, and reviewing actual customer mistakes. It emphasizes providing practical solutions rather than deep technical dives or troubleshooting "magic bullets". The outline includes sections on understanding performance and key influencers like workload, resources, optimization, and contention. It also covers measuring performance using different metrics depending on the workload and user perspective, and establishing performance baselines for interpretation. Storage performance fundamentals like IOPS, throughput, latency, and queuing are highlighted.
This document discusses hardware provisioning best practices for MongoDB. It covers key concepts like bottlenecks, working sets, and replication vs sharding. It also presents two case studies where these concepts were applied: 1) For a Spanish bank storing logs, the working set was 4TB so they provisioned servers with at least that much RAM. 2) For an online retailer storing products, testing found the working set was 270GB, so they recommended a replica set with 384GB RAM per server to avoid complexity of sharding. The key lessons are to understand requirements, test with a proof of concept, measure resource usage, and expect that applications may become bottlenecks over time.
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Redis Labs
The document discusses rate limiting and metering using Redis. It begins by introducing rate limiting and metering and why Redis is well-suited for these tasks. It then covers different Redis data structures that can be used, such as lists, hashes, sorted sets and strings. Common Redis commands for counting, setting keys and checking time to live are also presented. Different rate limiting design patterns and anti-patterns are described, including fixed window, sliding window and token bucket approaches. Finally, resources for further information are provided.
This document discusses real-time and embedded systems. It defines embedded systems as computer systems that perform specific functions and often interact with their environment. Real-time systems are systems where the correctness depends on both the logical results and the time the results are produced. Examples of real-time embedded systems include nuclear reactor control and flight control systems. The document discusses characteristics of real-time systems like being event-driven, having high failure costs, and requiring predictable behavior. It also defines types of real-time systems like hard, soft, and firm real-time systems.
Similar to Performance Concurrency Troubleshooting Final (20)
4. What will we Discuss?
– LEARN
– There are laws and principals that govern concurrency and performance.
– Performance can be built, fueled and/or tuned.
– How do we measure performance and capacity in abstract terms?
– Capacity (throughput) and Load are often used interchangeably but
incorrectly.
– What is the difference between Resource utilization and saturation?
– How performance & capacity are measured on a live system (CPU & Memory)?
– APPLY
– Find out how is your system being used or abused?
– Find out how your system is performing as a whole?
– Find out how a particular process in the system is performing?
– Find out how a particular thread in the process performing?
– Find out the bottle-necks? What is less or missing?
5. Performance – Built, Fueled or Tuned
• Built (Implementation and Techniques)
– Binary Search O(log n) is more efficient than Linear Search O(n)
– Caching can improve Disk I/O significantly boosting
performance.
• Fueled (More Resources)
– Simply get a machine with more CPU(s) and Memory if
constrained.
– Implement RAID to improve Disk I/O
• Tuned (Settings and Configurations)
Tune Garbage Collection to optimize Java Processes
– Tune Oracle parameters to get optimum database performance
6. Capacity and Load
• Load is an Expectation out of system
– It is the rate of work that we put on the system.
– It is an factor external to the system.
– Load may vary with time and events.
– It has no upper cap, can increase infinitely
• Capacity is a Potential of the system
– It is the max rate of work, the system supports efficiently, effectively & infinitely
– It is a factor, internal to the system.
Maximum capacity of a system is finite and stays fairly constant.
We often call Throughput as the System’s Capacity for Load.
• Chemistry between Load & Capacity
– LOAD = CAPACITY? Good Expectation matches the potential. Hired
– LOAD > CAPACITY? Bad Expectations is more than potential. Fired
– LOAD < CAPACITY? Ugly Expectations is less then potential. Find another one
– If not good better be ugly than bad.
7. Performance Measurement of a
System
•
Measures of System’s Capacity
Response Time or Latency
– Measures time spent executing a request
• Round-trip time (RTT) for a Transaction
– Good for understanding user experience
– Least scalable, Developers focus on how much time each transaction takes
• Throughput
– Measures the number of transactions executed over a period of time
• Output Transactions per second (TPS)
– A measure of the system's capacity for load
– Depending upon the resource type, It could be hit rate (for cache)
• Resource Utilization
– Measures the use of a resource
• Memory, disk space, CPU, network bandwidth
– Helpful for system sizing, is generally the easiest measurement to Understand
– Throughput and Response Time can conflict, because resources are limited
• Locking, resource contention, container activity
8. It is time for System Capacity to be Loaded with work
(Throttling & Buffering Techniques)
• No one stops us to load a system more than its capacity (Max Throughput).
• Transactions Per Seconds -Misconception, Real traffic may be in bursts
– Received 3600 transactions in a hour, not sure if every second only 60 were pumped
– Probably we received in bursts - all in first 10 minutes and for nothing last 50 minutes
– So we really cant say, at what tps? We can regulate bursts with throttling and buffering
• Throttling – (Implemented by producer to smoothen output)
– Spreads bursts over time to smoothen output from a process
– We may add throttles to control output rate from threads to each external interface
Throttle of 10 tps ensures max output is 10 tps regardless of the load & capacity.
Throttling is scheme for producers (Check production to rate the consumer can accept)
• Buffering – (Implemented by consumer to smoothen input)
– Spreads burst over time to smoothen input from an external interface
– We add buffering to control input rate to threads from each external interface
Application processes input at 10 tps, load above it will be buffered & processed later
Buffering is a scheme for consumers (Take whatever is produced, consume at our own)
9. Supply Chain Principle
(Apply it to define a optimum Thread Pool Size)
• The more throughput you want, more will be the resource consumption.
• You may apply this principle to define the optimum thread-pool size for a
system/application.
– To support a Throughput (t) transactions per second- (t) =
20 tps
– Where each transaction takes (d) seconds to complete- (d) =
5 seconds
– We need (d*t) threads at least (min size of the thread pool)-
(d*t) = 100 threads
• Thread is an abstract CPU unit resource here.
11. Quantify Resource Consumption
Utilization & Saturation
• Resource Utilization
– Utilization measures how busy a resource is.
– It is usually represented as a percentage average over a time interval.
• Resource Saturation
– Saturation is often a measure of work that has queued waiting for the resource
– It can be measured as both
• As an average over time
• And at a particular point in time.
– For some resources that do not queue, saturation may be synthesized by error counts.
Example Page-Faults reveal memory saturation.
• Load (input rate of requests) is an independent/external variable
• Resource consumption, Throughput (out-put rate of response) are
dependent/internal variables, a function of load.
12. How Load, Resource Consumption and
Throughput related?
• As load increases, throughput increases, until maximum resource utilization on the
bottleneck device is reached. At this point, maximum possible throughput is
reached, Saturation occurs.
• Then, queuing (waiting for saturated resources) starts to occur.
• Queuing typically manifests itself by degradation in response times.
• This phenomenon is described by Little’s Law:
L=X*R
L (LOAD), X (THROUGHPUT) and R (RESPONSE TIME)
• As L increases, X increases (R also increases slightly, because there is always some
level of contention at the component level).
• At some point, X reaches Xmax – the maximum throughput of the system. At this
point, as L continues to increase, the response time R increases in proportion and
through-put may then start to decrease, both due to resource contention.
14. Example
How Throughput and Resource Consumption are related?
• Throughput & Latency can have an inverse or direct relationship
– Concurrent tasks (Threads) often contend for resources (locking & contention)
• Single-Threaded – Higher Throughput = Lower Latency
– Consistent throughput, does not increase with incoming load & resources
– Processes serially, Good for batch jobs
– Response Time linearly varies with request order.
• Multi-Threaded – Higher Throughput = Higher Latency (Most of the time)
– Throughput may increase linearly with load, it starts to drop after threshold
– Process Concurrently, Good for interactive modules (Web Apps)
– Near consistent Response Time, doesn’t vary much with order but load.
Single Threaded – 10 CPU(s) Multi Threaded – 10 CPU(s)
Threads = 1 Threads = 10
Latency = .1 seconds Latency = .1 seconds
Throughput = 1/.1 = 10 tx/sec Throughput = 1/.1 * 10 = 100
Threads = 1 Threads = 100
Latency = .001 second Latency = .2 seconds
Throughput = 1/.001 = 1000 tx/sec Throughput = 1/.2 * 100 = 500 tx/sec
15. Producer Consumer Principle
Predicting Maximum Throughput
Identify Bottleneck Device/Resource
• The Utilization Law: Ui = T * Di
• Where Ui is the percentage of utilization of a device in the application, T is the application
throughput, and Di is the service demand of the application device.
• The maximum throughput of an application Tmax is limited by the maximum service demand of all
of the devices in the application.
• EXAMPLE - A load test reports 200 kb/sec average throughput:
CPUavg = 80% Dcpu = 0.8 / 200 kb/sec = 0.004 sec/kb
Memoryavg = 30% Dmemory = 0.3 / 200 kb/sec = 0.0015 sec/kb
Diskavg = 8% Ddisk = 0.08 / 200 kb/sec = 0.0004 sec/kb
Network I/Oavg = 40% Dnetwork I/O = 0.4 / 200 kb/sec = 0.002 sec/kb
• In this case, Dmax corresponds to the CPU. So, the CPU is the bottleneck device.
• We can use this to predict the maximum throughput of the application by setting the CPU utilization to
100% and dividing by Dcpu. In other words, for this example:
Tmax = 1 / Dcpu = 250 kb/sec
• In order to increase the capacity of this application, it would first be necessary to increase CPU capacity.
Increasing memory, network capacity or disk capacity would have little or no effect on performance until
after CPU capacity has been increased sufficiently.
16. Work Pools & Thread Pools
Working Together
• Work Pools are queues of work to be performed by a software application or component.
– If all threads in thread pool are busy, incoming work can be
queued in work pool
– Threads from thread pool, when freed can execute them later
• Work Pools are filling up congestion & smoothen bursts
– A queue consisting of units of work to be performed
– CONGESTION, by allowing the current (client) threads to submit
work and return
– BURST, over capacity transaction can buffered in work pool and
executed later
– Allow for caching of units of work to reduce system intensive
calls
• Can perform a bulk fetch form a database instead of fetching on record at a time
17. Queuing Tasks may be risky
• One task could lock up another that would be able to continue if the queued task
were to run.
• Queuing can smoothen in-coming traffic burst limited in time (depending upon the
rate of traffic and size)
• Fails if traffic arrives on average faster than they can be processed.
• In general, Work Pools are in memory so it is important to understand what the
impact of restarting a system is, as in memory elements will be lost.
– Is it relevant to lose the queued work?
– Is the queue backed up on disk?
18. Bounded & Unbounded Pools
(Load Shedding)
• If not bounded, pools can grow freely but can cause system to exhaust resources.
– Work Pool / Queue Unbounded - (May overload Memory / Heap &
crash)
• Each work object in the queue stays holding the space until consumed
– Thread Pool Unbounded – (May overload CPU / Native Space and
Crash)
• Each thread asks to be scheduled on CPU and consumes native stack space
• If queue size is bounded, incoming execute requests block when it is full. We can apply different Policies to
handle t, for example
– Reject if there is no space (Can have side affects)
– Remove based on Priority – (Ex priority may be function of time –
Timeouts)
• Thread Pools can have different policies when Work Pools is full:
– Block till there is available space – Starve (VERY BAD – Sometimes
Needed)
– Run in Current Thread (Very Dangerous!)
19. Work pool & thread pool sizes can
often be traded off for each other
Large Work-Pool and small thread pools
– Minimizes CPU usage, OS resources, and context-switching overhead.
– Can lead to artificially low throughput especially if tasks frequently block (ex I/O bound)
Small Work pool generally require larger thread pool sizes
– Keeps CPUs busier
– May cause scheduling overhead (Context Switching) and may lessen throughput.
Especially if the number of CPUs are less.
21. CPU
• Many modern systems from Sun boast numerous CPUs or virtual CPUs
(which may be cores or hardware threads).
• The CPUs are shared by applications on the system, according to a policy
prescribed by the operating system and scheduler
• If the system becomes CPU resource limited, then application or kernel
threads have to wait on a queue to be scheduled on a processor,
potentially degrading system performance.
• The time spent on these queues, the length of these queues and the
utilization of the system processor are important metrics for quantifying
CPU-related performance bottlenecks.
22. Process – User and Kernel Level
Threads
• Process includes the set of executable programs, address
space, stack, and process control block. One or more threads
may execute the program(s).
• User-level threads (threads library)
– Invisible to the OS and are maintained by a thread Library.
– are the interface for application parallelism
• Kernel threads
– the unit that can be dispatched on a processor and it’s structures are
maintain by the kernel
• Lightweight processes (LWP)
– Each LWP supports one or more User Level Thread and maps to exactly one
Kernel Level Thread. Maintains the state of a thread.
25. User Thread over a Solaris LWP
State of User Thread and LWP may be different
26. Solaris Threading Model
If you are in a thread, the thread library must schedule it on an a LWP
Each LWP has a kernel thread, which schedules it on a CPU.
Threading models are used between LWPs & Solaris Threads
28. JVM Memory Organization & Threads
• Method Area
– JVM loads the class file, their type info and binary data in this area
– This memory area is shared by all threads
• Heap Area
– JVM places all objects the program instantiates onto the heap
– This memory area is shared by all threads
– This memory can be adjusted by VM options -Xmx & -Xms as required
• Java Stack and Program Counter (PC) Register
– Each new thread that executes, gets its own pc register & Java stack.
– The value of the pc register indicates the next instruction to execute.
– A thread's Java stack stores the state of Java method invocations for the
thread. The state of a Java method invocation includes
• its local variables & the parameters with which it was invoked,
• its return value (if any), and intermediate calculations.
– This memory may be adjusted by VM option –Xss, typically 1m for RK Apps
– The state of native method (JVM method) invocations is stored in an
implementation-dependent way in native method stacks, as well as possibly in
registers or other implementation-dependent memory areas.
29. A Java thread’s Stack Memory
• The Java stack is composed of stack frames (or frames).
• A stack frame contains the state of one Java method invocation.
– When a thread invokes a method, the Java virtual
machine pushes a new frame onto that thread's
Java stack.
– When the method completes, the virtual machine
pops and discards the frame for that method.
30. Thread Modes
Kernel & User Mode Privilege
• A LWP may either execute in kernel (sys) or user (usr) privilege mode.
• Operations like, processing data on local memory and inter-process
communication between threads of the same process does not require
kernel mode privilege for the thread executing the user program.
• However, intra-process communication or hardware access are done by
kernel programs the executing thread requires kernel mode privilege
• User programs often call by call kernel programs by making system calls.
• A LWP runs in user mode until it makes a system call that requires kernel
mode privilege. The mode switch then happens, which is costly.
32. Complete Process State Diagram
State of a process is a super set of Thread States
A process’s thread state is defined by its threads.
33. vmstat tool provides a glimpse of the system's behavior
VMSTAT - Glimpse of CPU Behavior
The vmstat tool provides a glimpse of the system's behavior on one line indicates
both CPU utilization and saturation.
The first line is the summary since boot, followed by samples every five seconds
Far right is cpu:id for percent idle lets us determine how utilized the CPUs are
In this ex, the idle time for the 5 second samples was always 0, indicating
100% utilization.
On the far left is kthr:r for the total number of threads on the ready to run queues.
If the value is more than the number of CPU’s it indicates CPU saturation.
Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest
saturation for this single CPU server. A value of 4 would indicate high
saturation.
34. More about VMSTAT
Count Description
kthr
r Total number of runnable threads on the dispatcher queues
faults
in Number of interrupts per second
sy Number of system calls per second
cs Number of context switches per second, both voluntary and involuntary
cpu
us Percent user time; time the CPUs spent processing user-mode threads
sy Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus
the time spent processing kernel threads
id Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU
utilization
35. CPU Utilization
• You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us
and sy.
• 100% utilized may be fine—it can be the price of doing business.
• When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance;
the performance degradation is gradual. Because of this, CPU saturation is often a
better indicator of performance issues than is CPU utilization.
• The measurement interval is important: 5% utilization sounds close to idle; however, for
a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for
57 minutes. It is useful to have both short- and long-duration measurements.
• An server running at 10% CPU utilization sounds like 90% of the CPU is available for
"free," that is, it could be used without affecting the existing application. This isn't quite
true. When an application on a server with 10% CPU utilization wants the CPUs, they
will almost always be available immediately. On a server with 100% CPU utilization, the
same application will find that the CPUs are already busy—and will need to preempt
the currently running thread or wait to be scheduled. This can increase latency.
36. CPU Saturation
• The kthr:r metric from vmstat is useful as a measure for CPU saturation.
However, since this is the total across all the CPU run queues, divide kthr:r
by the CPU count for a value that can be compared with other servers.
• Any sustained non-zero value is likely to degrade performance. The
performance degradation is gradual (unlike the case with memory
saturation, where it is rapid).
• Interval time is still quite important. It is possible to see CPU saturation
(kthr:r) while a CPU is idle (cpu:idl). You may find that the run queue is
quite long for a short period of time, followed by idle time. Averaging over
the interval gives both a non-zero run queue length and idle time.
37. Solaris Peformance Tools
Tool Uses Description
vmstat kstat For an initial view of overall CPU behavior
psrinfo kstat For physical CPU properties
uptime getloadavg() For the load averages, to gauge recent CPU activity
sar kstat, sadc For overall CPU behavior, and dispatcher queue
statistics; sar also allows historical data collection
mpstat kstat For per-CPU statistics
prstat procfs To identify process CPU consumption
dtrace Dtrace For detailed analysis of CPU activity, including
scheduling events and dispatcher analysis
38. uptime Command
Prints up time with CPU Load averages. They represent both
utilization and saturation of the CPUs.
• The numbers are the 1-, 5-, and 15-minute load averages.
• Load averages is often approximated as the average number of runnable
and running threads, which is a reasonable description.
• A value equal to your CPU count usually means 100% utilization; less than
your CPU count is proportionally less than 100% utilization; and greater
than your CPU count is a measure of saturation
• A consistent load average higher than your CPU count may cause degraded
performance. Solaris handles CPU saturation very well, so load averages
should not be used for anything more than an initial approximation of CPU
load.
39. sar - The system activity reporter
Provide live statistics or can be activated to record historical
CPU statistics, prints the user (%usr), system (%sys), wait I/O
(%wio), and idle times (%idle).
Identifies long-term patterns that may be missed when taking a
quick look at the system. Also, historical data provides a
reference for what is "normal" for your system
The following example shows the default output of sar, which is
also the -u option to sar. An interval of 1 second and a count of
5 were specified.
40. sar –q - Statistics on the run queues
runq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can be
used as a measure of CPU saturation
swpq-sz (swapped-out queue size). Number of swapped-out threads. Swapping
out threads is a last resort for relieving memory pressure, so this field will be
zero unless there was a dire memory shortage.
%runocc (run queue occupancy). Helps prevent a danger when intervals are
used, that is, short bursts of activity can be averaged down to unnoticeable
values. The run queue occupancy can identify whether short bursts of run queue
activity occurred
%swpocc (swapped out occupancy). Percentage of time there were swapped
out threads. If one thread is swapped, all others of threads of the process must also be.
41. Is my system performing well?
About the Individual Processors
psrinfo -v command determines the number of processors in the system and their
speed. In Solaris 10, -vp prints additional information.
The mpstat command summarizes the utilization statistics for each CPU. Following
syscl (system calls) csw (context switches)
is an example of four CPU switches) migr (migrations of threads between processors)
icsw (involuntary context machine, being sampled every 1 second.
intr (interrupts) ithr (interrupts as threads)
smtx (kernel mutexes) srw (kernel reader/writer mutexes)
42. What are sampling and Clock tick
woes?
• While most counters you see in Solaris are highly accurate, sampling issues remain
in a few minor places. In particular, the run queue length as seen from vmstat
(kthr:r) is based on a sample that is taken every second. Example, a problem was
caused by a program that deliberately created numerous short-lived threads every
second, such that the one-second run queue sample usually missed the activity.
• The runq-sz from sar -q suffers from the same problem, as does %runocc(which for
short-interval measurements defeats the purpose of %runocc).
• These are all minor issues, and a valid workaround is to use DTrace, with which
statistics can be created at any accuracy desired
43. Who Is Using the CPU?
The default output from the prstat command shows one line of output
per process, showing CPU utilization value before the prstat
command was executed.
The system load average indicates the demand and queuing for
CPU resources averaged over a 1-, 5-, and 15-minute period if that
exceeds the number of CPUs, the system is overloaded.
44. How is the CPU being consumed?
• Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates.
• Microstates represent a time-based summary broken into percentages for each thread.
• USR through LAT sum to 100% of the time spent for each thread during the prstat sample.
• USR (user time) and SYS (system time) thread spent running on the CPU.
• The LAT (latency) is the amount of time thread spent waiting for CPU. A non-zero number means there
was some queuing/saturation for CPU resources.
• SLP inidicates the time thread spends blocked waiting for blocking events like Disk I/O etc.
• TFL & DTL determine if and how much the thread is waiting for memory paging.
• TRP indicates the time spent on software traps
Each Thread is waiting for CPU about 0.2% of the time. - CPU resources are not constrained.
Each Thread is waiting for CPU about 80% of the time. - CPU resources are Constrained
45. How are threads inside the process
performing?
The example shows us that thread number two in the target process is using the most CPU, and
spending 83% of its time waiting for CPU. We can further look at information about thread
number two with the pstack <pid>/<LWPID> command. Just pstack <pid> to shows all threads
Take a java thread dump and identify the thread with native thread id = 2. This is the one. This
way con relate the code in Java that called the native system call or library method on the
system.
46. Process Stack on a Java Virtual
Machine: pstack
• Use the “C++ stack unmangler” with Java virtual machine (JVM) targets to see the
native java function calls c stack
47. Tracing Processes
truss
truss traces system calls made on behalf of a process. It includes the user LWP
(thread) number, system call name, arguments and return codes for each system call.
truss –c option traces system call counts
48. Why Memory Saturation brings more
rapid a degradation in performance
compared to CPU saturation.
• Memory saturation may cause rapid degradation in performance. To come
over saturation OS resorts to page-in/out and swapping, which themselves
are an heavy task and with processes competing for memory, a race
condition may occur.
• The available memory on a server may be artificially constrained, either
through pre-allocation of memory or through the use of a garbage
collection mechanism that doesn’t free up memory until some threshold is
reached.
49. Thread Dumps
• What exactly is "Thread dump“
– Thread dump" basically gives you information on what
each of the thread in the VM is doing at any given
point of time.
• If an application seems stuck, or is running out of resources, a thread dump will reveal
the state of the server. Java's thread dumps are a vital tool for server debugging. For
scenarios like
– PERFORMANCE RELATED ISSUES
– DEADLOCK (SYSTEM LOCKS UP)
– TIMEOUT ISSUES
– SYSTEM STOPS PROCESSING TRAFFIC
50. Thread dumps in Redknee Applications
• Java thread dumps are obtained by doing:
– Send (kill -3 <pid>) - On Unix See
thread dump in ctl logs
– Press (Ctrl + Shift Break) – on Windows See
thread dumps on xbuild console
– $JAVA_HOME/bin/jstack <pid> See
thread dumps on Shell console
• Java thread dumps list all of the threads in an application
• Threads are outputted in the order that they are created, newest thread being at the
top
• Threads should be named with a useful name of what they do or what they are
responsible for (Open Tickets)
51. Common Threads in Redknee
• Idle”
– CORBA Threads to handle incoming requests, however are currently not doing any work
• “RMI TCP Connection(<port>)-<IP>”
– Outbound connection over RMI to a specific host and port
• "FileLogger“
– Framework thread for logging
• "JavaIDL Reader for <host>:<port>“
– CORBA Thread reading requests from a server
• "TP-Processor8“
– Tomcat Web Thread
• “Thread-<#>”
– Thread that has not been named (BAD)
• "ChannelHome ForwardingThread“
– Thread used to cluster transactions over to peer
– One of these threads per Home that is clustered (DB table)
• "Worker#1“
– Worker threads doing work
52. Thread Dump May Give you Clues
• C:learnclasses>java Test
• Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mixed mode):
• "Signal Dispatcher" daemon prio=10 tid=0x0091db28 nid=0x744 waiting on condition [0..0]
• "Finalizer" daemon prio=9 tid=0x0091ab78 nid=0x73c in Object.wait() [1816f000..1816fd88]
• at java.lang.Object.wait(Native Method)
• - waiting on <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)
• at java.lang.ref.ReferenceQueue.remove(Unknown Source)
• - locked <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)
• at java.lang.ref.ReferenceQueue.remove(Unknown Source)
• at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
• "Reference Handler" daemon prio=10 tid=0x009196f0 nid=0x738 in Object.wait() [1812f000..1812fd88]
• at java.lang.Object.wait(Native Method)
• - waiting on <0x10010388> (a java.lang.ref.Reference$Lock)
• at java.lang.Object.wait(Unknown Source)
• at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
• - locked <0x10010388> (a java.lang.ref.Reference$Lock)
• "main" prio=5 tid=0x00234998 nid=0x4c8 runnable [6f000..6fc3c]
• at Test.findNewLine(Test.java:13)
• at Test.<init>(Test.java:4)
• at Test.main(Test.java:20)
• "VM Thread" prio=5 tid=0x00959370 nid=0x6e8 runnable
• "VM Periodic Task Thread" prio=10 tid=0x0023e718 nid=0x74c waiting on condition
• "Suspend Checker Thread" prio=10 tid=0x0091cd58 nid=0x740 runnable
53. What is there in the Thread Dump?
• In this case we can see that, at the time we took the thread dump, there were seven threads: Show Thread
Dump
– Signal Dispatcher
– Finalizer
– Reference Handler
– main
– VM Thread
– VM Periodic Task Thread
– Suspend Checker Thread
• Each thread name is followed by whether the thread is a daemon thread or not.
• Then comes prio the priority of the thread [ex: prio=5].
• tid and nid are Java thread id and the native thread id.
• Then what follows the state of the thread. It is either:
– Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS
thread scheduler schedules it.
– Suspended [marked as S in some VMs]: I presume this indicates that the thread is not in a runnable state. Can some one please confirm?!
– Object.wait() [marked as CW in some VMs]: indicates that the thread is waiting on an object using Object.wait()
– Waiting for monitor entry [marked as MW in some VMs]: indicates that the thread is waiting to enter a synchronized block
• What follows the thread description line is a regular stack trace.
54. Threads in a Dead-Lock
• A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each thread in the
deadlock is waiting on a resource locked by some other thread in the set of deadlocked threads. In newer
JDKs they are detected automatically
– Found one Java-level deadlock:
– =============================
– "Thread-1":
– waiting to lock monitor 0x0091a27c (object 0x140fa790, a java.lang.Class),
– which is held by "Thread-0"
– "Thread-0":
– waiting to lock monitor 0x0091a25c (object 0x14026800, a java.lang.Class),
– which is held by "Thread-1"
– Java stack information for the threads listed above:
– ===================================================
– "Thread-1":
– at Deadlock$2.run(Deadlock.java:48)
– - waiting to lock <0x140fa790> (a java.lang.Class)
– - locked <0x14026800> (a java.lang.Class)
– "Thread-0":
– at Deadlock$1.run(Deadlock.java:33)
– - waiting to lock <0x14026800> (a java.lang.Class)
– - locked <0x140fa790> (a java.lang.Class)
– Found 1 deadlock.
56. Memory
• Memory includes
physical (RAM)
Swap space
• Swap space is a part storage acting as a memory.
• Memory is more complicated a subject than CPU.
• Memory saturation triggers CPU saturation (Page Faults / GC)
57. Memory Utilization and Saturation
• To sustain a higher throughput, application spawns more threads
and holds the request data
• Each thread occupies memory for data it operates on and its own
stack.
• A point where memory demanded by an process can no longer be
met from available memory, saturation occurs.
• Sudden increases in utilization without accompanying increases in
throughput can also be used to detect degraded performance
modes caused by software ‘aging’ issues, such as memory leaks
58. VMSTAT – Glimpse of Memory
Utilization
If the scan rate (sr) is continuously over 200 pages per second then there
is a memory shortage on the system.
Counter Description
swap Available swap space in Kbytes.
free Combined size of the cache list and free list.
re Page reclaims—The number of pages reclaimed from the cache list.
mf Minor faults—The number of pages attached to an address space.
fr Page-frees—Kilobytes that have been freed
pi and po Kilobytes Paged in and Paged out respectively
de Anticipated short-term memory in kilobytes shortfall to free ahead.
sr The number of pages scanned by the page scanner per second.
60. Relieving Memory Pressure
After the free memory exhausts, from cache list (FS,I/O etc cache).
Next the swapper swaps out entire threads, seriously degrading the
performance of swapped-out applications. The page scanner selects pages,
and is characterized by the scan rate (sr) from vmstat. Both use some form
of the Not Recently Used algorithm.
The swapper and the page scanner are only used when appropriate. Since
Solaris 8, the cyclic page cache, which maintains lists for a Least Recently
Used selection, is preferred.
61. Heap and Non-Heap Memory
• Heap Memory
Storage for Java objects
-Xmx<size> & -Xms<size>
• Non Heap Memory
Per-class structures such as runtime constant pool, field and method data,
Code for methods and constructors, as well as interned Strings.
Store loaded classes and other meta-data
JVM code itself, JVM internal structures, loaded profiler agent code and data, etc.
-XX:MaxPermSize=<size>
• Other
Space system/OS takes for process
Stacks of a threads (-Xss & -Xoss)
System & Native space
62. What is Garbage Collection?
Reclaim memory from inaccessible object
63. Stack Overflow or Out of Memory
• If u See OutOfMemoryError: unable to create native thread
– This means your Application is falling short Native Memory space – C Space
– Either, Insufficient memory to allocate thread stack or PC to the new Thread
– Or application has crossed JVM’s memory limit (3.2 GB in 32 bit environment)
– The JVM/application hangs with this error, we need to restart.
• See if you can reduce active threads which ate away system’s memory
• Or if you can decrease stack size to decrease memory use per thread
• If you Can’t bring memory consumption down, need more system memory
• If u See StackOverflowException
– It means the thread that threw this exception fell short of Stack Memory
Space
– A thread stacks method states invoked by it on to the stack memory
– For the number of nested invocations the thread makes, memory is
insufficient
– Only the thread dies with this exception, the application doesn’t hang.
• See if you can bring down number of nested invocations by the thread
• Or else, increase the stack size with VM option –Xss, by default it is 1m
64. Pros and Cons of Garbage Collection?
Advantages Disadvantages
Increased reliability Unpredictable application
Easier to write complex pauses
apps Increased CPU/memory
No memory leaks or utilization
invalid pointers Brutally complex
65. GC Logging
• Java Garbage Collection activity may be recorded in a log
file. VM options
– -verbosegc (Enable GC Logging, outputs to std-err
– -xloggc:<file> (GC logging to file)
– –XX:+PrintGCDetails (Detailed GC records)
– -XX:+PrintGCDateStamps (absolute instead of relative timestamps)
– Note: From relative timestamps in a GC log we can find absolute times by either by tracing forward from
application/GC start or backwards from application/GC stop
• Asynchronous garbage collection occurs whenever memory
available memory is low.
• System.gc() does not force a synchronous garbage
collection but just gives a hint to VM. VM options
– +XXDisableExplicitGC - Disable explicit GC
66. What to look for in GC Logs?
• Important information from GC logs
– The size of the heap after garbage collection
– The time taken to run the garbage collection
– The number of bytes reclaimed by garbage collection
• Heap Size after GC may give us a good idea of
memory requirement.
– 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)
• The other two help us assess the cost of GC to your
application.
• All of them together help us tune GC.
67. How to Calculate Impact of GC on your
Application?
• Run test (60sec, Collect GC logs)
– 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)
– 42406K->41504K(458752K), 4.4044878 secs – (902K reclaimed)
– 48617K->47874K(458752K), 4.5652409 secs – (770K reclaimed)
• Measure
– Out of 60 sec, GC ran for 17.2 sec, ie 29% of the time.
– Considering relative CPU utilization, GC cost may be even higher.
– 3037K of memory was recycled in 60 secs, ie 51831 bytes/second
• Analyze
– 29% time being consumed by GC is too high (should be between 5-15%)
– Is 51831 bytes/sec of memory recycled justifiable against operation?
– For an average 50 byte objects, it churned around 1036 objects/ sec
68. Heap Ranges – Xms to Xmx
• Heap Range can be defined
– VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size
• What causes VM to expand heap?
– Expansion of heap is a CPU Intensive and causes defragmented Heap
– VM Tries GC, Defragmentation, Compaction, etc to free up memory.
– If still unable to free up required memory, VM decides to expand heap
– VM may not wait till brink, it keeps some free space for temp objects
– By default, Sun tries to keep the proportion of free space to living objects at each
garbage collection within 40%-70% range.
• If less than 40% heap is free after GC, expand the heap
• If more than 70% heap is free after GC, contract the heap
– VM Args that can customize the default ratio
• -XX:MinFreeHeapRatio
• -XX:MaxFreeHeapRatio
69. Gross Heap Tuning
• Consequences of large heap sizes
– GC Cycles occur less frequently, but each sweep takes longer
– Long GC cycles may induce perceptible pauses in the system.
– If heap grows to a size more than available RAM, paging/swapping may occur.
• Consequences of low heap sizes
– GC runs too frequently with less recovery in each cycle
– Cost of GC becomes more
– Since, GC has to sweep less space each time, pauses are imperceptible.
• Max verses Min Heap sizes.
– Contraction & Expansion of heap is costly, should be worth the cause.
– Frequent contraction expansion also leads to segmented heap.
– Keep Xmx=Xms, for transaction oriented system which frequently peaks.
– Keep Xms<Xmx if the application infrequently operates at upper capacity.
70. We Just Learnt Gross Heap
Tuning
There might just be need for Fine Tuning
• We can fine tune the GC considering the intricacies of
GC Algorithm & Heap Structure. We will learn shortly.
• Goss Heap tuning is quite simple yet effective &
empirically established.
• Gross techniques are fairly effective irrespective of the
variables and most important we can always afford
apply them.
71. What is the advanced heap made of?
The one that works with Generational Garbage Collector in JVM
• HEAP is made up of
– Old Space or Tenure Space
• Objects, when get old in the young space, are transferred here.
– Young Space or Eden Space
• Young objects are held here.
– Scratch Space
• Working Space for algorithms
– New Space
• <Young Space> + <Scratch Space>
77. Heap Dump (Java)
Snapshot of the memory of a time
VMs usually invokes a GC before dumping heap
It contains
• Objects (Class, fields, primitive values and references)
• Classes (Classloader, name, super class, static fields)
• GC Roots (Objects defined to be reachable by the JVM)
• Thread Stacks (in time with per-frame about local objects)
Does not Contain
• Allocation information
Who created the objects and where they have been created?
• Live & Stale
Used memory consists of both live and dead objects.
JVM usually does a GC before generating a heap dump
Tools may attempt to remove these when loading the dump unreachable
from the GC roots
78. Heap Dump (Java)
How to take it?
• On Demand
VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak
Tools # JDK6 Jconsole, VisualVM, MAT
jmap -d64 -dump:file=<file-ascii-hdump> <pid>
jmap -d64 -dump:format=b, file=<file-bin-hdump> <pid>
• Automatic on Crash
VM-arg # -XX:+HeapDumpOnOutOfMemoryError
• Postmorterm after crash; from Core-Dump
jmap -d64 -dump:format=b,file=<file> <java-bin> <core-file>
79. Heap Dump (Java)
Shallow vs Retained Heap
Shallow heap
• Held by object’s primitive fields and reference variables
• Excludes referenced objects but just references (32/64 bits)
Retained heap
• Object’s shallow size plus the shallow sizes of the objects that are
accessible, directly or indirectly, only from this object.
• Memory that’s freed by the GC when this object is collected.
Garbage Collection Roots
• A garbage collection root is an object accessible from outside the heap.
• GC root objects, which will not be collected by Garbage Collector at the time
of measuring Locals (Java/Native), Threads, System Class, JNI,, Monitor,
Finalizer)
80. Shallow vs. Retained Heap
http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcept
s%2Fshallowretainedheap.html
In general, retained size is a GC root is an integral measure, which helps to understand
consumption memory by objects graphs
81. Dominator Tree
(Object Dependencies)
• Identifies chunks of retained memory & the keep-alive
• In the dominator tree each object is the immediate dominator of its children, so
dependencies between the objects are easily identified.
• The edges in the dominator tree do not directly correspond to object references from
the object graph. Same object may actually be under retained set of multiple roots.
• http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html
82. OQL (Object Query Language)
Heap Dump not just for
Troubleshooting
• OQL is an Object Query Language that let’s us query the heap dump in SQL
fashion.
• This enables us to analyze heap not only after problems but proactively search for
patterns. Ex select to see if there are more than two objects for Boolean, ideally
two .TRUE and .FALSE (singleton like Enums) are sufficient –
select toHtml(a) + " = " + a.value from java.lang.Boolean a
where objectid(a.clazz.statics.TRUE) != objectid(a)
&& objectid(a.clazz.statics.FALSE) != objectid(a)
(Runs on Visual VM
• Visual VM and MAT, both support nice interfaces for OQL
http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html
http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fwelcome.html