Kingshuk Dasgupta leads pLab, which helps institutionalize performance engineering across the enterprise. pLab aims to promote performance awareness, educate teams, benchmark technologies, and build a shared testing environment. Performance is defined as a system's ability to meet objectives for response time, stability, scalability, and efficiency. Issues caused by poor performance include increased costs and lost income/competitiveness. pLab monitors key metrics like response times and outages to ensure service level targets are met.
13. Dedicated, isolated environment in CyrusOne, Lewisville
Hardware configuration for various applications is same as
production
Environment includes
F5
SAN and NAS capabilities
Fabric components (MOM, USG, SSG, NOFEP, BBIS, ICE
Simulator)
ATSE components (Intellisell, MIP, IS, ASv2, DSSv2,
Pricing, Oracle, Instrumentation Database, App Console)
Load Runner and custom built load drivers
Automated performance testing, monitoring and data
collection framework
14. Performance
and Reliability
Goal Setting
Establishing
Test
Environment Test Planning
Test Harness
Preparation Test Execution
Performance
Optimization
Release
Signoff
Document
Response Time
Requirements
Hardware
Acquisition
Load, Stress and
SoakTest
Requirements Load drivers Setup data
Analysis ofTest
Results
Go / No-go
decision
Workload
Characterization
Platform Setup -
OS, Infrastructure
Test Harness
Design Mocks / Simulators
Run planned tests
and collect results Tuning
Document
Availability
Requirements Application Setup
Hardware,
middleware and
application
monitoring Generate Report
Document
Service Level
Expectations (GC,
Error rate etc.) Database Setup
Reporting /
Charting / Data
visualization
Possible use of Flex
Lab
• pLab monitoring
framework
• pLab mock
framework
Some support on
Tuning available
pLab Application Team
21. Performance
Testing
• Release testing
• Project centered
testing
• Ad hoc testing
• Flex Lab
Benchmarking
• Cookie Cutter
architecture
• Appliances
• Software
Optimization/
Tuning
• Code
optimization
• Profiling
• Platform tuning
• OS tuning
• JVM tuning
Performance
Oriented
Design/
Consulting
• PE Planning and
Test Harness
Design
• Architecture
review
• Patterns
• Anti patterns
22. 22
Category A:
Highest focus area
and pLab takes
ownership for
“Reliability Growth”
through performance
testing
Category B: Closely
monitored and
supported by pLab
Category C: Lower
priority applications
Low-------------BusinessCriticality------------------>High
A A B
A B C
C C
Low -------------Technical Quality------------------> High
Air Crews
Movement
Manager
Rev
Accounting
Centiva
SSCI (Kiosk, Web)
Schedule Manager
SSW2
Crew Control
Flightline
Load Manager
23. Solution
Review
Planning Execution
Post
Cutover
Support
• Engineering /
Capacity
Planning
Reviews
• Joint
Performance and
E2E engineering
risk assessment
• Deliver
Performance
EngineeringTest
(PET) Plan
• Plan for 4
rounds of
performance
testing in CERT
• Identify high
risk Ops products
• Create test
harness for every
injection point
on critical path
• Execute tests,
gather metrics
for all systems,
analyze
• Certify release
(Go/NoGo)
• Monitor
production
performance
• Repeat
performance
tests if necessary
Editor's Notes
History: 00-2007 functional automation team-> Dolly hired SK -> Robert funded pLab-> Benz’s team merged-> built ATSE lab->moved to engineering
Worked with many of Sabre’s most critical products from a performance testing standpoint
Have gained many hours of experience with doing reliable performance tests to uncover critical system performance bottlenecks
We are now uniquely poised to carry our learnings across the organization and to be the performance center of excellence not merely for performance testing although that remains the core but also for performance best practices, preventive engineering, building scalable systems…
pLab is synergistically placed to work with other arms of the EE organization.
By following Pareto Principle, 80% of our business and technical opportunities and complexities resides with 20% of our systems. Some critical systems we are heavily engaged are ATSE, Fabric, ePOS, CSS
We can’t support everyone directly, but by evangelizing performance awareness, we can multiply our force. Some of the classes/brown bags coming out of engineering on performance engineering/multi core are part of this effort
Consulting model
Blade benchmarking, terracotta, ServiceMix, JDK 1.6, now Azul system
Flex Lab
Can we ban a technology/tool if the performance is really horrible
Note: performance testing process!!!!!
Describe what are the goals of the other arms and how pLab interoperates with these arms
- How good are the results
- How quick the response shows up
A shopping transaction originated in Expedia -> web services-> ATSE -> Sabre or another GDS for availability
The customer should see the quality response in 3 seconds
The ATSE shopping servers should spend no more than XX CPU seconds into it
If expedia chooses to double their business with us, we should scale without getting dramatically more expensive
If systems are unstable expedia will take the business somewhere else
More performance related Sev calls with many of your application staff as well as several other enterprise resources
Fixing performance related issues late in the development lifecycle is very expensive. Sometimes fundamental architecture tradeoffs need to be made (give an example)
If tuning / redesign isn’t sufficient to solve the problem, you need to throw more hardware at your problems
In some cases, this may lead to loss of confidence of stakeholders and cancelled projects
Can cause your organization’s image to suffer. – give example of WCT & Hotels
Lost Income – either because of bad performance and irate users abandoning; or lost income because of delayed projects
Tell the story of Travel Now and eHotels (last Sev 1 almost 2 years ago).
E.g. of Sev1s found and prevented:
1. pLab called in to help investigate a memory issue on shopping MIP hex core boxes in a limited pool in production. PLAB tracked it to a datacollector issue under Redhat 5. When unable to write to the instrumentation database it queued eating up (eventually) all memory.
2. Mom api got fixed for race condition. When worker thread picks up the message exactly at 500 ms(which was timeout value for mom sync thread), the dispatcher threads times out at the same time, at this time worked threads doesn't have any dispatcher thread to hands off the message and It just stays there. The default time for this thread was quite big, so eventually all threads from mom thread pool come in this situation and it goes out of thread. As an effect of this, BBIS-MOM Api doesn't able to pick the message.
Where is the art? It is prior knowledge and experience of knowing failure modes due to performance; it is the intuition of knowing what to instrument to give insight into system performance
The science is in the engineering approach to performance – a consistent methodology, a set of measurement tools, standard reports, predictive models
Quantitative – it is important to measure because anything measured, however inaccurately can (a) usually, be improved upon and (b) is better than no measurement at all.
Latency, Throughput and Utilization – These three are intricately related. Minimizing latency (requires low load) while maximizing throughput (requiring high load), two contradictory goals – i.e. at some optimal utilization level of resources is what the optimal load for the system is
Latency, Throughput and Utilization – These three are intricately related. Minimizing latency (requires low load) while maximizing throughput (requiring high load), two contradictory goals – i.e. at some optimal utilization level of resources is what the optimal load for the system is
The success of any performance engineering initiative is having the ability to run a good performance test.
What makes a good performance test? One that will be useful in predicting the performance of the system in production. However, we don’t have the true production load, nor the true production servers, nor the true production integration points.
Testing at the right load levels, with the right workload mix and measuring the right things; and having an engine to repeat this to compare against an established baseline as the product evolves; mocks / simulators
Load testing is conducted to verify that your application can meet your desired performance objectives; these performance objectives are often specified in a service level agreement (SLA). A load test enables you to measure response times, throughput rates, and resource-utilization levels, and to identify your application’s breaking point, assuming that the breaking point occurs below the peak load condition.
Soak testing is a subset of load testing. An endurance test is a type of performance test focused on determining or validating the performance characteristics of the product under test when subjected to workload models and load volumes anticipated during production operations over an extended period of time.
The goal of stress testing is to reveal application bugs that surface only under high load conditions. These bugs can include such things as synchronization issues, race conditions, and memory leaks. Stress testing enables you to identify your application’s weak points, and shows how the application behaves under extreme load conditions. Spike testing is a subset of stress testing. A spike test is a type of performance test focused on determining or validating the performance characteristics of the product under test when subjected to workload models and load volumes that repeatedly increase beyond anticipated production operations for short periods of time.
Capacity testing is conducted in conjunction with capacity planning, which you use to plan for future growth, such as an increased user base or increased volume of data. For example, to accommodate future loads, you need to know how many additional resources (such as processor capacity, memory usage, disk capacity, or network bandwidth) are necessary to support future usage levels. Capacity testing helps you to identify a scaling strategy in order to determine whether you should scale up or scale out.
1. Impulses may occur because of sudden changes in the environment – e.g. Slashdot effect or a snow-storm for Airline ops system. What we are looking for is graceful handling of such exception conditions – ideally, the ability to protect yourself and cause no harm to others. If there is a single customer out of many that can be throttled, that is ideal. Recovery when load decreases is important as well.
Westjet – no performance testing per se..
Jet Blue story:
B6 E2E Performance Testing
PET plan creation
Working with capacity planning to lay out the workload for the tests
Test scripts for every traffic injection point
Individual product testing in pLab & E2E performance testing (Load, F5 load balancing, failover and soak tests) in CERT (coordinating the teams)
Data analysis and reporting
Volaris, SW SNAP and now AeroMexico