Introduction to
Kingshuk Dasgupta
Tuli Nivas
pLab – Enterprise Engineering
pLab
What is PLab
What is
Performance
Approach/Strategy
How to use pLab
To answer four questions
EE
System
Engineering
Performance
Lab
Software
Infrastructure
Engineering
Security
Engineering
 Institutionalize performance engineering for critical products
 Promote performance awareness across the enterprise
 Educate, mentor and consult with product teams
 Benchmark new technologies and hardware
 Build shared performance testing environment, foster performance
Response
time
Stability Scalability Efficiency
Performance is the ability to which a software system or software
component meets its objectives for response time, stability,
scalability and resource consumption
 Increased operational cost
 Increased development cost
 Increased hardware cost
 Canceled projects
 Damaged customer relations
 Lost income
 Reduced competitiveness
Source:SPE (Lloyd G. Williams & Connie U. Smith)
Air Services
Target 99.66%
Customer Access
Target 99.83%
2009 2010
Airline
Operations
Target 99.49%
Online Booking
Target 99.25%
The art and science of
quantitatively measuring,
understanding and tuning the
latency, throughput, and
utilization of computer systems
Response
Time
Throughput
Resource
Utilization
Failure Rate
PerformanceTesting
Analyze Optimize Assess Risk Plan
• Environment Setup
•Test Harness Creation
•Test Execution
•Test Report Generation
Performance
testing
Load test Soak test Destructive test
Impulse test
Resiliency test
Capacity
impact test
0
20
40
60
80
100
120
140
160
180
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Peak Load
3X
Peak Load
0.2X
Peak Load
Peak Load
 Dedicated, isolated environment in CyrusOne, Lewisville
 Hardware configuration for various applications is same as
production
 Environment includes
 F5
 SAN and NAS capabilities
 Fabric components (MOM, USG, SSG, NOFEP, BBIS, ICE
Simulator)
 ATSE components (Intellisell, MIP, IS, ASv2, DSSv2,
Pricing, Oracle, Instrumentation Database, App Console)
 Load Runner and custom built load drivers
 Automated performance testing, monitoring and data
collection framework
Performance
and Reliability
Goal Setting
Establishing
Test
Environment Test Planning
Test Harness
Preparation Test Execution
Performance
Optimization
Release
Signoff
Document
Response Time
Requirements
Hardware
Acquisition
Load, Stress and
SoakTest
Requirements Load drivers Setup data
Analysis ofTest
Results
Go / No-go
decision
Workload
Characterization
Platform Setup -
OS, Infrastructure
Test Harness
Design Mocks / Simulators
Run planned tests
and collect results Tuning
Document
Availability
Requirements Application Setup
Hardware,
middleware and
application
monitoring Generate Report
Document
Service Level
Expectations (GC,
Error rate etc.) Database Setup
Reporting /
Charting / Data
visualization
Possible use of Flex
Lab
• pLab monitoring
framework
• pLab mock
framework
Some support on
Tuning available
pLab Application Team
Understand
Business/Application
Needs
Analyze Results,
Report and Retest
Execute theTests
IdentifyTest
Environment
IdentifyAcceptance
Criteria
Plan and DesignTests
Load Driver
Response Time
Throughput
ErrorTypes &
Percentages
Application
Application
Metrics
JVM
ESSM/ CLR
Metrics
Database
Connections
Sessions
Errors
Resource
Utilization
System
CPU
Memory
Network
Disk
Yaketystats
Sample usage:
http://plabptl020.dev.sabre.com/yaketystats/jart/index.php?pl=IndividualServers/plab202
Collector – is the client.That collects stats and sends them to the server.
Stuffer – is the server that accepts stats from the client and puts them in a file
system. Once every 5 minutes it puts “stuffs” these stats to the RRD file.
Understand
Business/Application
Needs
Analyze Results,
Report and Retest
Execute theTests
IdentifyTest
Environment
IdentifyAcceptance
Criteria
Plan and DesignTests
http://wiki.sabre.com/confluence/display/EOP/Performance
pLab PET Plans
 OS tuning
• System library tuning
• Kernel tuning
• TCP/IP tuning
 JVM tuning
• Garbage collection tuning
 Application tuning
• Profiling
• Memory allocation tuning
• Thread contention
• Algorithm optimization
Performance
Testing
• Release testing
• Project centered
testing
• Ad hoc testing
• Flex Lab
Benchmarking
• Cookie Cutter
architecture
• Appliances
• Software
Optimization/
Tuning
• Code
optimization
• Profiling
• Platform tuning
• OS tuning
• JVM tuning
Performance
Oriented
Design/
Consulting
• PE Planning and
Test Harness
Design
• Architecture
review
• Patterns
• Anti patterns
22
 Category A:
Highest focus area
and pLab takes
ownership for
“Reliability Growth”
through performance
testing
 Category B: Closely
monitored and
supported by pLab
 Category C: Lower
priority applications
Low-------------BusinessCriticality------------------>High
A A B
A B C
C C
Low -------------Technical Quality------------------> High
Air Crews
Movement
Manager
Rev
Accounting
Centiva
SSCI (Kiosk, Web)
Schedule Manager
SSW2
Crew Control
Flightline
Load Manager
Solution
Review
Planning Execution
Post
Cutover
Support
• Engineering /
Capacity
Planning
Reviews
• Joint
Performance and
E2E engineering
risk assessment
• Deliver
Performance
EngineeringTest
(PET) Plan
• Plan for 4
rounds of
performance
testing in CERT
• Identify high
risk Ops products
• Create test
harness for every
injection point
on critical path
• Execute tests,
gather metrics
for all systems,
analyze
• Certify release
(Go/NoGo)
• Monitor
production
performance
• Repeat
performance
tests if necessary

Plab system owners meeting v2

Editor's Notes

  • #4 History: 00-2007 functional automation team-> Dolly hired SK -> Robert funded pLab-> Benz’s team merged-> built ATSE lab->moved to engineering Worked with many of Sabre’s most critical products from a performance testing standpoint Have gained many hours of experience with doing reliable performance tests to uncover critical system performance bottlenecks We are now uniquely poised to carry our learnings across the organization and to be the performance center of excellence not merely for performance testing although that remains the core but also for performance best practices, preventive engineering, building scalable systems… pLab is synergistically placed to work with other arms of the EE organization. By following Pareto Principle, 80% of our business and technical opportunities and complexities resides with 20% of our systems. Some critical systems we are heavily engaged are ATSE, Fabric, ePOS, CSS We can’t support everyone directly, but by evangelizing performance awareness, we can multiply our force. Some of the classes/brown bags coming out of engineering on performance engineering/multi core are part of this effort Consulting model Blade benchmarking, terracotta, ServiceMix, JDK 1.6, now Azul system Flex Lab Can we ban a technology/tool if the performance is really horrible Note: performance testing process!!!!! Describe what are the goals of the other arms and how pLab interoperates with these arms
  • #5  - How good are the results - How quick the response shows up A shopping transaction originated in Expedia -> web services-> ATSE -> Sabre or another GDS for availability The customer should see the quality response in 3 seconds The ATSE shopping servers should spend no more than XX CPU seconds into it If expedia chooses to double their business with us, we should scale without getting dramatically more expensive If systems are unstable expedia will take the business somewhere else
  • #6 More performance related Sev calls with many of your application staff as well as several other enterprise resources Fixing performance related issues late in the development lifecycle is very expensive. Sometimes fundamental architecture tradeoffs need to be made (give an example) If tuning / redesign isn’t sufficient to solve the problem, you need to throw more hardware at your problems In some cases, this may lead to loss of confidence of stakeholders and cancelled projects Can cause your organization’s image to suffer. – give example of WCT & Hotels Lost Income – either because of bad performance and irate users abandoning; or lost income because of delayed projects Tell the story of Travel Now and eHotels (last Sev 1 almost 2 years ago).
  • #8 E.g. of Sev1s found and prevented: 1. pLab called in to help investigate a memory issue on shopping MIP hex core boxes in a limited pool in production. PLAB tracked it to a datacollector issue under Redhat 5. When unable to write to the instrumentation database it queued eating up (eventually) all memory. 2. Mom api got fixed for race condition. When worker thread picks up the message exactly at 500 ms(which was timeout value for mom sync thread), the dispatcher threads times out at the same time, at this time worked threads doesn't have any dispatcher thread to hands off the message and It just stays there. The default time for this thread was quite big, so eventually all threads from mom thread pool come in this situation and it goes out of thread. As an effect of this, BBIS-MOM Api doesn't able to pick the message.
  • #9 Where is the art? It is prior knowledge and experience of knowing failure modes due to performance; it is the intuition of knowing what to instrument to give insight into system performance The science is in the engineering approach to performance – a consistent methodology, a set of measurement tools, standard reports, predictive models Quantitative – it is important to measure because anything measured, however inaccurately can (a) usually, be improved upon and (b) is better than no measurement at all. Latency, Throughput and Utilization – These three are intricately related. Minimizing latency (requires low load) while maximizing throughput (requiring high load), two contradictory goals – i.e. at some optimal utilization level of resources is what the optimal load for the system is
  • #10 Latency, Throughput and Utilization – These three are intricately related. Minimizing latency (requires low load) while maximizing throughput (requiring high load), two contradictory goals – i.e. at some optimal utilization level of resources is what the optimal load for the system is
  • #11 The success of any performance engineering initiative is having the ability to run a good performance test. What makes a good performance test? One that will be useful in predicting the performance of the system in production. However, we don’t have the true production load, nor the true production servers, nor the true production integration points. Testing at the right load levels, with the right workload mix and measuring the right things; and having an engine to repeat this to compare against an established baseline as the product evolves; mocks / simulators
  • #12 Load testing is conducted to verify that your application can meet your desired performance objectives; these performance objectives are often specified in a service level agreement (SLA). A load test enables you to measure response times, throughput rates, and resource-utilization levels, and to identify your application’s breaking point, assuming that the breaking point occurs below the peak load condition. Soak testing is a subset of load testing. An endurance test is a type of performance test focused on determining or validating the performance characteristics of the product under test when subjected to workload models and load volumes anticipated during production operations over an extended period of time. The goal of stress testing is to reveal application bugs that surface only under high load conditions. These bugs can include such things as synchronization issues, race conditions, and memory leaks. Stress testing enables you to identify your application’s weak points, and shows how the application behaves under extreme load conditions. Spike testing is a subset of stress testing. A spike test is a type of performance test focused on determining or validating the performance characteristics of the product under test when subjected to workload models and load volumes that repeatedly increase beyond anticipated production operations for short periods of time. Capacity testing is conducted in conjunction with capacity planning, which you use to plan for future growth, such as an increased user base or increased volume of data. For example, to accommodate future loads, you need to know how many additional resources (such as processor capacity, memory usage, disk capacity, or network bandwidth) are necessary to support future usage levels. Capacity testing helps you to identify a scaling strategy in order to determine whether you should scale up or scale out.
  • #13 1. Impulses may occur because of sudden changes in the environment – e.g. Slashdot effect or a snow-storm for Airline ops system. What we are looking for is graceful handling of such exception conditions – ideally, the ability to protect yourself and cause no harm to others. If there is a single customer out of many that can be throttled, that is ideal. Recovery when load decreases is important as well.
  • #24 Westjet – no performance testing per se.. Jet Blue story: B6 E2E Performance Testing PET plan creation Working with capacity planning to lay out the workload for the tests Test scripts for every traffic injection point Individual product testing in pLab & E2E performance testing (Load, F5 load balancing, failover and soak tests) in CERT (coordinating the teams) Data analysis and reporting Volaris, SW SNAP and now AeroMexico