The presentation on Batch Workload Modelling and Performance Optimization was done during #ATAGTR2017, one of the largest global testing conference. All copyright belongs to the author.
Author and presenter : Ashish Powar
2. Agile Testing Alliance Global Testing Retreat 2017
Objective
2
This Presentation covers the following aspects
Importance of Batch Performance tuning
Commonly faced Issues due to over running batches
Batch Workload Modelling
Analyzing Test data requirements for batches
Parameters to be monitored for batch optimization
Real example of batch tuning & benefits to the client
Workload to move to cloud
3. Agile Testing Alliance Global Testing Retreat 2017 3
Importance of Batch Performance Tuning
• Improper tuning of batch processes results in additional costs of hardware upgrades.
• Inefficient batches will not make optimal use of available resources
• Batches not meeting the SLA impact/delay the start of the online day
• Impact to online transactions due to High CPU usage by inefficient batches running in parallel
Leading UK bank suffered major IT
incident affecting the group
overnight batch processing system
which caused severe disruption to
many of its IT systems.
The IT incident resulted in the
Group being unable to update
customer account balances process
payments and participate fully in
clearing with normal timeframes
For a leading insurance
provider delay in reconciliation
of data blocked the customers
from renewing their insurance
policies.
This incident caused lot of
customers to lose credibility
and the insurance provider
had to deal with huge
financial implications
One of the worlds largest stock
Exchange brought down due to
failure in the computer systems
that feed stock prices to bank and
brokerages
On further analysis it was
identified that the incident was
caused due to delay in the batch
feed update
4. Agile Testing Alliance Global Testing Retreat 2017
• Delay in start of business day thus impacting online processing
• Data inconsistency issues - online services are dependent on previous night’s batch activity
• Database locking – Impact to online response time due to table locks by batches running in parallel.
• CPU/Memory – High CPU/memory usage due to batches running in parallel(above fig 2)
0
100
200
300
400
500
600
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
AverageResponseTime(Sec)
Concurrent Sessions
Impact on processing time due to DB
lock
Concurrent Sessions
0
20
40
60
80
100
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
AverageResponseTime(Sec)
Concurrent Sessions
CPU Utilization
% CPU Batch+Online % CPU Online only
4
Commonly faced Issues due to over running batches
5. Agile Testing Alliance Global Testing Retreat 2017 5
Batch
Workload
Modelling
Identifying
workload to
move to Cloud
Test data
Analysis/Setup
Batch
Execution
Monitoring &
Optimization
Areas of focus for batch Performance Optimization
6. Agile Testing Alliance Global Testing Retreat 2017
Maximum number of batches were executed at 24:00 hrs i.e. 3 with 300 records processed
6
Batch Workload Modelling
• Batches that run in isolation without any other activity in parallel
• Batches initiated as part of online sync process and run in background
• Batches that run in parallel with the online activity
Workload modelling for batch testing should simulate below real-time
production activities:
0
2
4
6
8
10
12
14
12.00AM
2.00AM
4.00AM
6.00AM
8.00AM
10.00AM
12.00AM
14.00PM
16.00PM
18.00PM
20.00PM
22.00PM
24.00PM
NoofBatches/durationofbatchwindow
Time
Batches and Batch Window
No of Batches Batch window
0
50
100
150
200
250
300
350
12.00AM
2.00AM
4.00AM
6.00AM
8.00AM
10.00AM
12.00AM
14.00PM
16.00PM
18.00PM
20.00PM
22.00PM
24.00PM
Noofbatches/recordsprocessed
Time
Batches and Records Processed
No of Batches Records processed
7. Agile Testing Alliance Global Testing Retreat 2017 7
Workload to move to cloud
DNS
Process auto- scales based on
jobs in the queue
Load Balancer
Queue
Cloud Files
App Servers
• Unpredictable load or potential for growth
• Partial Utilization
• Easy Parallelization
• Auto scale as per requirement
Criteria to move workload to cloud
Benefits of cloud
• Help Organization to pay based on the
need and usage
8. Agile Testing Alliance Global Testing Retreat 2017 8
Analyzing Test data requirements for batches
Batch Test data analyses involves the below
• Reference data required for testing
• Data to be copied on top of reference data
• Input data for execution depending on the functionality
• The composition of data (intersection of fields) is also important to simulate the actual production scenario
• Need for historical data in DB for replication of production scenario
9. Agile Testing Alliance Global Testing Retreat 2017
Job 2
Job3
Job4
Elapsed time
9
Parameters to be monitored for batch optimization
• Elapsed Time - The total duration taken by the batch to complete
• Throughput - Throughput is critical as batches process large volumes of data.
• CPU/Memory Utilization - Needs to be within limit of batch job is expected to consume.
• I/O Operations - Use of MOM, clustering and shared storage to spread the load
• DB Connection Pool/Thread Pool - Both the parameters need to be monitored for optimal value for the batch
• Slow running Queries – Identify slow running queries with respect to DB
Batch
Parameters
Elapsed Time
Throughput –
I/o
CPU/Memory
Utilization
Slow running
Queries
DB
Connection
Pool/Thread
Pool
10. Agile Testing Alliance Global Testing Retreat 2017 10
Validations that help to speed up batch processing
• Pooling - Retrieve small sets of batch items for processing at a time
• Locks - Monitor the locks acquired/sec
• Write Log - Monitor the total log flush (WRITELOG value)
• Index and Fragmentation - Indexing and defragmentation after the batch.
• Disable constraints - Need to considered if data is correct
• Triggers - Need to validate if needed during batch execution
• Wait Statistics - Need to be monitored for slowing down of server
11. Agile Testing Alliance Global Testing Retreat 2017 11
Real examples of batch tuning & benefits to the client
• Optimization of the SQL queries
• Use of Parallelism
• Making efficient use of CPU resources
• Use of quality metrics, like track job failures and test management tools
As part of an assignment with a leading UK bank, the below optimization
helped to reduced the batch elapsed time to meet the SLA:
Gains due to above optimization:
• Batch elapsed time reduced to 75 mins on a normal day (vs. 180 mins across similar period
before)
• Reduction in production systems running costs of around £245k per annum
• Estimated reduction in development systems running costs of £ 122k per annum
• Batch system utilization reduced to permissible limits during the overnight batch window
12. Agile Testing Alliance Global Testing Retreat 2017 12
Conclusion
• Inefficient batches lead to financial implications and impact to online activity
• Proper workload modelling is key to success of batch testing
• Proper analyses needs to be carried to identify workload to move to cloud
• Scope for optimization can be identified based on the monitors setup during batch execution
• Monitoring of the key DB parameters are critical for efficient batch optimization
DB Connection Pool/Thread Pool – DB connection pool is the number of DB connections available for batch. Thread pool size too large can cause performance problems because if there are too many concurrent threads, task switching overhead becomes a serious bottleneck. Optimum value need to derived by multiple execution and benchmarking
a.) Reduce the number of I/O Operations: The number of I/O operations (or the I/O size) of the application were set such that they don’t exceed the theoretical limit of the adapter/disk which helped to reduce the number of I/O operations and efficient execution.
b.) Reduce Elapsed Time: Parallelism helped reduce elapsed time, thus helping in completing batch within the SLA.
c.) Make efficient use of CPU Resources: Batches were optimized to use minimum CPU cycles and make efficient use of CPU close to 95% by splitting the batches in smaller logical pieces.
d.) Use of quality metrics, like track job failures, auto restart and test management tools: Each job within the batch was monitored to bring in efficiency and reduce impact to overall batch execution time elapsed