BRKINI-3287.pdf

#CLUS
Manish Agarwal,
Director of Product Management, HyperFlex
@flash4all
BRKINI-3287
End to end sizing and performance
Sizing and
Performance Tuning
for HyperFlex

Questions?
Use Cisco Webex Teams to chat
with the speaker after the session
Find this session in the Cisco Live Mobile App
Click “Join the Discussion”
Install Webex Teams or go directly to the team space
Enter messages/questions in the team space
How
Webex Teams will be moderated
by the speaker until June 16, 2019.
1
2
3
4
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
#CLUS
Cisco Webex Teams
cs.co/ciscolivebot# BRKINI-3287
3

#CLUS
Agenda
• Basics / Terms
• Sizing Considerations
• End to End Sizing Tools
• Hx Bench – benchmarking best practices
• Hx Workload Profiler
• Hx Sizer
• Performance Tuning
• Summary
BRKINI-3287 4

#CLUS
Common terms ..
• Workload
Working Set
Data Set
• Data Set Vs Working Set • Block Size Vs IO Size
IO Size
…
Block Size
Read / Write Mix
IO Size (Distribution)
Random/ Seq
• Sustained Vs Instantaneous
BRKINI-3287 6

#CLUS
Typical Storage Performance Curve
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0 10000 20000 30000 40000 50000 60000 70000
Latency
(ms)
IOPS
This particular curve is for: 4 node LFF, Workload: 70/30 rd/wr, 8K 100% random, 50%
compressible & 0% dedupe dataset, Run at 75% full filesystem, RF3
10,000 IOPS @ 2.1ms
50,000 IOPS @ 6.3ms
BRKINI-3287 7
Knee of the curve

#CLUS
Typical Storage Performance Curve
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0 10000 20000 30000 40000 50000 60000 70000
Latency
(ms)
IOPS
This particular curve is for: 4 node LFF, Workload: 70/30 rd/wr, 8K 100% random, 50%
compressible & 0% dedupe dataset, Run at 75% full filesystem, RF3
• Capacity utilization
• Working set size
• Workload compressibility
• Workload dedupe
• Larger block size
• % Write I/O mix
• % Read I/O mix
• Random I/O
• Sequential I/O
BRKINI-3287 8

#CLUS
Little’s Law:
IOPS = (Queue Depth)/(Latency)
Little’s Law
Latency Vs Throughput (or IOPS)
Implications:
• System has 1ms resp time for 4K IOPS:
• Single thread can’t get more than 1000
4K IO Size IOPS
• # of parallel threads can the system
service is (mostly) independent of 1ms
• If an application is getting 3000 (4K IO
Size) IOPS – the threadcount / OIO = 3
• Typically as load increases the response
time will increase
BRKINI-3287 9

#CLUS
What is Sizing?
Dictionary.com: “the act or process of applying size or preparing with size”
W1
W3
W5
W6
W2
W4
Total available resources
W1
W4
W3
W2
W6
W7
W8
W5
W9
Expectation
Total available resources
Reality
Sizing
BRKINI-3287 11

#CLUS
Sizing Workflow
• Detailed sizing (especially storage perf) can be a lot of work
• When to do detailed sizing?
• Large deployments (100s of VMs, very large VMs)
• High performance requirements (read: many 100s of MB/s; writes: low 100s)
• Performance sensitive environments
• Existing environment has performance issues or is hosted on high performance solution
• Refresh of existing environment:
• Use profiling / monitoring tools where every possible OR (if known) base it on existing
solutions specs and perf capability e.g. refresh of older HDD only infra is low risk
• Net new environments:
• Try to find a closest match in an existing environment OR use application templates
• HCI: Scale out helps reduce sizing risk
BRKINI-3287 12

#CLUS
Know What You Are Sizing For
• Important to understand the key constraints
• Sizing is in the context of a specific architecture
• Key Questions to Ask:
• What is the infrastructure going to be used for?
• Which applications will run on the infrastructure?
• What is the performance sensitivity?
• Are there any goals / constraints / assumptions?
• Is this an existing environment or net new?
• (For existing) Is the environment running well today?
• Is it possible to monitor the existing environment?
BRKINI-3287 13

#CLUS
Sizing in the Context of HCI
CPU
Memory
Network
Storage
Capacity
Storage
Performance
HCI Sizing
Constraints
Scaling Consideration
Storage Perf Depends on the storage architecture*
Capacity Depends on the storage architecture*
Memory Upto node max; clustered app beyond
CPU Upto node max; clustered app beyond
Network Scaling options; upstream implications
Key HCI Value Prop: Seamless Scaling
* Distributed Architecture Vs Data Locality
BRKINI-3287 14

#CLUS
Timeseries to a Single Number
0
500
1000
1500
2000
2500
3000
3500
2019-02-…
2019-02-…
2019-02-…
2019-02-…
2019-02-…
2019-02-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
Read
IOPS
Time
Server 1
IOPS
Peak 3260.5
Average 24.1
Median 5.7
95th Percentile 39.8
90th Percentile 6.5
Which one to use?
Variation of more than 100X
BRKINI-3287 15

#CLUS
Knowing the Use Case Helps
Unless the implications are
understood go with Peak
A more extreme example:
• Single point at 16K IOPS
• 20 sec sampling interval
• What level of IOPS to use for
sizing?
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
2019-02-…
2019-02-…
2019-02-…
2019-02-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
2019-03-…
Read
IOPS
Time
Answer is .. It depends:
• What is happening at that point?
• What is the impact of not meeting
that level of perf?
BRKINI-3287 16

#CLUS
Granularity of Monitoring
0
0.5
1
1.5
2
2.5
3
3.5
Read
IOPS
(K)
Time
Server 1 – 20sec interval
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Read
IOPS
Time
Server1 - 60sec interval
20 sec interval IOPS
Peak 3260.5
Average 24.1
Median 5.7
90th Percentile 6.5
60 sec interval IOPS
Peak 2769.6
Average 24.1
Median 5.7
90th Percentile 7.7
15%
(Can be
much more!)
10-20 sec granularity works well
for most scenarios
BRKINI-3287 17

#CLUS
Example 1:
Sizing Varies by Architectures
• A Canadian Securities Company
• Environment:
• VSI - 15 Servers, 266 VMs
• Monitored over 5 days, measuring sustained peaks
0
100
200
300
400
500
600
700
800
900
2017-07-10T16:53:40
2017-07-10T19:09:19
2017-07-10T21:25:28
2017-07-10T23:41:38
2017-07-11T01:57:49
2017-07-11T04:13:59
2017-07-11T06:30:09
2017-07-11T08:46:20
2017-07-11T11:02:30
2017-07-11T13:18:42
2017-07-11T15:34:51
2017-07-11T17:51:05
2017-07-11T20:07:14
2017-07-11T22:23:25
2017-07-12T00:39:35
2017-07-12T02:55:45
2017-07-12T05:11:54
2017-07-12T07:28:04
2017-07-12T09:44:14
2017-07-12T12:00:24
2017-07-12T14:16:34
2017-07-12T16:32:44
2017-07-12T18:48:56
2017-07-12T21:05:06
2017-07-12T23:21:17
2017-07-13T01:37:28
2017-07-13T03:53:37
2017-07-13T06:09:48
2017-07-13T08:25:57
2017-07-13T10:42:07
2017-07-13T12:58:20
2017-07-13T15:14:31
2017-07-13T17:30:43
2017-07-13T19:46:52
2017-07-13T22:03:02
2017-07-14T00:19:11
2017-07-14T02:35:21
2017-07-14T04:51:31
2017-07-14T07:07:42
2017-07-14T09:23:52
2017-07-14T11:40:03
2017-07-14T13:56:12
2017-07-14T16:12:22
2017-07-14T18:28:28
2017-07-14T20:44:38
2017-07-14T23:00:49
2017-07-15T01:17:00
2017-07-15T03:33:10
2017-07-15T05:49:21
MB/s
Total Read (MB/s)
Time Series (Total throughput across all servers)
For HX Sizing (single
pool across servers):
Read: ~848MB/s
Write: ~319MB/s
3X reduction due to
consolidation
Sum of Peak
Throughput across
15 servers:
• Read: 2879
MB/s
• Write: 922 MB /s
BRKINI-3287 18

#CLUS
Example 2:
Know the Application
• A Telecom Provider
• Environment:
• 5 ESX hosts running an analytics workload (MongoDB)
• Monitored over 5 days, measuring sustained peaks
For HX Sizing:
Read: ~110MB/s
Write: ~1232MB/s
However:
• MongoDB is optimized to extract
the max throughput (by varying
thread count)
• If you are co-locating other
workloads they may get starved
• It helps to know the appl stack
• Beware of perf hogs!
BRKINI-3287 19

#CLUS
All Flash Vs Hybrid
• All Flash Vs Hybrid Trade-offs:
• For performance sensitive environments – always use All Flash
• Hybrid performance is highly dependent on cache friendly workloads
• Rule of thumb for hybrid:
working set = (10 – 15%) of dataset
[Not a guarantee]
• Keep in mind the sensitivity to cache misses – can the applications tolerate that?
BRKINI-3287 20

#CLUS
Sizing for Failures
• Difference between “RF” & “N+x”
• RF (RF2 or RF3) controls how many simultaneous failures can the cluster survice
• N+x (N+1 or N+2) controls the performance in the face of failures
• N+x:
• CPU & Memory: For asymmetric cluster – worst case is the failure of the node with
max CPU / Memory resources
• Storage Capacity: short lived failures do not cause a change in capacity, compute
only nodes have no impact
• Storage Performance: even short lived failures cause a loss of storage
performance, compute only nodes have no impact
BRKINI-3287 21

#CLUS
A Few Other Things ..
• When profiling an existing environment for sizing purpose:
• The sampling period needs to capture the peak
• No existing performance issues – profiler can’t predict the amount of additional headroom
required to make the environment healthy
• Pay attention to latency as well (Remember Little’s Law)
• Meeting the IOPS / throughput may not be enough
• Understand the behavior of the profiling tool being used
• Usage numbers of older CPU SKUs needs to be normalized
• Consider storage configuration – RF2/RF3, Storage Efficiency, Data protection
• Depending on the resiliency required, has performance & capacity implications
• Do not forget to add headroom (15-20% atleast)
BRKINI-3287 22

#CLUS
Recommendation Details
Context • Try to know as much as possible about the environment
End to End
• All resources (compute, n/w, storage) are sized appropriately
• Ability to scale out reduces sizing risk – but understand the solution’s
scaling properties
• Resiliency, data protection and features need to be accounted for
Time
Series
• Use peak values to summarize (unless you know more)
• Use no more than 15-20 sec granularity
• Know the impact of aggregation / summarization
Know the
tools
• What is the source of the counters?
• What is captured– peak, average, etc? How is it being compacted?
Failures • Sizing should account for failure mode behavior
Latency
• Not enough to only hit the throughput / IOPS goal
• Latency can directly impact the time to completion
Summary
BRKINI-3287 23

#CLUS
HyperFlex Simplifies the Management Lifecycle
Plan
Eval, Profile, Size, Buy
Deploy
Install, Configure
Operate
Manage, Monitor, Grow
Troubleshoot
Support, Upgrade
Cloud-Like Analytics Driven Unified Management
BRKINI-3287 25

#CLUS
HyperFlex Tools
Hx Workload Profiler
CPU, memory and storage profile
• Creates usage profile for
existing environments
• 2 modes: Quick sizing and
active profiling
• Available for ESX and HyperV
environments
• Primary use case is sizing (can
be used for granular monitoring)
Hx Sizer
End-to-End sizing
• Includes – compute (CPU &
Memory), Storage performance
and capacity
• Integration with the profiler for
end to end sizing
• Application templates to aid
application based sizing
• All HX configurations and
features
HxBench
Storage benchmarking made easy
• Benchmarking workflow
automation using vdbench
• Industry standard benchmarking
practices
• Available for ESX and HyperV
environments
• Ability to do local storage [NFS,
SMB – coming soon]
Available to customers and partners Available as is at best effort support
Free tools
BRKINI-3287 26

#CLUS
Storage Performance Testing Best Practices
1. Initialize the disks
• More realistic - Applications don’t read data they haven’t written
2. Longer Test Duration:
• SSDs / All flash systems show very high performance initially – which not realistic
• Tests should be long running (several hours depending on write load)
3. Larger Working set size
• For smaller working set size hybrid will perform similar to all flash (as data is coming from flash)
4. Enable Dedupe and compression on competition (On by default on HX)
• Especially if your capacity sizing assumes dedupe & compression
BRKINI-3287 27

#CLUS
Storage Performance Testing Best Practices
5. Workload should correspond to your production workload:
• Reasonable write percentage (30% - 35% at least) in the workload
• 100% - 95% read workloads are not realistic in most environments
6. Similar hardware configuration:
• Pay careful attention to the BOM – CPU, Memory and SSD (type and count for cache/WL and
persistent data SSDs)
• Additional performance may come at a trade-off: cost or manageability
7. Similar software configuration:
• Pay careful attention to the resiliency setting – that needs to be identical
• Comparing Replication Factor = 2 performance with Replication Factor = 3 performance is not
comparing apples to apples
BRKINI-3287 28

#CLUS
Reading Storage Performance Results
1. What to look for in performance results:
• IOPS / Throughput
• Latency – not all workloads show sub millisecond latencies (even on all flash!)
• Consistent Latency / standard deviation of latency
2. Variance in performance across multiple VMs
Which one would you rather
have?
Write Latency as
measured by
esxtop for similar
workloads
BRKINI-3287 29

Demo: HxBench
Storage benchmarking made simple

#CLUS
Things to Know About The Monitoring Tool
• Which resources are being monitored?
• Can gather usage stats for an extended period of time
• Especially important for storage performance stats
• Presents a complete picture wrt cyclical changes in load
• What is the granularity of stats? What all does it report – max, avg?
• Is there any compaction of data being done – what is preserved?
BRKINI-3287 31

Demo 2: Hx Workload Profiler
Monitoring resource usage for sizing

Demo 3: HyperFlex Sizer
End to end sizing

#CLUS
End to End I/O Stack
DATASTORE/VOLUME
HYPERVISOR
HX CONTROLLER VM
VIRTUAL MACHINE
Application
• Performance tuning:
• Identifying the bottleneck
• (if possible) scaling it
• And then repeating the process
• Performance Debugging:
• First establish that there is a problem
• Know what to expect from the system
• Beware of “Low IOPS” complaint!
• Establish a bar (if possible)
• Narrow the time window when the
problem is seen
• Helps to view time axis aligned charts
for different layers
BRKINI-3287 35

#CLUS
End to End I/O Stack
DATASTORE/VOLUME
HYPERVISOR
HX CONTROLLER VM
VIRTUAL MACHINE
Application • Generally:
• Throughput & IOPS should be same at
all layers
• Higher latency at higher layers
• Exceptions:
• Caching: not all I/Os are sent down. Both
throughput (higher) and latency (lower) may
be different from lower layers
• Some layer is doing additional I/Os to satisfy
the user IO
• Intermediate layers may decide to split the
I/Os – IOPS won’t match, latency won’t either
• Quick Analysis: Rule out layers (and below)
where latency is ok
BRKINI-3287 36

#CLUS
Performance Tools
DATASTORE/VOLUME
HYPERVISOR
HX CONTROLLER VM
VIRTUAL MACHINE
Application
BRKINI-3287 37
Tools
Application • Application specific counters e.g. AWR
VM
• Linux: iostat, mpstat, vmstat
• Windows: perfmon
HyperVisor
• vCenter: CPU, memory and IO
performance
• Esxtop – lots of command line options
Datastore /
vmdk
• Esxtop – lots of command line options
(https://labs.vmware.com/flings/visualesxtop for
visualizing it)
• Vscsistats – vmdk granular I/O counters
and histograms (or use HX profiler)
Scvmclient
• HxConnect performance dashboards
(cluster, HX node and datastore level)
HXDP /
Drives
• Support bundle / ASUP

#CLUS
Customer Example
High Read Latency
BRKINI-3287 38
• Fortune 500 Company
• Environment:
• 5 node cluster running a Oracle DB workload (and other non-critical
workloads as well)
• All Flash, configured as RF3
• Very large Linux VM (256G memory, 32 vCPUs)
• Issue:
• Customer monitoring read throughput and seeing limited (500 –
800MB/s) read throughput
• High latency reported

#CLUS
Hx Level Performance
BRKINI-3287 39
Cluster Level Datastore Level
Very healthy read response time Read latency is good!

#CLUS
Recommendations
BRKINI-3287 40
• Disable SIOC
• pvscsi QD higher,
Datastore QD higher
• Multiple datastores and
spread out the load
• VM size (reduce vCPU &
Memory to fit in 1 NUMA
node)
• Result: Significant
improvement in overall
latency & throughput

#CLUS
Customer Example
High Write Latency
• A European Telecom Provider
• Environment:
• 5 node cluster running an analytics workload (MongoDB)
• Issue:
• Recently added new drives to add capacity to the cluster
• Seeing the run time increase after drive addition
BRKINI-3287 41

#CLUS
Cluster Storage Performance
0
200
400
600
800
1000
1200
1400
2018-10-14-16:00:10
2018-10-14-16:58:00
2018-10-14-17:55:50
2018-10-14-18:53:40
2018-10-14-19:51:30
2018-10-14-20:49:20
2018-10-14-21:47:10
2018-10-14-22:45:00
2018-10-14-23:42:50
2018-10-15-00:40:40
2018-10-15-01:38:30
2018-10-15-02:36:20
2018-10-15-03:34:11
2018-10-15-04:32:00
2018-10-15-05:29:50
2018-10-15-06:27:40
2018-10-15-07:25:30
2018-10-15-08:23:20
2018-10-15-09:21:10
2018-10-15-10:19:00
2018-10-15-11:16:50
2018-10-15-12:14:40
2018-10-15-13:12:30
2018-10-15-14:10:20
2018-10-15-15:08:10
2018-10-15-16:06:00
2018-10-15-17:03:50
2018-10-15-18:01:40
2018-10-15-18:59:30
2018-10-15-19:57:20
2018-10-15-20:55:10
2018-10-15-21:53:00
2018-10-15-22:50:50
2018-10-15-23:48:40
2018-10-16-00:46:30
2018-10-16-01:44:20
2018-10-16-02:42:10
2018-10-16-03:40:00
2018-10-16-04:37:50
2018-10-16-05:35:40
2018-10-16-06:33:30
2018-10-16-07:31:20
2018-10-16-08:29:10
2018-10-16-09:27:00
2018-10-16-10:24:50
2018-10-16-11:22:40
2018-10-16-12:20:30
2018-10-16-13:18:20
2018-10-16-14:16:10
2018-10-16-15:14:00
2018-10-16-16:11:50
2018-10-16-17:09:40
2018-10-16-18:07:30
2018-10-16-19:05:20
2018-10-16-20:03:10
2018-10-16-21:01:00
2018-10-16-21:58:50
2018-10-16-22:56:40
2018-10-16-23:54:30
MB/s
Storage Throughput
nfsWriteBytes nfsReadBytes
• Predominantly
write workload
• Writes are an
expensive
operation on HCI
solutions
• Very high load for
a 5 node cluster
BRKINI-3287 42

#CLUS
Cluster Storage Latency during rebalance
0
1000
2000
3000
4000
5000
6000
7000
TimeStamp
2018-10-14-16:57:00
2018-10-14-17:54:00
2018-10-14-18:51:00
2018-10-14-19:48:00
2018-10-14-20:45:00
2018-10-14-21:42:00
2018-10-14-22:39:00
2018-10-14-23:36:00
2018-10-15-00:33:00
2018-10-15-01:30:00
2018-10-15-02:27:00
2018-10-15-03:24:00
2018-10-15-04:21:00
2018-10-15-05:18:00
2018-10-15-06:15:00
2018-10-15-07:12:00
2018-10-15-08:09:00
2018-10-15-09:06:00
2018-10-15-10:03:00
2018-10-15-11:00:00
2018-10-15-11:57:00
2018-10-15-12:54:00
2018-10-15-13:51:00
2018-10-15-14:48:00
2018-10-15-15:45:00
2018-10-15-16:42:00
2018-10-15-17:39:00
2018-10-15-18:36:00
2018-10-15-19:33:00
2018-10-15-20:30:00
2018-10-15-21:27:00
2018-10-15-22:24:00
2018-10-15-23:21:00
2018-10-16-00:18:00
2018-10-16-01:15:00
2018-10-16-02:12:00
2018-10-16-03:09:00
2018-10-16-04:06:00
2018-10-16-05:03:00
2018-10-16-06:00:00
2018-10-16-06:57:00
2018-10-16-07:54:00
2018-10-16-08:51:00
2018-10-16-09:48:00
2018-10-16-10:45:00
2018-10-16-11:42:00
2018-10-16-12:39:00
2018-10-16-13:36:00
2018-10-16-14:33:00
2018-10-16-15:30:00
2018-10-16-16:27:00
2018-10-16-17:24:00
2018-10-16-18:21:00
2018-10-16-19:18:00
2018-10-16-20:15:00
2018-10-16-21:12:00
2018-10-16-22:09:00
2018-10-16-23:06:00
2018-10-17-00:03:00
Latency
(ms)
Write Latency
• Very very high
latencies (in
seconds)
• Cluster is
crawling!!
BRKINI-3287 43

#CLUS
Rebalance Load
0
500
1000
1500
2000
2500
3000
2018-10-14-16:00:10
2018-10-14-16:57:10
2018-10-14-17:54:10
2018-10-14-18:51:10
2018-10-14-19:48:10
2018-10-14-20:45:10
2018-10-14-21:42:10
2018-10-14-22:39:10
2018-10-14-23:36:10
2018-10-15-00:33:10
2018-10-15-01:30:10
2018-10-15-02:27:10
2018-10-15-03:24:10
2018-10-15-04:21:10
2018-10-15-05:18:10
2018-10-15-06:15:10
2018-10-15-07:12:10
2018-10-15-08:09:10
2018-10-15-09:06:10
2018-10-15-10:03:10
2018-10-15-11:00:10
2018-10-15-11:57:10
2018-10-15-12:54:10
2018-10-15-13:51:10
2018-10-15-14:48:10
2018-10-15-15:45:10
2018-10-15-16:42:10
2018-10-15-17:39:10
2018-10-15-18:36:10
2018-10-15-19:33:10
2018-10-15-20:30:10
2018-10-15-21:27:10
2018-10-15-22:24:11
2018-10-15-23:21:10
2018-10-16-00:18:10
2018-10-16-01:15:10
2018-10-16-02:12:10
2018-10-16-03:09:10
2018-10-16-04:06:10
2018-10-16-05:03:10
2018-10-16-06:00:10
2018-10-16-06:57:10
2018-10-16-07:54:10
2018-10-16-08:51:10
2018-10-16-09:48:10
2018-10-16-10:45:10
2018-10-16-11:42:10
2018-10-16-12:39:10
2018-10-16-13:36:10
2018-10-16-14:33:10
2018-10-16-15:30:10
2018-10-16-16:27:10
2018-10-16-17:24:10
2018-10-16-18:21:10
2018-10-16-19:18:10
2018-10-16-20:15:10
2018-10-16-21:12:10
2018-10-16-22:09:10
2018-10-16-23:06:10
2018-10-17-00:03:10
MB/s
ResyncReadBytes
• Smoking gun!
• Cluster rebalance
load is tipping the
cluster over the
max throughput
• Solution: Resize
the cluster to
leave additional
headroom
BRKINI-3287 44

#CLUS
Customer Example
High Write Latency @ low load
• A handful of customers (mostly POCs)
• Environment:
• Issue:
• Latency spikes on a very lightly loaded cluster
BRKINI-3287 45

#CLUS
Cluster Performance
Less than 50MB/s load
Latency in 20-30ms range,
with spikes to 80ms
BRKINI-3287 46

#CLUS
Solution to Customer Reported Issue
• Initial Workaround:
• Run some background synthetic load and the performance improves
dramatically!!
• Eventual RCA/ fix:
• Networking stack was buffering the IO in the hope of getting benefits of
coalescing
• Low traffic caused the window to increase and caused very high latency
• Immediate fix – increase the Heartbeat frequency to avoid the window to
increase
BRKINI-3287 47

#CLUS
Other Best Practices
• VM straddling multiple NUMA nodes
• Follow UCS memory DIMM best practices (see the server technical
specs for guidelines)
• Pinning of performance sensitive VMs to a different NUMA node
than the controller VM
• Ensure no disks are seeing errors and need replacement
• Enable jumbo frames at all levels
BRKINI-3287 48

#CLUS
Overall Summary
• Sizing:
• The better you know the environment, better sizing you can do
• Understand the requirements, tools, assumptions in detail
• Hx Tools:
• HxBench, Hx Workload Profiler, Hx Sizer
• Free – available to all customers & partners
• Perf Debugging / Tuning:
• Know the overall stack – know which tools to use
• Identify the bottleneck, scale it, repeat!
BRKINI-3287 49

Complete your
online session
evaluation
• Please complete your session survey
after each session. Your feedback
is very important.
• Complete a minimum of 4 session
surveys and the Overall Conference
survey (starting on Thursday) to
receive your Cisco Live water bottle.
• All surveys can be taken in the Cisco Live
Mobile App or by logging in to the Session
Catalog on ciscolive.cisco.com/us.
Cisco Live sessions will be available for viewing
on demand after the event at ciscolive.cisco.com.
#CLUS BRKINI-3287 50

#CLUS
Continue your education
Related sessions
Walk-in labs
Demos in the
Cisco campus
Meet the engineer
1:1 meetings
BRKINI-3287 51

#CLUS 52
Let’s Celebrate
10 Years of
Unified Computing!
We made it simple.
You made it happen.
Visit DC lounge outside WOS
Grab a shirt, get a digital portrait
for your social profiles, share
your stories!
Visit the Data Center area inside
WOS to see what’s new
Follow us!
#CiscoDC
https://bit.ly/2CXR33q
BRKINI-3287

BRKINI-3287.pdf

Recommended

Recommended

More Related Content

Similar to BRKINI-3287.pdf

Similar to BRKINI-3287.pdf (20)

Recently uploaded

Recently uploaded (20)

BRKINI-3287.pdf