SlideShare a Scribd company logo
1 of 91
Download to read offline
Performance Analysis and Tuning – Part 2 
D. John Shakshober (Shak) 
Sr Consulting Eng / Director Performance Engineering 
Larry Woodman 
Senior Consulting Engineer / Kernel VM 
Jeremy Eder 
Principal Software Engineer/ Performance Engineering
Agenda: Performance Analysis Tuning Part II 
• Part I 
• RHEL Evolution 5­> 
6­> 
7 – out­of­the­box 
tuned for Clouds ­“ 
tuned” 
• Auto_NUMA_Balance – tuned for NonUniform Memory Access (NUMA) 
• Cgroups / Containers 
• Scalabilty – Scheduler tunables 
• Transparent Hugepages, Static Hugepages 4K/2MB/1GB 
• Part II 
• Disk and Filesystem IO ­Throughput­performance 
• Network Performance and Latency­performance 
• System Performance/Tools – perf, tuna, systemtap, performance­co­pilot 
• Q & A
Disk I/O in RHEL
RHEL “tuned” package 
Available profiles: 
- balanced 
- desktop 
- latency-performance 
- network-latency 
- network-throughput 
- throughput-performance 
- virtual-guest 
- virtual-host 
Current active profile: throughput-performance
Tuned: Profile throughput-performance 
throughput-performance 
governor=performance 
energy_perf_bias=performance 
min_perf_pct=100 
readahead=4096 
kernel.sched_min_granularity_ns = 10000000 
kernel.sched_wakeup_granularity_ns = 15000000 
vm.dirty_background_ratio = 10 
vm.swappiness=10
I/O Tuning – Understanding I/O Elevators 
• Deadline – new RHEL7 default for all profiles 
•Two queues per device, one for read and one for writes 
• I/Os dispatched based on time spent in queue 
• CFQ – used for system disks off SATA/SAS controllers 
•Per process queue 
•Each process queue gets fixed time slice (based on process priority) 
• NOOP – used for high-end SSDs (Fusion IO etc) 
•FIFO 
•Simple I/O Merging 
• Lowest CPU Cost
Iozone Performance Effect of TUNED EXT4/XFS/GFS 
RHEL7 RC 3.10-111 File System In Cache Performance 
ext3 ext4 xfs gfs2 
4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
Intel I/O (iozone - geoM 1m-4g, 4k-1m) not tuned 
tuned 
Throughput in MB/Sec 
RHEL7 3.10-111 File System Out of Cache Performance 
ext3 ext4 xfs gfs2 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Intel I/O (iozone - geoM 1m-4g, 4k-1m) not tuned 
tuned 
Throughput in MB/Sec
SAS Application on Standalone Systems 
Picking a RHEL File System 
xfs most recommended 
• Max file system size 100TB 
• Max file size 100TB 
• Best performing 
ext4 recommended 
• Max file system size 16TB 
• Max file size 16TB 
ext3 not recommended 
• Max file system size 16TB 
• Max file size 2TB 
xfs-rhel7 
18000 
16000 
14000 
12000 
10000 
8000 
6000 
4000 
2000 
xfs-rhel6 
SAS Mixed Analytics (RHEL6 vs RHEL7) 
perf 32 (2 socket Nahelam) 8 x 48GB 
TOTAL Time System Time 
ext3-rhel7 
ext3-rhel6 
ext4-rhel7 
ext4-rhel6 
20 
0 
-20 
-40 
-60 
-80 
gfs2-rhel7 
gfs2-rhel6 
0 
-100 
6.18 
-4.05 
4.94 9.59 
File system tested 
Time in seconds (lower is better)
Tuning Memory – Flushing Caches 
•Drop unused Cache – to control pagecache dynamically 
✔Frees most pagecache memory 
✔File cache 
✗If the DB uses cache, may notice slowdown 
•NOTE: Use for benchmark environments. 
●Free pagecache 
●# sync; echo 1 > /proc/sys/vm/drop_caches 
●Free slabcache 
●# sync; echo 2 > /proc/sys/vm/drop_caches 
●Free pagecache and slabcache 
●# sync; echo 3 > /proc/sys/vm/drop_caches
Virtual Memory Manager (VM) Tunables 
●Reclaim Ratios 
●/proc/sys/vm/swappiness 
●/proc/sys/vm/vfs_cache_pressure 
●/proc/sys/vm/min_free_kbytes 
●Writeback Parameters 
●/proc/sys/vm/dirty_background_ratio 
●/proc/sys/vm/dirty_ratio 
●Readahead parameters 
●/sys/block/<bdev>/queue/read_ahead_kb
Per file system flush daemon 
buffer 
User space 
Kernel 
memory copy 
pagecache 
Read()/Write() 
Flush daemon 
File system 
Pagecache 
page
swappiness 
•Controls how aggressively the system reclaims anonymous 
memory: 
•Anonymous memory - swapping 
•Mapped file pages – writing if dirty and freeing 
•System V shared memory - swapping 
•Decreasing: more aggressive reclaiming of pagecache 
memory 
•Increasing: more aggressive swapping of anonymous 
memory
vfs_cache_pressure 
●Controls how aggressively the kernel reclaims memory in slab 
caches. 
●Increasing causes the system to reclaim inode cache and 
dentry cache. 
●Decreasing causes inode cache and dentry cache to grow.
min_free_kbytes 
Directly controls the page reclaim watermarks in KB 
Defaults are higher when THP is enabled 
# echo 1024 > /proc/sys/vm/min_free_kbytes 
----------------------------------------------------------- 
Node 0 DMA free:4420kB min:8kB low:8kB high:12kB 
Node 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB 
----------------------------------------------------------- 
echo 2048 > /proc/sys/vm/min_free_kbytes 
----------------------------------------------------------- 
Node 0 DMA free:4420kB min:20kB low:24kB high:28kB 
Node 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB 
-----------------------------------------------------------
dirty_background_ratio, dirty_background_bytes 
●Controls when dirty pagecache memory starts getting written. 
●Default is 10% 
●Lower 
●flushing starts earlier 
●less dirty pagecache and smaller IO streams 
●Higher 
●flushing starts later 
●more dirty pagecache and larger IO streams 
● dirty_background_bytes over-rides when you want < 1%
dirty_ratio, dirty_bytes 
• Absolute limit to percentage of dirty pagecache memory 
• Default is 20% 
• Lower means clean pagecache and smaller IO streams 
• Higher means dirty pagecache and larger IO streams 
• dirty_bytes overrides when you want < 1%
dirty_ratio and dirty_background_ratio 
100% of pagecache RAM dirty 
flushd and write()'ng processes write dirty buffers 
dirty_ratio(20% of RAM dirty) – processes start synchronous writes 
flushd writes dirty buffers in background 
dirty_background_ratio(10% of RAM dirty) – wakeup flushd 
do_nothing 
0% of pagecache RAM dirty
Network Performance Tuning
RHEL7 Networks 
•IPv4 Routing Cache replaced with Forwarding Information Base 
•Better scalability, determinism and security 
•Socket BUSY_POLL (aka low latency sockets) 
•40G NIC support, bottleneck moves back to CPU :-) 
•VXLAN Offload (for OpenStack) 
•NetworkManager: nmcli and nmtui
20 
NUMA NODE 1 
Locality of Packets 
Stream from 
Customer 1 
Stream from 
Customer 2 
Stream from 
Customer 3 
Socket1/Core1 
Socket1/Core2 
Socket1/Core3 
Stream from 
Customer 4 Socket1/Core4
Tuned: Profile Inheritance 
Parents 
throughput-performance latency-performance 
Children 
network-throughput network-latency 
virtual-host 
virtual-guest 
balanced 
desktop
Tuned: Profile Inheritance (throughput) 
throughput-performance 
governor=performance 
energy_perf_bias=performance 
min_perf_pct=100 
readahead=4096 
kernel.sched_min_granularity_ns = 10000000 
kernel.sched_wakeup_granularity_ns = 15000000 
vm.dirty_background_ratio = 10 
vm.swappiness=10 
network-throughput 
net.ipv4.tcp_rmem="4096 87380 16777216" 
net.ipv4.tcp_wmem="4096 16384 16777216" 
net.ipv4.udp_mem="3145728 4194304 16777216"
Tuned: Profile Inheritance (latency) 
latency-performance 
force_latency=1 
governor=performance 
energy_perf_bias=performance 
min_perf_pct=100 
kernel.sched_min_granularity_ns=10000000 
vm.dirty_ratio=10 
vm.dirty_background_ratio=3 
vm.swappiness=10 
kernel.sched_migration_cost_ns=5000000 
network-latency 
transparent_hugepages=never 
net.core.busy_read=50 
net.core.busy_poll=50 
net.ipv4.tcp_fastopen=3 
kernel.numa_balancing=0
Networking performance – System setup 
•Evaluate the 2 new tuned profiles for networking 
•Disable unnecessary services, runlevel 3 
• Follow vendor guidelines for BIOS Tuning 
• Logical cores? Power Management? Turbo? 
•In the OS, consider 
•Disabling filesystem journal 
•SSD/Memory Storage 
•Reducing writeback thresholds if your app does disk I/O 
•NIC Offloads favor throughput
Network Tuning: Buffer Bloat 
# ss |grep -v ssh 
State Recv-Q Send-Q Local Address:Port Peer Address:Port 
ESTAB 0 0 172.17.1.36:38462 172.17.1.34:12865 
ESTAB 0 3723128 172.17.1.36:58856 172.17.1.34:53491 
•NIC ring buffers: # ethtool -g p1p1 
● 10G line-rate 
● ~4MB queue depth 
● Matching servers 
4.50 
4.00 
3.50 
3.00 
2.50 
2.00 
1.50 
1.00 
0.50 
0.00 
Kernel Buffer Queue Depth 
10Gbit TCP_STREAM 
Time (1-sec intervals) 
Send-Q Depth 
Queue Depth (MB)
Tuned: Network Throughput Boost 
balanced throughput-performance network-latency network-throughput 
40000 
30000 
20000 
10000 
0 
R7 RC1 Tuned Profile Comparison - 40G Networking 
Incorrect Binding Correct Binding Correct Binding (Jumbo) 
Mbit/s 
39.6Gb/s 
19.7Gb/s
netsniff-ng: ifpps 
•Aggregate network 
stats to one screen 
•Can output to .csv Pkts/sec
netsniff-ng: ifpps 
•Aggregate network 
stats to one screen 
•Can output to .csv Pkts/sec Drops/sec
netsniff-ng: ifpps 
•Aggregate network 
stats to one screen 
•Can output to .csv Pkts/sec Drops/sec 
Hard/Soft IRQs/sec
Network Tuning: Low Latency TCP 
• set TCP_NODELAY (Nagle) 
• Experiment with ethtool offloads 
• tcp_low_latency tiny substantive benefit found 
•Ensure kernel buffers are “right-sized” 
•Use ss (Recv-Q Send-Q) 
•Don't setsockopt unless you've really tested 
•Review old code to see if you're using setsockopt 
• Might be hurting performance
Network Tuning: Low Latency UDP 
•Mainly about managing bursts, avoiding drops 
• rmem_max/wmem_max 
•TX 
• netdev_max_backlog 
• txqueuelen 
•RX 
• netdev_max_backlog 
• ethtool -g 
• ethtool -c 
• netdev_budget 
•Dropwatch tool in RHEL
Full DynTicks (nohz_full)
Full DynTicks Patchset 
•Patchset Goal: 
•Stop interrupting userspace tasks 
•Move timekeeping to non-latency-sensitive 
cores 
•If nr_running=1, then 
scheduler/tick can avoid that 
core 
•Default disabled...Opt-in via 
nohz_full cmdline option 
Kernel Tick: 
•timekeeping (gettimeofday) 
•Scheduler load balancing 
•Memory statistics (vmstat)
RHEL6 and 7 Tickless 
Time 
Tick No No No Tick 
Userspace Task Timer Interrupt Idle
nohz_full 
Userspace Task Timer Interrupt T i ckless doesn't 
require idle... 
Time 
No Ticks
Busy Polling
SO_BUSY_POLL Socket Option 
•Socket-layer code polls receive queue of NIC 
•Replaces interrupts and NAPI 
•Retains full capabilities of kernel network stack
BUSY_POLL Socket Option 
Baseline SO_BUSY_POLL 
80000 
70000 
60000 
50000 
40000 
30000 
20000 
10000 
0 
netperf TCP_RR and UDP_RR Transactions/sec 
TCP_RR-RX 
TCP_RR-TX 
UDP_RR-RX 
UDP_RR-TX 
Trans/sec
Power Management
Power Management: P-states and C-states 
•P-state: CPU Frequency 
•Governors, Frequency scaling 
•C-state: CPU Idle State 
• Idle drivers
Introducing intel_pstate P-state Driver 
•New Default Idle Driver in RHEL7: intel_pstate (not a module) 
• CPU governors replaced with sysfs min_perf_pct and max_perf_pct 
• Moves Turbo knob into OS control (yay!) 
• Tuned handles most of this for you: 
• Sets min_perf_pct=100 for most profiles 
• Sets x86_energy_perf_policy=performance (same as RHEL6)
Impact of CPU Idle Drives (watts per workload) 
RHEL7 @ C1 
kernel build disk read disk write unpack tar.gz active idle 
35 
30 
25 
20 
15 
10 
5 
0 
% power saved
Turbostat shows P/C-states on Intel CPUs 
Default 
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7 
0 0 0 0.24 2.93 2.88 5.72 1.32 0.00 92.72 
0 1 1 2.54 3.03 2.88 3.13 0.15 0.00 94.18 
0 2 2 2.29 3.08 2.88 1.47 0.00 0.00 96.25 
0 3 3 1.75 1.75 2.88 1.21 0.47 0.12 96.44 
latency-performance 
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7 
0 0 0 0.00 3.30 2.90 100.00 0.00 0.00 0.00 
0 1 1 0.00 3.30 2.90 100.00 0.00 0.00 0.00 
0 2 2 0.00 3.30 2.90 100.00 0.00 0.00 0.00 
0 3 3 0.00 3.30 2.90 100.00 0.00 0.00 0.00
Frequency Scaling (Turbo) Varying Load 
3.6 3.59 
Skt 0 1 thrd Skt 1 1 thrd Skt 0 8 thrds Skt 1 8 thrds 
3.2 
2.8 
2.4 
2 
3.31 3.3 
3.13 
2.9 
2.69 
2.9 
2.7 
Avg GHz Turbo ON 
Avg GHz Turbo OFF 
Avg GHz
Analysis Tools 
Performance Co-Pilot
Performance Co-Pilot (PCP) 
(Multi) system-level performance 
monitoring and management
pmchart – graphical metric plotting tool 
•Can plot myriad performance statistics
pmchart – graphical metric plotting tool 
•Can plot myriad performance statistics 
•Recording mode allows for replay 
• i.e. on a different system 
•Record in GUI, then 
# pmafm $recording.folio
pmchart – graphical metric plotting tool 
•Can plot myriad performance statistics 
•Recording mode allows for replay 
• i.e. on a different system 
•Record in GUI, then 
# pmafm $recording.folio 
•Ships with many pre-cooked “views”...for example: 
•ApacheServers: CPU%/Net/Busy/Idle Apache Servers 
•Overview: CPU%/Load/IOPS/Net/Memory
Performance Co-Pilot Demo Script 
•Tiny script to exercise 4 food groups... 
CPU 
# stress -t 5 -c 1 
DISK 
# dd if=/dev/zero of=/root/2GB count=2048 bs=1M oflag=direct 
NETWORK 
netperf -H rhel7.lab -l 5 
# MEMORY 
# stress -t 5 --vm 1 –vm-bytes 16G
CPU % 
Load Avg 
IOPS 
Network 
Memory 
Allocated
pmcollectl mode 
CPU
pmcollectl mode 
IOPS 
CPU
pmcollectl mode 
IOPS 
NET 
CPU
pmcollectl mode 
IOPS 
MEM 
NET 
CPU
pmatop mode
Tuna
Network Tuning: IRQ affinity 
•Use irqbalance for the common case 
•New irqbalance automates NUMA affinity for IRQs 
•Move 'p1p1*' IRQs to Socket 1: 
# tuna -q p1p1* -S1 -m -x 
# tuna -Q | grep p1p1 
•Manual IRQ pinning for the last X percent/determinism
Tuna GUI Capabilities Updated for RHEL7 
•Run tuning experiments in realtime 
•Save settings to a conf file (then load with tuna cli)
Tuna GUI Capabilities Updated for RHEL7
Tuna GUI Capabilities Updated for RHEL7
Tuna GUI Capabilities Updated for RHEL7
Network Tuning: IRQ affinity 
•Use irqbalance for the common case 
•New irqbalance automates NUMA affinity for IRQs 
•Flow-Steering Technologies 
•Move 'p1p1*' IRQs to Socket 1: 
# tuna -q p1p1* -S1 -m -x 
# tuna -Q | grep p1p1 
•Manual IRQ pinning for the last X percent/determinism 
• Guide on Red Hat Customer Portal
Core 
Tuna – for IRQs 
•Move 'p1p1*' IRQs to Socket 1: 
# tuna -q p1p1* -S0 -m -x 
# tuna -Q | grep p1p1 
78 p1p1-0 0 sfc 
79 p1p1-1 1 sfc 
80 p1p1-2 2 sfc 
81 p1p1-3 3 sfc 
82 p1p1-4 4 sfc
Tuna – for processes 
# tuna -t netserver -P 
thread ctxt_switches 
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 
13488 OTHER 0 0xfff 1 0 netserver 
# tuna -c2 -t netserver -m 
# tuna -t netserver -P 
thread ctxt_switches 
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 
13488 OTHER 0 2 1 0 netserver
Tuna – for core/socket isolation 
# tuna -S1 -i 
# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status 
Cpus_allowed_list: 0-15
Tuna – for core/socket isolation 
# tuna -S1 -i 
# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status 
Cpus_allowed_list: 0-15 
# tuna -S1 -i (tuna sets affinity of 'init' task as well) 
# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status 
Cpus_allowed_list: 0,1,2,3,4,5,6,7
Analysis Tools 
perf
perf 
Userspace tool to read CPU counters 
and kernel tracepoints
perf list 
List counters/tracepoints available on your system
perf list 
grep for something interesting, maybe to see what numabalance is doing ?
perf top 
System-wide 'top' view of busy functions
perf record 
•Record system-wide (-a)
perf record 
•Record system-wide (-a) 
• A single command
perf record 
•Record system-wide (-a) 
• A single command 
• An existing process (-p)
perf record 
•Record system-wide (-a) 
• A single command 
• An existing process (-p) 
• Add call-chain recording (-g)
perf record 
•Record system-wide (-a) 
• A single command 
• An existing process (-p) 
• Add call-chain recording (-g) 
• Only specific events (-e)
perf record 
•Record system-wide (-a) 
• A single command 
• An existing process (-p) 
• Add call-chain recording (-g) 
• Only specific events (-e)
perf report 
/dev/zero
perf report 
/dev/zero 
oflag=direct
perf diff 
Compare 2 perf recordings
perf probe (dynamic tracepoints) 
Insert a tracepoint on any function... 
Try 'perf probe -F' to list possibilities 
My Probe Point
RHEL7 Performance Tuning Summary 
•Use “Tuned”, “NumaD” and “Tuna” in RHEL6 and RHEL7 
● Power savings mode (performance), locked (latency) 
● Transparent Hugepages for anon memory (monitor it) 
● numabalance – Multi-instance, consider “NumaD” 
● Virtualization – virtio drivers, consider SR-IOV 
•Manually Tune 
●NUMA – via numactl, monitor numastat -c pid 
● Huge Pages – static hugepages for pinned shared-memory 
● Managing VM, dirty ratio and swappiness tuning 
● Use cgroups for further resource management control
Upcoming Performance Talks 
•Performance tuning: Red Hat Enterprise Linux for databases 
•Sanjay Rao, Wednesday April 16, 2:30pm 
•Automatic NUMA balancing for bare-metal workloads & KVM 
virtualization 
•Rik van Riel, Wednesday April 16, 3:40pm 
•Red Hat Storage Server Performance 
•Ben England, Thursday April 17, 11:00am
Helpful Utilities 
Supportability 
• redhat-support-tool 
• sos 
• kdump 
• perf 
• psmisc 
• strace 
• sysstat 
• systemtap 
• trace-cmd 
• util-linux-ng 
NUMA 
• hwloc 
• Intel PCM 
• numactl 
• numad 
• numatop (01.org) 
Power/Tuning 
• cpupowerutils (R6) 
• kernel-tools (R7) 
• powertop 
• tuna 
• tuned 
Networking 
• dropwatch 
• ethtool 
• netsniff-ng (EPEL6) 
• tcpdump 
• wireshark/tshark 
Storage 
• blktrace 
• iotop 
• iostat
Helpful Links 
•Official Red Hat Documentation 
•Red Hat Low Latency Performance Tuning Guide 
•Optimizing RHEL Performance by Tuning IRQ Affinity 
•nohz_full 
•Performance Co-Pilot 
•Perf 
•How do I create my own tuned profile on RHEL7 ? 
•Busy Polling Whitepaper 
•Blog: http://www.breakage.org/ or @jeremyeder
Q & A
Tuned: Profile virtual-host 
throughput-performance 
governor=performance 
energy_perf_bias=performance 
min_perf_pct=100 
transparent_hugepages=always 
readahead=4096 
sched_min_granularity_ns = 10000000 
sched_wakeup_granularity_ns = 15000000 
vm.dirty_ratio = 40 
vm.dirty_background_ratio = 10 
vm.swappiness=10 
virtual-host 
vm.dirty_background_ratio = 5 
sched_migration_cost_ns = 5000000 
virtual-guest 
vm.dirty_ratio = 30 
vm.swappiness = 30
RHEL RHS Tuning w/ RHEV/RHEL OSP (tuned) 
• gluster volume set <volume> group virt 
• XFS mkfs -n size=8192, mount inode64, noatime 
•RHS server: tuned-adm profile rhs-virtualization 
• Increase in readahead, lower dirty ratio's 
• KVM host: tuned-adm profile virtual-host 
• Better response time shrink guest block device queue 
•/sys/block/vda/queue/nr_request (16 or 8) 
• Best sequential read throughput, raise VM read-ahead 
•/sys/block/vda/queue/read_ahead_kb (4096/8192)
Iozone Performance Comparison RHS2.1/XFS w/ RHEV 
rnd-write rnd-read seq-write seq-read 
7000 
6000 
5000 
4000 
3000 
2000 
1000 
0 
Out-of-the-box tuned rhs-virtualization
RHS Fuse vs libgfapi integration (RHEL6.5 and RHEL7) 
FUSE libgfapi FUSE libgfapi 
1400 
1200 
1000 
800 
600 
400 
200 
0 
131 
520 
842 
956 
399 
481 
1201 
1321 
OSP 4.0 Large File Seq. I/O - FUSE vs. Libgfapi 
4 RHS servers (repl2), 4 computes, 4G filesz, 64K recsz 
Sequential Writes Sequential Reads 
Total Throughput in MB/Sec 
1 Instance 64 Instances

More Related Content

What's hot

Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
Baruch Osoveskiy
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
bryanrandol
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
Kristofferson A
 
Awrrpt 1 3004_3005
Awrrpt 1 3004_3005Awrrpt 1 3004_3005
Awrrpt 1 3004_3005
Kam Chan
 

What's hot (20)

History
HistoryHistory
History
 
BKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPABKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPA
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
제2회난공불락 오픈소스 세미나 커널튜닝
제2회난공불락 오픈소스 세미나 커널튜닝제2회난공불락 오픈소스 세미나 커널튜닝
제2회난공불락 오픈소스 세미나 커널튜닝
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Rac on NFS
Rac on NFSRac on NFS
Rac on NFS
 
IDF'16 San Francisco - Overclocking Session
IDF'16 San Francisco - Overclocking SessionIDF'16 San Francisco - Overclocking Session
IDF'16 San Francisco - Overclocking Session
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
HP 3PAR SSMC 2.1
HP 3PAR SSMC 2.1HP 3PAR SSMC 2.1
HP 3PAR SSMC 2.1
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
OSDC 2014: Nat Morris - Open Network Install Environment
OSDC 2014: Nat Morris - Open Network Install EnvironmentOSDC 2014: Nat Morris - Open Network Install Environment
OSDC 2014: Nat Morris - Open Network Install Environment
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
Kernel Configuration and Compilation
Kernel Configuration and CompilationKernel Configuration and Compilation
Kernel Configuration and Compilation
 
Awrrpt 1 3004_3005
Awrrpt 1 3004_3005Awrrpt 1 3004_3005
Awrrpt 1 3004_3005
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
PROSE
PROSEPROSE
PROSE
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
 

Similar to Shak larry-jeder-perf-and-tuning-summit14-part2-final

CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edward
caiqi wang
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
Kyle Hailey
 

Similar to Shak larry-jeder-perf-and-tuning-summit14-part2-final (20)

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-db
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edward
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
Sun Oracle Exadata V2 For OLTP And DWH
Sun Oracle Exadata V2 For OLTP And DWHSun Oracle Exadata V2 For OLTP And DWH
Sun Oracle Exadata V2 For OLTP And DWH
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
3PAR and VMWare
3PAR and VMWare3PAR and VMWare
3PAR and VMWare
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 

More from Tommy Lee

More from Tommy Lee (20)

새하늘과 새땅-리차드 미들턴
새하늘과 새땅-리차드 미들턴새하늘과 새땅-리차드 미들턴
새하늘과 새땅-리차드 미들턴
 
하나님의 아픔의신학 20180131
하나님의 아픔의신학 20180131하나님의 아픔의신학 20180131
하나님의 아픔의신학 20180131
 
그리스도인의미덕 통합
그리스도인의미덕 통합그리스도인의미덕 통합
그리스도인의미덕 통합
 
그리스도인의미덕 1장-4장
그리스도인의미덕 1장-4장그리스도인의미덕 1장-4장
그리스도인의미덕 1장-4장
 
예수왕의복음
예수왕의복음예수왕의복음
예수왕의복음
 
Grub2 and troubleshooting_ol7_boot_problems
Grub2 and troubleshooting_ol7_boot_problemsGrub2 and troubleshooting_ol7_boot_problems
Grub2 and troubleshooting_ol7_boot_problems
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나- IBM Bluemix
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나- IBM Bluemix제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나- IBM Bluemix
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나- IBM Bluemix
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Asible
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Asible제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Asible
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Asible
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre
 
제3회난공불락 오픈소스 인프라세미나 - Nagios
제3회난공불락 오픈소스 인프라세미나 - Nagios제3회난공불락 오픈소스 인프라세미나 - Nagios
제3회난공불락 오픈소스 인프라세미나 - Nagios
 
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
 
제3회난공불락 오픈소스 인프라세미나 - MySQL
제3회난공불락 오픈소스 인프라세미나 - MySQL제3회난공불락 오픈소스 인프라세미나 - MySQL
제3회난공불락 오픈소스 인프라세미나 - MySQL
 
제3회난공불락 오픈소스 인프라세미나 - JuJu
제3회난공불락 오픈소스 인프라세미나 - JuJu제3회난공불락 오픈소스 인프라세미나 - JuJu
제3회난공불락 오픈소스 인프라세미나 - JuJu
 
제3회난공불락 오픈소스 인프라세미나 - Pacemaker
제3회난공불락 오픈소스 인프라세미나 - Pacemaker제3회난공불락 오픈소스 인프라세미나 - Pacemaker
제3회난공불락 오픈소스 인프라세미나 - Pacemaker
 
새하늘과새땅 북톡-3부-우주적회복에대한신약의비전
새하늘과새땅 북톡-3부-우주적회복에대한신약의비전새하늘과새땅 북톡-3부-우주적회복에대한신약의비전
새하늘과새땅 북톡-3부-우주적회복에대한신약의비전
 
새하늘과새땅 북톡-2부-구약에서의총체적구원
새하늘과새땅 북톡-2부-구약에서의총체적구원새하늘과새땅 북톡-2부-구약에서의총체적구원
새하늘과새땅 북톡-2부-구약에서의총체적구원
 
새하늘과새땅 Part1
새하늘과새땅 Part1새하늘과새땅 Part1
새하늘과새땅 Part1
 
제2회 난공불락 오픈소스 인프라 세미나 Kubernetes
제2회 난공불락 오픈소스 인프라 세미나 Kubernetes제2회 난공불락 오픈소스 인프라 세미나 Kubernetes
제2회 난공불락 오픈소스 인프라 세미나 Kubernetes
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Shak larry-jeder-perf-and-tuning-summit14-part2-final

  • 1. Performance Analysis and Tuning – Part 2 D. John Shakshober (Shak) Sr Consulting Eng / Director Performance Engineering Larry Woodman Senior Consulting Engineer / Kernel VM Jeremy Eder Principal Software Engineer/ Performance Engineering
  • 2. Agenda: Performance Analysis Tuning Part II • Part I • RHEL Evolution 5­> 6­> 7 – out­of­the­box tuned for Clouds ­“ tuned” • Auto_NUMA_Balance – tuned for NonUniform Memory Access (NUMA) • Cgroups / Containers • Scalabilty – Scheduler tunables • Transparent Hugepages, Static Hugepages 4K/2MB/1GB • Part II • Disk and Filesystem IO ­Throughput­performance • Network Performance and Latency­performance • System Performance/Tools – perf, tuna, systemtap, performance­co­pilot • Q & A
  • 3. Disk I/O in RHEL
  • 4. RHEL “tuned” package Available profiles: - balanced - desktop - latency-performance - network-latency - network-throughput - throughput-performance - virtual-guest - virtual-host Current active profile: throughput-performance
  • 5. Tuned: Profile throughput-performance throughput-performance governor=performance energy_perf_bias=performance min_perf_pct=100 readahead=4096 kernel.sched_min_granularity_ns = 10000000 kernel.sched_wakeup_granularity_ns = 15000000 vm.dirty_background_ratio = 10 vm.swappiness=10
  • 6. I/O Tuning – Understanding I/O Elevators • Deadline – new RHEL7 default for all profiles •Two queues per device, one for read and one for writes • I/Os dispatched based on time spent in queue • CFQ – used for system disks off SATA/SAS controllers •Per process queue •Each process queue gets fixed time slice (based on process priority) • NOOP – used for high-end SSDs (Fusion IO etc) •FIFO •Simple I/O Merging • Lowest CPU Cost
  • 7. Iozone Performance Effect of TUNED EXT4/XFS/GFS RHEL7 RC 3.10-111 File System In Cache Performance ext3 ext4 xfs gfs2 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Intel I/O (iozone - geoM 1m-4g, 4k-1m) not tuned tuned Throughput in MB/Sec RHEL7 3.10-111 File System Out of Cache Performance ext3 ext4 xfs gfs2 800 700 600 500 400 300 200 100 0 Intel I/O (iozone - geoM 1m-4g, 4k-1m) not tuned tuned Throughput in MB/Sec
  • 8. SAS Application on Standalone Systems Picking a RHEL File System xfs most recommended • Max file system size 100TB • Max file size 100TB • Best performing ext4 recommended • Max file system size 16TB • Max file size 16TB ext3 not recommended • Max file system size 16TB • Max file size 2TB xfs-rhel7 18000 16000 14000 12000 10000 8000 6000 4000 2000 xfs-rhel6 SAS Mixed Analytics (RHEL6 vs RHEL7) perf 32 (2 socket Nahelam) 8 x 48GB TOTAL Time System Time ext3-rhel7 ext3-rhel6 ext4-rhel7 ext4-rhel6 20 0 -20 -40 -60 -80 gfs2-rhel7 gfs2-rhel6 0 -100 6.18 -4.05 4.94 9.59 File system tested Time in seconds (lower is better)
  • 9. Tuning Memory – Flushing Caches •Drop unused Cache – to control pagecache dynamically ✔Frees most pagecache memory ✔File cache ✗If the DB uses cache, may notice slowdown •NOTE: Use for benchmark environments. ●Free pagecache ●# sync; echo 1 > /proc/sys/vm/drop_caches ●Free slabcache ●# sync; echo 2 > /proc/sys/vm/drop_caches ●Free pagecache and slabcache ●# sync; echo 3 > /proc/sys/vm/drop_caches
  • 10. Virtual Memory Manager (VM) Tunables ●Reclaim Ratios ●/proc/sys/vm/swappiness ●/proc/sys/vm/vfs_cache_pressure ●/proc/sys/vm/min_free_kbytes ●Writeback Parameters ●/proc/sys/vm/dirty_background_ratio ●/proc/sys/vm/dirty_ratio ●Readahead parameters ●/sys/block/<bdev>/queue/read_ahead_kb
  • 11. Per file system flush daemon buffer User space Kernel memory copy pagecache Read()/Write() Flush daemon File system Pagecache page
  • 12. swappiness •Controls how aggressively the system reclaims anonymous memory: •Anonymous memory - swapping •Mapped file pages – writing if dirty and freeing •System V shared memory - swapping •Decreasing: more aggressive reclaiming of pagecache memory •Increasing: more aggressive swapping of anonymous memory
  • 13. vfs_cache_pressure ●Controls how aggressively the kernel reclaims memory in slab caches. ●Increasing causes the system to reclaim inode cache and dentry cache. ●Decreasing causes inode cache and dentry cache to grow.
  • 14. min_free_kbytes Directly controls the page reclaim watermarks in KB Defaults are higher when THP is enabled # echo 1024 > /proc/sys/vm/min_free_kbytes ----------------------------------------------------------- Node 0 DMA free:4420kB min:8kB low:8kB high:12kB Node 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB ----------------------------------------------------------- echo 2048 > /proc/sys/vm/min_free_kbytes ----------------------------------------------------------- Node 0 DMA free:4420kB min:20kB low:24kB high:28kB Node 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB -----------------------------------------------------------
  • 15. dirty_background_ratio, dirty_background_bytes ●Controls when dirty pagecache memory starts getting written. ●Default is 10% ●Lower ●flushing starts earlier ●less dirty pagecache and smaller IO streams ●Higher ●flushing starts later ●more dirty pagecache and larger IO streams ● dirty_background_bytes over-rides when you want < 1%
  • 16. dirty_ratio, dirty_bytes • Absolute limit to percentage of dirty pagecache memory • Default is 20% • Lower means clean pagecache and smaller IO streams • Higher means dirty pagecache and larger IO streams • dirty_bytes overrides when you want < 1%
  • 17. dirty_ratio and dirty_background_ratio 100% of pagecache RAM dirty flushd and write()'ng processes write dirty buffers dirty_ratio(20% of RAM dirty) – processes start synchronous writes flushd writes dirty buffers in background dirty_background_ratio(10% of RAM dirty) – wakeup flushd do_nothing 0% of pagecache RAM dirty
  • 19. RHEL7 Networks •IPv4 Routing Cache replaced with Forwarding Information Base •Better scalability, determinism and security •Socket BUSY_POLL (aka low latency sockets) •40G NIC support, bottleneck moves back to CPU :-) •VXLAN Offload (for OpenStack) •NetworkManager: nmcli and nmtui
  • 20. 20 NUMA NODE 1 Locality of Packets Stream from Customer 1 Stream from Customer 2 Stream from Customer 3 Socket1/Core1 Socket1/Core2 Socket1/Core3 Stream from Customer 4 Socket1/Core4
  • 21. Tuned: Profile Inheritance Parents throughput-performance latency-performance Children network-throughput network-latency virtual-host virtual-guest balanced desktop
  • 22. Tuned: Profile Inheritance (throughput) throughput-performance governor=performance energy_perf_bias=performance min_perf_pct=100 readahead=4096 kernel.sched_min_granularity_ns = 10000000 kernel.sched_wakeup_granularity_ns = 15000000 vm.dirty_background_ratio = 10 vm.swappiness=10 network-throughput net.ipv4.tcp_rmem="4096 87380 16777216" net.ipv4.tcp_wmem="4096 16384 16777216" net.ipv4.udp_mem="3145728 4194304 16777216"
  • 23. Tuned: Profile Inheritance (latency) latency-performance force_latency=1 governor=performance energy_perf_bias=performance min_perf_pct=100 kernel.sched_min_granularity_ns=10000000 vm.dirty_ratio=10 vm.dirty_background_ratio=3 vm.swappiness=10 kernel.sched_migration_cost_ns=5000000 network-latency transparent_hugepages=never net.core.busy_read=50 net.core.busy_poll=50 net.ipv4.tcp_fastopen=3 kernel.numa_balancing=0
  • 24. Networking performance – System setup •Evaluate the 2 new tuned profiles for networking •Disable unnecessary services, runlevel 3 • Follow vendor guidelines for BIOS Tuning • Logical cores? Power Management? Turbo? •In the OS, consider •Disabling filesystem journal •SSD/Memory Storage •Reducing writeback thresholds if your app does disk I/O •NIC Offloads favor throughput
  • 25. Network Tuning: Buffer Bloat # ss |grep -v ssh State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 172.17.1.36:38462 172.17.1.34:12865 ESTAB 0 3723128 172.17.1.36:58856 172.17.1.34:53491 •NIC ring buffers: # ethtool -g p1p1 ● 10G line-rate ● ~4MB queue depth ● Matching servers 4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 Kernel Buffer Queue Depth 10Gbit TCP_STREAM Time (1-sec intervals) Send-Q Depth Queue Depth (MB)
  • 26. Tuned: Network Throughput Boost balanced throughput-performance network-latency network-throughput 40000 30000 20000 10000 0 R7 RC1 Tuned Profile Comparison - 40G Networking Incorrect Binding Correct Binding Correct Binding (Jumbo) Mbit/s 39.6Gb/s 19.7Gb/s
  • 27. netsniff-ng: ifpps •Aggregate network stats to one screen •Can output to .csv Pkts/sec
  • 28. netsniff-ng: ifpps •Aggregate network stats to one screen •Can output to .csv Pkts/sec Drops/sec
  • 29. netsniff-ng: ifpps •Aggregate network stats to one screen •Can output to .csv Pkts/sec Drops/sec Hard/Soft IRQs/sec
  • 30. Network Tuning: Low Latency TCP • set TCP_NODELAY (Nagle) • Experiment with ethtool offloads • tcp_low_latency tiny substantive benefit found •Ensure kernel buffers are “right-sized” •Use ss (Recv-Q Send-Q) •Don't setsockopt unless you've really tested •Review old code to see if you're using setsockopt • Might be hurting performance
  • 31. Network Tuning: Low Latency UDP •Mainly about managing bursts, avoiding drops • rmem_max/wmem_max •TX • netdev_max_backlog • txqueuelen •RX • netdev_max_backlog • ethtool -g • ethtool -c • netdev_budget •Dropwatch tool in RHEL
  • 33. Full DynTicks Patchset •Patchset Goal: •Stop interrupting userspace tasks •Move timekeeping to non-latency-sensitive cores •If nr_running=1, then scheduler/tick can avoid that core •Default disabled...Opt-in via nohz_full cmdline option Kernel Tick: •timekeeping (gettimeofday) •Scheduler load balancing •Memory statistics (vmstat)
  • 34. RHEL6 and 7 Tickless Time Tick No No No Tick Userspace Task Timer Interrupt Idle
  • 35. nohz_full Userspace Task Timer Interrupt T i ckless doesn't require idle... Time No Ticks
  • 37. SO_BUSY_POLL Socket Option •Socket-layer code polls receive queue of NIC •Replaces interrupts and NAPI •Retains full capabilities of kernel network stack
  • 38. BUSY_POLL Socket Option Baseline SO_BUSY_POLL 80000 70000 60000 50000 40000 30000 20000 10000 0 netperf TCP_RR and UDP_RR Transactions/sec TCP_RR-RX TCP_RR-TX UDP_RR-RX UDP_RR-TX Trans/sec
  • 40. Power Management: P-states and C-states •P-state: CPU Frequency •Governors, Frequency scaling •C-state: CPU Idle State • Idle drivers
  • 41. Introducing intel_pstate P-state Driver •New Default Idle Driver in RHEL7: intel_pstate (not a module) • CPU governors replaced with sysfs min_perf_pct and max_perf_pct • Moves Turbo knob into OS control (yay!) • Tuned handles most of this for you: • Sets min_perf_pct=100 for most profiles • Sets x86_energy_perf_policy=performance (same as RHEL6)
  • 42. Impact of CPU Idle Drives (watts per workload) RHEL7 @ C1 kernel build disk read disk write unpack tar.gz active idle 35 30 25 20 15 10 5 0 % power saved
  • 43. Turbostat shows P/C-states on Intel CPUs Default pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7 0 0 0 0.24 2.93 2.88 5.72 1.32 0.00 92.72 0 1 1 2.54 3.03 2.88 3.13 0.15 0.00 94.18 0 2 2 2.29 3.08 2.88 1.47 0.00 0.00 96.25 0 3 3 1.75 1.75 2.88 1.21 0.47 0.12 96.44 latency-performance pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7 0 0 0 0.00 3.30 2.90 100.00 0.00 0.00 0.00 0 1 1 0.00 3.30 2.90 100.00 0.00 0.00 0.00 0 2 2 0.00 3.30 2.90 100.00 0.00 0.00 0.00 0 3 3 0.00 3.30 2.90 100.00 0.00 0.00 0.00
  • 44. Frequency Scaling (Turbo) Varying Load 3.6 3.59 Skt 0 1 thrd Skt 1 1 thrd Skt 0 8 thrds Skt 1 8 thrds 3.2 2.8 2.4 2 3.31 3.3 3.13 2.9 2.69 2.9 2.7 Avg GHz Turbo ON Avg GHz Turbo OFF Avg GHz
  • 46. Performance Co-Pilot (PCP) (Multi) system-level performance monitoring and management
  • 47. pmchart – graphical metric plotting tool •Can plot myriad performance statistics
  • 48. pmchart – graphical metric plotting tool •Can plot myriad performance statistics •Recording mode allows for replay • i.e. on a different system •Record in GUI, then # pmafm $recording.folio
  • 49. pmchart – graphical metric plotting tool •Can plot myriad performance statistics •Recording mode allows for replay • i.e. on a different system •Record in GUI, then # pmafm $recording.folio •Ships with many pre-cooked “views”...for example: •ApacheServers: CPU%/Net/Busy/Idle Apache Servers •Overview: CPU%/Load/IOPS/Net/Memory
  • 50. Performance Co-Pilot Demo Script •Tiny script to exercise 4 food groups... CPU # stress -t 5 -c 1 DISK # dd if=/dev/zero of=/root/2GB count=2048 bs=1M oflag=direct NETWORK netperf -H rhel7.lab -l 5 # MEMORY # stress -t 5 --vm 1 –vm-bytes 16G
  • 51. CPU % Load Avg IOPS Network Memory Allocated
  • 55. pmcollectl mode IOPS MEM NET CPU
  • 57. Tuna
  • 58. Network Tuning: IRQ affinity •Use irqbalance for the common case •New irqbalance automates NUMA affinity for IRQs •Move 'p1p1*' IRQs to Socket 1: # tuna -q p1p1* -S1 -m -x # tuna -Q | grep p1p1 •Manual IRQ pinning for the last X percent/determinism
  • 59. Tuna GUI Capabilities Updated for RHEL7 •Run tuning experiments in realtime •Save settings to a conf file (then load with tuna cli)
  • 60. Tuna GUI Capabilities Updated for RHEL7
  • 61. Tuna GUI Capabilities Updated for RHEL7
  • 62. Tuna GUI Capabilities Updated for RHEL7
  • 63. Network Tuning: IRQ affinity •Use irqbalance for the common case •New irqbalance automates NUMA affinity for IRQs •Flow-Steering Technologies •Move 'p1p1*' IRQs to Socket 1: # tuna -q p1p1* -S1 -m -x # tuna -Q | grep p1p1 •Manual IRQ pinning for the last X percent/determinism • Guide on Red Hat Customer Portal
  • 64. Core Tuna – for IRQs •Move 'p1p1*' IRQs to Socket 1: # tuna -q p1p1* -S0 -m -x # tuna -Q | grep p1p1 78 p1p1-0 0 sfc 79 p1p1-1 1 sfc 80 p1p1-2 2 sfc 81 p1p1-3 3 sfc 82 p1p1-4 4 sfc
  • 65. Tuna – for processes # tuna -t netserver -P thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 13488 OTHER 0 0xfff 1 0 netserver # tuna -c2 -t netserver -m # tuna -t netserver -P thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 13488 OTHER 0 2 1 0 netserver
  • 66. Tuna – for core/socket isolation # tuna -S1 -i # grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status Cpus_allowed_list: 0-15
  • 67. Tuna – for core/socket isolation # tuna -S1 -i # grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status Cpus_allowed_list: 0-15 # tuna -S1 -i (tuna sets affinity of 'init' task as well) # grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status Cpus_allowed_list: 0,1,2,3,4,5,6,7
  • 69. perf Userspace tool to read CPU counters and kernel tracepoints
  • 70. perf list List counters/tracepoints available on your system
  • 71. perf list grep for something interesting, maybe to see what numabalance is doing ?
  • 72. perf top System-wide 'top' view of busy functions
  • 73. perf record •Record system-wide (-a)
  • 74. perf record •Record system-wide (-a) • A single command
  • 75. perf record •Record system-wide (-a) • A single command • An existing process (-p)
  • 76. perf record •Record system-wide (-a) • A single command • An existing process (-p) • Add call-chain recording (-g)
  • 77. perf record •Record system-wide (-a) • A single command • An existing process (-p) • Add call-chain recording (-g) • Only specific events (-e)
  • 78. perf record •Record system-wide (-a) • A single command • An existing process (-p) • Add call-chain recording (-g) • Only specific events (-e)
  • 80. perf report /dev/zero oflag=direct
  • 81. perf diff Compare 2 perf recordings
  • 82. perf probe (dynamic tracepoints) Insert a tracepoint on any function... Try 'perf probe -F' to list possibilities My Probe Point
  • 83. RHEL7 Performance Tuning Summary •Use “Tuned”, “NumaD” and “Tuna” in RHEL6 and RHEL7 ● Power savings mode (performance), locked (latency) ● Transparent Hugepages for anon memory (monitor it) ● numabalance – Multi-instance, consider “NumaD” ● Virtualization – virtio drivers, consider SR-IOV •Manually Tune ●NUMA – via numactl, monitor numastat -c pid ● Huge Pages – static hugepages for pinned shared-memory ● Managing VM, dirty ratio and swappiness tuning ● Use cgroups for further resource management control
  • 84. Upcoming Performance Talks •Performance tuning: Red Hat Enterprise Linux for databases •Sanjay Rao, Wednesday April 16, 2:30pm •Automatic NUMA balancing for bare-metal workloads & KVM virtualization •Rik van Riel, Wednesday April 16, 3:40pm •Red Hat Storage Server Performance •Ben England, Thursday April 17, 11:00am
  • 85. Helpful Utilities Supportability • redhat-support-tool • sos • kdump • perf • psmisc • strace • sysstat • systemtap • trace-cmd • util-linux-ng NUMA • hwloc • Intel PCM • numactl • numad • numatop (01.org) Power/Tuning • cpupowerutils (R6) • kernel-tools (R7) • powertop • tuna • tuned Networking • dropwatch • ethtool • netsniff-ng (EPEL6) • tcpdump • wireshark/tshark Storage • blktrace • iotop • iostat
  • 86. Helpful Links •Official Red Hat Documentation •Red Hat Low Latency Performance Tuning Guide •Optimizing RHEL Performance by Tuning IRQ Affinity •nohz_full •Performance Co-Pilot •Perf •How do I create my own tuned profile on RHEL7 ? •Busy Polling Whitepaper •Blog: http://www.breakage.org/ or @jeremyeder
  • 87. Q & A
  • 88. Tuned: Profile virtual-host throughput-performance governor=performance energy_perf_bias=performance min_perf_pct=100 transparent_hugepages=always readahead=4096 sched_min_granularity_ns = 10000000 sched_wakeup_granularity_ns = 15000000 vm.dirty_ratio = 40 vm.dirty_background_ratio = 10 vm.swappiness=10 virtual-host vm.dirty_background_ratio = 5 sched_migration_cost_ns = 5000000 virtual-guest vm.dirty_ratio = 30 vm.swappiness = 30
  • 89. RHEL RHS Tuning w/ RHEV/RHEL OSP (tuned) • gluster volume set <volume> group virt • XFS mkfs -n size=8192, mount inode64, noatime •RHS server: tuned-adm profile rhs-virtualization • Increase in readahead, lower dirty ratio's • KVM host: tuned-adm profile virtual-host • Better response time shrink guest block device queue •/sys/block/vda/queue/nr_request (16 or 8) • Best sequential read throughput, raise VM read-ahead •/sys/block/vda/queue/read_ahead_kb (4096/8192)
  • 90. Iozone Performance Comparison RHS2.1/XFS w/ RHEV rnd-write rnd-read seq-write seq-read 7000 6000 5000 4000 3000 2000 1000 0 Out-of-the-box tuned rhs-virtualization
  • 91. RHS Fuse vs libgfapi integration (RHEL6.5 and RHEL7) FUSE libgfapi FUSE libgfapi 1400 1200 1000 800 600 400 200 0 131 520 842 956 399 481 1201 1321 OSP 4.0 Large File Seq. I/O - FUSE vs. Libgfapi 4 RHS servers (repl2), 4 computes, 4G filesz, 64K recsz Sequential Writes Sequential Reads Total Throughput in MB/Sec 1 Instance 64 Instances