제2회난공불락 오픈소스 세미나 커널튜닝

Enterprise Linux
Kernel Tuning & Customizing for Performance
한진구
jkoo_han@hanmail.net

2
難攻不落(난공불락) 오픈소스 인프라 세미나
 시작하기
 모니터링
 주요 요소별 튜닝 방안
 메모리
 Swap/Cache
 IO/파일시스템
 네트워킹
Agenda

3
 세션의 한계점
 1시간내에 시스템 튜닝에 대해서 모두 젂달하는 것에 대한 한계
=> 주요 개념 및 기본적인 튜닝에 초점
 튜닝시 Pre-requirement
 먼저 튜닝을 위해서는 하드웨어와 소프트웨어 모두에 대한 이해 필요
 더불어 시스템간의 상호작용에 대한 이해 필요
 튜닝시 고려사항
 사용자/관리자 요소도 반드시 고려
 사용자 실수?, 개념의 오해?
 모든 사람이 튜닝에 대해 이해하고 있다고 가정하면 안됨
 튜닝시 주의사항
 시스템 튜닝은 마법이 아님
 종종 하드웨어 업그레이드와 부하 분산이 필요
시작하기

4
시스템 튜닝에 대해 설명하거나 개선을 목표로 할때 반드시 구분해서 사용해야할 두가지
 Low-latency – Latency is a measure of time delay experienced in a system, the
precise definition of which depends on the system and the time being
measured.[1]
 High-throughput – The system throughput or aggregate throughput is the sum of
the data rates that are delivered to all terminals in a network or disk-drive.[1]
[1] : wikipedia.org
시작하기

5
 Tuning the hardware and firmware first
 In many cases, it will bring much better result than the software tuning
 Refer to the hardware manual
 As a trade-off, power reduction features usually affect the overall performance,
especially latency more than what we expect.
 Disabling and removing unused services
시작하기

6
Monitoring

7
 System log
 CPU & NUMA
 BIOS
 BUS
모니터링
# dmesg
# cat /var/log/messages
# lscpu
# x86info // x86info package
# numactl --hardware
# dmidecode
# lspci // pciutils package
# lsusb // usbutils package

8
 vmstat
 mpstat
모니터링
# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 327088 9380 84424 0 0 88 6 1003 30 1 1 97 1 0
0 0 0 327080 9380 84412 0 0 0 0 991 4 0 0 100 0 0
0 0 0 327080 9380 84412 0 0 0 0 991 4 0 0 100 0 0
0 0 0 327080 9380 84412 0 0 0 0 989 5 0 0 100 0 0
# mpstat -P ALL 10
11:07:47 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
11:07:57 PM all 0.00 0.00 0.05 0.15 0.00 0.30 0.00 0.00 99.49
11:07:57 PM 0 0.10 0.00 0.10 0.61 0.00 0.20 0.00 0.00 98.99
11:07:57 PM 1 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 99.69
11:07:57 PM 2 0.00 0.00 0.00 0.00 0.00 0.41 0.00 0.00 99.59
11:07:57 PM 3 0.00 0.00 0.10 0.00 0.00 0.30 0.00 0.00 99.60
11:07:57 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
11:08:07 PM all 0.03 0.00 0.08 0.13 0.00 0.33 0.00 0.00 99.44
11:08:07 PM 0 0.10 0.00 0.31 0.51 0.00 0.20 0.00 0.00 98.88
11:08:07 PM 1 0.00 0.00 0.10 0.00 0.00 0.31 0.00 0.00 99.59
11:08:07 PM 2 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 99.69
11:08:07 PM 3 0.00 0.00 0.00 0.00 0.00 0.51 0.00 0.00 99.49

9
 iostat
 sar
모니터링
# iostat -x /dev/sda 10
avg-cpu: %user %nice %system %iowait %steal %idle
0.88 0.00 1.94 1.09 0.00 96.09
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 6.36 0.67 13.77 0.65 527.35 10.02 37.26 0.06 4.51 3.38 4.88
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.30 0.00 0.00 99.67
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.50 0.00 4.00 8.00 0.00 1.60 0.40 0.02
# sar -q -f /var/log/sa/sa13
....
11:00:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:05:01 PM 0 200 0.08 0.09 0.06
11:10:01 PM 0 200 0.30 0.16 0.09
11:15:01 PM 0 200 0.00 0.06 0.07
11:20:01 PM 1 200 0.12 0.06 0.06

10
 netstat
 ethtool
모니터링
# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
tcp 0 0 192.168.219.117:22 192.168.219.105:33376 ESTABLISHED
tcp 0 0 :::22 :::* LISTEN
tcp 0 0 ::1:25 :::* LISTEN
udp 0 0 0.0.0.0:68 0.0.0.0:*
# ethtool -S eth0
NIC statistics:
rx_packets: 0
tx_packets: 0
rx_bytes: 0
tx_bytes: 0
....
rx_multicast: 0
tx_multicast: 0
rx_errors: 0
....

11
Basic Tuning

12
 Turn off tickless kernel
 Limit ACPI and Intel’s C-State
 Turn off ‘Transparent Huge Page’
 Turn off ‘CGroup’ feature
 Check what services are running
 Disable unused services
기본 튜닝

13
If you don’t know what to do, then use tuned instead
Tuned is a daemon that monitors the use of system components and dynamically
tunes system settings based on that monitoring information.
It includes predefined profiles for specific use cases.
기본 튜닝
# yum install tuned
# service tuned start
# chkconfig tuned on
# tuned-adm active // ‘default’ profile
# tuned-adm list
# tuned-adm profile [profile_name]

14
The predefined profiles (in EL 6)
It is possible to customize the profile
기본 튜닝
# tuned-adm list
- laptop-ac-powersave
- desktop-powersave
- enterprise-storage
- default
- virtual-guest
- throughput-performance
- laptop-battery-powersave
- server-powersave
- latency-performance
- spindown-disk
- virtual-host
# tuned-adm profile latency-performance

15
Memory Tuning

16
Linear address space (Virtual Address)
TLB
Physical
Memory
MMU (in CPU)
Linear address space (Virtual Address)
Offset within PGD Offset within PMD Offset within PTE Offset within Data
Yes
No (TLB miss)
page fault
메모리 어드레싱 개요

17
 Physical memory is divided into a page and the default page size is 4 KiB.
 If a system has a large amount of memory and the workload requires accessing a
large and continuous memory space, TLB miss will be rapidly increasing.
대용량 물리 메모리 환경에서의 문제
Translation Lookaside Buffer (TLB)
Translating linear addresses into physical addresses takes time, so most processors
have a small cache known as a TLB that stores the physical addresses associated
with the most recently accessed virtual addresses.
TLB is a small cache so large memory applications can incur high TLB miss rates,
and TLB misses are extremely expensive on today’s very fast, pipelined CPUs.

18
 The IA-32 architecture supports either 4 KiB, 2 MiB or 4 MiB pages.
 The Linux kernel also supports large sized pages – 2MB and 1GB - through the
HugePage mechanism.
 Having fewer TLB entries that point to more memory means that a TLB hit is more
likely to occur.
대용량 메모리 환경 성능개선방안 - Hugepage
Standard HugePage (EL 4, 5, 6)
2 MB per page
Reserve/Free via /proc/sys/vm/nr_hugepages
Used via hugetlbfs
GB HugePage (EL 6, 7)
1 GB per page
Reserved at boot time/No freeing
Used via hugetlbfs

19
 Enabled by default in EL 6 for all applications.
 The kernel attempts to allocate hugepages whenever possible and any process will
receive 2MB pages if the mmap region is 2MB naturally aligned.
 If no hugepages are available, the kernel will fall back to the regular 4KB pages.
 THP are also swappable (unlike hugetlbfs). This is achieved by breaking the huge
page to smaller 4KB pages, which are then swapped out normally.
 No modification is required for applications
 Carefully use when running with Big Data or DBMS solutions
대용량 메모리 환경 성능개선방안 – Transparent Hugepage

20
Swap/Cache Tuning

21
Swap에 대한 이해
 Swap space increases the amount of effective memory on a system. As free
memory drops, old pages can be paged out to disk to free memory space for
other uses.
 Anonymous pages but inactive will be selected
 Recently, systems have the large amount of physical memory. Is swap space
obsolete?
 Without swap space, the anonymous pages can't be flushed. They have to
stay in memory until they're deleted. Even if they're never used again.
 Flushing pages to swap is actually a bit easier and quicker than flushing them
to disk: the code is much simpler, and there are no directory trees to update.

22
Cache Memory에 대한 이해
 To reduce service time for slower subsystems (I/O), kernel uses different type of
caches:
 Slab Cache :
 Store various types of data structures kernel uses and these data structures
don’t fit into a single page of memory.
 Allocate the slab from the pre-allocated memory area.
 Swap Cache :
 Track of pages previously swapped out and now swapped in.
 If needs to swap out the page again, and it finds an entry in the swap cache,
the page doesn’t need to be written to disk.

23
Cache Memory에 대한 이해
 Page Cache (File-backed, no swapable) :
 To improve the overall performance of a system, the kernel tends to use
memory as a cache to store data being read from or written to disk as much
as possible.
 These data can be re-used from RAM without having I/O requests to the disk.
 In some cases, page cache brings issues:
 Cache size constantly goes up and the speed of freeing page cache cannot
follow the speed of growth.
 The system performance drops down due to seeking free pages or swapping
the pages out to free space in despite of large page cache.

24
IO/Filesystem Tuning

25
I/O 서브시스템의 이해
 Read or write requests are transformed into block device requests that go into a
queue.
 The I/O subsystem then batches similar requests that come within a specific
time window and processes them all at once.
 Generally, the I/O subsystem does not operate in a true FIFO manner. It processes
queued read/write requests depending on the selected scheduler algorithms called
elevators because they operate in the same manner that real-life building elevators
do.
# cat /sys/block/<device>/queme/schduler
noop anticipatory deadline [cfq]

26
I/O 서브시스템의 이해
 Think about how Hard disk drive works
 To improve the overall I/O performance
 Re-arrange the requests,
 Wisely choice when will the requests are served
I/O Queue
New I/O requests
Drop the performance to seek
the location for each requests

27
I/O 서브시스템의 성능개선방안
 Completely Fair Queuing – cfq
 Default I/O scheduler in EL 5, 6, 7
 Equally divide all available I/O bandwidth among all processes issuing I/O
requests.
 Deadline - deadline
 large sequential read-mostly workload
 Guarantee a response time for each request. Once a request reaches its
expiration time, it is serviced immediately
# echo dealine > /sys/block/<device>/queue/schduler

28
I/O 서브시스템의 성능개선방안
 Anticipatory – anticipatory
 Optimize systems with small or slow disk subsystems.
 Recommended for servers running data processing applications that are not
regularly interrupted by external requests.
 NOOP - noop
 For systems which consumes heavy CPU workload
 All requests into a simple unordered queue
 Recommended for virtualized guests
elevator=noop // kernel parameter

29
저널링 파일시스템의 이해
 Journaling file system is quickly recovered by a log book for the file system.
 Any change of the file system will be made in a journal as a transaction
before committing them to the actual file system.
 In the event of a system crash or power failure, the file systems are quickly
recovered and less likely to be corrupted.
 It’s very important feature in Enterprise market
 ext3, ext4, xfs are journaling file systems
 EL6 uses ext4, EL7 uses xfs as its default file system

30
Networking Tuning

31
네트워크 성능 저하에 따른 패킷 손실
Overrun : usually seen under heavy UDP traffic
Dropped : seen under both heavy UDP/TCP traffic
bond1 Link encap:Ethernet HWaddr 00:AA:BB:CC:DD:EE
inet addr:192.168.10.33 Bcast:192.168.10.255 Mask:255.255.255.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500
RX packets:8344569671 errors:0 dropped:0 overruns:46295 frame:0
TX packets:53614 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2952210470156 (2.6 TiB) TX bytes:5251386 (5.0 MiB)
eth0 Link encap:Ethernet HWaddr
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27051811 errors:0 dropped:696311 overruns:0 frame:0
TX packets:110147381 errors:0 dropped:0 overruns:0 carrier:0

32
TCP Window에 대한 이해
 Whenever receiving a packet, the receiver needs to send an ACK to the sender
under TCP protocol. The sender also needs to wait for the ack.
 Will affect network throughput and CPU utilization
 If the network is long and slow like satellite network, or has a larger bandwidth,
more packets can be on the link between a sender and receiver at a time.
 TCP window allows the sender sending more packets without ACKs.
 The length of TCP window is variable based on the size of TCP socket buffer.

33
TCP Window에 대한 이해
 If applications slowly fetch packets from socket buffers, the buffers are going to be
full and start to drop packets
 Get better performance by increasing TCP socket buffer size.
Receiver
Sender
Receiver
Receive window : 4
Sender
Sender
Ack : receive window 2

34
감사합니다

제2회난공불락 오픈소스 세미나 커널튜닝

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 제2회난공불락 오픈소스 세미나 커널튜닝

Similar to 제2회난공불락 오픈소스 세미나 커널튜닝 (20)

More from Tommy Lee

More from Tommy Lee (20)

Recently uploaded

Recently uploaded (20)

제2회난공불락 오픈소스 세미나 커널튜닝