[231]운영체제 수준에서의 데이터베이스 성능 분석과 최적화

운영체제 수준에서의
데이터베이스 성능 분석
및 최적화
김상욱
Apposha

2
김상욱
• Co-founder and CEO @ Apposha
• 성균관대 컴퓨터공학 박사과정
• 4편의 SCI 저널 저술, 6편의 국제학술대회 논문 발표
• 클라우드/가상화 분야
• 멀티코어 스케줄링 [ASPLOS’13, VEE’14]
• 그룹 기반 메모리 관리 [JPDC’14]
• 데이터베이스/저장장치 분야
• 비휘발성 캐시 관리 [USENIX ATC’15, ApSys’16]
• 리퀘스트 중심 I/O 우선 처리 [FAST’17, HotStorage’17]

3
MySQL 대비 5배 성능
상용 제품 대비 1/10 비용
for
Apposha?

Contents
• DB 트렌드 소개
• OS 수준 분석 및 최적화의 중요성
• DB 성능 관점에서의 리눅스 커널 최적화
4

다양한 오픈소스 DB 활용 증가
5
30
35
40
45
50
55
60
65
70
Jan.2013 Jan.2014 Jan.2015 Jan.2016 Jan.2017
Percentage(%)
오픈 소스 상용 제품
0
50
100
150
200
250
300
350
Jan.2013 Jan.2014 Jan.2015 Jan.2016 Jan.2017
DB-EngineScore
MongoDB PostgreSQL Cassandra
Redis SQLite Elasticsearch
Source : db-engines.com Source : db-engines.com

DB와 OS의 관계 변화
• 과거
• DB를 위한 OS 수준 지원 미비 [CACM’81]
• OS 간섭을 최소화 하는 방향으로 전개 (e.g., Oracle)
• 현재
• 다양한 OS 인터페이스 제공
• madvise(), fadvise(), ionice(), …
• fallocate(), fdatasync(), sync_file_range(), …
• OS 기능을 적극 활용한 구현이 주류
• “Nearly all modern databases run through the file system.” [OSDI’14]
6

OS 수준 최적화의 중요성
7
- 실제 리소스 관리/할당의 주체는 운영체제이므로
DB 수준 최적화 만으로는 성능 개선의 한계가 있음
높은 우선순위
낮은 우선순위
하드웨어
데이터베이스

사례 분석
• MySQL “swap insanity”
• MongoDB readahead 튜닝
• PostgreSQL autovacuum 설정
• Elasticsearch 로깅
8

사례 1: MySQL “swap insanity”
9
NUMA 아키텍쳐
(Non-Uniform Memory Access)
Solution: interleaving via OS interface
# numactl --interleave all command
Default NUMA allocation
https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture
https://blog.jcole.us/2012/04/16/a-brief-update-on-numa-and-mysql

사례 2: MongoDB Readahead 튜닝
10
“Set the readahead setting to 0 regardless of storage media type.”
CPU
Disk

사례 2: MongoDB Readahead 튜닝
11
0
10000
20000
30000
40000
50000
Default Readahead 0
처리량(ops/sec)
• Dell Poweredge R730
• 32 cores
• 8GB DRAM
• 1 SAS SSD
• MongoDB v3.2.10
• YCSB workload
• Read-only
• 10GB dataset
• 10 min run
처리량 40%

사례 3: PostgreSQL Autovacuum 설정
12
Free Space Map
http://bstar36.tistory.com/308

13
• 32 cores
• 132GB DRAM
• 1 SAS SSD
• PostgreSQL v9.5.6
• TPC-C workload
• 50GB dataset
• 1 hour run
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 50 100 150 200 250 300
처리량(trx/sec)
클라이언트 수
Default Aggressive AV
처리량 40%

0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000
최대응답지연(ms)
경과시간 (초)
Default Aggressive AV
14
최대 2.5초

0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000
경과시간 (초)
Default Aggressive AV V12-P
15
최대 0.1초
최대 2.5초
V12-P: Aggressive AV + Apposha 최적화 엔진

사례 4: Elasticsearch 로깅
16
https://www.elastic.co/guide/en/elasticsearch/guide/current/translog.html

0 1000 2000 3000 4000 5000
Default
Async
처리량 (index/sec)
17
• AWS m4.2xlarge
• 8 vCPUs, 16GB RAM
• EBS 1000 IOPS
• Elasticsearch v5.2
• YCSB workload
• Insert-only
• 50 concurrent clients
데이터 안전
데이터 손실가능
처리량 4.5X

18
Translog
Translog
.ckp
write() + fdatasync()
FS
Metadata
Storage
Step 1. write log record
Step 2. sync log record
Step 3. sync FS metadata
Step 4. write log metadata
Step 5. sync log metadata

19
Translog
Translog
.ckp
Storage

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Default
V12-E v0.1
Async
20
처리량 2X

21
Translog
Translog
.ckp
write()
Storage

0 1000 2000 3000 4000 5000
Default
V12-E v0.2
Async
22
처리량 3X 데이터 안전
데이터 손실가능

Contents
• DB 트렌드 소개
• OS 수준 분석 및 최적화의 중요성
• DB 성능 관점에서의 리눅스 커널 최적화
23

DB 성능 관점에서의 리눅스 커널 분석
• 저장장치 I/O 스택
• I/O 우선순위 적용의 어려움
• WAL 로그 쓰기 증폭 현상
• Lock으로 인한 scalability 저하
• CPU 스케줄링
• 멀티코어 로드 밸런싱 문제
24
데이터베이스
운영체제
하드웨어

I/O 우선순위 적용의 어려움
• DB 일반적인 구조
25
Storage Device
Operating System
T1
Client
T2
I/O
T3 T4
Request Response
I/O I/O I/O
Database
Database
performance
* Example: MongoDB
- Client (foreground)
- Checkpointer
- Log writer
- Eviction worker
- …

• MongoDB 실험 결과
26
0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600 1800
Operationthroughput
(ops/sec)
Elapsed time (sec)
CFQ
Regular
checkpoint task
• 32 cores
• 132GB DRAM
• 1 SAS SSD
• MongoDB v3.2.10
• YCSB workload
• Read:update=50:50
• 12GB dataset
• 30 min run
• 200 clients

0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600 1800
Operationthroughput
(ops/sec)
Elapsed time (sec)
CFQ CFQ-IDLE
• MongoDB 실험 결과
27
I/O priority does
not help
• 32 cores
• 132GB DRAM
• 1 SAS SSD
• MongoDB v3.2.10
• YCSB workload
• Read:update=50:50
• 12GB dataset
• 30 min run
• 200 clients

• 각 계층에서의 독립적인 I/O 처리
28
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Abstraction

29
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Abstraction
Buffer Cache
read() write()
admission
control

30
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Abstraction
Buffer Cache
read() write()
admission
control
Block-level Q
admission
control

Application
31
Storage Device
Caching Layer
File System Layer
Block Layer
Abstraction
Buffer Cache
reorder
FG FG BGBG
read() write()

32
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Abstraction
Buffer Cache
read() write()
FG FGBG
Device-internal Q
admission
control

33
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Abstraction
Buffer Cache
read() write()
FG FGBG
BG FG BGBG
reorder

• I/O 우선순위 역전
34
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Locks
Condition variables

35
Storage Device
Caching Layer
Application
File System Layer
Block Layer Condition variables
I/OFG
lock
BG
wait

36
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/OFG
lock
BG
wait
FG
wait
wait
BGvar
wake

37
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/O
FG
wait
wait
BGuser
var
wake
FG
wait
I/OFG
lock
BG
wait
FG
wait
wait
BGvar
wake

• I/O 우선순위 역전 (I/O dependency)
38
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Outstanding I/Os

• I/O 우선순위 역전 (I/O dependency)
39
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/OFG
wait
For ensuring consistency
and/or durability

리퀘스트 중심 I/O 우선처리
• 솔루션 v1 (reactive)
• 전체 계층에서의 우선순위 적용 [FAST’17]
• 동적 우선순위 상속 [USENIX ATC’15, FAST’17]
• Locks
• Condition variables
40
FG
lock
BG I/OFG BG
submit
complete
FG BG
FG
wait
BG
register
BG
inherit
FG BGI/O
submit
complete
wake
CV CV CV
[USENIX ATC’15] Request-Oriented Durable Write Caching for Application Performance
[FAST’17] Enlightening the I/O Path: A Holistic Approach for Application Performance

41
Caching Layer
Application
File System Layer
Block Layer
• 솔루션 v1 (문제점)
Synchronization
linux/include/linux/mutex.h
linux/include/linux/pagemap.h
linux/include/linux/rtmutex.h
linux/include/linux/rwsem.h
linux/include/uapi/linux/sem.h
linux/include/linux/wait.h
linux/kernel/sched/wait.c
linux/kernel/locking/rwsem.c
linux/kernel/futex.c
linux/kernel/locking/mutex.h
linux/kernel/locking/mutex.c
linux/kernel/locking/rtmutex.c
linux/kernel/locking/rwsem-xadd.c
linux/kernel/fork.c
linux/kernel/sys.c
linux/kernel/sysctl.c
linux/include/linux/sched.h
linux/include/linux/blk_types.h
linux/include/linux/blkdev.h
linux/include/linux/buffer_head.h
linux/include/linux/caq.h
linux/block/blk-core.c
linux/block/blk-flush.c
linux/block/blk-lib.c
linux/block/blk-mq.c
linux/block/caq-iosched.c
linux/block/cfq-iosched.c
linux/block/elevator.c
linux/fs/buffer.c
linux/fs/ext4/extents.c
linux/fs/ext4/inode.c
linux/include/linux/jbd2.h
linux/fs/jbd2/commit.c
linux/fs/jbd2/ journal.c
linux/fs/jbd2/ transaction.c
linux/include/linux/writeback.h
linux/mm/page-writeback.c
linux/include/linux/mm_types.h
linux/fs/buffer.c
Interface
mongo/util/net/message_server_port.cpp
third_party/wiredtiger/src/evict/evict_lru.c

• 솔루션 v2 (proactive)
42
Device Driver
Noop CFQ Deadline Apposha I/O Scheduler
Block Layer
Ext4 XFS F2FS
VFS
Apposha Front-End File System
Etc
Linux I/O 스택
PageCache
- 우선순위 기반 I/O 스케줄링
- 디바이스 큐 혼잡 제어
- 우선순위 기반 쓰기 I/O 제어
- OS 캐싱 효율성 향상

• V12 성능 최적화 엔진
43
Apposha 최적화 엔진
MongoDB
Library
PostgreSQL
Library
Elasticsearch
Library
V12-M V12-P V12-E
- 태스크 중요도, 파일 접근 패턴 분류
Front-End File System I/O Scheduler

0
1000
2000
3000
4000
5000
6000
0 60 120 180 240 300 360 420 480 540 600
경과시간 (초)
Linux Default Best Practice V12-M
• MongoDB 성능 결과
44
V12-M: MongoDB용 V12 엔진
최대 5.2초

0
500
1000
1500
2000
2500
0 60 120 180 240 300 360 420 480 540 600
경과시간 (초)
45
최대 2.2초
최대 0.1초

0
5000
10000
15000
20000
25000
30000
35000
1 50 100 150 200 250 300
처리량(ops/sec)
클라이언트 수
46
처리량 30%

• MongoDB 성능 분석 (LatencyTOP)
47
$ cat /proc/latency_stats | sort –k2rn
# blocked total wait max wait kernel call stack
14930242 2688576986 5000 sk_wait_data…SyS_recvfrom
1236442 250490867 40485 jbd2_log_wait_commit…SyS_fdatasync
503473 145763439 5000 futex_wait…SyS_futex
15330 72668185 37287 ext4_file_write_iter…SyS_pwrite64
1329954 57847733 4688 wait_on_page_writeback…SyS_fdatasync
93613 2392472 26378 blkdev_issue_flush…SyS_fdatasync
…

• MongoDB 성능 분석 (SystemTap)
48
$ vi full_backtrace.stp
probe kernel.function("filemap_fdatawait_range") {
print_backtrace()
print_ubacktrace()
}

• MongoDB 성능 분석 (SystemTap)
49
$ stap full_backtrace.stp –d /usr/local/bin/mongod --ldd --all-modules
handleIncomingMsgEPv
receivedCommand
runCommands
…
waitUntilDurable
__session_log_flush
__wt_log_flush
__wt_log_force_sync
__posix_sync
sys_fdatasync
do_fsync
vfs_fsync_range
ext4_sync_file
filemap_write_and_wait_range
filemap_fdatawait_range

WAL 로그 쓰기 증폭
• MongoDB 사례
50
WTLog
FS
Metadata
Storage
Step 1. fallocate log file
fallocate()

WAL 로그 쓰기 최적화
• 솔루션
51
프론트엔드
파일시스템
라이브러리
I/O 스케줄러
Apposha OS
V12 엔진 v2
- 로그 선할당 & 재사용
- BG 로그 쓰기 예외 처리
- 로그 쓰기 시 RMW 처리

WAL 로그 쓰기 증폭
• MongoDB 사례
52
WTLog
Storage
Step 1. fallocate log file
fallocate()

WAL 로그 쓰기 최적화
• 성능 평가
53
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1 50 100 150 200 250 300
처리량(ops/sec)
클라이언트 수
Linux Default Best Practice V12-M V12-M v2
처리량
37%~72%

Lock으로 인한 Scalability 저하
• MongoDB 성능 결과 (60GB dataset)
54
0
1000
2000
3000
4000
5000
6000
7000
0 60 120 180 240 300 360 420 480 540 600
경과시간 (초)
Linux Default Best Practice V12-M v2

55
0
500
1000
1500
2000
2500
0 60 120 180 240 300 360 420 480 540 600
경과시간 (초)
Linux Default Best Practice V12-M v2
최대 0.9초

56
$ cat /proc/latency_stats | sort –k2rn
# blocked total_wait max_wait kernel_call_stack
6948352 15336712185 5000 futex_wait…SyS_futex
17116192 2950396149 5000 sk_wait_data…SYSC_recvfrom
132555 422559398 48725 ext4_file_write_iter…SyS_pwrite64
1998336 94252487 5848 wait_on_page_writeback…SyS_fdatasync
1997980 85578077 39318 blkdev_issue_flush SyS_fdatasync
104922 29824666 35976 __lock_page_killable…SyS_pread64

• 문제 상황
57
File A
write() write()
File B
write() write()
T2 T3 T5 T6
T1 T4

Lock으로 인한 Scalability 최적화
• 솔루션
58
프론트엔드
파일시스템
라이브러리
I/O 스케줄러
V12 엔진 v3
- Range lock for file writing

• 솔루션
59
File A
write() write()
T2 T3
write()
T1
File B
write() write()
T5 T6
write()
T4

0
500
1000
1500
2000
2500
0 60 120 180 240 300 360 420 480 540 600
경과시간 (초)
Linux Default Best Practice V12-M v2 V12-M v3
• 성능 평가
60
최대 0.9초
최대 0.13초

DB 성능 관점에서의 리눅스 커널 분석
• 저장장치 I/O 스택
• I/O 우선순위 적용의 어려움
• WAL 로그 쓰기 증폭 현상
• Lock으로 인한 scalability 저하
• CPU 스케줄링
• 멀티 코어 로드 밸런싱 문제
61
데이터베이스
운영체제
하드웨어

멀티 코어 로드 밸런싱 문제
62
Thread
Load Balancing
- Wakeup
- Create
- Exec
Thread
Thread
Thread
Thread
Thread
Thread가 Runnable 상태로 변하는 시점 Global 로드 밸런싱 주기마다 전체의 로드를 맞춤
Pre load balancing Periodic load balancing

63
Thread
Thread
Thread
매 주기(HZ 단위)에 idle 한 코어를 찾아서
thread를 내쫒음
Thread
Thread
Thread
Idle 코어가 되는 시점에 바쁜 코어로부터
thread를 끌어옴
Kick load balancing Drain load balancing

• 로드 밸런싱 오버헤드
• 로드 계산
• 매 타이머 HZ or 스케줄러 호출 시
• 타겟 코어 선정
• Drain, Pre, Periodic: Sched-Domain Tree 활용
• Kick: CPU mask 활용
• 마이그레이션
• 마이그레이션 워커를 통해 진행 (수 ms 소모)
64
CPU cycle 소모
Lock 경쟁 발생

• 마이그레이션에 의한 지연
65
Lock
Lock
Unparking
Migration Worker
Context Switching
Current  Migration Worker
unLock
unlock
Parking
Migration Worker
Schedule
Schedule
Waiters
Waiters
Migration
Context Switching
Migration Worker  Current

• DB 워크로드 영향
66
DB 태스크 1: Network I/O Storage I/OC Network I/OC
DB워크로드는 I/O Intensive 한 Task들의 집합
Network I/O C Storage I/O C Network I/ODB 태스크 2:

67
Pre
Load Balance
잦은 Task State 변화로 인한 Pre Load Balancing 발생

68
IDLE IDLE IDLE
Pre
Load Balance
Drain
Load Balance
IDLE 상태 변화시 로드 밸런싱 수행

69
IDLE IDLE IDLE
Pre
Load Balance
Drain
Load Balance
Kick
Load Balance
IDLE 상태 변화시 로드 밸런싱 수행

70
Migration
Migration
Migration
응답지연 발생

멀티 코어 로드 밸런싱 최적화
• 솔루션
71
프론트엔드
파일시스템
라이브러리
I/O 스케줄러
V12 엔진 v4
- Anticipatory 스케줄링 클래스
- 스마트 마이그레이션
CPU 스케줄러

0
10000
20000
30000
40000
50000
60000
1 50 100 150 200 250 300
처리량(ops/sec)
클라이언트 수
Linux Default Best Practice V12-M v3 V12-M v4
멀티 코어 로드 밸런싱 최적화
• 성능 평가 (12GB dataset)
72
처리량
1.4X~2.5X

Credits
73
김형준
• 성균관대 박사과정
• 파일시스템 개발
현병훈
• 성균관대 석박통합과정
• CPU 스케줄러 개발
Luis Cavazos
• 파일시스템 개발
Mr. K.
• 성균관대 석사
• 삼성전자 S/W 엔지니어
• 웹 개발 및 DB 분석
Pedram Khoshnevis
• DBaaS 프론트엔드 개발
Mr. J.
• 삼성전자 S/W 엔지니어
• DBaaS 백엔드 개발
김환주
• KAIST 박사
• Dell EMC Senior
S/W Engineer
• I/O 우선처리 설계
정진규
• KAIST 박사
• 성균관대 조교수
이준원
• Georgia Tech 박사
• 성균관대 교수
김상훈
• KAIST 박사
• Virginia Tech
PostDoc
• I/O 상속 설계

74
H http://apposha.io
F www.facebook.com/apposha
M sangwook@apposha.io

[231]운영체제 수준에서의 데이터베이스 성능 분석과 최적화

More Related Content

What's hot

Viewers also liked

Similar to [231]운영체제 수준에서의 데이터베이스 성능 분석과 최적화

More from NAVER D2

[231]운영체제 수준에서의 데이터베이스 성능 분석과 최적화