Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PostgreSQL na EXT3/4, XFS,
BTRFS a ZFS
FOSDEM PgDay 2016, 28.1.2016, Brussels
Tomáš Vondra
tomas.vondra@2ndquadrant.com
ht...
not a filesystem engineer
database engineer
Which file system should we use for
PostgreSQL on production systems?
According to our benchmarks from 2003,
the best file system is ...
What does it actually means when a file
system is “stable” and “production ready”?
1) reliability
2) consistent performance
3) management & monitoring
DISCLAIMER
I'm not a dedicated fan (or enemy) of any of
the file systems discussed in the talk.
SSD
File systems
EXT3, EXT4, XFS, ...
● EXT3/4, XFS, … (and others)
– traditional design from 90., with journaling and such
– similar goals...
EXT3, EXT4, XFS, ...
● evolution, not revolution
– new features (e.g. TRIM, write barriers, ...)
– scalability improvement...
BTRFS, ZFS
● basic idea
– integrate all the layers (LVM + dm + ...)
– designed for consumer-level hardware (expect failure...
BTRFS, ZFS
● BTRFS
– merged in 2009, but still considered “experimental”
– on-disk format marked as “stable” (1.0)
– some ...
Generic “mount options”
Generic “mount options”
● TRIM (discard)
– enables TRIM commands (sent from kernel to SSD)
– impacts internal cleanup (blo...
Specific “mount options”
BTRFS
● nodatacow
– disables “copy on write” (CoW), enables when snapshotting
– also disables checksums (require “full” Co...
ZFS
● recordsize=8kB
– standard ZFS page has 128kB (PostgreSQL uses 8kB pages)
– makes ARC cache inefficient (smaller numb...
Benchmark
pgbench (TPC-B)
● transactional benchmark (TPC-B) / stress-test
– many tiny queries (access through PK, ...)
– mix of diff...
Hardware
● CPU: Intel i5-2500k
– 4 cores @ 3.3 GHz (3.7GHz)
– 6MB cache
– 2011-2013
● 8GB RAM (DDR3 1333)
● SSD Intel S370...
Hardware (Cosium)
● CPU2x Intel Xeon E5-2687W v3, 3,1GHz, Cache
25Mo, 9,60GT/s QPI, Turbo, HT, 10C/20T (160W)
● RAM256GB R...
But that is not representative!
pgbench read-only
0 2 4 6 8 10 12 14 16 18
0
10000
20000
30000
40000
50000
60000
pgbench / small (150 MB) read-only
number of clients
transa...
0 2 4 6 8 10 12 14 16 18
0
5000
10000
15000
20000
25000
30000
35000
40000
pgbench / large (16GB) read-only
ZFS ZFS (record...
pgbench read-write
0 2 4 6 8 10 12 14 16 18
0
1000
2000
3000
4000
5000
6000
7000
8000
pgbench / small (150MB) read-write
BTRFS (ssd, nobarrie...
0 2 4 6 8 10 12 14 16 18
0
1000
2000
3000
4000
5000
6000
7000
8000
pgbench / small (150MB) read-write
BTRFS (ssd, nobarrie...
0 2 4 6 8 10 12 14 16 18
0
1000
2000
3000
4000
5000
6000
pgbench / large (16GB) read-write
ZFS BTRFS (ssd)
ZFS (recordsize...
0 2 4 6 8 10 12 14 16 18
0
1000
2000
3000
4000
5000
6000
pgbench / large (16GB) read-write
ZFS (recordsize, logbias) F2FS ...
0 50 100 150 200 250 300
0
1000
2000
3000
4000
5000
6000
7000
Write barriers
ext4 and xfs (defaults, noatime)
ext4 (barrie...
Performance variability
0 50 100 150 200 250 300
0
1000
2000
3000
4000
5000
6000
7000
pgbench / large (16GB) read-write
number of transactions per...
NVME drives
ext4-barrier
ext4-nobarrier
xfs-barrier
xfs-nobarrier
zfs-mirror
zfs-single
zfs-single-compression
0
2000
4000
6000
8000
1...
4kB vs. 8kB
discard nodiscard
0
1000
2000
3000
4000
5000
3115 3128
4052 4111
PostgreSQL se 4kB a 8kB pages
pgbench read-write, 16 clie...
8 kB 4 kB
0
200
400
600
800
1000
1200
962
805
Host_Writes_32MB vs. 4kB/8kB pages
amount of data written to SSD (4 hours)
G...
8 kB 4 kB
0
200
400
600
800
1000
1200
962
805
962
612
Host_Writes_32MB vs. 4kB/8kB pages
amount of data written to SSD (4 ...
EXT / XFS
● similar behavior
– mostly compromise between throughput and latency
– EXT4 – higher throughput, more jitter
– ...
BTRFS, ZFS
● significant price for features (based on CoW)
– about 50% reduction of performance when writing data
● BTRFS
...
Conclusion
Conclusion
● if traditional file system is sufficient
– use EXT4/XFS, depending on your distribution
– no extreme differen...
Questions?
BTRFS, ZFS
Tasks: 215 total,   2 running, 213 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 12.6%sy,  0.0%ni, 87.4%id...
BTRFS, ZFS
$ df /mnt/ssd­s3700/
Filesystem     1K­blocks     Used Available Use% Mounted on
/dev/sda1       97684992 71625...
EXT3/4, XFS
● Linux Filesystems: Where did they come from?
(Dave Chinner @ linux.conf.au 2014)
https://www.youtube.com/wat...
Upcoming SlideShare
Loading in …5
×

PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016

5,531 views

Published on

Comparison of PostgreSQL performance on contemporary Linux file systems.

Published in: Software

PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016

  1. 1. PostgreSQL na EXT3/4, XFS, BTRFS a ZFS FOSDEM PgDay 2016, 28.1.2016, Brussels Tomáš Vondra tomas.vondra@2ndquadrant.com http://blog.pgaddict.com
  2. 2. not a filesystem engineer database engineer
  3. 3. Which file system should we use for PostgreSQL on production systems?
  4. 4. According to our benchmarks from 2003, the best file system is ...
  5. 5. What does it actually means when a file system is “stable” and “production ready”?
  6. 6. 1) reliability 2) consistent performance 3) management & monitoring
  7. 7. DISCLAIMER I'm not a dedicated fan (or enemy) of any of the file systems discussed in the talk.
  8. 8. SSD
  9. 9. File systems
  10. 10. EXT3, EXT4, XFS, ... ● EXT3/4, XFS, … (and others) – traditional design from 90., with journaling and such – similar goals / concepts / implementatins – continuous improvements – mature, reliable, proven by time and production deployments ● basic history – 2001 - EXT3 – 2002 - XFS (1994 - SGI Irix 5.3, 2000 GPL, 2002 Linux) – 2008 - EXT4
  11. 11. EXT3, EXT4, XFS, ... ● evolution, not revolution – new features (e.g. TRIM, write barriers, ...) – scalability improvements (metadata, ...) – bug fixes ● conceived at the times of rotational storage – mostly work on SSD drives – stop-gap for future storage types (NVRAM, ...) ● mostly no support for – volume management, multiple drives, snapshots – addressed by LVM and/or RAID (hw/sw) – sometimes issues
  12. 12. BTRFS, ZFS ● basic idea – integrate all the layers (LVM + dm + ...) – designed for consumer-level hardware (expect failures) – designed for large data volumes ● that will (hopefully) give us ... – flexible management – built-in snapshostting – compression, deduplication – checksums
  13. 13. BTRFS, ZFS ● BTRFS – merged in 2009, but still considered “experimental” – on-disk format marked as “stable” (1.0) – some say it's “stable” or even “production ready” ... – default file system in some distributions ● ZFS – originally Sun / Solaris, but “got Oracled” :-( – slightly fragmented development (Illumos, Oracle, ...) – available on other BSD systems (FreeBSD) – “ZFS on Linux” project (but CDDL vs. GPL apod.)
  14. 14. Generic “mount options”
  15. 15. Generic “mount options” ● TRIM (discard) – enables TRIM commands (sent from kernel to SSD) – impacts internal cleanup (block erasure) / wear leveling – not entirely necessary, but may help SSD with “garbage collection” ● write barriers – prevents controller from reordering writes (e.g. journal x data) – ensures consistency of file system, does not prevent data loss – write cache + battery => write barriers may be disabled (really?) ● SSD alignment
  16. 16. Specific “mount options”
  17. 17. BTRFS ● nodatacow – disables “copy on write” (CoW), enables when snapshotting – also disables checksums (require “full” CoW) – probably also eliminates “torn-page resiliency” (full_page_writes=on) ● ssd – should enable SSD-related optimizations (but not sure which) ● compress=lzo/zlib – speculative compression
  18. 18. ZFS ● recordsize=8kB – standard ZFS page has 128kB (PostgreSQL uses 8kB pages) – makes ARC cache inefficient (smaller number of “slots”) ● logbias=throughput [latency] – influences access to ZIL – prioritizes latency vs. throughput ● zfs_arc_max – limits size of ARC cache (50% RAM by default) – should be freed automatically, but external module ...
  19. 19. Benchmark
  20. 20. pgbench (TPC-B) ● transactional benchmark (TPC-B) / stress-test – many tiny queries (access through PK, ...) – mix of different I/O types (read/write, random/sequential) ● two variants – read-only (SELECT) – read-write (SELECT + INSERT + UPDATE) ● three data volume categories – small (~200MB) – medium (~50% RAM) – large (~200% RAM)
  21. 21. Hardware ● CPU: Intel i5-2500k – 4 cores @ 3.3 GHz (3.7GHz) – 6MB cache – 2011-2013 ● 8GB RAM (DDR3 1333) ● SSD Intel S3700 100GB (SATA3) ● Gentoo + kernel 4.0.4 ● PostgreSQL 9.4
  22. 22. Hardware (Cosium) ● CPU2x Intel Xeon E5-2687W v3, 3,1GHz, Cache 25Mo, 9,60GT/s QPI, Turbo, HT, 10C/20T (160W) ● RAM256GB RAM (16x DUAL IN-LINE MEMORY MODULE, 16GB, 2133, 2RX4, 4G, DDR4, R) ● storage A 2x Samsung XS1715 NVME SSD 1.6 TB ● storage B 2x 300GB SAS 10k RPM drive ● storage C 4x 1.2TB SAS 10k RPM drive ● RAID Dell Perc H330 (no write cache)
  23. 23. But that is not representative!
  24. 24. pgbench read-only
  25. 25. 0 2 4 6 8 10 12 14 16 18 0 10000 20000 30000 40000 50000 60000 pgbench / small (150 MB) read-only number of clients transactionspersecond
  26. 26. 0 2 4 6 8 10 12 14 16 18 0 5000 10000 15000 20000 25000 30000 35000 40000 pgbench / large (16GB) read-only ZFS ZFS (recordsize=8k) BTRFS BTRFS (nodatacow) F2FS ReiserFS EXT4 EXT3 XFS number of clients transactionspersecond
  27. 27. pgbench read-write
  28. 28. 0 2 4 6 8 10 12 14 16 18 0 1000 2000 3000 4000 5000 6000 7000 8000 pgbench / small (150MB) read-write BTRFS (ssd, nobarrier) BTRFS (ssd, nobarrier, discard, nodatacow) EXT3 EXT4 (nobarrier, discard) F2FS (nobarrier, discard) ReiserFS (nobarrier) XFS (nobarrier, discard) ZFS ZFS (recordsize, logbias) number of clients transactionspersecond
  29. 29. 0 2 4 6 8 10 12 14 16 18 0 1000 2000 3000 4000 5000 6000 7000 8000 pgbench / small (150MB) read-write BTRFS (ssd, nobarrier, discard, nodatacow) ZFS (recordsize, logbias) F2FS (nobarrier, discard) EXT4 (nobarrier, discard) ReiserFS (nobarrier) XFS (nobarrier, discard) number of clients transactionspersecond
  30. 30. 0 2 4 6 8 10 12 14 16 18 0 1000 2000 3000 4000 5000 6000 pgbench / large (16GB) read-write ZFS BTRFS (ssd) ZFS (recordsize) ZFS (recordsize, logbias) F2FS (nobarrier, discard) BTRFS (ssd, nobarrier, discard, nodatacow) EXT3 ReiserFS (nobarrier) XFS (nobarrier, discard) EXT4 (nobarrier, discard) number of clients transactionspersecond
  31. 31. 0 2 4 6 8 10 12 14 16 18 0 1000 2000 3000 4000 5000 6000 pgbench / large (16GB) read-write ZFS (recordsize, logbias) F2FS (nobarrier, discard) BTRFS (ssd, nobarrier, discard, nodatacow) ReiserFS (nobarrier) XFS (nobarrier, discard) EXT4 (nobarrier, discard) number of clients transactionspersecond
  32. 32. 0 50 100 150 200 250 300 0 1000 2000 3000 4000 5000 6000 7000 Write barriers ext4 and xfs (defaults, noatime) ext4 (barrier) ext4 (nobarrier) xfs (barrier) xfs (nobarrier) time of benchmark (second) transactionspersecond
  33. 33. Performance variability
  34. 34. 0 50 100 150 200 250 300 0 1000 2000 3000 4000 5000 6000 7000 pgbench / large (16GB) read-write number of transactions per second over time btrfs (ssd, nobarrier, discard) btrfs (ssd, nobarrier, discard, nodatacow) ext4 (nobarrier, discard) xfs (nobarrier, discard) zfs (recordsize, logbias) time of benchmark (second) transakcízavteřinu
  35. 35. NVME drives
  36. 36. ext4-barrier ext4-nobarrier xfs-barrier xfs-nobarrier zfs-mirror zfs-single zfs-single-compression 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 pgbench / large, 60 clients on NVME throughput transactions per second
  37. 37. 4kB vs. 8kB
  38. 38. discard nodiscard 0 1000 2000 3000 4000 5000 3115 3128 4052 4111 PostgreSQL se 4kB a 8kB pages pgbench read-write, 16 clients, scale 5000 (~80GB) 8 kB 4 kB
  39. 39. 8 kB 4 kB 0 200 400 600 800 1000 1200 962 805 Host_Writes_32MB vs. 4kB/8kB pages amount of data written to SSD (4 hours) GBs
  40. 40. 8 kB 4 kB 0 200 400 600 800 1000 1200 962 805 962 612 Host_Writes_32MB vs. 4kB/8kB pages amount of data written to SSD (4 hours) raw compensated GBs
  41. 41. EXT / XFS ● similar behavior – mostly compromise between throughput and latency – EXT4 – higher throughput, more jitter – XFS – lower throughput, less jitter ● significant impact of “write barriers” – requires reliable drives / RAID controller with BBU ● minimal TRIM impact – depends on SSD model (different over-provisioning etc.) – depends on how full the SSD is – benchmark does not delete (over-writes pages)
  42. 42. BTRFS, ZFS ● significant price for features (based on CoW) – about 50% reduction of performance when writing data ● BTRFS – most problems I've ran into were na on BTRFS – good: no data corruption bugs (but not tested) – bad: unstable and inconsistent behavior, lockups ● ZFS – alien in the Linux world, separate ARC cache – much more mature than BTRFS, nice stable behavior – ZFSonLinux actively developed (current 0.6.5, tested 0.6.3)
  43. 43. Conclusion
  44. 44. Conclusion ● if traditional file system is sufficient – use EXT4/XFS, depending on your distribution – no extreme differences in behavior / performance – worth spending some time in tuning ● if you need “advanced” features – e.g. snapshotting, multi-device support ... – ZFS is good choice (maybe consider FreeBSD) – BTRFS (now) definitely not recommended
  45. 45. Questions?
  46. 46. BTRFS, ZFS Tasks: 215 total,   2 running, 213 sleeping,   0 stopped,   0 zombie Cpu(s):  0.0%us, 12.6%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  16432096k total, 16154512k used,   277584k free,     9712k buffers Swap:  2047996k total,    22228k used,  2025768k free, 15233824k cached   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 24402 root      20   0     0    0    0 R 99.7  0.0   2:28.09 kworker/u16:2 24051 root      20   0     0    0    0 S  0.3  0.0   0:02.91 kworker/5:0     1 root      20   0 19416  608  508 S  0.0  0.0   0:01.02 init     2 root      20   0     0    0    0 S  0.0  0.0   0:09.10 kthreadd     ... Samples: 59K of event 'cpu­clock', Event count (approx.): 10269077465 Overhead  Shared Object        Symbol   37.47%  [kernel]             [k] btrfs_bitmap_cluster   30.59%  [kernel]             [k] find_next_zero_bit   26.74%  [kernel]             [k] find_next_bit    1.59%  [kernel]             [k] _raw_spin_unlock_irqrestore    0.41%  [kernel]             [k] rb_next    0.33%  [kernel]             [k] tick_nohz_idle_exit    ...
  47. 47. BTRFS, ZFS $ df /mnt/ssd­s3700/ Filesystem     1K­blocks     Used Available Use% Mounted on /dev/sda1       97684992 71625072  23391064  76% /mnt/ssd­s3700 $ btrfs filesystem df /mnt/ssd­s3700 Data: total=88.13GB, used=65.82GB System, DUP: total=8.00MB, used=16.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=2.50GB, used=2.00GB    <= full (0.5GB for btrfs) Metadata: total=8.00MB, used=0.00 : total=364.00MB, used=0.00 $ btrfs balance start ­dusage=10 /mnt/ssd­s3700 https://btrfs.wiki.kernel.org/index.php/Balance_Filters
  48. 48. EXT3/4, XFS ● Linux Filesystems: Where did they come from? (Dave Chinner @ linux.conf.au 2014) https://www.youtube.com/watch?v=SMcVdZk7wV8 ● Ted Ts'o on the ext4 Filesystem (Ted Ts'o, NYLUG, 2013) https://www.youtube.com/watch?v=2mYDFr5T4tY ● XFS: There and Back … and There Again? (Dave Chinner @ Vault 2015) https://lwn.net/Articles/638546/ ● XFS: Recent and Future Adventures in Filesystem Scalability (Dave Chinner, linux.conf.au 2012) https://www.youtube.com/watch?v=FegjLbCnoBw ● XFS: the filesystem of the future? (Jonathan Corbet, Dave Chinner, LWN, 2012) http://lwn.net/Articles/476263/

×