Exploiting Your File System to Build Robust &
Efficient Workflows
Jason Johnson
jajohnson@softlayer.com
Exploiting Your File System to Build Robust &
Efficient Workflows
Jason Johnson
jajohnson@softlayer.com
What is /dev/sdc, anyway?
The Hard Disk Drive
Basic Platter Geometry
Cylinder-Head-Sector (obsolete)
Logical Block Addressing, LBA
What is /dev/sdc, anyway?
The Disk Array Controller
● Adaptec 5405Z
● PCIe x8
● 1.2 GHz Dual Core RAID on Chip (ROC)
● 128-1024 MB Battery-Backed DDR
● 1-4 GB NAND
● Up to 256 SATA or SAS HDD's
● arcconf
Write Caching
“...you *must* disable the
individual hard disk write cache in
order to ensure to keep the file
system intact after a power
failure.”
XFS.org FAQ
Initial 1MB Sector Alignment
Sector Size Starting Sector Drive Type
512 B 2048 SATA & SAS
2 KB 512 SSD
4 KB 256 Advanced Format & SSD
blockdev --getpbsz /dev/sdc
blockdev --getss /dev/sdc
“Aligning IO on a hard disk RAID”
http://www.mysqlperformanceblog.com/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/
(s)gdisk
Tuning the File System
● Disable Caching
● Tools: sysbench, iozone, iostat, vmstat
● Start Simple
● Apply Increasing Parallel I/O
● ext2, ext3, ext4, xfs, btrfs, zfs?
● Graph Everything
arcconf
arcconf 
create 1 logicaldrive 
stripesize 256 
wcache wt 
rcache roff 
max
0 
0 3 
0 4 
...
0 18
sysbench, fileio
sysbench 
--num-threads=[8-1024] 
--test=fileio 
--file-total-size=10G 
--file-test-mode=rndwr 
--file-fsync-all=on 
--file-num=64 
--file-block-size=16384 
[prepare|run|cleanup]
EXT4, mkfs
mkfs.ext4 /dev/sdc1
mke2fs 
-b 4096 
-O journal_dev 
/dev/sdb1 32768
mkfs.ext4 
-b 4096 
-E stride=4,stripe_width=16 
-J device=/dev/sdb1 
/dev/sdc1
mount -o 
noatime,stripe=16 
/dev/sdc1 
/mnt/data
I/O Requests per Second, EXT4
Latency, EXT4
XFS, mkfs
mkfs.xfs /dev/sdc1
mkfs.xfs 
-d sw=16,su=16k 
-l 
logdev=/dev/sdb1, 
size=128m, 
su=256k 
/dev/sdc1
mount -o 
noatime, 
logdev=/dev/sdb1, 
logbufs=8,logbsize=256k 
/dev/sdc1 
/mnt/data
I/O Requests per Second, XFS
Latency, XFS
What are we looking for?
I/O Request per Second... per Drive
&
Reasonable Latency
XFS vs. EXT4, Latency
XFS vs. EXT4, per Drive
Scenario 1, Efficiency
MySQL
MySQL Write Pattern
MySQL Configuration
System Variable Value
innodb_io_capacity 5000
innodb_thread_concurrency 256
innodb_write_io_threads 192
innodb_read_io_threads 64
innodb_log_file_size 32M
innodb_log_files_in_group 32
innodb_buffer_pool_size 10GB
innodb_buffer_pool_instances 10
“MySQL System Variables”
sysbench, mysql
sysbench 
--num-threads=[32|64|128|256] 
--test=oltp 
--oltp-test-mode=nontrx 
--oltp-nontrx-mode=insert 
--oltp-table-size=100000 
--max-requests=10000000 
[prepare|run|cleanup]
Transactions per Second
+96.28%
+125.37%
+102.29%
+69.43%
15,962.79/s @ 16ms
9,421.29/s @ 27ms
inotify
Event Mask Fired when...
IN_ACCESS File was accessed (read)
IN_ATTRIB Metadata changed
IN_CLOSE_WRITE File opened for writing was closed
IN_CLOSE_NOWRITE File not opened for writing was closed
IN_CREATE File/directory created in watched directory
IN_DELETE File/directory deleted from watched directory
IN_DELETE_SELF Watched file/directory was itself deleted
IN_MODIFY File was modified
IN_MOVE_SELF Watched file/directory was itself moved
IN_MOVED_FROM File moved out of watched directory
IN_MOVED_TO File moved into watched directory
IN_OPEN File was opened
inotify in [language]
Language Source
Python pip install pyinotify
PHP pecl install inotify
Go go's exp repository
Ruby gem install rb-inotify
C #include <sys/inotify.h>
Scenario 2, Robustness
A Custom Message Queue
Message Queue Architecture
I/O Serialization
I/O Serialization
hash(queue_id) % num_threads
Message Queue Architecture
64K
Summary
● Caching
● File system choice
● Benchmarking w/ sysbench
● Efficiency through proper configuration
● Robustness through cooperation & decoupling
● Discovering & understanding your write pattern
● Benchmark & Graph everything
● Never Assume Anything (atime, stripe width, etc.)
Jason Johnson jajohnson@softlayer.com
https://github.com/jasonjohnson
http://www.slideshare.net/jasonajohnson
“A Case for Redundant Arrays of Inexpensive Disks (RAID)”
http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf
“Practical File System Design”
http://www.nobius.org/~dbg/practical-file-system-design.pdf
“XFS Papers and Documentation”
http://xfs.org/index.php/XFS_Papers_and_Documentation
“Kernel Documentation on File Systems”
https://www.kernel.org/doc/Documentation/filesystems/
“MySQL Performance Blog”
http://www.mysqlperformanceblog.com/
“MySQL DBA”
http://mysqldba.blogspot.com/
“MySQL Server System Variables”
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html

Exploiting Your File System to Build Robust & Efficient Workflows

Editor's Notes

  • #2 Good afternoon! Title
  • #3 Begins Database server? Video Encoder? Where to go from here? Up or Down Understand the Abstraction Go Down to Physical
  • #4 Common Big Virtual Disk RAID Controller Couple Drives Not “set it and forget it”
  • #5 Platters Spindle Actuator Actuator Coil Actuator Arm Heads
  • #6 Sectors Cluster Track Cylinder 512, 2K, 4K CHS obsolete Giant String of Sectors
  • #7 Disk Array Controller Break it Down
  • #8 DDR Flushes to NAND Configuration Tool Stunned Discrete GPU for your File System
  • #9 Data Corruption? (hands) Fallible Disable All Caching Eliminate Class of Errors
  • #11 Sector Alignment 1MB Offset Room for Partition Table Use These Tools Verify Correctness
  • #12 Sectors Clearly Communicating in LBA Logical Size Offset Check. All Makes Sense Not Scary
  • #13 No Caching Sysbench 16 Data-Bearing Disks Hardware Controller XFS Designed for This! But... verify through testing.
  • #14 Stripe Size 256k Entire Width Cache Disabled Add Physical Drives All Controller Brands Different
  • #15 What are we comparing? EXT4 vs. XFS Naive Naive External Tuned Tuned External
  • #16 From Naive to Modestly Tuned Stripe Width Stride 128MB external journal Mount requires extra information
  • #17 Review Graph
  • #18 Review Graph
  • #19 Again, From Naive to Modestly Tuned Stripe Width Stripe Unit Size External Journal Device Additional Information Needed by Mount
  • #20 Review Graph
  • #21 Review Graph
  • #22 We want 330 IOPS Our $$$
  • #23 Neck and Neck Slight Advantage at 256 Threads
  • #24 Noticeable Advantage at 256 Threads 5,200 IOPS Reached Practical Limit Early Fully Tuned? sysctl for XFS?
  • #25 Predictable Write Pattern We Make It Efficient
  • #26 InnoDB Pages Linux Pages Unit of Work XFS Allocation Groups EXT4 Metadata Groupings
  • #27 Review The Configuration Plug-in Values from Sysbench Google “MySQL System Variables” Explain Values
  • #28 Small Benchmark 10 Million Inserts, One Transaction Each Ramp up Threads Text EXTERNALLY Deadlock Potential Spin-locks Contending with Benchmark
  • #29 Transactions per Second For the Percentage Increase Folks For the Real Figures Folks 125% increase (in some cases) High-End nearing 16,000
  • #30 Before Next Section ---------------------------------- Who has written code like this? (hands) Scanning Race Condition Creation Behind Us ----------------------------------- There is a better way!
  • #31 It Can Tell Us No Scanning or Polling No Races ------------------------------------------ IN_CLOSE_WRITE IN_MOVED_TO
  • #32 Event Stream No Races File System Obeys Rules Can&apos;t Move Files Being Written
  • #33 Go&apos;s extracted from stdlib FreeBSD&apos;s kqueue
  • #34 Internally RabbitMQ Every Datacenter Worldwide ----------------------------- Simpler File-Based RESTful 200,000 Concurrency Insane Burst-able Throughput
  • #35 Familiar? SMTP or Maildir ----------------------------------- Fall Over ----------------------------------- Inbox Partial Content Source &amp; Victim Locked Fetching Serialized
  • #36 Request Must Know How to Respond But... ONE I/O THREAD?!!
  • #37 Predictable, Simple Hash Tenants CAN &amp; WILL Clobber, Though Sticky
  • #38 Notification-based Movement Serialized I/O per-queue Parallel I/O per-server Highly Available Front-End Decoupled Delivery Basic UNIX Command Maintenance -------------------------------------- Learn From MySQL Random Writes &amp; Random Size ZFS &amp; ZIL kqueue vs. inotify Fix One