How Shit Works: Storage

Tomer Gabel
Tomer GabelConsulting Engineer at Substrate Software Services
How Shit
Works:
Storage
Tomer Gabel, Wix
@ GeeCON Kraków 2016
Like all good stories…
• We’ll start with a question.
• “What’s wrong with this picture?”
Like all good stories…
• We’ll start with a question.
• “What’s wrong with this picture?”
MY, OH, MY.
WHAT COULD IT BE?
Axioms
• Not a trick question
– Servers are properly
configured
– System architecture
makes sense
– No obvious bugs
– No scheduled jobs
• So what else goes
bump in the night?
PROLOGUE
“A LAUGHABLE CLAIM”
I/O is simple
• Just open a file, write, flush, close
• Nothing to it, right?
HDD
Application File
I/O is simple
• A little closer…
HDD
Application File
Kernel
File
system
(ext4)
Virtual
File
System Logical
Volume
Manager
I/O
scheduler
SCSI
driver
stack
I/O is simple
• But really…
HDD
Application File
Kernel
Hardware
Storage Subsystem
System Bus Drivers
PCI Express Bus
SATA Controller
THE ONION OF ABSTRACTION
ACT I
THESE BOOTS
ARE MADE
FOR WALKIN’
Everybody knows...
• Sequential
access is fast
• Random
access is slow
• … so what?
Everybody knows…
“Disk seeks are a huge performance
bottleneck… When the amount of data
starts to grow so large that effective
caching becomes impossible… you
need at least one disk seek to read and
a couple of disk seeks to write things.”
-- MySQL Reference Manual (8.12.3)
Everybody knows…
“Disk seeks are a huge performance
bottleneck… When the amount of data
starts to grow so large that effective
caching becomes impossible… you
need at least one disk seek to read and
a couple of disk seeks to write things.”
-- MySQL Reference Manual (8.12.3)
But why?
Rotational Latency
Rotational Latency
Rotational Latency
Rotational Latency
Throughput
• So you understand
latency…
• What about throughput?
• Depends on two factors:
– Areal density
– Newtonian physics
Areal Density
Interlude: Math
• Rotation is fixed
– Constant angular
velocity (CAV)
• Newton tells us that…
v = ω ∙ r
• Throughput increases
with radius!
Interlude: Math
• Commodity drives
are available at:
– 5400-15000 RPM
– Usually 7200 RPM
• What does it mean
for latency?
7200
60
= 120
Revolutions
/ Second
1
120
= 0.08333
~ 8.33ms!
In practice?
• Modern drives
give you:
200+ MB/s
300 IOPS
• Pure random
access nets only
1.2MB/s!
RIGHT.
WHAT CAN WE DO ABOUT IT?
Fine-tuning
• Provision more RAM
• Careful index structure
– Represent IPs as
UNSIGNED INT for 75%
reduction
– Implement better UUIDs¹
for 30% reduction
¹ Store UUID in an optimized way, Percona blog
… or use a sledgehammer!
• RAID 0 (and variants)
employ striping
• Data is distributed to
multiple spindles
• If it sounds familiar…
– It is!
– We call it “sharding”
It’s turtles all the way down
• Don’t jump to
conclusions!
– RAID 0 is impractical
– RAID 5 may be slow
– RAID 10 is expensive
– etc.
• Do your homework
• Benchmark!
ACT II: I’LL USE MY CREDIT CARD
Let’s talk SSDs
• Non-volatile RAM
• Lots of IOPS
• Expensive :-)
• Same caveats
apply…
Let’s talk SSDs
• Value starts at “1”
• Electrons accrue in the
floating gate
• After programming,
value becomes “0”
• Electrons are drained
to reset value to “0”
Surprise and Terror
• “Draining” is destructive!
• Limited erases
• Limited lifespan!
Wear Leveling
Caveats, remember?
• Addressing
– Cells (1 bit) – not
addressable
Caveats, remember?
• Addressing
– Cells (1 bit) – not
addressable
– Pages (0.5-8KB)
Caveats, remember?
• Addressing
– Cells (1 bit) – not
addressable
– Pages (0.5-8KB)
– Blocks (32-64 pages)
Caveats, remember?
• Addressing
– Cells (1 bit) – not
addressable
– Pages (0.5-8KB)
– Blocks (32-64 pages)
• Why do you care?
– Reads/writes on a page
– But erasure on a block
Write Amplification
1
1
1
1
1
Δ = 1 bit Δ = 1 block!
Surprising Results
• Defragmentation
– Relocates blocks
– Contiguous files
– Lower LBAs
– Background job
• Bad, bad, bad!
– No benefit with SSDs
– Major write load!
Background GC
7
5
6
1
2
Block A Block B
Block C Block D
1 2 5
6 7
Block A Block B
Block C Block D
Surprising Results
• What happens when
you delete file?
– Not much
– Bit flip on file table
– Space is not reclaimed
• Result?
– SATA TRIM command
7
5
6
1
2
Block A Block B
Block C Block D
SSD Takeaways
• A moving target
–File systems
–Data structures
–Longevity
• As usual:
–Benchmark
–Monitor
EPILOGUE
“LET ME EMBRACE
THEE, SOUR
ADVERSITY,
FOR WISE MEN SAY
IT IS THE WISEST
COURSE.”
WE’RE DONE HERE!
… AND YES, WE’RE HIRING :-)
Thank you for listening
tomer@tomergabel.com
@tomerg
http://il.linkedin.com/in/tomergabel
Wix Engineering blog:
http://engineering.wix.com
1 of 44

More Related Content

What's hot(20)

Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI13.3K views
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
Jon Haddad3.8K views
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
Jon Haddad3.1K views
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI6.5K views
Play concurrencyPlay concurrency
Play concurrency
Justin Long1.4K views
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net Driver
DataStax Academy1.2K views
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
DataStax Academy7.5K views
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
Amazon Web Services3.9K views
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
Ivan Glushkov31.9K views
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar1.7K views

More from Tomer Gabel(20)

How Shit Works: Storage