Who are we?
Premium photo & video sharing.
Bootstrapped in ’02.
$10M+ as of ’07.
Top 400 website.
Premium means “more” and “better”.
Big photos (48Mpix). 500M+ of them.
Big video (1920x180p).
Lots of photos per page.
x86 (mostly AMD) on Linux (~300 4+ core hosts?)
4 datacenters: 2 x SV, 1 x VA, 1 x SEA
2 Ops guys. :)
Majority of boxes are diskless.
Consume lots of cloud services (S3, EC2, etc).
Binary data (photos, video, etc):
Stored in Amazon’s S3. PBs.
Akamai fronts for caching and acceleration.
Structured data (Database, etc):
MySQL (InnoDB mostly).
4+ cores, 64GB, >2TB storage
Memcached fronts for caching.
Photo & video processing / encoding:
Handled in Amazon EC2.
Totally autonomous scaling (SkyNet)
Diskless web boxes (PXE boot)
Scaled up *and* out MySQL
Secret Weapon: Akamai
Reads often already close to customer.
More than just a CDN:
HTML/AJAX/etc inspection for pre-fetch
Anticipate requests and get data to within low ms
Optimal data path to SmugMug
DNS latency reduction
$$$ but worth it. Get what you pay for.
Secret Weapon: memcached
~1TB of data stored.
>96% hit rate
Contains MySQL row data, avoid SELECTs
Misc other data cached, but MySQL biggest win
Fall back on MySQL for cold data
Secret Weapon: MySQL
Most important technology at SmugMug.
Super dependent on replication:
Reliability / High Availability
No MySQL data loss in >7 years.
No JOINs. (Or lots of 4.x+ features, either)
Vertically partitioned, not horizontally (no shards)
Secret Weapon: InnoDB
Most important technology at SmugMug.
Huge thanks to Heikki, Oracle, Percona and Google!
Running 1.0.3+patches in production.
Big performance gains with recent releases.
Secret Weapon: Percona
Crazy concentration of talent under one roof.
Best MySQL dollars we’ve ever spent.
Helped us out of a major bind
Have you heard of the ‘back_log’ mysqld setting?
Me neither. Hope you never do. Percona had.
Helped build, integrate, and test InnoDB patches.
We care about write latency above all.
Well, ok, maybe data integrity. ;)
Scaling reads “easy”: replication and memcached.
Replication needs to stay current (<1 sec).
MySQL concurrency problems. (Much improved!)
Parallel I/O - lots of cores.
Large storage (TBs).
Big RAM (64GB+) to keep indexes hot.
MySQL query details
Mostly SELECT pkey FROM table WHERE index;
On cache miss, SELECT * FROM table WHERE pkey;
UPDATEs/DELETEs mostly on single rows by pkey
Easy memcached expiration.
Easy slave-delay tracking.
No JOINs or complex SELECTs.
OLTP benchmark imperfect. Time for sysbench-web?
MySQL Issues: Filesystems
CentOS Linux shop (lots of expertise).
MySQL is storage intensive (iops, size, etc).
ext3 old and busted. fsck, well, sucks.
ext4 already old and busted. :(
Want good volume management.
Serialized writes (non-parallel). Ugh.
The REAL Issue
We run Linux.
ZFS doesn’t run on Linux.
MySQL Issues: Replication
Unknown state on crash:
Did *.info get written at commit?
Or is it *2 months* out of date?
Bringing TB+ slaves online quickly.
Backups using LVM/ZFS a pain.
Keeping up with master.
Single thread for replication SQL.
Master promotion cludgy.
Transactional replication patches:
Slave always in known state.
Either ok to bring back up or CHANGE MASTER.
Safe to take snapshots anytime, no effort.
Safe to use innodb_ﬂush_log_at_trx_commit=2
InnoDB only. Stopgap. Global trx IDs better.
Using in pre-production. Production next week?
Secret Weapon: Sushi
Toro aka S7410.
NAS storage with a few twists.
2 x Quad-Core Opteron + 64GB RAM
100MB Readzilla SSD
2 x 18GB Writezilla SSD. 20K write iops.
22 x 1TB 7200rpm HDD
Clustered HA conﬁguration.
Mmm, Toro tastes good.
ZFS on Linux!
SSD is here!
SSD performance is cheap!
Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc.
Massive ﬂexibility - no more DAS.
Fishworks interface is a dream.
Analytics is a game changer.
Sushi’s quite reasonable
Initial sticker shock - $80K?! $142K clustered?!
No one pays list price. Whew.
Startup Essentials. Double-whew.
Paradigm shift. Biggest whew!
DAS -> NAS
So much IO, in theory, can “stack” lots of clients.
In practice, can stack *lots* of clients.
We now have 5 clustered conﬁgs. :)
Sushi served fast
Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us
So fast, we’re stacking like crazy.
5 different MySQL workloads on single clustered Toro.
8 slaves on single Toro.
Each used to have 15K disks + write cache.
Lots of excess io and space capacity still.
Compression “for free” (no client CPU usage)
~1.5X ratio across TBs of InnoDB
Backups a breeze.
Automatic snapshots every n minutes / hours / days.
No need to LOCK / shutdown / STOP SLAVE / etc
Rollback anytime. Skip bad SQL statements.
New slave? Click snapshot. Click clone. Done.
Slaves share unchanged data on disk and in RAM.
Future bright: clone + de-dupe = insanely efﬁcient.
DTrace on Linux!
Never had analytics on storage before.
Vendor used to say: “Um, we dunno. Buy more spindles?”
Now I know all.
Vendor now says: “What does Analytics say?”
Drill down on everything. Correlate anything.
MySQL on Toro so far
NFSv3 (rather than v4)
16KB record size in ZFS (InnoDB)
Mirrored (RAID1+0) disks w/striped Logzilla
MySQL concurrency bound - can’t use all the I/O
If compressing, use LZJB.
In theory, can optimize InnoDB:
doublewrite = 0, checksums = 0. ZFS does these.
In practice, no big gain with our workload.
MySQL on Toro problems
Replication *.info ﬁles not sync’d over NFS
Found a slave with *2 month old* info ﬁles
Transactional replication to the rescue!
NFS locking and InnoDB
Warnings on the Net. No hard data.
Actively researching. What’s the problem?
10GbE for reduced latency?
Actively testing this.
Driver tuning required. Defaults for throughput.
Cards (Intel) & switches (Arista) cheap & fast
Less than $500/port.
Copper twinax SFP+ cables cheap. Optical XFP not.
$50 vs $1000+
Toro doesn’t support SFP+ cards yet. :(
Kitchen sink on Toro
Everything runs better on Toro. :)
Stateless Linux mounts.
Developer home directories.
Built-in, automatic replication for multi-site backups.
Photo and video serving?
Still too $$ for TB+ installs.
Even better InnoDB.
Community on ﬁre. Oracle/MySQL accepting patches!
Preview release is out. Yes!
New storage engines
PBXT, Falcon, Maria, oh my!
MySQL is a crown jewel.
Not a gateway drug to Oracle. Different customers.
Kill btrfs. GPL ZFS.
MySQL and InnoDB under one roof = opportunity.
OpenStorage is game changer. Don’t kill it.
Listen to your new communities.
I’m busy. I’m up here because this is important.