• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,186
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
117
Comments
0
Likes
15

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The SmugMug Tale
  • 2. Who are we? Premium photo & video sharing. Bootstrapped in ’02. $10M+ as of ’07. Profitable. No debt. Top 400 website. Doubling yearly.
  • 3. Our challenge Premium means “more” and “better”. Unlimited storage. Unlimited bandwidth. Big photos (48Mpix). 500M+ of them. Big video (1920x180p). Lots of photos per page. Super fast.
  • 4. Architecture overview LAMP(hp). x86 (mostly AMD) on Linux (~300 4+ core hosts?) 4 datacenters: 2 x SV, 1 x VA, 1 x SEA 2 Ops guys. :) Majority of boxes are diskless. Consume lots of cloud services (S3, EC2, etc).
  • 5. Storage Binary data (photos, video, etc): Stored in Amazon’s S3. PBs. Akamai fronts for caching and acceleration. Structured data (Database, etc): MySQL (InnoDB mostly). 4+ cores, 64GB, >2TB storage Memcached fronts for caching.
  • 6. Compute Photo & video processing / encoding: Handled in Amazon EC2. Totally autonomous scaling (SkyNet) Customer facing: Diskless web boxes (PXE boot) Scaled up *and* out MySQL Memcached ~1TB
  • 7. Secret Weapon: Akamai Super-fast CDN: Reads often already close to customer. More than just a CDN: HTML/AJAX/etc inspection for pre-fetch Anticipate requests and get data to within low ms Optimal data path to SmugMug DNS latency reduction $$$ but worth it. Get what you pay for.
  • 8. Secret Weapon: memcached Screaming fast. ~1TB of data stored. >96% hit rate Contains MySQL row data, avoid SELECTs Misc other data cached, but MySQL biggest win Fall back on MySQL for cold data
  • 9. Secret Weapon: MySQL Most important technology at SmugMug. Super dependent on replication: Performance Reliability / High Availability No MySQL data loss in >7 years. No JOINs. (Or lots of 4.x+ features, either) Vertically partitioned, not horizontally (no shards)
  • 10. Secret Weapon: InnoDB Most important technology at SmugMug. Huge thanks to Heikki, Oracle, Percona and Google! Running 1.0.3+patches in production. Big performance gains with recent releases.
  • 11. Secret Weapon: Percona Crazy concentration of talent under one roof. Best MySQL dollars we’ve ever spent. Helped us out of a major bind Have you heard of the ‘back_log’ mysqld setting? Me neither. Hope you never do. Percona had. Helped build, integrate, and test InnoDB patches.
  • 12. MySQL details We care about write latency above all. Well, ok, maybe data integrity. ;) Scaling reads “easy”: replication and memcached. Replication needs to stay current (<1 sec). MySQL concurrency problems. (Much improved!) Parallel I/O - lots of cores. Large storage (TBs). Big RAM (64GB+) to keep indexes hot.
  • 13. MySQL query details Mostly SELECT pkey FROM table WHERE index; On cache miss, SELECT * FROM table WHERE pkey; UPDATEs/DELETEs mostly on single rows by pkey Easy memcached expiration. Easy slave-delay tracking. Very denormalized. No JOINs or complex SELECTs. OLTP benchmark imperfect. Time for sysbench-web?
  • 14. MySQL Issues: Filesystems Better filesystem: CentOS Linux shop (lots of expertise). MySQL is storage intensive (iops, size, etc). ext3 old and busted. fsck, well, sucks. ext4 already old and busted. :( Want good volume management. Serialized writes (non-parallel). Ugh.
  • 15. Filesystem Solution - ZFS! Transactional. Copy-on-write. End-to-end data integrity. On-the-fly corruption detection & repair. Integrated volume management. Snapshots & clones. Open source software.
  • 16. The REAL Issue We run Linux. ZFS doesn’t run on Linux. Crap.
  • 17. MySQL Issues: Replication Unknown state on crash: Did *.info get written at commit? Or is it *2 months* out of date? Bringing TB+ slaves online quickly. Backups using LVM/ZFS a pain. Keeping up with master. Single thread for replication SQL. Master promotion cludgy.
  • 18. Replication solutions Transactional replication patches: Slave always in known state. Either ok to bring back up or CHANGE MASTER. Safe to take snapshots anytime, no effort. Safe to use innodb_flush_log_at_trx_commit=2 InnoDB only. Stopgap. Global trx IDs better. Using in pre-production. Production next week?
  • 19. Secret Weapon: Sushi Toro aka S7410. NAS storage with a few twists. 2 x Quad-Core Opteron + 64GB RAM 100MB Readzilla SSD 2 x 18GB Writezilla SSD. 20K write iops. 22 x 1TB 7200rpm HDD Clustered HA configuration.
  • 20. Mmm, Toro tastes good. ZFS on Linux! SSD is here! SSD performance is cheap! Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc. Massive flexibility - no more DAS. Fishworks interface is a dream. Analytics is a game changer.
  • 21. Sushi’s quite reasonable Initial sticker shock - $80K?! $142K clustered?! No one pays list price. Whew. Startup Essentials. Double-whew. Paradigm shift. Biggest whew! DAS -> NAS So much IO, in theory, can “stack” lots of clients. In practice, can stack *lots* of clients. We now have 5 clustered configs. :)
  • 22. Sushi served fast Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us
  • 23. Sushi served fast Scalable. 15K 4k write iops w/16 threads. Low latency. ~250us @ 3K iops, ~700us @ 10K fio write benchmark 20000 15000 4K write iops 10000 5000 0 1 2 4 8 16 32 threads
  • 24. Sushi today So fast, we’re stacking like crazy. 5 different MySQL workloads on single clustered Toro. 8 slaves on single Toro. Each used to have 15K disks + write cache. Lots of excess io and space capacity still. Compression “for free” (no client CPU usage) Crazy fast ~1.5X ratio across TBs of InnoDB
  • 25. Sushi today Backups a breeze. Automatic snapshots every n minutes / hours / days. No need to LOCK / shutdown / STOP SLAVE / etc Rollback anytime. Skip bad SQL statements. New slave? Click snapshot. Click clone. Done. Slaves share unchanged data on disk and in RAM. Future bright: clone + de-dupe = insanely efficient.
  • 26. Analyzing sushi DTrace on Linux! Never had analytics on storage before. Vendor used to say: “Um, we dunno. Buy more spindles?” Now I know all. Vendor now says: “What does Analytics say?” Drill down on everything. Correlate anything. God-like power.
  • 27. MySQL on Toro so far NFSv3 (rather than v4) 16KB record size in ZFS (InnoDB) Mirrored (RAID1+0) disks w/striped Logzilla MySQL concurrency bound - can’t use all the I/O If compressing, use LZJB. In theory, can optimize InnoDB: doublewrite = 0, checksums = 0. ZFS does these. In practice, no big gain with our workload.
  • 28. MySQL on Toro problems Replication *.info files not sync’d over NFS Found a slave with *2 month old* info files Transactional replication to the rescue! NFS locking and InnoDB Warnings on the Net. No hard data. Actively researching. What’s the problem?
  • 29. Even faster? 10GbE for reduced latency? Actively testing this. Driver tuning required. Defaults for throughput. Cards (Intel) & switches (Arista) cheap & fast Less than $500/port. Copper twinax SFP+ cables cheap. Optical XFP not. $50 vs $1000+ Toro doesn’t support SFP+ cards yet. :(
  • 30. Kitchen sink on Toro Everything runs better on Toro. :) Revision control. Stateless Linux mounts. Email. Developer home directories. Built-in, automatic replication for multi-site backups. Photo and video serving?
  • 31. The future? 100% SSD. Still too $$ for TB+ installs. Even better InnoDB. Community on fire. Oracle/MySQL accepting patches! Multi-threaded replication. Preview release is out. Yes! New storage engines PBXT, Falcon, Maria, oh my!
  • 32. Oracle wishlist MySQL is a crown jewel. Not a gateway drug to Oracle. Different customers. Kill btrfs. GPL ZFS. MySQL and InnoDB under one roof = opportunity. OpenStorage is game changer. Don’t kill it. Listen to your new communities. I’m busy. I’m up here because this is important.
  • 33. Thanks! Blog: http://blogs.smugmug.com/don Twitter: DonMacAskill Email: don@smugmug.com Percona Conference: Upstairs :)