SlideShare a Scribd company logo
1 of 33
Download to read offline
The SmugMug Tale
Who are we?
  Premium photo & video sharing.
  Bootstrapped in ’02.
  $10M+ as of ’07.
  Profitable.
  No debt.
  Top 400 website.
  Doubling yearly.
Our challenge
  Premium means “more” and “better”.
  Unlimited storage.
  Unlimited bandwidth.
  Big photos (48Mpix). 500M+ of them.
  Big video (1920x180p).
  Lots of photos per page.
  Super fast.
Architecture overview
  LAMP(hp).
  x86 (mostly AMD) on Linux (~300 4+ core hosts?)
  4 datacenters: 2 x SV, 1 x VA, 1 x SEA
  2 Ops guys. :)
  Majority of boxes are diskless.
  Consume lots of cloud services (S3, EC2, etc).
Storage
  Binary data (photos, video, etc):
      Stored in Amazon’s S3. PBs.
     Akamai fronts for caching and acceleration.
  Structured data (Database, etc):
      MySQL (InnoDB mostly).
      4+ cores, 64GB, >2TB storage
      Memcached fronts for caching.
Compute
  Photo & video processing / encoding:
     Handled in Amazon EC2.
     Totally autonomous scaling (SkyNet)
  Customer facing:
     Diskless web boxes (PXE boot)
     Scaled up *and* out MySQL
     Memcached ~1TB
Secret Weapon: Akamai
  Super-fast CDN:
    Reads often already close to customer.
  More than just a CDN:
   HTML/AJAX/etc inspection for pre-fetch
   Anticipate requests and get data to within low ms
   Optimal data path to SmugMug
   DNS latency reduction
  $$$ but worth it. Get what you pay for.
Secret Weapon: memcached
  Screaming fast.
  ~1TB of data stored.
  >96% hit rate
  Contains MySQL row data, avoid SELECTs
  Misc other data cached, but MySQL biggest win
  Fall back on MySQL for cold data
Secret Weapon: MySQL
  Most important technology at SmugMug.
  Super dependent on replication:
   Performance
   Reliability / High Availability
  No MySQL data loss in >7 years.
  No JOINs. (Or lots of 4.x+ features, either)
  Vertically partitioned, not horizontally (no shards)
Secret Weapon: InnoDB
  Most important technology at SmugMug.
  Huge thanks to Heikki, Oracle, Percona and Google!
  Running 1.0.3+patches in production.
  Big performance gains with recent releases.
Secret Weapon: Percona
  Crazy concentration of talent under one roof.
  Best MySQL dollars we’ve ever spent.
  Helped us out of a major bind
   Have you heard of the ‘back_log’ mysqld setting?
   Me neither. Hope you never do. Percona had.
  Helped build, integrate, and test InnoDB patches.
MySQL details
  We care about write latency above all.
  Well, ok, maybe data integrity. ;)
  Scaling reads “easy”: replication and memcached.
  Replication needs to stay current (<1 sec).
  MySQL concurrency problems. (Much improved!)
  Parallel I/O - lots of cores.
  Large storage (TBs).
  Big RAM (64GB+) to keep indexes hot.
MySQL query details
  Mostly SELECT pkey FROM table WHERE index;
  On cache miss, SELECT * FROM table WHERE pkey;
  UPDATEs/DELETEs mostly on single rows by pkey
   Easy memcached expiration.
   Easy slave-delay tracking.
  Very denormalized.
  No JOINs or complex SELECTs.
  OLTP benchmark imperfect. Time for sysbench-web?
MySQL Issues: Filesystems
  Better filesystem:
   CentOS Linux shop (lots of expertise).
   MySQL is storage intensive (iops, size, etc).
   ext3 old and busted. fsck, well, sucks.
   ext4 already old and busted. :(
   Want good volume management.
   Serialized writes (non-parallel). Ugh.
Filesystem Solution - ZFS!
  Transactional.
  Copy-on-write.
  End-to-end data integrity.
  On-the-fly corruption detection & repair.
  Integrated volume management.
  Snapshots & clones.
  Open source software.
The REAL Issue
  We run Linux.
  ZFS doesn’t run on Linux.
  Crap.
MySQL Issues: Replication
  Unknown state on crash:
   Did *.info get written at commit?
   Or is it *2 months* out of date?
  Bringing TB+ slaves online quickly.
  Backups using LVM/ZFS a pain.
  Keeping up with master.
  Single thread for replication SQL.
  Master promotion cludgy.
Replication solutions
  Transactional replication patches:
   Slave always in known state.
   Either ok to bring back up or CHANGE MASTER.
   Safe to take snapshots anytime, no effort.
   Safe to use innodb_flush_log_at_trx_commit=2
   InnoDB only. Stopgap. Global trx IDs better.
   Using in pre-production. Production next week?
Secret Weapon: Sushi
  Toro aka S7410.
   NAS storage with a few twists.
   2 x Quad-Core Opteron + 64GB RAM
   100MB Readzilla SSD
   2 x 18GB Writezilla SSD. 20K write iops.
   22 x 1TB 7200rpm HDD
   Clustered HA configuration.
Mmm, Toro tastes good.
  ZFS on Linux!
  SSD is here!
  SSD performance is cheap!
  Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc.
  Massive flexibility - no more DAS.
  Fishworks interface is a dream.
  Analytics is a game changer.
Sushi’s quite reasonable
   Initial sticker shock - $80K?! $142K clustered?!
    No one pays list price. Whew.
    Startup Essentials. Double-whew.
    Paradigm shift. Biggest whew!
     DAS -> NAS
     So much IO, in theory, can “stack” lots of clients.
     In practice, can stack *lots* of clients.
  We now have 5 clustered configs. :)
Sushi served fast
   Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us
Sushi served fast
   Scalable. 15K 4k write iops w/16 threads.
   Low latency. ~250us @ 3K iops, ~700us @ 10K
                                       fio write benchmark



                       20000




                       15000
       4K write iops




                       10000




                        5000




                           0
                               1   2   4                    8   16   32
                                               threads
Sushi today
  So fast, we’re stacking like crazy.
  5 different MySQL workloads on single clustered Toro.
  8 slaves on single Toro.
   Each used to have 15K disks + write cache.
   Lots of excess io and space capacity still.
  Compression “for free” (no client CPU usage)
   Crazy fast
   ~1.5X ratio across TBs of InnoDB
Sushi today
  Backups a breeze.
   Automatic snapshots every n minutes / hours / days.
   No need to LOCK / shutdown / STOP SLAVE / etc
   Rollback anytime. Skip bad SQL statements.
  New slave? Click snapshot. Click clone. Done.
   Slaves share unchanged data on disk and in RAM.
   Future bright: clone + de-dupe = insanely efficient.
Analyzing sushi
  DTrace on Linux!
  Never had analytics on storage before.
  Vendor used to say: “Um, we dunno. Buy more spindles?”
  Now I know all.
  Vendor now says: “What does Analytics say?”
  Drill down on everything. Correlate anything.
  God-like power.
MySQL on Toro so far
  NFSv3 (rather than v4)
  16KB record size in ZFS (InnoDB)
  Mirrored (RAID1+0) disks w/striped Logzilla
  MySQL concurrency bound - can’t use all the I/O
  If compressing, use LZJB.
  In theory, can optimize InnoDB:
   doublewrite = 0, checksums = 0. ZFS does these.
   In practice, no big gain with our workload.
MySQL on Toro problems
  Replication *.info files not sync’d over NFS
   Found a slave with *2 month old* info files
   Transactional replication to the rescue!
  NFS locking and InnoDB
   Warnings on the Net. No hard data.
   Actively researching. What’s the problem?
Even faster?
  10GbE for reduced latency?
   Actively testing this.
     Driver tuning required. Defaults for throughput.
   Cards (Intel) & switches (Arista) cheap & fast
     Less than $500/port.
   Copper twinax SFP+ cables cheap. Optical XFP not.
     $50 vs $1000+
   Toro doesn’t support SFP+ cards yet. :(
Kitchen sink on Toro
  Everything runs better on Toro. :)
  Revision control.
  Stateless Linux mounts.
  Email.
  Developer home directories.
  Built-in, automatic replication for multi-site backups.
  Photo and video serving?
The future?
  100% SSD.
   Still too $$ for TB+ installs.
  Even better InnoDB.
   Community on fire. Oracle/MySQL accepting patches!
  Multi-threaded replication.
   Preview release is out. Yes!
  New storage engines
   PBXT, Falcon, Maria, oh my!
Oracle wishlist
  MySQL is a crown jewel.
   Not a gateway drug to Oracle. Different customers.
  Kill btrfs. GPL ZFS.
  MySQL and InnoDB under one roof = opportunity.
  OpenStorage is game changer. Don’t kill it.
  Listen to your new communities.
   I’m busy. I’m up here because this is important.
Thanks!
          Blog: http://blogs.smugmug.com/don
          Twitter: DonMacAskill
          Email: don@smugmug.com
          Percona Conference: Upstairs :)

More Related Content

What's hot

What's hot (19)

Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce
 
Apache cassandra - survivre en production
Apache cassandra  - survivre en productionApache cassandra  - survivre en production
Apache cassandra - survivre en production
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
Performance analysis with_ceph
Performance analysis with_cephPerformance analysis with_ceph
Performance analysis with_ceph
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
Ceph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To EnterpriseCeph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To Enterprise
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 
openSUSE storage workshop 2016
openSUSE storage workshop 2016openSUSE storage workshop 2016
openSUSE storage workshop 2016
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
深入了解Redis
深入了解Redis深入了解Redis
深入了解Redis
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFS
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud Stroage
 
SUSE Enterprise Storage on ThunderX
SUSE Enterprise Storage on ThunderXSUSE Enterprise Storage on ThunderX
SUSE Enterprise Storage on ThunderX
 
RAID, Replication, and You
RAID, Replication, and YouRAID, Replication, and You
RAID, Replication, and You
 
SSD для вашей базы данных, Петр Зайцев (Percona)
SSD для вашей базы данных, Петр Зайцев (Percona)SSD для вашей базы данных, Петр Зайцев (Percona)
SSD для вашей базы данных, Петр Зайцев (Percona)
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 

Similar to The Smug Mug Tale

Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail
Internet World
 
Rocking mongo db on the cloud
Rocking mongo db on the cloudRocking mongo db on the cloud
Rocking mongo db on the cloud
MongoDB
 

Similar to The Smug Mug Tale (20)

Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
Sanger HPC infrastructure Report (2007)
Sanger HPC infrastructure  Report (2007)Sanger HPC infrastructure  Report (2007)
Sanger HPC infrastructure Report (2007)
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Measuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesMeasuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS Instances
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Nimble Storage Series A presentation 2007
Nimble Storage Series A presentation 2007Nimble Storage Series A presentation 2007
Nimble Storage Series A presentation 2007
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
 
Rocking mongo db on the cloud
Rocking mongo db on the cloudRocking mongo db on the cloud
Rocking mongo db on the cloud
 

More from MySQLConference

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My Sql
MySQLConference
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real World
MySQLConference
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The Hood
MySQLConference
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
MySQLConference
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
MySQLConference
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
MySQLConference
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability Patches
MySQLConference
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq Engine
MySQLConference
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
MySQLConference
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
MySQLConference
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20
MySQLConference
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
MySQLConference
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business Intelligence
MySQLConference
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
MySQLConference
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
MySQLConference
 

More from MySQLConference (16)

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My Sql
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real World
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The Hood
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability Patches
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq Engine
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business Intelligence
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 

The Smug Mug Tale

  • 2. Who are we? Premium photo & video sharing. Bootstrapped in ’02. $10M+ as of ’07. Profitable. No debt. Top 400 website. Doubling yearly.
  • 3. Our challenge Premium means “more” and “better”. Unlimited storage. Unlimited bandwidth. Big photos (48Mpix). 500M+ of them. Big video (1920x180p). Lots of photos per page. Super fast.
  • 4. Architecture overview LAMP(hp). x86 (mostly AMD) on Linux (~300 4+ core hosts?) 4 datacenters: 2 x SV, 1 x VA, 1 x SEA 2 Ops guys. :) Majority of boxes are diskless. Consume lots of cloud services (S3, EC2, etc).
  • 5. Storage Binary data (photos, video, etc): Stored in Amazon’s S3. PBs. Akamai fronts for caching and acceleration. Structured data (Database, etc): MySQL (InnoDB mostly). 4+ cores, 64GB, >2TB storage Memcached fronts for caching.
  • 6. Compute Photo & video processing / encoding: Handled in Amazon EC2. Totally autonomous scaling (SkyNet) Customer facing: Diskless web boxes (PXE boot) Scaled up *and* out MySQL Memcached ~1TB
  • 7. Secret Weapon: Akamai Super-fast CDN: Reads often already close to customer. More than just a CDN: HTML/AJAX/etc inspection for pre-fetch Anticipate requests and get data to within low ms Optimal data path to SmugMug DNS latency reduction $$$ but worth it. Get what you pay for.
  • 8. Secret Weapon: memcached Screaming fast. ~1TB of data stored. >96% hit rate Contains MySQL row data, avoid SELECTs Misc other data cached, but MySQL biggest win Fall back on MySQL for cold data
  • 9. Secret Weapon: MySQL Most important technology at SmugMug. Super dependent on replication: Performance Reliability / High Availability No MySQL data loss in >7 years. No JOINs. (Or lots of 4.x+ features, either) Vertically partitioned, not horizontally (no shards)
  • 10. Secret Weapon: InnoDB Most important technology at SmugMug. Huge thanks to Heikki, Oracle, Percona and Google! Running 1.0.3+patches in production. Big performance gains with recent releases.
  • 11. Secret Weapon: Percona Crazy concentration of talent under one roof. Best MySQL dollars we’ve ever spent. Helped us out of a major bind Have you heard of the ‘back_log’ mysqld setting? Me neither. Hope you never do. Percona had. Helped build, integrate, and test InnoDB patches.
  • 12. MySQL details We care about write latency above all. Well, ok, maybe data integrity. ;) Scaling reads “easy”: replication and memcached. Replication needs to stay current (<1 sec). MySQL concurrency problems. (Much improved!) Parallel I/O - lots of cores. Large storage (TBs). Big RAM (64GB+) to keep indexes hot.
  • 13. MySQL query details Mostly SELECT pkey FROM table WHERE index; On cache miss, SELECT * FROM table WHERE pkey; UPDATEs/DELETEs mostly on single rows by pkey Easy memcached expiration. Easy slave-delay tracking. Very denormalized. No JOINs or complex SELECTs. OLTP benchmark imperfect. Time for sysbench-web?
  • 14. MySQL Issues: Filesystems Better filesystem: CentOS Linux shop (lots of expertise). MySQL is storage intensive (iops, size, etc). ext3 old and busted. fsck, well, sucks. ext4 already old and busted. :( Want good volume management. Serialized writes (non-parallel). Ugh.
  • 15. Filesystem Solution - ZFS! Transactional. Copy-on-write. End-to-end data integrity. On-the-fly corruption detection & repair. Integrated volume management. Snapshots & clones. Open source software.
  • 16. The REAL Issue We run Linux. ZFS doesn’t run on Linux. Crap.
  • 17. MySQL Issues: Replication Unknown state on crash: Did *.info get written at commit? Or is it *2 months* out of date? Bringing TB+ slaves online quickly. Backups using LVM/ZFS a pain. Keeping up with master. Single thread for replication SQL. Master promotion cludgy.
  • 18. Replication solutions Transactional replication patches: Slave always in known state. Either ok to bring back up or CHANGE MASTER. Safe to take snapshots anytime, no effort. Safe to use innodb_flush_log_at_trx_commit=2 InnoDB only. Stopgap. Global trx IDs better. Using in pre-production. Production next week?
  • 19. Secret Weapon: Sushi Toro aka S7410. NAS storage with a few twists. 2 x Quad-Core Opteron + 64GB RAM 100MB Readzilla SSD 2 x 18GB Writezilla SSD. 20K write iops. 22 x 1TB 7200rpm HDD Clustered HA configuration.
  • 20. Mmm, Toro tastes good. ZFS on Linux! SSD is here! SSD performance is cheap! Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc. Massive flexibility - no more DAS. Fishworks interface is a dream. Analytics is a game changer.
  • 21. Sushi’s quite reasonable Initial sticker shock - $80K?! $142K clustered?! No one pays list price. Whew. Startup Essentials. Double-whew. Paradigm shift. Biggest whew! DAS -> NAS So much IO, in theory, can “stack” lots of clients. In practice, can stack *lots* of clients. We now have 5 clustered configs. :)
  • 22. Sushi served fast Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us
  • 23. Sushi served fast Scalable. 15K 4k write iops w/16 threads. Low latency. ~250us @ 3K iops, ~700us @ 10K fio write benchmark 20000 15000 4K write iops 10000 5000 0 1 2 4 8 16 32 threads
  • 24. Sushi today So fast, we’re stacking like crazy. 5 different MySQL workloads on single clustered Toro. 8 slaves on single Toro. Each used to have 15K disks + write cache. Lots of excess io and space capacity still. Compression “for free” (no client CPU usage) Crazy fast ~1.5X ratio across TBs of InnoDB
  • 25. Sushi today Backups a breeze. Automatic snapshots every n minutes / hours / days. No need to LOCK / shutdown / STOP SLAVE / etc Rollback anytime. Skip bad SQL statements. New slave? Click snapshot. Click clone. Done. Slaves share unchanged data on disk and in RAM. Future bright: clone + de-dupe = insanely efficient.
  • 26. Analyzing sushi DTrace on Linux! Never had analytics on storage before. Vendor used to say: “Um, we dunno. Buy more spindles?” Now I know all. Vendor now says: “What does Analytics say?” Drill down on everything. Correlate anything. God-like power.
  • 27. MySQL on Toro so far NFSv3 (rather than v4) 16KB record size in ZFS (InnoDB) Mirrored (RAID1+0) disks w/striped Logzilla MySQL concurrency bound - can’t use all the I/O If compressing, use LZJB. In theory, can optimize InnoDB: doublewrite = 0, checksums = 0. ZFS does these. In practice, no big gain with our workload.
  • 28. MySQL on Toro problems Replication *.info files not sync’d over NFS Found a slave with *2 month old* info files Transactional replication to the rescue! NFS locking and InnoDB Warnings on the Net. No hard data. Actively researching. What’s the problem?
  • 29. Even faster? 10GbE for reduced latency? Actively testing this. Driver tuning required. Defaults for throughput. Cards (Intel) & switches (Arista) cheap & fast Less than $500/port. Copper twinax SFP+ cables cheap. Optical XFP not. $50 vs $1000+ Toro doesn’t support SFP+ cards yet. :(
  • 30. Kitchen sink on Toro Everything runs better on Toro. :) Revision control. Stateless Linux mounts. Email. Developer home directories. Built-in, automatic replication for multi-site backups. Photo and video serving?
  • 31. The future? 100% SSD. Still too $$ for TB+ installs. Even better InnoDB. Community on fire. Oracle/MySQL accepting patches! Multi-threaded replication. Preview release is out. Yes! New storage engines PBXT, Falcon, Maria, oh my!
  • 32. Oracle wishlist MySQL is a crown jewel. Not a gateway drug to Oracle. Different customers. Kill btrfs. GPL ZFS. MySQL and InnoDB under one roof = opportunity. OpenStorage is game changer. Don’t kill it. Listen to your new communities. I’m busy. I’m up here because this is important.
  • 33. Thanks! Blog: http://blogs.smugmug.com/don Twitter: DonMacAskill Email: don@smugmug.com Percona Conference: Upstairs :)