SmugMug’s Zero Downtime Migration to AWS
ARC312
Andrew Shieh, SmugMug Operations
shandrew @ smugmug.com
November 15, 2013
...
SmugMug—Who are we?

Friday, November 15, 13
The early days of SmugMug
• Gradual bootstrapped growth
• Multiple self-managed datacenter cages
• Too many servers of var...
Data
Center
Fantasy

Friday, November 15, 13
Data Center Reality

Friday, November 15, 13
Data Center Reality

Friday, November 15, 13
SmugMug <3 AWS
• Early adopter of Amazon S3
• Over the years, moved rendering,
upload, archiving, payments,
permissions, e...
SmugMug Architecture ~2006

AWS: S3
SV: Web, DB, Image*

Friday, November 15, 13

AWS: S3
SmugMug Architecture ~2011

AWS: S3
AWS: S3, Image (upload,
SV: Web, DB

Friday, November 15, 13

processing, render, vide...
SmugMug Architecture - Transition

AWS: S3
SV: Web, DB

Friday, November 15, 13

AWS: S3, Image*, Web
DC: Replication DB,
...
SmugMug Architecture Today

Ø

Friday, November 15, 13

AWS: S3, Image*,
Web, DB
How did we get there?

Friday, November 15, 13
Our database I/O evolution:
Always cutting edge
• Started with MySQL on spinning
disk RAID, max RAM
• Moved to ZFS SSD + S...
hi1.4xlarge FTW
• our custom, obscure hardware =>
difficult to resolve problems,
difficult to upgrade
• hi1 overall DB IO ...
Amazon VPC - also a big win
• Easy mapping of internal / external network security
model to AWS

Friday, November 15, 13
Zero downtime move?

Friday, November 15, 13
Friday, November 15, 13
Friday, November 15, 13
Zero Downtime Move
• Flexibility of the AWS cloud
makes a zero downtime move
inexpensive. Pay for only what
you use. Provi...
Major changes post-move
• Database storage goes from SSD to
hi1.4xlarge ephemeral
• Hardware load balancers become
Elastic...
Major changes post-move
• Database storage goes from SSD to
hi1.4xlarge ephemeral
• Hardware load balancers become ELB
• h...
Zero Downtime Move Requirements
• Read-only site mode
• Traffic control — shadow load
• Cross country MySQL replication +
...
Zero Downtime Move Requirements
• Read-only site mode
• Traffic control — shadow load
• Cross country MySQL replication +
...
More on moving
• Full scale read-write testing
is difficult
• Be aware of AWS limits
• Talk to support for big
growth
• Ro...
Flipping the switch to AWS
• “The biggest, scariest engineering
change we've made in the company's
history” - Don, SmugMug...
Flipping the switch to AWS
• Test! (60 min)
• When Read-only is
all good, go to readwrite (5 min)
• Test! Inevitable bugs
...
MHA?
• Facebook, DeNA
• Helps to reliably reassign
MySQL masters and
replication, maintaining
consistency

Friday, Novembe...
MHA?
• Manual failover in MySQL
5.5 and earlier is painful, timeconsuming
• Be careful with automation for
rare events — i...
Problems?
• Completely redundant
network links can fail
• Bugs related to IP address
change
• ElastiCache performance
• Ne...
Results

Friday, November 15, 13
Results

Friday, November 15, 13
Results
• Data Center - performance fluctuated
through day
• AWS w/scaling - flat performance
throughout the day - signifi...
Lessons Learned
• We love AWS even more than before
• Automate everything
• Understand Amazon EBS, and
understand underlyi...
Lessons Learned

Job #1:
Making
business
happen
Friday, November 15, 13
We made more changes, because we could
• As long as we’re moving our infrastructure,
why not rebuild most of it too?
• Lin...
One last thing...
• Go Multi-availability-zone!
• Load balancers send traffic to multiple
haproxy per AZ with AZ-specific ...
Questions?
Andrew Shieh, Sunnyvale, CA
shandrew@smugmug.com
@shandrew
http://www.smugmug.com/
http://pics.shieh.info/
Than...
Please give us your feedback on this
presentation

ARC312 - SmugMug’s Zero
Downtime Migration to AWS
As a thank you, we wi...
Upcoming SlideShare
Loading in...5
×

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

819

Published on

SmugMug spent six years split between its datacenters and AWS. Find out how and why SmugMug went 100% AWS, migrating 30 TB of databases, hundreds of frontends, load balancing, and caches, across the US in one night with zero downtime.We show you specific techniques and processes that made our large-scale migration a resounding success: moving massive MySQL databases, testing and sizing a new AWS infrastructure, automating AWS operations, managing the risks involved in wholesale infrastructure change, and architecting for reliability in multiple AWS Availability Zones. We talk about the performance, scalability, operational, and business benefits and challenges we've seen since moving 100% to AWS. Finally, we share secrets about our favorite AWS products.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
819
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

  1. 1. SmugMug’s Zero Downtime Migration to AWS ARC312 Andrew Shieh, SmugMug Operations shandrew @ smugmug.com November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Friday, November 15, 13
  2. 2. SmugMug—Who are we? Friday, November 15, 13
  3. 3. The early days of SmugMug • Gradual bootstrapped growth • Multiple self-managed datacenter cages • Too many servers of varying types • Too many disks • Tons of valuable skilled employee hours spent in cages Friday, November 15, 13
  4. 4. Data Center Fantasy Friday, November 15, 13
  5. 5. Data Center Reality Friday, November 15, 13
  6. 6. Data Center Reality Friday, November 15, 13
  7. 7. SmugMug <3 AWS • Early adopter of Amazon S3 • Over the years, moved rendering, upload, archiving, payments, permissions, email, and more compute to AWS • Before mid-2012, no ultra-high performance I/O Friday, November 15, 13
  8. 8. SmugMug Architecture ~2006 AWS: S3 SV: Web, DB, Image* Friday, November 15, 13 AWS: S3
  9. 9. SmugMug Architecture ~2011 AWS: S3 AWS: S3, Image (upload, SV: Web, DB Friday, November 15, 13 processing, render, video, …)
  10. 10. SmugMug Architecture - Transition AWS: S3 SV: Web, DB Friday, November 15, 13 AWS: S3, Image*, Web DC: Replication DB, Direct Connect
  11. 11. SmugMug Architecture Today Ø Friday, November 15, 13 AWS: S3, Image*, Web, DB
  12. 12. How did we get there? Friday, November 15, 13
  13. 13. Our database I/O evolution: Always cutting edge • Started with MySQL on spinning disk RAID, max RAM • Moved to ZFS SSD + SSD cache + spinning disks • Moved to custom 24-SSD arrays Friday, November 15, 13
  14. 14. hi1.4xlarge FTW • our custom, obscure hardware => difficult to resolve problems, difficult to upgrade • hi1 overall DB IO performance comparable to 8 x SSD RAID10 • < 3%/yr hi1 instance failure rate! Friday, November 15, 13
  15. 15. Amazon VPC - also a big win • Easy mapping of internal / external network security model to AWS Friday, November 15, 13
  16. 16. Zero downtime move? Friday, November 15, 13
  17. 17. Friday, November 15, 13
  18. 18. Friday, November 15, 13
  19. 19. Zero Downtime Move • Flexibility of the AWS cloud makes a zero downtime move inexpensive. Pay for only what you use. Provision fast. • Plan • Test • Plan and test again Friday, November 15, 13
  20. 20. Major changes post-move • Database storage goes from SSD to hi1.4xlarge ephemeral • Hardware load balancers become Elastic Load Balancing load balancers Friday, November 15, 13
  21. 21. Major changes post-move • Database storage goes from SSD to hi1.4xlarge ephemeral • Hardware load balancers become ELB • haproxy layer 7 load/traffic directing goes from static to dynamic config • Web servers autoscale for each cluster • Membase to ElastiCache (later to Amazon EC2) Friday, November 15, 13
  22. 22. Zero Downtime Move Requirements • Read-only site mode • Traffic control — shadow load • Cross country MySQL replication + sufficient bandwidth Friday, November 15, 13
  23. 23. Zero Downtime Move Requirements • Read-only site mode • Traffic control — shadow load • Cross country MySQL replication + sufficient bandwidth • Bot testing • Read-only live site testing w/ QA Friday, November 15, 13
  24. 24. More on moving • Full scale read-write testing is difficult • Be aware of AWS limits • Talk to support for big growth • Roll back plan - manage risky change Friday, November 15, 13
  25. 25. Flipping the switch to AWS • “The biggest, scariest engineering change we've made in the company's history” - Don, SmugMug Chief Geek • Go read-only (1 min) • Pre-Scale up big • MHA to reassign MySQL masters and their replication (30min) • Point DNS+CDN to Elastic Load Balancing (5-30m) Friday, November 15, 13
  26. 26. Flipping the switch to AWS • Test! (60 min) • When Read-only is all good, go to readwrite (5 min) • Test! Inevitable bugs at this step (hours) Friday, November 15, 13
  27. 27. MHA? • Facebook, DeNA • Helps to reliably reassign MySQL masters and replication, maintaining consistency Friday, November 15, 13
  28. 28. MHA? • Manual failover in MySQL 5.5 and earlier is painful, timeconsuming • Be careful with automation for rare events — it can bite Friday, November 15, 13
  29. 29. Problems? • Completely redundant network links can fail • Bugs related to IP address change • ElastiCache performance • NewRelic! Use it or a similar APM product Friday, November 15, 13
  30. 30. Results Friday, November 15, 13
  31. 31. Results Friday, November 15, 13
  32. 32. Results • Data Center - performance fluctuated through day • AWS w/scaling - flat performance throughout the day - significant scalability limits removed • Networking was a key improvement • Success! Friday, November 15, 13
  33. 33. Lessons Learned • We love AWS even more than before • Automate everything • Understand Amazon EBS, and understand underlying details of AWS services • Unpredictable Ops schedules vs. large projects Friday, November 15, 13
  34. 34. Lessons Learned Job #1: Making business happen Friday, November 15, 13
  35. 35. We made more changes, because we could • As long as we’re moving our infrastructure, why not rebuild most of it too? • Linux, MySQL, package versions upgraded • New monitoring tools • NFS dependencies eliminated, moved to Amazon S3 or DynamoDB • Code pushes managed by nice distributed tools utilizing Amazon S3 + internal torrent Friday, November 15, 13
  36. 36. One last thing... • Go Multi-availability-zone! • Load balancers send traffic to multiple haproxy per AZ with AZ-specific web clusters, DB replicas • Backed up w/ cross AZ • Keep SPOFs in one AZ Friday, November 15, 13
  37. 37. Questions? Andrew Shieh, Sunnyvale, CA shandrew@smugmug.com @shandrew http://www.smugmug.com/ http://pics.shieh.info/ Thank you! Friday, November 15, 13
  38. 38. Please give us your feedback on this presentation ARC312 - SmugMug’s Zero Downtime Migration to AWS As a thank you, we will select prize winners daily for completed surveys! Friday, November 15, 13 Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×