by Chris Proto, DevOps Engineer, Craftsy
Craftsy is the leading online destination for passionate makers to learn, create, and share. With online classes, popular supplies and indie patterns, over ten million creative enthusiasts are taking their skills to new heights. By working with AWS and using data transfer services, Crafty was able to bounce back from a massive storage outage that impacted numerous teams. By using AWS storage services, the company was able to minimize the outage and speed up data restore three-fold. Learn more by attending this session.
3. About Craftsy
● eLearning & Entertainment
● Passionate Hobbyists
● Video, Video, Video
….with a dash of eCommerce
4. AWS @ Craftsy
● EC2: Craftsy.com, API, Unlimited
● SQS/SNS: API Support - Microservice messages
● VPC: Networking + Routing - very little Public IP space
● RDS + Redshift: Application Data, Data Warehouse
10. File Creation
● ~ 400GB of media files created per day
● 2 locations - production studio (Taxi)
and Main office
● 3x daily media file transfer from Taxi
storage (local SAN) to Main SAN
through Digital Asset Management
system
Main
1Gb/s
Point2Point
Fiber
11. Main SAN
● 233TB storage for media files
● Dual 8Gb Fiber channel to clients for
high-speed video editing
● Restoration of cold projects (AWS
Glacier) to hot storage for repurposing
of content
12. Backup Solution
● Daily (cron) AWS S3 sync from Linux
client
● Email on failures
● Replication across regions from us-
east-1 → us-west-1
● “Archives” → AWS Glacier
aws s3 sync --exclude *.pek --exclude *.ims --exclude *.qtindex
/mnt/sanVolume/ s3://backupBucket/
#scrappy
13. Backup Solution
● “I deleted something. Can you get it
back?”
aws s3 cp s3://backupBucket/cable_knitting_01.mp4
/mnt/sanVolume/cable_knitting_01.mp4
14. September 14th: A Day Which Will Live In
Infamy
● External vendor application has RW access to SAN
files through an AD service account
● Manages media file lifecycle from asset tagging,
ingesting, and archiving
● Different folder location on the SAN for STAGE vs
PROD
● Same permissions across enviros, for speed and ease
of implementation (Mistake #1, me)
● Application upgrade process is very manual (Mistake
#2, them)
15. How It Went Down
● STAGE needed to be downgraded, to match
production and test upgrade path
● Downgrade of STAGE required ‘data sanitation’,
which I was unaware of due to other constraints, PM
was handling the upgrade and we miscommunicated
● ‘Data sanitation’ was a database change as well as
media file removal from the filesystem. “Starting out
clean” was the idea...
● What Linux Command Do We All Make
Fun Of?
16.
17. ● They who shall not be named LITERALLY ran the
command and walked away for the day, to let the
sanitation happen overnight
● Editor complains of files going offline in his editing
program
● VNC into an OSX client and begin watching files
disappear before my eyes, month folders at a time (01, 02,
03 etc)
● Ssh to application server, notice service account logged
in, view history:975 psql someCommand
976 cd
/mnt/sanVolume/appFolder/production/media
977 rm -rf *
18. ● I immediately shut down the file system to
prevent further deletion, but it was too late
● 127TB of media files deleted (but I saved
611MB!!)
● All content from Aug 2015 - Sept 2017
● Hot, cold and warm files
● RAW and edited mp4, mov, psd, jpg, CR2
● Workforce affected --
○ 12 video editors and motion graphics can’t
work
○ 8 producers can’t review edits
○ 7 marketers can’t find high-res marketing
images
19. Where To Even Start?? AWS CLI
● Sync most recent ingests, from multiple client machines to maximize 1Gb/s pipe,
utilizing multi-part downloading
aws s3 sync s3://backupBucket/2017/09/
/mnt/sanVolume/appFolder/production/media/2017/09/
aws s3 sync s3://backupBucket/2017/08/
/mnt/sanVolume/appFolder/production/media/2017/08/
aws s3 sync s3://backupBucket/2017/07/
/Volumes/sanVolume/appFolder/production/media/2017/
07/
Linux Client
Linux Client
OSX Client
20. Order Some Snowballs!
● Monday, Sept 18th ordered (4) 50TB Snowballs
● Quantity was due to quickest data load time & folder
structure
● First snowball arrived September 22nd
● Setup of 1st snowball and copy begin was < 1 hour
x4
snowball cp --recursive s3://backupBucket/2017/
/mnt/sanVolume/appFolder/production/media/2017/
Linux Client
OSX Client
snowball cp --recursive s3://backupBucket/2016/
/mnt/sanVolume/appFolder/production/media/2016/
21. Restoration approach
● Multi-machine transfer
○ 1 linux client, 2 OSX clients
● Most needed footage first - meet with editors to find
month folders that are most active for their upcoming
projects
● Didn’t have time/foresight at the beginning of the process
to come up with the quickest iteration of restoration
because of the year-by-year folder structure
● Full restoration completed on Monday, October 2nd
● All 127TB accounted for!
22. Learnings
● In DR situations -- order Snowballs right away.
● Special character file names caused a couple hiccups that
required manual intervention
● Duplicate files -- come up with a good structure before ordering.
○ 2015 -- 4 months of files
○ 2016 -- 12 months of files
○ 2017 -- 9 months of files
■ Created a duplicate issue where we had similar files on
multiple snowballs. Easy to fix, but confusing when
running expected completion times
● Obvious ones - separate service accounts for enviro with
matching permissions, better communication around upgrade
path etc.
23. Conclusion
● Estimated time with only AWS sync restoration - ~ 23
days
● Actual time with AWS sync + Snowball - 15 days (4 day
delay BEFORE we decided to order snowballs,
could’ve been FASTER
● Video production time lost - 5 hours (does not include
DevOps + PM time to prioritize footage)
24.
25. Thank You
Chris Proto - @cproto
Justin Lang - @FilmHooligan (absent)
Craftsy - @beCraftsy