Customer Highlight: Craftsy

Backups and
Disaster Recovery
with AWS

About Me
Chris Proto - DevOps Manager - @cproto

About Craftsy
● eLearning & Entertainment
● Passionate Hobbyists
● Video, Video, Video
….with a dash of eCommerce

AWS @ Craftsy
● EC2: Craftsy.com, API, Unlimited
● SQS/SNS: API Support - Microservice messages
● VPC: Networking + Routing - very little Public IP space
● RDS + Redshift: Application Data, Data Warehouse

Storage @ Craftsy
● S3
● EBS
● Glacier
● Hybrid-Strategy

EBS @ Craftsy
● Elastic-Block Storage
● EC2 (app servers)
● RDS (db servers)
● Snapshot Management
● Amazon Machine Images (EBS-backed)

S3 @ Craftsy
● Versatile, flexible
● Everything from cheap P2P sharing to hosting
entire websites...even applications (video player)
● Multi-Region Replication, 11 9s integrity
● RAW Video Files
● Transcoded Video Packages for Distribution

Glacier @ Craftsy
● Tapes? We don’t need no stinkin’ tapes
● 3-2-1 Backups & Diligence Approved
● Restoration Flexibility
● AWS-Customer Flexibility

Hybrid-Storage,
Video Production,
and Business
Continuity

File Creation
● ~ 400GB of media files created per day
● 2 locations - production studio (Taxi)
and Main office
● 3x daily media file transfer from Taxi
storage (local SAN) to Main SAN
through Digital Asset Management
system
Main
1Gb/s
Point2Point
Fiber

Main SAN
● 233TB storage for media files
● Dual 8Gb Fiber channel to clients for
high-speed video editing
● Restoration of cold projects (AWS
Glacier) to hot storage for repurposing
of content

Backup Solution
● Daily (cron) AWS S3 sync from Linux
client
● Email on failures
● Replication across regions from us-
east-1 → us-west-1
● “Archives” → AWS Glacier
aws s3 sync --exclude *.pek --exclude *.ims --exclude *.qtindex
/mnt/sanVolume/ s3://backupBucket/
#scrappy

Backup Solution
● “I deleted something. Can you get it
back?”
aws s3 cp s3://backupBucket/cable_knitting_01.mp4
/mnt/sanVolume/cable_knitting_01.mp4

September 14th: A Day Which Will Live In
Infamy
● External vendor application has RW access to SAN
files through an AD service account
● Manages media file lifecycle from asset tagging,
ingesting, and archiving
● Different folder location on the SAN for STAGE vs
PROD
● Same permissions across enviros, for speed and ease
of implementation (Mistake #1, me)
● Application upgrade process is very manual (Mistake
#2, them)

How It Went Down
● STAGE needed to be downgraded, to match
production and test upgrade path
● Downgrade of STAGE required ‘data sanitation’,
which I was unaware of due to other constraints, PM
was handling the upgrade and we miscommunicated
● ‘Data sanitation’ was a database change as well as
media file removal from the filesystem. “Starting out
clean” was the idea...
● What Linux Command Do We All Make
Fun Of?

● They who shall not be named LITERALLY ran the
command and walked away for the day, to let the
sanitation happen overnight
● Editor complains of files going offline in his editing
program
● VNC into an OSX client and begin watching files
disappear before my eyes, month folders at a time (01, 02,
03 etc)
● Ssh to application server, notice service account logged
in, view history:975 psql someCommand
976 cd
/mnt/sanVolume/appFolder/production/media
977 rm -rf *

● I immediately shut down the file system to
prevent further deletion, but it was too late
● 127TB of media files deleted (but I saved
611MB!!)
● All content from Aug 2015 - Sept 2017
● Hot, cold and warm files
● RAW and edited mp4, mov, psd, jpg, CR2
● Workforce affected --
○ 12 video editors and motion graphics can’t
work
○ 8 producers can’t review edits
○ 7 marketers can’t find high-res marketing
images

Where To Even Start?? AWS CLI
● Sync most recent ingests, from multiple client machines to maximize 1Gb/s pipe,
utilizing multi-part downloading
aws s3 sync s3://backupBucket/2017/09/
/mnt/sanVolume/appFolder/production/media/2017/09/
/mnt/sanVolume/appFolder/production/media/2017/08/
/Volumes/sanVolume/appFolder/production/media/2017/
07/
Linux Client
Linux Client
OSX Client

Order Some Snowballs!
● Monday, Sept 18th ordered (4) 50TB Snowballs
● Quantity was due to quickest data load time & folder
structure
● First snowball arrived September 22nd
● Setup of 1st snowball and copy begin was < 1 hour
x4
snowball cp --recursive s3://backupBucket/2017/
/mnt/sanVolume/appFolder/production/media/2017/
Linux Client
OSX Client
snowball cp --recursive s3://backupBucket/2016/
/mnt/sanVolume/appFolder/production/media/2016/

Restoration approach
● Multi-machine transfer
○ 1 linux client, 2 OSX clients
● Most needed footage first - meet with editors to find
month folders that are most active for their upcoming
projects
● Didn’t have time/foresight at the beginning of the process
to come up with the quickest iteration of restoration
because of the year-by-year folder structure
● Full restoration completed on Monday, October 2nd
● All 127TB accounted for!

Learnings
● In DR situations -- order Snowballs right away.
● Special character file names caused a couple hiccups that
required manual intervention
● Duplicate files -- come up with a good structure before ordering.
○ 2015 -- 4 months of files
■ Created a duplicate issue where we had similar files on
multiple snowballs. Easy to fix, but confusing when
running expected completion times
● Obvious ones - separate service accounts for enviro with
matching permissions, better communication around upgrade
path etc.

Conclusion
● Estimated time with only AWS sync restoration - ~ 23
days
● Actual time with AWS sync + Snowball - 15 days (4 day
delay BEFORE we decided to order snowballs,
could’ve been FASTER
● Video production time lost - 5 hours (does not include
DevOps + PM time to prioritize footage)

Thank You
Chris Proto - @cproto
Justin Lang - @FilmHooligan (absent)
Craftsy - @beCraftsy

Customer Highlight: Craftsy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Customer Highlight: Craftsy

Similar to Customer Highlight: Craftsy (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Customer Highlight: Craftsy