AWS Data Migration case study: from tapes to Glacier
1. DevOps, Done Right
AWS Data Migration
A Case Study
Presented by: Farley
farley@olindata.com
2. SlideAWS Glacier Data Migration
• Dutch museum of natural history
• Biodiversity research center
• Located in Leiden
• Millions of biological samples
• Open-data policy
• Researchers can request originals
2
Client Overview
3. SlideAWS Glacier Data Migration
Data Migration - Stages
3
Requirements
Initial contact, gathering all information necessary for the
project. May include a proposal or write-up before moving on,
with fairly accurate estimates of price and rough time estimates
Implementation
Setting up whatever resources, network, technical, physical,
virtual, that are necessary to perform the data migration
Testing
Validating the setup, configuration, performing test migrations,
validating the results with data. This step should result in a
more accurate estimate of time and cost of project completion
Execution
Performing the data migration, along with having monitoring
metrics available to provide health/insights into the system
Validation
During/after the migration, provide some “proof” of intact
delivery, ideally using a checksum
4. SlideAWS Glacier Data Migration
Project Overview
4
• 280TB of media data at external tape provider
• Timeline - Urgent / Immediate. Aka, “Yesterday”
• All data is all in “tar” files in the tape provider
• This data is requested one media at a time (file in tar)
• Some data is duplicated in these tar files (est 10%)
• Mapping data (files to tar file) is in a MySQL database
(requirements)
5. SlideAWS Glacier Data Migration 5
Additional Information
• We are working and coordinating with not one, but two
clients. The data owner and the current data provider.
(technically 3, if you count AWS as future data provider)
• Gathered information from the data provider, reading from
their tape drives is up to ~10 megabit/sec
• The original files we have no checksum verification but
we do have file-size verification on some files
• The data provider has tar checksums on their side
• Discussed various “migration plan” scenarios with both
clients.
(requirements)
6. SlideAWS Glacier Data Migration 6
Migration Plans
data pulling, processing &
deduplication
s3
final / verified
data in s3 bucket
ftp server hosted by
current data provider
data polling, pulling
processing & deduplication
Plan #1
Plan #2
ftp server hosted by future data provider,
pushed to by current data provider data processing & deduplication
ec2
ec2
ec2
Plan #3
7. SlideAWS Glacier Data Migration
• Ingest data as fast as the tape can read, meaning…
• Receiving of tar data (disk/network)
• Tar-file verification (disk)
• Extracting of data to individual files (disk/cpu)
• De-duplication and file verification (disk/cpu)
• Pushing data to S3 bucket (disk/network)
• Removal of files and tar
7
Implementation Requirements
(with their constraint)
8. SlideAWS Glacier Data Migration
• EC2 Instance with enough capacity to handle all
aspects of the migration
• Running a FTP Server for Data Ingestion
• Running custom “ingestion” software to do
verification, extraction, data de-duplication, and
final delivery of data into S3
• Monitoring/metrics/alarms setup and configured
8
Implementation
(overview)
9. SlideAWS Glacier Data Migration 9
Ingestion Workflow
Streaming tape
data to FTP Server
Data gets picked up
by ingestion engine
Data gets pushed
to S3 bucket
10. SlideAWS Glacier Data Migration
• Implemented ingestion engine in Python because…
• Reliable and up-to-date AWS module (boto3)
• My knowledge and experience in Python
• Simple and re-usable
• Work with files, databases, s3, and run external
shell scripts if necessary
10
Implementation
(details)
11. SlideAWS Glacier Data Migration
• After server is setup, ingestion service is running,
performed a few test-migrations
• Debugged and dialed in the ingestion workflow
• Dialed in what instance type to use
• Because of the extremely heavy demand for I/O,
ended up using an i3.xlarge EC2 Instance with 4
vCPUs, 30GB RAM, 1TB Instance Store NVMe.
• This server is effectively only a “buffer” anyway
11
Testing
12. SlideAWS Glacier Data Migration
• Coordinate with all teams/clients
• Keep in mind if your ingestion workload may go
over some AWS service limits (API limits, service
limits, bucket limits, etc) then contact AWS ahead
of time to have them increase your limits. Eg: If
using an HA setup via ELB, ask AWS to pre-warm it
• Have monitoring in place to keep an eye on it,
especially if it is running 24/7
12
Execution
13. SlideAWS Glacier Data Migration
• Disk Usage (root and instance store)
• Memory Usage
• CPU Usage
• Network Usage
• Daemons Running (FTP & Ingestion)
• Interface / Visualization (DEMO coming…)
13
Monitoring
14. SlideAWS Glacier Data Migration
• If you recall me mentioning, they had no checksums, only
file-sizes on some files
• Had to think outside the box…
• Came up with solution to do image comparison analysis
to their thumbnails from their reference library. Demo…
• Additionally, after the migration was complete, had logs of
every file placed in S3
• As an extra verification step, performed an headObject on
every file we expected to be in Glacier, and delivered that
as part of the completion report
14
Data Verification
16. DevOps, Done Right
Thanks!
Questions?
Ask them now, or…
farley@olindata.com
All trademarks, service marks, trade names, trade dress, product names and logos appearing
on this presentation are the property of their respective owners. All rights reserved.