SlideShare a Scribd company logo
1 of 16
Download to read offline
DevOps, Done Right
AWS Data Migration

A Case Study
Presented by: Farley
farley@olindata.com
SlideAWS Glacier Data Migration
• Dutch museum of natural history
• Biodiversity research center
• Located in Leiden
• Millions of biological samples
• Open-data policy
• Researchers can request originals
2
Client Overview
SlideAWS Glacier Data Migration
Data Migration - Stages
3
Requirements
Initial contact, gathering all information necessary for the
project. May include a proposal or write-up before moving on,
with fairly accurate estimates of price and rough time estimates
Implementation
Setting up whatever resources, network, technical, physical,
virtual, that are necessary to perform the data migration
Testing
Validating the setup, configuration, performing test migrations,
validating the results with data. This step should result in a
more accurate estimate of time and cost of project completion
Execution
Performing the data migration, along with having monitoring
metrics available to provide health/insights into the system
Validation
During/after the migration, provide some “proof” of intact
delivery, ideally using a checksum
SlideAWS Glacier Data Migration
Project Overview
4
• 280TB of media data at external tape provider
• Timeline - Urgent / Immediate. Aka, “Yesterday”
• All data is all in “tar” files in the tape provider
• This data is requested one media at a time (file in tar)
• Some data is duplicated in these tar files (est 10%)
• Mapping data (files to tar file) is in a MySQL database
(requirements)
SlideAWS Glacier Data Migration 5
Additional Information
• We are working and coordinating with not one, but two
clients. The data owner and the current data provider.
(technically 3, if you count AWS as future data provider)
• Gathered information from the data provider, reading from
their tape drives is up to ~10 megabit/sec
• The original files we have no checksum verification but
we do have file-size verification on some files
• The data provider has tar checksums on their side
• Discussed various “migration plan” scenarios with both
clients.
(requirements)
SlideAWS Glacier Data Migration 6
Migration Plans
data pulling, processing &
deduplication
s3
final / verified
data in s3 bucket
ftp server hosted by
current data provider
data polling, pulling
processing & deduplication
Plan #1
Plan #2
ftp server hosted by future data provider,
pushed to by current data provider data processing & deduplication
ec2
ec2
ec2
Plan #3
SlideAWS Glacier Data Migration
• Ingest data as fast as the tape can read, meaning…
• Receiving of tar data (disk/network)
• Tar-file verification (disk)
• Extracting of data to individual files (disk/cpu)
• De-duplication and file verification (disk/cpu)
• Pushing data to S3 bucket (disk/network)
• Removal of files and tar
7
Implementation Requirements
(with their constraint)
SlideAWS Glacier Data Migration
• EC2 Instance with enough capacity to handle all
aspects of the migration
• Running a FTP Server for Data Ingestion
• Running custom “ingestion” software to do
verification, extraction, data de-duplication, and
final delivery of data into S3
• Monitoring/metrics/alarms setup and configured
8
Implementation
(overview)
SlideAWS Glacier Data Migration 9
Ingestion Workflow
Streaming tape
data to FTP Server
Data gets picked up
by ingestion engine
Data gets pushed
to S3 bucket
SlideAWS Glacier Data Migration
• Implemented ingestion engine in Python because…
• Reliable and up-to-date AWS module (boto3)
• My knowledge and experience in Python
• Simple and re-usable
• Work with files, databases, s3, and run external
shell scripts if necessary
10
Implementation
(details)
SlideAWS Glacier Data Migration
• After server is setup, ingestion service is running,
performed a few test-migrations
• Debugged and dialed in the ingestion workflow
• Dialed in what instance type to use
• Because of the extremely heavy demand for I/O,
ended up using an i3.xlarge EC2 Instance with 4
vCPUs, 30GB RAM, 1TB Instance Store NVMe.
• This server is effectively only a “buffer” anyway
11
Testing
SlideAWS Glacier Data Migration
• Coordinate with all teams/clients
• Keep in mind if your ingestion workload may go
over some AWS service limits (API limits, service
limits, bucket limits, etc) then contact AWS ahead
of time to have them increase your limits. Eg: If
using an HA setup via ELB, ask AWS to pre-warm it
• Have monitoring in place to keep an eye on it,
especially if it is running 24/7
12
Execution
SlideAWS Glacier Data Migration
• Disk Usage (root and instance store)
• Memory Usage
• CPU Usage
• Network Usage
• Daemons Running (FTP & Ingestion)
• Interface / Visualization (DEMO coming…)
13
Monitoring
SlideAWS Glacier Data Migration
• If you recall me mentioning, they had no checksums, only
file-sizes on some files
• Had to think outside the box…
• Came up with solution to do image comparison analysis
to their thumbnails from their reference library. Demo…
• Additionally, after the migration was complete, had logs of
every file placed in S3
• As an extra verification step, performed an headObject on
every file we expected to be in Glacier, and delivered that
as part of the completion report
14
Data Verification
SlideAWS Glacier Data Migration
Demo(s)
15
DevOps, Done Right
Thanks!
Questions?
Ask them now, or…
farley@olindata.com
All trademarks, service marks, trade names, trade dress, product names and logos appearing
on this presentation are the property of their respective owners. All rights reserved.

More Related Content

What's hot

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with JepsenTesting Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsenjkni
 
Cloudian HyperStore Features and Benefits
Cloudian HyperStore Features and BenefitsCloudian HyperStore Features and Benefits
Cloudian HyperStore Features and BenefitsCloudian
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Dynamic Object Routing
Dynamic Object RoutingDynamic Object Routing
Dynamic Object RoutingCloudian
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBInfluxData
 
Building Complete Private Clouds with Apache CloudStack and Riak CS
Building Complete Private Clouds with Apache CloudStack and Riak CSBuilding Complete Private Clouds with Apache CloudStack and Riak CS
Building Complete Private Clouds with Apache CloudStack and Riak CSJohn Burwell
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAlibaba Cloud
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy
 
mParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandramParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandraScyllaDB
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginerYousun Jeong
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in DockerDataStax
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...OpenStack
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japanHiromitsu Komatsu
 

What's hot (20)

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with JepsenTesting Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
Cloudian HyperStore Features and Benefits
Cloudian HyperStore Features and BenefitsCloudian HyperStore Features and Benefits
Cloudian HyperStore Features and Benefits
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Dynamic Object Routing
Dynamic Object RoutingDynamic Object Routing
Dynamic Object Routing
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
 
Building Complete Private Clouds with Apache CloudStack and Riak CS
Building Complete Private Clouds with Apache CloudStack and Riak CSBuilding Complete Private Clouds with Apache CloudStack and Riak CS
Building Complete Private Clouds with Apache CloudStack and Riak CS
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core Features
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting Platform
 
mParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandramParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from Cassandra
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in Docker
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japan
 

Similar to AWS Data Migration case study: from tapes to Glacier

SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierSRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierAmazon Web Services
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Cloud Native Day Tel Aviv
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in CloudHoward Marks
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWSTom Laszewski
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...Amazon Web Services
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Backup and Archiving in the AWS Cloud
Backup and Archiving in the AWS CloudBackup and Archiving in the AWS Cloud
Backup and Archiving in the AWS CloudAmazon Web Services
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Tom Laszewski
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture PerformanceEnkitec
 
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierSRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierAmazon Web Services
 

Similar to AWS Data Migration case study: from tapes to Glacier (20)

SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierSRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier | AWS Public Sector...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Backup and Archiving in the AWS Cloud
Backup and Archiving in the AWS CloudBackup and Archiving in the AWS Cloud
Backup and Archiving in the AWS Cloud
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierSRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
 

More from OlinData

AWS Cost Control: Cloud Custodian
AWS Cost Control: Cloud CustodianAWS Cost Control: Cloud Custodian
AWS Cost Control: Cloud CustodianOlinData
 
Introduction to 2FA on AWS
Introduction to 2FA on AWSIntroduction to 2FA on AWS
Introduction to 2FA on AWSOlinData
 
Issuing temporary credentials for my sql using hashicorp vault
Issuing temporary credentials for my sql using hashicorp vaultIssuing temporary credentials for my sql using hashicorp vault
Issuing temporary credentials for my sql using hashicorp vaultOlinData
 
Log monitoring with Logstash and Icinga
Log monitoring with Logstash and IcingaLog monitoring with Logstash and Icinga
Log monitoring with Logstash and IcingaOlinData
 
FOSDEM 2017: GitLab CI
FOSDEM 2017:  GitLab CIFOSDEM 2017:  GitLab CI
FOSDEM 2017: GitLab CIOlinData
 
Cfgmgmtcamp 2017 docker is the new tarball
Cfgmgmtcamp 2017  docker is the new tarballCfgmgmtcamp 2017  docker is the new tarball
Cfgmgmtcamp 2017 docker is the new tarballOlinData
 
Icinga 2 and Puppet - Automate Monitoring
Icinga 2 and Puppet - Automate MonitoringIcinga 2 and Puppet - Automate Monitoring
Icinga 2 and Puppet - Automate MonitoringOlinData
 
Webinar - Auto-deploy Puppet Enterprise: Vagrant and Oscar
Webinar - Auto-deploy Puppet Enterprise: Vagrant and OscarWebinar - Auto-deploy Puppet Enterprise: Vagrant and Oscar
Webinar - Auto-deploy Puppet Enterprise: Vagrant and OscarOlinData
 
Webinar - High Availability and Distributed Monitoring with Icinga2
Webinar - High Availability and Distributed Monitoring with Icinga2Webinar - High Availability and Distributed Monitoring with Icinga2
Webinar - High Availability and Distributed Monitoring with Icinga2OlinData
 
Webinar - Windows Application Management with Puppet
Webinar - Windows Application Management with PuppetWebinar - Windows Application Management with Puppet
Webinar - Windows Application Management with PuppetOlinData
 
Webinar - Continuous Integration with GitLab
Webinar - Continuous Integration with GitLabWebinar - Continuous Integration with GitLab
Webinar - Continuous Integration with GitLabOlinData
 
Webinar - Centralising syslogs with the new beats, logstash and elasticsearch
Webinar - Centralising syslogs with the new beats, logstash and elasticsearchWebinar - Centralising syslogs with the new beats, logstash and elasticsearch
Webinar - Centralising syslogs with the new beats, logstash and elasticsearchOlinData
 
Icinga 2 and puppet: automate monitoring
Icinga 2 and puppet: automate monitoringIcinga 2 and puppet: automate monitoring
Icinga 2 and puppet: automate monitoringOlinData
 
Webinar - Project Management for DevOps
Webinar - Project Management for DevOpsWebinar - Project Management for DevOps
Webinar - Project Management for DevOpsOlinData
 
Using puppet in a traditional enterprise
Using puppet in a traditional enterpriseUsing puppet in a traditional enterprise
Using puppet in a traditional enterpriseOlinData
 
Webinar - PuppetDB
Webinar - PuppetDBWebinar - PuppetDB
Webinar - PuppetDBOlinData
 
Webinar - Scaling your Puppet infrastructure
Webinar - Scaling your Puppet infrastructureWebinar - Scaling your Puppet infrastructure
Webinar - Scaling your Puppet infrastructureOlinData
 
Webinar - Managing your Docker containers and AWS cloud with Puppet
Webinar - Managing your Docker containers and AWS cloud with PuppetWebinar - Managing your Docker containers and AWS cloud with Puppet
Webinar - Managing your Docker containers and AWS cloud with PuppetOlinData
 
Webinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppetWebinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppetOlinData
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera clusterOlinData
 

More from OlinData (20)

AWS Cost Control: Cloud Custodian
AWS Cost Control: Cloud CustodianAWS Cost Control: Cloud Custodian
AWS Cost Control: Cloud Custodian
 
Introduction to 2FA on AWS
Introduction to 2FA on AWSIntroduction to 2FA on AWS
Introduction to 2FA on AWS
 
Issuing temporary credentials for my sql using hashicorp vault
Issuing temporary credentials for my sql using hashicorp vaultIssuing temporary credentials for my sql using hashicorp vault
Issuing temporary credentials for my sql using hashicorp vault
 
Log monitoring with Logstash and Icinga
Log monitoring with Logstash and IcingaLog monitoring with Logstash and Icinga
Log monitoring with Logstash and Icinga
 
FOSDEM 2017: GitLab CI
FOSDEM 2017:  GitLab CIFOSDEM 2017:  GitLab CI
FOSDEM 2017: GitLab CI
 
Cfgmgmtcamp 2017 docker is the new tarball
Cfgmgmtcamp 2017  docker is the new tarballCfgmgmtcamp 2017  docker is the new tarball
Cfgmgmtcamp 2017 docker is the new tarball
 
Icinga 2 and Puppet - Automate Monitoring
Icinga 2 and Puppet - Automate MonitoringIcinga 2 and Puppet - Automate Monitoring
Icinga 2 and Puppet - Automate Monitoring
 
Webinar - Auto-deploy Puppet Enterprise: Vagrant and Oscar
Webinar - Auto-deploy Puppet Enterprise: Vagrant and OscarWebinar - Auto-deploy Puppet Enterprise: Vagrant and Oscar
Webinar - Auto-deploy Puppet Enterprise: Vagrant and Oscar
 
Webinar - High Availability and Distributed Monitoring with Icinga2
Webinar - High Availability and Distributed Monitoring with Icinga2Webinar - High Availability and Distributed Monitoring with Icinga2
Webinar - High Availability and Distributed Monitoring with Icinga2
 
Webinar - Windows Application Management with Puppet
Webinar - Windows Application Management with PuppetWebinar - Windows Application Management with Puppet
Webinar - Windows Application Management with Puppet
 
Webinar - Continuous Integration with GitLab
Webinar - Continuous Integration with GitLabWebinar - Continuous Integration with GitLab
Webinar - Continuous Integration with GitLab
 
Webinar - Centralising syslogs with the new beats, logstash and elasticsearch
Webinar - Centralising syslogs with the new beats, logstash and elasticsearchWebinar - Centralising syslogs with the new beats, logstash and elasticsearch
Webinar - Centralising syslogs with the new beats, logstash and elasticsearch
 
Icinga 2 and puppet: automate monitoring
Icinga 2 and puppet: automate monitoringIcinga 2 and puppet: automate monitoring
Icinga 2 and puppet: automate monitoring
 
Webinar - Project Management for DevOps
Webinar - Project Management for DevOpsWebinar - Project Management for DevOps
Webinar - Project Management for DevOps
 
Using puppet in a traditional enterprise
Using puppet in a traditional enterpriseUsing puppet in a traditional enterprise
Using puppet in a traditional enterprise
 
Webinar - PuppetDB
Webinar - PuppetDBWebinar - PuppetDB
Webinar - PuppetDB
 
Webinar - Scaling your Puppet infrastructure
Webinar - Scaling your Puppet infrastructureWebinar - Scaling your Puppet infrastructure
Webinar - Scaling your Puppet infrastructure
 
Webinar - Managing your Docker containers and AWS cloud with Puppet
Webinar - Managing your Docker containers and AWS cloud with PuppetWebinar - Managing your Docker containers and AWS cloud with Puppet
Webinar - Managing your Docker containers and AWS cloud with Puppet
 
Webinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppetWebinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppet
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

AWS Data Migration case study: from tapes to Glacier

  • 1. DevOps, Done Right AWS Data Migration A Case Study Presented by: Farley farley@olindata.com
  • 2. SlideAWS Glacier Data Migration • Dutch museum of natural history • Biodiversity research center • Located in Leiden • Millions of biological samples • Open-data policy • Researchers can request originals 2 Client Overview
  • 3. SlideAWS Glacier Data Migration Data Migration - Stages 3 Requirements Initial contact, gathering all information necessary for the project. May include a proposal or write-up before moving on, with fairly accurate estimates of price and rough time estimates Implementation Setting up whatever resources, network, technical, physical, virtual, that are necessary to perform the data migration Testing Validating the setup, configuration, performing test migrations, validating the results with data. This step should result in a more accurate estimate of time and cost of project completion Execution Performing the data migration, along with having monitoring metrics available to provide health/insights into the system Validation During/after the migration, provide some “proof” of intact delivery, ideally using a checksum
  • 4. SlideAWS Glacier Data Migration Project Overview 4 • 280TB of media data at external tape provider • Timeline - Urgent / Immediate. Aka, “Yesterday” • All data is all in “tar” files in the tape provider • This data is requested one media at a time (file in tar) • Some data is duplicated in these tar files (est 10%) • Mapping data (files to tar file) is in a MySQL database (requirements)
  • 5. SlideAWS Glacier Data Migration 5 Additional Information • We are working and coordinating with not one, but two clients. The data owner and the current data provider. (technically 3, if you count AWS as future data provider) • Gathered information from the data provider, reading from their tape drives is up to ~10 megabit/sec • The original files we have no checksum verification but we do have file-size verification on some files • The data provider has tar checksums on their side • Discussed various “migration plan” scenarios with both clients. (requirements)
  • 6. SlideAWS Glacier Data Migration 6 Migration Plans data pulling, processing & deduplication s3 final / verified data in s3 bucket ftp server hosted by current data provider data polling, pulling processing & deduplication Plan #1 Plan #2 ftp server hosted by future data provider, pushed to by current data provider data processing & deduplication ec2 ec2 ec2 Plan #3
  • 7. SlideAWS Glacier Data Migration • Ingest data as fast as the tape can read, meaning… • Receiving of tar data (disk/network) • Tar-file verification (disk) • Extracting of data to individual files (disk/cpu) • De-duplication and file verification (disk/cpu) • Pushing data to S3 bucket (disk/network) • Removal of files and tar 7 Implementation Requirements (with their constraint)
  • 8. SlideAWS Glacier Data Migration • EC2 Instance with enough capacity to handle all aspects of the migration • Running a FTP Server for Data Ingestion • Running custom “ingestion” software to do verification, extraction, data de-duplication, and final delivery of data into S3 • Monitoring/metrics/alarms setup and configured 8 Implementation (overview)
  • 9. SlideAWS Glacier Data Migration 9 Ingestion Workflow Streaming tape data to FTP Server Data gets picked up by ingestion engine Data gets pushed to S3 bucket
  • 10. SlideAWS Glacier Data Migration • Implemented ingestion engine in Python because… • Reliable and up-to-date AWS module (boto3) • My knowledge and experience in Python • Simple and re-usable • Work with files, databases, s3, and run external shell scripts if necessary 10 Implementation (details)
  • 11. SlideAWS Glacier Data Migration • After server is setup, ingestion service is running, performed a few test-migrations • Debugged and dialed in the ingestion workflow • Dialed in what instance type to use • Because of the extremely heavy demand for I/O, ended up using an i3.xlarge EC2 Instance with 4 vCPUs, 30GB RAM, 1TB Instance Store NVMe. • This server is effectively only a “buffer” anyway 11 Testing
  • 12. SlideAWS Glacier Data Migration • Coordinate with all teams/clients • Keep in mind if your ingestion workload may go over some AWS service limits (API limits, service limits, bucket limits, etc) then contact AWS ahead of time to have them increase your limits. Eg: If using an HA setup via ELB, ask AWS to pre-warm it • Have monitoring in place to keep an eye on it, especially if it is running 24/7 12 Execution
  • 13. SlideAWS Glacier Data Migration • Disk Usage (root and instance store) • Memory Usage • CPU Usage • Network Usage • Daemons Running (FTP & Ingestion) • Interface / Visualization (DEMO coming…) 13 Monitoring
  • 14. SlideAWS Glacier Data Migration • If you recall me mentioning, they had no checksums, only file-sizes on some files • Had to think outside the box… • Came up with solution to do image comparison analysis to their thumbnails from their reference library. Demo… • Additionally, after the migration was complete, had logs of every file placed in S3 • As an extra verification step, performed an headObject on every file we expected to be in Glacier, and delivered that as part of the completion report 14 Data Verification
  • 15. SlideAWS Glacier Data Migration Demo(s) 15
  • 16. DevOps, Done Right Thanks! Questions? Ask them now, or… farley@olindata.com All trademarks, service marks, trade names, trade dress, product names and logos appearing on this presentation are the property of their respective owners. All rights reserved.