SlideShare a Scribd company logo
1 of 25
Download to read offline
Are Your Researchers Paying Too Much for Their Cloud-
Based Data Backups?
Dirk Petersen, Scientific Computing Director,
Fred Hutchinson Cancer Research Center (FHCRC)
Bio-IT World 2015 1
Who are we and what do we do
What is Fred Hutch?
• Cancer & HIV research
• 3 Nobel Laureates
• $430M budget / 85% NIH funding
• Seattle Campus with 13 buildings, 15 acre
campus, 1.5+ million sq ft of facility space
Research at “The Hutch”
• 2,700 employees
• 220 Faculty, many with custom requirements
• 13 research programs
• 14 core facilities
• Conservative use of information technology
IT at “The Hutch”
• Multiple data centers with >1000kw capacity
• 100 staff in Center IT plus divisional IT
• Team of 3 Sysadmins to support storage
• IT funded by indirects (F&A)
• Storage Chargebacks started Nov 2014
Bio-IT World 2015 2
How did we get here:
Economy File project in production in 2014
• Chargebacks drove the Hutch to embrace more economical storage
• Selected Swift object storage managed by SwiftStack
• Go-live in 2014, strong interest and expansion in 2015
• Researchers do not want to pay the price for standard enterprise storage
• Additional use cases:
– In production: Swift as a backend for Galaxy
– In progress: Swift replaces standard disk deduplication devices for backup
– Planning: Swift as backend for endpoint backup (Druva)
– Planning: Swift as backend for virtual machines (openvstorage)
– Future option: Swift as backend for Enterprise file sharing / NAS
• File System Gateway for CIFS/NFS access phased out
3Bio-IT World 2015
Phasing out of Filesystem Gateway
• Initial deployment was using SwiftStack Gateway (CIFS /NFS)
– User survey: strong preference for traditional file access
– Gateway was easiest integration option in existing authentication and
authorization process
• However – Gateway was up to 10x slower than direct access to API
– Users had agreed on low performance because of low costs
– And low performance still causes frustration and increases Ops cost
• Now we have alternatives, and better AD integration of Swift
– Gateway was non-HA, higher Ops costs
– Removing gateway allows rolling updates during business hours
– Gateway didn’t allow for full auditing of file access, but Swift does
• Users finally saw benefit of removing gateway and were willing to try
alternative tools
4Bio-IT World 2015
How chargebacks were implemented
• Custom SharePoint site for storage
chargeback processing and
allocation to grants
– Each PI can allocate certain % of
charges to up 3 grant budgets
– Allocation is default setup for next
month
– User comments positive:
“very easy to use“
• Don’t make chargeback worse by
offering bad tools !
5Bio-IT World 2015
Chargebacks spike Swift utilization
• Started storage chargebacks
on Nov 1st
– Triggered strong growth in October
– Users sought to avoid high cost of
enterprise NAS and put as much as
possible into lower cost Swift
• Underestimated success of Swift
– Needed to stop migration to buy
more hardware
– Can migrate 30+ TB per day today
6Bio-IT World 2015
Chargebacks spike Swift utilization, cont.
• High Aggregate throughput
• Current network architecture is
an (anticipated) bottleneck
• Many parallel streams required to
max out throughput
• Ideal for HPC cluster architecture
7Bio-IT World 2015
Silicon Mechanics – Expert included.
Bio-IT World 2015 8
• Commodity hardware selection
• Open source software identification
• Quality assembly process with zero defects
• On-time installation and deployment
• Design consultation for the right solution
• Focused on your real world problems
• Real people behind the product
• Support staff who knows your system
Silicon Mechanics: The value of highly customizable hardware
Bio-IT World 2015 9
Silicon Mechanics Storform Storage Servers
• Flexible, Configurable, Reliable
• 144TB raw capacity; 130TB usable
• No RAID controllers; no storage lost to RAID
• 36 x 4TB 3.5” Seagate SATA drives
• 2 x 120GB Intel S3700 SSDs; OS + metadata
• 10Gb Base-T connectivity
• (2) Intel Xeon E5 CPUs
• 64GB RAM
Supermicro SC847 4U chassisLearn more at Booth #361
@ExpertIncluded
Management of OpenStack Swift using SwiftStack
• SwiftStack provides control & visibility
– Deployment automation
• Let us roll out Swift nodes in
10 minutes
• Upgrading Swift across clusters
with 1 click
– Monitoring and stats at cluster, node,
and drive levels
– Authentication & Authorization
– Capacity & Utilization Management
via Quotas and Rate Limits
– Alerting, & Diagnostics
Bio-IT World 2015
SwiftStack Architecture Overview
Standard Linux Distribution
Off-the-shelf Ubuntu, Red Hat, CentOS
Standard Hardware
Silicon Mechanics, Supermicro, etc.
Swift Runtime
Integrated storage engine with all node components
Integrations & Interfaces
End-user web UI, legacy interfaces,
authentication, utilization API, etc.
OpenStack Swift
Released and supported by SwiftStack
100% Open Source
SwiftStack Nodes (2 —> 1000s)
Rolling Upgrades & 24x7 Support
Monitoring, Alerting & Diagnostics
Capacity & Utilization Mgmt.
Client Support
Ring & Cluster Management
Authentication Services
Deployment Automation
SwiftStack
Controller
11Bio-IT World 2015
How much does it cost?
• Only small changes vs 2014
– Kryder’s law obsolete at <15%/Y ?
– Swift now down to Glacier cost
(hardware down to $3 / TB / month)
– No price reductions in the cloud
• 4TB (~$120) and 6TB (~$250)
drives cost the same
– Do you want a fault domain of 144TB
or 216TB in your storage servers
– Don’t save on CPU / Erasure Code is
coming !
12Bio-IT World 2015
11
26
28
40
0
5
10
15
20
25
30
35
40
45
Swiftstack Google Amazon S3 NAS
Swiftstack
Google
Amazon S3
NAS
Object storage systems and traditional file systems –
totally different, right?
• No traditional file system hierarchy, we just have buckets (S3 lingo) or containers
(Swift lingo), that can contain millions of objects (aka files)
• Huh, no sub-directories ? But how the heck can I upload my uber-complex
bioinformatics file system with 11 folder hierarchies to Swift ?
– Answer: we simulate the hierarchical structure by simply putting forward slashes (/) in the object name (or file name)
– source /dir1/dir2/dir3/dir4/file5 can simply be copied to /container1/many/fake/dirs/file5
• So, how do you actually copy / migrate data over to Swift if I don’t want to use API?
– The standard tool is the openstack Swift client, let’s assume I want to copy /my/local/folder to
/Swiftcontainer/pseudo/folder, here is the command you have to type:
swift upload --changed --segment-size=2G --use-slo --object-name=“pseudo/folder" “container" " /my/local/folder"
– Really? Can’t we get this a little easier?
– There are a handful of open source tools available, some of them are easier to use (e.g. rclone)
– However, the Swift client is frequently used, well supported, maintained and really fast !!
Bio-IT World 2015 13
Object storage systems and traditional file systems –
totally different, right?
• OK, so let’s get over with this and do what HPC shops do all the time: write a
wrapper and verify that people who don’t have a lot of patience find it usable.
• Swift Commander, a simple shell wrapper for the Swift client, curl and some other
tools makes working with Swift very easy:
• Sub commands such as swc ls, swc cd, swc rm, swc more give you a feel that is quite similar to
a Unix file system, idea stolen from Google’s gsutil
• Actively maintained and available at https://github.com/FredHutch/Swift-
commander/
Bio-IT World 2015 14
$ swc upload /my/posix/folder /my/Swift/folder
$ swc compare /my/posix/folder /my/Swift/folder
$ swc download /my/Swift/folder /my/scratch/fs
Object storage systems and traditional file systems –
totally different, right?
• Didn’t someone say that object storage systems were great at using metadata?
• Yes, and you can just add a few key:value pairs as upload argument:
• Query the meta data via swc, or use an external search engine such as elastic search
Bio-IT World 2015 15
$ swc upload /my/posix/folder /my/Swift/folder project:grant-xyz
collaborators:jill,joe,jim cancer:breast
$ swc meta /my/Swift/folder
Meta Cancer: breast
Meta Collaborators: jill,joe,jim
Meta Project: grant-xyz
Object storage systems and traditional file systems –
totally different, right?
• Users tend to prefer to work with a posix file system with all files in one place ….. But integrating
Swift in your workflows is not really hard
• Example, running samtools using persistent scratch space
(files deleted if not accessed for 30 days)
• A complex 50 line HPC submission script prepping a GATK workflow requires
just 3 more lines !!
• Read the file from persistent scratch space and if it is not there simply pull it again from Swift
• If you don’t have scratch space you can pipe download from Swift directly to samtools
Bio-IT World 2015 16
If ! [[ -f /fh/scratch/delete30/pi/raw/genome.bam ]]; then
swc download /Swiftfolder/genome.bam /fh/scratch/delete30/raw/genome.bam
fi
samtools view -F 0xD04 -c /fh/scratch/delete30/pi/raw/genome.bam > otherfile
Object storage systems and traditional file systems –
totally different, right?
• Use HPC system to download lots of bam files in parallel
• 30 cluster jobs run in parallel on 30 1G nodes (which is my HPC limit)
• My scratch file system says it loads data at 1.4 GB/s
• This means that each bam file is downloaded at 47 MB/s on average and downloading this dataset of 1.2
TB takes 14 min
Bio-IT World 2015 17
$ swc ls /Ext/seq_20150112/ > bamfiles.txt
$ while read FILE; do
$ sbatch -N1 -c4 --wrap="swc download /Ext/seq_20150112/$FILE .";
$ done < bamfiles.txt
$ squeue -u petersen
JOBID PARTITION NAME USER ST TIME NODES NODELIST
17249368 campus sbatch petersen R 15:15 1 gizmof120
17249371 campus sbatch petersen R 15:15 1 gizmof123
17249378 campus sbatch petersen R 15:15 1 gizmof130
$ fhgfs-ctl --userstats --names --interval=5 --nodetype=storage
====== 10 s ======
Sum: 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]
petersen 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]
Scientific file systems are a mixture of small files & large files
• How does Swift handle copying lots of small files ?
• Answer: not so fast …..but to be honest your NFS NAS does not handle this too well either
• Example: (ab)using filenames as database:
• So, we could tar up this entire directory structure ….. but then we have one giant tar ball of 1 TB that
becomes really hard to handle …
• But what if we had a tool that would not tar up sub dirs in one file but create a tar ball for each level:
/folder1/folder2/folder3 could turn into:
• So restoring folder2 and below we just need folder2.tar.gz + folder3.tar.gz
Bio-IT World 2015 18
dirk@rhino04:# ls metapop_results/corrected/release_test/evo/ | head
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.05_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.15_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.1_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.25_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
/folder1.tar.gz
/folder1/folder2.tar.gz
/folder1/folder2/folder3.tar.gz
Scientific file systems are a mixture of small files & large files
• Solution: Swift commander contains an archiving module
• Written by the author of the postmark file system benchmark … who has some
experience with handling small files
• It’s easy:
• It’s fast:
– Archiving uses multiple processes, measured up to 400 MB/s from one Linux box.
– Each process uses pigz multithreaded gzip compression (Example: compressing 1GB DNA string down to
272MB: 111 sec using gzip, 5 seconds using pigz)
– Restore can use standard gzip
• It’s simple & free: https://github.com/FredHutch/Swift-commander/blob/master/bin/swbundler.py
Bio-IT World 2015 19
$ archive: swc arch /my/posix/folder /my/Swift/folder
$ restore: swc unarch /my/Swift/folder /my/scratch/fs
Scientific file systems are a mixture of small files & large files
• Special case: Sometimes we have large ngs files mixed with many small files, we
want to copy but not tar the large files and archive the small files as tar.gz
• Default bundle option in Swift commander copies files >64MB straight and bundles
files < 64M into tar.gz archives
• Can change default to other sizes:
• Benefit, archives small files effectively and still allows you to open large files directly
with other tools, e.g. bam files in public folder in Swift can be opened by IGV
browser
Bio-IT World 2015 20
archive: $ swc bundle /my/posix/folder /my/Swift/folder
$ swc bundle /my/posix/folder /my/Swift/folder 512M
restore: $ swc unbundle /my/Swift/folder /my/scratch/fs
Access with GUI tools is required for collaboration
• Reality: Even if infrequent every archive
requires access via GUI tools
• Needs to work with Windows and Mac
• Tools such as Cyberduck are standard but not
perfectly convenient, we need tools that
– Are very easy to use and
– do not create any proprietary data structures in
Swift that cannot be read by other tools and
– Simply replace a shared drive
Bio-IT World 2015 21
Access with GUI tools is required for collaboration
• Another example: ExpanDrive and Storage Made Easy
– Works with Windows and Mac
– Integrates in Mac Finder and is mountable as a drive in Windows
Bio-IT World 2015 22
rclone: mass copy, backup, data migration - better than rsync
• rclone is a multithreaded data
copy / mirror tool
• Consistent performance on
Linux, Mac and Windows
• E.g. keep a mirror of Synology
workgroup NAS (QNAP has a
builtin swift mirror option)
• Data remains accessible by
swc, desktop clients
• Mirror protected by swift
undelete (currently 60 days
retention)
Bio-IT World 2015 23
Galaxy integration with OpenStack Swift in production
• Galaxy web based high throughput
computing at the Hutch uses Swift as
primary storage in production today
• SwiftStack patches contributed to Galaxy
Project
• Swift allows to delegate “root” access to
bioinformaticians
• Integrated with Slurm HPC scheduler:
automatically assigns default PI account
for each user
Bio-IT World 2015 24
Q & A
Bio-IT World 2015 25

More Related Content

What's hot

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.
 
Embracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloadsEmbracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloadsAlluxio, Inc.
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupDatacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupPaco Nathan
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
[Pixar] Big Data, Big Depots
[Pixar] Big Data, Big Depots[Pixar] Big Data, Big Depots
[Pixar] Big Data, Big DepotsPerforce
 
The DuraCloud Workshop - Open Repositories 2015
The DuraCloud Workshop - Open Repositories 2015The DuraCloud Workshop - Open Repositories 2015
The DuraCloud Workshop - Open Repositories 2015DuraSpace
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceOzone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceDinesh Chitlangia
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
 
Data as a Service
Data as a Service Data as a Service
Data as a Service Kyle Hailey
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem tableMohamed Magdy
 

What's hot (20)

Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
 
HDFS Tiered Storage
HDFS Tiered StorageHDFS Tiered Storage
HDFS Tiered Storage
 
Embracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloadsEmbracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloads
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupDatacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
[Pixar] Big Data, Big Depots
[Pixar] Big Data, Big Depots[Pixar] Big Data, Big Depots
[Pixar] Big Data, Big Depots
 
The DuraCloud Workshop - Open Repositories 2015
The DuraCloud Workshop - Open Repositories 2015The DuraCloud Workshop - Open Repositories 2015
The DuraCloud Workshop - Open Repositories 2015
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceOzone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Data as a Service
Data as a Service Data as a Service
Data as a Service
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem table
 
Migrating from OCLC's Digital Archive to DuraCloud
Migrating from OCLC's Digital Archive to DuraCloudMigrating from OCLC's Digital Archive to DuraCloud
Migrating from OCLC's Digital Archive to DuraCloud
 

Viewers also liked

Viewers also liked (20)

Training report nyakyera sacco ltd final report
Training report nyakyera sacco ltd final reportTraining report nyakyera sacco ltd final report
Training report nyakyera sacco ltd final report
 
6. hafta sunu
6. hafta sunu6. hafta sunu
6. hafta sunu
 
Question 6
Question 6Question 6
Question 6
 
Frieling better burger
Frieling better burgerFrieling better burger
Frieling better burger
 
Joyce willaert presentatie
Joyce willaert presentatieJoyce willaert presentatie
Joyce willaert presentatie
 
Rattler better burger
Rattler better burgerRattler better burger
Rattler better burger
 
Sunu2
Sunu2Sunu2
Sunu2
 
Acessibilidade
AcessibilidadeAcessibilidade
Acessibilidade
 
Taal
TaalTaal
Taal
 
Credit card
Credit cardCredit card
Credit card
 
A Paisagem Urbana - de Gordon Cullen
A Paisagem Urbana - de Gordon Cullen A Paisagem Urbana - de Gordon Cullen
A Paisagem Urbana - de Gordon Cullen
 
Andrew McStay slides Empathic Media #datapowerconf
Andrew McStay slides Empathic Media #datapowerconfAndrew McStay slides Empathic Media #datapowerconf
Andrew McStay slides Empathic Media #datapowerconf
 
Overview of Sensors project
Overview of Sensors projectOverview of Sensors project
Overview of Sensors project
 
7. hafta sunu
7. hafta sunu7. hafta sunu
7. hafta sunu
 
Pictures
PicturesPictures
Pictures
 
Planning the blog
Planning the blogPlanning the blog
Planning the blog
 
Question 6
Question 6Question 6
Question 6
 
Recomendação de Nairobi 1976
Recomendação de Nairobi 1976Recomendação de Nairobi 1976
Recomendação de Nairobi 1976
 
Shaw empresario
Shaw empresarioShaw empresario
Shaw empresario
 
роял сафари
роял сафарироял сафари
роял сафари
 

Similar to BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data Backups

Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dpDirk Petersen
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
start_your_datacenter_sds_v3
start_your_datacenter_sds_v3start_your_datacenter_sds_v3
start_your_datacenter_sds_v3David Byte
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: ScalingChris Miller
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute ClusterRamsay Key
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsIDERA Software
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 

Similar to BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data Backups (20)

Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
start_your_datacenter_sds_v3
start_your_datacenter_sds_v3start_your_datacenter_sds_v3
start_your_datacenter_sds_v3
 
CI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris MillerCI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris Miller
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: Scaling
 
Mini-Training: To cache or not to cache
Mini-Training: To cache or not to cacheMini-Training: To cache or not to cache
Mini-Training: To cache or not to cache
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 

Recently uploaded

Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
The State of the Green IT at the beginning of 2024
The State of the Green IT at the beginning of 2024The State of the Green IT at the beginning of 2024
The State of the Green IT at the beginning of 2024Artur Skowroński
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive ReviewRevolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Reviewjw364beach
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Piyovi
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
What are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docxWhat are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docxkzayra69
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...OnePlan Solutions
 
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBUETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBUsamruddhijedgule2004
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dbaRemote DBA Services
 
Reliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdfReliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdfRalf Gommers
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 

Recently uploaded (20)

Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
The State of the Green IT at the beginning of 2024
The State of the Green IT at the beginning of 2024The State of the Green IT at the beginning of 2024
The State of the Green IT at the beginning of 2024
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive ReviewRevolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
What are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docxWhat are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docx
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
 
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBUETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dba
 
Reliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdfReliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdf
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data Backups

  • 1. Are Your Researchers Paying Too Much for Their Cloud- Based Data Backups? Dirk Petersen, Scientific Computing Director, Fred Hutchinson Cancer Research Center (FHCRC) Bio-IT World 2015 1
  • 2. Who are we and what do we do What is Fred Hutch? • Cancer & HIV research • 3 Nobel Laureates • $430M budget / 85% NIH funding • Seattle Campus with 13 buildings, 15 acre campus, 1.5+ million sq ft of facility space Research at “The Hutch” • 2,700 employees • 220 Faculty, many with custom requirements • 13 research programs • 14 core facilities • Conservative use of information technology IT at “The Hutch” • Multiple data centers with >1000kw capacity • 100 staff in Center IT plus divisional IT • Team of 3 Sysadmins to support storage • IT funded by indirects (F&A) • Storage Chargebacks started Nov 2014 Bio-IT World 2015 2
  • 3. How did we get here: Economy File project in production in 2014 • Chargebacks drove the Hutch to embrace more economical storage • Selected Swift object storage managed by SwiftStack • Go-live in 2014, strong interest and expansion in 2015 • Researchers do not want to pay the price for standard enterprise storage • Additional use cases: – In production: Swift as a backend for Galaxy – In progress: Swift replaces standard disk deduplication devices for backup – Planning: Swift as backend for endpoint backup (Druva) – Planning: Swift as backend for virtual machines (openvstorage) – Future option: Swift as backend for Enterprise file sharing / NAS • File System Gateway for CIFS/NFS access phased out 3Bio-IT World 2015
  • 4. Phasing out of Filesystem Gateway • Initial deployment was using SwiftStack Gateway (CIFS /NFS) – User survey: strong preference for traditional file access – Gateway was easiest integration option in existing authentication and authorization process • However – Gateway was up to 10x slower than direct access to API – Users had agreed on low performance because of low costs – And low performance still causes frustration and increases Ops cost • Now we have alternatives, and better AD integration of Swift – Gateway was non-HA, higher Ops costs – Removing gateway allows rolling updates during business hours – Gateway didn’t allow for full auditing of file access, but Swift does • Users finally saw benefit of removing gateway and were willing to try alternative tools 4Bio-IT World 2015
  • 5. How chargebacks were implemented • Custom SharePoint site for storage chargeback processing and allocation to grants – Each PI can allocate certain % of charges to up 3 grant budgets – Allocation is default setup for next month – User comments positive: “very easy to use“ • Don’t make chargeback worse by offering bad tools ! 5Bio-IT World 2015
  • 6. Chargebacks spike Swift utilization • Started storage chargebacks on Nov 1st – Triggered strong growth in October – Users sought to avoid high cost of enterprise NAS and put as much as possible into lower cost Swift • Underestimated success of Swift – Needed to stop migration to buy more hardware – Can migrate 30+ TB per day today 6Bio-IT World 2015
  • 7. Chargebacks spike Swift utilization, cont. • High Aggregate throughput • Current network architecture is an (anticipated) bottleneck • Many parallel streams required to max out throughput • Ideal for HPC cluster architecture 7Bio-IT World 2015
  • 8. Silicon Mechanics – Expert included. Bio-IT World 2015 8 • Commodity hardware selection • Open source software identification • Quality assembly process with zero defects • On-time installation and deployment • Design consultation for the right solution • Focused on your real world problems • Real people behind the product • Support staff who knows your system
  • 9. Silicon Mechanics: The value of highly customizable hardware Bio-IT World 2015 9 Silicon Mechanics Storform Storage Servers • Flexible, Configurable, Reliable • 144TB raw capacity; 130TB usable • No RAID controllers; no storage lost to RAID • 36 x 4TB 3.5” Seagate SATA drives • 2 x 120GB Intel S3700 SSDs; OS + metadata • 10Gb Base-T connectivity • (2) Intel Xeon E5 CPUs • 64GB RAM Supermicro SC847 4U chassisLearn more at Booth #361 @ExpertIncluded
  • 10. Management of OpenStack Swift using SwiftStack • SwiftStack provides control & visibility – Deployment automation • Let us roll out Swift nodes in 10 minutes • Upgrading Swift across clusters with 1 click – Monitoring and stats at cluster, node, and drive levels – Authentication & Authorization – Capacity & Utilization Management via Quotas and Rate Limits – Alerting, & Diagnostics Bio-IT World 2015
  • 11. SwiftStack Architecture Overview Standard Linux Distribution Off-the-shelf Ubuntu, Red Hat, CentOS Standard Hardware Silicon Mechanics, Supermicro, etc. Swift Runtime Integrated storage engine with all node components Integrations & Interfaces End-user web UI, legacy interfaces, authentication, utilization API, etc. OpenStack Swift Released and supported by SwiftStack 100% Open Source SwiftStack Nodes (2 —> 1000s) Rolling Upgrades & 24x7 Support Monitoring, Alerting & Diagnostics Capacity & Utilization Mgmt. Client Support Ring & Cluster Management Authentication Services Deployment Automation SwiftStack Controller 11Bio-IT World 2015
  • 12. How much does it cost? • Only small changes vs 2014 – Kryder’s law obsolete at <15%/Y ? – Swift now down to Glacier cost (hardware down to $3 / TB / month) – No price reductions in the cloud • 4TB (~$120) and 6TB (~$250) drives cost the same – Do you want a fault domain of 144TB or 216TB in your storage servers – Don’t save on CPU / Erasure Code is coming ! 12Bio-IT World 2015 11 26 28 40 0 5 10 15 20 25 30 35 40 45 Swiftstack Google Amazon S3 NAS Swiftstack Google Amazon S3 NAS
  • 13. Object storage systems and traditional file systems – totally different, right? • No traditional file system hierarchy, we just have buckets (S3 lingo) or containers (Swift lingo), that can contain millions of objects (aka files) • Huh, no sub-directories ? But how the heck can I upload my uber-complex bioinformatics file system with 11 folder hierarchies to Swift ? – Answer: we simulate the hierarchical structure by simply putting forward slashes (/) in the object name (or file name) – source /dir1/dir2/dir3/dir4/file5 can simply be copied to /container1/many/fake/dirs/file5 • So, how do you actually copy / migrate data over to Swift if I don’t want to use API? – The standard tool is the openstack Swift client, let’s assume I want to copy /my/local/folder to /Swiftcontainer/pseudo/folder, here is the command you have to type: swift upload --changed --segment-size=2G --use-slo --object-name=“pseudo/folder" “container" " /my/local/folder" – Really? Can’t we get this a little easier? – There are a handful of open source tools available, some of them are easier to use (e.g. rclone) – However, the Swift client is frequently used, well supported, maintained and really fast !! Bio-IT World 2015 13
  • 14. Object storage systems and traditional file systems – totally different, right? • OK, so let’s get over with this and do what HPC shops do all the time: write a wrapper and verify that people who don’t have a lot of patience find it usable. • Swift Commander, a simple shell wrapper for the Swift client, curl and some other tools makes working with Swift very easy: • Sub commands such as swc ls, swc cd, swc rm, swc more give you a feel that is quite similar to a Unix file system, idea stolen from Google’s gsutil • Actively maintained and available at https://github.com/FredHutch/Swift- commander/ Bio-IT World 2015 14 $ swc upload /my/posix/folder /my/Swift/folder $ swc compare /my/posix/folder /my/Swift/folder $ swc download /my/Swift/folder /my/scratch/fs
  • 15. Object storage systems and traditional file systems – totally different, right? • Didn’t someone say that object storage systems were great at using metadata? • Yes, and you can just add a few key:value pairs as upload argument: • Query the meta data via swc, or use an external search engine such as elastic search Bio-IT World 2015 15 $ swc upload /my/posix/folder /my/Swift/folder project:grant-xyz collaborators:jill,joe,jim cancer:breast $ swc meta /my/Swift/folder Meta Cancer: breast Meta Collaborators: jill,joe,jim Meta Project: grant-xyz
  • 16. Object storage systems and traditional file systems – totally different, right? • Users tend to prefer to work with a posix file system with all files in one place ….. But integrating Swift in your workflows is not really hard • Example, running samtools using persistent scratch space (files deleted if not accessed for 30 days) • A complex 50 line HPC submission script prepping a GATK workflow requires just 3 more lines !! • Read the file from persistent scratch space and if it is not there simply pull it again from Swift • If you don’t have scratch space you can pipe download from Swift directly to samtools Bio-IT World 2015 16 If ! [[ -f /fh/scratch/delete30/pi/raw/genome.bam ]]; then swc download /Swiftfolder/genome.bam /fh/scratch/delete30/raw/genome.bam fi samtools view -F 0xD04 -c /fh/scratch/delete30/pi/raw/genome.bam > otherfile
  • 17. Object storage systems and traditional file systems – totally different, right? • Use HPC system to download lots of bam files in parallel • 30 cluster jobs run in parallel on 30 1G nodes (which is my HPC limit) • My scratch file system says it loads data at 1.4 GB/s • This means that each bam file is downloaded at 47 MB/s on average and downloading this dataset of 1.2 TB takes 14 min Bio-IT World 2015 17 $ swc ls /Ext/seq_20150112/ > bamfiles.txt $ while read FILE; do $ sbatch -N1 -c4 --wrap="swc download /Ext/seq_20150112/$FILE ."; $ done < bamfiles.txt $ squeue -u petersen JOBID PARTITION NAME USER ST TIME NODES NODELIST 17249368 campus sbatch petersen R 15:15 1 gizmof120 17249371 campus sbatch petersen R 15:15 1 gizmof123 17249378 campus sbatch petersen R 15:15 1 gizmof130 $ fhgfs-ctl --userstats --names --interval=5 --nodetype=storage ====== 10 s ====== Sum: 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s] petersen 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]
  • 18. Scientific file systems are a mixture of small files & large files • How does Swift handle copying lots of small files ? • Answer: not so fast …..but to be honest your NFS NAS does not handle this too well either • Example: (ab)using filenames as database: • So, we could tar up this entire directory structure ….. but then we have one giant tar ball of 1 TB that becomes really hard to handle … • But what if we had a tool that would not tar up sub dirs in one file but create a tar ball for each level: /folder1/folder2/folder3 could turn into: • So restoring folder2 and below we just need folder2.tar.gz + folder3.tar.gz Bio-IT World 2015 18 dirk@rhino04:# ls metapop_results/corrected/release_test/evo/ | head global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.05_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000 global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.15_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000 global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.1_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000 global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.25_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000 /folder1.tar.gz /folder1/folder2.tar.gz /folder1/folder2/folder3.tar.gz
  • 19. Scientific file systems are a mixture of small files & large files • Solution: Swift commander contains an archiving module • Written by the author of the postmark file system benchmark … who has some experience with handling small files • It’s easy: • It’s fast: – Archiving uses multiple processes, measured up to 400 MB/s from one Linux box. – Each process uses pigz multithreaded gzip compression (Example: compressing 1GB DNA string down to 272MB: 111 sec using gzip, 5 seconds using pigz) – Restore can use standard gzip • It’s simple & free: https://github.com/FredHutch/Swift-commander/blob/master/bin/swbundler.py Bio-IT World 2015 19 $ archive: swc arch /my/posix/folder /my/Swift/folder $ restore: swc unarch /my/Swift/folder /my/scratch/fs
  • 20. Scientific file systems are a mixture of small files & large files • Special case: Sometimes we have large ngs files mixed with many small files, we want to copy but not tar the large files and archive the small files as tar.gz • Default bundle option in Swift commander copies files >64MB straight and bundles files < 64M into tar.gz archives • Can change default to other sizes: • Benefit, archives small files effectively and still allows you to open large files directly with other tools, e.g. bam files in public folder in Swift can be opened by IGV browser Bio-IT World 2015 20 archive: $ swc bundle /my/posix/folder /my/Swift/folder $ swc bundle /my/posix/folder /my/Swift/folder 512M restore: $ swc unbundle /my/Swift/folder /my/scratch/fs
  • 21. Access with GUI tools is required for collaboration • Reality: Even if infrequent every archive requires access via GUI tools • Needs to work with Windows and Mac • Tools such as Cyberduck are standard but not perfectly convenient, we need tools that – Are very easy to use and – do not create any proprietary data structures in Swift that cannot be read by other tools and – Simply replace a shared drive Bio-IT World 2015 21
  • 22. Access with GUI tools is required for collaboration • Another example: ExpanDrive and Storage Made Easy – Works with Windows and Mac – Integrates in Mac Finder and is mountable as a drive in Windows Bio-IT World 2015 22
  • 23. rclone: mass copy, backup, data migration - better than rsync • rclone is a multithreaded data copy / mirror tool • Consistent performance on Linux, Mac and Windows • E.g. keep a mirror of Synology workgroup NAS (QNAP has a builtin swift mirror option) • Data remains accessible by swc, desktop clients • Mirror protected by swift undelete (currently 60 days retention) Bio-IT World 2015 23
  • 24. Galaxy integration with OpenStack Swift in production • Galaxy web based high throughput computing at the Hutch uses Swift as primary storage in production today • SwiftStack patches contributed to Galaxy Project • Swift allows to delegate “root” access to bioinformaticians • Integrated with Slurm HPC scheduler: automatically assigns default PI account for each user Bio-IT World 2015 24
  • 25. Q & A Bio-IT World 2015 25