SlideShare a Scribd company logo
1 of 26
Download to read offline
Dealing with the Challenges of
Large Life Science Data Sets from
Acquisition to Archive
Dean Flanders
Friedrich Miescher Institute
Basel, Switzerland
DDN User Group Meeting @ ISC 2017
June 20, 2017
“Moving Sucks”
“Biologists are not Physicists”
OR
“Help!”
OR
Alternate Titles Considered…
Legend for Slides
PRODUCTION MIGRATION
UNDER
CONSTRUCTION
• Internationally recognized as a center of
excellence in biomedical research with a
strong record of innovation in the molecular
biology of disease.
• Devoted to fundamental biomedical research.
• Current research focuses on the study of
genetics, cancer, and neurobiology with 360
researchers.
• Wide variety of research technologies are
used generating terabytes of data per year.
• Funded by the Novartis Research
Foundation.
Founded in 1970
Located in Basel, Switzerland
Introduction – Friedrich Miescher Institute
data center 2 (DR)data center 1 (primary)
CIFS / NFS / GPFS (1024TB useable initial to 4PB in ~3 years)
~10 X ~300 MB/s~800 X ~50 MB/s ~30 X ~500 MB/s
Initial 2017
- NAS: 1024 TB useable
Expansion 2017
- NAS: additional 512 TB
Expansion 2018
- NAS: additional 512 TB
~20 X ~500 MB/s
Architecture – Overview (DDN GS14KX and DDN GS7KX)
data center 3 (backup)
Raw data or data that no longer is
being actively used is archived to tape
to reduce costs
DDN GS14K NAS filer 1024TB of useable storage
• 24GB/s aggregate performance
• Scalable to 10PB in a single system
• Support data acquisition and analysis
~20 instruments with 10Gb/s
(~10 max concurrently, ~2TB
per run/150MB/s or 1.5GB/s
combined)
~30 servers/VMs/analysis computers
with 10Gb/s NFS/CIFS
(~20 max at one time, run at 200-
300MB/s each or 4-6GB/s combined)
~10 analysis computers with 10Gb/s NFS/CIFS
(~10 max at one time, run at 200-300MB/s
each or 2-3GB/s combined)
IBM V7000 Unified storage system
• 80 TBs of block storage
• Mission critical storage systems such as home directories
and group shares.
• Performance requirements are much lower than high
performance storage system.
~800 client computers that rely on
mapping of home directories and
group shares for less intensive
workflows..
Workflow – Overview
Workflow – Data Type Breakdown
Workflow – Typical Life Science Data Workflow @ FMI
Workflow – Data Movement
Yokogawa Cell Voyager CV7000S instrument
• High throughput screening microscope
• Life cell imaging
• Supports 6, 24, 96, 384 and 1536 well plates
• External robotics allows automated plate changing
Image acquisition plan for 384 well plates:
• 16 positions in a well, 10 depth positions, 4 channels / lasers
→ 245760 images a 10 MB → 2400 GB per plate
• Time courses planned too
Goal: Reduce the time for data transfer to analysis
server(s) by copying data already in parallel
to image data acquisition
Acquisition – Instrument Example
• Copying in parallel to data acquisition
• Copying to multiple destinations (VM / CLUSTER, NAS / ARCHIVING)
• User dependent destinations
• Removal of source data after copy completion (with some delay, free
up SSD space)
Instrument computer
Internal SSD
External,
local storage
VM / CLUSTER NAS / ARCHIVING
Acquisition – Data Copying
• GUI
• Highly configurable
• Defaults configuration section [main]
• Customizable settings dependent on data source path
• Supports copying of source data up to five target locations
• Copies to local and / or remote locations
• Connect to remote shares via username and password
• Remove source data (optional)
• Logging
Source Area e.g.: E:Acquisition or E:
Data source e.g.: E:User1, E:User2folder_20160330 (if source removal is on, this location will be deleted)
Relative data path e.g.: User1, User2folder_20160330
Copied data will result in folder structure {target_path}{relative data path}
Target1_Path_default e.g.: F:Backup → F:BackupUser1, F:BackupUser2folder_20160330
Target2_Path_default e.g.: SERVER1cv7000s$ → SERVER1cv7000s$User1,
SERVER1cv7000s$User2folder_20160330
Acquisition – Procedure
Business need:
• Exchange of large amounts of data with external collaborators (10 to >100 GB),
e.g. single large files, archives (zip, tar.gz), or many variable-sized files
Complementary to
• File exchange
• Typical use with smaller and few files (≤ 2 GB)
• FMI Dropbox account (ask Dean for details)
• Restricted accessibility (currently)
SFTP (SSH File Transfer Protocol or Secure File Transfer Protocol)
• Is safe, i.e. encrypted
• Command-line or easy-to-use GUI-clients exist for all commonly used operating systems at FMI
e.g.:
Remote Transfer – SFTP Temporary Accounts
• First version was manual, scripted (Linux, bash), initiated by email or ticket request
• Now: web-based request form with automated account creation
1. Intranet → Services → Informatics → Quicklinks section: Tools → SFTP account
request
2. From within FileExchange
3. URL: https://webapp.fmi.ch/it/manageshared/sftp.aspx (requires login)
Remote Transfer – SFTP Account Creation (I)
Account details are sent by email:
Dear Grzybek, Stefan,
A temporary account on sftp.fmi.ch has been created for your data exchange:
account xferfmi-yucoba created with password "IftOs:quocs4", account expires on 2015-12-
04
The account and all of its data will be automatically deleted at the indicated date.
Please communicate the server and account name, as well as the password (use it without
the quotes) to your collaborator(s) for the data exchange.
Use an sftp client of your choice (e.g. Filezilla or WinSCP for Windows) to transfer the data
to the server sftp.fmi.ch. If your sftp client requires specification of a port, use port 22
(standard ssh / sftp port) for the communication.
Kind regards,
Your FMI IT
 Server name
 Account name
“xferfmi-” + 6 random
characters
 Password
 Account expiration time
Remote Transfer – SFTP Account Creation (II)
o Expiration notification three days in advance
o Account removal notification
Dear Grzybek, Stefan,
Your temporary account xferfmi-yucoba on sftp.fmi.ch
expires on 2015-12-04. All its data will be automatically
removed.
Please ensure that you have secured any valuable data
before the expiration date.
Reminder: the sftp-account password is "IftOs:quocs4".
Kind regards,
Your FMI IT
Dear Grzybek, Stefan,
Your temporary sftp account xferfmi-yucoba on sftp.fmi.ch
and all associated data has been removed.
Kind regards,
Your FMI IT
Remote Transfer – SFTP Account Expiration & Removal
Disk Utilization – Reporting (du2rrd)
Quotas – Management
Quotas – Reporting to users/groups (email)
Quotas – Reporting to users/groups (web)
Archiving – Generic Archiving Process Overview
Archiving – Generic Archiving Workflow
Backup – Architecture
rsync currently
but what about
AFM or object?
TSM with
tape or
object?
Backup – HSM snapshots / versioning / backup with SAMFS
Long running full
blocking differential.
Backup – Monitoring (daily status reports)
Conclusion
• Moving sucks…. It is a pity there is so much
wheel re-invention in this space.
• Biologists are not physicists… We need to
improve the tools for life science research.
• Help… We are happy to learn from and work with
others to get these tools in place.
• It would be nice to have these kinds of tools
generally available for large data management…

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 

What's hot (20)

Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Directories
DirectoriesDirectories
Directories
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
MongoDB DOC v1.5
MongoDB DOC v1.5MongoDB DOC v1.5
MongoDB DOC v1.5
 
Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 

Similar to Dealing with the Challenges of Large Life Science Data Sets from Acquisition to Archive

Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
Alluxio, Inc.
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
David Wallom
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
saintdevil163
 

Similar to Dealing with the Challenges of Large Life Science Data Sets from Acquisition to Archive (20)

Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Storage solutions for High Performance Computing
Storage solutions for High Performance ComputingStorage solutions for High Performance Computing
Storage solutions for High Performance Computing
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
OSBConf 2016: The Backup Report of the Friedrich Schiller University Jena - b...
OSBConf 2016: The Backup Report of the Friedrich Schiller University Jena - b...OSBConf 2016: The Backup Report of the Friedrich Schiller University Jena - b...
OSBConf 2016: The Backup Report of the Friedrich Schiller University Jena - b...
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Integrating an electronic lab notebook with a university it environment rdmf ...
Integrating an electronic lab notebook with a university it environment rdmf ...Integrating an electronic lab notebook with a university it environment rdmf ...
Integrating an electronic lab notebook with a university it environment rdmf ...
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Enabling efficient movement of data into & out of a high-performance analysis...
Enabling efficient movement of data into & out of a high-performance analysis...Enabling efficient movement of data into & out of a high-performance analysis...
Enabling efficient movement of data into & out of a high-performance analysis...
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
The Quick Migration of File Servers
The Quick Migration of File ServersThe Quick Migration of File Servers
The Quick Migration of File Servers
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Dealing with the Challenges of Large Life Science Data Sets from Acquisition to Archive

  • 1. Dealing with the Challenges of Large Life Science Data Sets from Acquisition to Archive Dean Flanders Friedrich Miescher Institute Basel, Switzerland DDN User Group Meeting @ ISC 2017 June 20, 2017
  • 2. “Moving Sucks” “Biologists are not Physicists” OR “Help!” OR Alternate Titles Considered…
  • 3. Legend for Slides PRODUCTION MIGRATION UNDER CONSTRUCTION
  • 4. • Internationally recognized as a center of excellence in biomedical research with a strong record of innovation in the molecular biology of disease. • Devoted to fundamental biomedical research. • Current research focuses on the study of genetics, cancer, and neurobiology with 360 researchers. • Wide variety of research technologies are used generating terabytes of data per year. • Funded by the Novartis Research Foundation. Founded in 1970 Located in Basel, Switzerland Introduction – Friedrich Miescher Institute
  • 5. data center 2 (DR)data center 1 (primary) CIFS / NFS / GPFS (1024TB useable initial to 4PB in ~3 years) ~10 X ~300 MB/s~800 X ~50 MB/s ~30 X ~500 MB/s Initial 2017 - NAS: 1024 TB useable Expansion 2017 - NAS: additional 512 TB Expansion 2018 - NAS: additional 512 TB ~20 X ~500 MB/s Architecture – Overview (DDN GS14KX and DDN GS7KX) data center 3 (backup)
  • 6. Raw data or data that no longer is being actively used is archived to tape to reduce costs DDN GS14K NAS filer 1024TB of useable storage • 24GB/s aggregate performance • Scalable to 10PB in a single system • Support data acquisition and analysis ~20 instruments with 10Gb/s (~10 max concurrently, ~2TB per run/150MB/s or 1.5GB/s combined) ~30 servers/VMs/analysis computers with 10Gb/s NFS/CIFS (~20 max at one time, run at 200- 300MB/s each or 4-6GB/s combined) ~10 analysis computers with 10Gb/s NFS/CIFS (~10 max at one time, run at 200-300MB/s each or 2-3GB/s combined) IBM V7000 Unified storage system • 80 TBs of block storage • Mission critical storage systems such as home directories and group shares. • Performance requirements are much lower than high performance storage system. ~800 client computers that rely on mapping of home directories and group shares for less intensive workflows.. Workflow – Overview
  • 7. Workflow – Data Type Breakdown
  • 8. Workflow – Typical Life Science Data Workflow @ FMI
  • 9. Workflow – Data Movement
  • 10. Yokogawa Cell Voyager CV7000S instrument • High throughput screening microscope • Life cell imaging • Supports 6, 24, 96, 384 and 1536 well plates • External robotics allows automated plate changing Image acquisition plan for 384 well plates: • 16 positions in a well, 10 depth positions, 4 channels / lasers → 245760 images a 10 MB → 2400 GB per plate • Time courses planned too Goal: Reduce the time for data transfer to analysis server(s) by copying data already in parallel to image data acquisition Acquisition – Instrument Example
  • 11. • Copying in parallel to data acquisition • Copying to multiple destinations (VM / CLUSTER, NAS / ARCHIVING) • User dependent destinations • Removal of source data after copy completion (with some delay, free up SSD space) Instrument computer Internal SSD External, local storage VM / CLUSTER NAS / ARCHIVING Acquisition – Data Copying
  • 12. • GUI • Highly configurable • Defaults configuration section [main] • Customizable settings dependent on data source path • Supports copying of source data up to five target locations • Copies to local and / or remote locations • Connect to remote shares via username and password • Remove source data (optional) • Logging Source Area e.g.: E:Acquisition or E: Data source e.g.: E:User1, E:User2folder_20160330 (if source removal is on, this location will be deleted) Relative data path e.g.: User1, User2folder_20160330 Copied data will result in folder structure {target_path}{relative data path} Target1_Path_default e.g.: F:Backup → F:BackupUser1, F:BackupUser2folder_20160330 Target2_Path_default e.g.: SERVER1cv7000s$ → SERVER1cv7000s$User1, SERVER1cv7000s$User2folder_20160330 Acquisition – Procedure
  • 13. Business need: • Exchange of large amounts of data with external collaborators (10 to >100 GB), e.g. single large files, archives (zip, tar.gz), or many variable-sized files Complementary to • File exchange • Typical use with smaller and few files (≤ 2 GB) • FMI Dropbox account (ask Dean for details) • Restricted accessibility (currently) SFTP (SSH File Transfer Protocol or Secure File Transfer Protocol) • Is safe, i.e. encrypted • Command-line or easy-to-use GUI-clients exist for all commonly used operating systems at FMI e.g.: Remote Transfer – SFTP Temporary Accounts
  • 14. • First version was manual, scripted (Linux, bash), initiated by email or ticket request • Now: web-based request form with automated account creation 1. Intranet → Services → Informatics → Quicklinks section: Tools → SFTP account request 2. From within FileExchange 3. URL: https://webapp.fmi.ch/it/manageshared/sftp.aspx (requires login) Remote Transfer – SFTP Account Creation (I)
  • 15. Account details are sent by email: Dear Grzybek, Stefan, A temporary account on sftp.fmi.ch has been created for your data exchange: account xferfmi-yucoba created with password "IftOs:quocs4", account expires on 2015-12- 04 The account and all of its data will be automatically deleted at the indicated date. Please communicate the server and account name, as well as the password (use it without the quotes) to your collaborator(s) for the data exchange. Use an sftp client of your choice (e.g. Filezilla or WinSCP for Windows) to transfer the data to the server sftp.fmi.ch. If your sftp client requires specification of a port, use port 22 (standard ssh / sftp port) for the communication. Kind regards, Your FMI IT  Server name  Account name “xferfmi-” + 6 random characters  Password  Account expiration time Remote Transfer – SFTP Account Creation (II)
  • 16. o Expiration notification three days in advance o Account removal notification Dear Grzybek, Stefan, Your temporary account xferfmi-yucoba on sftp.fmi.ch expires on 2015-12-04. All its data will be automatically removed. Please ensure that you have secured any valuable data before the expiration date. Reminder: the sftp-account password is "IftOs:quocs4". Kind regards, Your FMI IT Dear Grzybek, Stefan, Your temporary sftp account xferfmi-yucoba on sftp.fmi.ch and all associated data has been removed. Kind regards, Your FMI IT Remote Transfer – SFTP Account Expiration & Removal
  • 17. Disk Utilization – Reporting (du2rrd)
  • 19. Quotas – Reporting to users/groups (email)
  • 20. Quotas – Reporting to users/groups (web)
  • 21. Archiving – Generic Archiving Process Overview
  • 22. Archiving – Generic Archiving Workflow
  • 23. Backup – Architecture rsync currently but what about AFM or object? TSM with tape or object?
  • 24. Backup – HSM snapshots / versioning / backup with SAMFS
  • 25. Long running full blocking differential. Backup – Monitoring (daily status reports)
  • 26. Conclusion • Moving sucks…. It is a pity there is so much wheel re-invention in this space. • Biologists are not physicists… We need to improve the tools for life science research. • Help… We are happy to learn from and work with others to get these tools in place. • It would be nice to have these kinds of tools generally available for large data management…