This document summarizes the data management challenges and solutions at the Friedrich Miescher Institute in Basel, Switzerland. It discusses how the Institute generates terabytes of data per year from various life science research technologies. It then outlines the Institute's storage architecture using DDN storage systems, data workflows from acquisition to analysis to archiving, tools for data transfer and sharing, and systems for storage management including quotas, reporting, backups and archiving. The conclusion expresses a desire to collaborate with others to improve data management tools for life science research.
Elevate Developer Efficiency & build GenAI Application with Amazon Q
Dealing with the Challenges of Large Life Science Data Sets from Acquisition to Archive
1. Dealing with the Challenges of
Large Life Science Data Sets from
Acquisition to Archive
Dean Flanders
Friedrich Miescher Institute
Basel, Switzerland
DDN User Group Meeting @ ISC 2017
June 20, 2017
4. • Internationally recognized as a center of
excellence in biomedical research with a
strong record of innovation in the molecular
biology of disease.
• Devoted to fundamental biomedical research.
• Current research focuses on the study of
genetics, cancer, and neurobiology with 360
researchers.
• Wide variety of research technologies are
used generating terabytes of data per year.
• Funded by the Novartis Research
Foundation.
Founded in 1970
Located in Basel, Switzerland
Introduction – Friedrich Miescher Institute
5. data center 2 (DR)data center 1 (primary)
CIFS / NFS / GPFS (1024TB useable initial to 4PB in ~3 years)
~10 X ~300 MB/s~800 X ~50 MB/s ~30 X ~500 MB/s
Initial 2017
- NAS: 1024 TB useable
Expansion 2017
- NAS: additional 512 TB
Expansion 2018
- NAS: additional 512 TB
~20 X ~500 MB/s
Architecture – Overview (DDN GS14KX and DDN GS7KX)
data center 3 (backup)
6. Raw data or data that no longer is
being actively used is archived to tape
to reduce costs
DDN GS14K NAS filer 1024TB of useable storage
• 24GB/s aggregate performance
• Scalable to 10PB in a single system
• Support data acquisition and analysis
~20 instruments with 10Gb/s
(~10 max concurrently, ~2TB
per run/150MB/s or 1.5GB/s
combined)
~30 servers/VMs/analysis computers
with 10Gb/s NFS/CIFS
(~20 max at one time, run at 200-
300MB/s each or 4-6GB/s combined)
~10 analysis computers with 10Gb/s NFS/CIFS
(~10 max at one time, run at 200-300MB/s
each or 2-3GB/s combined)
IBM V7000 Unified storage system
• 80 TBs of block storage
• Mission critical storage systems such as home directories
and group shares.
• Performance requirements are much lower than high
performance storage system.
~800 client computers that rely on
mapping of home directories and
group shares for less intensive
workflows..
Workflow – Overview
10. Yokogawa Cell Voyager CV7000S instrument
• High throughput screening microscope
• Life cell imaging
• Supports 6, 24, 96, 384 and 1536 well plates
• External robotics allows automated plate changing
Image acquisition plan for 384 well plates:
• 16 positions in a well, 10 depth positions, 4 channels / lasers
→ 245760 images a 10 MB → 2400 GB per plate
• Time courses planned too
Goal: Reduce the time for data transfer to analysis
server(s) by copying data already in parallel
to image data acquisition
Acquisition – Instrument Example
11. • Copying in parallel to data acquisition
• Copying to multiple destinations (VM / CLUSTER, NAS / ARCHIVING)
• User dependent destinations
• Removal of source data after copy completion (with some delay, free
up SSD space)
Instrument computer
Internal SSD
External,
local storage
VM / CLUSTER NAS / ARCHIVING
Acquisition – Data Copying
12. • GUI
• Highly configurable
• Defaults configuration section [main]
• Customizable settings dependent on data source path
• Supports copying of source data up to five target locations
• Copies to local and / or remote locations
• Connect to remote shares via username and password
• Remove source data (optional)
• Logging
Source Area e.g.: E:Acquisition or E:
Data source e.g.: E:User1, E:User2folder_20160330 (if source removal is on, this location will be deleted)
Relative data path e.g.: User1, User2folder_20160330
Copied data will result in folder structure {target_path}{relative data path}
Target1_Path_default e.g.: F:Backup → F:BackupUser1, F:BackupUser2folder_20160330
Target2_Path_default e.g.: SERVER1cv7000s$ → SERVER1cv7000s$User1,
SERVER1cv7000s$User2folder_20160330
Acquisition – Procedure
13. Business need:
• Exchange of large amounts of data with external collaborators (10 to >100 GB),
e.g. single large files, archives (zip, tar.gz), or many variable-sized files
Complementary to
• File exchange
• Typical use with smaller and few files (≤ 2 GB)
• FMI Dropbox account (ask Dean for details)
• Restricted accessibility (currently)
SFTP (SSH File Transfer Protocol or Secure File Transfer Protocol)
• Is safe, i.e. encrypted
• Command-line or easy-to-use GUI-clients exist for all commonly used operating systems at FMI
e.g.:
Remote Transfer – SFTP Temporary Accounts
14. • First version was manual, scripted (Linux, bash), initiated by email or ticket request
• Now: web-based request form with automated account creation
1. Intranet → Services → Informatics → Quicklinks section: Tools → SFTP account
request
2. From within FileExchange
3. URL: https://webapp.fmi.ch/it/manageshared/sftp.aspx (requires login)
Remote Transfer – SFTP Account Creation (I)
15. Account details are sent by email:
Dear Grzybek, Stefan,
A temporary account on sftp.fmi.ch has been created for your data exchange:
account xferfmi-yucoba created with password "IftOs:quocs4", account expires on 2015-12-
04
The account and all of its data will be automatically deleted at the indicated date.
Please communicate the server and account name, as well as the password (use it without
the quotes) to your collaborator(s) for the data exchange.
Use an sftp client of your choice (e.g. Filezilla or WinSCP for Windows) to transfer the data
to the server sftp.fmi.ch. If your sftp client requires specification of a port, use port 22
(standard ssh / sftp port) for the communication.
Kind regards,
Your FMI IT
Server name
Account name
“xferfmi-” + 6 random
characters
Password
Account expiration time
Remote Transfer – SFTP Account Creation (II)
16. o Expiration notification three days in advance
o Account removal notification
Dear Grzybek, Stefan,
Your temporary account xferfmi-yucoba on sftp.fmi.ch
expires on 2015-12-04. All its data will be automatically
removed.
Please ensure that you have secured any valuable data
before the expiration date.
Reminder: the sftp-account password is "IftOs:quocs4".
Kind regards,
Your FMI IT
Dear Grzybek, Stefan,
Your temporary sftp account xferfmi-yucoba on sftp.fmi.ch
and all associated data has been removed.
Kind regards,
Your FMI IT
Remote Transfer – SFTP Account Expiration & Removal
26. Conclusion
• Moving sucks…. It is a pity there is so much
wheel re-invention in this space.
• Biologists are not physicists… We need to
improve the tools for life science research.
• Help… We are happy to learn from and work with
others to get these tools in place.
• It would be nice to have these kinds of tools
generally available for large data management…