Use-cases for the ARCHIVER project
The European Bioinformatics Institute
What is EMBL-EBI?
• Europe’s home for biological data services, research and training
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory, an
intergovernmental research organisation
• International: 650 members of staff from 66 nations
• Home of the ELIXIR Technical hub.
The European Molecular Biology Laboratory
Tissue Biology, Disease Modeling
Hinxton, Cambridge, UK
6 sites in Europe
• Our collaborative community
facilitates social, scientific and
• This image shows internal
interactions between data
resources, as determined by
the exchange of data.
• The width of each internal arc is
weighted according to the number
of different data types exchanged.
Increasing Data, Increasing Analysis
Storage growth at EBI
• ~40-50% per year
• i.e. doubling every two
• No reason to expect
that to slow down
EGA and ENA account for
the bulk of the data
• DNA sequences
See the live map at www.ebi.ac.uk/about/our-impact
Who uses EMBL-EBI services?
DNA sequence data
○ The bulk of our data, files from few MB up to many tens of GB
○ ‘long-read’ sequencing technology, can expect file sizes to increase?
○ EBI has custodial responsibility, most of our data is stored ‘forever’
○ Data is immutable (but may be versioned)
○ Assembly: stream/index whole file, then random access string matching
○ Query: byte-range lookup
○ POSIX, FTP, HTTP, S3…
○ Data discovery by portal lookup, dedicated portals with cross-references
○ Available without authorisation or identification – anonymous FTP
○ Apply to a committee for access, individually encrypted copy provided if granted
○ Team of people with access, varying degrees (R/O, R/W), fluctuating membership
○ Public after analysis/publication, or after time window expires
“EMBL on FIRE” - Background
The FIle REplication Project started in Systems and Networking team in 2008
○ Provide an efficient, reliable, scalable and replicated data storage (for disaster recovery)
○ Provide a cost-effective and vendor-independent solution
○ Different storage technologies on Replica A and Replica B to mitigate possible data loss
Projects using FIRE include:
○ 1000 Genomes (G1K)
○ European Nucleotide Archive (ENA)
○ European Genome-phenome Archive (EGA)
○ Human Induced Pluripotent Stem Cells Initiative (HIPSCI)
○ Functional Annotation of Animal Genomes (FAANG)
○ BioImaging Data Archive
Become S3 like cloud
Not yet defined
5 Years plan
“EMBL on FIRE” - Challenges
○ Can cloud-based storage offer a cost-efficient approach?
○ How do ingest rates affect this model?
○ Current use is ~1PB download, 2 billion requests, per month
○ As the data-volume grows, we expect users to switch to cloud-based analysis platforms.
How can we effectively distribute/present the data for analysis
○ Need a hybrid/multi-cloud model that blurs the boundaries between on-premises and
○ Long tail of analysis, effectively no ‘cold data’ -> tiered storage not a panacea
Caching in the cloud
○ Increasing data volumes strain our in-house compute resources
○ Many of our data products have regular release cycles, e.g. quarterly
○ Downstream processing becoming a bottleneck, unable to keep up
○ Bandwidth for access to data
○ Some workflows require specialized hardware, e.g. >> 1 TB RAM
○ Prefer to move to the cloud as soon as is cost-effective
○ Hybrid-cloud model, extend on-premises resources transparently into multiple clouds
Caching in the cloud
JANET – UK Academic Network
Caching in the cloud
Which data do we cache?
○ Which data is most likely to be used in the future? When?
○ Half our data is less than 2 years old
○ Long tail of analysis, not use-once-and-forget
○ Need monitoring of access patterns and knowledge of file relationships
○ Some knowledge of a-priori requirements, but not complete
How much data to cache?
○ Trade-off long-term caching vs. cost of upload/download of data, available bandwidth
○ Instrument workflows with caching hints?
○ Process-mining to determine which files are used in what manner for a given workflow?
○ How much can we automate this vs. requiring the user to tell us?
Caching in the cloud and FIRE?
Cache vs. archive:
○ Cache lifetime goes to infinity -> archive
○ Need a process that can evolve over time, over many orders of magnitude
○ Tools & technologies may change, must be fluid
○ Ingest + download with multiple clients, rate ~PB/month
○ Clients distributed around several clouds, several locations
○ Byte-range download for subsets of large files
○ Sustained functionality over long periods – days, not minutes
○ Test RBAC functionality, reliability, usability, latency (e.g. if eventually consistent)
○ Ability to get near-realtime ‘cost’ reports, predictions, alerts, breakdowns…
o Data growing fast, ~doubling every two years
o Don’t expect this to slow down anytime soon
o Cloud-migration for user community just beginning
o Actively pushing to accelerate this
o Need a hybrid/multi-cloud storage solution
o Flexible, performant, cost-effective