Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Embl ebi use-cases_-_t.wildish


Published on

Embl ebi use-cases_-_t.wildish

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Embl ebi use-cases_-_t.wildish

  1. 1. Use-cases for the ARCHIVER project The European Bioinformatics Institute Tony Wildish
  2. 2. What is EMBL-EBI? • Europe’s home for biological data services, research and training • A trusted data provider for the life sciences • Part of the European Molecular Biology Laboratory, an intergovernmental research organisation • International: 650 members of staff from 66 nations • Home of the ELIXIR Technical hub.
  3. 3. Our mission Deliver excellent research Train the next generation of scientists Engage with industry Coordinate bioinformatics in Europe Deliver scientific services
  4. 4. The European Molecular Biology Laboratory Heidelberg, Germany Main Laboratory Barcelona, Spain Tissue Biology, Disease Modeling 80+ nationalities Hinxton, Cambridge, UK Bioinformatics Mouse Biology Rome, Italy >1700 personnel Grenoble, France Hamburg, Germany Structural Biology 6 sites in Europe Structural Biology
  5. 5. Data resources at EMBL-EBI
  6. 6. Database interactions • Our collaborative community facilitates social, scientific and technical interactions • This image shows internal interactions between data resources, as determined by the exchange of data. • The width of each internal arc is weighted according to the number of different data types exchanged.
  7. 7. Increasing Data, Increasing Analysis Storage growth at EBI • ~40-50% per year • i.e. doubling every two years • No reason to expect that to slow down EGA and ENA account for the bulk of the data • DNA sequences
  8. 8. See the live map at Who uses EMBL-EBI services?
  9. 9. Where does our data come from?
  10. 10. Data characteristics DNA sequence data ○ The bulk of our data, files from few MB up to many tens of GB ○ ‘long-read’ sequencing technology, can expect file sizes to increase? Lifetime ○ EBI has custodial responsibility, most of our data is stored ‘forever’ ○ Data is immutable (but may be versioned) Analyses ○ Assembly: stream/index whole file, then random access string matching ○ Query: byte-range lookup Access ○ POSIX, FTP, HTTP, S3… ○ Data discovery by portal lookup, dedicated portals with cross-references
  11. 11. Privacy, security Public ○ Available without authorisation or identification – anonymous FTP Private, secure ○ Apply to a committee for access, individually encrypted copy provided if granted Collaboration ○ Team of people with access, varying degrees (R/O, R/W), fluctuating membership Embargo ○ Public after analysis/publication, or after time window expires
  12. 12. “EMBL on FIRE” - Background The FIle REplication Project started in Systems and Networking team in 2008 ○ Provide an efficient, reliable, scalable and replicated data storage (for disaster recovery) ○ Provide a cost-effective and vendor-independent solution ○ Different storage technologies on Replica A and Replica B to mitigate possible data loss Projects using FIRE include: ○ 1000 Genomes (G1K) ○ European Nucleotide Archive (ENA) ○ European Genome-phenome Archive (EGA) ○ Human Induced Pluripotent Stem Cells Initiative (HIPSCI) ○ Functional Annotation of Animal Genomes (FAANG) ○ BioImaging Data Archive
  13. 13. 2018 Stability with 1PB/month ingress 2019 Become S3 like cloud with metadata features 2020 Ingress 2PB/month Egress 60PB/month 2021 Metadata explorer Ingress 3PB/month 2022 Not yet defined 5 Years plan
  14. 14. “EMBL on FIRE” - Challenges Cost-effective scaling: ○ Can cloud-based storage offer a cost-efficient approach? ○ How do ingest rates affect this model? ○ Current use is ~1PB download, 2 billion requests, per month Cost-effective analysis: ○ As the data-volume grows, we expect users to switch to cloud-based analysis platforms. How can we effectively distribute/present the data for analysis ○ Need a hybrid/multi-cloud model that blurs the boundaries between on-premises and public cloud ○ Long tail of analysis, effectively no ‘cold data’ -> tiered storage not a panacea
  15. 15. Caching in the cloud Why? ○ Increasing data volumes strain our in-house compute resources ○ Many of our data products have regular release cycles, e.g. quarterly ○ Downstream processing becoming a bottleneck, unable to keep up ○ Bandwidth for access to data ○ Some workflows require specialized hardware, e.g. >> 1 TB RAM ○ Prefer to move to the cloud as soon as is cost-effective How? ○ Hybrid-cloud model, extend on-premises resources transparently into multiple clouds
  16. 16. Caching in the cloud EMBL-EBI Data Centre Space JANET – UK Academic Network Public Clouds Clusters NFS Object Store Research Team Cache Public Service Service Team Users
  17. 17. Caching in the cloud Which data do we cache? ○ Which data is most likely to be used in the future? When? ○ Half our data is less than 2 years old ○ Long tail of analysis, not use-once-and-forget ○ Need monitoring of access patterns and knowledge of file relationships ○ Some knowledge of a-priori requirements, but not complete How much data to cache? ○ Trade-off long-term caching vs. cost of upload/download of data, available bandwidth Cache lifetime? ○ Instrument workflows with caching hints? ○ Process-mining to determine which files are used in what manner for a given workflow? ○ How much can we automate this vs. requiring the user to tell us?
  18. 18. Caching in the cloud and FIRE? Cache vs. archive: ○ Cache lifetime goes to infinity -> archive Moving target ○ Need a process that can evolve over time, over many orders of magnitude ○ Tools & technologies may change, must be fluid
  19. 19. Testing plans Functionality ○ Ingest + download with multiple clients, rate ~PB/month ○ Clients distributed around several clouds, several locations ○ Byte-range download for subsets of large files Performance ○ Sustained functionality over long periods – days, not minutes Security ○ Test RBAC functionality, reliability, usability, latency (e.g. if eventually consistent) Accounting, billing ○ Ability to get near-realtime ‘cost’ reports, predictions, alerts, breakdowns…
  20. 20. Summary o Data growing fast, ~doubling every two years o Don’t expect this to slow down anytime soon o Cloud-migration for user community just beginning o Actively pushing to accelerate this o Need a hybrid/multi-cloud storage solution o Flexible, performant, cost-effective