Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute

Describes how the Joint Genome Institute (JGI) is addressing the challenges it faces in storing and managing the rapidly growing volume of -omics data. Presented at the GlobusWorld 2021 conference by Kjiersten Fagnan.

  • Be the first to comment

GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute

  1. 1. Managing the genomics data deluge at the DOE Joint Genome Institute Kjiersten Fagnan CIO, JGI
  2. 2. The DOE Joint Genome Institute at a glance JGI MISSION: To provide the global research community with free access to the most advanced integrative genome science capabilities in support of the DOE energy & environmental research mission Integrative Genomics Building (IGB) U.S. Department of Energy Office of Science User Facility ● JGI established in 1997, User facility from 2004 ● Located at Lawrence Berkeley National Laboratory ● ~285 staff; ~$80M annual funding ● 2,038 Global Primary Users in FY20; >10,000 Data Users
  3. 3. JGI History 3
  4. 4. Environmental genomics will enable the Bioeconomy Genetic “Circuit” Gene Enzyme Microbial Factory DNA 2 NH4 2+ CO3 2-
  5. 5. FY 2020 Users: 2,038 Worldwide 6 Users on the Map: 2,038 Academic 1,504 74% Government 183 9% DOE (national labs only) 161 8% Industry 29 1% Other 161 8%
  6. 6. Projects Completed/Scientific Publications 7 Cumulative Number of Projects Completed Cumulative Number of Scientific Publications
  7. 7. Sequence Output 8 Massively Parallel Short Read Sequencing Basepairs (GB) Single Molecule Long Read Sequencing Basepairs (GB)
  8. 8. DOE Office of Science Public Reusable Research Data (PuRe Data) Data/Resources-at-a-Glance
  9. 9. Deluge of Large, Complex Data Sets 10 JGI manages a 10+ PB data repository
  10. 10. Mega – Giga – Tera – Peta – Exa – Zetta – Yotta 5/19/2021 11 The cost to store 1 Yottabyte of data - $100 trillion* This is just genomics data… we also want metabolomes, transcriptomes, proteomes, image data
  11. 11. The Immense Scale of Omics Data 5/19/2021 12 Advances in sequencing and omics technologies have far outpaced data infrastructure How do we remove the barriers to data access and analysis at scale?
  12. 12. Data Management is Critical 5/19/2021 13 PMO S DM Q AQ C / RQ C G AAG Plant MEP RnD Fungal G enome Portal IMG MG M External C ollaborators Web S ervices ( Mycocosm, Phytozome, IMGM/ ER) In 2013, JGI deployed a hierarchical data management system to deal with the exponetial growth in sequence data and analysis products
  13. 13. JGI Archive and Metadata Organizer (JAMO) 5/19/2021 14 G AAG Plant MEP RnD Fungal IMG MG M S DM Q AQ C / RQ C Web S ervices ( Mycocosm, Phytozome, IMGM/ ER) G enome Portal External C ollaborators PMO
  14. 14. JAMO’s Back-end Infrastructure 5/19/2021 15
  15. 15. JAMO Enabled Increased Automation Between Groups • JGI’s core pipelines connect with JAMO and provide metadata through templates • Once data is available for processing, the workflows are triggered automatically • Data that fails QC is flagged for review 5/19/2021 16
  16. 16. JAMO is the Backbone of JGI’s Data Portal 5/19/2021 17 All the metadata used to populate the Data Portal comes from JAMO’s Mongo DB
  17. 17. Code for America Summit Talk on JGI’s New Data Portal Aligning Data Across Siloed Departments Many government sectors have been collecting data digitally for decades often in uncoordinated ways. In this talk we’ll explore how Truss and Joint Genome Institute partnered to break down data silos and start conversations across departments to align metadata across the organization. From establishing baseline agreements, to finding common outcomes everyone could agree upon, to bringing old data sets into the present, this talk will provide useful tools for practitioners facing challenges of data misalignment across multiple departments. It's Thursday later in the day 2:00-3:00 pm PST 5/19/2021 18
  18. 18. Improving Search Across JGI 5/19/2021 19 Metadata in one place makes search across all JGI programs possible JGI-KBase RESTful Service JGI Data and Metadata system including LIMS, GOLD, sequence, assemblies, annotations Metadata and file types User Query Response Data sets
  19. 19. Most of JGI’s Infrastructure is @NERSC 5/19/2021 20
  20. 20. Berkeley Lab is on a Major Fault Line 5/19/2021 21 NERSC is here! Most samples used to generate data at JGI are unique and irreplaceable
  21. 21. Backing up Irreplaceable Data • Moved 1 PB of data to ORNL for safe-keeping • Data migration completed in 5 days using Globus • Enables access to the data – but only useful with the right metadata 5/19/2021 22 Main JGI Data Repository API HPSS Archive JAMO light DTN DTN SUMMIT API
  22. 22. What can you do with all that data and a supercomputer? A Gordon Bell Prize (Supercomputing) winner in 2018 used all the well- characterized publicly available data to look at genetic underpinnings of opioid addiction. Wayne Joubert, et al. 2018. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18). IEEE Press, Article 57, 1–14. Access to large amounts of ‘omics data enables scientists to explore a broad range of hypotheses!
  23. 23. CA has Earthquakes and Fires! 5/19/2021 24 We need to distribute Data and Analysis to maintain scientific productivity
  24. 24. JGI’s Centralized Workflow System ● JGI Analysis Workflow Service (JAWS) ● Need to be able to compute at multiple centers: NERSC, LBL IT, others ● Need to have more readily reusable and modifiable bioinformatics pipelines ● Need workflows to support FAIR* guidelines ● Objective: Portable, Reusable, Traceable workflows on a Robust platform *Findable, Accessible, Interoperable, Reusable 25
  25. 25. Distributed Computing is Hard • Managing multiple user accounts • Different facilities have different policies – Batch schedulers – File system availability and data retention • Different architectures – CPU vs GPU – Local disk vs parallel file systems – Memory size and footprint • Portability is a lot of work 5/19/2021 26
  26. 26. JGI is Running Analyses Across the West Coast JGI Centralized Workflow System Cromwell Workflow Manager Additional resources (cloud, ORNL, ANL, etc) Common interface to access resources initial testing future Workflow Description Language
  27. 27. JGI is Running Analyses Across the West Coast JGI Centralized Workflow System Workflow Description Language 1. Find the data for analysis in the data management system 2. Authenticate with Globus and transfer the data to the remote computing resource 3. Work is executed, results are generated 4. Transfer data back to the home repository with Globus 5. Register the data and metadata with JAMO Application tokens are accepted by the facilities we are using making it possible to transfer data on behalf of the user
  28. 28. Data Movement Between Resources – Globus! • JGI has been using Globus since ~2012 to move data around –One time we broke the service by trying to move millions of tiny files that were all in the same directory :D • Globus enables JGI collaborators to download large amounts of data –Biggest customers are the Bioenergy Research Centers – DOE funded facilities investigating biofuels –Some JGI Users are still willing to wait 9+ days for a download to complete via the browser – education opportunity! • Globus is an integral part of JAWS –Enables the application to move data between computing resources on behalf of the user 5/19/2021 29
  29. 29. Summary • JGI is a DOE User Facility that produces a lot of complex, unique data for the scientific community • As instruments improve, the data is higher quality – *metadata can still be problematic • We’d be lost without a good data management system • JGI is turning to distributed computing for processing and large-scale analyses • Data movement made much easier and faster with Globus 5/19/2021 30
  30. 30. Upcoming Virtual Annual Meeting/Resource Calls ● Aug 30 – Sept 1: 3 x 6-hour days, 2 sessions/day – Exploring the Universe of Specialized Metabolites – From Microbial Sequence to Environmental Function – The Many Facets of Plant-Microbial Interactions – Machine Learning and Artificial Intelligence for Biology – Integrative Omics-Inspired Plant and Microbe Engineering – Technology Innovations ● Community Science Program (CSP) Functional Genomics proposal deadline: July 31 – Genes/Pathway synthesis – Strain engineering – Data mining – Metabolomics – RNA-seq ● Call New Investigator Call proposal deadline: Sept 15 – Bacterial and archaeal isolates and single cell draft genomes – Metagenomes/metatranscriptomes – DNA synthesis- and Metabolomics-based functional analysis