CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects

This project has received funding from the European Union’s Horizon 2020 research and
Innovation programme under grant agreement No. 825775
Data Gravity in the Life Sciences: Lessons learned
from the Human Cell Atlas and other federated data
projects
Presenter: Tony Burdett (EMBL-EBI)
Host: Marta Lloret Llinares (EMBL-EBI)

This webinar is being recorded

Audience Q&A Session
Please write your
questions in the
questions
window of the
GoToWebinar
application

The challenges:
Stay
informed
@CinecaProject
www.cineca-project.eu
Common Infrastructure for National Cohorts
in Europe, Canada and Africa
This project has received funding from the European Union’s Horizon 2020 research and
Innovation programme under grant agreement No. 825775
Accelerating disease research and
improving health by facilitating
transcontinental human data exchange
The vision:
This project has received funding from the Canadian Institute of Health
Research under grant agreement #404896

Today’s presenter
Tony Burdett leads the Archival Infrastructure and Technology team,
which develops services and provides technology to support the
activities of EMBL-EBI’s molecular archives, including data submission,
storage, validation, coordination and presentation.
Tony joined EMBL-EBI in 2005 and has personally built and led
development teams for many resources such as the GWAS Catalog,
ArrayExpress, the Expression Atlas and BioSamples. His team now
develops the ingestion service for the Human Cell Atlas Data
Coordination Platform, EMBL-EBI’s Unified Submission Interface, and the
BioSamples database.

Lessons learned from the Human Cell Atlas and other
federated data projects
Data Gravity in the Life Sciences
Tony Burdett, EMBL-EBI
12th November, 2020

A bit about me…
• I joined EBI in 2005
• I have a biological and medical background
• My career has been heavily focused on service engineering in bioinformatics
• I’ve built, helped develop, or run the development teams for…
• ArrayExpress
• Expression Atlas
• BioSamples
• Ontology tooling
• GWAS Catalog
• Human Cell Atlas DCP

Data Gravity
I didn’t coin the term...
https://datagravitas.com/2010/12/07/data-gravity-in-the-clouds/

vR
BC
G =
“Let data gravity of a given dataset, G, be the product of data volume, V and the regulatory restrictions of the
region in which the data was generated, R, over the bandwidth at the location of the data, B, and the cost of
compute in that location, C”
Data Gravity
Background photo created by rawpixel.com - www.freepik.com

Why does “data gravity” matter?

Percentage of whole genomes and exomes
that are funded solely by healthcare systems
2012
~1%
2017
~20%
2022
>80%
Changing Genomic Data Generation Landscape

Big Data in Digital Biology: EMBL-EBI 2015-2019
Public Web Infrastructure
• Web Requests: 27M → 40M/day
• Unique Host IPs: 1.1M → 2.4M/month
• Web Jobs: 138M → 145M/year
• Search Requests: 272M → 551M/year
6.2PB → 22.7PB
1600VMs → 3100VMs
(TB)
450TB → 973TB
Slide acknowledgment: Steven Newhouse

Collating Data for Analysis
Data being analysed
Cohort datasets
Reference annotation datasets
Proprietary, firewalled datasets

Bottlenecks and Barriers
FEDERATED
DATA
FEDERATED
WORKFLOW
EXECUTION
GLOBAL FEDERATED RESEARCH PLATFORM

● Data and Data Sciences are core elements of Health Research and
Innovation and in all elements of Biopharma Research
● The impact and reuse of data is rapidly growing - but nearly 80% of
investment is spent assembling and harmonizing data
Bottleneck: FAIR Data
Forbes article on 2016 Data
Scientist Report

Cost of not having FAIR research data:
€26bn/yr in Europe
https://dx.doi.org/10.2777/02999
Impact on innovation

Bottleneck: Data Federation
• National genomics initiatives in most European
countries
• Primary goal healthcare diagnostics and personalised
medicine
• Federated EGA is a harmonised platform for human
data discovery, access, distribution, coordinated via
ELIXIR human data community
• Central EGA: International submissions+helpdesk
• Local EGA: Host data locally, share metadata, national
node for submissions and/or helpdesk
• EGA community: Host data locally, share metadata

Bottleneck: Reproducible Research and Analysis
Figure courtesy of: https://esciencelab.org.uk/projects/eosclife/

@CinecaProject
CINECA - Federated Analysis
Data sources
EGA
Biobanks
CHILD
H3ABioNet
..
WP1
Federated data
discovery
- Phenotype
- Genotype
- Data use
WP4
Federated
research
- Federated
GWAS
- Federated
Genomic
Analyses
WP3
Cohort Level
Meta Data
Representation
WP2
AAI
- Europe,
Canada, Africa
interoperability

Sending Compute to Data… Globally?
• Global data storage and analysis
infrastructures required
• Generating truly portable analysis
workflows is complex - and we
don’t have good solutions yet
• Some high powered spacecraft still
need building!

Overcoming Data Gravity
DEPENDS ON...
Costs of compute
Network bandwidth
Data sharing
regulations
Data volumes

“Cloud native” is the answer!

Human Cell Atlas - profiling millions of human cells
Global effort requiring:
• Hundreds of labs
• Organ-specific data
• Disparate experimental
techniques and data types
Integrating data at this scale
requires next generation
technology and infrastructure

Comprehensive Inclusive Organized Dynamic
G
en
eti
cs
Accessible
Tom Deerinck, NIGMS, NIH
Human Cell Atlas Data Coordination Platform
To bridge disparate data, tools and research from all over the world, we must
bring them together in a public platform (the “HCA DCP”) that is:

Labs contribute
single-cell data
DCP pipelines upload
authors data and process
Researchers access
data on the portal
Researchers find
community tools to
work with the data
How it works: the DCP data flow

Outcomes Downloads
(Metadata)
Downloads
(Raw and
Analysed Data)
Checkout to Terra
(to work on in
analysis platform)
HCA DCP Data Browser Statistics from Q3 2020,
from a total 2671 data access requests

“Cloud native” engineering is
not enough
to change behaviour
Lessons Learned
• The DCP adopted a heavily “cloud
native” engineering approach
• Services are somewhat traditional
• Data archive (both raw and
summary results)
• Analysis pipeline
• Engineered with cloud technology
(has no impact to users)
• All the data lives in AWS or GCP, in
US-East (expensive to download)
• Analysis platform available (but
underused)

Strategic Implications
Data Gravity in the life sciences tells us we need a culture change

Federating data and analysis requires:
1. Standards
2. Data provider adoption
3. Data consumer adoption
4. Understanding and considering
data gravity

1. Standards
data gravity
SKILLS

1. Standards
data gravity
SKILLS
INCENTIVES

1. Standards
data gravity
SKILLS
INCENTIVES
COSTS

Credit to: Ian Harrow, FAIR & OM projects
FAIR as enabler for the digital transformation
Slide credit: Susanna Sansone
46
● Data providers improve their own returns
by implementing the FAIR Principles -
gathering traction in big pharma
● FAIR enables powerful new AI analytics to
access data for machine learning and
prediction
● Requirements
○ financial, technical, training
● Challenges
○ change the culture, show business value,
achieve the ‘FAIR enough’
○ Sustain FAIR solutions and activities

47
https://www.covid19dataportal.org/
https://covidhub.psnc.pl/
https://covid19dataportal.se/sv/
https://covid19dataportal.jp/
COVID-19 Data Portals

Top Tips: Driving Data Consumer Adoption
1. Identify good measures of value
• What can I do faster, cheaper, better?
• How many people are using your cloud platform vs downloading data?
2. Start small and expand
• Big re-engineering efforts are costly, risky, and too slow to keep up with
the rate of change in the field
3. Find some exemplars
• Are there smaller sets of data that are high value?
• Can you pilot approaches within communities?
4. Invest in training and outreach
• Even if data is federated and the cloud platform exists, many
bioinformaticians do not have the skills to exploit them

vR
BC
G =
“Let data gravity of a given dataset, G, be the product of data volume, V and the regulatory restrictions of the
region in which the data was generated, R, over the bandwidth at the location of the data, B, and the cost of
compute in that location, C”
Data Gravity

The AIT Team at EMBL-EBI
Acknowledgements

Questions?
Title: Data Gravity in the Life Sciences: Lessons learned from the
Human Cell Atlas and other federated data projects
Presenter: Tony Burdett
Please write your questions in the
questions window of the GoToWebinar
application

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects

Similar to CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects (20)

More from CINECAProject

More from CINECAProject (11)

Recently uploaded

Recently uploaded (20)

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects