Genome sharing projects around the world nijmegen oct 29 - 2015

Genome sharing projects
around the world
– and how you find data for
your research
Fiona Nielsen, October 2015
Find me on twitter: @glyn_dk

• In case my talk will be boring…
First the take home messages…

Do not forget:
By 2025 genome research will produce as much data
as Twitter /YouTube.
You do not have
enough statistical
power to interpret
your data
But
You can
improve your
study design
And
You can access
more data from
public genome
data repositories

Data output is going up
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
400K
Genomes
Sequenced
The output of human genome
sequencing data is growing at
exponential rates
Estimated number of human
genomes sequenced in 2015 

Population scale genome sequencing projects
Population scale genome
sequencing projects have
been launched all over
the world
Soon every research lab
and every genetic clinic
will have a DNA
sequencer

How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015:
UK10K, Icelandic population (2,636 + 100k imputed),
Cancer genome atlas ~11,000 genomes
Exac consortium 65,000 exomes
?

Statistically speaking, you still need 10s of thousands of samples for
validation
The more severe the phenotype and the more complete penetrance, the
easier it will be for you to find your variant, but
“As the genetic complexity of the disease increases (for example,
reduced penetrance and increased locus heterogeneity), issues of
statistical power quickly become paramount.”
http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html
But I am just looking at this one disease…

What can I do?
PRO TIP: involve a statistician early on in your study design!

How can I determine significance?
“One potentially powerful approach is to assess conservation across and within
multiple species as whole-genome sequence data become more abundant.”
Look at extreme phenotypes “Sampling cases or controls from the extremes of an
appropriate quantitative distribution can often increase power”
Look at non-SNP variants, they are more likely to have functional effects
- “how to account for the technical features of sequencing, such as incomplete
sequencing and biased coverage over the genome?”

Think of how you can provide evidence that your result is not just a local
technical variation or sampling bias
e.g. data from same cell type, same seq technology, same alignment…
How to account for bias?
PRO TIP: include more reference data in your analysis

• Know what data is available in your lab,
your dept, your org
• Survey from Qiagen showed that one of
the main reasons researchers collaborate
is to get access to data!
How can I access more data for my research?

How can I find collaborators?
PRO TIP: Search for collaborators who have the data you need
PRO TIP: Tell your colleagues and peers what type of data you
have in your lab

Where can I access data?
public repositories
• some you apply for access,
especially if data contains
clinical info or whole genome
PID
• some are open access: GEO,
SRA, PGP, OpenSNP, GigaDB, …
• some are consented for
general research use, some
have specific consent

And it takes time
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem by
qualitative interviews followed
by a survey of researchers in
human genetics

And it takes time
T. A. van Schaik et al
The need to redefine genomic
data sharing: a focus on data
accessibility, Applied &
Translational Genomics, 2014
10.1016/j.atg.2014.09.013
Researchers spend months to
find and access genomic data,
and often choose to not access
data at all

Barriers to access
NIH / eRA Commons
login
No
Yes
Organisation registered
with eRA
Organisation has DUNS
number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ days to weeks
Access granted
Variable: from
weeks to months
dbGaP Application Process
Science…
Find/Download/Decryp
t data
+ 1-2 days

Why the barrier?
• Benefits: strict governance, review of consent, applicant signs for full
responsibility for governance
• Disadvantages: No control of data once access is given, high barrier for
access – too high?

• Start planning your data needs early in your project
• When you find the data you need, start application
• Use Open Access data
How can I save time?
PRO Tip: If you use human genomic data, apply for the GRU
datasets in dbGaP, one application – access to all the GRU
datasets

• Some data is Open Access  requires specific consent
• OpenSNP.org (Bastian)
• Personal Genomes Projects
• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
Not all data is restricted

• Some data is Open Access  requires specific consent
• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
• OpenSNP.org (Bastian)
• Personal Genomes Projects
Not all data is restricted

Personal Genome Project
PGP Harvard PGP Canada PGP UK Genom Austria
Host institution Harvard Medical School
Boston
SickKids Toronto University College London CeMM Research Center for
Molecular Medicine
Principal Investigator George Church Steven Scherer Stephan Beck Christoph Bock & Giulio
Superti-Furga
Launch year 2005 2012 2013 2014
Geographic scope USA, mainly Boston Canada United Kingdom Mainly Austria
Enrollment eligibility At least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups
excluded
Data Generated Whole genome sequencing,
upload of additional data
possible
Mainly whole genome
sequencing
Whole genome sequencing,
DNA methylome sequencing,
RNA transcriptome sequencing
Mainly whole genome
sequencing
Number of genomes 100s 10s 10s 10s
Data access
http://personalgenomes.org/harvard/data
http://genomaustria.at/unser-
genom/#genome-der-
pionierinnen
Project funding Discretional funds and
corporate sponsoring
Institutional startup funds Discretional funds and
corporate sponsoring
Institutional startup funds
Areas of emphasis Integration with phenotypic data,
collaboration with other personal
omics initiatives
Genome donations, synergy with
massive-scale clinical genome
sequencing projects
Genomes and society, genetic
literacy, school projects,
education
Website http://personalgenomes.org/harvard/ http://personalgenomes.org/canada/ http://personalgenomes.org/uk/ http://genomaustria.at/

Summary of data access barriers
Data is uploaded
to repository
Data is discovered
by potential user
Data is accessed
by potential user

Where is the data?
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
≈ 5K
Genomes
Available
400K
Genomes
Sequenced
Only a fraction of the data is
findable or available through
public repositories

• “even when researchers are authorised to share data they
report reluctance to do so because of the amount of effort
required“ http://www.sciencedirect.com/science/article/pii/S2212066114000386
• “Clinical geneticists cited a lack of time because their main priority is
diagnosing patients. Industrial researchers cited a lack of time because of
the pressure to meet the deadlines in their job. Researchers in academia
cited both a concern about the potential loss of future publications once
unpublished data is shared, and the lack of time and incentive to share
data as this does not contribute to their publication record. Researchers
from all categories felt that they lacked sufficient resources to make their
data available.”
The barrier of making data available
But I do not want to share my data

• If you expect data to be available to you
– you have to make your data available too!
• Encourage collaborations: power by numbers
1. Get credit – publish and make your data available
2. Give credit – cite data sources
3. Understand consent – for all uses of clinical data
Best practices

• Use all available tools to make your life easier:
• Data publications  visibility and citations for your data, e.g.
GigaScience
• Figshare, Zenodo, Dryad for sharing open access data
• PhenomeCentral, Matchmaker exchange for rare disease research
• Repositive for finding data across repositories and make your own data
discoverable
Best practices: use the tools

Does #OpenScience
matter at
proposal evaluation
Based on: Winning Horizon 2020 with Open Science,
http://dx.doi.org/10.5281/zenodo.12247

“Weakness: Involvement of non-
academic beneficiaries is limited”
“Weakness: highly focused on academic activities, and
lacks an advanced communication strategy”
“Weakness: limited exposure to
non-academic partners & infrastructures”
Excellence
Impact
Implementation
“data accessibility is unclear!”
“data storage & access not considered”

“Strengths: extensive dissemination of data to the
scientific community (open access, databases)”
“outreach activities to a broad audience”
“research software is freely available”
Impact:

Make the (research) world a better place by sharing in return 
Best practices

• Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g.
PatientsKnowBest
• Privacy-preserving access to datasets: preserving control and
governance with data custodian, lower barrier for access
What the future holds

In the meantime: It is a jungle out there!
What if finding data was as easy as finding a book on
Amazon, book a hotel on Expedia?

The Repositive vision
Enabling
efficient data
access
Incentivising
best practices
Trusted broker
for data
exchange

Repositive is a web platform
Discover new data sources
We are indexing all the public sources of
data, so users have an easy portal for
searching through data descriptions.
EASY
SEARCH

Repositive is a web platform
Make your data visible
As a two-sided marketplace, the users
can also make their own data findable.
SHARE
KNOWLEDGE

Active Repositive users increase benefits
Build a data community
BUILD
TRUST
Users can interact to find relevant
collaborators for their research either to
analyse their data or to combine data
sources.

Active Repositive users increase benefits
Find data collaborators
SAVE TIME
Feedback from other users through ratings
and comments helps users evaluate data
quality

Benefit for both sides
Data consumers Data producers
Find relevant data faster
Feedback from other users
through ratings and comments to
evaluate data quality
Find collaborators with data
Make your data visible
Build credibility as a trusted
provider of quality data
Find collaborators to analyse
your data

Live demo
Sign up as beta tester: http://repositive.io

Best practices - recap
• Get credit – publish data
• Give credit – cite data
• Understand consent

Genome sharing projects around the world nijmegen oct 29 - 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Genome sharing projects around the world nijmegen oct 29 - 2015

Similar to Genome sharing projects around the world nijmegen oct 29 - 2015 (20)

More from Fiona Nielsen

More from Fiona Nielsen (13)

Recently uploaded

Recently uploaded (20)

Genome sharing projects around the world nijmegen oct 29 - 2015

Editor's Notes