Find Genome Data for Your Research

Genome sharing projects
around the world
– and how you find data for
your research
Cambridge, April 26 2016
Slides will be made available online 
Tweets welcome #CamFindData

We are on twitter:
@glyn_dk
@repositiveio
@DNAdigest
@CamOpenData
Cambridge, April 26 2016
Slides will be made available online 
Tweets welcome #CamFindData

1. What data are you looking for? And Why?
2. Data resources from around the world
3. Tips on how to find and access data
4. Hands-on using Repositive
5. Summary and feedback
Workshop outline

1. What data are you looking for?
This workshop will focus on finding
and accessing human genomic data.
… And why would you be looking for
genomic data for your research?
Are you researching cancer or
genetic diseases?

How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015:
UK10K, Icelandic population (2,636 + 100k imputed),
Cancer genome atlas ~11,000 genomes
Exac consortium 65,000 exomes
?

Statistically speaking, you still need 10s of thousands of samples for
validation
The more severe the phenotype and the more complete penetrance, the
easier it will be for you to find your variant, but
“As the genetic complexity of the disease increases (for example,
reduced penetrance and increased locus heterogeneity), issues of
statistical power quickly become paramount.”
http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html
But I am just looking at this one disease…

What can I do?
PRO TIP: involve a statistician early on in your study design!

How can I determine significance?
“One potentially powerful approach is to assess conservation across and within
multiple species as whole-genome sequence data become more abundant.”
Look at extreme phenotypes “Sampling cases or controls from the extremes of an
appropriate quantitative distribution can often increase power”
Look at non-SNP variants, they are more likely to have functional effects
- “how to account for the technical features of sequencing, such as incomplete
sequencing and biased coverage over the genome?”

Think of how you can provide evidence that your result is not just a local
technical variation or sampling bias
e.g. data from same cell type, same seq technology, same alignment…
How to account for bias?
PRO TIP: include more reference data in your analysis

• Know what data is available in your lab,
your dept, your org
• Survey from Qiagen showed that one of
the main reasons researchers collaborate
is to get access to data!
How can I access more data for my research?

How can I find collaborators?
PRO TIP: Search for collaborators who have the data you need
PRO TIP: Tell your colleagues and peers what type of data you
have in your lab

2. Data resources from around the world
Public repositories
• some you apply for access,
especially if data contains
clinical info or whole genome
PID
• some are open access: GEO,
SRA, PGP, OpenSNP, GigaDB, …
• some are consented for
general research use, some
have specific consent

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+PB
Sequenced
every year
WGS data available
in public repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research

Hundreds of data sources
…but they aren’t easy to find!
10
25
33 35
102
163
0
20
40
60
80
100
120
140
160
180
200
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16
http://dx.doi.org/10.1371/journal.pbio.1002418First 30 data sources listed here:

Data source content
Assay Types
Dedicated to…

Number of samples in Data sources
1
10
100
1000
10000
100000
1000000
Sample#(Log10)
Top 5:
GEO (1.8M)
PMI Cohort Program (1M)
Auria Biopankki (1M)
EGA (~0.6M)
SRA (~0.5M)

Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both Open and Restricted
access data within one repository.

Online Data source ’types’
University – Affiliated to a
university. Often only members of
that university can
upload/download to/from it. Catalogue – doesn’t have raw
data but lists studies/datasets.
Initiative/Consortium – Has a
specific purpose/aim. Often
focussed on a question or
disease.
Repository – Can download
from, has data from multiple
institutions. Often can also
upload your own data there.
Company – For profit
organisation. Listing data is
not their main purpose.
Biobank – many have sequence
data of their biological samples.

Sequenced ethnicities
Aboriginals
African Americans
Africans
Australians
Chinese
Malays
Indians
Danish
Dutch Estonian
Russian
European Ancestry
Finnish
Icelandic
Japanese
Korean
Latin Americans
Saudi
Swedish

Machines & Data sources
947
5600
88
660
26
68
50
62
3
25
0
0
23 International
Interesting site to look at:
http://omicsmaps.com/stats

Main Repository funders
BGI = 4
EBI = 9NIH = 10
NCBI = 9
The Broad = 8
Wellcome = 4
EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/all
NCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_

Biobanks as data sources
- Biobanks are potential sources of genomic data
- Most biobanks contain large collections of samples (thousands)
- Some biobanks also contain data related to these samples
- A fraction of this data is genomic data (usually genotyping)
- Several biobanks (e.g. ToMMo biobank in Japan, UK biobank) have sequencing programs
- Many biobanks do not consider sequencing as their priority but are willing to give their samples to
researchers who would like to sequence them
- Most biobanks are supposed to share their samples with bona fide researchers (exception –
commercial biobanks, e.g. Abcodia)
- In most cases, the best thing is to ask them directly whether they have samples/data that you
need!

Name: UK Biobank
Type of data: genotyping
URL: http://biobank.ctsu.ox.ac.uk/crystal/gsearch.cgi
UK Biobank
Name: ToMMo Biobank
Type of data: genotyping, WGS
URL: https://ijgvd.megabank.tohoku.ac.jp/
Name: Diabetes Biobank Brussels
Type of data: data (including genomic; not specified) and
clinical samples on >20.000 diabetic patients and their first
degree relatives.
URL: http://www.diabetesbiobank.org/
Name: Dutch biobanks (dozens of them!)
Type of data: multiple
URL: http://bit.ly/1XxPA6W
Name: Auria Biobank Finland
Type of data: There are roughly one million human biological samples
stored in Auria Biobank, a considerable proportion of which are cancer
samples. At the moment, there is only the catalogue of samples, no
catalogue of data. In case a researcher needs to know what kind of data
we have, he/she needs to contact us.
URL: https://www.auriabiopankki.fi/?lang=en

More information about data sources
… in our recent paper:
http://tinyurl.com/plos-biology-repositive

• Case study: DNA data on Cancer
3. Tips to find and access data

Case Study – DNA data on Cancer
Repositories you
have heard of:
Ask around
(word of mouth):
Repository Data Type Access
ArrayExpress Expression Open
GEO Espression Open
EGA Mixed Restricted
dbGaP Mixed Restricted
Encode Healthy Reference Open
1000 Genomes Healthy Reference Open
Repository Data Type Access
COSMIC Somatic mutations & WGS Open
ClinVar Variant information Open
ExAC Allele Freq. but not raw data Open
SRA Individual sequences Open
TCGA Clinical & high level data Open
CGHub Low level data (DNA data) Restricted

Case Study – DNA data on Cancer
We have identified the first 27 cancer specific data sources 
And many more that contain cancer data alongside other data
types.
Abcodia
AmbryShare
BRCA Exchange
Breast Cancer Now Tissue Bank
Broad Cancer programme datasets
Cancer Moonshot 2020
CanGEM
CGCI
CGHub
Chinese cancer genome consortium
Chinese national human genome centre
Follicular Lymphoma Genome Data
G-DOC
GenoMel
ICGC
National Mesothelioma Virtual Bank
NCIP Hub
Project GENIE
Target
TCGA
Texa cancer research biobank
NCI-60
CCLE
COSMIC
Fantom
cancer methylome system
Cancer therepeutics response portal

1. Register for eRA account
2. Request access to specific dataset of interest
3. Download data
Registering for CGHub
https://cghub.ucsc.edu/keyfile/newuser.html
‘Principle signing
official’ registers
Email to verify
Email to
confirm/deny access
to website
Email with
temporary password
Change password Electronic signature
Login Fill in contact info,
Complete ‘424’ form
(research application
form)
Request reviewed by
DAC
Email to
confirm/deny access
to data
Login
Retrieve personal
access token
Download! 

Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem by
qualitative interviews followed
by a survey of researchers in
human genetics

Often a long process
T. A. van Schaik et al
The need to redefine genomic
data sharing: a focus on data
accessibility, Applied &
Translational Genomics, 2014
10.1016/j.atg.2014.09.013
Researchers spend months to
find and access genomic data,
and often choose to not access
data at all

Why the barrier?
• Benefits: strict governance, review of consent, applicant signs for full
responsibility for governance
• Disadvantages: No control of data once access is given, high barrier for
access – too high?

• Start planning your data needs early in your project
• When you find the data you need, start application
• Use Open Access data
How can I save time?
PRO Tip: If you use human genomic data, apply for the GRU
datasets in dbGaP, one application – access to all the GRU
datasets

• Some data is Open Access  requires specific consent
• OpenSNP.org (Bastian)
• Personal Genomes Projects
• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
Not all data is restricted

• Some data is Open Access  requires specific consent
• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
• OpenSNP.org
• Personal Genomes Projects
Not all data is restricted

Personal Genome Project
PGP Harvard PGP Canada PGP UK Genom Austria
Host institution Harvard Medical School
Boston
SickKids Toronto University College London CeMM Research Center for
Molecular Medicine
Principal Investigator George Church Steven Scherer Stephan Beck Christoph Bock & Giulio
Superti-Furga
Launch year 2005 2012 2013 2014
Geographic scope USA, mainly Boston Canada United Kingdom Mainly Austria
Enrollment eligibility At least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups
excluded
Data Generated Whole genome sequencing,
upload of additional data
possible
Mainly whole genome
sequencing
Whole genome sequencing,
DNA methylome sequencing,
RNA transcriptome sequencing
Mainly whole genome
sequencing
Number of genomes 100s 10s 10s 10s
Data access
http://personalgenomes.org/harvard/data
http://genomaustria.at/unser-
genom/#genome-der-
pionierinnen
Project funding Discretional funds and
corporate sponsoring
Institutional startup funds Discretional funds and
corporate sponsoring
Institutional startup funds
Areas of emphasis Integration with phenotypic data,
collaboration with other personal
omics initiatives
Genome donations, synergy with
massive-scale clinical genome
sequencing projects
Genomes and society, genetic
literacy, school projects,
education
Website http://personalgenomes.org/harvard/ http://personalgenomes.org/canada/ http://personalgenomes.org/uk/ http://genomaustria.at/

Summary of data access barriers
Data is uploaded
to repository
Data is discovered
by potential user
Data is accessed
by potential user

• “even when researchers are authorised to share data they
report reluctance to do so because of the amount of effort
required“ http://www.sciencedirect.com/science/article/pii/S2212066114000386
• “Clinical geneticists cited a lack of time because their main priority is
diagnosing patients. Industrial researchers cited a lack of time because of
the pressure to meet the deadlines in their job. Researchers in academia
cited both a concern about the potential loss of future publications once
unpublished data is shared, and the lack of time and incentive to share
data as this does not contribute to their publication record. Researchers
from all categories felt that they lacked sufficient resources to make their
data available.”
The barrier of making data available
But I do not want to share my data

• If you expect data to be available to you
– you have to make your data available too!
• Encourage collaborations: power by numbers
1. Get credit – publish and make your data available
2. Give credit – cite data sources
3. Understand consent – for all uses of clinical data
Best practices

• Use all available tools to make your life easier:
• Data publications  visibility and citations for your data, e.g.
GigaScience and Scientific Data
• Figshare, Zenodo, Dryad for sharing open access data
• PhenomeCentral, Matchmaker exchange for rare disease research
• Repositive for finding data across repositories and make your own
data discoverable
Best practices: use the tools

Does data sharing
matter at
grant proposal evaluation
Based on: Winning Horizon 2020 with Open Science,
http://dx.doi.org/10.5281/zenodo.12247
Best practices: Plan into your grant proposals

“Weakness: Involvement of non-
academic beneficiaries is limited”
“Weakness: highly focused on academic activities, and
lacks an advanced communication strategy”
“Weakness: limited exposure to
non-academic partners & infrastructures”
Excellence
Impact
Implementation
“data accessibility is unclear!”
“data storage & access not considered”

“Strengths: extensive dissemination of data to the
scientific community (open access, databases)”
“outreach activities to a broad audience”
“research software is freely available”
Impact:

Make the (research) world a better place by sharing in return 
Best practices: Share in return!

• Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g.
PatientsKnowBest
• Privacy-preserving access to datasets: preserving control and
governance with data custodian, lower barrier for access
What the future holds

4. Hands-on session using Repositive
What if finding data was as easy as finding a book on
Amazon, book a hotel on Expedia?

Repositive promotes best practices
Discover new data sources
EASY
SEARCH

Make your data visible
SHARE
KNOWLEDGE

Build a data community
BUILD
TRUST

Benefit for both sides of data collaboration
Data consumers Data producers
Find relevant data faster
Feedback from other users
through ratings and comments to
evaluate data quality
Find collaborators with data
Make your data visible
Build credibility as a trusted
provider of quality data
Find collaborators to analyse
your data

Live demo
http://discover.repositive.io
Use activation code: CamFindData

5. Summary and feedback
• Get credit – publish data
• Give credit – cite data
• Understand consent

Tell us your thoughts:
@repositiveio
@glyn_dk
And read more on http://repositive.io
Bugs and feedback to: Charlotte at Repositive.io

Find Genome Data for Your Research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Find Genome Data for Your Research

Similar to Find Genome Data for Your Research (20)

More from Fiona Nielsen

More from Fiona Nielsen (13)

Recently uploaded

Recently uploaded (20)

Find Genome Data for Your Research

Editor's Notes