Workshop presentation on finding and accessing human genomics data for research.
Including statistics of publicly available data sources and tips on how to save time in your workflow of data access.
Organised in collaboration between DNAdigest and Open Data Cambridge.
Read more about our work:
http://DNAdigest.org
http://repositive.io
https://uk.linkedin.com/in/fionanielsen
http://www.data.cam.ac.uk
1. Genome sharing projects
around the world
– and how you find data for
your research
Cambridge, April 26 2016
Slides will be made available online
Tweets welcome #CamFindData
2. We are on twitter:
@glyn_dk
@repositiveio
@DNAdigest
@CamOpenData
Cambridge, April 26 2016
Slides will be made available online
Tweets welcome #CamFindData
3. 1. What data are you looking for? And Why?
2. Data resources from around the world
3. Tips on how to find and access data
4. Hands-on using Repositive
5. Summary and feedback
Workshop outline
4. 1. What data are you looking for?
This workshop will focus on finding
and accessing human genomic data.
… And why would you be looking for
genomic data for your research?
Are you researching cancer or
genetic diseases?
5. How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015:
UK10K, Icelandic population (2,636 + 100k imputed),
Cancer genome atlas ~11,000 genomes
Exac consortium 65,000 exomes
?
6. Statistically speaking, you still need 10s of thousands of samples for
validation
The more severe the phenotype and the more complete penetrance, the
easier it will be for you to find your variant, but
“As the genetic complexity of the disease increases (for example,
reduced penetrance and increased locus heterogeneity), issues of
statistical power quickly become paramount.”
http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html
But I am just looking at this one disease…
7. What can I do?
PRO TIP: involve a statistician early on in your study design!
8. How can I determine significance?
“One potentially powerful approach is to assess conservation across and within
multiple species as whole-genome sequence data become more abundant.”
Look at extreme phenotypes “Sampling cases or controls from the extremes of an
appropriate quantitative distribution can often increase power”
Look at non-SNP variants, they are more likely to have functional effects
- “how to account for the technical features of sequencing, such as incomplete
sequencing and biased coverage over the genome?”
9. Think of how you can provide evidence that your result is not just a local
technical variation or sampling bias
e.g. data from same cell type, same seq technology, same alignment…
How to account for bias?
PRO TIP: include more reference data in your analysis
10. • Know what data is available in your lab,
your dept, your org
• Survey from Qiagen showed that one of
the main reasons researchers collaborate
is to get access to data!
How can I access more data for my research?
11. How can I find collaborators?
PRO TIP: Search for collaborators who have the data you need
PRO TIP: Tell your colleagues and peers what type of data you
have in your lab
12. 2. Data resources from around the world
Public repositories
• some you apply for access,
especially if data contains
clinical info or whole genome
PID
• some are open access: GEO,
SRA, PGP, OpenSNP, GigaDB, …
• some are consented for
general research use, some
have specific consent
13. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+PB
Sequenced
every year
WGS data available
in public repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research
18. Number of samples in Data sources
1
10
100
1000
10000
100000
1000000
Sample#(Log10)
Top 5:
GEO (1.8M)
PMI Cohort Program (1M)
Auria Biopankki (1M)
EGA (~0.6M)
SRA (~0.5M)
19. Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both Open and Restricted
access data within one repository.
20. Online Data source ’types’
University – Affiliated to a
university. Often only members of
that university can
upload/download to/from it. Catalogue – doesn’t have raw
data but lists studies/datasets.
Initiative/Consortium – Has a
specific purpose/aim. Often
focussed on a question or
disease.
Repository – Can download
from, has data from multiple
institutions. Often can also
upload your own data there.
Company – For profit
organisation. Listing data is
not their main purpose.
Biobank – many have sequence
data of their biological samples.
22. Machines & Data sources
947
5600
88
660
26
68
50
62
3
25
0
0
23 International
Interesting site to look at:
http://omicsmaps.com/stats
23. Main Repository funders
BGI = 4
EBI = 9NIH = 10
NCBI = 9
The Broad = 8
Wellcome = 4
EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/all
NCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_
24. Biobanks as data sources
- Biobanks are potential sources of genomic data
- Most biobanks contain large collections of samples (thousands)
- Some biobanks also contain data related to these samples
- A fraction of this data is genomic data (usually genotyping)
- Several biobanks (e.g. ToMMo biobank in Japan, UK biobank) have sequencing programs
- Many biobanks do not consider sequencing as their priority but are willing to give their samples to
researchers who would like to sequence them
- Most biobanks are supposed to share their samples with bona fide researchers (exception –
commercial biobanks, e.g. Abcodia)
- In most cases, the best thing is to ask them directly whether they have samples/data that you
need!
25. Name: UK Biobank
Type of data: genotyping
URL: http://biobank.ctsu.ox.ac.uk/crystal/gsearch.cgi
UK Biobank
Name: ToMMo Biobank
Type of data: genotyping, WGS
URL: https://ijgvd.megabank.tohoku.ac.jp/
Name: Diabetes Biobank Brussels
Type of data: data (including genomic; not specified) and
clinical samples on >20.000 diabetic patients and their first
degree relatives.
URL: http://www.diabetesbiobank.org/
Name: Dutch biobanks (dozens of them!)
Type of data: multiple
URL: http://bit.ly/1XxPA6W
Name: Auria Biobank Finland
Type of data: There are roughly one million human biological samples
stored in Auria Biobank, a considerable proportion of which are cancer
samples. At the moment, there is only the catalogue of samples, no
catalogue of data. In case a researcher needs to know what kind of data
we have, he/she needs to contact us.
URL: https://www.auriabiopankki.fi/?lang=en
26. More information about data sources
… in our recent paper:
http://tinyurl.com/plos-biology-repositive
27. • Case study: DNA data on Cancer
3. Tips to find and access data
28. Case Study – DNA data on Cancer
Repositories you
have heard of:
Ask around
(word of mouth):
Repository Data Type Access
ArrayExpress Expression Open
GEO Espression Open
EGA Mixed Restricted
dbGaP Mixed Restricted
Encode Healthy Reference Open
1000 Genomes Healthy Reference Open
Repository Data Type Access
COSMIC Somatic mutations & WGS Open
ClinVar Variant information Open
ExAC Allele Freq. but not raw data Open
SRA Individual sequences Open
TCGA Clinical & high level data Open
CGHub Low level data (DNA data) Restricted
29. Case Study – DNA data on Cancer
We have identified the first 27 cancer specific data sources
And many more that contain cancer data alongside other data
types.
Abcodia
AmbryShare
BRCA Exchange
Breast Cancer Now Tissue Bank
Broad Cancer programme datasets
Cancer Moonshot 2020
CanGEM
CGCI
CGHub
Chinese cancer genome consortium
Chinese national human genome centre
Follicular Lymphoma Genome Data
G-DOC
GenoMel
ICGC
National Mesothelioma Virtual Bank
NCIP Hub
Project GENIE
Target
TCGA
Texa cancer research biobank
NCI-60
CCLE
COSMIC
Fantom
cancer methylome system
Cancer therepeutics response portal
30. 1. Register for eRA account
2. Request access to specific dataset of interest
3. Download data
Registering for CGHub
https://cghub.ucsc.edu/keyfile/newuser.html
‘Principle signing
official’ registers
Email to verify
Email to
confirm/deny access
to website
Email with
temporary password
Change password Electronic signature
Login Fill in contact info,
Complete ‘424’ form
(research application
form)
Request reviewed by
DAC
Email to
confirm/deny access
to data
Login
Retrieve personal
access token
Download!
31. Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem by
qualitative interviews followed
by a survey of researchers in
human genetics
32. Often a long process
T. A. van Schaik et al
The need to redefine genomic
data sharing: a focus on data
accessibility, Applied &
Translational Genomics, 2014
10.1016/j.atg.2014.09.013
Researchers spend months to
find and access genomic data,
and often choose to not access
data at all
34. Why the barrier?
• Benefits: strict governance, review of consent, applicant signs for full
responsibility for governance
• Disadvantages: No control of data once access is given, high barrier for
access – too high?
35. • Start planning your data needs early in your project
• When you find the data you need, start application
• Use Open Access data
How can I save time?
PRO Tip: If you use human genomic data, apply for the GRU
datasets in dbGaP, one application – access to all the GRU
datasets
36. • Some data is Open Access requires specific consent
• OpenSNP.org (Bastian)
• Personal Genomes Projects
• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
Not all data is restricted
37. • Some data is Open Access requires specific consent
• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
• OpenSNP.org
• Personal Genomes Projects
Not all data is restricted
38. Personal Genome Project
PGP Harvard PGP Canada PGP UK Genom Austria
Host institution Harvard Medical School
Boston
SickKids Toronto University College London CeMM Research Center for
Molecular Medicine
Principal Investigator George Church Steven Scherer Stephan Beck Christoph Bock & Giulio
Superti-Furga
Launch year 2005 2012 2013 2014
Geographic scope USA, mainly Boston Canada United Kingdom Mainly Austria
Enrollment eligibility At least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups
excluded
Data Generated Whole genome sequencing,
upload of additional data
possible
Mainly whole genome
sequencing
Whole genome sequencing,
DNA methylome sequencing,
RNA transcriptome sequencing
Mainly whole genome
sequencing
Number of genomes 100s 10s 10s 10s
Data access
http://personalgenomes.org/harvard/data
http://genomaustria.at/unser-
genom/#genome-der-
pionierinnen
Project funding Discretional funds and
corporate sponsoring
Institutional startup funds Discretional funds and
corporate sponsoring
Institutional startup funds
Areas of emphasis Integration with phenotypic data,
collaboration with other personal
omics initiatives
Genome donations, synergy with
massive-scale clinical genome
sequencing projects
Genomes and society, genetic
literacy, school projects,
education
Website http://personalgenomes.org/harvard/ http://personalgenomes.org/canada/ http://personalgenomes.org/uk/ http://genomaustria.at/
39. Summary of data access barriers
Data is uploaded
to repository
Data is discovered
by potential user
Data is accessed
by potential user
40. • “even when researchers are authorised to share data they
report reluctance to do so because of the amount of effort
required“ http://www.sciencedirect.com/science/article/pii/S2212066114000386
• “Clinical geneticists cited a lack of time because their main priority is
diagnosing patients. Industrial researchers cited a lack of time because of
the pressure to meet the deadlines in their job. Researchers in academia
cited both a concern about the potential loss of future publications once
unpublished data is shared, and the lack of time and incentive to share
data as this does not contribute to their publication record. Researchers
from all categories felt that they lacked sufficient resources to make their
data available.”
The barrier of making data available
But I do not want to share my data
41. • If you expect data to be available to you
– you have to make your data available too!
• Encourage collaborations: power by numbers
1. Get credit – publish and make your data available
2. Give credit – cite data sources
3. Understand consent – for all uses of clinical data
Best practices
42. • Use all available tools to make your life easier:
• Data publications visibility and citations for your data, e.g.
GigaScience and Scientific Data
• Figshare, Zenodo, Dryad for sharing open access data
• PhenomeCentral, Matchmaker exchange for rare disease research
• Repositive for finding data across repositories and make your own
data discoverable
Best practices: use the tools
43. Does data sharing
matter at
grant proposal evaluation
Based on: Winning Horizon 2020 with Open Science,
http://dx.doi.org/10.5281/zenodo.12247
Best practices: Plan into your grant proposals
44. “Weakness: Involvement of non-
academic beneficiaries is limited”
“Weakness: highly focused on academic activities, and
lacks an advanced communication strategy”
“Weakness: limited exposure to
non-academic partners & infrastructures”
Excellence
Impact
Implementation
“data accessibility is unclear!”
“data storage & access not considered”
Best practices: Plan into your grant proposals
45. “Strengths: extensive dissemination of data to the
scientific community (open access, databases)”
“outreach activities to a broad audience”
“research software is freely available”
Impact:
Best practices: Plan into your grant proposals
47. Make the (research) world a better place by sharing in return
Best practices: Share in return!
48. • Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g.
PatientsKnowBest
• Privacy-preserving access to datasets: preserving control and
governance with data custodian, lower barrier for access
What the future holds
49. 4. Hands-on session using Repositive
What if finding data was as easy as finding a book on
Amazon, book a hotel on Expedia?
53. Benefit for both sides of data collaboration
Data consumers Data producers
Find relevant data faster
Feedback from other users
through ratings and comments to
evaluate data quality
Find collaborators with data
Make your data visible
Build credibility as a trusted
provider of quality data
Find collaborators to analyse
your data
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
It has been shown that the combination of summary single-variant statistics from multiple data sets, rather than the joint analysis of a combined data set, does not result in an appreciable loss of information85, and that taking into account heterogeneity in effect size across studies can improve statistical power
“Although they are harder to call and annotate, insertion or deletions, multinucleotide variants and structural variants (including copy-number variants, translocations and inversions) constitute a smaller set of variation (in terms of the number of discrete events an individual is expected to carry) relative to all SNVs and are more likely to have functional effects.”
It has been shown that the combination of summary single-variant statistics from multiple data sets, rather than the joint analysis of a combined data set, does not result in an appreciable loss of information85, and that taking into account heterogeneity in effect size across studies can improve statistical power
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Population scale genome sequencing projects have been launched all over the world
More than 80PB of human genomic data is being sequenced Every year
BUT
To date only around .5PB of data available in public repositories
Further confounded by the data being highly fragmented.
Siloed in repositories and institutions around the world.
There are many public repositories, but It can be hugely confusing to know where to look for the right kind of data
Public repositories: default is apply for access -> full access
Benefits: strict governance, review of consent, applicant signs for full responsibility for governance
Disadvantages: No control of data once access is given, high barrier for access – too high? (researchers giving up, even patients can’t get access to their own data)
ODP trained, EURO-BASIN manager, – a boring title, for a diverse job, in an exciting research domain.
DIP into EACH step of the research cycle, from proposal formulation to providing the best return-on-investment to the funders.
So I`d like to share with you some experiences from the last few years of OS advocacy in the Marine Science Community
Excellence at your Research Subject is … excellent, but is it ENOUGH ?
To be successful, a candidate will be judged on being complete.
MESSAGE: FOSUC only on IF could expose you to risk
ODP trained, EURO-BASIN manager, – a boring title, for a diverse job, in an exciting research domain.
DIP into EACH step of the research cycle, from proposal formulation to providing the best return-on-investment to the funders.
So I`d like to share with you some experiences from the last few years of OS advocacy in the Marine Science Community
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data