The slides of the workshop I gave on May 29th 2017 at the European society for Human Genetics in Copenhagen, Denmark. Here I present the Repositive platform, a tool that helps scientist find and access human genome data.
17. 11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
Public Repositories
Universities
Companies
BioBanks
Research consortiums
19. 2001: 1 human genome
2005: Personal Genome Project
Human Genome Diversity Project
HapMap
2016: 2M AstraZeneca - HLI
2008: 1000 Genomes (1092 genomes, since increased to ~2500)
Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE)
2011: H3Africa
2012: International Cancer Genome Consortium
International Initiatives
29. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Open
Access
80+PB
Sequenced
Genome data
available in public
repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research
30.
31. • Required by funders
• Cannot publish unless accession
number given
• Specialised
• ENA
• EGA
• dbGaP
• dbSNP…
• Generalist
• Dryad
• figshare
Public Repositories
32. Open vs Managed Access
Open Access
75,000,000 per month
Managed Access
150 per month
500,000 fold difference
Stephan Beck
34. Hundreds of data sources
…but they aren’t easy to find!
http://tinyurl.com/plos-biology-repositiveFirst 30 data sources listed here:
10
25
33 35
102
174
239
0
50
100
150
200
250
300
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16
35. Case studies
Raquel, PhD Student, London, UK.
Researching genes associated with rare eye disorders.
Problems:
- Doesn’t know where to look for data.
- Doesn't know if data even exists.
“I gave up on finding the data - it was very time consuming and not
proving fruitful – so I started focusing more on generating my own
data.”
36. Access to Managed Data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?
37. NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/
38. Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/
43. The meaning of #dataeureka
1. We have actively helped a user F/A/S genomic data
2. An employee recognises a feature that makes a difference in
F/A/S
3. Product-market fit that justifies Repositive existence
49. 1-click to human genomic data access
to make finding data as easy as finding a book
on Amazon, book a hotel on Expedia!
50. Simpler workflow
for data access
Discover and
access
Search, see
related results
Find colleagues &
their data interests
Co-annotate data &
community feedback
55. 1000 Genomes samples used
Population Subpopulation Description Number
African (207) YRI Yoruba in Ibadan, Nigeria 108
LWK Luhya in Webuye, Kenya 99
Asian (300) CHB Han Chinese in Beijing, China 103
JPT Japanese in Tokyo, Japan 104
CHS Southern Hand Chinese 93
European (336) TSI Toscani in Italia 107
GBR British in England and Scotland 87
FIN Finnish in Finland 96
IBS Iberian Population in Spain 46
58. 2,402 23andMe
African cluster
50 samps
43 OpenSNP
Of which 11 reported ethnicity:
10 African/Black
1 Many inc African
Asian cluster
58 samps
46 OpenSNP
Of which 5 reported ethnicity:
4 East Asian
1 Korean
Surnames suggestive in 13 other cases
European (2098) ancestry cluster?
(by default)
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Population scale genome sequencing projects have been launched all over the world
More than 80PB of human genomic data is being sequenced Every year
BUT
To date only around .5PB of data available in public repositories
Every data source
Mixed data dedicated to anything
Sources dedicated to specific
March 2016
239 total sources
June 2016
Examples of researchers looking for genomics data. All have problems, even though in different parts of the world, in different industries and with different research questions.
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns on how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Our vision is to make genomic data access as easy as finding a book on Amazon or book a hotel on Expedia
KEY POINTS:
Repositive builds tools for genomics data search & access.
We’re really good at it. We have the expertise in-house. It’s what we do.
Aside from building a highly functional tool, we’ve taken the time to prioritise User Experience, streamlining of user workflows & presentation.
Within a month of our formal platform launch we have over 600 registered users.
The Repositive platform is an online community and marketplace connecting data consumers with data providers.
On Repositive, Jenn has
Easy, Interactive search
Faster data access workflow
Easy access to new data collaborators
Benefiting from reading feedback on data from community, colleagues, to assess data quality and utility
The Repositive platform and technology will remove barriers to data sharing and will incentivise users to explore, contribute and collaborate in alignment with best practices
55
56
58
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data