Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Genome sharing projects
around the world
– and how you find data for
your research
Fiona Nielsen
Lunteren, April 18 2016
S...
Follow us on twitter:
@repositiveio
Fiona Nielsen, April 18 2016
Find me on twitter: @glyn_dk
1. What data are you looking for? And Why?
2. Data resources from around the world
3. Tips on how to find and access data
...
1. What data are you looking for?
This workshop will focus on finding
and accessing human genomic data.
… And why would yo...
How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2...
Statistically speaking, you still need 10s of thousands of samples for
validation
The more severe the phenotype and the mo...
What can I do?
PRO TIP: involve a statistician early on in your study design!
How can I determine significance?
“One potentially powerful approach is to assess conservation across and within
multiple ...
Think of how you can provide evidence that your result is not just a local
technical variation or sampling bias
e.g. data ...
• Know what data is available in your lab,
your dept, your org
• Survey from Qiagen showed that one of
the main reasons re...
How can I find collaborators?
PRO TIP: Search for collaborators who have the data you need
PRO TIP: Tell your colleagues a...
2. Data resources from around the world
public repositories
• some you apply for access,
especially if data contains
clini...
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+P...
DATA is fragmented
It may be confusing
Hundreds of data sources
…but they aren’t easy to find!
10
25
33 35
102
163
0
20
40
60
80
100
120
140
160
180
200
Jan-15 M...
Data source content
Assay Types
Dedicated to…
Number of samples in Data sources
1
10
100
1000
10000
100000
1000000
Sample#(Log10)
Top 5:
GEO (1.8M)
PMI Cohort Program (...
Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both...
Online Data source ’types’
University – Affiliated to a
university. Often only members of
that university can
upload/downl...
Sequenced ethnicities
Aboriginals
African Americans
Africans
Australians
Chinese
Malays
Indians
Danish
Dutch Estonian
Russ...
Machines & Data sources
947
5600
88
660
26
68
50
62
3
25
0
0
23 International
Interesting site to look at:
http://omicsmap...
Main Repository funders
BGI = 4
EBI = 9NIH = 10
NCBI = 9
The Broad = 8
Wellcome = 4
EBI total 104 services, 19 repositorie...
• Case study: DNA data on Cancer
3. Tips to find and access data
Case Study – DNA data on Cancer
Repositories you
have heard of:
Ask around
(word of mouth):
Repository Data Type Access
Ar...
Case Study – DNA data on Cancer
We have identified the first 27 cancer specific data sources 
And many more that contain ...
1. Register for eRA account
2. Request access to specific dataset of interest
3. Download data
Registering for CGHub
https...
Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting d...
Often a long process
T. A. van Schaik et al
The need to redefine genomic
data sharing: a focus on data
accessibility, Appl...
Why the barrier?
Why the barrier?
• Benefits: strict governance, review of consent, applicant signs for full
responsibility for governance
...
• Start planning your data needs early in your project
• When you find the data you need, start application
• Use Open Acc...
• Some data is Open Access  requires specific consent
• OpenSNP.org (Bastian)
• Personal Genomes Projects
• Individuals w...
• Some data is Open Access  requires specific consent
• Individuals who put their genomes online, e.g. Manuel Corpas
and ...
Personal Genome Project
PGP Harvard PGP Canada PGP UK Genom Austria
Host institution Harvard Medical School
Boston
SickKid...
Summary of data access barriers
Data is uploaded
to repository
Data is discovered
by potential user
Data is accessed
by po...
• “even when researchers are authorised to share data they
report reluctance to do so because of the amount of effort
requ...
• If you expect data to be available to you
– you have to make your data available too!
• Encourage collaborations: power ...
• Use all available tools to make your life easier:
• Data publications  visibility and citations for your data, e.g.
Gig...
Does data sharing
matter at
grant proposal evaluation
Based on: Winning Horizon 2020 with Open Science,
http://dx.doi.org/...
“Weakness: Involvement of non-
academic beneficiaries is limited”
“Weakness: highly focused on academic activities, and
la...
“Strengths: extensive dissemination of data to the
scientific community (open access, databases)”
“outreach activities to ...
Best practices: Plan into your grant proposals
Make the (research) world a better place by sharing in return 
Best practices: Share in return!
• Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g.
PatientsK...
4. Hands-on session using Repositive
What if finding data was as easy as finding a book on
Amazon, book a hotel on Expedia?
Repositive promotes best practices
Discover new data sources
EASY
SEARCH
Repositive promotes best practices
Make your data visible
SHARE
KNOWLEDGE
Repositive promotes best practices
Build a data community
BUILD
TRUST
Benefit for both sides of data collaboration
Data consumers Data producers
Find relevant data faster
Feedback from other u...
Live demo
http://discover.repositive.io
Use activation code: BioBS16
5. Summary and feedback
• Get credit – publish data
• Give credit – cite data
• Understand consent
Tell us your thoughts:
@repositiveio
@glyn_dk
And read more on http://repositive.io
Thank you!
Upcoming SlideShare
Loading in …5
×

of

Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 1 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 2 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 3 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 4 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 5 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 6 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 7 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 8 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 9 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 10 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 11 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 12 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 13 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 14 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 15 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 16 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 17 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 18 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 19 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 20 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 21 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 22 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 23 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 24 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 25 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 26 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 27 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 28 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 29 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 30 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 31 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 32 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 33 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 34 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 35 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 36 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 37 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 38 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 39 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 40 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 41 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 42 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 43 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 44 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 45 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 46 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 47 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 48 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 49 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 50 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 51 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 52 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 53 Workshop   finding and accessing data - fiona - lunteren april 18 2016 Slide 54
Upcoming SlideShare
Why i left my job in genomics R&D - Lunteren - april 18 - 2016
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Workshop finding and accessing data - fiona - lunteren april 18 2016

Download to read offline

Workshop presentation on finding and accessing human genomics data for research.

Including statistics of publicly available data sources and tips on how to save time in your workflow of data access.

Presented at BioSB2016, pre-conference PhD retreat for young researchers in bioinformatics and systems biology at Congrescentrum De Werelt in Lunteren. #BioSB2016 #BioSB16

Link to event:
http://www.youngcb.nl/events/biosb-phd-retreat-2016/

Read more about my work:
http://DNAdigest.org
http://repositive.io
https://uk.linkedin.com/in/fionanielsen

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Workshop finding and accessing data - fiona - lunteren april 18 2016

  1. 1. Genome sharing projects around the world – and how you find data for your research Fiona Nielsen Lunteren, April 18 2016 Slides will be made available online 
  2. 2. Follow us on twitter: @repositiveio Fiona Nielsen, April 18 2016 Find me on twitter: @glyn_dk
  3. 3. 1. What data are you looking for? And Why? 2. Data resources from around the world 3. Tips on how to find and access data 4. Hands-on using Repositive 5. Summary and feedback Workshop outline
  4. 4. 1. What data are you looking for? This workshop will focus on finding and accessing human genomic data. … And why would you be looking for genomic data for your research? Are you researching cancer or genetic diseases?
  5. 5. How much data do you need to publish a paper? 2001: 1 human genome 2012: 1000 Genomes (1092 genomes, since increased to ~2500) 2015: UK10K, Icelandic population (2,636 + 100k imputed), Cancer genome atlas ~11,000 genomes Exac consortium 65,000 exomes ?
  6. 6. Statistically speaking, you still need 10s of thousands of samples for validation The more severe the phenotype and the more complete penetrance, the easier it will be for you to find your variant, but “As the genetic complexity of the disease increases (for example, reduced penetrance and increased locus heterogeneity), issues of statistical power quickly become paramount.” http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html But I am just looking at this one disease…
  7. 7. What can I do? PRO TIP: involve a statistician early on in your study design!
  8. 8. How can I determine significance? “One potentially powerful approach is to assess conservation across and within multiple species as whole-genome sequence data become more abundant.” Look at extreme phenotypes “Sampling cases or controls from the extremes of an appropriate quantitative distribution can often increase power” Look at non-SNP variants, they are more likely to have functional effects - “how to account for the technical features of sequencing, such as incomplete sequencing and biased coverage over the genome?”
  9. 9. Think of how you can provide evidence that your result is not just a local technical variation or sampling bias e.g. data from same cell type, same seq technology, same alignment… How to account for bias? PRO TIP: include more reference data in your analysis
  10. 10. • Know what data is available in your lab, your dept, your org • Survey from Qiagen showed that one of the main reasons researchers collaborate is to get access to data! How can I access more data for my research?
  11. 11. How can I find collaborators? PRO TIP: Search for collaborators who have the data you need PRO TIP: Tell your colleagues and peers what type of data you have in your lab
  12. 12. 2. Data resources from around the world public repositories • some you apply for access, especially if data contains clinical info or whole genome PID • some are open access: GEO, SRA, PGP, OpenSNP, GigaDB, … • some are consented for general research use, some have specific consent
  13. 13. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Large amounts of data, but not accessible ≈ .5PB Sequence available 80+PB Sequenced every year WGS data available in public repos Exponential growth rate Under-utilised data has huge potential for medical research
  14. 14. DATA is fragmented
  15. 15. It may be confusing
  16. 16. Hundreds of data sources …but they aren’t easy to find! 10 25 33 35 102 163 0 20 40 60 80 100 120 140 160 180 200 Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 http://dx.doi.org/10.1371/journal.pbio.1002418First 30 data sources listed here:
  17. 17. Data source content Assay Types Dedicated to…
  18. 18. Number of samples in Data sources 1 10 100 1000 10000 100000 1000000 Sample#(Log10) Top 5: GEO (1.8M) PMI Cohort Program (1M) Auria Biopankki (1M) EGA (~0.6M) SRA (~0.5M)
  19. 19. Data accessibility Can download the data straight away or after logging in. Need to apply for access to the data. Has both Open and Restricted access data within one repository.
  20. 20. Online Data source ’types’ University – Affiliated to a university. Often only members of that university can upload/download to/from it. Catalogue – doesn’t have raw data but lists studies/datasets. Initiative/Consortium – Has a specific purpose/aim. Often focussed on a question or disease. Repository – Can download from, has data from multiple institutions. Often can also upload your own data there. Company – For profit organisation. Listing data is not their main purpose. Biobank – many have sequence data of their biological samples.
  21. 21. Sequenced ethnicities Aboriginals African Americans Africans Australians Chinese Malays Indians Danish Dutch Estonian Russian European Ancestry Finnish Icelandic Japanese Korean Latin Americans Saudi Swedish
  22. 22. Machines & Data sources 947 5600 88 660 26 68 50 62 3 25 0 0 23 International Interesting site to look at: http://omicsmaps.com/stats
  23. 23. Main Repository funders BGI = 4 EBI = 9NIH = 10 NCBI = 9 The Broad = 8 Wellcome = 4 EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/all NCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_
  24. 24. • Case study: DNA data on Cancer 3. Tips to find and access data
  25. 25. Case Study – DNA data on Cancer Repositories you have heard of: Ask around (word of mouth): Repository Data Type Access ArrayExpress Expression Open GEO Espression Open EGA Mixed Restricted dbGaP Mixed Restricted Encode Healthy Reference Open 1000 Genomes Healthy Reference Open Repository Data Type Access COSMIC Somatic mutations & WGS Open ClinVar Variant information Open ExAC Allele Freq. but not raw data Open SRA Individual sequences Open TCGA Clinical & high level data Open CGHub Low level data (DNA data) Restricted
  26. 26. Case Study – DNA data on Cancer We have identified the first 27 cancer specific data sources  And many more that contain cancer data alongside other data types. Abcodia AmbryShare BRCA Exchange Breast Cancer Now Tissue Bank Broad Cancer programme datasets Cancer Moonshot 2020 CanGEM CGCI CGHub Chinese cancer genome consortium Chinese national human genome centre Follicular Lymphoma Genome Data G-DOC GenoMel ICGC National Mesothelioma Virtual Bank NCIP Hub Project GENIE Target TCGA Texa cancer research biobank NCI-60 CCLE COSMIC Fantom cancer methylome system Cancer therepeutics response portal
  27. 27. 1. Register for eRA account 2. Request access to specific dataset of interest 3. Download data Registering for CGHub https://cghub.ucsc.edu/keyfile/newuser.html ‘Principle signing official’ registers Email to verify Email to confirm/deny access to website Email with temporary password Change password Electronic signature Login Fill in contact info, Complete ‘424’ form (research application form) Request reviewed by DAC Email to confirm/deny access to data Login Retrieve personal access token Download! 
  28. 28. Often a long process Bottlenecks: • Finding relevant and usable data • Getting authorisation to access data • Formatting data • Storing and moving data We studied the problem by qualitative interviews followed by a survey of researchers in human genetics
  29. 29. Often a long process T. A. van Schaik et al The need to redefine genomic data sharing: a focus on data accessibility, Applied & Translational Genomics, 2014 10.1016/j.atg.2014.09.013 Researchers spend months to find and access genomic data, and often choose to not access data at all
  30. 30. Why the barrier?
  31. 31. Why the barrier? • Benefits: strict governance, review of consent, applicant signs for full responsibility for governance • Disadvantages: No control of data once access is given, high barrier for access – too high?
  32. 32. • Start planning your data needs early in your project • When you find the data you need, start application • Use Open Access data How can I save time? PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application – access to all the GRU datasets
  33. 33. • Some data is Open Access  requires specific consent • OpenSNP.org (Bastian) • Personal Genomes Projects • Individuals who put their genomes online, e.g. Manuel Corpas and his family “the Corpasome” • http://manuelcorpas.com/about/ Not all data is restricted
  34. 34. • Some data is Open Access  requires specific consent • Individuals who put their genomes online, e.g. Manuel Corpas and his family “the Corpasome” • http://manuelcorpas.com/about/ • OpenSNP.org • Personal Genomes Projects Not all data is restricted
  35. 35. Personal Genome Project PGP Harvard PGP Canada PGP UK Genom Austria Host institution Harvard Medical School Boston SickKids Toronto University College London CeMM Research Center for Molecular Medicine Principal Investigator George Church Steven Scherer Stephan Beck Christoph Bock & Giulio Superti-Furga Launch year 2005 2012 2013 2014 Geographic scope USA, mainly Boston Canada United Kingdom Mainly Austria Enrollment eligibility At least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups excluded Data Generated Whole genome sequencing, upload of additional data possible Mainly whole genome sequencing Whole genome sequencing, DNA methylome sequencing, RNA transcriptome sequencing Mainly whole genome sequencing Number of genomes 100s 10s 10s 10s Data access http://personalgenomes.org/harvard/data http://genomaustria.at/unser- genom/#genome-der- pionierinnen Project funding Discretional funds and corporate sponsoring Institutional startup funds Discretional funds and corporate sponsoring Institutional startup funds Areas of emphasis Integration with phenotypic data, collaboration with other personal omics initiatives Genome donations, synergy with massive-scale clinical genome sequencing projects Genomes and society, genetic literacy, school projects, education Website http://personalgenomes.org/harvard/ http://personalgenomes.org/canada/ http://personalgenomes.org/uk/ http://genomaustria.at/
  36. 36. Summary of data access barriers Data is uploaded to repository Data is discovered by potential user Data is accessed by potential user
  37. 37. • “even when researchers are authorised to share data they report reluctance to do so because of the amount of effort required“ http://www.sciencedirect.com/science/article/pii/S2212066114000386 • “Clinical geneticists cited a lack of time because their main priority is diagnosing patients. Industrial researchers cited a lack of time because of the pressure to meet the deadlines in their job. Researchers in academia cited both a concern about the potential loss of future publications once unpublished data is shared, and the lack of time and incentive to share data as this does not contribute to their publication record. Researchers from all categories felt that they lacked sufficient resources to make their data available.” The barrier of making data available But I do not want to share my data
  38. 38. • If you expect data to be available to you – you have to make your data available too! • Encourage collaborations: power by numbers 1. Get credit – publish and make your data available 2. Give credit – cite data sources 3. Understand consent – for all uses of clinical data Best practices
  39. 39. • Use all available tools to make your life easier: • Data publications  visibility and citations for your data, e.g. GigaScience and Scientific Data • Figshare, Zenodo, Dryad for sharing open access data • PhenomeCentral, Matchmaker exchange for rare disease research • Repositive for finding data across repositories and make your own data discoverable Best practices: use the tools
  40. 40. Does data sharing matter at grant proposal evaluation Based on: Winning Horizon 2020 with Open Science, http://dx.doi.org/10.5281/zenodo.12247 Best practices: Plan into your grant proposals
  41. 41. “Weakness: Involvement of non- academic beneficiaries is limited” “Weakness: highly focused on academic activities, and lacks an advanced communication strategy” “Weakness: limited exposure to non-academic partners & infrastructures” Excellence Impact Implementation “data accessibility is unclear!” “data storage & access not considered” Best practices: Plan into your grant proposals
  42. 42. “Strengths: extensive dissemination of data to the scientific community (open access, databases)” “outreach activities to a broad audience” “research software is freely available” Impact: Best practices: Plan into your grant proposals
  43. 43. Best practices: Plan into your grant proposals
  44. 44. Make the (research) world a better place by sharing in return  Best practices: Share in return!
  45. 45. • Digital consent: towards automatic processing of applications • Dynamic consent and power to the patient, e.g. PatientsKnowBest • Privacy-preserving access to datasets: preserving control and governance with data custodian, lower barrier for access What the future holds
  46. 46. 4. Hands-on session using Repositive What if finding data was as easy as finding a book on Amazon, book a hotel on Expedia?
  47. 47. Repositive promotes best practices Discover new data sources EASY SEARCH
  48. 48. Repositive promotes best practices Make your data visible SHARE KNOWLEDGE
  49. 49. Repositive promotes best practices Build a data community BUILD TRUST
  50. 50. Benefit for both sides of data collaboration Data consumers Data producers Find relevant data faster Feedback from other users through ratings and comments to evaluate data quality Find collaborators with data Make your data visible Build credibility as a trusted provider of quality data Find collaborators to analyse your data
  51. 51. Live demo http://discover.repositive.io Use activation code: BioBS16
  52. 52. 5. Summary and feedback • Get credit – publish data • Give credit – cite data • Understand consent
  53. 53. Tell us your thoughts: @repositiveio @glyn_dk And read more on http://repositive.io
  54. 54. Thank you!

Workshop presentation on finding and accessing human genomics data for research. Including statistics of publicly available data sources and tips on how to save time in your workflow of data access. Presented at BioSB2016, pre-conference PhD retreat for young researchers in bioinformatics and systems biology at Congrescentrum De Werelt in Lunteren. #BioSB2016 #BioSB16 Link to event: http://www.youngcb.nl/events/biosb-phd-retreat-2016/ Read more about my work: http://DNAdigest.org http://repositive.io https://uk.linkedin.com/in/fionanielsen

Views

Total views

734

On Slideshare

0

From embeds

0

Number of embeds

6

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×