A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsBrett Tully
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
Islam M1,2, Christiansen J3, Mahboob S4, Valova V4, Baker M4, Capes-Davis D4, Hains P4, Balleine R1,4, Zhong Q1,4, Reddel R1,4, Robinson P1,4, Tully B4
1 The University of Sydney, Camperdown, Sydney, NSW, 2050, Australia
2 Intersect, Level 13/50 Carrington St, Sydney, NSW, 2000, Australia
3 Queensland Cyber Infrastructure Foundation Ltd, Axon Building 47, University of Queensland, St Lucia, Brisbane, QLD, 4072, Australia
4 Children’s Medical Research Institute, Westmead, NSW, 2145, Australia
Background
The ACRF International Centre for the Proteome of Cancer (ProCan) at Children’s Medical Research Institute (CMRI) is an “industrial scale” program specialising in small-sample proteomics analysis from human cancer tissue.
ProCan seeks to generate both a wide and deep analytics pipeline and requires an enabling data framework. The framework must accommodate initial analysis and proteomic profiling of a large number of tumor samples, along with the clinical and demographic information, subsequent multi-omics studies, and any previously recorded responses to treatment. The curated datasets will provide a valuable resource beyond their primary use and ProCan is committed to making its data accessible to collaborators and the wider scientific community.
Objectives
The objective of the project is to an establish efficient, reliable, secure and ethical data sharing and publication framework based on the best practice data sharing principles, such as the FAIR principle. The framework must address various challenges that stem from the scale and complexity of the program, and ProCan’s focus on human-derived data and associated challenges presented in sharing these data while maintaining the privacy of any research participants.
Method
The project adopted a requirements-driven methodology and engaged with a wide range of ProCan stakeholders nationally and internationally. Together, various industrial-scale proteomics data management and sharing scenarios were explored such that robust and ethical sharing of the data would be achieved.
Results
The project developed a data sharing framework based on the FAIR principle that currently forms the basis of ongoing implementation work within the ProCan program.
Presentation by Dr Steve McEachern, ADA, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Presentation by Hugo Leroux and Liming Zhu, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
International perspective for sharing publicly funded medical research dataARDC
Presentation by Olivier Salvado, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...GigaScience, BGI Hong Kong
Laurie Goodman at the AIBS Changing Practices in Data Pub workshop: Beyond Data Release Mandates - Helping Authors Make Data Available. 3rd December 2014
Presentation by Prof Lisa Askie, ANZCTR, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
Presented in session 48 - Sharing of sensitive data - presented by Fiona Nielsen on September 12, 2016 at #SciDataCon http://scidatacon.org
We have addressed the most pressing problem for public genomic data, that of data discoverability, by indexing worldwide resources for genomic research data on an online platform (repositive.io) providing a single point of entry to find and access available genomic research data.
http://www.scidatacon.org/2016/sessions/48/paper/26/
http://www.scidatacon.org/2016/sessions/48/
International data week - #RDAPlenary #IDW2016
A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsBrett Tully
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
Islam M1,2, Christiansen J3, Mahboob S4, Valova V4, Baker M4, Capes-Davis D4, Hains P4, Balleine R1,4, Zhong Q1,4, Reddel R1,4, Robinson P1,4, Tully B4
1 The University of Sydney, Camperdown, Sydney, NSW, 2050, Australia
2 Intersect, Level 13/50 Carrington St, Sydney, NSW, 2000, Australia
3 Queensland Cyber Infrastructure Foundation Ltd, Axon Building 47, University of Queensland, St Lucia, Brisbane, QLD, 4072, Australia
4 Children’s Medical Research Institute, Westmead, NSW, 2145, Australia
Background
The ACRF International Centre for the Proteome of Cancer (ProCan) at Children’s Medical Research Institute (CMRI) is an “industrial scale” program specialising in small-sample proteomics analysis from human cancer tissue.
ProCan seeks to generate both a wide and deep analytics pipeline and requires an enabling data framework. The framework must accommodate initial analysis and proteomic profiling of a large number of tumor samples, along with the clinical and demographic information, subsequent multi-omics studies, and any previously recorded responses to treatment. The curated datasets will provide a valuable resource beyond their primary use and ProCan is committed to making its data accessible to collaborators and the wider scientific community.
Objectives
The objective of the project is to an establish efficient, reliable, secure and ethical data sharing and publication framework based on the best practice data sharing principles, such as the FAIR principle. The framework must address various challenges that stem from the scale and complexity of the program, and ProCan’s focus on human-derived data and associated challenges presented in sharing these data while maintaining the privacy of any research participants.
Method
The project adopted a requirements-driven methodology and engaged with a wide range of ProCan stakeholders nationally and internationally. Together, various industrial-scale proteomics data management and sharing scenarios were explored such that robust and ethical sharing of the data would be achieved.
Results
The project developed a data sharing framework based on the FAIR principle that currently forms the basis of ongoing implementation work within the ProCan program.
Presentation by Dr Steve McEachern, ADA, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Presentation by Hugo Leroux and Liming Zhu, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
International perspective for sharing publicly funded medical research dataARDC
Presentation by Olivier Salvado, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...GigaScience, BGI Hong Kong
Laurie Goodman at the AIBS Changing Practices in Data Pub workshop: Beyond Data Release Mandates - Helping Authors Make Data Available. 3rd December 2014
Presentation by Prof Lisa Askie, ANZCTR, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
Presented in session 48 - Sharing of sensitive data - presented by Fiona Nielsen on September 12, 2016 at #SciDataCon http://scidatacon.org
We have addressed the most pressing problem for public genomic data, that of data discoverability, by indexing worldwide resources for genomic research data on an online platform (repositive.io) providing a single point of entry to find and access available genomic research data.
http://www.scidatacon.org/2016/sessions/48/paper/26/
http://www.scidatacon.org/2016/sessions/48/
International data week - #RDAPlenary #IDW2016
Presentation by Kelly Hart, ONDC in PM&C, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Presentation about OHSL's new initiative, Mycroft Cognitive Assistant®, which is intended to streamline the operational aspects of research using IBM Watson cognitive computing capabilities.
Investigator-initiated clinical trials: a community perspectiveARDC
Presentation by Miranda Cumpston, ACTA, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
Genome sharing projects across the world
Did you ever wonder what happened to the exponential increase in genome sequencing data? It is out there around the world and a lot of it is consented for research use. This means that if you just know where to find the data, you can potentially analyse gigabytes of data to power your research.
In this talk Fiona will present community genome initiatives, the genome sharing projects across the world, how you can benefit from this wealth of data in your work, and how you can boost your academic career by sharing and collaboration.
by Fiona Nielsen, Founder and CEO of DNAdigest and Repositive
With a background in software development Fiona pursued her career in bioinformatics research at Radboud University Nijmegen. Now a scientist-turned-entrepreneur Fiona founded DNAdigest and its social enterprise spin-out Repositive Ltd. Both the charity and company focus on efficient and ethical sharing of genetics data for research to accelerate diagnostics and cures for genetic diseases.
Expert Panel on Data Challenges in Translational ResearchEagle Genomics
A panel of experts including Alexandre Passioukov, VP Translational Medicine at Pierre Fabre, Xose Fernandez, Chief Data Officer at Institut Curie, Abel Ureta-Vidal, CEO at Eagle Genomics share their first-hand experience of enabling translational research in pharmaceutical and biomedical organisations, and discuss the challenges around the establishment of streamlined, seamless data handling and governance to accelerate innovation.
FAIR Data - A is for Accessible
David Fitzgerald, Data Manager for the Australian Longitudinal Study of Women’s Health (ALSWH) presented on how ALSWH makes a nationally significant longitudinal study with highly sensitive data accessible for others to reuse.
Full webinar recording: https://youtu.be/me27whU8GG8
Poster on governance for health IT infrastructures. Sustainability, scalability, standardization, planned sun-setting. Presented at the European Federation for Medical Informatics in Manchester, UK. 2017.
Digital transformation of translational medicineEagle Genomics
Anthony Finbow, Executive Chairman, and William Spooner, Chief Science Officer, discuss Eagle Genomics' software product, marketed at pharmaceutical and biotech companies, which enables radical improvements in the productivity of scientific research.
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Jan Cheetham, University of Wisconsin-Madison
Wendy Kozlowski, Cornell University
Common Ground: a policy framework for open access to research dataLIBER Europe
Presentation of the ReCODE project at LIBER 2013 in Munich. Presents the argument for stakeholder engagement in the development of open access policies for research data
Validating microbiome claims – including the latest DNA techniquesEagle Genomics
Abel Ureta-Vidal, Founder and CEO of Eagle Genomics, discusses how advanced DNA techniques help us to identify and characterise the microbiome, leading us to ways to prove cosmetic claims at the in-cosmetics formulation summit, 25th October 2017.
This presentation was provided by John Inglis of Cold Spring Harbor Laboratory during the NISO virtual conference, The Preprint: Integrating the Form into the Scholarly Ecosystem, held on February 14, 2018.
This presentation was provided by Darla Henderson of the ACS during the NISO virtual conference, The Preprint: Integrating the Form into the Scholarly Ecosystem, held on February 14, 2018.
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...Fiona Nielsen
Workshop presentation on finding and accessing human genomics data for research.
Including statistics of publicly available data sources and tips on how to save time in your workflow of data access.
Organised in collaboration between DNAdigest and Open Data Cambridge.
Read more about our work:
http://DNAdigest.org
http://repositive.io
https://uk.linkedin.com/in/fionanielsen
http://www.data.cam.ac.uk
Presentation by Kelly Hart, ONDC in PM&C, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Presentation about OHSL's new initiative, Mycroft Cognitive Assistant®, which is intended to streamline the operational aspects of research using IBM Watson cognitive computing capabilities.
Investigator-initiated clinical trials: a community perspectiveARDC
Presentation by Miranda Cumpston, ACTA, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
Genome sharing projects across the world
Did you ever wonder what happened to the exponential increase in genome sequencing data? It is out there around the world and a lot of it is consented for research use. This means that if you just know where to find the data, you can potentially analyse gigabytes of data to power your research.
In this talk Fiona will present community genome initiatives, the genome sharing projects across the world, how you can benefit from this wealth of data in your work, and how you can boost your academic career by sharing and collaboration.
by Fiona Nielsen, Founder and CEO of DNAdigest and Repositive
With a background in software development Fiona pursued her career in bioinformatics research at Radboud University Nijmegen. Now a scientist-turned-entrepreneur Fiona founded DNAdigest and its social enterprise spin-out Repositive Ltd. Both the charity and company focus on efficient and ethical sharing of genetics data for research to accelerate diagnostics and cures for genetic diseases.
Expert Panel on Data Challenges in Translational ResearchEagle Genomics
A panel of experts including Alexandre Passioukov, VP Translational Medicine at Pierre Fabre, Xose Fernandez, Chief Data Officer at Institut Curie, Abel Ureta-Vidal, CEO at Eagle Genomics share their first-hand experience of enabling translational research in pharmaceutical and biomedical organisations, and discuss the challenges around the establishment of streamlined, seamless data handling and governance to accelerate innovation.
FAIR Data - A is for Accessible
David Fitzgerald, Data Manager for the Australian Longitudinal Study of Women’s Health (ALSWH) presented on how ALSWH makes a nationally significant longitudinal study with highly sensitive data accessible for others to reuse.
Full webinar recording: https://youtu.be/me27whU8GG8
Poster on governance for health IT infrastructures. Sustainability, scalability, standardization, planned sun-setting. Presented at the European Federation for Medical Informatics in Manchester, UK. 2017.
Digital transformation of translational medicineEagle Genomics
Anthony Finbow, Executive Chairman, and William Spooner, Chief Science Officer, discuss Eagle Genomics' software product, marketed at pharmaceutical and biotech companies, which enables radical improvements in the productivity of scientific research.
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Jan Cheetham, University of Wisconsin-Madison
Wendy Kozlowski, Cornell University
Common Ground: a policy framework for open access to research dataLIBER Europe
Presentation of the ReCODE project at LIBER 2013 in Munich. Presents the argument for stakeholder engagement in the development of open access policies for research data
Validating microbiome claims – including the latest DNA techniquesEagle Genomics
Abel Ureta-Vidal, Founder and CEO of Eagle Genomics, discusses how advanced DNA techniques help us to identify and characterise the microbiome, leading us to ways to prove cosmetic claims at the in-cosmetics formulation summit, 25th October 2017.
This presentation was provided by John Inglis of Cold Spring Harbor Laboratory during the NISO virtual conference, The Preprint: Integrating the Form into the Scholarly Ecosystem, held on February 14, 2018.
This presentation was provided by Darla Henderson of the ACS during the NISO virtual conference, The Preprint: Integrating the Form into the Scholarly Ecosystem, held on February 14, 2018.
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...Fiona Nielsen
Workshop presentation on finding and accessing human genomics data for research.
Including statistics of publicly available data sources and tips on how to save time in your workflow of data access.
Organised in collaboration between DNAdigest and Open Data Cambridge.
Read more about our work:
http://DNAdigest.org
http://repositive.io
https://uk.linkedin.com/in/fionanielsen
http://www.data.cam.ac.uk
Workshop finding and accessing data - fiona - lunteren april 18 2016Fiona Nielsen
Workshop presentation on finding and accessing human genomics data for research.
Including statistics of publicly available data sources and tips on how to save time in your workflow of data access.
Presented at BioSB2016, pre-conference PhD retreat for young researchers in bioinformatics and systems biology at Congrescentrum De Werelt in Lunteren. #BioSB2016 #BioSB16
Link to event:
http://www.youngcb.nl/events/biosb-phd-retreat-2016/
Read more about my work:
http://DNAdigest.org
http://repositive.io
https://uk.linkedin.com/in/fionanielsen
Data sharing drivers in precision oncology, biomedical research, and healthcare. Accelerating discovery, innovation, providing credit for all stakeholders - patients, researchers, care providers, payers.
Data dialogue - Human Genomic Data DiscoveryFiona Nielsen
Presenting at The Data Dialogue. Time to Share: Navigating Boundaries & Benefits - Afternoon session: Sharing difficult data.
July 28 - 2016 @ University of Cambridge
http://www.ses.ac.uk/event/data-dialogue-time-share-navigating-boundaries-benefits/
In this talk I present an overview of human genomic data sources around the world, their funding, access policies and type of data they contain. Discussing why data sharing is hard, including issues of data privacy and a research culture that does not incentivise sharing of data and results.
Presented by Fiona Nielsen, founder and CEO of Repositive
http://repositive.io
ODF III - 3.15.16 - Day Two Morning SessionsMichael Kerr
Slide presentations delivered during morning sessions of Day Two of the California Statewide Health and Human Services Open DataFest - March 14 - 15, 2016, Sacramento, CA
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
Finding and accessing human genomic data for research
University of Cambridge, United Kingdom | Seminar Room G
Monday, 22 August 2016 from 10:00 to 12:00 (BST)
Charlotte, Nadia and Fiona presented an overview of data sources around the world where you can find genomics data for your research and gave examples of the data access application for dbGaP and EGA with specific details relevant for University of Cambridge researchers.
SAMSI Precision Medicine Keynote, August 2018: Data: where Precision Oncology...Warren Kibbe
The promise of precision medicine in oncology is predicated on the availability of accurate, high quality data from the clinic and the laboratory. Likewise, a Learning Health System is one in which we use data to monitor that we are following guidelines and care pathways to deliver the best care and not revert to prior practices (regression testing for care!) and also provide real world evidence to determine effectiveness and identify populations that would benefit from novel therapies. Into this mix of clinical drivers are the rapidly changing capabilities in instrumentation, computing, computation, and the pervasive use of sensors and smart devices. I will highlight a few of the obvious and perhaps not as obvious opportunities in leveraging the increasingly digital landscape in healthcare and biomedical research as we move toward a national learning health system for cancer.
Ethical, Legal, and Social Implications of ELSI Learning Health Systems 2017 Conference, University of Michigan. Learning from the experience and outcomes of every cancer patient
Sdal air health and social development (jan. 27, 2014) finalkimlyman
The American Institutes for Research (AIR) and Virginia Tech are collaborating to explore and develop new approaches to combining, manipulating and understanding big data. The two are also looking at how big data analytics can help answer questions critical to solving issues in education, workforce, health, and human and social development. They held two workshops on January 7 and 27, 2014- the first on Education and Workforce Analytics and the second on Health and Social Development Analytics.
Data sharing promotes many goals of the NIH research endeavor. It is particularly important for unique data that cannot be readily replicated. Data sharing allows scientists to expedite the translation of research results into knowledge, products, and procedures to improve human health. Do you know what a data sharing plan should include? Are you aware of common practices and standards for data sharing? Do you know what services are available to help share your data responsibly? This workshop will begin to address these questions. Q&A will follow the presentation. Anyone interested in or planning to apply for NIH funding should attend. Note: The NIH data-sharing policy applies to applicants seeking $500,000 or more in direct costs in any year of the proposed research.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
1. We are always looking for data
Finding & Accessing
Human Genomic
Data for research
BioSB 2017
Tweets welcome
#dataeureka
@repositiveio
2. Genomic data is important for research
Pre-clinical
drug discovery
Diagnostics and treatments
of genetic diseases
3. “Consensus among researchers, clinicians,
politicians & the public that
genomics will transform biomedical
research, healthcare and lifestyle choices”
Stephan Beck, UCL
OPPORTUNITY
4. Genome Technology Evolution
2001: 1 human genome
2005: Personal Genome Project
Human Genome Diversity Project
HapMap
2016: 2M AstraZeneca - HLI
2008: 1000 Genomes (1092 genomes, since increased to ~2500)
Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE)
2011: H3Africa
2012: International Cancer Genome Consortium
5. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+PB
Sequenced
every year
WGS data available
in public repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research
6. How many data sources?
How many sources of human
genomic data do you know about?
7. Hundreds of data sources
…but they aren’t easy to find!
http://tinyurl.com/plos-biology-repositiveFirst 30 data sources listed here:
10 25 33 35
102
174
239
314
506
582
0
100
200
300
400
500
600
700
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16 Sep-16 Dec-16 Mar-17
Data Sources Identified
9. 11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
Public Repositories
Universities
Companies
BioBanks
Research consortiums
11. • Required by funders
• Cannot publish unless accession
number given
• Specialised for genomics
• ArrayExpress
• EGA
• dbGaP
• GEO…
• Generalist
• Dryad
• Figshare…
See http://discover.repositive.io for more
Public Repositories
12. The researchers’ pain points
FRAGMENTED
No holistic approach
to discover new data
HIDDEN
13. The researchers’ pain points
FRAGMENTED
No holistic approach
to discover new data
ADMIN
BURDEN
14. Open Access
• Eg. PGP, CC0
• Bermuda Accord
Managed (Restricted or Controlled Access)
• Data Access Committee
• No effective agreement (policy vacuum)
GOVERNANCE Models
15. Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both Open and Restricted
access data within one repository.
16. Access to Restricted Data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?
17. Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem with
qualitative interviews followed
by a survey of researchers in
human genetics
T. A. van Schaik et al
The need to redefine genomic data sharing: a focus on
data accessibility, Applied & Translational Genomics, 2014
http://tinyurl.com/schaik-dnadigest
18. NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/
19. Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/
22. We are enabling best practices
MAKE DATA
DISCOVERABLE
SIMPLIFY
WORKFLOWS
CONTRIBUTE TO
COMMUNITY
A platform to make human genomic data accessible for research
23. 1-click to human genomic data access
to make finding data as easy as finding a book
on Amazon, book a hotel on Expedia!
Repositive
24. Simpler workflow
for data access
Our expertise is data search platforms
Discover and
access
Search, see
related results
Find colleagues &
their data interests
Co-annotate data &
community feedback
Genomics data is needed for research and drug discovery
It enables researchers to develop diagnostics and treatments for genetic diseases
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Population scale genome sequencing projects have been launched all over the world
More than 80PB of human genomic data is being sequenced Every year
BUT
To date only around .5PB of data available in public repositories
Data is fragmented in unconnected silos – makes it very difficult to discover data
There are many public repositories, but It can be hugely confusing to know where to look for the right kind of data
Data privacy Is a concern and controlled access is a requirement for many clinical datasets
Accessing data is a time-consuming and bureaucratic exercise
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Just like Liz, and researcher struggling to get hold of the genomics data she needed for her researcher.
So… she quite her job at illumina and decided to try and do something about that problem.
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Our vision is to make genomic data access as easy as finding a book on Amazon or book a hotel on Expedia
KEY POINTS:
Repositive builds tools for genomics data search & access.
We’re really good at it. We have the expertise in-house. It’s what we do.
Aside from building a highly functional tool, we’ve taken the time to prioritise User Experience, streamlining of user workflows & presentation.
Within a month of our formal platform launch we have over 600 registered users.
The Repositive platform is an online community and marketplace connecting data consumers with data providers.
On Repositive, Jenn has
Easy, Interactive search
Faster data access workflow
Easy access to new data collaborators
Benefiting from reading feedback on data from community, colleagues, to assess data quality and utility
The Repositive platform and technology will remove barriers to data sharing and will incentivise users to explore, contribute and collaborate in alignment with best practices
DNA.land
OpenSNP
PersonalGenomesProject
Direct to consumer genetic tests & microbiome
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and vetting of users