SlideShare a Scribd company logo
1 of 28
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
Disclaimer
• I do not (and will not) profit in any way, shape
or form, from any of the brands, products or
companies I may mention in this
presentation.
Data availability and re‐usability in the
transition from microarray to next‐generation
sequencing: can we do better?
B.F. Francis Ouellette
• Senior Scientist & Associate Director, Informatics and
Biocomputing, Ontario Institute for Cancer Research,
Toronto, ON
• Associate Professor, Department of Cell and Systems
Biology, University of Toronto, Toronto, ON.

@bffo on
•

Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette,
Alvis Brazma and the Functional Genomics Data Society
http://fged.org

•
•
•
•
•
•
•
•
•
•
•
•
•
•

Alvis Brazma - EBI
Roger Bumgarner - U of Washington
Cesare Furlanello - FBK – MPBA
Michael Miller - ISB
Francis Ouellette - OICR
John Quackenbush – Dana-Farber
Michael Reich - Broad
Gabriella Rustici - EBI
Chris Stoeckert – U Penn
Ronald Taylor - PNNL
Steve Chervitz Trutane - Personalis
Jennifer Weller - UNC
Brian Wilhelm - IRIC
Neil Winegarden - UHN
FGED’s mission:

To be a positive agent of
change in the effective
sharing and reproducibility
of functional genomic data
Poster # 142 (Friday)
fged.org
I come here wearing many hats!
• Officer of FGED
• Data submitter to a large international cancer
genomics initiative
• Receiving and curating data from that same
initiative from 67 cancer genome projects.
• Editor in an #openaccess journal where we are just
now rewriting the data submission policy to ensure
reproducibility
• Associate Editor of an #OA DATABASE journal
• Also on the SAB of Galaxy and Genomespace
What do we do with this?
FGED
(Functional Genomics Data Society)
was
MGED
(Microarray Gene Expression
Data Society)
we evaluated the replication of data analyses in 18 articles on
microarray-based gene expression profiling. (…) We reproduced
two analyses in principle and six partially or with some
discrepancies; ten could not be reproduced. The main reason
for failure to reproduce was data unavailability, and discrepancies
were mostly due to incomplete data annotation or specification of
data processing and analysis. Repeatability of published
microarray studies is apparently limited. More strict publication
rules enforcing public data availability and explicit description of
data processing and analysis should be considered.
Does it matter?
• In Ioannidis et al (2009), they were not saying that
the papers were wrong.

• But there were problems
– missing data (38%)
– missing software, hardware details (50%)
– missing method, processing details (66%)
… forensic bioinformatics [was needed] to infer what
was done to obtain the results
- Keith Baggerly
Does it matter?
• In both cases the supporting data WERE deposited
in GEO or ArrayExpress
• Forensic bioinformatics was needed and more
often than not failed
• May be just depositing is not quite enough?
What was in MIAME?
1. The raw data
2. The final processed (normalised) data
3. The essential sample annotation and experimental
variables
4. Sample data relationships
5. Array annotation (e.g., probe oligonucleotide
sequences)
6. The laboratory and data processing protocols
Did it work? The glass half empty…
• Where were the hiccups? MIAME was asking too
much!
• However, some now say that MIAME is much too
little to ask! (e.g., publishing fully documented code
with instructions how to run it)
• What does it mean ‘sufficient data processing
protocols’?
• Even when data and protocols were deposited,
would the reviewers check these? Probably not
• So does it help at all?
Did it work? The glass half full …
• ArrayExpress and GEO have data from well
over 6 million high throughput assays from
some 30,000 functional genomics studies
• The MIAME compliance has been increasing
over time
• Many studies have shown the reusability of
these data
• We can have an informed discussion about the
reproducibility rather than forensics
Standards for content vs
standards for format
• Developing a usable format is challenging
– If it’s too ‘flexible’, too much free text, it’s no longer a
standard, no software can reasonably parse it
– If it’s too rigid, too granular, it can’t handle new type of
data, and people end up putting things in fields that don’t
work

• Human readable formats is useful, but machine
readability is essential!
A simple human readable format for Functional
genomics experiment metadata
• Sample-Data Relationship File (SDRF)
Lessons learned
• Keep it simple, keep it simple, keep it simple!
• Perils of designing standards by a committee vs
advantages of community agreement
• Successful formats are mostly defined by
successful software, e.g., GFF in UCSC GB or
Bioconductors gene_set
• The attraction and perils of perfection – the last few
steps of full automation cost most effort
– A human person may be a cheep broker between two
pieces of software (again – Bioconductor example)
What does it mean for HTS?
• (RNASeq – ChIPSeq)
• The metadata for functional genomics HTS
experiments are not so different from microarray
experiments – replace cel files with BAM files
MINSEQE - Minimum Information about a highthroughput Nucleotide SeQuencing Experiment
1. A general description of the aim of the experiment;
2. The submitter contact details;
3. Essential sample annotation and the experimental
factors;
4. An ‘experiment’ or ‘run’ date, which may be
important for identifying batch effects;
5. Sufficient information to correctly identify bio &
tech reps;
6. Experimental and data processing protocols
7. Raw sequencing reads location; and processed
data.
Percentage of publications from 2012
containing new gene expression data
Data type

Number of
PMID with new
data

% of data in
SRA/Arrayexpr
ess/GEO

Microarray

347

49

RNA-SEQ

334

61
Percentage of RNA-Seq studies
providing metadata (1/2)
Original
Database

ArrayExpress GEO

SRA

Experimental
description

95

100

100

Contact

100

100

0

Sample &
Factor info

100

100

60

Experimental
Or Run date

0

0

60
Percentage of RNA-Seq studies
providing metadata (2/2)
Original
Database

ArrayExpress GEO

SRA

Biological
and Tech
replicates

Yes

Sometimes

Yes

Exp and data
processing
protocol

60

100

0

Raw reads

100

100

100

Processed
data

35

90

0
Things we still need to do:
• Involves folks from NCBI
• Compare methods and metrics over time (20092012)
• Compare methods with ENCODE, ICGC, EGA and
the databases we presented here.
• Look for shared meta data and seek to mate what
is best and core to all.
• Make sure it aligns with large funder’s current
requirements.
• Share and publish this information
Take home messages
• Archiving just something is not the same as
making data available and useful – metadata,
analysis code, usable format, …
– Storing metadata doesn’t cost too much, extracting them
from data generators does!

• Minimising the human mediation in moving data
between the LIMS, archives and analysis tools is
more realistic goal than eliminating it – the need for
brokerage
• The main source of variability in RNSseq
interpretation seems to be the alignments – we
don’t know how to do this well yet. Getting the
short reads for RNASeq is a beginning.
• FGED: The Functional Genomics Data Society is a
very open society, and we welcome feedback and
input!

– http://fged.org
– Twitter: @fged
Acknowledgements:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Gabriella Rustici, Eleanor Williams, Alvis Brazma and
the Functional Genomics Data Society http://fged.org
Alvis Brazma - EBI
Roger Bumgarner - U of Washington
Cesare Furlanello - FBK – MPBA
Michael Miller - ISB
Francis Ouellette - OICR
John Quackenbush – Dana-Farber
Michael Reich - Broad
Gabriella Rustici - EBI
Chris Stoeckert – U Penn
Ronald Taylor - PNNL
Steve Chervitz Trutane - Personalis
Jennifer Weller - UNC
Brian Wilhelm - IRIC
Neil Winegarden - UHN

More Related Content

What's hot

Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialDmitry Grapov
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vuploadProf. Wim Van Criekinge
 
NetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbioNetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbioAlexander Pico
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw Alexander Pico
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformaticsChris Dwan
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final ReportShruthi Choudary
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizAlexander Pico
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...Syed Ahmad Chan Bukhari, PhD
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuKAUSHAL SAHU
 

What's hot (20)

Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
NetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbioNetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbio
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Canadian health census to lod
Canadian health census to lodCanadian health census to lod
Canadian health census to lod
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 

Viewers also liked

Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Functional Genomics Data Society
 
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013Functional Genomics Data Society
 
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013Functional Genomics Data Society
 
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013Functional Genomics Data Society
 
Information, Science, and Society
Information, Science, and SocietyInformation, Science, and Society
Information, Science, and SocietyMelanie Swan
 
miRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity SystemsmiRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity SystemsNatalie Ng
 
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...Gerd Leonhard
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 

Viewers also liked (9)

Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
 
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
 
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
 
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
 
Information, Science, and Society
Information, Science, and SocietyInformation, Science, and Society
Information, Science, and Society
 
miRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity SystemsmiRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity Systems
 
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

Similar to Cshl minseqe 2013_ouellette

Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesChristoph Steinbeck
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData ManagementUlrike Wittig
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowEagle Genomics
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesRothamsted Research, UK
 

Similar to Cshl minseqe 2013_ouellette (20)

2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic molecules
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData Management
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Cshl minseqe 2013_ouellette

  • 1. You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
  • 2. You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
  • 3. Disclaimer • I do not (and will not) profit in any way, shape or form, from any of the brands, products or companies I may mention in this presentation.
  • 4. Data availability and re‐usability in the transition from microarray to next‐generation sequencing: can we do better? B.F. Francis Ouellette • Senior Scientist & Associate Director, Informatics and Biocomputing, Ontario Institute for Cancer Research, Toronto, ON • Associate Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, ON. @bffo on
  • 5. • Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette, Alvis Brazma and the Functional Genomics Data Society http://fged.org • • • • • • • • • • • • • • Alvis Brazma - EBI Roger Bumgarner - U of Washington Cesare Furlanello - FBK – MPBA Michael Miller - ISB Francis Ouellette - OICR John Quackenbush – Dana-Farber Michael Reich - Broad Gabriella Rustici - EBI Chris Stoeckert – U Penn Ronald Taylor - PNNL Steve Chervitz Trutane - Personalis Jennifer Weller - UNC Brian Wilhelm - IRIC Neil Winegarden - UHN
  • 6. FGED’s mission: To be a positive agent of change in the effective sharing and reproducibility of functional genomic data Poster # 142 (Friday) fged.org
  • 7. I come here wearing many hats! • Officer of FGED • Data submitter to a large international cancer genomics initiative • Receiving and curating data from that same initiative from 67 cancer genome projects. • Editor in an #openaccess journal where we are just now rewriting the data submission policy to ensure reproducibility • Associate Editor of an #OA DATABASE journal • Also on the SAB of Galaxy and Genomespace
  • 8. What do we do with this? FGED (Functional Genomics Data Society) was MGED (Microarray Gene Expression Data Society)
  • 9. we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling. (…) We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.
  • 10. Does it matter? • In Ioannidis et al (2009), they were not saying that the papers were wrong. • But there were problems – missing data (38%) – missing software, hardware details (50%) – missing method, processing details (66%)
  • 11. … forensic bioinformatics [was needed] to infer what was done to obtain the results - Keith Baggerly
  • 12. Does it matter? • In both cases the supporting data WERE deposited in GEO or ArrayExpress • Forensic bioinformatics was needed and more often than not failed • May be just depositing is not quite enough?
  • 13.
  • 14. What was in MIAME? 1. The raw data 2. The final processed (normalised) data 3. The essential sample annotation and experimental variables 4. Sample data relationships 5. Array annotation (e.g., probe oligonucleotide sequences) 6. The laboratory and data processing protocols
  • 15. Did it work? The glass half empty… • Where were the hiccups? MIAME was asking too much! • However, some now say that MIAME is much too little to ask! (e.g., publishing fully documented code with instructions how to run it) • What does it mean ‘sufficient data processing protocols’? • Even when data and protocols were deposited, would the reviewers check these? Probably not • So does it help at all?
  • 16. Did it work? The glass half full … • ArrayExpress and GEO have data from well over 6 million high throughput assays from some 30,000 functional genomics studies • The MIAME compliance has been increasing over time • Many studies have shown the reusability of these data • We can have an informed discussion about the reproducibility rather than forensics
  • 17. Standards for content vs standards for format • Developing a usable format is challenging – If it’s too ‘flexible’, too much free text, it’s no longer a standard, no software can reasonably parse it – If it’s too rigid, too granular, it can’t handle new type of data, and people end up putting things in fields that don’t work • Human readable formats is useful, but machine readability is essential!
  • 18. A simple human readable format for Functional genomics experiment metadata • Sample-Data Relationship File (SDRF)
  • 19. Lessons learned • Keep it simple, keep it simple, keep it simple! • Perils of designing standards by a committee vs advantages of community agreement • Successful formats are mostly defined by successful software, e.g., GFF in UCSC GB or Bioconductors gene_set • The attraction and perils of perfection – the last few steps of full automation cost most effort – A human person may be a cheep broker between two pieces of software (again – Bioconductor example)
  • 20. What does it mean for HTS? • (RNASeq – ChIPSeq) • The metadata for functional genomics HTS experiments are not so different from microarray experiments – replace cel files with BAM files
  • 21. MINSEQE - Minimum Information about a highthroughput Nucleotide SeQuencing Experiment 1. A general description of the aim of the experiment; 2. The submitter contact details; 3. Essential sample annotation and the experimental factors; 4. An ‘experiment’ or ‘run’ date, which may be important for identifying batch effects; 5. Sufficient information to correctly identify bio & tech reps; 6. Experimental and data processing protocols 7. Raw sequencing reads location; and processed data.
  • 22. Percentage of publications from 2012 containing new gene expression data Data type Number of PMID with new data % of data in SRA/Arrayexpr ess/GEO Microarray 347 49 RNA-SEQ 334 61
  • 23. Percentage of RNA-Seq studies providing metadata (1/2) Original Database ArrayExpress GEO SRA Experimental description 95 100 100 Contact 100 100 0 Sample & Factor info 100 100 60 Experimental Or Run date 0 0 60
  • 24. Percentage of RNA-Seq studies providing metadata (2/2) Original Database ArrayExpress GEO SRA Biological and Tech replicates Yes Sometimes Yes Exp and data processing protocol 60 100 0 Raw reads 100 100 100 Processed data 35 90 0
  • 25. Things we still need to do: • Involves folks from NCBI • Compare methods and metrics over time (20092012) • Compare methods with ENCODE, ICGC, EGA and the databases we presented here. • Look for shared meta data and seek to mate what is best and core to all. • Make sure it aligns with large funder’s current requirements. • Share and publish this information
  • 26. Take home messages • Archiving just something is not the same as making data available and useful – metadata, analysis code, usable format, … – Storing metadata doesn’t cost too much, extracting them from data generators does! • Minimising the human mediation in moving data between the LIMS, archives and analysis tools is more realistic goal than eliminating it – the need for brokerage • The main source of variability in RNSseq interpretation seems to be the alignments – we don’t know how to do this well yet. Getting the short reads for RNASeq is a beginning.
  • 27. • FGED: The Functional Genomics Data Society is a very open society, and we welcome feedback and input! – http://fged.org – Twitter: @fged
  • 28. Acknowledgements: • • • • • • • • • • • • • • • Gabriella Rustici, Eleanor Williams, Alvis Brazma and the Functional Genomics Data Society http://fged.org Alvis Brazma - EBI Roger Bumgarner - U of Washington Cesare Furlanello - FBK – MPBA Michael Miller - ISB Francis Ouellette - OICR John Quackenbush – Dana-Farber Michael Reich - Broad Gabriella Rustici - EBI Chris Stoeckert – U Penn Ronald Taylor - PNNL Steve Chervitz Trutane - Personalis Jennifer Weller - UNC Brian Wilhelm - IRIC Neil Winegarden - UHN