SlideShare a Scribd company logo
The importance (and absence)
of annotation in the Next
Generation Sequence Data
Hugh Shanahan & Jamie Alnasir
Hugh.Shanahan@rhul.ac.uk
@hughshanahan
Results to be published in GigaScience
It was the best of times
• Many exciting experiments based on gathering huge amounts of data.
• 100,000 Genomes in the UK, many others
• Elixir - Exabytes of biomedical data in the next decade
• Large experiments - SKA, LHC
• Opening up of Government data
• Up ahead - Sensor networks and Monitoring Cities
• Machine Learning is now a widely accepted tool in analysing data and
in making decisions.
• Evidence-based policy becoming the norm.
It was the worst of times
• Leaks appearing in the Scientific process.
• In domains with many possible relationships, most
published results are wrong (Ioannidis, PLoS
Medicine, 2005).
• 1/4 of 67 published experiments on drug targets
reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)
• 39% of key Psychology experiments could be
reproduced (Nature News, 2015).
Poor statistics?
• Naive use of p-value
calculations across fields.
• Banning use of Null
Hypothesis Significance Test
Procedure in Basic and
Applies Social Psychology
(Trafimow and Marks, BASP,
2015)
• Not the end of the story…more
like the tip of the iceberg
(Leek and Peng, Nature 2015)
Lessons learnt
• Results from individual experiments are probably
wrong.
• Bias in your data means your conclusions are
even more likely to be wrong.
• Meta-analyses help.
• Understand how you got the data you have.
Sequence Read Archive
• Central repository of sequence data.
• Nearly 30,000 genomic and transcriptomics
experiments stored and freely available.
• 2 x 1015 nucleotides stored
• Based on Next Generation Sequencing
• Step reduction in cost of sequencing
• ~$thousands for a human genome
• Potentially an enormous resource
• But how do you get that data?
Good news
• SRA data is open
• Stored in a sensible way (uses SQL)
• API and documentation to access it
Mucky business
• Data stored in SRA are short reads.
• ~100 nucleotide-long fragments which are then
assembled.
• Very long pipeline to get from a sample to this
step.
• Pipeline (Protocol in their lingo) is VARIABLE
Obvious question
• Is there any evidence of bias in the data due to
varying the protocol?
Even More Obvious
Question
• Where is the metadata on the pipeline
(protocol)?
4% of experiments describe all of the
steps
What’s more…
• Metadata are stored as text fields.
• Hugely difficult task to parse.
• Submitters are not obliged to fill this data in.
• Confusion about what level to enter data in.
Bottom line
• For much of the SRA data, there is a “known
unknown” about biases due to preparation.
• It’s very unlikely we’ll ever be able to figure that
out.
Why should you be paying
attention?
• As a member of the public - it’s your money
down the drain ($108-$109)
• As a researcher - all of this undermines
confidence in Science as a whole.
• If you work with big (and more particularly)
complex data - the same issues will crop up for
you.
Answers?
• Understand how you got your data - even if it’s a step
for modelling.
• Metadata is crucial.
• Organising your data is crucial.
• Use Ontologies
• Use discrete keywords
• Get people to use it
In summary :-
We want to do all the clever stuff….
Most of the time we need to deal with
a ton of pitchblende to find the milligram
of Radium ..

More Related Content

What's hot

Cheminformatics Workflows Using Mobile Apps for Drug Discovery
Cheminformatics Workflows Using Mobile Apps for Drug DiscoveryCheminformatics Workflows Using Mobile Apps for Drug Discovery
Cheminformatics Workflows Using Mobile Apps for Drug Discovery
Sean Ekins
 
GWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thalianaGWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thaliana
Golden Helix Inc
 
Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)
Paul Agapow
 
Working with Quertle
Working with QuertleWorking with Quertle
Working with Quertle
Janet Delicata
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
C. Tobin Magle
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
Pistoia Alliance
 
AI in translational medicine webinar
AI in translational medicine webinarAI in translational medicine webinar
AI in translational medicine webinar
Pistoia Alliance
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
William Gunn
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Neo4j
 
Making Open the Default - Bjorn Brembs
Making Open the Default - Bjorn BrembsMaking Open the Default - Bjorn Brembs
Making Open the Default - Bjorn Brembs
Right to Research
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Neo4j
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
Tao Xie
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
CEDAR: Center for Expanded Data Annotation and Retrieval
 

What's hot (15)

Cheminformatics Workflows Using Mobile Apps for Drug Discovery
Cheminformatics Workflows Using Mobile Apps for Drug DiscoveryCheminformatics Workflows Using Mobile Apps for Drug Discovery
Cheminformatics Workflows Using Mobile Apps for Drug Discovery
 
GWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thalianaGWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thaliana
 
Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)
 
Working with Quertle
Working with QuertleWorking with Quertle
Working with Quertle
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
AI in translational medicine webinar
AI in translational medicine webinarAI in translational medicine webinar
AI in translational medicine webinar
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
 
Making Open the Default - Bjorn Brembs
Making Open the Default - Bjorn BrembsMaking Open the Default - Bjorn Brembs
Making Open the Default - Bjorn Brembs
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 

Viewers also liked

CEUTF - TEOLOGIA
CEUTF - TEOLOGIACEUTF - TEOLOGIA
CEUTF - TEOLOGIA
WandersonLo
 
Relazione Progetto cRIO
Relazione Progetto cRIORelazione Progetto cRIO
Relazione Progetto cRIO
Sebastiano Merlino (eTr)
 
Top Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
Top Ten Digital Engagement Tools - WASHTO 2013 Annual MeetingTop Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
Top Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
krmobley1
 
Tips for UXD that works
Tips for UXD that worksTips for UXD that works
Tips for UXD that works
Albert Wang
 
Formato plano 10th week5_complex_sent
Formato plano 10th week5_complex_sentFormato plano 10th week5_complex_sent
Formato plano 10th week5_complex_sent
Evelin Peña
 
Ict4 d rhul talk
Ict4 d rhul talkIct4 d rhul talk
Ict4 d rhul talk
Hugh Shanahan
 
Formato de clase 8y9 acronyms
Formato de clase 8y9 acronymsFormato de clase 8y9 acronyms
Formato de clase 8y9 acronyms
Evelin Peña
 
Openid+Opensocial
Openid+OpensocialOpenid+Opensocial
Openid+Opensocial
Sebastiano Merlino (eTr)
 
Galeria Rammstein Slides
Galeria Rammstein SlidesGaleria Rammstein Slides
Galeria Rammstein Slides
NATALIA LAVERDE
 
Formato de clase 8y9 future
Formato de clase 8y9 futureFormato de clase 8y9 future
Formato de clase 8y9 future
Evelin Peña
 
Linux & Open Source - Lezione 1
Linux & Open Source - Lezione 1Linux & Open Source - Lezione 1
Linux & Open Source - Lezione 1
Sebastiano Merlino (eTr)
 
Formato plano 7th week4_simpl_pasrvspastcont
Formato plano 7th week4_simpl_pasrvspastcontFormato plano 7th week4_simpl_pasrvspastcont
Formato plano 7th week4_simpl_pasrvspastcont
Evelin Peña
 
VPI Ontario
VPI OntarioVPI Ontario
VPI Ontario
vporcaro
 
Folio
FolioFolio
Folio
souk06
 
Presentazione Progetto CRio
Presentazione Progetto CRioPresentazione Progetto CRio
Presentazione Progetto CRio
Sebastiano Merlino (eTr)
 
Viernes santo la merced 2012
Viernes santo la merced 2012Viernes santo la merced 2012
Viernes santo la merced 2012
Claudio Obregón
 
Formato plano 6th week6_future_simple
Formato plano 6th week6_future_simpleFormato plano 6th week6_future_simple
Formato plano 6th week6_future_simple
Evelin Peña
 

Viewers also liked (20)

CEUTF - TEOLOGIA
CEUTF - TEOLOGIACEUTF - TEOLOGIA
CEUTF - TEOLOGIA
 
Relazione Progetto cRIO
Relazione Progetto cRIORelazione Progetto cRIO
Relazione Progetto cRIO
 
Aht ren alde
Aht ren aldeAht ren alde
Aht ren alde
 
Energiak etorkizunean maketa
Energiak etorkizunean maketaEnergiak etorkizunean maketa
Energiak etorkizunean maketa
 
Top Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
Top Ten Digital Engagement Tools - WASHTO 2013 Annual MeetingTop Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
Top Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
 
Tips for UXD that works
Tips for UXD that worksTips for UXD that works
Tips for UXD that works
 
Formato plano 10th week5_complex_sent
Formato plano 10th week5_complex_sentFormato plano 10th week5_complex_sent
Formato plano 10th week5_complex_sent
 
Ict4 d rhul talk
Ict4 d rhul talkIct4 d rhul talk
Ict4 d rhul talk
 
Formato de clase 8y9 acronyms
Formato de clase 8y9 acronymsFormato de clase 8y9 acronyms
Formato de clase 8y9 acronyms
 
Openid+Opensocial
Openid+OpensocialOpenid+Opensocial
Openid+Opensocial
 
Galeria Rammstein Slides
Galeria Rammstein SlidesGaleria Rammstein Slides
Galeria Rammstein Slides
 
Formato de clase 8y9 future
Formato de clase 8y9 futureFormato de clase 8y9 future
Formato de clase 8y9 future
 
Linux & Open Source - Lezione 1
Linux & Open Source - Lezione 1Linux & Open Source - Lezione 1
Linux & Open Source - Lezione 1
 
Formato plano 7th week4_simpl_pasrvspastcont
Formato plano 7th week4_simpl_pasrvspastcontFormato plano 7th week4_simpl_pasrvspastcont
Formato plano 7th week4_simpl_pasrvspastcont
 
VPI Ontario
VPI OntarioVPI Ontario
VPI Ontario
 
Folio
FolioFolio
Folio
 
Presentazione Progetto CRio
Presentazione Progetto CRioPresentazione Progetto CRio
Presentazione Progetto CRio
 
Viernes santo la merced 2012
Viernes santo la merced 2012Viernes santo la merced 2012
Viernes santo la merced 2012
 
Formato plano 6th week6_future_simple
Formato plano 6th week6_future_simpleFormato plano 6th week6_future_simple
Formato plano 6th week6_future_simple
 
Aht ren kontra
Aht ren kontraAht ren kontra
Aht ren kontra
 

Similar to On the importance (and absence) of annotation in Next Generation Sequencing Data

Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics Datasets
Manuel Corpas
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
Michael Brodie
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
Chris Dwan
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
Guy Coates
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
Fiona Nielsen
 
From Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility RevolutionFrom Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility Revolution
Koki Ikeda
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop   finding and accessing data - fiona - lunteren april 18 2016Workshop   finding and accessing data - fiona - lunteren april 18 2016
Workshop finding and accessing data - fiona - lunteren april 18 2016
Fiona Nielsen
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
Fiona Nielsen
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECAProject
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Jisc
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
wolf vanpaemel
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive Technology
IBM Watson
 
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
U.S. Army Engineer Research and Development Center
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Spark Summit
 
RNP support to data-driven research
RNP support to data-driven researchRNP support to data-driven research
RNP support to data-driven research
Leandro Ciuffo
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Fiona Nielsen
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
Russ Altman
 
Using Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical PathwaysUsing Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical Pathways
diannepatricia
 
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
National Cancer Institute National Cancer Informatics Program
 

Similar to On the importance (and absence) of annotation in Next Generation Sequencing Data (20)

Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics Datasets
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
From Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility RevolutionFrom Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility Revolution
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop   finding and accessing data - fiona - lunteren april 18 2016Workshop   finding and accessing data - fiona - lunteren april 18 2016
Workshop finding and accessing data - fiona - lunteren april 18 2016
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive Technology
 
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
RNP support to data-driven research
RNP support to data-driven researchRNP support to data-driven research
RNP support to data-driven research
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Using Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical PathwaysUsing Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical Pathways
 
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
 

Recently uploaded

EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 

Recently uploaded (20)

EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 

On the importance (and absence) of annotation in Next Generation Sequencing Data

  • 1. The importance (and absence) of annotation in the Next Generation Sequence Data Hugh Shanahan & Jamie Alnasir Hugh.Shanahan@rhul.ac.uk @hughshanahan Results to be published in GigaScience
  • 2. It was the best of times • Many exciting experiments based on gathering huge amounts of data. • 100,000 Genomes in the UK, many others • Elixir - Exabytes of biomedical data in the next decade • Large experiments - SKA, LHC • Opening up of Government data • Up ahead - Sensor networks and Monitoring Cities • Machine Learning is now a widely accepted tool in analysing data and in making decisions. • Evidence-based policy becoming the norm.
  • 3. It was the worst of times • Leaks appearing in the Scientific process. • In domains with many possible relationships, most published results are wrong (Ioannidis, PLoS Medicine, 2005). • 1/4 of 67 published experiments on drug targets reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011) • 39% of key Psychology experiments could be reproduced (Nature News, 2015).
  • 4. Poor statistics? • Naive use of p-value calculations across fields. • Banning use of Null Hypothesis Significance Test Procedure in Basic and Applies Social Psychology (Trafimow and Marks, BASP, 2015) • Not the end of the story…more like the tip of the iceberg (Leek and Peng, Nature 2015)
  • 5. Lessons learnt • Results from individual experiments are probably wrong. • Bias in your data means your conclusions are even more likely to be wrong. • Meta-analyses help. • Understand how you got the data you have.
  • 6. Sequence Read Archive • Central repository of sequence data. • Nearly 30,000 genomic and transcriptomics experiments stored and freely available. • 2 x 1015 nucleotides stored
  • 7.
  • 8. • Based on Next Generation Sequencing • Step reduction in cost of sequencing • ~$thousands for a human genome • Potentially an enormous resource • But how do you get that data?
  • 9. Good news • SRA data is open • Stored in a sensible way (uses SQL) • API and documentation to access it
  • 10. Mucky business • Data stored in SRA are short reads. • ~100 nucleotide-long fragments which are then assembled. • Very long pipeline to get from a sample to this step. • Pipeline (Protocol in their lingo) is VARIABLE
  • 11.
  • 12. Obvious question • Is there any evidence of bias in the data due to varying the protocol?
  • 13. Even More Obvious Question • Where is the metadata on the pipeline (protocol)?
  • 14. 4% of experiments describe all of the steps
  • 15. What’s more… • Metadata are stored as text fields. • Hugely difficult task to parse. • Submitters are not obliged to fill this data in. • Confusion about what level to enter data in.
  • 16. Bottom line • For much of the SRA data, there is a “known unknown” about biases due to preparation. • It’s very unlikely we’ll ever be able to figure that out.
  • 17. Why should you be paying attention? • As a member of the public - it’s your money down the drain ($108-$109) • As a researcher - all of this undermines confidence in Science as a whole. • If you work with big (and more particularly) complex data - the same issues will crop up for you.
  • 18. Answers? • Understand how you got your data - even if it’s a step for modelling. • Metadata is crucial. • Organising your data is crucial. • Use Ontologies • Use discrete keywords • Get people to use it
  • 19. In summary :- We want to do all the clever stuff….
  • 20. Most of the time we need to deal with a ton of pitchblende to find the milligram of Radium ..