SlideShare a Scribd company logo
Linking literature to data in the life sciences
OpenAIREplus workshop, Copenhagen, 11 June 2012
Overview
•
•
•
•

What literature? What data?
How we make literature-data connections
Case study
Challenges and future directions
What literature? What data?
Data Landscape and Definitions
Research
articles

Funder mandates
Journal requirements
Metadata
Standards

Big Data:
Deposition
Primary

Unstructured
Data

*reuse

Big Data:
Curated
Annotation
PMC336623

Extended to several other biological data types
40000

300

European Nucleotide Archive

Ensembl and Ensembl Genomes
250

35000
30000

Genomes

• Big data
• Thematic data
• Public data
• Archived data

Nucleotides (millions)

45000

25000
20000
15000

200
150
100

10000
50
5000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year
14000000
12000000

25000

UniProt

Year
InterPro

Entries

10000000
8000000
6000000

15000
10000

4000000
5000

2000000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Year

500000
450000

70000

ArrayExpress

60000

400000

Structures

Hybridisations

• Two petabytes of data
• Scales to 7 pbs raw disk
• Majority is DNA

Entries

20000

350000
300000
250000
200000
150000

PDBe

50000
40000
30000
20000

100000
10000

50000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year
Two core literature databases

•

26 million abstracts

PubMed, Patents, Agricola
•

Website and web services

•
•
•

Citation networks
Database links
Whatizit textmining

• over 1.1 million new records per year

•

2.2 million full text articles
(217K articles with suppl data)

•

Website

•
•

Supplemented by CiteXplore
Additional text mining

• over 150K new articles per year
UK PubMed Central Overview
• Built in collaboration with PubMed Central USA (+ PMC Canada) since
2006
• Led by the European Bioinformatics Institute since 2011, with the
British Library, and the University of Manchester
• Supported by 16 UK and 2 European Funders, led by the Wellcome
Trust. Research spend: ~ 2 billion GBP

• A life-science web-based repository
• Manuscript submission service (self archiving by grant holders)
• Database of grant information – with details of about 18000 PIs
• Grant reporting and funder analysis tool
• 250K requests, 40K IPs, 7K direct interactive searches per day
How many articles?

Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
How we make literature-data connections
Links
• by the author - on submission, as metadata (primary databases)
• by database curators - information and links from the
literature

• expensive, slow, but high quality

Text mining
• by algorithms that use terminologies (can be subject to lag)
• post publication – can find new associations
• variable quality, but high throughput
Links from Literature to Databases
•
•
•
•
•
•
•
•
•

800 K

370 K

110 K

Proteins
Nucleotides
OMIM
Chemicals
Structure
Clinical reviews
Protein families
Protein-protein interactions
Gene expression experiments …
Text Mining in UKPMC (2.2 million articles)
Semantic Type
Gene/Protein

Unique Terms

Articles

Annotations

225,905

1,288,809

15,021,502

GO Terms

32,486

1,806,539

15,016,957

Organism

178,847

1,689,251

12,322,782

Disease

170,592

1,743,212

16,201,198

Accession No.

232,950

65,640

331,329

76,350

1,669,500

22,438,980

Chemical
Case study
3.9 billion years ago
E. Coli meets humans

Human colon cancer

DNA repair
07/21/10

17
Protein structure in PDBe
Link to the literature from the PDBe record
Algorithms that find similar structures
Text mine full text for 1ewq
Towards understanding DNA repair mechanisms
Challenges and future directions
Data-driven science
Data re-use: biology is
post publication
Linking: citing papers
and data (provenance
and integration)
Metrics and attribution
Hard decisions about
value of keeping
complete data sets
Data landscape - possibilities
analysis

Research
articles

Unstructured
Data
Structured links

Big Data:
Deposition
Primary

Big Data:
Curated
Annotation

reuse?
PDF
HTML

GIF

JPG

TIF
MOV
DOC

Analysis supplied by Mimas, University of Manchester

XSL
Solutions that make sense to scientists
http://ukpmc.ac.uk

More Related Content

What's hot

Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
Valery Tkachenko
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistry
Valery Tkachenko
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Carole Goble
 
CrossRef DOIs for African Journal Partnership Journals
CrossRef DOIs for African Journal Partnership JournalsCrossRef DOIs for African Journal Partnership Journals
CrossRef DOIs for African Journal Partnership Journals
Crossref
 
The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...
Valery Tkachenko
 
20140327 rda plazi_final
20140327 rda plazi_final20140327 rda plazi_final
20140327 rda plazi_final
agosti
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Juan Antonio Vizcaino
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
Simon Jupp
 
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
Crossref
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
BioDBCore: Current Status and Next Developments
BioDBCore: Current Status and Next DevelopmentsBioDBCore: Current Status and Next Developments
BioDBCore: Current Status and Next Developments
Pascale Gaudet
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
Open Science Fair
 
Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomics
Juan Antonio Vizcaino
 
FAIR data and model management for systems biology.
FAIR data and model management for systems biology.FAIR data and model management for systems biology.
FAIR data and model management for systems biology.
FAIRDOM
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
CEDAR: Center for Expanded Data Annotation and Retrieval
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

What's hot (20)

Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistry
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
 
CrossRef DOIs for African Journal Partnership Journals
CrossRef DOIs for African Journal Partnership JournalsCrossRef DOIs for African Journal Partnership Journals
CrossRef DOIs for African Journal Partnership Journals
 
The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...
 
20140327 rda plazi_final
20140327 rda plazi_final20140327 rda plazi_final
20140327 rda plazi_final
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
 
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
BioDBCore: Current Status and Next Developments
BioDBCore: Current Status and Next DevelopmentsBioDBCore: Current Status and Next Developments
BioDBCore: Current Status and Next Developments
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
 
Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomics
 
FAIR data and model management for systems biology.
FAIR data and model management for systems biology.FAIR data and model management for systems biology.
FAIR data and model management for systems biology.
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 

Similar to Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
Dr. Haxel Consult
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
ORCID, Inc
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Ken Karapetyan
 
The application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platformsThe application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platforms
Valery Tkachenko
 
Globus in European Life Science
Globus in European Life ScienceGlobus in European Life Science
Globus in European Life Science
Globus
 
Data integration
Data integrationData integration
Data integration
Rafael C. Jimenez
 
NEBASE Hour - August 2008 - What's New At OCLC?
NEBASE Hour - August 2008 - What's New At OCLC?NEBASE Hour - August 2008 - What's New At OCLC?
NEBASE Hour - August 2008 - What's New At OCLC?
Nebraska Library Commission
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Ken Karapetyan
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
Trish Whetzel
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
Archiver
 
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
Crossref
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
Bonnie Hurwitz
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
Connected Data World
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
BioinformaticsCentre
 
Data integration
Data integrationData integration
Data integration
Rafael C. Jimenez
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Data Consortium
 

Similar to Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI (20)

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
The application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platformsThe application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platforms
 
Globus in European Life Science
Globus in European Life ScienceGlobus in European Life Science
Globus in European Life Science
 
Data integration
Data integrationData integration
Data integration
 
NEBASE Hour - August 2008 - What's New At OCLC?
NEBASE Hour - August 2008 - What's New At OCLC?NEBASE Hour - August 2008 - What's New At OCLC?
NEBASE Hour - August 2008 - What's New At OCLC?
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
Data integration
Data integrationData integration
Data integration
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 

More from OpenAIRE

10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call
OpenAIRE
 
9th Content Providers Community Call\
9th Content Providers Community Call\9th Content Providers Community Call\
9th Content Providers Community Call\
OpenAIRE
 
OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE
 
8th Content Providers Community Call
8th Content Providers Community Call8th Content Providers Community Call
8th Content Providers Community Call
OpenAIRE
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
OpenAIRE
 
OpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managersOpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE
 
What will it cost to manage and share my data?
What will it cost to manage and share my data?What will it cost to manage and share my data?
What will it cost to manage and share my data?
OpenAIRE
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
OpenAIRE
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
OpenAIRE
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
OpenAIRE
 
6th Content Providers Community Call
6th Content Providers Community Call6th Content Providers Community Call
6th Content Providers Community Call
OpenAIRE
 
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
OpenAIRE
 
20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?
OpenAIRE
 
20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science
OpenAIRE
 
20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)
OpenAIRE
 
20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science
OpenAIRE
 
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
OpenAIRE
 
COVID-19: Activities, tools, best practice and contact points in Greece
 COVID-19: Activities, tools, best practice and contact points in Greece COVID-19: Activities, tools, best practice and contact points in Greece
COVID-19: Activities, tools, best practice and contact points in Greece
OpenAIRE
 
5th Content Providers Community Call
5th Content Providers Community Call5th Content Providers Community Call
5th Content Providers Community Call
OpenAIRE
 
4th Content Providers Community Call
4th Content Providers Community Call4th Content Providers Community Call
4th Content Providers Community Call
OpenAIRE
 

More from OpenAIRE (20)

10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call
 
9th Content Providers Community Call\
9th Content Providers Community Call\9th Content Providers Community Call\
9th Content Providers Community Call\
 
OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)
 
8th Content Providers Community Call
8th Content Providers Community Call8th Content Providers Community Call
8th Content Providers Community Call
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
OpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managersOpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managers
 
What will it cost to manage and share my data?
What will it cost to manage and share my data?What will it cost to manage and share my data?
What will it cost to manage and share my data?
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
 
6th Content Providers Community Call
6th Content Providers Community Call6th Content Providers Community Call
6th Content Providers Community Call
 
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?
 
20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science
 
20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)
 
20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science
 
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
COVID-19: Activities, tools, best practice and contact points in Greece
 COVID-19: Activities, tools, best practice and contact points in Greece COVID-19: Activities, tools, best practice and contact points in Greece
COVID-19: Activities, tools, best practice and contact points in Greece
 
5th Content Providers Community Call
5th Content Providers Community Call5th Content Providers Community Call
5th Content Providers Community Call
 
4th Content Providers Community Call
4th Content Providers Community Call4th Content Providers Community Call
4th Content Providers Community Call
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

  • 1. Linking literature to data in the life sciences OpenAIREplus workshop, Copenhagen, 11 June 2012
  • 2. Overview • • • • What literature? What data? How we make literature-data connections Case study Challenges and future directions
  • 4. Data Landscape and Definitions Research articles Funder mandates Journal requirements Metadata Standards Big Data: Deposition Primary Unstructured Data *reuse Big Data: Curated Annotation
  • 5. PMC336623 Extended to several other biological data types
  • 6. 40000 300 European Nucleotide Archive Ensembl and Ensembl Genomes 250 35000 30000 Genomes • Big data • Thematic data • Public data • Archived data Nucleotides (millions) 45000 25000 20000 15000 200 150 100 10000 50 5000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 14000000 12000000 25000 UniProt Year InterPro Entries 10000000 8000000 6000000 15000 10000 4000000 5000 2000000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year 500000 450000 70000 ArrayExpress 60000 400000 Structures Hybridisations • Two petabytes of data • Scales to 7 pbs raw disk • Majority is DNA Entries 20000 350000 300000 250000 200000 150000 PDBe 50000 40000 30000 20000 100000 10000 50000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year
  • 7. Two core literature databases • 26 million abstracts PubMed, Patents, Agricola • Website and web services • • • Citation networks Database links Whatizit textmining • over 1.1 million new records per year • 2.2 million full text articles (217K articles with suppl data) • Website • • Supplemented by CiteXplore Additional text mining • over 150K new articles per year
  • 8. UK PubMed Central Overview • Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006 • Led by the European Bioinformatics Institute since 2011, with the British Library, and the University of Manchester • Supported by 16 UK and 2 European Funders, led by the Wellcome Trust. Research spend: ~ 2 billion GBP • A life-science web-based repository • Manuscript submission service (self archiving by grant holders) • Database of grant information – with details of about 18000 PIs • Grant reporting and funder analysis tool • 250K requests, 40K IPs, 7K direct interactive searches per day
  • 9. How many articles? Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
  • 10. How we make literature-data connections
  • 11. Links • by the author - on submission, as metadata (primary databases) • by database curators - information and links from the literature • expensive, slow, but high quality Text mining • by algorithms that use terminologies (can be subject to lag) • post publication – can find new associations • variable quality, but high throughput
  • 12. Links from Literature to Databases • • • • • • • • • 800 K 370 K 110 K Proteins Nucleotides OMIM Chemicals Structure Clinical reviews Protein families Protein-protein interactions Gene expression experiments …
  • 13. Text Mining in UKPMC (2.2 million articles) Semantic Type Gene/Protein Unique Terms Articles Annotations 225,905 1,288,809 15,021,502 GO Terms 32,486 1,806,539 15,016,957 Organism 178,847 1,689,251 12,322,782 Disease 170,592 1,743,212 16,201,198 Accession No. 232,950 65,640 331,329 76,350 1,669,500 22,438,980 Chemical
  • 16. E. Coli meets humans Human colon cancer DNA repair
  • 18.
  • 20. Link to the literature from the PDBe record
  • 21. Algorithms that find similar structures
  • 22. Text mine full text for 1ewq
  • 23. Towards understanding DNA repair mechanisms
  • 24. Challenges and future directions
  • 25. Data-driven science Data re-use: biology is post publication Linking: citing papers and data (provenance and integration) Metrics and attribution Hard decisions about value of keeping complete data sets
  • 26. Data landscape - possibilities analysis Research articles Unstructured Data Structured links Big Data: Deposition Primary Big Data: Curated Annotation reuse?
  • 27. PDF HTML GIF JPG TIF MOV DOC Analysis supplied by Mimas, University of Manchester XSL
  • 28. Solutions that make sense to scientists