Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Linking literature to data in the life sciences
OpenAIREplus workshop, Copenhagen, 11 June 2012

Overview
•
•
•
•

What literature? What data?
How we make literature-data connections
Case study
Challenges and future directions

Data Landscape and Definitions
Research
articles

Funder mandates
Journal requirements
Metadata
Standards

Big Data:
Deposition
Primary

Unstructured
Data

*reuse

Big Data:
Curated
Annotation

PMC336623

Extended to several other biological data types

40000

300

European Nucleotide Archive

Ensembl and Ensembl Genomes
250

35000
30000

Genomes

• Big data
• Thematic data
• Public data
• Archived data

Nucleotides (millions)

45000

25000
20000
15000

200
150
100

10000
50
5000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year
14000000
12000000

25000

UniProt

Year
InterPro

Entries

10000000
8000000
6000000

15000
10000

4000000
5000

2000000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Year

500000
450000

70000

ArrayExpress

60000

400000

Structures

Hybridisations

• Two petabytes of data
• Scales to 7 pbs raw disk
• Majority is DNA

Entries

20000

350000
300000
250000
200000
150000

PDBe

50000
40000
30000
20000

100000
10000

50000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Two core literature databases

•

26 million abstracts

PubMed, Patents, Agricola
•

Website and web services

•
•
•

Citation networks
Database links
Whatizit textmining

• over 1.1 million new records per year

•

2.2 million full text articles
(217K articles with suppl data)

•

Website

•
•

Supplemented by CiteXplore
Additional text mining

• over 150K new articles per year

UK PubMed Central Overview
• Built in collaboration with PubMed Central USA (+ PMC Canada) since
2006
• Led by the European Bioinformatics Institute since 2011, with the
British Library, and the University of Manchester
• Supported by 16 UK and 2 European Funders, led by the Wellcome
Trust. Research spend: ~ 2 billion GBP

• A life-science web-based repository
• Manuscript submission service (self archiving by grant holders)
• Database of grant information – with details of about 18000 PIs
• Grant reporting and funder analysis tool
• 250K requests, 40K IPs, 7K direct interactive searches per day

How many articles?

Overall: 20% OA (~ 450K OA articles out of 2.2 million total)

How we make literature-data connections

Links
• by the author - on submission, as metadata (primary databases)
• by database curators - information and links from the
literature

• expensive, slow, but high quality

Text mining
• by algorithms that use terminologies (can be subject to lag)
• post publication – can find new associations
• variable quality, but high throughput

Links from Literature to Databases
•
•
•
•
•
•
•
•
•

800 K

370 K

110 K

Proteins
Nucleotides
OMIM
Chemicals
Structure
Clinical reviews
Protein families
Protein-protein interactions
Gene expression experiments …

Text Mining in UKPMC (2.2 million articles)
Semantic Type
Gene/Protein

Unique Terms

Articles

Annotations

225,905

1,288,809

15,021,502

GO Terms

32,486

1,806,539

15,016,957

Organism

178,847

1,689,251

12,322,782

Disease

170,592

1,743,212

16,201,198

Accession No.

232,950

65,640

331,329

76,350

1,669,500

22,438,980

Chemical

E. Coli meets humans

Human colon cancer

DNA repair

Link to the literature from the PDBe record

Algorithms that find similar structures

Towards understanding DNA repair mechanisms

Challenges and future directions

Data-driven science
Data re-use: biology is
post publication
Linking: citing papers
and data (provenance
and integration)
Metrics and attribution
Hard decisions about
value of keeping
complete data sets

Data landscape - possibilities
analysis

Research
articles

Unstructured
Data
Structured links

Big Data:
Deposition
Primary

Big Data:
Curated
Annotation

reuse?

PDF
HTML

GIF

JPG

TIF
MOV
DOC

Analysis supplied by Mimas, University of Manchester

XSL

Solutions that make sense to scientists

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Similar to Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI (20)

More from OpenAIRE

More from OpenAIRE (20)

Recently uploaded

Recently uploaded (20)

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI