Text and Data Mining for Biomedical Research Insights

Text and data mining for
Biomedical Research
Dr. Jean-Fred Fontaine
Max Delbrück Center for Molecular Medicine, Berlin

Scientific project and biomedical literature

Project design
Project design
• State of the art
• Innovative ideas

Communication
Communication

Experiments
Experiments
• Technologies

• State of the art
• Explanations
• Open hypotheses
• Perspectives

Analysis
Analysis
• Methods
• Explanations
• New hypotheses

Data growth
Literature growth

Molecular data growth

Accessibility

18 M (all)

9.7 M – TEXT MINING OF ABSTRACTS
8.6 M

2.4 M – (freely readable)
1.8 M
0.2 M - TEXT MINING OF FULL
TEXTS*

Krallinger et al. (2010) Methods Mol Biol.
* PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)

Document retrieval

Alzheimer’s disease?
Citations in PubMed®
25,000,000

20,000,000

15,000,000

10,000,000

0
4
9
1
8
2
5
9
1
6
0
9
1
4
6
8
9
1
2
7
6
9
1
0
8
4
9
1
8
2
9
1
6
0
2
4
8
0
2

5,000,000

0

By date

Medline Ranker

.................
.................
.................
.................
......
......

................
................
................
................
................
................
........
........
................
................
................
................
........
........
................
................
................
................
........
........
................
................
........
........

By relevance

Fontaine et al. (2009) Nucleic Acids Res.
http://cbdm.mdc-berlin.de/tools/medlineranker/

Discovery of gene-disease associations
Database mining
Database mining

Medline Ranker / Génie

...
...

Rank 20 000 genes

Fontaine et al. (2011) Nucleic Acids Res.
http://cbdm.mdc-berlin.de/tools/genie

Discovery of gene- and drug-disease associations

?
Before 2007

Before 2007

After 2007

After 2007
Frijters et al. (2010) PLoS Comput Biol.

Semantic analysis



Knowledge bases
Van Landeghem et al. (2013) PLoS One.

Network construction

Modelling Plant Defence Response

Miljkovic et al. (2012) PLoS One.

Trends

Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab.
http:// www.ogic.ca/mltrends/

Surveillance of Surgical Site Infections
 University Hospital of Rennes, France
 SSI secondary to neurosurgery
 Electronic Patient Records
 ICD10 codes
 Free text

2008-2009
2008-2009
relevant
relevant
records
records

Conventional ICD10 codes
surveillance

Full-text
medical
reports

TRUE positive

Classification
Classification

11

12

FALSE positive

0

219

18

FALSE negative

10

2

1

TRUE negative

2010 medical reports

3

1212

993

1194

................
................
................
................
.......
.......

Campillo-Gimenez et al. (2013) Stud Health Technol Inform.

Disease Correlations from Electronic Patient Records

ICD10 codes
ICD10 codes

Avg. ICD10 codes


Manual: 2.7



Text Mining: 9.5

Manual
Patient records
Patient records

Text Mining



Co-morbidity


93 / 802 unexpected



Ex. Alopecia and Migraine

Alopecia
HR

THRA
ESR1

Migraine

Roque et al. (2011) PLoS Comput Biol.

Summary


Computers and biomedical literature and data






Generation
Storage
Analysis

Text and data mining



Useful from project start to finish
Broad and critical applications



Information extraction



Knowledge databases





Information retrieval

Knowledge discovery

Limited by text availability

Challenges


Accuracy in some applications


Ambiguity, complex sentences, document context, novelty




From abstracts to full texts






“Protein A and its partners”

Current methods optimized for short texts (abstracts)
Figures and tables
Supplementary information

File format


The PDF problem
........
........
........
........
........
........



........
........
........
........
........
........

?

........
........
........

........
........
........

........
........
........

........
........
........

?

........
........
........

........
........
........

........
........
........

........
........
........

XML: structured format


Abstract, Introduction, Results, Methods, Discussion, References, ...

Needs


Copyright





Teach scientists
Unify licenses

Availability


All significant documents




Articles, reviews, case reports, letters

The main structured text (XML)


No figures (or optional)






Supplements: optional

No fancy user interface or webservice




texts mostly useless for readers

FTP/P2P + Compressed XML

Communicating Research results




# articles

Compressed file size*

1

13 KB

1M

12 GB

20M

250 GB

Open Access
As text
As data



standardized list of facts
standardized figures data and tables
* Projections based on PMC Open Access 2012

Text and Data Mining for Biomedical Research Insights

Text and Data Mining for Biomedical Research Insights

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Text and Data Mining for Biomedical Research Insights

Similar to Text and Data Mining for Biomedical Research Insights (20)

More from LIBER Europe

More from LIBER Europe (20)

Recently uploaded

Recently uploaded (20)

Text and Data Mining for Biomedical Research Insights

Editor's Notes