This document discusses text and data mining for biomedical research. It describes how literature and molecular data have grown exponentially and challenges of accessing full texts. It outlines applications of text mining including document retrieval, discovery of gene-disease associations, and surveillance of surgical infections. Challenges include accuracy, moving from abstracts to full texts, and file formats.
Breaking the Kubernetes Kill Chain: Host Path Mount
Text and Data Mining for Biomedical Research Insights
1. Text and data mining for
Biomedical Research
Dr. Jean-Fred Fontaine
Max Delbrück Center for Molecular Medicine, Berlin
2. Scientific project and biomedical literature
Project design
Project design
• State of the art
• Innovative ideas
Communication
Communication
Experiments
Experiments
• Technologies
• State of the art
• Explanations
• Open hypotheses
• Perspectives
Analysis
Analysis
• Methods
• Explanations
• New hypotheses
4. Accessibility
18 M (all)
9.7 M – TEXT MINING OF ABSTRACTS
8.6 M
2.4 M – (freely readable)
1.8 M
0.2 M - TEXT MINING OF FULL
TEXTS*
Krallinger et al. (2010) Methods Mol Biol.
* PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)
11. Surveillance of Surgical Site Infections
University Hospital of Rennes, France
SSI secondary to neurosurgery
Electronic Patient Records
ICD10 codes
Free text
2008-2009
2008-2009
relevant
relevant
records
records
Conventional ICD10 codes
surveillance
Full-text
medical
reports
TRUE positive
Classification
Classification
11
12
FALSE positive
0
219
18
FALSE negative
10
2
1
TRUE negative
2010 medical reports
3
1212
993
1194
................
................
................
................
.......
.......
Campillo-Gimenez et al. (2013) Stud Health Technol Inform.
12. Disease Correlations from Electronic Patient Records
ICD10 codes
ICD10 codes
Avg. ICD10 codes
Manual: 2.7
Text Mining: 9.5
Manual
Patient records
Patient records
Text Mining
Co-morbidity
93 / 802 unexpected
Ex. Alopecia and Migraine
Alopecia
HR
THRA
ESR1
Migraine
Roque et al. (2011) PLoS Comput Biol.
13. Summary
Computers and biomedical literature and data
Generation
Storage
Analysis
Text and data mining
Useful from project start to finish
Broad and critical applications
Information extraction
Knowledge databases
Information retrieval
Knowledge discovery
Limited by text availability
14. Challenges
Accuracy in some applications
Ambiguity, complex sentences, document context, novelty
From abstracts to full texts
“Protein A and its partners”
Current methods optimized for short texts (abstracts)
Figures and tables
Supplementary information
File format
The PDF problem
........
........
........
........
........
........
........
........
........
........
........
........
?
........
........
........
........
........
........
........
........
........
........
........
........
?
........
........
........
........
........
........
........
........
........
........
........
........
XML: structured format
Abstract, Introduction, Results, Methods, Discussion, References, ...
15. Needs
Copyright
Teach scientists
Unify licenses
Availability
All significant documents
Articles, reviews, case reports, letters
The main structured text (XML)
No figures (or optional)
Supplements: optional
No fancy user interface or webservice
texts mostly useless for readers
FTP/P2P + Compressed XML
Communicating Research results
# articles
Compressed file size*
1
13 KB
1M
12 GB
20M
250 GB
Open Access
As text
As data
standardized list of facts
standardized figures data and tables
* Projections based on PMC Open Access 2012