The researcher perspective, Jean-Fred Fontaine, MDC Berlin

2,094 views
1,999 views

Published on

Presentation by Jean-Fred Fontaine (MDC Berlin) from the 'Prefect Swell' workshop on text and data mining on the 27th of September 2013.

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,094
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • MEDLINE®/PubMed® statistics: http://www.nlm.nih.gov/bsd/pmresources.html#statistics
    GenBank Release Notes (August 15, 2013) (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
  • WHEN: statistics from 2008
    Abstract length: 250-400 words / 10,000 chars
  • senile dementia of the Alzheimer type (SDAT)
  • Graves’ disease and Programmed cell death 1 (PDCD1)
    Milnacipran (antidepressants) and obsessive-compulsive disorder
  • Activation (A), Binding (B) and Inhibition (I).
    Ethylene (ET), Jasmonic Acid (JA) and Salicylic Acid (SA)
  • ICD10 codes: billing and social purpose
    DRG: Diagnosis-Related Group.
    Studies in health technology and informatics
  • Alopecia: hair loss
    HR: Protein Hairless
    THRA: Thyroid hormone receptor
    ESR1: Estrogen receptor
  • 56.6 KB / XML article
    13.3 KB / compressed XML article
    19607566706 B (19.6 GB) / 346448 XML
    4608658061 (4.3 GB) / 346448 compressed XML
    1.1TB / 20M XML articles
    248GB / 20M compressed XML articles
  • The researcher perspective, Jean-Fred Fontaine, MDC Berlin

    1. 1. Text and data mining for Biomedical Research Dr. Jean-Fred Fontaine Max Delbrück Center for Molecular Medicine, Berlin
    2. 2. Scientific project and biomedical literature Project design Project design • State of the art • Innovative ideas Communication Communication Experiments Experiments • Technologies • State of the art • Explanations • Open hypotheses • Perspectives Analysis Analysis • Methods • Explanations • New hypotheses
    3. 3. Data growth Literature growth Molecular data growth
    4. 4. Accessibility 18 M (all) 9.7 M – TEXT MINING OF ABSTRACTS 8.6 M 2.4 M – (freely readable) 1.8 M 0.2 M - TEXT MINING OF FULL TEXTS* Krallinger et al. (2010) Methods Mol Biol. * PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)
    5. 5. Document retrieval Alzheimer’s disease? Citations in PubMed® 25,000,000 20,000,000 15,000,000 10,000,000 0 4 9 1 8 2 5 9 1 6 0 9 1 4 6 8 9 1 2 7 6 9 1 0 8 4 9 1 8 2 9 1 6 0 2 4 8 0 2 5,000,000 0 By date Medline Ranker ................. ................. ................. ................. ...... ...... ................ ................ ................ ................ ................ ................ ........ ........ ................ ................ ................ ................ ........ ........ ................ ................ ................ ................ ........ ........ ................ ................ ........ ........ By relevance Fontaine et al. (2009) Nucleic Acids Res. http://cbdm.mdc-berlin.de/tools/medlineranker/
    6. 6. Discovery of gene-disease associations Database mining Database mining Medline Ranker / Génie ... ... Rank 20 000 genes Fontaine et al. (2011) Nucleic Acids Res. http://cbdm.mdc-berlin.de/tools/genie
    7. 7. Discovery of gene- and drug-disease associations ? Before 2007 Before 2007 After 2007 After 2007 Frijters et al. (2010) PLoS Comput Biol.
    8. 8. Semantic analysis  Knowledge bases Van Landeghem et al. (2013) PLoS One.
    9. 9. Network construction Modelling Plant Defence Response Miljkovic et al. (2012) PLoS One.
    10. 10. Trends Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab. http:// www.ogic.ca/mltrends/
    11. 11. Surveillance of Surgical Site Infections  University Hospital of Rennes, France  SSI secondary to neurosurgery  Electronic Patient Records  ICD10 codes  Free text 2008-2009 2008-2009 relevant relevant records records Conventional ICD10 codes surveillance Full-text medical reports TRUE positive Classification Classification 11 12 FALSE positive 0 219 18 FALSE negative 10 2 1 TRUE negative 2010 medical reports 3 1212 993 1194 ................ ................ ................ ................ ....... ....... Campillo-Gimenez et al. (2013) Stud Health Technol Inform.
    12. 12. Disease Correlations from Electronic Patient Records  ICD10 codes ICD10 codes Avg. ICD10 codes  Manual: 2.7  Text Mining: 9.5 Manual Patient records Patient records Text Mining  Co-morbidity  93 / 802 unexpected  Ex. Alopecia and Migraine Alopecia HR THRA ESR1 Migraine Roque et al. (2011) PLoS Comput Biol.
    13. 13. Summary  Computers and biomedical literature and data     Generation Storage Analysis Text and data mining   Useful from project start to finish Broad and critical applications   Information extraction  Knowledge databases   Information retrieval Knowledge discovery Limited by text availability
    14. 14. Challenges  Accuracy in some applications  Ambiguity, complex sentences, document context, novelty   From abstracts to full texts     “Protein A and its partners” Current methods optimized for short texts (abstracts) Figures and tables Supplementary information File format  The PDF problem ........ ........ ........ ........ ........ ........  ........ ........ ........ ........ ........ ........ ? ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ? ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ XML: structured format  Abstract, Introduction, Results, Methods, Discussion, References, ...
    15. 15. Needs  Copyright    Teach scientists Unify licenses Availability  All significant documents   Articles, reviews, case reports, letters The main structured text (XML)  No figures (or optional)    Supplements: optional No fancy user interface or webservice   texts mostly useless for readers FTP/P2P + Compressed XML Communicating Research results    # articles Compressed file size* 1 13 KB 1M 12 GB 20M 250 GB Open Access As text As data   standardized list of facts standardized figures data and tables * Projections based on PMC Open Access 2012

    ×