Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The researcher perspective, Jean-Fred Fontaine, MDC Berlin

2,194 views

Published on

Presentation by Jean-Fred Fontaine (MDC Berlin) from the 'Prefect Swell' workshop on text and data mining on the 27th of September 2013.

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

The researcher perspective, Jean-Fred Fontaine, MDC Berlin

  1. 1. Text and data mining for Biomedical Research Dr. Jean-Fred Fontaine Max Delbrück Center for Molecular Medicine, Berlin
  2. 2. Scientific project and biomedical literature Project design Project design • State of the art • Innovative ideas Communication Communication Experiments Experiments • Technologies • State of the art • Explanations • Open hypotheses • Perspectives Analysis Analysis • Methods • Explanations • New hypotheses
  3. 3. Data growth Literature growth Molecular data growth
  4. 4. Accessibility 18 M (all) 9.7 M – TEXT MINING OF ABSTRACTS 8.6 M 2.4 M – (freely readable) 1.8 M 0.2 M - TEXT MINING OF FULL TEXTS* Krallinger et al. (2010) Methods Mol Biol. * PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)
  5. 5. Document retrieval Alzheimer’s disease? Citations in PubMed® 25,000,000 20,000,000 15,000,000 10,000,000 0 4 9 1 8 2 5 9 1 6 0 9 1 4 6 8 9 1 2 7 6 9 1 0 8 4 9 1 8 2 9 1 6 0 2 4 8 0 2 5,000,000 0 By date Medline Ranker ................. ................. ................. ................. ...... ...... ................ ................ ................ ................ ................ ................ ........ ........ ................ ................ ................ ................ ........ ........ ................ ................ ................ ................ ........ ........ ................ ................ ........ ........ By relevance Fontaine et al. (2009) Nucleic Acids Res. http://cbdm.mdc-berlin.de/tools/medlineranker/
  6. 6. Discovery of gene-disease associations Database mining Database mining Medline Ranker / Génie ... ... Rank 20 000 genes Fontaine et al. (2011) Nucleic Acids Res. http://cbdm.mdc-berlin.de/tools/genie
  7. 7. Discovery of gene- and drug-disease associations ? Before 2007 Before 2007 After 2007 After 2007 Frijters et al. (2010) PLoS Comput Biol.
  8. 8. Semantic analysis  Knowledge bases Van Landeghem et al. (2013) PLoS One.
  9. 9. Network construction Modelling Plant Defence Response Miljkovic et al. (2012) PLoS One.
  10. 10. Trends Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab. http:// www.ogic.ca/mltrends/
  11. 11. Surveillance of Surgical Site Infections  University Hospital of Rennes, France  SSI secondary to neurosurgery  Electronic Patient Records  ICD10 codes  Free text 2008-2009 2008-2009 relevant relevant records records Conventional ICD10 codes surveillance Full-text medical reports TRUE positive Classification Classification 11 12 FALSE positive 0 219 18 FALSE negative 10 2 1 TRUE negative 2010 medical reports 3 1212 993 1194 ................ ................ ................ ................ ....... ....... Campillo-Gimenez et al. (2013) Stud Health Technol Inform.
  12. 12. Disease Correlations from Electronic Patient Records  ICD10 codes ICD10 codes Avg. ICD10 codes  Manual: 2.7  Text Mining: 9.5 Manual Patient records Patient records Text Mining  Co-morbidity  93 / 802 unexpected  Ex. Alopecia and Migraine Alopecia HR THRA ESR1 Migraine Roque et al. (2011) PLoS Comput Biol.
  13. 13. Summary  Computers and biomedical literature and data     Generation Storage Analysis Text and data mining   Useful from project start to finish Broad and critical applications   Information extraction  Knowledge databases   Information retrieval Knowledge discovery Limited by text availability
  14. 14. Challenges  Accuracy in some applications  Ambiguity, complex sentences, document context, novelty   From abstracts to full texts     “Protein A and its partners” Current methods optimized for short texts (abstracts) Figures and tables Supplementary information File format  The PDF problem ........ ........ ........ ........ ........ ........  ........ ........ ........ ........ ........ ........ ? ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ? ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ XML: structured format  Abstract, Introduction, Results, Methods, Discussion, References, ...
  15. 15. Needs  Copyright    Teach scientists Unify licenses Availability  All significant documents   Articles, reviews, case reports, letters The main structured text (XML)  No figures (or optional)    Supplements: optional No fancy user interface or webservice   texts mostly useless for readers FTP/P2P + Compressed XML Communicating Research results    # articles Compressed file size* 1 13 KB 1M 12 GB 20M 250 GB Open Access As text As data   standardized list of facts standardized figures data and tables * Projections based on PMC Open Access 2012

×