SlideShare a Scribd company logo
1 of 16
Text and data mining for
Biomedical Research
Dr. Jean-Fred Fontaine
Max Delbrück Center for Molecular Medicine, Berlin
Scientific project and biomedical literature

Project design
Project design
• State of the art
• Innovative ideas

Communication
Communication

Experiments
Experiments
• Technologies

• State of the art
• Explanations
• Open hypotheses
• Perspectives

Analysis
Analysis
• Methods
• Explanations
• New hypotheses
Data growth
Literature growth

Molecular data growth
Accessibility

18 M (all)

9.7 M – TEXT MINING OF ABSTRACTS
8.6 M

2.4 M – (freely readable)
1.8 M
0.2 M - TEXT MINING OF FULL
TEXTS*

Krallinger et al. (2010) Methods Mol Biol.
* PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)
Document retrieval

Alzheimer’s disease?
Citations in PubMed®
25,000,000

20,000,000

15,000,000

10,000,000

0
4
9
1
8
2
5
9
1
6
0
9
1
4
6
8
9
1
2
7
6
9
1
0
8
4
9
1
8
2
9
1
6
0
2
4
8
0
2

5,000,000

0

By date

Medline Ranker

.................
.................
.................
.................
......
......

................
................
................
................
................
................
........
........
................
................
................
................
........
........
................
................
................
................
........
........
................
................
........
........

By relevance

Fontaine et al. (2009) Nucleic Acids Res.
http://cbdm.mdc-berlin.de/tools/medlineranker/
Discovery of gene-disease associations
Database mining
Database mining

Medline Ranker / Génie

...
...

Rank 20 000 genes

Fontaine et al. (2011) Nucleic Acids Res.
http://cbdm.mdc-berlin.de/tools/genie
Discovery of gene- and drug-disease associations

?
Before 2007

Before 2007

After 2007

After 2007
Frijters et al. (2010) PLoS Comput Biol.
Semantic analysis



Knowledge bases
Van Landeghem et al. (2013) PLoS One.
Network construction

Modelling Plant Defence Response

Miljkovic et al. (2012) PLoS One.
Trends

Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab.
http:// www.ogic.ca/mltrends/
Surveillance of Surgical Site Infections
 University Hospital of Rennes, France
 SSI secondary to neurosurgery
 Electronic Patient Records
 ICD10 codes
 Free text

2008-2009
2008-2009
relevant
relevant
records
records

Conventional ICD10 codes
surveillance

Full-text
medical
reports

TRUE positive

Classification
Classification

11

12

FALSE positive

0

219

18

FALSE negative

10

2

1

TRUE negative

2010 medical reports

3

1212

993

1194

................
................
................
................
.......
.......

Campillo-Gimenez et al. (2013) Stud Health Technol Inform.
Disease Correlations from Electronic Patient Records

ICD10 codes
ICD10 codes

Avg. ICD10 codes


Manual: 2.7



Text Mining: 9.5

Manual
Patient records
Patient records

Text Mining



Co-morbidity


93 / 802 unexpected



Ex. Alopecia and Migraine

Alopecia
HR

THRA
ESR1

Migraine

Roque et al. (2011) PLoS Comput Biol.
Summary


Computers and biomedical literature and data






Generation
Storage
Analysis

Text and data mining



Useful from project start to finish
Broad and critical applications



Information extraction



Knowledge databases





Information retrieval

Knowledge discovery

Limited by text availability
Challenges


Accuracy in some applications


Ambiguity, complex sentences, document context, novelty




From abstracts to full texts






“Protein A and its partners”

Current methods optimized for short texts (abstracts)
Figures and tables
Supplementary information

File format


The PDF problem
........
........
........
........
........
........



........
........
........
........
........
........

?

........
........
........

........
........
........

........
........
........

........
........
........

?

........
........
........

........
........
........

........
........
........

........
........
........

XML: structured format


Abstract, Introduction, Results, Methods, Discussion, References, ...
Needs


Copyright





Teach scientists
Unify licenses

Availability


All significant documents




Articles, reviews, case reports, letters

The main structured text (XML)


No figures (or optional)






Supplements: optional

No fancy user interface or webservice




texts mostly useless for readers

FTP/P2P + Compressed XML

Communicating Research results




# articles

Compressed file size*

1

13 KB

1M

12 GB

20M

250 GB

Open Access
As text
As data



standardized list of facts
standardized figures data and tables
* Projections based on PMC Open Access 2012
Text and Data Mining for Biomedical Research Insights

More Related Content

Viewers also liked

Lund Sep 15 09
Lund Sep 15 09Lund Sep 15 09
Lund Sep 15 09velterop
 
iExpo Paris 10 juin 2010-Velterop
iExpo Paris 10 juin 2010-VelteropiExpo Paris 10 juin 2010-Velterop
iExpo Paris 10 juin 2010-Velteropvelterop
 
Triples And Access
Triples And AccessTriples And Access
Triples And Accessvelterop
 
Giessen October 9 09 Nano Publication
Giessen October 9 09 Nano PublicationGiessen October 9 09 Nano Publication
Giessen October 9 09 Nano Publicationvelterop
 
Optimising the use of existing knowledge
Optimising the use of existing knowledgeOptimising the use of existing knowledge
Optimising the use of existing knowledgevelterop
 
Reshaping the research library.LIBER's involvement in The European Library
Reshaping the research library.LIBER's involvement in The European LibraryReshaping the research library.LIBER's involvement in The European Library
Reshaping the research library.LIBER's involvement in The European LibraryLIBER Europe
 
Liber Cybsoc
Liber CybsocLiber Cybsoc
Liber Cybsocolegliber
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...LIBER Europe
 
Presentation for New LIBER Board Members
Presentation for New LIBER Board MembersPresentation for New LIBER Board Members
Presentation for New LIBER Board MembersLIBER Europe
 
LIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Europe
 

Viewers also liked (11)

Lund Sep 15 09
Lund Sep 15 09Lund Sep 15 09
Lund Sep 15 09
 
iExpo Paris 10 juin 2010-Velterop
iExpo Paris 10 juin 2010-VelteropiExpo Paris 10 juin 2010-Velterop
iExpo Paris 10 juin 2010-Velterop
 
Triples And Access
Triples And AccessTriples And Access
Triples And Access
 
Giessen October 9 09 Nano Publication
Giessen October 9 09 Nano PublicationGiessen October 9 09 Nano Publication
Giessen October 9 09 Nano Publication
 
Optimising the use of existing knowledge
Optimising the use of existing knowledgeOptimising the use of existing knowledge
Optimising the use of existing knowledge
 
Reshaping the research library.LIBER's involvement in The European Library
Reshaping the research library.LIBER's involvement in The European LibraryReshaping the research library.LIBER's involvement in The European Library
Reshaping the research library.LIBER's involvement in The European Library
 
Liber Cybsoc
Liber CybsocLiber Cybsoc
Liber Cybsoc
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
 
Presentation for New LIBER Board Members
Presentation for New LIBER Board MembersPresentation for New LIBER Board Members
Presentation for New LIBER Board Members
 
Jan Velterop: Science publishing: the different interests of record keeping a...
Jan Velterop: Science publishing: the different interests of record keeping a...Jan Velterop: Science publishing: the different interests of record keeping a...
Jan Velterop: Science publishing: the different interests of record keeping a...
 
LIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data Management
 

Similar to Text and Data Mining for Biomedical Research Insights

Perl cures coronary heart disease
Perl cures coronary heart diseasePerl cures coronary heart disease
Perl cures coronary heart diseaseBiogeeks
 
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...Dr. Haxel Consult
 
Identification of interstitial lung diseases using deep learning
Identification of interstitial lung diseases using deep learning Identification of interstitial lung diseases using deep learning
Identification of interstitial lung diseases using deep learning IJECEIAES
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsPragya Pai
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”Dr. C.V. Suresh Babu
 
Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...
Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...
Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...Life Sciences Network marcus evans
 
Estimating the Statistical Significance of Classifiers used in the Predictio...
Estimating the Statistical Significance of Classifiers used in the  Predictio...Estimating the Statistical Significance of Classifiers used in the  Predictio...
Estimating the Statistical Significance of Classifiers used in the Predictio...IOSR Journals
 
Next generation electronic medical records and search a test implementation i...
Next generation electronic medical records and search a test implementation i...Next generation electronic medical records and search a test implementation i...
Next generation electronic medical records and search a test implementation i...lucenerevolution
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomicsmikaelhuss
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsJTADrexel
 
Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...
Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...
Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...CrimsonpublishersMedical
 
diagnostics-12-02142.pdf
diagnostics-12-02142.pdfdiagnostics-12-02142.pdf
diagnostics-12-02142.pdfmokamojah
 
2013-10-23 DTL Next Generation Life Sciences Event, Utrecht
2013-10-23 DTL Next Generation Life Sciences Event, Utrecht2013-10-23 DTL Next Generation Life Sciences Event, Utrecht
2013-10-23 DTL Next Generation Life Sciences Event, UtrechtAlain van Gool
 
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...JumanaNadir
 
Bioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
Bioinformatics in the Clinical Pipeline: Contribution in Genomic MedicineBioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
Bioinformatics in the Clinical Pipeline: Contribution in Genomic Medicineiosrjce
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...DATAVERSITY
 
PREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUES
PREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUESPREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUES
PREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUEScscpconf
 
BDCC-06-00004.pdf
BDCC-06-00004.pdfBDCC-06-00004.pdf
BDCC-06-00004.pdfAsiyaKhan63
 
Mansfield CV 2016 LinkedIN
Mansfield CV 2016 LinkedINMansfield CV 2016 LinkedIN
Mansfield CV 2016 LinkedINColin MANSFIELD
 
DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING
DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING
DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING hiij
 

Similar to Text and Data Mining for Biomedical Research Insights (20)

Perl cures coronary heart disease
Perl cures coronary heart diseasePerl cures coronary heart disease
Perl cures coronary heart disease
 
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
 
Identification of interstitial lung diseases using deep learning
Identification of interstitial lung diseases using deep learning Identification of interstitial lung diseases using deep learning
Identification of interstitial lung diseases using deep learning
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in Bioinformatics
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 
Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...
Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...
Evolution 2013: Prof. Dr. Georges De Moor, EuroRec on Liberating Health Data ...
 
Estimating the Statistical Significance of Classifiers used in the Predictio...
Estimating the Statistical Significance of Classifiers used in the  Predictio...Estimating the Statistical Significance of Classifiers used in the  Predictio...
Estimating the Statistical Significance of Classifiers used in the Predictio...
 
Next generation electronic medical records and search a test implementation i...
Next generation electronic medical records and search a test implementation i...Next generation electronic medical records and search a test implementation i...
Next generation electronic medical records and search a test implementation i...
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...
Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...
Crimson Publishers - The Use of Artificial Intelligence Methods in the Evalua...
 
diagnostics-12-02142.pdf
diagnostics-12-02142.pdfdiagnostics-12-02142.pdf
diagnostics-12-02142.pdf
 
2013-10-23 DTL Next Generation Life Sciences Event, Utrecht
2013-10-23 DTL Next Generation Life Sciences Event, Utrecht2013-10-23 DTL Next Generation Life Sciences Event, Utrecht
2013-10-23 DTL Next Generation Life Sciences Event, Utrecht
 
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
 
Bioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
Bioinformatics in the Clinical Pipeline: Contribution in Genomic MedicineBioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
Bioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
 
PREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUES
PREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUESPREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUES
PREDICTIVE ANALYTICS IN HEALTHCARE SYSTEM USING DATA MINING TECHNIQUES
 
BDCC-06-00004.pdf
BDCC-06-00004.pdfBDCC-06-00004.pdf
BDCC-06-00004.pdf
 
Mansfield CV 2016 LinkedIN
Mansfield CV 2016 LinkedINMansfield CV 2016 LinkedIN
Mansfield CV 2016 LinkedIN
 
DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING
DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING
DETECTION OF CRACKLES AND WHEEZES IN LUNG SOUND USING TRANSFER LEARNING
 

More from LIBER Europe

LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020LIBER Europe
 
LIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into RealityLIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into RealityLIBER Europe
 
Copyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER AdvocacyCopyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER AdvocacyLIBER Europe
 
LIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data LiteracyLIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data LiteracyLIBER Europe
 
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...LIBER Europe
 
Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...LIBER Europe
 
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...LIBER Europe
 
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...LIBER Europe
 
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...LIBER Europe
 
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
 LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P... LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...LIBER Europe
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...LIBER Europe
 
The Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same TimeThe Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same TimeLIBER Europe
 
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...LIBER Europe
 
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...LIBER Europe
 
Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...LIBER Europe
 
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...LIBER Europe
 
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...LIBER Europe
 
Enabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in AgricultureEnabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in AgricultureLIBER Europe
 
GDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and LibrariesGDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and LibrariesLIBER Europe
 
Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...LIBER Europe
 

More from LIBER Europe (20)

LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020
 
LIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into RealityLIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into Reality
 
Copyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER AdvocacyCopyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER Advocacy
 
LIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data LiteracyLIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data Literacy
 
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
 
Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...
 
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
 
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
 
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
 
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
 LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P... LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...
 
The Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same TimeThe Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same Time
 
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
 
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
 
Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...
 
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
 
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
 
Enabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in AgricultureEnabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in Agriculture
 
GDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and LibrariesGDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
 
Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Text and Data Mining for Biomedical Research Insights

  • 1. Text and data mining for Biomedical Research Dr. Jean-Fred Fontaine Max Delbrück Center for Molecular Medicine, Berlin
  • 2. Scientific project and biomedical literature Project design Project design • State of the art • Innovative ideas Communication Communication Experiments Experiments • Technologies • State of the art • Explanations • Open hypotheses • Perspectives Analysis Analysis • Methods • Explanations • New hypotheses
  • 4. Accessibility 18 M (all) 9.7 M – TEXT MINING OF ABSTRACTS 8.6 M 2.4 M – (freely readable) 1.8 M 0.2 M - TEXT MINING OF FULL TEXTS* Krallinger et al. (2010) Methods Mol Biol. * PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)
  • 5. Document retrieval Alzheimer’s disease? Citations in PubMed® 25,000,000 20,000,000 15,000,000 10,000,000 0 4 9 1 8 2 5 9 1 6 0 9 1 4 6 8 9 1 2 7 6 9 1 0 8 4 9 1 8 2 9 1 6 0 2 4 8 0 2 5,000,000 0 By date Medline Ranker ................. ................. ................. ................. ...... ...... ................ ................ ................ ................ ................ ................ ........ ........ ................ ................ ................ ................ ........ ........ ................ ................ ................ ................ ........ ........ ................ ................ ........ ........ By relevance Fontaine et al. (2009) Nucleic Acids Res. http://cbdm.mdc-berlin.de/tools/medlineranker/
  • 6. Discovery of gene-disease associations Database mining Database mining Medline Ranker / Génie ... ... Rank 20 000 genes Fontaine et al. (2011) Nucleic Acids Res. http://cbdm.mdc-berlin.de/tools/genie
  • 7. Discovery of gene- and drug-disease associations ? Before 2007 Before 2007 After 2007 After 2007 Frijters et al. (2010) PLoS Comput Biol.
  • 8. Semantic analysis  Knowledge bases Van Landeghem et al. (2013) PLoS One.
  • 9. Network construction Modelling Plant Defence Response Miljkovic et al. (2012) PLoS One.
  • 10. Trends Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab. http:// www.ogic.ca/mltrends/
  • 11. Surveillance of Surgical Site Infections  University Hospital of Rennes, France  SSI secondary to neurosurgery  Electronic Patient Records  ICD10 codes  Free text 2008-2009 2008-2009 relevant relevant records records Conventional ICD10 codes surveillance Full-text medical reports TRUE positive Classification Classification 11 12 FALSE positive 0 219 18 FALSE negative 10 2 1 TRUE negative 2010 medical reports 3 1212 993 1194 ................ ................ ................ ................ ....... ....... Campillo-Gimenez et al. (2013) Stud Health Technol Inform.
  • 12. Disease Correlations from Electronic Patient Records  ICD10 codes ICD10 codes Avg. ICD10 codes  Manual: 2.7  Text Mining: 9.5 Manual Patient records Patient records Text Mining  Co-morbidity  93 / 802 unexpected  Ex. Alopecia and Migraine Alopecia HR THRA ESR1 Migraine Roque et al. (2011) PLoS Comput Biol.
  • 13. Summary  Computers and biomedical literature and data     Generation Storage Analysis Text and data mining   Useful from project start to finish Broad and critical applications   Information extraction  Knowledge databases   Information retrieval Knowledge discovery Limited by text availability
  • 14. Challenges  Accuracy in some applications  Ambiguity, complex sentences, document context, novelty   From abstracts to full texts     “Protein A and its partners” Current methods optimized for short texts (abstracts) Figures and tables Supplementary information File format  The PDF problem ........ ........ ........ ........ ........ ........  ........ ........ ........ ........ ........ ........ ? ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ? ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ XML: structured format  Abstract, Introduction, Results, Methods, Discussion, References, ...
  • 15. Needs  Copyright    Teach scientists Unify licenses Availability  All significant documents   Articles, reviews, case reports, letters The main structured text (XML)  No figures (or optional)    Supplements: optional No fancy user interface or webservice   texts mostly useless for readers FTP/P2P + Compressed XML Communicating Research results    # articles Compressed file size* 1 13 KB 1M 12 GB 20M 250 GB Open Access As text As data   standardized list of facts standardized figures data and tables * Projections based on PMC Open Access 2012

Editor's Notes

  1. MEDLINE®/PubMed® statistics: http://www.nlm.nih.gov/bsd/pmresources.html#statistics GenBank Release Notes (August 15, 2013) (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
  2. WHEN: statistics from 2008 Abstract length: 250-400 words / 10,000 chars
  3. senile dementia of the Alzheimer type (SDAT)
  4. Graves’ disease and Programmed cell death 1 (PDCD1) Milnacipran (antidepressants) and obsessive-compulsive disorder
  5. Activation (A), Binding (B) and Inhibition (I). Ethylene (ET), Jasmonic Acid (JA) and Salicylic Acid (SA)
  6. ICD10 codes: billing and social purpose DRG: Diagnosis-Related Group. Studies in health technology and informatics
  7. Alopecia: hair loss HR: Protein Hairless THRA: Thyroid hormone receptor ESR1: Estrogen receptor
  8. 56.6 KB / XML article 13.3 KB / compressed XML article 19607566706 B (19.6 GB) / 346448 XML 4608658061 (4.3 GB) / 346448 compressed XML 1.1TB / 20M XML articles 248GB / 20M compressed XML articles