Medical Information Retrieval and its Evaluation: an Overview of CLEF eHealth Evaluation Task

Medical Information Retrieval and
its Evaluation: an Overview of CLEF
eHealth Evaluation Task
Lorraine Goeuriot
LIG – Université Grenoble Alpes (France)
lorraine.goeuriot@imag.fr

Presentation Overview
• Medical IR and its Evaluation
• CLEF eHealth
– Context and tasks
– IR tasks description
– Datasets
– Evaluation
– Participation
• Conclusion
2

• CLEF eHealth
– Datasets
– Evaluation
– Participation
• Conclusion
3

4
Medical Professionals – Web
Search and Data
• Online information search on a regular basis
• Search failure for 2 patients out of 3
• PubMed search: very long (30+ minutes against 5
available)
• Knowledge production
constantly growing
• More and more publications
• Varying web access

5
Medical Professionals – Web
Search and Data

6
Patients and general public
• Change in the patient-physician relationship
• Patients more committed - cybercondria
• How can information quality be guaranteed?

7
Patients – Web Search and Data

8

9

Medical Information Retrieval
• How different is medical IR from general IR?
– Domain-specific search: narrowing down the
applications to improve results for categories of users
– Consequences of bad performances of a medical search
system
• Characteristics of medical IR:
– Data: medical/clinical reports, research papers, medical
websites…
– Information need: decision support,
technology/progress watch, education, daily care…
– Evaluation: relevance, readability, trustworthiness,
time
10

Evaluating Information Retrieval?
Did the user find the information she needed?
How many relevant documents did she get back?
What is a relevant document?
How many unrelevant document did she get back?
How long before she found the information?
Is she satisfied with the results?
…
Did the user find the information she needed?
How many relevant documents did she get back?
What is a relevant document?
How many unrelevant document did she get back?
How long before she found the information?
Is she satisfied with the results?
…
• Creation of (artificial) datasets representing a specific search
task, in order to compare various systems efficiency
• Involving human rating
• Shared with the community to improve IR
11

Typical IR Evaluation Dataset
Document Collection
Topic Set Relevance
Assessment
...
...
12

Existing Medical IR evaluation tasks
• Existing medical IR evaluation tasks:
 TREC Medical Records 2011, 2012
 TREC 2000 filtering track (corpus OHSUMED)
 TREC genomics 2003-2007
 ImageCLEFMed 2005-2013
 TREC clinical decision support 2014, 2015
No patient-centered evaluation task
13

• CLEF eHealth
– Datasets
– Evaluation
– Participation
• Conclusion
14

CLEF eHealth
AP: 72 yo w/ ESRD on HD,
CAD, HTN, asthma, p/w
significant hyperkalemia &
associated arrythmias.
15

CLEF eHealth Tasks
2013
• Task 1: Named entity
recognition in clinical
text
• Task 2: acronym
normalization in clinical
text
• Task 3: User-centred
health IR
2014
• Task 1: Visual-Interactive
Search and Exploration
of eHealth Data
• Task 2: Information
extraction from clinical
text
• Task 3: User-centred
health IR
2015
• Task 1a: Clinical speech recognition from nurses handover
• Task 1b: Clinical named entity recognition in French
• Task 2: User-centred health IR 16

• CLEF eHealth
– Datasets
– Evaluation
– Participation
• Conclusion
17

2013-2014
IR Evaluation Task Scenario
2015
18

IR Evaluation Task over the years
2013 2014 2015
Goal Help laypersons better
understand medical reports
Layperson checking
their symptoms
Topics 55 EN topics
built from
discharge
summaries
55 EN topics +
translation in
CZ, DE, FR
67 EN topics built from
images + translation in
AR, CZ, DE, FA, FR, IT,
PT
Documents Medical document collection provided by Khresmoi project
Relevance
assessment
Manual evaluation of relevance
of documents
Manual evaluation of
relevance and
readability of
documents
19

• CLEF eHealth
– Datasets
– Evaluation
– Participation
• Conclusion
20

Document Collection
• Web crawl of health-related documents (~ 1M)
• Made available through the Khresmoi project
(khresmoi.eu)
• Target: general public and medical professionals
• Broad range of medical topics covered
• Content:
• Health On the Net (HON) Foundation certified
websites (~60%)
• Various well-known medical websites: DrugBank,
Diagnosia, TRIP answers, etc. (~40%)
21

Topics &
context
Topics
2013
Manual creation
from randomly
selected annotation
of disorder in the
DS (context)
2014
Manual creation
from manually
identified main
disorders in the DS
(context)
2015
Manual creation from images describing
a medical problem (context)
22

Topics - Examples
<topic> <id>qtest3</id>
<discharge_summary>02115-010823-
DISCHARGE_SUMMARY.txt</discharge_summary>
<title>Asystolic arrest</title>
<desc>what is asystolic arrest</desc>
<narr>asystolic arrest and why does it cause death</narr>
<profile>A 87 year old woman with a stroke and asystolic arrest dies and
the daughter wants to know about asystolic arrest and what it
means.</profile>
</topic>
2013-2014
<topic> <id>clef2015.test.15</id>
<query>weird brown patches on skin</query>
</topic>
2015
23

Datasets - Summary
• Provided to the participants:
• Document collection
• Discharge summaries (optional) [2013-2014]
• Training set:
– 5 queries + qrels [2013]
– 5 queries (+ translation) + qrels [2014-2015]
• Test set:
– 50 queries [2013]
– 50 queries (+ translation) [2014]
– 62 queries (+ translation) [2015]
24

• CLEF eHealth
– Datasets
– Evaluation
– Participation
• Conclusion
25

Guidelines for Submissions
26
Submission of up to 7 runs (per language):
Run 1 (mandatory) - team baseline: only title and
description fields, no external resources.
Runs 2-4 (optional) any experiment WITH the DS.
Runs 5-7 (optional) any experiment WITHOUT the DS.
2013 - 2014
Submission of up to 10 ranked runs (per language):
Run 1 (mandatory): baseline run
Runs 2-10: any experiment with any external resource
2015

Relevance Assessment
 Manual relevance assessment conducted by medical
professionals and IR experts
 4-point scale assessment mapped to a binary scale
– {0: non relevant, 1: on topic but unreliable} → non
relevant
– {2: somewhat relevant, 3: relevant} → relevant
 4-point scale for NDCG and 2-point scale for precision
 [2015] Manual assessment of the readability of the
documents conducted by the same assessors on a 4-
point scale
27

Relevance Assessment - Pools
Training set Test set
2013 Merged top 30 ranked
documents from Vector
Space Model and Okapi
BM25
Merged top 10 documents
from participants baseline
run, the highest two priority
runs with DS and highest
two without DS
2014
2015 Merged top 10 documents
from participants three
highest priority runs
28

Evaluation Metrics
• Classical TREC evaluation: P@5, P@10,
NDCG@5, NDCG@10, MAP
• Ranking based on P@10
29

• CLEF eHealth
– Datasets
– Evaluation
– Participation
• Conclusion
30

Participants and Runs
Monolingual IR Multilingual IR
# teams # runs # teams # runs
2013 9 48 -- --
2014 14 62 2 24
2015 12 92 1 35
31

Baselines
2013:
• JSoup
• Okapi stop words & Porter stemmer
• Lucene BM25
2014:
• Indri HTML parser
• Okapi stop words & Krovetz stemmer
• Indri BM25, tf.idf, LM
32

33
2013 Participants P@10 (best run)
Team-Mayo (2)
Team-AEHRC (5)
Team-MEDINFO (1)
Team-UOG (5)
Team-THCIB (5)
Team-KC (1)
Team-UTHealth (1)
Team-QUT (2)
Team-OHSU (5)
0
0.1
0.2
0.3
0.4
0.5
0.6
BM25 BM25 +
PRF

2014 Task 3a P@10 (best run)GRIUM_EN_Run.5
SNUMEDINFO_EN_Run.2
KISTI_EN_Run.2
IRLabDAIICT_EN_Run.1
UIOWA_EN_Run.1
baseline.dir
DEMIR_EN_Run.6
RePaLi_EN_Run.5
NIJM_EN_Run.2
YORKU_EN_Run.5
UHU_EN_Run.5
COMPL_EN_Run.5
ERIAS_EN_Run.6
miracl_en_run.1
CUNI_EN_RUN.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
34

35
Participants P@10 (2013 and 2014)
P@10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2013
2014
BM25 2013
LM Dirichlet smoothing 2014
35

Team-Mayo Team-AEHRCTeam-MEDINFO Team-UOG Team-THCIB Team-KC Team-UTHealth Team-QUT Team-OHSU
0
0.1
0.2
0.3
0.4
0.5
0.6
Baseline
Best run
36
2013 Participants' Results
Baseline vs best run

What Worked Well?
Team-Mayo:
• Markov Model Random Field to model query term
dependency
• QE using external collections
• Combination of indexing techniques + re-ranking
Team-AEHRC:
• Language Models with Dirichlet smoothing
• QE with spelling correction and acronym expansion
Team-MEDINFO: Query Likelihood Model
BM25 Baseline
37

COMPL
CUNI
DEMIR
ERIAS
GRIUM
IRLabDAIICT
KISTI
miracl_en_run.1
NIJM
RePaLi
SNUMEDINFO
UHU
UIOWA
YORKU
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Baseline
Best run
2014 Participant's Results
Baseline vs best run
38

What Worked Well?
Team-GRIUM:
• Hybrid IR approach (text-based and concept-based)`
• Language models
• Query expansion based on mutual information
Team-SNUMEDINFO:
• Language Models with Dirichlet smoothing
• QE with medical concepts
• Google translate
Team-KISTI:
• Language models
• Various QE approaches
39

Task 3b Results
CS DE FR
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CUNI
SNUMEDINFO
40

41
2013 - Use of Discharge Summaries
Team-Mayo Team-Medinfo Team-THCIB Team-KC Team-QUT
0
0.1
0.2
0.3
0.4
0.5
0.6
With DS
Without DS
Baseline

42
How were DS used?
- Result re-ranking based on concepts extracted from
queries, relevant documents and DS (Team-Mayo)
- Query expansion:
* Filtering of non-relevant expansion terms/concepts
(Team-MEDINFO)
* Expansion with all concepts from query and DS (Team-
THCIB)
* Expansion with concepts identified in relevant passages
of the DS (Team-KC)
* Query refinement (Team-TOPSIG)

2014 - Use of Discharge
Summaries
IRLabDAIICT KISTI NIJM
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
DS
No DS
43

How Were DS Used?
●Query expansion:
● Expansion using Metamap, with expansion
candidates filtered using the DS (Team-
SNUMEDINFO)
● Expansion with abbreviations and DS combined
with pseudo-relevance feedback (Team-KISTI)
● Expansion with MeSH terminology and DS (Team-
IRLABDAIICT)
● Expansion with terms from the DS (Team-
Nijmegen)
44

• CLEF eHealth
– Datasets
– Evaluation
– Participation
– Further analysis
• Conclusion
45

46
Medical Queries Complexity
 Query complexity = number of medical
concepts/entities it contains
 radial neck fracture and healing time
 facial cuts and scar tissue
 nausea and vomiting and hematemesis
 Dataset:
 50 queries from CLEF eHealth 2013 (patients
queries)
 Runs from 9 teams
 Impact of the complexity on the systems
performances

• CLEF eHealth
– Datasets
– Evaluation
– Participation
– Further analysis
• Conclusion
48

Conclusion
• 3 successful years running CLEF eHealth
• Datasets are publicly available for research
purpose
• Used for research by organizers, participants,
and other groups
• Building a community – evaluation tasks,
workshop@SIGIR, special edition of JIR
49

For More Details
CLEF eHealth Lab overview:
Suominen et al. (2013). Overview of the ShARe/CLEF eHealth
Evaluation Lab 2013. In CLEF 2013 Proceedings.
Kelly et al. (2014). Overview of the ShARe/CLEF eHealth
Evaluation Lab 2014. In CLEF 2014 Proceedings.
CLEF eHealth IR task overview:
Goeuriot et al. (2013). ShAReCLEF eHealth Evaluation Lab
2013, Task 3: Information Retrieval to Address Patients’
Questions when Reading Clinical Reports. In CLEF 2013
Working notes.
Goeuriot et al. (2014). ShARe/CLEF eHealth Evaluation Lab
2014, Task 3: User-centred health information retrieval. In
CLEF 2013 Working notes.
50

Follow us!
http://sites.google.com/site/clefehealth2015
clef-ehealth-evaluation-lab-information
On Google groups
@clefehealth
Join the party in Toulouse: http://clef2015.clef-
initiative.eu/CLEF2015/conferenceRegistration.php
51

Consortium
• Lab chairs: Lorraine Goeuriot, Liadh Kelly
• Task 1: Hanna Suominen, Leif Hanlen, Gareth
Jones, Liyuan Zhou, Aurélie Névéol, Cyril
Grouin, Thierry Hamon, Pierre Zweigenbaum
• Task 2: Joao Palotti, Guido Zuccon, Allan
Hanbury, Mihai Lupu, Pavel Pecina
52

Task 3a - Topic Generation Process (1)
Discharge Medications:
1. Aspirin 81 mg Tablet, Delayed Release (E.C.) Sig: One (1) Tablet, Delayed Release (E.C.) PO
DAILY (Daily). Disp:*30 Tablet, Delayed Release (E.C.)(s)* Refills:*0*
2. Docusate Sodium 100 mg Capsule Sig: One (1) Capsule PO BID (2 times a day). Disp:*60
Capsule(s)* Refills:*0*
3. Levothyroxine Sodium 200 mcg Tablet Sig: One (1) Tablet PO DAILY (Daily).
Discharge Disposition:
Extended Care
Facility:
[**Hospital 5805**] Manor - [**Location (un) 348**]
Discharge Diagnosis:
Coronary artery disease.
s/p CABG
post op atrial fibrillation
54

Extended Care
Facility:
s/p CABG
55

Extended Care
Facility:
s/p CABG
What is coronary heart disease?
56

Medical Information Retrieval and its Evaluation: an Overview of CLEF eHealth Evaluation Task

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Medical Information Retrieval and its Evaluation: an Overview of CLEF eHealth Evaluation Task

Similar to Medical Information Retrieval and its Evaluation: an Overview of CLEF eHealth Evaluation Task (20)

Recently uploaded

Recently uploaded (20)

Medical Information Retrieval and its Evaluation: an Overview of CLEF eHealth Evaluation Task