A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

A Web Corpus for eCare
Collection, Lay Annotation and Learning
Marina Santini
(marina.santini@ri.se)
RISE SICS East
Sweden
3 September 2017LTA'17 - FedCSIS 2017 - Prague
1

RISE & SICS
• RI.SE = Research Institutes of Sweden
• SICS = Swedish Institute of Computer Science
• SICS East Linköping
• Group: Language Technology and Intelligent Interaction Design
• Group Leader: Professor Arne Jönsson (arne.jonsson@ri.se)
Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare:
Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 2

Outline
• The eCare@home project
• The Language Technology contribution
• The eCare Swedish corpus
• Lay-Specialized Annotation
• The experiments: lay-specialized text classification
• Conclusions

eCare@home
• Website: http://ecareathome.se/
• Interdisciplinary project
• Funded by Swedish Knowledge Foundation
1. Measure attributes about people and the environment.
2. Infer beyond that which we cannot measure.
3. Automatically configure devices and "things" to achieve a task.
4. Represent information in a human consumable way.

Language Technology and eCare@home
Generally speaking:
• The Internet of Things  Sensor Data = Numbers
• Language Technology  to present information in a “human consumable way”

How to Build a Medical Corpus for eCare?
Kardiologi vs hjärtspecialist
specialized vs lay

Starting point:
a Web Corpus for eCare
• Creation of a concept-specific medical web corpus
useful for eCare (and eHealth) LT applications, such as:
• the automatic extraction of lay synonyms (or paraphrases) of
medical terms,
• text simplification
• etc.

Web Corpus
• Which textual sources?
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8
• Using medical journals?
• Crawling user-generated texts?
• Relying on a specific web genre like blogs?
• Downloading medical websites?
• Web corpus!

Designing a corpus for eCare
1. having a publicly-available medical corpus annotated with lay-
specialized labels that can be easily shared;
2. having a corpus with a design and a structure that allow for
expansion with additional documents over time;
3. accounting for very specific medical terms.

Potential Issues
1. expanding the corpus over time may cause scalability issues
2. the web is noisy  the corpus will be noisy: noise disturbs LT
applications
We made two assumptions:
1. scalability  increasing the size of the corpus does not necessarily affect the
performance of LT applications negatively
2. Noise  noisy texts do not necessarily affect the performance of LT applications
negatively

In the next slides...
1. Lay vs Specialized sublanguage
2. The construction and annotation of the eCare Swedish web
corpus
3. Experiments 1 and 2:
• Experiment 1: robustness to scalability issues ( increasing the size of
the corpus does not necessarily affect the performance of LT application
negatively )
• Experiment 2: resilience to noise ( noisy texts do not necessarily affect
the performance of LT application negatively)

Lay vs Specialized Sublanguage
• Definition of sublanguage:
• A language variety used by a specific user group in
certain communicative/situational contexts or domains
• Ex: the jargon used by police, or by the military, or by
the politicians, etc.
• Medical domain not only jargon but also medical
terminology!
• Two user groups in contact:
• patients  lay sublanguage (ex: heart specialist)
• professional staff  specialized sublanguage (ex:
cardiologist)
The Guardian, 2014 

eCare Web Corpus:
Construction
• Seed terms from SNOMED CT
• Chronic diseases
• The corpus has been bootstrapped with 228
seed terms and 801 documents were bootcat-
ted (BootCat, Baroni & Bernardini, 2004)
Initial seeds Retrieved seeds Downloaded
documents
Number of
words
Unigrams 13 13 112 91 118
Bigrams 215 142 689 618 491
Total 228 155 801 709 609
Example of long medical term (source SNOMED CT

Small Data (vs Big Data)
• Small data: data that has small enough size for human comprehension.
• Use: for many problems and questions, small data in itself is enough.
• Small data (wikipedia) = a new buzz word

eCare Web Corpus: Lay Annotation
• Annotation by a native speaker who participates
who has little knowledge of medical
terminology.
• The lay-specialized text classification
experiments described later on are based on this
lay annotation.

Inter-rater agreement: Lay vs Expert (sample)
• Interrater agreement
To be taken with a grain of salt , but normally:
• Poor agreement = 0.20 or less
• Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60
• Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00
User group bias:
annotation maby be biassed by the
annotator’s domain expertise.

Noise
• Apparently the web is full of automatically translated
documents ! Do we need to sort them out? It depends!
• Out of 801 web documents, 339 have received comments
by the lay annotator, e.g. "Machine Translated" or "it is
about animals and not about humans".
• Is it important to remove this noise-ness for some LT tasks?
Maybe not always
• Computationally, the presence of noise might be irrelevant,
so it might not be worth investing resource for cleaning a
corpus

Lay/Specialized Text Classification
Based on the annotation made by the lay annotator
• Experiment 1: focus on scalability
• Experiment 2: focus on the impact of noise on performance

Features
No text pre-processing
The texts were defined as “string”
Filter converting strings to word vectors

Experiment 1: Scalability

Experiment 2: Resilience to Noise
• Noisy vs Cleaned

New Experiment: Lay vs Specialized Annotation
• SVM (weka implementation: SMO)
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29
801 documents labelled by the lay annotator
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25
778 documents labelled by the expert annotator

Conclusions: Findings
• We presented the construction of the the eCare corpus
• We made two claims and we supported our claims with
two experiments:
1. We can design an extensible/dynamic corpus without fearing
scalability issues
2. We can use a noisy corpus to build (certain) LT applications
without fearing bad performance

Replicability
• We encourage the replication of the results presented in this
paper, and welcome improvements and further discussion.
• Corpus, scripts and the output of the classification models
are available from the project website:
http://ecareathome.se.

Thanks for your attention !
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25
Any Questions ?

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Recommended

Recommended

More Related Content

Similar to A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Similar to A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results- (20)

More from Marina Santini

More from Marina Santini (20)

Recently uploaded

Recently uploaded (20)

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-