A Web Corpus for eCare
Collection, Lay Annotation and Learning
Marina Santini
(marina.santini@ri.se)
RISE SICS East
Sweden
3 September 2017LTA'17 - FedCSIS 2017 - Prague
1
RISE & SICS
• RI.SE = Research Institutes of Sweden
• SICS = Swedish Institute of Computer Science
• SICS East Linköping
• Group: Language Technology and Intelligent Interaction Design
• Group Leader: Professor Arne Jönsson (arne.jonsson@ri.se)
Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare:
Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 2
Outline
• The eCare@home project
• The Language Technology contribution
• The eCare Swedish corpus
• Lay-Specialized Annotation
• The experiments: lay-specialized text classification
• Conclusions
3 September 2017LTA'17 - FedCSIS 2017 - Prague 3
eCare@home
• Website: http://ecareathome.se/
• Interdisciplinary project
• Funded by Swedish Knowledge Foundation
1. Measure attributes about people and the environment.
2. Infer beyond that which we cannot measure.
3. Automatically configure devices and "things" to achieve a task.
4. Represent information in a human consumable way.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 4
Language Technology and eCare@home
3 September 2017LTA'17 - FedCSIS 2017 - Prague 5
Generally speaking:
• The Internet of Things  Sensor Data = Numbers
• Language Technology  to present information in a “human consumable way”
How to Build a Medical Corpus for eCare?
3 September 2017LTA'17 - FedCSIS 2017 - Prague 6
Kardiologi vs hjärtspecialist
specialized vs lay
Starting point:
a Web Corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 7
• Creation of a concept-specific medical web corpus
useful for eCare (and eHealth) LT applications, such as:
• the automatic extraction of lay synonyms (or paraphrases) of
medical terms,
• text simplification
• etc.
Web Corpus
• Which textual sources?
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8
• Using medical journals?
• Crawling user-generated texts?
• Relying on a specific web genre like blogs?
• Downloading medical websites?
• Web corpus!
Designing a corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 9
1. having a publicly-available medical corpus annotated with lay-
specialized labels that can be easily shared;
2. having a corpus with a design and a structure that allow for
expansion with additional documents over time;
3. accounting for very specific medical terms.
Potential Issues
3 September 2017LTA'17 - FedCSIS 2017 - Prague 10
1. expanding the corpus over time may cause scalability issues
2. the web is noisy  the corpus will be noisy: noise disturbs LT
applications
We made two assumptions:
1. scalability  increasing the size of the corpus does not necessarily affect the
performance of LT applications negatively
2. Noise  noisy texts do not necessarily affect the performance of LT applications
negatively
In the next slides...
3 September 2017LTA'17 - FedCSIS 2017 - Prague 11
1. Lay vs Specialized sublanguage
2. The construction and annotation of the eCare Swedish web
corpus
3. Experiments 1 and 2:
• Experiment 1: robustness to scalability issues ( increasing the size of
the corpus does not necessarily affect the performance of LT application
negatively )
• Experiment 2: resilience to noise ( noisy texts do not necessarily affect
the performance of LT application negatively)
Lay vs Specialized Sublanguage
3 September 2017LTA'17 - FedCSIS 2017 - Prague 12
• Definition of sublanguage:
• A language variety used by a specific user group in
certain communicative/situational contexts or domains
• Ex: the jargon used by police, or by the military, or by
the politicians, etc.
• Medical domain not only jargon but also medical
terminology!
• Two user groups in contact:
• patients  lay sublanguage (ex: heart specialist)
• professional staff  specialized sublanguage (ex:
cardiologist)
The Guardian, 2014 
eCare Web Corpus:
Construction
3 September 2017LTA'17 - FedCSIS 2017 - Prague 13
• Seed terms from SNOMED CT
• Chronic diseases
• The corpus has been bootstrapped with 228
seed terms and 801 documents were bootcat-
ted (BootCat, Baroni & Bernardini, 2004)
Initial seeds Retrieved seeds Downloaded
documents
Number of
words
Unigrams 13 13 112 91 118
Bigrams 215 142 689 618 491
Total 228 155 801 709 609
Example of long medical term (source SNOMED CT
Small Data (vs Big Data)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 14
• Small data: data that has small enough size for human comprehension.
• Use: for many problems and questions, small data in itself is enough.
• Small data (wikipedia) = a new buzz word
eCare Web Corpus: Lay Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 15
• Annotation by a native speaker who participates
who has little knowledge of medical
terminology.
• The lay-specialized text classification
experiments described later on are based on this
lay annotation.
Inter-rater agreement: Lay vs Expert (sample)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 16
• Interrater agreement
To be taken with a grain of salt , but normally:
• Poor agreement = 0.20 or less
• Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60
• Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00
User group bias:
annotation maby be biassed by the
annotator’s domain expertise.
Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 17
• Apparently the web is full of automatically translated
documents ! Do we need to sort them out? It depends!
• Out of 801 web documents, 339 have received comments
by the lay annotator, e.g. "Machine Translated" or "it is
about animals and not about humans".
• Is it important to remove this noise-ness for some LT tasks?
Maybe not always
• Computationally, the presence of noise might be irrelevant,
so it might not be worth investing resource for cleaning a
corpus
Lay/Specialized Text Classification
3 September 2017LTA'17 - FedCSIS 2017 - Prague 18
Based on the annotation made by the lay annotator
• Experiment 1: focus on scalability
• Experiment 2: focus on the impact of noise on performance
Features
3 September 2017LTA'17 - FedCSIS 2017 - Prague 19
No text pre-processing
The texts were defined as “string”
Filter converting strings to word vectors
Experiment 1: Scalability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 20
Experiment 2: Resilience to Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 21
• Noisy vs Cleaned
New Experiment: Lay vs Specialized Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 22
• SVM (weka implementation: SMO)
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29
801 documents labelled by the lay annotator
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25
778 documents labelled by the expert annotator
Conclusions: Findings
3 September 2017LTA'17 - FedCSIS 2017 - Prague 23
• We presented the construction of the the eCare corpus
• We made two claims and we supported our claims with
two experiments:
1. We can design an extensible/dynamic corpus without fearing
scalability issues
2. We can use a noisy corpus to build (certain) LT applications
without fearing bad performance
Replicability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 24
• We encourage the replication of the results presented in this
paper, and welcome improvements and further discussion.
• Corpus, scripts and the output of the classification models
are available from the project website:
http://ecareathome.se.
Thanks for your attention !
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25
Any Questions ?

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

  • 1.
    A Web Corpusfor eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se) RISE SICS East Sweden 3 September 2017LTA'17 - FedCSIS 2017 - Prague 1
  • 2.
    RISE & SICS •RI.SE = Research Institutes of Sweden • SICS = Swedish Institute of Computer Science • SICS East Linköping • Group: Language Technology and Intelligent Interaction Design • Group Leader: Professor Arne Jönsson (arne.jonsson@ri.se) Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague. 3 September 2017LTA'17 - FedCSIS 2017 - Prague 2
  • 3.
    Outline • The eCare@homeproject • The Language Technology contribution • The eCare Swedish corpus • Lay-Specialized Annotation • The experiments: lay-specialized text classification • Conclusions 3 September 2017LTA'17 - FedCSIS 2017 - Prague 3
  • 4.
    eCare@home • Website: http://ecareathome.se/ •Interdisciplinary project • Funded by Swedish Knowledge Foundation 1. Measure attributes about people and the environment. 2. Infer beyond that which we cannot measure. 3. Automatically configure devices and "things" to achieve a task. 4. Represent information in a human consumable way. 3 September 2017LTA'17 - FedCSIS 2017 - Prague 4
  • 5.
    Language Technology andeCare@home 3 September 2017LTA'17 - FedCSIS 2017 - Prague 5 Generally speaking: • The Internet of Things  Sensor Data = Numbers • Language Technology  to present information in a “human consumable way”
  • 6.
    How to Builda Medical Corpus for eCare? 3 September 2017LTA'17 - FedCSIS 2017 - Prague 6 Kardiologi vs hjärtspecialist specialized vs lay
  • 7.
    Starting point: a WebCorpus for eCare 3 September 2017LTA'17 - FedCSIS 2017 - Prague 7 • Creation of a concept-specific medical web corpus useful for eCare (and eHealth) LT applications, such as: • the automatic extraction of lay synonyms (or paraphrases) of medical terms, • text simplification • etc.
  • 8.
    Web Corpus • Whichtextual sources? 3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8 • Using medical journals? • Crawling user-generated texts? • Relying on a specific web genre like blogs? • Downloading medical websites? • Web corpus!
  • 9.
    Designing a corpusfor eCare 3 September 2017LTA'17 - FedCSIS 2017 - Prague 9 1. having a publicly-available medical corpus annotated with lay- specialized labels that can be easily shared; 2. having a corpus with a design and a structure that allow for expansion with additional documents over time; 3. accounting for very specific medical terms.
  • 10.
    Potential Issues 3 September2017LTA'17 - FedCSIS 2017 - Prague 10 1. expanding the corpus over time may cause scalability issues 2. the web is noisy  the corpus will be noisy: noise disturbs LT applications We made two assumptions: 1. scalability  increasing the size of the corpus does not necessarily affect the performance of LT applications negatively 2. Noise  noisy texts do not necessarily affect the performance of LT applications negatively
  • 11.
    In the nextslides... 3 September 2017LTA'17 - FedCSIS 2017 - Prague 11 1. Lay vs Specialized sublanguage 2. The construction and annotation of the eCare Swedish web corpus 3. Experiments 1 and 2: • Experiment 1: robustness to scalability issues ( increasing the size of the corpus does not necessarily affect the performance of LT application negatively ) • Experiment 2: resilience to noise ( noisy texts do not necessarily affect the performance of LT application negatively)
  • 12.
    Lay vs SpecializedSublanguage 3 September 2017LTA'17 - FedCSIS 2017 - Prague 12 • Definition of sublanguage: • A language variety used by a specific user group in certain communicative/situational contexts or domains • Ex: the jargon used by police, or by the military, or by the politicians, etc. • Medical domain not only jargon but also medical terminology! • Two user groups in contact: • patients  lay sublanguage (ex: heart specialist) • professional staff  specialized sublanguage (ex: cardiologist) The Guardian, 2014 
  • 13.
    eCare Web Corpus: Construction 3September 2017LTA'17 - FedCSIS 2017 - Prague 13 • Seed terms from SNOMED CT • Chronic diseases • The corpus has been bootstrapped with 228 seed terms and 801 documents were bootcat- ted (BootCat, Baroni & Bernardini, 2004) Initial seeds Retrieved seeds Downloaded documents Number of words Unigrams 13 13 112 91 118 Bigrams 215 142 689 618 491 Total 228 155 801 709 609 Example of long medical term (source SNOMED CT
  • 14.
    Small Data (vsBig Data) 3 September 2017LTA'17 - FedCSIS 2017 - Prague 14 • Small data: data that has small enough size for human comprehension. • Use: for many problems and questions, small data in itself is enough. • Small data (wikipedia) = a new buzz word
  • 15.
    eCare Web Corpus:Lay Annotation 3 September 2017LTA'17 - FedCSIS 2017 - Prague 15 • Annotation by a native speaker who participates who has little knowledge of medical terminology. • The lay-specialized text classification experiments described later on are based on this lay annotation.
  • 16.
    Inter-rater agreement: Layvs Expert (sample) 3 September 2017LTA'17 - FedCSIS 2017 - Prague 16 • Interrater agreement To be taken with a grain of salt , but normally: • Poor agreement = 0.20 or less • Fair agreement = 0.20 to 0.40 • Moderate agreement = 0.40 to 0.60 • Good agreement = 0.60 to 0.80 • Very good agreement = 0.80 to 1.00 User group bias: annotation maby be biassed by the annotator’s domain expertise.
  • 17.
    Noise 3 September 2017LTA'17- FedCSIS 2017 - Prague 17 • Apparently the web is full of automatically translated documents ! Do we need to sort them out? It depends! • Out of 801 web documents, 339 have received comments by the lay annotator, e.g. "Machine Translated" or "it is about animals and not about humans". • Is it important to remove this noise-ness for some LT tasks? Maybe not always • Computationally, the presence of noise might be irrelevant, so it might not be worth investing resource for cleaning a corpus
  • 18.
    Lay/Specialized Text Classification 3September 2017LTA'17 - FedCSIS 2017 - Prague 18 Based on the annotation made by the lay annotator • Experiment 1: focus on scalability • Experiment 2: focus on the impact of noise on performance
  • 19.
    Features 3 September 2017LTA'17- FedCSIS 2017 - Prague 19 No text pre-processing The texts were defined as “string” Filter converting strings to word vectors
  • 20.
    Experiment 1: Scalability 3September 2017LTA'17 - FedCSIS 2017 - Prague 20
  • 21.
    Experiment 2: Resilienceto Noise 3 September 2017LTA'17 - FedCSIS 2017 - Prague 21 • Noisy vs Cleaned
  • 22.
    New Experiment: Layvs Specialized Annotation 3 September 2017LTA'17 - FedCSIS 2017 - Prague 22 • SVM (weka implementation: SMO) k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29 801 documents labelled by the lay annotator k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25 778 documents labelled by the expert annotator
  • 23.
    Conclusions: Findings 3 September2017LTA'17 - FedCSIS 2017 - Prague 23 • We presented the construction of the the eCare corpus • We made two claims and we supported our claims with two experiments: 1. We can design an extensible/dynamic corpus without fearing scalability issues 2. We can use a noisy corpus to build (certain) LT applications without fearing bad performance
  • 24.
    Replicability 3 September 2017LTA'17- FedCSIS 2017 - Prague 24 • We encourage the replication of the results presented in this paper, and welcome improvements and further discussion. • Corpus, scripts and the output of the classification models are available from the project website: http://ecareathome.se.
  • 25.
    Thanks for yourattention ! 3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25 Any Questions ?