SlideShare a Scribd company logo
A Web Corpus for eCare
Collection, Lay Annotation and Learning
Marina Santini
(marina.santini@ri.se)
RISE SICS East
Sweden
3 September 2017LTA'17 - FedCSIS 2017 - Prague
1
RISE & SICS
• RI.SE = Research Institutes of Sweden
• SICS = Swedish Institute of Computer Science
• SICS East Linköping
• Group: Language Technology and Intelligent Interaction Design
• Group Leader: Professor Arne Jönsson (arne.jonsson@ri.se)
Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare:
Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 2
Outline
• The eCare@home project
• The Language Technology contribution
• The eCare Swedish corpus
• Lay-Specialized Annotation
• The experiments: lay-specialized text classification
• Conclusions
3 September 2017LTA'17 - FedCSIS 2017 - Prague 3
eCare@home
• Website: http://ecareathome.se/
• Interdisciplinary project
• Funded by Swedish Knowledge Foundation
1. Measure attributes about people and the environment.
2. Infer beyond that which we cannot measure.
3. Automatically configure devices and "things" to achieve a task.
4. Represent information in a human consumable way.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 4
Language Technology and eCare@home
3 September 2017LTA'17 - FedCSIS 2017 - Prague 5
Generally speaking:
• The Internet of Things  Sensor Data = Numbers
• Language Technology  to present information in a “human consumable way”
How to Build a Medical Corpus for eCare?
3 September 2017LTA'17 - FedCSIS 2017 - Prague 6
Kardiologi vs hjärtspecialist
specialized vs lay
Starting point:
a Web Corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 7
• Creation of a concept-specific medical web corpus
useful for eCare (and eHealth) LT applications, such as:
• the automatic extraction of lay synonyms (or paraphrases) of
medical terms,
• text simplification
• etc.
Web Corpus
• Which textual sources?
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8
• Using medical journals?
• Crawling user-generated texts?
• Relying on a specific web genre like blogs?
• Downloading medical websites?
• Web corpus!
Designing a corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 9
1. having a publicly-available medical corpus annotated with lay-
specialized labels that can be easily shared;
2. having a corpus with a design and a structure that allow for
expansion with additional documents over time;
3. accounting for very specific medical terms.
Potential Issues
3 September 2017LTA'17 - FedCSIS 2017 - Prague 10
1. expanding the corpus over time may cause scalability issues
2. the web is noisy  the corpus will be noisy: noise disturbs LT
applications
We made two assumptions:
1. scalability  increasing the size of the corpus does not necessarily affect the
performance of LT applications negatively
2. Noise  noisy texts do not necessarily affect the performance of LT applications
negatively
In the next slides...
3 September 2017LTA'17 - FedCSIS 2017 - Prague 11
1. Lay vs Specialized sublanguage
2. The construction and annotation of the eCare Swedish web
corpus
3. Experiments 1 and 2:
• Experiment 1: robustness to scalability issues ( increasing the size of
the corpus does not necessarily affect the performance of LT application
negatively )
• Experiment 2: resilience to noise ( noisy texts do not necessarily affect
the performance of LT application negatively)
Lay vs Specialized Sublanguage
3 September 2017LTA'17 - FedCSIS 2017 - Prague 12
• Definition of sublanguage:
• A language variety used by a specific user group in
certain communicative/situational contexts or domains
• Ex: the jargon used by police, or by the military, or by
the politicians, etc.
• Medical domain not only jargon but also medical
terminology!
• Two user groups in contact:
• patients  lay sublanguage (ex: heart specialist)
• professional staff  specialized sublanguage (ex:
cardiologist)
The Guardian, 2014 
eCare Web Corpus:
Construction
3 September 2017LTA'17 - FedCSIS 2017 - Prague 13
• Seed terms from SNOMED CT
• Chronic diseases
• The corpus has been bootstrapped with 228
seed terms and 801 documents were bootcat-
ted (BootCat, Baroni & Bernardini, 2004)
Initial seeds Retrieved seeds Downloaded
documents
Number of
words
Unigrams 13 13 112 91 118
Bigrams 215 142 689 618 491
Total 228 155 801 709 609
Example of long medical term (source SNOMED CT
Small Data (vs Big Data)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 14
• Small data: data that has small enough size for human comprehension.
• Use: for many problems and questions, small data in itself is enough.
• Small data (wikipedia) = a new buzz word
eCare Web Corpus: Lay Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 15
• Annotation by a native speaker who participates
who has little knowledge of medical
terminology.
• The lay-specialized text classification
experiments described later on are based on this
lay annotation.
Inter-rater agreement: Lay vs Expert (sample)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 16
• Interrater agreement
To be taken with a grain of salt , but normally:
• Poor agreement = 0.20 or less
• Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60
• Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00
User group bias:
annotation maby be biassed by the
annotator’s domain expertise.
Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 17
• Apparently the web is full of automatically translated
documents ! Do we need to sort them out? It depends!
• Out of 801 web documents, 339 have received comments
by the lay annotator, e.g. "Machine Translated" or "it is
about animals and not about humans".
• Is it important to remove this noise-ness for some LT tasks?
Maybe not always
• Computationally, the presence of noise might be irrelevant,
so it might not be worth investing resource for cleaning a
corpus
Lay/Specialized Text Classification
3 September 2017LTA'17 - FedCSIS 2017 - Prague 18
Based on the annotation made by the lay annotator
• Experiment 1: focus on scalability
• Experiment 2: focus on the impact of noise on performance
Features
3 September 2017LTA'17 - FedCSIS 2017 - Prague 19
No text pre-processing
The texts were defined as “string”
Filter converting strings to word vectors
Experiment 1: Scalability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 20
Experiment 2: Resilience to Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 21
• Noisy vs Cleaned
New Experiment: Lay vs Specialized Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 22
• SVM (weka implementation: SMO)
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29
801 documents labelled by the lay annotator
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25
778 documents labelled by the expert annotator
Conclusions: Findings
3 September 2017LTA'17 - FedCSIS 2017 - Prague 23
• We presented the construction of the the eCare corpus
• We made two claims and we supported our claims with
two experiments:
1. We can design an extensible/dynamic corpus without fearing
scalability issues
2. We can use a noisy corpus to build (certain) LT applications
without fearing bad performance
Replicability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 24
• We encourage the replication of the results presented in this
paper, and welcome improvements and further discussion.
• Corpus, scripts and the output of the classification models
are available from the project website:
http://ecareathome.se.
Thanks for your attention !
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25
Any Questions ?

More Related Content

Similar to A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
Daniel S. Katz
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
 
Providing Research Graph data in JSON-LD using Schema.org
Providing Research Graph data in JSON-LD using Schema.orgProviding Research Graph data in JSON-LD using Schema.org
Providing Research Graph data in JSON-LD using Schema.org
Jingbo Wang
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
Angelo Salatino
 
Hybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systemsHybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systems
Matīss ‎‎‎‎‎‎‎  
 
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Kripa (कृपा) Rajshekhar
 
Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluatedGESIS
 
Publishing conference proceedings internationally: Tips and tricks
Publishing conference proceedings internationally: Tips and tricksPublishing conference proceedings internationally: Tips and tricks
Publishing conference proceedings internationally: Tips and tricks
Aliaksandr Birukou
 
Multilingual qa
Multilingual qaMultilingual qa
Multilingual qa
shakimov
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Menzo Windhouwer
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017
Frederick Zarndt
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
Artificial Intelligence Institute at UofSC
 
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
Dr. Haxel Consult
 
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersAligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Wouter Deconinck
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
Vladimir Alexiev, PhD, PMP
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
Nina Jeliazkova
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
Nick Grattan
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
Advanced-Concepts-Team
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
Juan Antonio Vizcaino
 

Similar to A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results- (20)

Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 
Providing Research Graph data in JSON-LD using Schema.org
Providing Research Graph data in JSON-LD using Schema.orgProviding Research Graph data in JSON-LD using Schema.org
Providing Research Graph data in JSON-LD using Schema.org
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
Hybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systemsHybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systems
 
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
 
Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
 
Publishing conference proceedings internationally: Tips and tricks
Publishing conference proceedings internationally: Tips and tricksPublishing conference proceedings internationally: Tips and tricks
Publishing conference proceedings internationally: Tips and tricks
 
Multilingual qa
Multilingual qaMultilingual qa
Multilingual qa
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
 
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
 
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersAligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 

More from Marina Santini

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
Marina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
Marina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
Marina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
Marina Santini
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
Marina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
Marina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
Marina Santini
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
Marina Santini
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Marina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Marina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
Marina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 

More from Marina Santini (20)

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

  • 1. A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se) RISE SICS East Sweden 3 September 2017LTA'17 - FedCSIS 2017 - Prague 1
  • 2. RISE & SICS • RI.SE = Research Institutes of Sweden • SICS = Swedish Institute of Computer Science • SICS East Linköping • Group: Language Technology and Intelligent Interaction Design • Group Leader: Professor Arne Jönsson (arne.jonsson@ri.se) Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague. 3 September 2017LTA'17 - FedCSIS 2017 - Prague 2
  • 3. Outline • The eCare@home project • The Language Technology contribution • The eCare Swedish corpus • Lay-Specialized Annotation • The experiments: lay-specialized text classification • Conclusions 3 September 2017LTA'17 - FedCSIS 2017 - Prague 3
  • 4. eCare@home • Website: http://ecareathome.se/ • Interdisciplinary project • Funded by Swedish Knowledge Foundation 1. Measure attributes about people and the environment. 2. Infer beyond that which we cannot measure. 3. Automatically configure devices and "things" to achieve a task. 4. Represent information in a human consumable way. 3 September 2017LTA'17 - FedCSIS 2017 - Prague 4
  • 5. Language Technology and eCare@home 3 September 2017LTA'17 - FedCSIS 2017 - Prague 5 Generally speaking: • The Internet of Things  Sensor Data = Numbers • Language Technology  to present information in a “human consumable way”
  • 6. How to Build a Medical Corpus for eCare? 3 September 2017LTA'17 - FedCSIS 2017 - Prague 6 Kardiologi vs hjärtspecialist specialized vs lay
  • 7. Starting point: a Web Corpus for eCare 3 September 2017LTA'17 - FedCSIS 2017 - Prague 7 • Creation of a concept-specific medical web corpus useful for eCare (and eHealth) LT applications, such as: • the automatic extraction of lay synonyms (or paraphrases) of medical terms, • text simplification • etc.
  • 8. Web Corpus • Which textual sources? 3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8 • Using medical journals? • Crawling user-generated texts? • Relying on a specific web genre like blogs? • Downloading medical websites? • Web corpus!
  • 9. Designing a corpus for eCare 3 September 2017LTA'17 - FedCSIS 2017 - Prague 9 1. having a publicly-available medical corpus annotated with lay- specialized labels that can be easily shared; 2. having a corpus with a design and a structure that allow for expansion with additional documents over time; 3. accounting for very specific medical terms.
  • 10. Potential Issues 3 September 2017LTA'17 - FedCSIS 2017 - Prague 10 1. expanding the corpus over time may cause scalability issues 2. the web is noisy  the corpus will be noisy: noise disturbs LT applications We made two assumptions: 1. scalability  increasing the size of the corpus does not necessarily affect the performance of LT applications negatively 2. Noise  noisy texts do not necessarily affect the performance of LT applications negatively
  • 11. In the next slides... 3 September 2017LTA'17 - FedCSIS 2017 - Prague 11 1. Lay vs Specialized sublanguage 2. The construction and annotation of the eCare Swedish web corpus 3. Experiments 1 and 2: • Experiment 1: robustness to scalability issues ( increasing the size of the corpus does not necessarily affect the performance of LT application negatively ) • Experiment 2: resilience to noise ( noisy texts do not necessarily affect the performance of LT application negatively)
  • 12. Lay vs Specialized Sublanguage 3 September 2017LTA'17 - FedCSIS 2017 - Prague 12 • Definition of sublanguage: • A language variety used by a specific user group in certain communicative/situational contexts or domains • Ex: the jargon used by police, or by the military, or by the politicians, etc. • Medical domain not only jargon but also medical terminology! • Two user groups in contact: • patients  lay sublanguage (ex: heart specialist) • professional staff  specialized sublanguage (ex: cardiologist) The Guardian, 2014 
  • 13. eCare Web Corpus: Construction 3 September 2017LTA'17 - FedCSIS 2017 - Prague 13 • Seed terms from SNOMED CT • Chronic diseases • The corpus has been bootstrapped with 228 seed terms and 801 documents were bootcat- ted (BootCat, Baroni & Bernardini, 2004) Initial seeds Retrieved seeds Downloaded documents Number of words Unigrams 13 13 112 91 118 Bigrams 215 142 689 618 491 Total 228 155 801 709 609 Example of long medical term (source SNOMED CT
  • 14. Small Data (vs Big Data) 3 September 2017LTA'17 - FedCSIS 2017 - Prague 14 • Small data: data that has small enough size for human comprehension. • Use: for many problems and questions, small data in itself is enough. • Small data (wikipedia) = a new buzz word
  • 15. eCare Web Corpus: Lay Annotation 3 September 2017LTA'17 - FedCSIS 2017 - Prague 15 • Annotation by a native speaker who participates who has little knowledge of medical terminology. • The lay-specialized text classification experiments described later on are based on this lay annotation.
  • 16. Inter-rater agreement: Lay vs Expert (sample) 3 September 2017LTA'17 - FedCSIS 2017 - Prague 16 • Interrater agreement To be taken with a grain of salt , but normally: • Poor agreement = 0.20 or less • Fair agreement = 0.20 to 0.40 • Moderate agreement = 0.40 to 0.60 • Good agreement = 0.60 to 0.80 • Very good agreement = 0.80 to 1.00 User group bias: annotation maby be biassed by the annotator’s domain expertise.
  • 17. Noise 3 September 2017LTA'17 - FedCSIS 2017 - Prague 17 • Apparently the web is full of automatically translated documents ! Do we need to sort them out? It depends! • Out of 801 web documents, 339 have received comments by the lay annotator, e.g. "Machine Translated" or "it is about animals and not about humans". • Is it important to remove this noise-ness for some LT tasks? Maybe not always • Computationally, the presence of noise might be irrelevant, so it might not be worth investing resource for cleaning a corpus
  • 18. Lay/Specialized Text Classification 3 September 2017LTA'17 - FedCSIS 2017 - Prague 18 Based on the annotation made by the lay annotator • Experiment 1: focus on scalability • Experiment 2: focus on the impact of noise on performance
  • 19. Features 3 September 2017LTA'17 - FedCSIS 2017 - Prague 19 No text pre-processing The texts were defined as “string” Filter converting strings to word vectors
  • 20. Experiment 1: Scalability 3 September 2017LTA'17 - FedCSIS 2017 - Prague 20
  • 21. Experiment 2: Resilience to Noise 3 September 2017LTA'17 - FedCSIS 2017 - Prague 21 • Noisy vs Cleaned
  • 22. New Experiment: Lay vs Specialized Annotation 3 September 2017LTA'17 - FedCSIS 2017 - Prague 22 • SVM (weka implementation: SMO) k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29 801 documents labelled by the lay annotator k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25 778 documents labelled by the expert annotator
  • 23. Conclusions: Findings 3 September 2017LTA'17 - FedCSIS 2017 - Prague 23 • We presented the construction of the the eCare corpus • We made two claims and we supported our claims with two experiments: 1. We can design an extensible/dynamic corpus without fearing scalability issues 2. We can use a noisy corpus to build (certain) LT applications without fearing bad performance
  • 24. Replicability 3 September 2017LTA'17 - FedCSIS 2017 - Prague 24 • We encourage the replication of the results presented in this paper, and welcome improvements and further discussion. • Corpus, scripts and the output of the classification models are available from the project website: http://ecareathome.se.
  • 25. Thanks for your attention ! 3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25 Any Questions ?