SlideShare a Scribd company logo
1 of 11
Download to read offline
2018-06-08
1
Towards a quality assessment of web corpora
for language technology applications
Wiktor Strandqvist
RISE SICS East and Linköping University, Linköping,
Sweden
wikst813@student.liu.se
(co-authors: M. Santini, L. Lind, A. Jönsson)
Outline
• Introduction
• Purpose
• Domain-specific web corpora
• Two-step approach:
1. Extraction of term seeds from use cases, personas and scenarios
• Results
2. Bootstrapping and evaluation of domain-specific web corpora
• Results
• Conclusion and Future Work
2018-06-08
2
Introduction
• Research exists to assess the representativeness of general purpose web corpora
by comparing them to traditional corpora.
• Our focus and purpose :
• Creation and evaluation of domain-specific web corpora to be used to build Language
Technology real-world applications.
• We propose a two-step approach:
1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios;
2. Creation and validation of specialized and domain-specific web corpora bootstrapped with
term seeds automatically extracted in step 1.
Step 1: Term extraction from use cases, personas and scenarios
• Why using use cases, personas and scenarios, when available?
• They are based on numerous interviews and observations of real situations;
• Checked by domain experts who know how to correctly use terms in their own domain.
• Use cases, personas and scenario are a good starting point to automatize the manual process
(often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora.
• We focus on the medical terms that occur in use cases, personas and scenarios written in
English for the E-care@home project.
• Challenge: accurate term extractor from a relatively short text (a few dozen pages)
2018-06-08
3
Step 2: Corpus boostrapping and evaluation
• Bootstrap a web corpus using term seeds automatically extracted from use cases,
personas and scenarios.
• Automatically evaluate the ”quality” ot the bootstrapped domain-specific web
corpora.
Open issues and proposed answers
Q1: What is meant by “quality” of a web corpus?
A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses.
Q2: How can we assess the quality of a corpus automatically bootstrapped from the web?
A2: by using metrics that are well-established and easily replicable.
Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target
domain?
A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is
satisfactorily domain-specific or whether the corpus needs some amends before being used.
Q4: Can we measure the domain-specificity of a corpus?
A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part
2 of this presentation.
2018-06-08
4
Word frequency lists: a compact corpus representation
• Our assumptions:
• ”Words are not selected at random” (Adam Kilgarriff)
• Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking
much of the information in the corpus but small and easily tractable” (Adam Kilgarriff)
• We use frequency list of content words (i.e. after having applied stopword
removal) to evaluate the “quality” of the web corpora.
Part 1:
Term-Extraction from Use Cases, Personas, Scenarios
• Term candidate extraction
• Part-of-speech tagging (Standford tagger)
• Syntactic patterns
• Term validation
• Partial matching against a medical databse (Snomed CT)
• Ranking the terms based on DF/IDF
• Cutoff
• Seed generation
• Triples sampled from the same context
2018-06-08
5
Part 1:
Term-Extraction results
• Term candidate extraction
• Extraction recall: 81%
• Term validation
• Precision: 34.2%
• Recall: 71%
• F1: 46.2%
Part 2:
Evaluating domain-specific web corpora
• In this part of the presentation:
• We show that a corpus bootstrapped with automatically extracted term seeds from use
cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped
with hand-picked seeds (Gold corpus).
• We show that both the Gold corpus and the Auto corpus have similar domain-specificity
(domainhood), and do not share any similarity with a general language web corpus, like
ukWac.
2018-06-08
6
The Web Corpora used in our experiments
• ukWaCsample (872 565 words): a random subset of ukWaC (general language
corpus)
• Gold (544 677 words): a web corpus collected with hand-picked seeds
• Auto (492 479 words) : a web corpus collected with automatically extracted seeds
Plotting normalized frequencies (wpm)
• ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)
2018-06-08
7
Plotting ranks (top 1000 words)
• The ranks are based on the normalized frequencies (wpm)
Rank Correlation: Kendall
• Non-parametric Kendall Tau
2018-06-08
8
Rank Correlation: Spearman
• Non parametric Spearman Rho
Smoothing: 0.01
• We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).
2018-06-08
9
KL divergence (aka relative entropy)
• R: entropy package, function KL.empirical()
• KL: ukWacSample vs Gold = 7.544118
• KL: ukWacSample vs Auto = 6.519677
• KL: Gold vs Auto = 1.843863
Log-likelihood (LL-G2)
• Corpus profiling: the larger the LL-G2 scores, the more significant the difference
between two corpora.
• The total LL-G2 scores for the three web corpora (top 1000-ranked words) are
• LL-G2 : ukWaCsample vs Gold = 453 441.6
• LL-G2 : ukWaCsample vs Auto = 393 705.9
• LL-G2 : Gold vs Auto: 114 694.2
2018-06-08
10
List of LL- G2 scores
From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto
For the individual LL scores, a G2score of 3.8415 or higher is significant at the level
of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001
Discussion
• These simple measures based on word frequency lists give a clear indication of
the ”quality” of a bootstrapped domain-specific werb corpus:
• Rank correlation
• KL divergence
• Log-likelihood (LL-G2)
• These measures can be used to assess the corpus quality BEFORE the corpus is
used to build LT applications, thus avoiding bad surprises.
• If the values returned by the metrics are not satisfactory, a corpus can be
amended accordingly.
2018-06-08
11
Conclusion and Future Work
• It is possible to create a fairly accurate term extractor from a relatively short text
written by domain experts.
• It is possible to assess the quality and domain-specificity of web corpora by using
well-established metrics.
• Future work: expanding word frequency list (including bigram and trigrams) &
identifying more metrics that can help in the evaluation of the quality of the
corpora, such as burstiness ad perplexity.
Thank you for your attention!

More Related Content

What's hot

Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contentsSteffen Staab
 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patentsMIPLM
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeDr. Haxel Consult
 
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2TigerGraph
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
 
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...Dr. Haxel Consult
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbenchEuropean Data Forum
 
Graph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionGraph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionTigerGraph
 
GLOBE Metadata Analysis
GLOBE Metadata AnalysisGLOBE Metadata Analysis
GLOBE Metadata AnalysisXavier Ochoa
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN Dr. Haxel Consult
 

What's hot (20)

Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patents
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent Office
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
Graph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionGraph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community Detection
 
GLOBE Metadata Analysis
GLOBE Metadata AnalysisGLOBE Metadata Analysis
GLOBE Metadata Analysis
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN
 

Similar to Towards a Quality Assessment of Web Corpora for Language Technology Applications

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchDaniel Schneiter
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...IOSR Journals
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldSean Chittenden
 
How experiments drive product growth at Viki
How experiments drive product growth at VikiHow experiments drive product growth at Viki
How experiments drive product growth at Vikiishanagrawal90
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Baden Hughes
 
Incremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software EngineeringIncremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software EngineeringÁkos Horváth
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?EDB
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 

Similar to Towards a Quality Assessment of Web Corpora for Language Technology Applications (20)

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated World
 
Yuvaraj
YuvarajYuvaraj
Yuvaraj
 
How experiments drive product growth at Viki
How experiments drive product growth at VikiHow experiments drive product growth at Viki
How experiments drive product growth at Viki
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
 
Incremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software EngineeringIncremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software Engineering
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 

More from Marina Santini

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part) Marina Santini
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Marina Santini
 

More from Marina Santini (20)

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 

Recently uploaded

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 

Towards a Quality Assessment of Web Corpora for Language Technology Applications

  • 1. 2018-06-08 1 Towards a quality assessment of web corpora for language technology applications Wiktor Strandqvist RISE SICS East and Linköping University, Linköping, Sweden wikst813@student.liu.se (co-authors: M. Santini, L. Lind, A. Jönsson) Outline • Introduction • Purpose • Domain-specific web corpora • Two-step approach: 1. Extraction of term seeds from use cases, personas and scenarios • Results 2. Bootstrapping and evaluation of domain-specific web corpora • Results • Conclusion and Future Work
  • 2. 2018-06-08 2 Introduction • Research exists to assess the representativeness of general purpose web corpora by comparing them to traditional corpora. • Our focus and purpose : • Creation and evaluation of domain-specific web corpora to be used to build Language Technology real-world applications. • We propose a two-step approach: 1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios; 2. Creation and validation of specialized and domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Step 1: Term extraction from use cases, personas and scenarios • Why using use cases, personas and scenarios, when available? • They are based on numerous interviews and observations of real situations; • Checked by domain experts who know how to correctly use terms in their own domain. • Use cases, personas and scenario are a good starting point to automatize the manual process (often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora. • We focus on the medical terms that occur in use cases, personas and scenarios written in English for the E-care@home project. • Challenge: accurate term extractor from a relatively short text (a few dozen pages)
  • 3. 2018-06-08 3 Step 2: Corpus boostrapping and evaluation • Bootstrap a web corpus using term seeds automatically extracted from use cases, personas and scenarios. • Automatically evaluate the ”quality” ot the bootstrapped domain-specific web corpora. Open issues and proposed answers Q1: What is meant by “quality” of a web corpus? A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses. Q2: How can we assess the quality of a corpus automatically bootstrapped from the web? A2: by using metrics that are well-established and easily replicable. Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target domain? A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is satisfactorily domain-specific or whether the corpus needs some amends before being used. Q4: Can we measure the domain-specificity of a corpus? A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part 2 of this presentation.
  • 4. 2018-06-08 4 Word frequency lists: a compact corpus representation • Our assumptions: • ”Words are not selected at random” (Adam Kilgarriff) • Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking much of the information in the corpus but small and easily tractable” (Adam Kilgarriff) • We use frequency list of content words (i.e. after having applied stopword removal) to evaluate the “quality” of the web corpora. Part 1: Term-Extraction from Use Cases, Personas, Scenarios • Term candidate extraction • Part-of-speech tagging (Standford tagger) • Syntactic patterns • Term validation • Partial matching against a medical databse (Snomed CT) • Ranking the terms based on DF/IDF • Cutoff • Seed generation • Triples sampled from the same context
  • 5. 2018-06-08 5 Part 1: Term-Extraction results • Term candidate extraction • Extraction recall: 81% • Term validation • Precision: 34.2% • Recall: 71% • F1: 46.2% Part 2: Evaluating domain-specific web corpora • In this part of the presentation: • We show that a corpus bootstrapped with automatically extracted term seeds from use cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped with hand-picked seeds (Gold corpus). • We show that both the Gold corpus and the Auto corpus have similar domain-specificity (domainhood), and do not share any similarity with a general language web corpus, like ukWac.
  • 6. 2018-06-08 6 The Web Corpora used in our experiments • ukWaCsample (872 565 words): a random subset of ukWaC (general language corpus) • Gold (544 677 words): a web corpus collected with hand-picked seeds • Auto (492 479 words) : a web corpus collected with automatically extracted seeds Plotting normalized frequencies (wpm) • ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)
  • 7. 2018-06-08 7 Plotting ranks (top 1000 words) • The ranks are based on the normalized frequencies (wpm) Rank Correlation: Kendall • Non-parametric Kendall Tau
  • 8. 2018-06-08 8 Rank Correlation: Spearman • Non parametric Spearman Rho Smoothing: 0.01 • We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).
  • 9. 2018-06-08 9 KL divergence (aka relative entropy) • R: entropy package, function KL.empirical() • KL: ukWacSample vs Gold = 7.544118 • KL: ukWacSample vs Auto = 6.519677 • KL: Gold vs Auto = 1.843863 Log-likelihood (LL-G2) • Corpus profiling: the larger the LL-G2 scores, the more significant the difference between two corpora. • The total LL-G2 scores for the three web corpora (top 1000-ranked words) are • LL-G2 : ukWaCsample vs Gold = 453 441.6 • LL-G2 : ukWaCsample vs Auto = 393 705.9 • LL-G2 : Gold vs Auto: 114 694.2
  • 10. 2018-06-08 10 List of LL- G2 scores From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto For the individual LL scores, a G2score of 3.8415 or higher is significant at the level of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001 Discussion • These simple measures based on word frequency lists give a clear indication of the ”quality” of a bootstrapped domain-specific werb corpus: • Rank correlation • KL divergence • Log-likelihood (LL-G2) • These measures can be used to assess the corpus quality BEFORE the corpus is used to build LT applications, thus avoiding bad surprises. • If the values returned by the metrics are not satisfactory, a corpus can be amended accordingly.
  • 11. 2018-06-08 11 Conclusion and Future Work • It is possible to create a fairly accurate term extractor from a relatively short text written by domain experts. • It is possible to assess the quality and domain-specificity of web corpora by using well-established metrics. • Future work: expanding word frequency list (including bigram and trigrams) & identifying more metrics that can help in the evaluation of the quality of the corpora, such as burstiness ad perplexity. Thank you for your attention!