SlideShare a Scribd company logo
Domain Classification
WSD systems
Rubén Izquierdo
Outline
● Domain classification
– System
– Evaluation
● Word Sense Disambiguation
– Systems
● timbl-DSC
svm-DSC
● ukb-DSC
– Evaluation
● Fold-cross
● Random
● All-words
1
Domain classifier
● Automatic system to assign domains labels to texts
● 37 domains created by grouping WordNet Domains
– Biol -> anatomy, biology, botany, ecology,
entomology, genetics, zoology and physiology
● Support vector machines (SVMLight, Joachims 1998)
– One binary classifier per domain
● Features:
– Bag-of-words approach (binary features)
2
Domain classifier
● Training data:
– Synonyms and definitions from Cornetto synsets
tagged with domains
● Evaluation test sets:
– Random_set: 143 paragraphs randomly selected
from the 1st and 2nd release of SONAR
– Random_genre_set: 170 paragraphs, where we took
randomly a few from each genre in the 1st and
2nd release of SONAR
– Manually annotated with domains
3
Domain classifier
● A paragraph is considered if:
– At least one of his related domains is returned by the
classifier within the top 5 scoring documents
● Accuracy, ok / (ok+wrong)
– Random_set 84.62 %
– Random_genre_set 79.88 %
● All SONAR paragraphs have been automatically assigned
with their domains
– 9.4 M of paragraphs in SONAR annotated
4
WSD systems: timbl-DSC
● Based on TiMBL, supervised K-nearest neighbor classifier
(Daelemans et at, 2007)
● One classifier per word (multi-class classification)
● Memory-based learning
– All trained instances are stored along with the
senses associated
– To tag a new example:
● Find the 'k' most similar examples in the
stored model
● Return the majority sense of these 'k'
examples
5
WSD systems: timbl-DSC
● Features
– Local context: words, lemmas, PoS in context
– Global context: filtered bag-of-words (min 5, 0.8
relative frequency)
– Domain information:
● Sonar category
● Domain labels
● Timbl parameters
– Value for 'k', algorithm and feature metric, weighting
scheme...
– Optimization per classifier: leave-one-out
6
WSD systems: svm-DSC
● Based on Support Vector Machines (SVMLight, Joachims
1998)
● Supervised binary linear classifier
– Represent all training instances in a n-dimensional
space (most simple 2D)
– Learn a line that separates both sets of examples
– Maximize the margin of separation of the line with
the two groups of examples
– To classify a new instance:
● Represent it on the 2D space and see in
which side of the line falls
7
WSD systems: svm-DSC
● One classifier per word
– SVMLight is binary in principle
– One-vs-all: one binary classifier per word sense
● Positive examples of the sense
● Negative examples of the rest of senses
● Features:
– Bag of words
– Filtering by relative frequency per classifier
● Default svm parameters mostly used in WSD systems
8
WSD systems: ukb-DSC
● Knowledge-based system (unsupervised) (Agirre and Soroa
2009)
● WordNet (Cornetto) is considered as a graph where:
– Synsets: nodes
– Relations: edges
● Personalized PageRank algorithm
– Modification of PageRank
– Context words act as source nodes injecting mass
into word senses
– Assign stronger probabilities to certain nodes
9
WSD systems: ukb-DSC
● Dutch WordNet
● English WordNet
● Dutch WordNet ==> English WordNet
● WordNet Domain
– tennis player, tennis ball => tennis =>
– Football player, football => soccer =
● Annotation co-occurrence relations
– Polysemous => monosemous
– Polysemous => polysemous
SPORT
10
WSD Systems
● Three systems
– 2 supervised systems
● timbl-DSC
● svm-DSC
– 1 unsupervised system
● ukb-DSC
● One super-system combining the 3 systems
– Majority voting
– We have tried different weights for each system
(decide in case of tie)
11
WSD. Evaluation
● We have a huge amount of evaluation results
– Three systems (and combination) with different
configurations for each
– Three types of evaluation
– Separate results for nouns, verbs and adjectives
– Results for systems, lemmas and word meanings
– For senses (lexical units), sense-groups and base
concepts
● All the results and evaluation data is available on the
website
● In this presentation: best overall results for senses
12
WSD. Evaluation
● Three different evaluations (each one with a specific goal)
– Fold cross validation
● To get the best sense tagger on SONAR, to
fulfill the main goal of the project
– Random evaluation on SONAR
● To estimate the accuracy of the sense tagger
over the rest of SONAR
– All words evaluation
● To analyze the performance of our SONAR-
oriented WSD system in totally independent
texts
13
WSD. FC Evaluation
● Token accuracy for systems. Senses
– Using manually annotated data of the AL process
Nouns Verbs Adjectives
timbl-DSC No domain feats. 83.97 83.44 78.64
Domain features 81.60 81.21 76.28
svm-DSC No domains feats. 81.17 84.19 77.88
Domain features 82.69 84.93 79.03
ukb-DSC UKB4f (all relations 1,7M
relations)
73.04 55.84 56.36
UKB5d (no singletons 1,1M
relations)
51.29 37.52 37.78
UKB1 (cornetto + domain
relations 138,427)
47.03 30.61 35.36
14
WSD. FC Evaluation
● Token accuracy for the combination of the systems. Senses
timbl-DSC svm-DSC ukb-DSC Token
accuracy
Nouns 1 83.97
1.5 1 1 88.53
1 1 1.5 88.65
Verbs 1 84.93
1 1.5 1 87.60
Adjectives 1 79.03
1 1.5 1 82.97
1.5 1 1 83.06
15
WSD. Random Evaluation
● Token accuracy for the random evaluation. Senses
– 5 nouns, 5 verbs and 3 verbs selected from:
● Between 90 and 100 (not with acc=100)
● Between 80 and 90
● Between 70 and 80
● Between 60 and 70
System Nouns Verbs Adjectives
timbl-DSC 54.25 48.25 46.50
svm-DSC 64.10 52.20 52.00
ukb-DSC 49.37 44.15 38.13
Combination 1 – 1.5 - 1 66.92 60.55 55.11
16
WSD. All words
Token_id no available as feature
– Around 8 points of decrease if not used in FC validation
<w xml:id="WR-P-P-G-0000148955.p.28.s.3.w.8"><t>paard</t>...
--> Sense of chess piece
....
<w xml:id="WR-P-P-G-0000148955.p.30.s.3.w.5"><t>paard</t>
--> Sense ??? Maybe chess piece??
System Nouns Verbs Adjectives
timbl-DSC 55.76 37.96 49.09
svm-DSC 64.58 45.81 55.70
ukb-DSC 56.81 31.37 35.93
Combination 1 – 1.5 - 1 66.09 45.68 52.24 17
WSD. Overall Evaluation
System Nouns Verbs Adjectives
Fold cross validation 88.65 87.60 83.06
Random evaluation 66.92 60.55 55.11
All words evaluation 66.09 45.68 52.24
Overall results for systems, considering all the lemmas
18
WSD. Overall Evaluation
● Distribution of word in terms of performance for the combine
system in the fold-cross validation
– Performance for each lemma
Range Nouns Verbs Adjectives All
P >= 80 87.91 % 91.03 % 69.55 % 82.83 %
70 <= P < 80 9.86 % 7.35 % 22.69 % 13.30 %
60 <=P < 70 2.08 % 1.18 % 6.57 % 3.28 %
P < 60 0.15 % 0.45 % 1.19 % 0.60 %
19
Sense tagging of all SONAR
● We apply our three systems to all unannotated SONAR (for
the DSC selected words)
Number of tokens automatically annotated:
● All automatic annotations with confidence value
– You can select the best ones
Nouns Verbs Adjectives
timbl-DSC 18,5 M 23,9 M 5,3 M
svm-DSC 18,5 M 23,9 M 5,3 M
ukb-DSC 18,9 M 24,1 M 5,4 M
20
Sense tagging of all SONAR
Number of tokens automatically annotated with a
confidence >= 0.8
Around 29 M tokens with a confidence >= 0.8
Nouns Verbs Adjectives
Timbl-DSC all 18,5 M 23,9 M 5,3 M
Timbl-DSC
Conf>=0.8
10,8 M (58%) 15,6 M (64%) 2,6 M (50%)
21
Dank je wel !
Thanks !
Gracias !

More Related Content

Viewers also liked

Chapter 19 lecture- Viruses & Bacteria
Chapter 19 lecture- Viruses & BacteriaChapter 19 lecture- Viruses & Bacteria
Chapter 19 lecture- Viruses & Bacteria
Mary Beth Smith
 
Bacteria Presentation
Bacteria PresentationBacteria Presentation
Bacteria Presentation
guest73c0c
 
Viruses, bacteria, protists and fungi
Viruses, bacteria, protists and fungiViruses, bacteria, protists and fungi
Viruses, bacteria, protists and fungi
Sian Ferguson
 
Bacteria powerpoint
Bacteria powerpointBacteria powerpoint
Bacteria powerpoint
jhadachek
 
Bacteria
BacteriaBacteria
Bacteria
jayarajgr
 
Classification of bacteria
Classification of bacteriaClassification of bacteria
Classification of bacteria
Shyam Mishra
 
SCIENCE FORM5: CHAPTER 1
SCIENCE FORM5: CHAPTER 1SCIENCE FORM5: CHAPTER 1
SCIENCE FORM5: CHAPTER 1
SHUKOR Sudin
 

Viewers also liked (7)

Chapter 19 lecture- Viruses & Bacteria
Chapter 19 lecture- Viruses & BacteriaChapter 19 lecture- Viruses & Bacteria
Chapter 19 lecture- Viruses & Bacteria
 
Bacteria Presentation
Bacteria PresentationBacteria Presentation
Bacteria Presentation
 
Viruses, bacteria, protists and fungi
Viruses, bacteria, protists and fungiViruses, bacteria, protists and fungi
Viruses, bacteria, protists and fungi
 
Bacteria powerpoint
Bacteria powerpointBacteria powerpoint
Bacteria powerpoint
 
Bacteria
BacteriaBacteria
Bacteria
 
Classification of bacteria
Classification of bacteriaClassification of bacteria
Classification of bacteria
 
SCIENCE FORM5: CHAPTER 1
SCIENCE FORM5: CHAPTER 1SCIENCE FORM5: CHAPTER 1
SCIENCE FORM5: CHAPTER 1
 

Similar to DutchSemCor workshop: Domain classification and WSD systems

Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
fuchaoqun
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
Manish Pandey
 
Lucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolLucio marcenaro tue summer_school
Lucio marcenaro tue summer_school
Jun Hu
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
Hideyuki Kawashima
 
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
Davide Chicco
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
Wake Tech BAS
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
JohanAspro
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
zhu02
 
super-cheatsheet-artificial-intelligence.pdf
super-cheatsheet-artificial-intelligence.pdfsuper-cheatsheet-artificial-intelligence.pdf
super-cheatsheet-artificial-intelligence.pdf
ssuser089265
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
olusolaogunyewo1
 
sequenckjkojkjhguignmpojihiubgijnkompoje.ppt
sequenckjkojkjhguignmpojihiubgijnkompoje.pptsequenckjkojkjhguignmpojihiubgijnkompoje.ppt
sequenckjkojkjhguignmpojihiubgijnkompoje.ppt
JITENDER773791
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
Yusuke Yamamoto
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Maté Ongenaert
 
Design Verification Using SystemC
Design Verification Using SystemCDesign Verification Using SystemC
Design Verification Using SystemC
DVClub
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Sonar Metrics
Sonar MetricsSonar Metrics
Sonar Metrics
Keheliya Gallaba
 
PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...
PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...
PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...
Mladen Jovanovic
 

Similar to DutchSemCor workshop: Domain classification and WSD systems (20)

Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Lucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolLucio marcenaro tue summer_school
Lucio marcenaro tue summer_school
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
super-cheatsheet-artificial-intelligence.pdf
super-cheatsheet-artificial-intelligence.pdfsuper-cheatsheet-artificial-intelligence.pdf
super-cheatsheet-artificial-intelligence.pdf
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
 
sequenckjkojkjhguignmpojihiubgijnkompoje.ppt
sequenckjkojkjhguignmpojihiubgijnkompoje.pptsequenckjkojkjhguignmpojihiubgijnkompoje.ppt
sequenckjkojkjhguignmpojihiubgijnkompoje.ppt
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Design Verification Using SystemC
Design Verification Using SystemCDesign Verification Using SystemC
Design Verification Using SystemC
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Sonar Metrics
Sonar MetricsSonar Metrics
Sonar Metrics
 
PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...
PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...
PyCon Balkans 2018 // Recommender systems - collaborative filtering and dimen...
 

More from Rubén Izquierdo Beviá

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
Rubén Izquierdo Beviá
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
Rubén Izquierdo Beviá
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
Rubén Izquierdo Beviá
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
Rubén Izquierdo Beviá
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
Rubén Izquierdo Beviá
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)
Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
Rubén Izquierdo Beviá
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
Rubén Izquierdo Beviá
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
Rubén Izquierdo Beviá
 
ULM1 - The borders of Ambiguity
ULM1 - The borders of AmbiguityULM1 - The borders of Ambiguity
ULM1 - The borders of Ambiguity
Rubén Izquierdo Beviá
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
Rubén Izquierdo Beviá
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
Rubén Izquierdo Beviá
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
Rubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
Rubén Izquierdo Beviá
 

More from Rubén Izquierdo Beviá (16)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
ULM1 - The borders of Ambiguity
ULM1 - The borders of AmbiguityULM1 - The borders of Ambiguity
ULM1 - The borders of Ambiguity
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 

Recently uploaded

EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 

Recently uploaded (20)

EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 

DutchSemCor workshop: Domain classification and WSD systems

  • 2. Outline ● Domain classification – System – Evaluation ● Word Sense Disambiguation – Systems ● timbl-DSC svm-DSC ● ukb-DSC – Evaluation ● Fold-cross ● Random ● All-words 1
  • 3. Domain classifier ● Automatic system to assign domains labels to texts ● 37 domains created by grouping WordNet Domains – Biol -> anatomy, biology, botany, ecology, entomology, genetics, zoology and physiology ● Support vector machines (SVMLight, Joachims 1998) – One binary classifier per domain ● Features: – Bag-of-words approach (binary features) 2
  • 4. Domain classifier ● Training data: – Synonyms and definitions from Cornetto synsets tagged with domains ● Evaluation test sets: – Random_set: 143 paragraphs randomly selected from the 1st and 2nd release of SONAR – Random_genre_set: 170 paragraphs, where we took randomly a few from each genre in the 1st and 2nd release of SONAR – Manually annotated with domains 3
  • 5. Domain classifier ● A paragraph is considered if: – At least one of his related domains is returned by the classifier within the top 5 scoring documents ● Accuracy, ok / (ok+wrong) – Random_set 84.62 % – Random_genre_set 79.88 % ● All SONAR paragraphs have been automatically assigned with their domains – 9.4 M of paragraphs in SONAR annotated 4
  • 6. WSD systems: timbl-DSC ● Based on TiMBL, supervised K-nearest neighbor classifier (Daelemans et at, 2007) ● One classifier per word (multi-class classification) ● Memory-based learning – All trained instances are stored along with the senses associated – To tag a new example: ● Find the 'k' most similar examples in the stored model ● Return the majority sense of these 'k' examples 5
  • 7. WSD systems: timbl-DSC ● Features – Local context: words, lemmas, PoS in context – Global context: filtered bag-of-words (min 5, 0.8 relative frequency) – Domain information: ● Sonar category ● Domain labels ● Timbl parameters – Value for 'k', algorithm and feature metric, weighting scheme... – Optimization per classifier: leave-one-out 6
  • 8. WSD systems: svm-DSC ● Based on Support Vector Machines (SVMLight, Joachims 1998) ● Supervised binary linear classifier – Represent all training instances in a n-dimensional space (most simple 2D) – Learn a line that separates both sets of examples – Maximize the margin of separation of the line with the two groups of examples – To classify a new instance: ● Represent it on the 2D space and see in which side of the line falls 7
  • 9. WSD systems: svm-DSC ● One classifier per word – SVMLight is binary in principle – One-vs-all: one binary classifier per word sense ● Positive examples of the sense ● Negative examples of the rest of senses ● Features: – Bag of words – Filtering by relative frequency per classifier ● Default svm parameters mostly used in WSD systems 8
  • 10. WSD systems: ukb-DSC ● Knowledge-based system (unsupervised) (Agirre and Soroa 2009) ● WordNet (Cornetto) is considered as a graph where: – Synsets: nodes – Relations: edges ● Personalized PageRank algorithm – Modification of PageRank – Context words act as source nodes injecting mass into word senses – Assign stronger probabilities to certain nodes 9
  • 11. WSD systems: ukb-DSC ● Dutch WordNet ● English WordNet ● Dutch WordNet ==> English WordNet ● WordNet Domain – tennis player, tennis ball => tennis => – Football player, football => soccer = ● Annotation co-occurrence relations – Polysemous => monosemous – Polysemous => polysemous SPORT 10
  • 12. WSD Systems ● Three systems – 2 supervised systems ● timbl-DSC ● svm-DSC – 1 unsupervised system ● ukb-DSC ● One super-system combining the 3 systems – Majority voting – We have tried different weights for each system (decide in case of tie) 11
  • 13. WSD. Evaluation ● We have a huge amount of evaluation results – Three systems (and combination) with different configurations for each – Three types of evaluation – Separate results for nouns, verbs and adjectives – Results for systems, lemmas and word meanings – For senses (lexical units), sense-groups and base concepts ● All the results and evaluation data is available on the website ● In this presentation: best overall results for senses 12
  • 14. WSD. Evaluation ● Three different evaluations (each one with a specific goal) – Fold cross validation ● To get the best sense tagger on SONAR, to fulfill the main goal of the project – Random evaluation on SONAR ● To estimate the accuracy of the sense tagger over the rest of SONAR – All words evaluation ● To analyze the performance of our SONAR- oriented WSD system in totally independent texts 13
  • 15. WSD. FC Evaluation ● Token accuracy for systems. Senses – Using manually annotated data of the AL process Nouns Verbs Adjectives timbl-DSC No domain feats. 83.97 83.44 78.64 Domain features 81.60 81.21 76.28 svm-DSC No domains feats. 81.17 84.19 77.88 Domain features 82.69 84.93 79.03 ukb-DSC UKB4f (all relations 1,7M relations) 73.04 55.84 56.36 UKB5d (no singletons 1,1M relations) 51.29 37.52 37.78 UKB1 (cornetto + domain relations 138,427) 47.03 30.61 35.36 14
  • 16. WSD. FC Evaluation ● Token accuracy for the combination of the systems. Senses timbl-DSC svm-DSC ukb-DSC Token accuracy Nouns 1 83.97 1.5 1 1 88.53 1 1 1.5 88.65 Verbs 1 84.93 1 1.5 1 87.60 Adjectives 1 79.03 1 1.5 1 82.97 1.5 1 1 83.06 15
  • 17. WSD. Random Evaluation ● Token accuracy for the random evaluation. Senses – 5 nouns, 5 verbs and 3 verbs selected from: ● Between 90 and 100 (not with acc=100) ● Between 80 and 90 ● Between 70 and 80 ● Between 60 and 70 System Nouns Verbs Adjectives timbl-DSC 54.25 48.25 46.50 svm-DSC 64.10 52.20 52.00 ukb-DSC 49.37 44.15 38.13 Combination 1 – 1.5 - 1 66.92 60.55 55.11 16
  • 18. WSD. All words Token_id no available as feature – Around 8 points of decrease if not used in FC validation <w xml:id="WR-P-P-G-0000148955.p.28.s.3.w.8"><t>paard</t>... --> Sense of chess piece .... <w xml:id="WR-P-P-G-0000148955.p.30.s.3.w.5"><t>paard</t> --> Sense ??? Maybe chess piece?? System Nouns Verbs Adjectives timbl-DSC 55.76 37.96 49.09 svm-DSC 64.58 45.81 55.70 ukb-DSC 56.81 31.37 35.93 Combination 1 – 1.5 - 1 66.09 45.68 52.24 17
  • 19. WSD. Overall Evaluation System Nouns Verbs Adjectives Fold cross validation 88.65 87.60 83.06 Random evaluation 66.92 60.55 55.11 All words evaluation 66.09 45.68 52.24 Overall results for systems, considering all the lemmas 18
  • 20. WSD. Overall Evaluation ● Distribution of word in terms of performance for the combine system in the fold-cross validation – Performance for each lemma Range Nouns Verbs Adjectives All P >= 80 87.91 % 91.03 % 69.55 % 82.83 % 70 <= P < 80 9.86 % 7.35 % 22.69 % 13.30 % 60 <=P < 70 2.08 % 1.18 % 6.57 % 3.28 % P < 60 0.15 % 0.45 % 1.19 % 0.60 % 19
  • 21. Sense tagging of all SONAR ● We apply our three systems to all unannotated SONAR (for the DSC selected words) Number of tokens automatically annotated: ● All automatic annotations with confidence value – You can select the best ones Nouns Verbs Adjectives timbl-DSC 18,5 M 23,9 M 5,3 M svm-DSC 18,5 M 23,9 M 5,3 M ukb-DSC 18,9 M 24,1 M 5,4 M 20
  • 22. Sense tagging of all SONAR Number of tokens automatically annotated with a confidence >= 0.8 Around 29 M tokens with a confidence >= 0.8 Nouns Verbs Adjectives Timbl-DSC all 18,5 M 23,9 M 5,3 M Timbl-DSC Conf>=0.8 10,8 M (58%) 15,6 M (64%) 2,6 M (50%) 21
  • 23. Dank je wel ! Thanks ! Gracias !