A Study of Association Measures and their Combination for Arabic MWT Extraction

Introduction
The state of MWT extraction
Proposed Method
Evaluation and Results
Conclusion and perspectives
Bibliography

.

A Study of Association Measures and their
Combination for Arabic MWT Extraction
.

10th International Conference on Terminology and Artiﬁcial
Intelligence (TIA’2013)
Abdelkader El Mahdaouy, Said El Alaoui Ouatik and Eric Gaussier

October 28th, 2013

A. El Mahdaouy, S.O El Alaoui and E. Gaussier

Arabic MWT Extraction

1 / 20

Introduction
Proposed Method
Bibliography

. Table of contents
.
1 Introduction
Terminology Extraction
Motivation
.
2 The state of MWT extraction
Standard Approaches
Statistical Measures
.
3 Proposed Method
Linguistic Filter
Statistical Filter
.
4 Evaluation and Results
Corpus
Evaluation Method
Obtained results
.
5 Conclusion and perspectives
.
6 Bibliography


2 / 20

Introduction
Proposed Method
Bibliography

Motivation

. Terminology Extraction
.
Terminology
.
Set of terms representing the system of concepts of a particular
subject field.
.
.
Term
.
lexical unit that has an unambiguous meaning when used in a
text of a specific domain.
Refer to a defined concept ... (ISO 704).
.
.
.
Subtask of information extraction.
.

Automatically extract relevant terms from a given corpus.


3 / 20

Introduction
Proposed Method
Bibliography

Motivation

. Motivation

The bag-of-words model (based on single word terms) is a
simplifying representation used in natural language processing
and information retrieval(IR).
Multi-word terms (MWT) are less ambiguous and less
polysemous than single word terms.
Using MWT instead of single word terms yields a better
representation of document content.



4 / 20

Introduction
Proposed Method
Bibliography

Standard Approaches

. Standard Approaches
.
Linguistic Approaches
.
Based on linguistic pre-processing and POS tagging.
Extract candidate terms candidate using syntactic patterns.
.
.
Statistical Approaches
.
Ranking candidate terms based on a particular measure that gives higher scores
to ”good” candidate terms.
Frequent expressions are assumed to represent important concepts.
.



5 / 20

Introduction
Proposed Method
Bibliography

Standard Approaches

. Standard Approaches
.
Linguistic Approaches
.
Based on linguistic pre-processing and POS tagging.
Extract candidate terms candidate using syntactic patterns.
.
.
Statistical Approaches
.
Ranking candidate terms based on a particular measure that gives higher scores
to ”good” candidate terms.
Frequent expressions are assumed to represent important concepts.
.
.
Hybrid Approaches
.
Combine linguistic and statistical techniques to extract MWTs in order to avoid the
weaknesses of the two approaches.
.



5 / 20

Introduction
Proposed Method
Bibliography

Standard Approaches

. Characteristics of MWTs
Deﬁned by Kageura et al., 1996 :
.
Unithood
.
The degree of strength or stability of syntagmatic
combinations or collocations.
.

Log-Likelihood Ratio, T-Score, MI, etc.

.
Termthood
.
The degree to which a linguistic unit is related to a speciﬁc
domain concept.
.

C/NC-value.



6 / 20

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

. Proposed Method
Hybrid method consists of two ﬁlters:
.
Linguistic Filter
.
Use AMIRA 2.0 (POS tagging toolkit).
Extract MWT candidates based on syntactic patterns.
.

Handle the problem of MWT variation.

.
Statistical Filter
.
Propose novel statistical measure (NLC-value) that combine
context information with termhood and unithood.
.

Evaluate state-of-the-art statistical measures.


7 / 20

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

. Linguistic Filter
The proposed linguistic filter extracts
candidate MWTs based on two core
components; the POS tagger and the
sequence identifier:
.
Syntactic patterns
.
(Noun + (Noun|Adj) +
|(Noun|adj) + |(Noun|Adj)).
.

Noun Prep Noun.
Figure 1 : The global schema of
the linguistic filter


8 / 20

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

. Term variation
Four types of variations are handled : graphical variants,
inﬂectional variants, morpho-syntactic variants and syntactic
variants.
.
Graphical Variants
.
Concern orthographic errors occurred in writing some particular
letters (” ”, ” ” and ” ”).
.
.
Example
.
which leads to

meaning

“Biodiversity”.
.


9 / 20

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

. Term variation
.
Inflectional Variants
.
These variants due to the use of different forms for the words
constituting a MWT:
The gender and the number.
The presence/absence of a definite article.

.

.
Examples
.
.
1

(ocean pollution) which leads to
(pollution of the oceans).

.

2

.

(water pollution) which leads to
water pollution).


(the

10 / 20

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

. Term variation
.
Morpho-syntactic Variants
.
These variants aﬀect the internal structure of term as the words it
contains are related through derivational morphology:
Noun1 Noun2 ⇔ Noun1 Adj.
Noun1 Adj ⇔ Noun1 Prep Noun.

.

.
Examples
.
.
1
.

2

and
which leads to

(air pollution).
(barrel of oil).

.


11 / 20

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

. Term variation
.
Syntactic Variants
.
These variants modify the structure of the MWT candidate by
adding one or more words (as adjectives) but do not aﬀect the
grammatical categories:
Noun1 Noun2 ⇔ Noun1 Noun2 Adj.
Noun1 Adj1 ⇔ Noun1 Adj1 Adj2.
.
.
Examples
.
.
1
(Water stocks) and
(Groundwater stocks).
.

2

.

(Health Organization) and
(World Health Organization).


12 / 20

.

Introduction
Proposed Method
Bibliography

Linguistic Filter
Statistical Filter

Statistical Filter
The NLC-value
.
NLC-value
.
NLC-value(a) = 0.8 · LC-value(a) + 0.2 · N-value(a)

.

(1)

with

{
log2 (|a|) · FL(a) if a is not nested,
1 ∑
log2 (|a|) · (FL(a) − |T | b∈Ta FL(b)) else
a
and FL(a) = f(a) · ln(2 + min(LLR(a))),
∑
|T(b)|
fa (b) ·
N − value (a) =
n
b∈C

LC-value(a) =

,

a

.
1
.
2

|a| denotes the length in words of candidate term a.

.
3
.
4

T(a) denotes the set of longer candidate terms into which a appears.

.
5
.
6

Ca denotes the set of distinct context words of a.

.
7

n is the total number of terms considered.

f(a) is the number of occurrences of a.
|T(a)| is the cardinality of the set T(a).
fa (b) corresponds to the number of times b occurs in the context of a.



13 / 20

Introduction
Proposed Method
Bibliography

Corpus
Evaluation Method
Obtained results

. The Corpus

Lack of Arabic specialized domain corpora.
The corpus built contains 1666 files comprising 53569
different tokens (without stop words) extracted from the Web
site “Al-Khat Alakhdar”.
The corpus covers various environmental topics such as
pollution, water purification, soil degradation, forest
preservation, climate change and natural disasters.



14 / 20

Introduction
Proposed Method
Bibliography

Corpus
Evaluation Method
Obtained results

. The Evaluation
. We computed the association scores (LLR, C-value, NC-value,
NTC-value, LLR+C-value, NLC-value) for the MWT
candidates.
.
2 We retain from each produced ranking for each statistical
1

measure the k-best candidates, with k ranging from 100 to
300 at intervals of 100.
. We have constituted automatically a reference list of all Arabic
MWTs available in the latest version of AGROVOC thesaurus.
.
4 We used translation of MWT and European terminological
3

database IATE.



15 / 20

Introduction
Proposed Method
Bibliography

Corpus
Evaluation Method
Obtained results

. Obtained results
Statistical measures
LLR
C-value
NC-value
NTC-value
LLR+C-value
NLC-Value

Table 1 :

LLR
C-value
NC-value
NTC-value
LLR+C-value
NLC-Value

Top MWT considred
100
200
300
75,0%
70,5%
64,3%
71,0%
69,0%
67,3%
74,0%
70,0%
68,3%
80,0%
71,5%
69,7%
73,0%
72,0%
68,3%
82,0%
75,5%
73,0%

Results obtained for diﬀerent statistical measures

Top MWT considred
100
200
300
35
60
80
27
59
82
32
62
82
35
60
83
34
60
84
41
65
86

Table 2 :

Number of terms found in agrovoc
foreach measure


16 / 20

Introduction
Proposed Method
Bibliography

Corpus
Evaluation Method
Obtained results

. Obtained results
LLR
C-value
NC-value
NTC-value
LLR+C-value
NLC-Value

Table 1 :

LLR
C-value
NC-value
NTC-value
LLR+C-value
NLC-Value

Top MWT considred
100
200
300
75,0%
70,5%
64,3%
71,0%
69,0%
67,3%
74,0%
70,0%
68,3%
80,0%
71,5%
69,7%
73,0%
72,0%
68,3%
82,0%
75,5%
73,0%

Results obtained for diﬀerent statistical measures

Top MWT considred
100
200
300
35
60
80
27
59
82
32
62
82
35
60
83
34
60
84
41
65
86

Table 2 :

Number of terms found in agrovoc
foreach measure

LLR
C-value
NC-value
NTC-value
LLR+C-value
NLC-Value

Top MWT considred
100
200
300
40
81
113
44
79
120
42
78
123
45
83
126
39
84
121
41
86
133

Table 3 :

Number of terms found in IATE
foreach measure

16 / 20

Introduction
Proposed Method
Bibliography

Figure 2 :

Figure 3 :

Corpus
Evaluation Method
Obtained results

Precision obtained for diﬀerent statistical measures that combine termhood and unithood

Precision obtained for the C/NC-value and

the NTC-value

Figure 4 :

Precision obtained for the LLR and the

C/NC-value

17 / 20

Introduction
Proposed Method
Bibliography

. Conclusion and perspectives
.
Conclusion
.
.
1 Hybrid method for Arabic MWT acquisition, that takes advantage of existing
linguistic and statistical approaches.
.
2 Novel statistical measure, NLC-value, that consists of ranking MWT candidates.
.
3

Experiments are performed for bi-grams and tri-grams on an environment Arabic
corpus.

.
.
perspectives
.
.
1 Validate the proposed statistical measure in other language.
.
2 Using the extracted MWTs for documents indexing and retrieving in IR systems.
.
.
We appreciate the reviewers for their useful comments (the results presented here are
based on their remarks).
.


18 / 20

Introduction
Proposed Method
Bibliography



19 / 20

Introduction
Proposed Method
Bibliography

. Bibliography
Boulaknadel S, Daille B, and Aboutajdine D. 2008 a. Multi-word term indexing
for Arabic document retrieval. In Proceedings of the The IEEE symposium on
Computers and Communications, pp. 869-873.
Dunning T. 1994. Accurate Methods for the Statistics of Surprise and
Coincidence, volume 19. Computational Linguistics, pp. 61-74.
Frantzi K. T, Ananiadou S, and Tsujii T. 1998. The CValue/NC-Value Method
of Automatic Recognition for Multi-word terms. Journal on Research and
Advanced Technology for Digital Libraries, pp. 115-130.
Kageura K, and Umino B.1996, Methods of Automatic Term Recognition A
Review,volume 3. Terminology.
Vu T, Aw A. Ti, and Zhang M. 2008. Term Extraction Through Unithood And
Termhood Uniﬁcation. In Procedings of IJCNLP.



20 / 20

A Study of Association Measures and their Combination for Arabic MWT Extraction

Recommended

Recommended

More Related Content

Similar to A Study of Association Measures and their Combination for Arabic MWT Extraction

Similar to A Study of Association Measures and their Combination for Arabic MWT Extraction (20)

Recently uploaded

Recently uploaded (20)

A Study of Association Measures and their Combination for Arabic MWT Extraction