SlideShare a Scribd company logo
1 of 25
Download to read offline
VIETNAM NATIONAL UNIVERSITY HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
THI-THANH-TAM DO
TAGSET EVALUATION
AND AUTOMATICAL ERROR VERRIFICATION
IN POS TAGGED CORPUS
MASTER THESIS
(Natural language processing)
Ha Noi - 2012
ii
VIETNAM NATIONAL UNIVERSITY HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
THI-THANH-TAM DO
TAGSET EVALUATION
AND AUTOMATICAL ERROR VERRIFICATION
IN POS TAGGED CORPUS
Branch of knowledge: Information technology
Major: Computer science
Code: 60 48 01
MASTER THESIS
Supervisor: Dr. Nguyen Phuong Thai
Ha Noi - 2012
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS....................................................................................... iii
TABLE OF CONTENTS............................................................................................iv
LIST OF FIGURES ....................................................................................................vi
LIST OF TABLES.....................................................................................................vii
NOTATIONS/ABBREVIATIONS...........................................................................viii
ORIGINALITY STATEMENT ..................................................................................ix
ABSTRACT ................................................................................................................1
CHAPTER 1................................................................................................................2
INTRODUCTION AND MOTIVATION....................................................................2
1.1. Characteristics of Vietnamese language..........................................................2
1.2. Vietnamese part of speech ..............................................................................3
1.2.1. Criteria to classify...............................................................................................3
1.2.2.The ways to build up tagset..................................................................................4
1.3. Copora..............................................................................................................4
1.3.1. VietTreeBank.........................................................................................5
1.3.2. VnQtag...................................................................................................6
1.4. Motivation.........................................................................................................8
1.5. Organization of the thesis ................................................................................11
CHAPTER 2:.............................................................................................................12
EVALUATING DISTRIBUTIONAL PROPERTIES -..............................................12
CONVERSION POSSIBILITY OF TAGSETS .........................................................12
IN VIETNAMESE ....................................................................................................12
2.1. Tagset evaluation..........................................................................................12
2.1.1.Introduction .......................................................................................................12
2.1.2.Tagset ................................................................................................................13
2.1.3.A method for evaluating distributional properties of tagsets...............................13
2.1.3.1. Internal criterion..............................................................................13
2.1.3.2. External benchmark ........................................................................15
2.1.3.3. Algorithm........................................................................................15
2.1.4. Result of tagset evaluation ................................................................................16
v
2.2. Possibility of Tagsets convertibility .................................................................19
Result of tagset convertibility .....................................................................................20
CHAPTER 3:.............................................................................................................24
AUTOMATIC ERROR VERIFICATION .................................................................24
OF POS - TAGGED CORPUS ..................................................................................24
3.1. Concept related to variation n-gram method .................................................24
3.2. Types of Vietnamese tagging error ...............................................................25
3.3. A algorithm for detecting errors....................................................................26
3.4. Classifying variations ...................................................................................26
3.5. Result of detecting errors in POS tagging.........................................................27
3.6. Word segmentation..........................................................................................31
3.6.1. Word in Vietnamese.........................................................................31
3.6.2. N-gram in word segmentation..........................................................32
3.6.3. Result of detecting errors in word segmentation ..............................33
CHAPTER 4:.............................................................................................................35
CONCLUSION AND SUMMARY ...........................................................................35
BIBLIOGRAPHY .....................................................................................................37
APPENDIX ...............................................................................................................40
A.1. The Vietnamese treebank tagset .........................................................................38
A.2. Vietnamese Tagset (VietTreeBank)....................................................................40
A.3. Tagset 3 (25tags)................................................................................................41
A4. Tagset 4 (40 tags)................................................................................................42
A5. Syntax function tags in VTB ..............................................................................43
A6. Adverbial classification tag of verb in VTB ........................................................43
A7. Phrase tagset in VTB...........................................................................................44
A8. Clause tagset in VTB ..........................................................................................44
vi
LIST OF FIGURES
Figure 1.1. The features of Vietnamese type ...............................................................2
Figure 2. Purity as external evaluation criterion for cluster quality. Majority class and
number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4
(cluster 2); and , 3 (cluster 3). Purity is .............................14
Figure 3. N-gram and variation nuclei in VTB corpus with n up to 29 ......................27
vii
LIST OF TABLES
Table 1. The expression of grammatical meaning in Vietnamese ................................3
Table 2. Corpus with VnQtag tagset annotation.........................................................8
Table 3. Principle differences between Vietnamese and English ..............................11
Table 4. Some frames is found in corpus...................................................................17
Table 5. Result of tagset evaluation method .............................................................18
Table 6. Some properties in tagset convertibility method in Hoangtube...................20
Table 7. Statistic ambiguous the word types in VnQtag corpus ...............................21
Table 8. Statistic ambiguous the token in VnQtag corpus.......................................21
Table 9. Statistic detail ambiguous word types in VnQtag corppus..........................22
Table 10. Statistic errors in corpus ...........................................................................28
Table 11. The detail n-gram in tagged corpus............................................................28
Table 12. The errors and ambiguous statistic in word segmentation algorithm ..........33
Table 13: Detail of context and varitation in VTB corpus..........................................34
2
CHAPTER 1
INTRODUCTION AND MOTIVATION
1.1. Characteristics of Vietnamese language
Every language in the world has its own features and so has Vietnamese. To
understand more Vietnamese, we would like to list some emerging features and
compare Vietnamese with some other languages such as Chinese, English.
Followed M.Ferlus and other domestic and international researchers in
Vietnam, Vietnamese is native origin language, belongs to South Asian language,
Mon-Khmer family, has relationship closely with Muong language. Besides,
Vietnamese belongs to a isolating language type with three prominent features. Firstly,
a syllable is foundation unit to form a word and a sentence. The syllable may be single
word or be element to compose a complex word, a compound word and a reiteration
word. Secondly, the Vietnamese word is not inflectional. In particular, there are no
difference between singular noun and plural noun; for example, “hai cuốn sách” (two
books) and “một cuốn sách” (one book). Thirdly, grammatical meaning expresses
mainly through word order and expletive method. Given some expletives such as “sẽ,
đã, không” and sentence “Tôi ra ngoài”. We can make three different meaning
sentences from given input: “Tôi sẽ ra ngoài”; “tôi đã ra ngoài”;” tôi không ra
ngoài”.
Figure 1. The features of Vietnamese type
In the world, some languages also belong to isolating language such as Chinese and
Thai language. English, French, Russian are flexional language. So, there are some
different features, for instance comparing Vietnamese, English and Chinese sentence.
The characteristics of Vietnamese
Syllable is
foundation unit to
form word or
sentence
Vietnamese word is
not inflectional
The grammatical
meaning express mainly
through word order and
expletive method
3
Table 1. The expression of grammatical meaning in Vietnamese
Vietnamese Chinese English
Word order Tôi yêu anh ấy
 Anh ấy yêu tôi
Wo ai ta
 Ta ai wo
I love him
 He loves me
Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him
Unlike Vietnamese and Chinese, in above English sentence when word order changes,
object pronoun turns into personal pronoun (himhe).
1.2. Vietnamese part of speech
1.2.1. Criteria to classify
In European language, POS notion glues with morphological category such as gender,
numeral, mood, so on. In Vietnam, there are two idea followed:
 Firstly, POS does not exist in Vietnamese because Vietnamese does not have
morphological modification. (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung)
 Secondly, like European language, Vietnamese has also POS but to classify
words in tags, or define POS of words, it is necessary to base on certain criteria.
So far, Vietnamese branch has almost agreed using criteria following ( Diep
Quang Ban, Hoang Van Thung, 2010):
a. General meaning: “The meaning of a POS is the general meaning of a words
group, bases on vocabulary generalization foundation to form common grammatical
category generalization (lexical-grammatical category)”. POSs are suitable for
definition of classification category. These are groups having giant number of words
that each group has a classification feature: object, quality, action or state, so on.
Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns
because their vocabulary meaning is generalized and abstracted as objects. The
grammar category belongs to noun.
b. Combination ability: With general meaning, words can get involve to one
meaningful combination: some words can replace each other in a certain position of a
combination, the rest of the combination make the setting for appearing replacement
ability. Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each
other in combination type: nhà này, chim này, cát này, etc and are classified as nouns.
4
c. Syntax function: Participating in sentence composition, words can stand in one or
some certain positions in a sentence, or can replace each other in the positions, and
express one relation about syntax function with other parts in the sentence
composition, can be classified into one POS. For instance, some words such as nhà,
bàn, chim, cát are noun. They may be subjects in sentences in which the subject
function is a syntax function to classify them into noun.
1.2.2. The ways to build up tagset
Nowadays, there are two kinds of set of POS tags have developed in which the first
kind received attention much more from linguistic researchers.
The first kind bases on 8 basic POS tags that are used many in dictionaries or linguistic
materials. These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection,
emotive word. From the 8 basic tags, some finer set of POS tags are built up. Each
researcher relies on certain criteria to build up the tagset finer (criteria are discussed in
the section 1.2.1). Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags;
VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix).
The second kind is built up by mapping a tagset from other language to Vietnamese
based on association between words of two languages (Dinh Dien and Hoang Kiem
2003)
1.3. Copora
Annotated corpora are large bodies of text with linguistically-informative mark-up.
They play an important role for current work in computational linguistics, so great
attention has gone into developing such corpora. Any countries, there are their own
corpora as well. Some common corpora such as: British National corpus (Leech et at,
1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank
(Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and
Richard Xiao, 2005). In Vietnam, there are notable corpora: VnQtag, VnPos, VTB.
To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson,
2001, p.29).
 Sampling and representativeness: elements in a corpus must be general,
diversified and plentiful. A sample is representative if what we find for the
sample also holds for the general population.
5
 Finite size: bigger the size of a corpus is, higher it is appreciated but it is still
finite size.
 Machine-readable form
 Standard reference
We must admit that it takes much time to build a large corpus by manual due to need
huge linguistic knowledge. With manually built large corpus, the quality of corpus is
not surely good corpus. Therefore, our thesis will find out and improve it.
Two corpora we used in our experiments are VietTreeBank and VnQtag. After that,
we would like to deeper discuss about building way of the corpora.
1.3.1. VietTreeBank
VietTreeBank is the result of a national project VLSP that is developed by VTB group
(Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators). The
corpus includes 142 documents belonging to a politics-society topic of the Youth news
responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS
tagging, syntax structure). The group based on MEMs and CRFs machine learning
model to assign POS tags. The preciseness of the model is over 93%. VTB is
developed with the purpose to aid programs building: word segmentation, POS
tagging, syntax parsing, and so on. VTB group chose two criteria to classify POS:
combination ability and syntactic function words. For instance, noun has role as
subject or object in a sentence. Besides, noun can combine with numeral (three, four)
and attribute (each, every).
One POS tag can contain information about basic class of words (noun, verb,
adjective, so on), morphological information (countable or uncountable), subcategory
(verb goes with noun, verb goes with a clause, etc), semantic information or other
syntax information. VTB group built up the tagset just based on basic class of words
without other information such as morphological information, subcategory, etc (see
tagset in appendix).
In addition to POS information, the group describes basic syntax elements as phrase
and clause. Syntax tags are the most foundation information in syntax tree, they forms
spine of the tree. A7 and A8 in appendix list phrase and clause tagset, respectively.
6
Function tag of a syntax element expresses its role in syntax element in higher level.
The tags are assigned to the main elements in the sentence such as subject, predicative,
object. They provide information help us identify basic grammar relationship as
followed.
 Subject – Predicative
 Predicative
 Combination
 Complement
 ……
Tagging process of each sentence in corpus consists of three steps: word segmentation,
POS tagging, and syntactic parsing.
1.3.2. VnQtag
Building VnQtag tagset belongs to KC01 national project and is performed by
development group including Nguyen Thi Minh Huyen, Vu Xuan Luong, Le Hong
Phuong. The group based on a print dictionary (Vietnamese dictionary of Linguistic
Institution in 2000) to carry out their work. First of all, they segmented sentences into
words by a syllable otomat and a lexical otomat. Then, they used Qtag tagger to assign
POS label to Vietnamese words. The number of POS labels is 59 labels (see in
appendix). In addition of grammar information, the group got adding semantic
information (general meaning of word) to classify into 59 word class labels. For
example, words are considered verb that they express general meaning about process.
Process meaning expresses directly in action feature of object. This is action meaning.
State meaning is generalized in relationship with action of object in time and space
(Vietnamese grammar of Diep Quang Ban and Hoang Van Thung). The automatic
tagger experiment is carried out on 7 documents that are listed in table 2. The
annotated corpus plays an important in NLP; it is data database containing high quality
linguistic sources; it obeys international standards and data express.
The gained corpus has format following: each lexical unit and corresponding POS
stand on one line, in which using space in each syllable, between word and POS have
tab to separate. The type of punctuation and other symbols in text are processed as
lexical unit with label is punctuation corresponding. This corpus includes 7 documents
that belonged to different types such as story, novel, science and press. It gathers
7
common words used popularly in daily life and the press. It also gathers words that we
can usually see in literature works or science-technical terms.
8
Table 2. Corpus with VnQtag tagset annotation
Id Document Type
The
number of
lexical unit
The number of
processing unit
(included punctuations)
1. Hoang tu be Story 15532 18663
2.
Chuyen tinh ke truoc luc
rang dong-part I
Novel 14277 16787
3.
Chuyen tinh ke truoc luc
rang dong- part II
Novel 12499 14698
4. Luoc su thoi gian Science 10598 11626
5. Muoi cua rung Story 3117 3573
6. Nhung bai hoc nong thon Story 6682 8244
7.
Cong nghe va he thong
phong thu quoc gia
Press 1028 1162
1.4. Motivation
Until now, maybe you not image my thesis will solve which problems as well as the
reasons I chose to solve them. In this section, therefore, we will discuss about them.
As we all know, linguistic theories first developed to describe of Indo-European
languages and until now there are many significant archievements. In our country,
NLP field has begun since 1990, however; achieved results have still limit. Whereas,
Vietnamese processing issue is responsible for Vietnamese; we cannot expect this
issue in foreign researchers (Ho Tu Bao, 2001). Therefore, this thesis wishes
contributed a part in improving Vietnamese processing by concentrating on enhancing
tagsets and detection errors in tagging.
Natural language processing is done at five stages. These are:
 Morphological and lexical analysis: The lexicon of a language is its
vocabularies that include its words and expressions. Morphology is the
identification, analysis and description of structure of words. The words are
generally accepted as being the smallest units of syntax. The syntax refers to the
9
rules and principles that govern the sentence structure of any individual
language.
Lexical analysis: The aim is to divide the text into paragraphs, sentences and
words. The lexical analysis cannot be performed in isolation from
morphological and syntactic analysis
 Syntactic Analysis: The analysis is of words in a sentence to know the
grammatical structure of the sentence. The words are transformed into
structures that show how the words relate to each others. Some word sequences
may be rejected if they violate the rules of the languages for how words may be
combined.
 Semantic analysis: It derives an absolute meaning from context it determines
the possible meanings of a sentence in a context.
 Discourse integration: The meaning of an individual sentence may depend on
the sentences that precede it and may influence the meaning of the sentences
that follow it.
 Pragmatic analysis: It derives knowledge from external commonsense
information it means understanding the purposeful use of language in situations,
particularly those aspects of language which require world knowledge. For
example: Do you know what time is it? The sentence should be interpreted as a
request.
Our thesis concentrates on the first stage (i.e morphological analysis) in natural
language processing. It is very important preprocessing step for following stages such
as syntactic analysis and semantic analysis.
Our thesis has two big problems and two small problems. These are evaluating tagset
and detecting tagging errors automatically; checking convertible possibility of tagset
and detecting segmentation errors automatically, respectively.
a. Evaluating and convertible possibility of tagset
In previous section, we mentioned some tagsets such as VietTreeBank (17 tags);
VNPOS (15 tags); VNQTag (59 tags). Such inconsistent tagsets emerge some
questions such as: which tagsets can be better? What methods can evaluate these
tagsets or how we can choose right set of POS tags for certain applications. In the first
part of this thesis, we will focus to answer the question.
10
Another aspect we will also discuss here is tagsets conversion ability. The choice one
tagset much affect on the difficulty of POS tagging issues. In particular, if big tagset
will increase the difficulty but smaller one will not satisfy for a certain purpose.
Therefore, it is necessary to balance between quality and the quantity in one tagset, it
means that:
 Information quality more clear (i.e classify to more Part-of-speech based
on concrete meaning)
 Possibility of tagging (i.e the number of Pos as little as possible)
From above discussed problem, we try to find a method to balance them. It means that
we carry out experiment on source tagset (ST) and target tagset (TT). Then calculating
the number of ambiguous words when we converted; therefore, we give conclusion.
b. Detecting POS tagging and word segmentation errors
 If each word belongs to only one label then one limited a dictionary
including words and corresponding labels can solve absolutely POS tagging
issue. In fact, however, one word can belong to more than one label and that
leads to ambiguity and errors in POS tagging. To fix this problem, it costs
much time and money by manual. We want to find out method to detect
errors automatically to reduce cost about time and money.
 Besides, it admits that Vietnamese word segmentation is a thorny issue.
One sentence maybe to have many different segmentation ways. For
example, chiếc xe đạp nặng quá. Way 1: chiếc/ xe/ đạp/ nặng/ quá. Way 2:
chiếc/ xe đạp/ nặng/ quá. Here, we used “/” to separate words. Both of ways
are accepted because each sentence is private meaningful.
One of reasons causes the difference is listed in following table. And the last problem
in our thesis is word segmentation:
11
Table 3. Principle differences between Vietnamese and English
Character Vietnamese English
Foundation unit Syllable Word
Prefix or Suffix No Yes
Part of speech No agreement Defined clearly
Boundary of word
Context meaningful
combination of syllable
Blank or Delimiters
All above reasons are motive power to help me find the last answer.
1.5. Organization of the thesis
The thesis is organized four main chapters with basic content following:
Chapter 1: Introduction and motivation.
Chapter 1 provides a general picture about Vietnamese such as features of Vietnamese
and part-of-speech. Besides, reasons I chose the topic in the thesis also discuss.
Chapter 2: Evaluating distributional properties and conversion possibility of
tagsets in Vietnamese.
Chapter 2 we will find out deeper about tagset for instance way to build up tagset or
way to merge labels as well as introduction basic notions to carry out evaluating
properties of tagsets.
Chapter 3: Automatic error verification of pos-tagged corpus
In this chapter, we will introduce notion related to errors detecting method, after that
present algorithm and discuss about classifying variation into errors or ambiguity.
Chapter 4: Summary and conclusion
In this chapter, we will discuss about three issues. These are thesis’s contributions
about theory, experiment and further new directions. It sums up achievement that we
gained and discussed further some word needed solve in future.
12
CHAPTER 2:
EVALUATING DISTRIBUTIONAL PROPERTIES -
CONVERSION POSSIBILITY OF TAGSETS
IN VIETNAMESE
2.1. Tagset evaluation
2.1.1. Introduction
It is obvious that evaluating tagset has received much attention of NLP reserachers
since over 20 years ago. Tagset evaluation allows us to test and assess the impact of
tagset modifications on results, by using different versions given tagset on the same
texts (Martin Volk and Gerol Schneider, 1998). In 2000, Dzeroski Saso and Erjavec
Tomaz and Zavrel Jakub calculated by comparing accuracy of design tagsets that are
formed by decreasing the cardinality of the tagset: ommitting certain attributes of the
tagset or almost all, except certain attributes. Accuracies were computed using a
Black-Box combiner (Halteren, Dzeroski). In the same year, Herv Ejean Seminar and
Hervé Déjean presented two kinds of a tagset evaluation: a global and a local one. The
first kind consists of evaluating the initial grammar generated by ALLiS. The second
kind is to use the notion of reliability that reliability of an element corresponds to the
ratio between its frequence in the structure over its total frequency in the corpus.
Besides, in Indian language, Madhav Gopal, Diwakar Mishra, and Devi Priyanka
Singh (2010) gave some discussions about evaluated tagsets: ILMT tagset, JNU-
Sanskrit tagset, LDCIL tagset, Sanskrit consortium tagset.
Vietnamese is an isolating language and important syntactic information source is
word order. To evaluating Vietnamese tagsets, this chapter will introduce a simple
method using internal criteria and external criteria. Frequency frame and purity are
used in internal criteria to check whether tag is assigned accurately. External criteria
review reduction cardinality of the tagset to check information quality is retained. It is
true that a number of evaluations showed that a lot of tagging errors are caused by
sometimes too fine differentiations within major categories (Eugenie Giesbrecht,
2008).
13
2.1.2. Tagset
A POS is a set of words with some grammatical characteristic(s) in common and each
POS differs in grammatical characteristics from every other POS. For example, nouns
have different properties from verbs, which have different properties from adjective
and so on.
Tagset is set of POS tags built up based on the criteria (see in 1.2). Therefore, tagsets
usually vary quantity of tags and also used in various applications.
Properties of tagset: One tagset need guarantee some properties as followed:
Retaining linguistic feature, reflect syntax structure, possibility of tagging accurately,
reduction ambiguous words when we carried out tagging.
2.1.3. A method for evaluating distributional properties of tagsets
2.1.3.1. Internal criterion
Among properties of tagsets, we high appreciate possibility of tags is assigned
accurately in corpus. It means that we mention of internal criterion. It is worth noting
that we can review this criterion through a frame notion and a purity formula. The
frame represents reviewed local context. It can alert for us which tag can appear in this
the frame. Next, purity formula assesses possibility convergence of tag in the local
context.
Discussion about purity
As mention preciously, we use purity formula as external evaluation criterion for
tagset (Stanford natural language processing, 357). Purity is widely used in cluster
quality evaluation measure. It is simple and transparent evaluation measure. To
compute purity, each cluster is assigned to the class which is most frequent in the
cluster, and then the accuracy of this assignment is measured by counting the number
of correctly assigned documents and dividing by N.
Formally:
(1)
Where:
is the set of clusters
is the set of classes.
14
We interpret wk as the set of documents in wk and cj as the set of documents in cj in
equation (1).
High purity is easy to achieve when the number of clusters in large, in particular,
purity is 1 if each document gets its own cluster.
For example:
Figure 2. Purity as external evaluation criterion for cluster quality. Majority class and
number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4
(cluster 2); and , 3 (cluster 3). Purity is
Frame notion
The frame notion is mentioned in 2006 by Mintz. Then, in 2010, Dickinson and
Jochim redefined it following: In local context, one frame consists of three words in
which two words surrounding a target word leading to target’s categorization. We will
use frames to test the quality of distributional mappings. In English, for example, the
frame “you_it” generally predicts a verbal category for the target (i.e, target word may
be hit, beat, eat, or kiss). In Vietnamese, the frame “mẹ_là” leads target word
belonging to pronoun (Pp), i.e, “tôi, anh, chị, bác”. Therefore, to have a more exact
result, we used a frequency and a frequent frame notion. Frequent frame supplies
category information in child language corpora. It means that, frame’s role in a corpus
is not similar. Many times one frame appears, more linguistic information the frame
concentrates. We identify the frequent threshold based on a formula about 0.03% of
the frame total. In particular, if we have 10000 frames in a corpus then the frequency is
3 (10000*0.03%). So, one frame appears above 3 times, we consider them as one
frequent frame.
Next, purity formula is applied in the method with respect to calculating possibility of
distributing tag in one frame. It means that percentage of each tag appears in frame is
different. To calculate purity value, we just consider to the biggest frequency of a tag
in each frame. Next, we add them and divide total of appearing times of all tags. If the
x x
o x x
x
x o
 o
o o
x
x 
 
15
purity value is higher, then words ability can be tagged accurately higher. For instance,
we have two frames: Tôi_ở and mẹ _bảo. The first frame appears 4 times in a corpus
in which the target word belongs to two tags Vits, Vitn (1 times and 3 times,
respectively). The second frame appears 8 times in which 7 times target word’s POS is
Np, 1 times is Pp. We can calculate the purity value by
2.1.3.2. External benchmark
Normally, to evaluate tagset, linguistic scientists have mapped a tagset into reduced
one because this work helps us check retained linguistic features. Of course, reduced
tagset is built up by merge tags; however, how do we have to merge? This is a difficult
question that we need solve.
Herv Ejean Seminar, Hervé Déjean and Universität Tübingen (2000) discussed about
the theoretically minimal tagset. They affirmed that the quality of a tagset does not
depend on the quantity of tags. They built up the minimum tagset necessary to parse
sentence whatever the domain are. Originally, they use a tagset with one tag per
structure (NP-VP). Then, they estimated that a tagset of about 20 tags is enough to
parse a sentence into PS and clause structures.
Indeed, there are many ways to merge labels so the tagsets with various tags quantity
have still existed. English is morphological language so it is rather easily to identify
situations can merge such as conflating base form verb (VB) and present tense verb
(non-third person singular, VPB) but Vietnamese is not. The tagsets are used in our
thesis have two kinds:
Firstly, we used tagset that it is built up by preceding NLP researchers, for instance,
VnQtag, VietTreeBank.
Secondly, we conflate ourselves some labels based on Vietnamese features. The
number of tags in VnQtag is the largest, so we use it as source tagset to generate other
tagsets.
2.1.3.3. Algorithm
To concrete above mentioned theory, we would like to introduce the algorithm
containing 5 steps in tagged corpus as followed.
1. Identifying all the words and its POS in the corpus, store them and its
positions.
16
2. Calculating the quantity of frames in the corpus, after based on total of the
frames to calculate a frequency.
3. Then, finding frequency frames and a purity value
4. Mapping the original tagset to new reduced tagsets
5. Finally calculating the new purity value in the new tagsets and statistic lost
ambiguous words
Preparing data:
We carried out this method on corpus with VnQtag tagset annotated corpus.
2.1.4. Result of tagset evaluation
The experiments are performed on VnQtag corpus including four VnQtag annotated
documents. Then we carried out merging some tags in VnQtag to form new tagsets:
tagset 3 and tagset 4. Therefore, we have: VietTreeBank (18 tags), basic tagset 2 (8
tags), tagset 3 (25 tags) and tagset 4 (40 tags) (see in appendix). We relied on the book
(Ngữ pháp tiếng Việt - Diệp Quang Ban) to merge tags in which he organized
Vietnamese POS system into two groups:
 Group 1: Noun, Verb, Adjective
Numeral
Pronoun
 Group 2: Adjunct (Determine, adverb)
Conjunction
Particle
Each major category he classified finer-grain such as noun has two main kinds: Proper
noun and common noun. Common noun contains synthetic noun and non-synthetic
noun. Both of them are fine classified into countable noun and uncountable noun and
so on.
To gain 25-POSs and 30-POSs tagsets, we merged some tags of noun and verb. They
are basic categories and have the largest number of words in Vietnamese. In the
VnQtag tagset, noun is fine classified to 8 tags and verb with 10 finer tags. We
employed four annotated documents in VnQtag and four tagsets to gain results in the
table 4 and the table 5.
17
Table 4. Some frames is found in corpus
Frame
(Frequency)
POS
(Frequency)
Frame
(Frequency)
POS
(Frequency)
Frame
(Frequency)
POS
(Frequency)
mẹ_là (4) Pp (4) Tôi_ở (4)
Vits (1)
Vitn (3)
chẳng _gì (3) Vte (1)
Na (2)
Tôi_nông dân (3) Vla (3) nhà _ở (3) Np (1)
Pp (2)
cái _tre (2)
No (2)
Còn_sinh (3) Pp (3) ba _Phúc (2) Nh (2)
với _đứa (2) Nn (2)
sinh _nông thôn (3) Cm (3) Con _nhỏ (2) No (2)
dăm _trẻ (2) Nu (2)
đứa _dâng (2) Nh (2)
trẻ _đào (2) Vta (2)
có _người (3)
Aa (2)
Vtf (1)
bố _Lâm (3)
Nh (3)
tôi _có (3)
An (1)
Vtf (1)
Ja (1)
Lâm _thích (2) Pp (1)
Ja (1)
tôi _lắm (2)
Vtf (2)
có _thì (4)
Vta (2)
Nn (1)
Np (1)
thì _đỡ (2)
Jd (1)
Vitm (1)
mẹ _bảo (8)
Pp (1)
Np (7)
đây _lần (2) Vla (2)
là _đầu (2)
Nt (2)
lần _tôi (2)
Nl (2)
tôi _Lâm (4)
Vtd (1)
Vtn (1)
Cc (2)
bố _bảo (9) Np (8)
Pp (1)
…. … …. …. …. ….
Table 4 shows some frames are found in corpus. Each frame consists of four
information kinds. These are content of frame, appearing time of the frame, the POS of
target word and its appearing times. In particular, the first cell contains the frame:
“mẹ_là”. This frame occurs 4 times in the corpus and all of them are assigned as Pp
(Pronoun).
18
Table 5. Result of tagset evaluation method
Document Words Mapping Tags
Frequency
frame, total
of frame,
threshold
Purity
Lost
ambs
Chuyện tình
kể trước lúc
rạng đông
31520
VnQtag 59
128, 15706, 5
60.69% 0
VietTreeBank 18 82.86% 331
Basic tagset 8 82.06% 397
Tagset 3 25 69.71% 55
Tagset 4 40 79.09% 137
Hoàng tử bé 18666
VnQtag 59
453, 7951, 3
76.66% 0
VietTreeBank 18 88.37% 590
Basic tagset 8 88.60% 892
Tagset 3 25 82.05% 141
Tagset 4 40 82.49% 261
Lược sử thời
gian
11677
VnQtag 59
407, 6738, 3
71.81% 0
VietTreeBank 18 86.74% 720
Basic tagset 8 87.35% 826
Tagset 3 25 75.15% 184
Tagset 4 40 75.60% 241
Những bài
học nông
thôn
8247
VnQtag 59
242, 3845, 2
79.47% 0
VietTreeBank 18 88.34% 172
Basic tagset 8 88.83% 362
Tagset 3 25 82.76% 43
Tagset 4 40 83.25% 126
Based on the table 5, we can see that the first document has 31520 words and the total
of frames is computed exactly 15706 frames. Threshold is calculated by formula:
0.03%* the total of frames. Therefore threshold in the situation is approximate 5.
When experiment the first document on different tagsets, we achieved purity values
(60.69%, 82.86%, 82.06%, 69.71%, 79.09%) as well as the number of ambiguous
words (0, 331, 397, 55, 137 respectively). We can realized that the percentage of the
ambiguous words compared to the total of words in document is small, i.e, 0%, 1.05%,
1.26%, 0.17%, 0.43%. Other documents are similar in the explaining.
Conclusion, three tagsets namely VietTreeBank, basic tagset and tagset 4 are
appreciated higher than other tagsets. Because, value of purity is high and the number
of lost ambiguous words is quite small.
Tải bản FULL (51 trang): https://bit.ly/3RZw0Dd
Dự phòng: fb.com/TaiHo123doc.net
19
2.2. Possibility of Tagsets convertibility
As you know, existing different tagsets in the same language helps linguistic scientists
have more tagset options. In English language, there are some tagsets following:
Brown tagset in 1967 (87 tags), Susanne tagset in 1987 (353 wordtags), Penn Tree
Bank tagset in 1991 (36 tags), IBM Lancaster in 1993 (132 tags). To give right
decision, they have found out relationship between tagsets as well as specific
applications.
In Vietnamese, it is notable that there are three tagsets: VnQtag (59 tags), VnPos (15
tags), VietTreeBank (18 tags). Some Vietnamese linguistic researchers have advocated
minimal tagset it means that they are interested in smaller tagset. With small tagset,
tagging is performed more easily, and less cost about time and money. Therefore, we
want to test converting from a large tagset into small one. Of course, reverse direction
is always true. As a result, the first direction, some words can be lost ambiguity about
tag. This is not good sign. However, if their number is small then we can just add some
information of context or syntax to understand them. For instance, Daniel Zeman
(2008) used Interset (Tagset diriver) to convert source tagset into target one. Bartosz
Zaborowski andAdam Przepiórkowski (2012) used set of rules converting particular
tags.
In our thesis, we emphasize to ability of conversion from one tagset to another. The
thing we wish found out here is that any large tagsets always can convert easily to
small tagsets with minimal ambiguous word cardinality. Ambiguous words are words
that are lost a distinction in finer tags in target tagset. In particular, we carried out as
followed:
 Identifying tagsets that we want to check
 Identifying corpus annotated as well as tagger
 Calculating the number of word belonging to each POS tag of tagset
 Statistic
o The number of ambiguous tokens in corpus (when we convert large
tagset into small tagset, some tags in large one will merge to correspond
to tags of small tagset).
o The number of ambiguous word types in corpus.
 Computing the percentage of ambiguous tokens and word types.
Tải bản FULL (51 trang): https://bit.ly/3RZw0Dd
Dự phòng: fb.com/TaiHo123doc.net
20
Result of tagset convertibility
The data input of this method is two tagsets: VnQtag and VietTreeBank. Besides, we
used Qtag probability and Vn Tagger to tag for the folder containing 7 documents
(Hoàng tử bé, Chuyện tình, Lược sử thời gian, Những bài học nông thôn, Chiến tranh
cục bộ, muối của rừng, An Dương Vương) with two tagsets respectively.
Then we compared outputs to have last conclusion.
Table 6. Some properties in tagset convertibility method in Hoangtube
Here, there are tiny note that word in above table is exactly word type not token. It
means that each word we just count once time. Besides, the experiment is performed
on one document (hoangtube) so we can see ambiguous percentage is quite small. The
number of ambiguous words sometime is large so in table we listed some situations
not all of them.
6811897

More Related Content

Similar to Tagset evaluation and automatical error verrification in pos tagged corpus.pdf

The Compatibility of Irish Political Parties with their Political Groups in t...
The Compatibility of Irish Political Parties with their Political Groups in t...The Compatibility of Irish Political Parties with their Political Groups in t...
The Compatibility of Irish Political Parties with their Political Groups in t...Cillian Griffey
 
Decla cert etc etc
Decla cert etc etcDecla cert etc etc
Decla cert etc etcMel bliss
 
Accessible.textbooks.classroom ii.11.12.10
Accessible.textbooks.classroom ii.11.12.10Accessible.textbooks.classroom ii.11.12.10
Accessible.textbooks.classroom ii.11.12.10Scorpiolady
 
An organizational culture assessment of viettrans haiphong - master of busine...
An organizational culture assessment of viettrans haiphong - master of busine...An organizational culture assessment of viettrans haiphong - master of busine...
An organizational culture assessment of viettrans haiphong - master of busine...NuioKila
 
Culture vn context
Culture vn contextCulture vn context
Culture vn contextDinh Anh
 
Culture vn context
Culture vn contextCulture vn context
Culture vn contextDinh Anh
 
sưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON TRADE FACILITATION IN VIET NAM
sưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON  TRADE FACILITATION IN VIET NAMsưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON  TRADE FACILITATION IN VIET NAM
sưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON TRADE FACILITATION IN VIET NAMlamluanvan.net Viết thuê luận văn
 
The Difficulties In Learning Vocabulary Faced By Students At Aies And Solutions
The Difficulties In Learning Vocabulary Faced By Students At Aies And SolutionsThe Difficulties In Learning Vocabulary Faced By Students At Aies And Solutions
The Difficulties In Learning Vocabulary Faced By Students At Aies And SolutionsDịch vụ Làm Luận Văn 0936885877
 
Eng8-Quarter-2-Module-7_v3.pdf
Eng8-Quarter-2-Module-7_v3.pdfEng8-Quarter-2-Module-7_v3.pdf
Eng8-Quarter-2-Module-7_v3.pdfEdrenzAustria
 
Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...
Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...
Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...NuioKila
 
A Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdfA Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdfNuioKila
 
Bachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile AdvertisingBachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile AdvertisingVantharith Oum
 
Đề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAY
Đề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAYĐề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAY
Đề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAYViết thuê trọn gói ZALO 0934573149
 
A CIRCULAR ECONOMY FOR PLASTIC PRODUCTS IN SELECTED COUNTRIES AND EXPERIEN...
A CIRCULAR ECONOMY FOR PLASTIC   PRODUCTS IN SELECTED COUNTRIES AND  EXPERIEN...A CIRCULAR ECONOMY FOR PLASTIC   PRODUCTS IN SELECTED COUNTRIES AND  EXPERIEN...
A CIRCULAR ECONOMY FOR PLASTIC PRODUCTS IN SELECTED COUNTRIES AND EXPERIEN...lamluanvan.net Viết thuê luận văn
 

Similar to Tagset evaluation and automatical error verrification in pos tagged corpus.pdf (20)

The Compatibility of Irish Political Parties with their Political Groups in t...
The Compatibility of Irish Political Parties with their Political Groups in t...The Compatibility of Irish Political Parties with their Political Groups in t...
The Compatibility of Irish Political Parties with their Political Groups in t...
 
Decla cert etc etc
Decla cert etc etcDecla cert etc etc
Decla cert etc etc
 
Accessible.textbooks.classroom ii.11.12.10
Accessible.textbooks.classroom ii.11.12.10Accessible.textbooks.classroom ii.11.12.10
Accessible.textbooks.classroom ii.11.12.10
 
An organizational culture assessment of viettrans haiphong - master of busine...
An organizational culture assessment of viettrans haiphong - master of busine...An organizational culture assessment of viettrans haiphong - master of busine...
An organizational culture assessment of viettrans haiphong - master of busine...
 
cs-2002-01
cs-2002-01cs-2002-01
cs-2002-01
 
Culture vn context
Culture vn contextCulture vn context
Culture vn context
 
Culture vn context
Culture vn contextCulture vn context
Culture vn context
 
20060123_ey_submission
20060123_ey_submission20060123_ey_submission
20060123_ey_submission
 
sưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON TRADE FACILITATION IN VIET NAM
sưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON  TRADE FACILITATION IN VIET NAMsưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON  TRADE FACILITATION IN VIET NAM
sưu tầm ASSESSMENT OF DIGITAL TRANSFORMATION ON TRADE FACILITATION IN VIET NAM
 
Internship Report some solutions to attract more customers to come to torino ...
Internship Report some solutions to attract more customers to come to torino ...Internship Report some solutions to attract more customers to come to torino ...
Internship Report some solutions to attract more customers to come to torino ...
 
The Difficulties In Learning Vocabulary Faced By Students At Aies And Solutions
The Difficulties In Learning Vocabulary Faced By Students At Aies And SolutionsThe Difficulties In Learning Vocabulary Faced By Students At Aies And Solutions
The Difficulties In Learning Vocabulary Faced By Students At Aies And Solutions
 
Luận Văn A Study On Difficulties In Translating Lyrics Of Some English Songs.doc
Luận Văn A Study On Difficulties In Translating Lyrics Of Some English Songs.docLuận Văn A Study On Difficulties In Translating Lyrics Of Some English Songs.doc
Luận Văn A Study On Difficulties In Translating Lyrics Of Some English Songs.doc
 
Eng8-Quarter-2-Module-7_v3.pdf
Eng8-Quarter-2-Module-7_v3.pdfEng8-Quarter-2-Module-7_v3.pdf
Eng8-Quarter-2-Module-7_v3.pdf
 
Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...
Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...
Applying games in reviewing vocabulary after units in Tieng Anh 8 (experiment...
 
POWER_VOCAB_IELTS.pdf
POWER_VOCAB_IELTS.pdfPOWER_VOCAB_IELTS.pdf
POWER_VOCAB_IELTS.pdf
 
A Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdfA Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdf
 
Bachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile AdvertisingBachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile Advertising
 
Luận Văn Service Quality, Customer Satisfaction And Behavioral Intentions.doc
Luận Văn Service Quality, Customer Satisfaction And Behavioral Intentions.docLuận Văn Service Quality, Customer Satisfaction And Behavioral Intentions.doc
Luận Văn Service Quality, Customer Satisfaction And Behavioral Intentions.doc
 
Đề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAY
Đề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAYĐề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAY
Đề tài: Công tác tạo động lực làm việc đối với nhân viên ngân hàng, HAY
 
A CIRCULAR ECONOMY FOR PLASTIC PRODUCTS IN SELECTED COUNTRIES AND EXPERIEN...
A CIRCULAR ECONOMY FOR PLASTIC   PRODUCTS IN SELECTED COUNTRIES AND  EXPERIEN...A CIRCULAR ECONOMY FOR PLASTIC   PRODUCTS IN SELECTED COUNTRIES AND  EXPERIEN...
A CIRCULAR ECONOMY FOR PLASTIC PRODUCTS IN SELECTED COUNTRIES AND EXPERIEN...
 

More from TieuNgocLy

THI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdf
THI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdfTHI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdf
THI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdfTieuNgocLy
 
Cách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdf
Cách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdfCách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdf
Cách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdfTieuNgocLy
 
HẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdf
HẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdfHẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdf
HẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdfTieuNgocLy
 
BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...
BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...
BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...TieuNgocLy
 
现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...
现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...
现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...TieuNgocLy
 
Nghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdf
Nghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdfNghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdf
Nghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdfTieuNgocLy
 
Chức Năng Hoạch Định Quản Trị Học.pdf
Chức Năng Hoạch Định Quản Trị Học.pdfChức Năng Hoạch Định Quản Trị Học.pdf
Chức Năng Hoạch Định Quản Trị Học.pdfTieuNgocLy
 
NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...
NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...
NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...TieuNgocLy
 
HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...
HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...
HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...TieuNgocLy
 
Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...
Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...
Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...TieuNgocLy
 
Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...
Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...
Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...TieuNgocLy
 
HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...
HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...
HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...TieuNgocLy
 
Những vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdf
Những vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdfNhững vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdf
Những vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdfTieuNgocLy
 
Pháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdf
Pháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdfPháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdf
Pháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdfTieuNgocLy
 
Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...
Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...
Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...TieuNgocLy
 
Bài Giảng Chứng Khoán Phái Sinh.pdf
Bài Giảng Chứng Khoán Phái Sinh.pdfBài Giảng Chứng Khoán Phái Sinh.pdf
Bài Giảng Chứng Khoán Phái Sinh.pdfTieuNgocLy
 
Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...
Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...
Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...TieuNgocLy
 
Intangible Values in Financial Accounting and Reporting An Analysis from the ...
Intangible Values in Financial Accounting and Reporting An Analysis from the ...Intangible Values in Financial Accounting and Reporting An Analysis from the ...
Intangible Values in Financial Accounting and Reporting An Analysis from the ...TieuNgocLy
 
Bài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdf
Bài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdfBài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdf
Bài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdfTieuNgocLy
 
Những Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdf
Những Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdfNhững Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdf
Những Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdfTieuNgocLy
 

More from TieuNgocLy (20)

THI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdf
THI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdfTHI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdf
THI HÀNH ÁN HÌNH SỰ TỪ THỰC TIỄN TỈNH PHÚ THỌ.pdf
 
Cách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdf
Cách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdfCách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdf
Cách trưng bày và bố trí sản phẩm của circle k tại Việt Nam 9870993.pdf
 
HẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdf
HẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdfHẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdf
HẠ THÂN NHIỆT ĐIỀU TRỊ TRONG NGỪNG TUẦN HOÀN- THỰC TẾ TẠI VIỆT NAM.pdf
 
BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...
BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...
BÁO CHÍ VỚI VẤN ĐỀ “GIẢI CỨU NÔNG SẢN” CHO NÔNG DÂN - Luận văn Thạc sĩ chuyên...
 
现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...
现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...
现代汉语广告中的成语研究 = Nghiên cứu thành ngữ trong ngôn ngữ quảng cáo của tiếng Hán hi...
 
Nghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdf
Nghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdfNghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdf
Nghiên cứu hệ thống chống bó cứng phanh trên xe mazda CX 5 2013.pdf
 
Chức Năng Hoạch Định Quản Trị Học.pdf
Chức Năng Hoạch Định Quản Trị Học.pdfChức Năng Hoạch Định Quản Trị Học.pdf
Chức Năng Hoạch Định Quản Trị Học.pdf
 
NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...
NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...
NHẬN THỨC VỀ YẾU TỐ NGUY CƠ VÀ BIỂU HIỆN CẢNH BÁO ĐỘT QỤY NÃO CỦA NGƯỜI BỆNH ...
 
HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...
HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...
HỘI THẢO CƠ CHẾ CHÍNH SÁCH CUNG ỨNG DỊCH VỤ CÔNG ÍCH TẠI CÁC ĐÔ THỊ Ở VIỆT NA...
 
Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...
Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...
Nghiên cứu quá trình thụ đắc từ li hợp trong tiếng Hán hiện đại của sinh viên...
 
Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...
Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...
Báo Cáo Thực Tập Tốt Nghiệp Thông Tin Vô Tuyến, Chuyển Mạch Và Thông Tin Quan...
 
HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...
HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...
HIỆP ĐỊNH THÀNH LẬP KHU VỰC THƯƠNG MẠI TỰ DO ASEAN – ÚC – NIU DILÂN (AANZFTA)...
 
Những vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdf
Những vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdfNhững vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdf
Những vấn đề pháp lý về chống bán phá giá hàng hóa nhập khẩu vào Việt Nam.pdf
 
Pháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdf
Pháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdfPháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdf
Pháp luật về quản lý chất thải nguy hại trong khu công nghiệp ở Việt Nam.pdf
 
Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...
Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...
Thiết Kế Hệ Thống Cung Cấp Điện Cho Tòa Nhà Cao Tầng Có Ứng Dụng Các Phương P...
 
Bài Giảng Chứng Khoán Phái Sinh.pdf
Bài Giảng Chứng Khoán Phái Sinh.pdfBài Giảng Chứng Khoán Phái Sinh.pdf
Bài Giảng Chứng Khoán Phái Sinh.pdf
 
Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...
Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...
Hội Thảo, Tập Huấn, Rút Kinh Nghiệm Dạy Học Theo Mô Hình Trường Học Mới Việt ...
 
Intangible Values in Financial Accounting and Reporting An Analysis from the ...
Intangible Values in Financial Accounting and Reporting An Analysis from the ...Intangible Values in Financial Accounting and Reporting An Analysis from the ...
Intangible Values in Financial Accounting and Reporting An Analysis from the ...
 
Bài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdf
Bài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdfBài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdf
Bài Giảng Các Phương Pháp Dạy Học Hiện Đại.pdf
 
Những Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdf
Những Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdfNhững Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdf
Những Kiến Thức Cơ Bản Của Tâm Lý Học Lứa Tuổi Và Tâm Lý Học Sư Phạm.pdf
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Recently uploaded (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Tagset evaluation and automatical error verrification in pos tagged corpus.pdf

  • 1. VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS MASTER THESIS (Natural language processing) Ha Noi - 2012
  • 2. ii VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS Branch of knowledge: Information technology Major: Computer science Code: 60 48 01 MASTER THESIS Supervisor: Dr. Nguyen Phuong Thai Ha Noi - 2012
  • 3. iv TABLE OF CONTENTS ACKNOWLEDGEMENTS....................................................................................... iii TABLE OF CONTENTS............................................................................................iv LIST OF FIGURES ....................................................................................................vi LIST OF TABLES.....................................................................................................vii NOTATIONS/ABBREVIATIONS...........................................................................viii ORIGINALITY STATEMENT ..................................................................................ix ABSTRACT ................................................................................................................1 CHAPTER 1................................................................................................................2 INTRODUCTION AND MOTIVATION....................................................................2 1.1. Characteristics of Vietnamese language..........................................................2 1.2. Vietnamese part of speech ..............................................................................3 1.2.1. Criteria to classify...............................................................................................3 1.2.2.The ways to build up tagset..................................................................................4 1.3. Copora..............................................................................................................4 1.3.1. VietTreeBank.........................................................................................5 1.3.2. VnQtag...................................................................................................6 1.4. Motivation.........................................................................................................8 1.5. Organization of the thesis ................................................................................11 CHAPTER 2:.............................................................................................................12 EVALUATING DISTRIBUTIONAL PROPERTIES -..............................................12 CONVERSION POSSIBILITY OF TAGSETS .........................................................12 IN VIETNAMESE ....................................................................................................12 2.1. Tagset evaluation..........................................................................................12 2.1.1.Introduction .......................................................................................................12 2.1.2.Tagset ................................................................................................................13 2.1.3.A method for evaluating distributional properties of tagsets...............................13 2.1.3.1. Internal criterion..............................................................................13 2.1.3.2. External benchmark ........................................................................15 2.1.3.3. Algorithm........................................................................................15 2.1.4. Result of tagset evaluation ................................................................................16
  • 4. v 2.2. Possibility of Tagsets convertibility .................................................................19 Result of tagset convertibility .....................................................................................20 CHAPTER 3:.............................................................................................................24 AUTOMATIC ERROR VERIFICATION .................................................................24 OF POS - TAGGED CORPUS ..................................................................................24 3.1. Concept related to variation n-gram method .................................................24 3.2. Types of Vietnamese tagging error ...............................................................25 3.3. A algorithm for detecting errors....................................................................26 3.4. Classifying variations ...................................................................................26 3.5. Result of detecting errors in POS tagging.........................................................27 3.6. Word segmentation..........................................................................................31 3.6.1. Word in Vietnamese.........................................................................31 3.6.2. N-gram in word segmentation..........................................................32 3.6.3. Result of detecting errors in word segmentation ..............................33 CHAPTER 4:.............................................................................................................35 CONCLUSION AND SUMMARY ...........................................................................35 BIBLIOGRAPHY .....................................................................................................37 APPENDIX ...............................................................................................................40 A.1. The Vietnamese treebank tagset .........................................................................38 A.2. Vietnamese Tagset (VietTreeBank)....................................................................40 A.3. Tagset 3 (25tags)................................................................................................41 A4. Tagset 4 (40 tags)................................................................................................42 A5. Syntax function tags in VTB ..............................................................................43 A6. Adverbial classification tag of verb in VTB ........................................................43 A7. Phrase tagset in VTB...........................................................................................44 A8. Clause tagset in VTB ..........................................................................................44
  • 5. vi LIST OF FIGURES Figure 1.1. The features of Vietnamese type ...............................................................2 Figure 2. Purity as external evaluation criterion for cluster quality. Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4 (cluster 2); and , 3 (cluster 3). Purity is .............................14 Figure 3. N-gram and variation nuclei in VTB corpus with n up to 29 ......................27
  • 6. vii LIST OF TABLES Table 1. The expression of grammatical meaning in Vietnamese ................................3 Table 2. Corpus with VnQtag tagset annotation.........................................................8 Table 3. Principle differences between Vietnamese and English ..............................11 Table 4. Some frames is found in corpus...................................................................17 Table 5. Result of tagset evaluation method .............................................................18 Table 6. Some properties in tagset convertibility method in Hoangtube...................20 Table 7. Statistic ambiguous the word types in VnQtag corpus ...............................21 Table 8. Statistic ambiguous the token in VnQtag corpus.......................................21 Table 9. Statistic detail ambiguous word types in VnQtag corppus..........................22 Table 10. Statistic errors in corpus ...........................................................................28 Table 11. The detail n-gram in tagged corpus............................................................28 Table 12. The errors and ambiguous statistic in word segmentation algorithm ..........33 Table 13: Detail of context and varitation in VTB corpus..........................................34
  • 7. 2 CHAPTER 1 INTRODUCTION AND MOTIVATION 1.1. Characteristics of Vietnamese language Every language in the world has its own features and so has Vietnamese. To understand more Vietnamese, we would like to list some emerging features and compare Vietnamese with some other languages such as Chinese, English. Followed M.Ferlus and other domestic and international researchers in Vietnam, Vietnamese is native origin language, belongs to South Asian language, Mon-Khmer family, has relationship closely with Muong language. Besides, Vietnamese belongs to a isolating language type with three prominent features. Firstly, a syllable is foundation unit to form a word and a sentence. The syllable may be single word or be element to compose a complex word, a compound word and a reiteration word. Secondly, the Vietnamese word is not inflectional. In particular, there are no difference between singular noun and plural noun; for example, “hai cuốn sách” (two books) and “một cuốn sách” (one book). Thirdly, grammatical meaning expresses mainly through word order and expletive method. Given some expletives such as “sẽ, đã, không” and sentence “Tôi ra ngoài”. We can make three different meaning sentences from given input: “Tôi sẽ ra ngoài”; “tôi đã ra ngoài”;” tôi không ra ngoài”. Figure 1. The features of Vietnamese type In the world, some languages also belong to isolating language such as Chinese and Thai language. English, French, Russian are flexional language. So, there are some different features, for instance comparing Vietnamese, English and Chinese sentence. The characteristics of Vietnamese Syllable is foundation unit to form word or sentence Vietnamese word is not inflectional The grammatical meaning express mainly through word order and expletive method
  • 8. 3 Table 1. The expression of grammatical meaning in Vietnamese Vietnamese Chinese English Word order Tôi yêu anh ấy  Anh ấy yêu tôi Wo ai ta  Ta ai wo I love him  He loves me Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe). 1.2. Vietnamese part of speech 1.2.1. Criteria to classify In European language, POS notion glues with morphological category such as gender, numeral, mood, so on. In Vietnam, there are two idea followed:  Firstly, POS does not exist in Vietnamese because Vietnamese does not have morphological modification. (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung)  Secondly, like European language, Vietnamese has also POS but to classify words in tags, or define POS of words, it is necessary to base on certain criteria. So far, Vietnamese branch has almost agreed using criteria following ( Diep Quang Ban, Hoang Van Thung, 2010): a. General meaning: “The meaning of a POS is the general meaning of a words group, bases on vocabulary generalization foundation to form common grammatical category generalization (lexical-grammatical category)”. POSs are suitable for definition of classification category. These are groups having giant number of words that each group has a classification feature: object, quality, action or state, so on. Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns because their vocabulary meaning is generalized and abstracted as objects. The grammar category belongs to noun. b. Combination ability: With general meaning, words can get involve to one meaningful combination: some words can replace each other in a certain position of a combination, the rest of the combination make the setting for appearing replacement ability. Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns.
  • 9. 4 c. Syntax function: Participating in sentence composition, words can stand in one or some certain positions in a sentence, or can replace each other in the positions, and express one relation about syntax function with other parts in the sentence composition, can be classified into one POS. For instance, some words such as nhà, bàn, chim, cát are noun. They may be subjects in sentences in which the subject function is a syntax function to classify them into noun. 1.2.2. The ways to build up tagset Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers. The first kind bases on 8 basic POS tags that are used many in dictionaries or linguistic materials. These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection, emotive word. From the 8 basic tags, some finer set of POS tags are built up. Each researcher relies on certain criteria to build up the tagset finer (criteria are discussed in the section 1.2.1). Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags; VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix). The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem 2003) 1.3. Copora Annotated corpora are large bodies of text with linguistically-informative mark-up. They play an important role for current work in computational linguistics, so great attention has gone into developing such corpora. Any countries, there are their own corpora as well. Some common corpora such as: British National corpus (Leech et at, 1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank (Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and Richard Xiao, 2005). In Vietnam, there are notable corpora: VnQtag, VnPos, VTB. To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson, 2001, p.29).  Sampling and representativeness: elements in a corpus must be general, diversified and plentiful. A sample is representative if what we find for the sample also holds for the general population.
  • 10. 5  Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size.  Machine-readable form  Standard reference We must admit that it takes much time to build a large corpus by manual due to need huge linguistic knowledge. With manually built large corpus, the quality of corpus is not surely good corpus. Therefore, our thesis will find out and improve it. Two corpora we used in our experiments are VietTreeBank and VnQtag. After that, we would like to deeper discuss about building way of the corpora. 1.3.1. VietTreeBank VietTreeBank is the result of a national project VLSP that is developed by VTB group (Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators). The corpus includes 142 documents belonging to a politics-society topic of the Youth news responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS tagging, syntax structure). The group based on MEMs and CRFs machine learning model to assign POS tags. The preciseness of the model is over 93%. VTB is developed with the purpose to aid programs building: word segmentation, POS tagging, syntax parsing, and so on. VTB group chose two criteria to classify POS: combination ability and syntactic function words. For instance, noun has role as subject or object in a sentence. Besides, noun can combine with numeral (three, four) and attribute (each, every). One POS tag can contain information about basic class of words (noun, verb, adjective, so on), morphological information (countable or uncountable), subcategory (verb goes with noun, verb goes with a clause, etc), semantic information or other syntax information. VTB group built up the tagset just based on basic class of words without other information such as morphological information, subcategory, etc (see tagset in appendix). In addition to POS information, the group describes basic syntax elements as phrase and clause. Syntax tags are the most foundation information in syntax tree, they forms spine of the tree. A7 and A8 in appendix list phrase and clause tagset, respectively.
  • 11. 6 Function tag of a syntax element expresses its role in syntax element in higher level. The tags are assigned to the main elements in the sentence such as subject, predicative, object. They provide information help us identify basic grammar relationship as followed.  Subject – Predicative  Predicative  Combination  Complement  …… Tagging process of each sentence in corpus consists of three steps: word segmentation, POS tagging, and syntactic parsing. 1.3.2. VnQtag Building VnQtag tagset belongs to KC01 national project and is performed by development group including Nguyen Thi Minh Huyen, Vu Xuan Luong, Le Hong Phuong. The group based on a print dictionary (Vietnamese dictionary of Linguistic Institution in 2000) to carry out their work. First of all, they segmented sentences into words by a syllable otomat and a lexical otomat. Then, they used Qtag tagger to assign POS label to Vietnamese words. The number of POS labels is 59 labels (see in appendix). In addition of grammar information, the group got adding semantic information (general meaning of word) to classify into 59 word class labels. For example, words are considered verb that they express general meaning about process. Process meaning expresses directly in action feature of object. This is action meaning. State meaning is generalized in relationship with action of object in time and space (Vietnamese grammar of Diep Quang Ban and Hoang Van Thung). The automatic tagger experiment is carried out on 7 documents that are listed in table 2. The annotated corpus plays an important in NLP; it is data database containing high quality linguistic sources; it obeys international standards and data express. The gained corpus has format following: each lexical unit and corresponding POS stand on one line, in which using space in each syllable, between word and POS have tab to separate. The type of punctuation and other symbols in text are processed as lexical unit with label is punctuation corresponding. This corpus includes 7 documents that belonged to different types such as story, novel, science and press. It gathers
  • 12. 7 common words used popularly in daily life and the press. It also gathers words that we can usually see in literature works or science-technical terms.
  • 13. 8 Table 2. Corpus with VnQtag tagset annotation Id Document Type The number of lexical unit The number of processing unit (included punctuations) 1. Hoang tu be Story 15532 18663 2. Chuyen tinh ke truoc luc rang dong-part I Novel 14277 16787 3. Chuyen tinh ke truoc luc rang dong- part II Novel 12499 14698 4. Luoc su thoi gian Science 10598 11626 5. Muoi cua rung Story 3117 3573 6. Nhung bai hoc nong thon Story 6682 8244 7. Cong nghe va he thong phong thu quoc gia Press 1028 1162 1.4. Motivation Until now, maybe you not image my thesis will solve which problems as well as the reasons I chose to solve them. In this section, therefore, we will discuss about them. As we all know, linguistic theories first developed to describe of Indo-European languages and until now there are many significant archievements. In our country, NLP field has begun since 1990, however; achieved results have still limit. Whereas, Vietnamese processing issue is responsible for Vietnamese; we cannot expect this issue in foreign researchers (Ho Tu Bao, 2001). Therefore, this thesis wishes contributed a part in improving Vietnamese processing by concentrating on enhancing tagsets and detection errors in tagging. Natural language processing is done at five stages. These are:  Morphological and lexical analysis: The lexicon of a language is its vocabularies that include its words and expressions. Morphology is the identification, analysis and description of structure of words. The words are generally accepted as being the smallest units of syntax. The syntax refers to the
  • 14. 9 rules and principles that govern the sentence structure of any individual language. Lexical analysis: The aim is to divide the text into paragraphs, sentences and words. The lexical analysis cannot be performed in isolation from morphological and syntactic analysis  Syntactic Analysis: The analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the languages for how words may be combined.  Semantic analysis: It derives an absolute meaning from context it determines the possible meanings of a sentence in a context.  Discourse integration: The meaning of an individual sentence may depend on the sentences that precede it and may influence the meaning of the sentences that follow it.  Pragmatic analysis: It derives knowledge from external commonsense information it means understanding the purposeful use of language in situations, particularly those aspects of language which require world knowledge. For example: Do you know what time is it? The sentence should be interpreted as a request. Our thesis concentrates on the first stage (i.e morphological analysis) in natural language processing. It is very important preprocessing step for following stages such as syntactic analysis and semantic analysis. Our thesis has two big problems and two small problems. These are evaluating tagset and detecting tagging errors automatically; checking convertible possibility of tagset and detecting segmentation errors automatically, respectively. a. Evaluating and convertible possibility of tagset In previous section, we mentioned some tagsets such as VietTreeBank (17 tags); VNPOS (15 tags); VNQTag (59 tags). Such inconsistent tagsets emerge some questions such as: which tagsets can be better? What methods can evaluate these tagsets or how we can choose right set of POS tags for certain applications. In the first part of this thesis, we will focus to answer the question.
  • 15. 10 Another aspect we will also discuss here is tagsets conversion ability. The choice one tagset much affect on the difficulty of POS tagging issues. In particular, if big tagset will increase the difficulty but smaller one will not satisfy for a certain purpose. Therefore, it is necessary to balance between quality and the quantity in one tagset, it means that:  Information quality more clear (i.e classify to more Part-of-speech based on concrete meaning)  Possibility of tagging (i.e the number of Pos as little as possible) From above discussed problem, we try to find a method to balance them. It means that we carry out experiment on source tagset (ST) and target tagset (TT). Then calculating the number of ambiguous words when we converted; therefore, we give conclusion. b. Detecting POS tagging and word segmentation errors  If each word belongs to only one label then one limited a dictionary including words and corresponding labels can solve absolutely POS tagging issue. In fact, however, one word can belong to more than one label and that leads to ambiguity and errors in POS tagging. To fix this problem, it costs much time and money by manual. We want to find out method to detect errors automatically to reduce cost about time and money.  Besides, it admits that Vietnamese word segmentation is a thorny issue. One sentence maybe to have many different segmentation ways. For example, chiếc xe đạp nặng quá. Way 1: chiếc/ xe/ đạp/ nặng/ quá. Way 2: chiếc/ xe đạp/ nặng/ quá. Here, we used “/” to separate words. Both of ways are accepted because each sentence is private meaningful. One of reasons causes the difference is listed in following table. And the last problem in our thesis is word segmentation:
  • 16. 11 Table 3. Principle differences between Vietnamese and English Character Vietnamese English Foundation unit Syllable Word Prefix or Suffix No Yes Part of speech No agreement Defined clearly Boundary of word Context meaningful combination of syllable Blank or Delimiters All above reasons are motive power to help me find the last answer. 1.5. Organization of the thesis The thesis is organized four main chapters with basic content following: Chapter 1: Introduction and motivation. Chapter 1 provides a general picture about Vietnamese such as features of Vietnamese and part-of-speech. Besides, reasons I chose the topic in the thesis also discuss. Chapter 2: Evaluating distributional properties and conversion possibility of tagsets in Vietnamese. Chapter 2 we will find out deeper about tagset for instance way to build up tagset or way to merge labels as well as introduction basic notions to carry out evaluating properties of tagsets. Chapter 3: Automatic error verification of pos-tagged corpus In this chapter, we will introduce notion related to errors detecting method, after that present algorithm and discuss about classifying variation into errors or ambiguity. Chapter 4: Summary and conclusion In this chapter, we will discuss about three issues. These are thesis’s contributions about theory, experiment and further new directions. It sums up achievement that we gained and discussed further some word needed solve in future.
  • 17. 12 CHAPTER 2: EVALUATING DISTRIBUTIONAL PROPERTIES - CONVERSION POSSIBILITY OF TAGSETS IN VIETNAMESE 2.1. Tagset evaluation 2.1.1. Introduction It is obvious that evaluating tagset has received much attention of NLP reserachers since over 20 years ago. Tagset evaluation allows us to test and assess the impact of tagset modifications on results, by using different versions given tagset on the same texts (Martin Volk and Gerol Schneider, 1998). In 2000, Dzeroski Saso and Erjavec Tomaz and Zavrel Jakub calculated by comparing accuracy of design tagsets that are formed by decreasing the cardinality of the tagset: ommitting certain attributes of the tagset or almost all, except certain attributes. Accuracies were computed using a Black-Box combiner (Halteren, Dzeroski). In the same year, Herv Ejean Seminar and Hervé Déjean presented two kinds of a tagset evaluation: a global and a local one. The first kind consists of evaluating the initial grammar generated by ALLiS. The second kind is to use the notion of reliability that reliability of an element corresponds to the ratio between its frequence in the structure over its total frequency in the corpus. Besides, in Indian language, Madhav Gopal, Diwakar Mishra, and Devi Priyanka Singh (2010) gave some discussions about evaluated tagsets: ILMT tagset, JNU- Sanskrit tagset, LDCIL tagset, Sanskrit consortium tagset. Vietnamese is an isolating language and important syntactic information source is word order. To evaluating Vietnamese tagsets, this chapter will introduce a simple method using internal criteria and external criteria. Frequency frame and purity are used in internal criteria to check whether tag is assigned accurately. External criteria review reduction cardinality of the tagset to check information quality is retained. It is true that a number of evaluations showed that a lot of tagging errors are caused by sometimes too fine differentiations within major categories (Eugenie Giesbrecht, 2008).
  • 18. 13 2.1.2. Tagset A POS is a set of words with some grammatical characteristic(s) in common and each POS differs in grammatical characteristics from every other POS. For example, nouns have different properties from verbs, which have different properties from adjective and so on. Tagset is set of POS tags built up based on the criteria (see in 1.2). Therefore, tagsets usually vary quantity of tags and also used in various applications. Properties of tagset: One tagset need guarantee some properties as followed: Retaining linguistic feature, reflect syntax structure, possibility of tagging accurately, reduction ambiguous words when we carried out tagging. 2.1.3. A method for evaluating distributional properties of tagsets 2.1.3.1. Internal criterion Among properties of tagsets, we high appreciate possibility of tags is assigned accurately in corpus. It means that we mention of internal criterion. It is worth noting that we can review this criterion through a frame notion and a purity formula. The frame represents reviewed local context. It can alert for us which tag can appear in this the frame. Next, purity formula assesses possibility convergence of tag in the local context. Discussion about purity As mention preciously, we use purity formula as external evaluation criterion for tagset (Stanford natural language processing, 357). Purity is widely used in cluster quality evaluation measure. It is simple and transparent evaluation measure. To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N. Formally: (1) Where: is the set of clusters is the set of classes.
  • 19. 14 We interpret wk as the set of documents in wk and cj as the set of documents in cj in equation (1). High purity is easy to achieve when the number of clusters in large, in particular, purity is 1 if each document gets its own cluster. For example: Figure 2. Purity as external evaluation criterion for cluster quality. Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4 (cluster 2); and , 3 (cluster 3). Purity is Frame notion The frame notion is mentioned in 2006 by Mintz. Then, in 2010, Dickinson and Jochim redefined it following: In local context, one frame consists of three words in which two words surrounding a target word leading to target’s categorization. We will use frames to test the quality of distributional mappings. In English, for example, the frame “you_it” generally predicts a verbal category for the target (i.e, target word may be hit, beat, eat, or kiss). In Vietnamese, the frame “mẹ_là” leads target word belonging to pronoun (Pp), i.e, “tôi, anh, chị, bác”. Therefore, to have a more exact result, we used a frequency and a frequent frame notion. Frequent frame supplies category information in child language corpora. It means that, frame’s role in a corpus is not similar. Many times one frame appears, more linguistic information the frame concentrates. We identify the frequent threshold based on a formula about 0.03% of the frame total. In particular, if we have 10000 frames in a corpus then the frequency is 3 (10000*0.03%). So, one frame appears above 3 times, we consider them as one frequent frame. Next, purity formula is applied in the method with respect to calculating possibility of distributing tag in one frame. It means that percentage of each tag appears in frame is different. To calculate purity value, we just consider to the biggest frequency of a tag in each frame. Next, we add them and divide total of appearing times of all tags. If the x x o x x x x o  o o o x x   
  • 20. 15 purity value is higher, then words ability can be tagged accurately higher. For instance, we have two frames: Tôi_ở and mẹ _bảo. The first frame appears 4 times in a corpus in which the target word belongs to two tags Vits, Vitn (1 times and 3 times, respectively). The second frame appears 8 times in which 7 times target word’s POS is Np, 1 times is Pp. We can calculate the purity value by 2.1.3.2. External benchmark Normally, to evaluate tagset, linguistic scientists have mapped a tagset into reduced one because this work helps us check retained linguistic features. Of course, reduced tagset is built up by merge tags; however, how do we have to merge? This is a difficult question that we need solve. Herv Ejean Seminar, Hervé Déjean and Universität Tübingen (2000) discussed about the theoretically minimal tagset. They affirmed that the quality of a tagset does not depend on the quantity of tags. They built up the minimum tagset necessary to parse sentence whatever the domain are. Originally, they use a tagset with one tag per structure (NP-VP). Then, they estimated that a tagset of about 20 tags is enough to parse a sentence into PS and clause structures. Indeed, there are many ways to merge labels so the tagsets with various tags quantity have still existed. English is morphological language so it is rather easily to identify situations can merge such as conflating base form verb (VB) and present tense verb (non-third person singular, VPB) but Vietnamese is not. The tagsets are used in our thesis have two kinds: Firstly, we used tagset that it is built up by preceding NLP researchers, for instance, VnQtag, VietTreeBank. Secondly, we conflate ourselves some labels based on Vietnamese features. The number of tags in VnQtag is the largest, so we use it as source tagset to generate other tagsets. 2.1.3.3. Algorithm To concrete above mentioned theory, we would like to introduce the algorithm containing 5 steps in tagged corpus as followed. 1. Identifying all the words and its POS in the corpus, store them and its positions.
  • 21. 16 2. Calculating the quantity of frames in the corpus, after based on total of the frames to calculate a frequency. 3. Then, finding frequency frames and a purity value 4. Mapping the original tagset to new reduced tagsets 5. Finally calculating the new purity value in the new tagsets and statistic lost ambiguous words Preparing data: We carried out this method on corpus with VnQtag tagset annotated corpus. 2.1.4. Result of tagset evaluation The experiments are performed on VnQtag corpus including four VnQtag annotated documents. Then we carried out merging some tags in VnQtag to form new tagsets: tagset 3 and tagset 4. Therefore, we have: VietTreeBank (18 tags), basic tagset 2 (8 tags), tagset 3 (25 tags) and tagset 4 (40 tags) (see in appendix). We relied on the book (Ngữ pháp tiếng Việt - Diệp Quang Ban) to merge tags in which he organized Vietnamese POS system into two groups:  Group 1: Noun, Verb, Adjective Numeral Pronoun  Group 2: Adjunct (Determine, adverb) Conjunction Particle Each major category he classified finer-grain such as noun has two main kinds: Proper noun and common noun. Common noun contains synthetic noun and non-synthetic noun. Both of them are fine classified into countable noun and uncountable noun and so on. To gain 25-POSs and 30-POSs tagsets, we merged some tags of noun and verb. They are basic categories and have the largest number of words in Vietnamese. In the VnQtag tagset, noun is fine classified to 8 tags and verb with 10 finer tags. We employed four annotated documents in VnQtag and four tagsets to gain results in the table 4 and the table 5.
  • 22. 17 Table 4. Some frames is found in corpus Frame (Frequency) POS (Frequency) Frame (Frequency) POS (Frequency) Frame (Frequency) POS (Frequency) mẹ_là (4) Pp (4) Tôi_ở (4) Vits (1) Vitn (3) chẳng _gì (3) Vte (1) Na (2) Tôi_nông dân (3) Vla (3) nhà _ở (3) Np (1) Pp (2) cái _tre (2) No (2) Còn_sinh (3) Pp (3) ba _Phúc (2) Nh (2) với _đứa (2) Nn (2) sinh _nông thôn (3) Cm (3) Con _nhỏ (2) No (2) dăm _trẻ (2) Nu (2) đứa _dâng (2) Nh (2) trẻ _đào (2) Vta (2) có _người (3) Aa (2) Vtf (1) bố _Lâm (3) Nh (3) tôi _có (3) An (1) Vtf (1) Ja (1) Lâm _thích (2) Pp (1) Ja (1) tôi _lắm (2) Vtf (2) có _thì (4) Vta (2) Nn (1) Np (1) thì _đỡ (2) Jd (1) Vitm (1) mẹ _bảo (8) Pp (1) Np (7) đây _lần (2) Vla (2) là _đầu (2) Nt (2) lần _tôi (2) Nl (2) tôi _Lâm (4) Vtd (1) Vtn (1) Cc (2) bố _bảo (9) Np (8) Pp (1) …. … …. …. …. …. Table 4 shows some frames are found in corpus. Each frame consists of four information kinds. These are content of frame, appearing time of the frame, the POS of target word and its appearing times. In particular, the first cell contains the frame: “mẹ_là”. This frame occurs 4 times in the corpus and all of them are assigned as Pp (Pronoun).
  • 23. 18 Table 5. Result of tagset evaluation method Document Words Mapping Tags Frequency frame, total of frame, threshold Purity Lost ambs Chuyện tình kể trước lúc rạng đông 31520 VnQtag 59 128, 15706, 5 60.69% 0 VietTreeBank 18 82.86% 331 Basic tagset 8 82.06% 397 Tagset 3 25 69.71% 55 Tagset 4 40 79.09% 137 Hoàng tử bé 18666 VnQtag 59 453, 7951, 3 76.66% 0 VietTreeBank 18 88.37% 590 Basic tagset 8 88.60% 892 Tagset 3 25 82.05% 141 Tagset 4 40 82.49% 261 Lược sử thời gian 11677 VnQtag 59 407, 6738, 3 71.81% 0 VietTreeBank 18 86.74% 720 Basic tagset 8 87.35% 826 Tagset 3 25 75.15% 184 Tagset 4 40 75.60% 241 Những bài học nông thôn 8247 VnQtag 59 242, 3845, 2 79.47% 0 VietTreeBank 18 88.34% 172 Basic tagset 8 88.83% 362 Tagset 3 25 82.76% 43 Tagset 4 40 83.25% 126 Based on the table 5, we can see that the first document has 31520 words and the total of frames is computed exactly 15706 frames. Threshold is calculated by formula: 0.03%* the total of frames. Therefore threshold in the situation is approximate 5. When experiment the first document on different tagsets, we achieved purity values (60.69%, 82.86%, 82.06%, 69.71%, 79.09%) as well as the number of ambiguous words (0, 331, 397, 55, 137 respectively). We can realized that the percentage of the ambiguous words compared to the total of words in document is small, i.e, 0%, 1.05%, 1.26%, 0.17%, 0.43%. Other documents are similar in the explaining. Conclusion, three tagsets namely VietTreeBank, basic tagset and tagset 4 are appreciated higher than other tagsets. Because, value of purity is high and the number of lost ambiguous words is quite small. Tải bản FULL (51 trang): https://bit.ly/3RZw0Dd Dự phòng: fb.com/TaiHo123doc.net
  • 24. 19 2.2. Possibility of Tagsets convertibility As you know, existing different tagsets in the same language helps linguistic scientists have more tagset options. In English language, there are some tagsets following: Brown tagset in 1967 (87 tags), Susanne tagset in 1987 (353 wordtags), Penn Tree Bank tagset in 1991 (36 tags), IBM Lancaster in 1993 (132 tags). To give right decision, they have found out relationship between tagsets as well as specific applications. In Vietnamese, it is notable that there are three tagsets: VnQtag (59 tags), VnPos (15 tags), VietTreeBank (18 tags). Some Vietnamese linguistic researchers have advocated minimal tagset it means that they are interested in smaller tagset. With small tagset, tagging is performed more easily, and less cost about time and money. Therefore, we want to test converting from a large tagset into small one. Of course, reverse direction is always true. As a result, the first direction, some words can be lost ambiguity about tag. This is not good sign. However, if their number is small then we can just add some information of context or syntax to understand them. For instance, Daniel Zeman (2008) used Interset (Tagset diriver) to convert source tagset into target one. Bartosz Zaborowski andAdam Przepiórkowski (2012) used set of rules converting particular tags. In our thesis, we emphasize to ability of conversion from one tagset to another. The thing we wish found out here is that any large tagsets always can convert easily to small tagsets with minimal ambiguous word cardinality. Ambiguous words are words that are lost a distinction in finer tags in target tagset. In particular, we carried out as followed:  Identifying tagsets that we want to check  Identifying corpus annotated as well as tagger  Calculating the number of word belonging to each POS tag of tagset  Statistic o The number of ambiguous tokens in corpus (when we convert large tagset into small tagset, some tags in large one will merge to correspond to tags of small tagset). o The number of ambiguous word types in corpus.  Computing the percentage of ambiguous tokens and word types. Tải bản FULL (51 trang): https://bit.ly/3RZw0Dd Dự phòng: fb.com/TaiHo123doc.net
  • 25. 20 Result of tagset convertibility The data input of this method is two tagsets: VnQtag and VietTreeBank. Besides, we used Qtag probability and Vn Tagger to tag for the folder containing 7 documents (Hoàng tử bé, Chuyện tình, Lược sử thời gian, Những bài học nông thôn, Chiến tranh cục bộ, muối của rừng, An Dương Vương) with two tagsets respectively. Then we compared outputs to have last conclusion. Table 6. Some properties in tagset convertibility method in Hoangtube Here, there are tiny note that word in above table is exactly word type not token. It means that each word we just count once time. Besides, the experiment is performed on one document (hoangtube) so we can see ambiguous percentage is quite small. The number of ambiguous words sometime is large so in table we listed some situations not all of them. 6811897