FinalReport

Final Report of Internship Program
at Kyoto Institute of Technology
Ho Xuan Vinh,
Faculty of Information Technology - University of Science
Ho Chi Minh city, Vietnam
Email: hovinh39@gmail.com
Abstract—With the fastest development speed than ever,
Machine Learning approach shows its dominant in solv-
ing almost every problems from Image Processing to
Natural Language Processing. However, to utilize all of
this advantages, scientist must provide clean and clear
material for machines automatically extract features and
learn from then. This is the goal of this internship in
creating a bilingual annotated corpus Vietnamese-English
for training purpose of further advanced tasks. I have
met a lot of challenges in finding appropriate tagset and
implementing method, all are represented below .
Index Terms—LLOCE, UCREL, WordNet, sense tag,
semantic annotation, ...
I. INTRODUCTION
Beginning from Alan Turing’s famous Enigma de-
coder machine, Natural Language Processing field has
many up and down moment in its research and develop-
ment history. This interactive human-machine has taken
a great leap since 1980 with encouraging achievements
supported by Machine Learning techniques. Unavoid-
ably, the thirsty for well-annotated and large corpus
becomes one of the most important factors to have a
great training system. We also see that the more detail in
annotated corpus, the more information you can extract
and the higher the accuracy of the system. In specific,
we can divide the label for annotating to 3 levels:
• Morphological (noun, verb...)
• Grammar (pronoun...)
• Semantic (distinguish different entities with same
morphological)
The most distinctive is Semantic label that can not
only provide many meaningful information for docu-
ments, but also shows ability in solving complicated
ambiguity with semantic relationship. Not standing out
of the big game, many efforts have been made to create
a universal semantic tagset such as WordNet, CoreLex,
LLOCE, UCREL...
This would be a great news unless the corpus an-
notated with these tagsets are extremely rare, which
do not provide enough data for training and testing
Machine Learning methods, especially in low-resource
language. From this motivation, I aim to make an
annotated bilingual corpus for English - Vietnamese.
The order of following section will be my timeline in
doing research of this internship. Extra explanation and
discussion about result also been added to make clearer
about my changes in each approach. Concluding remark
will be given and my continue work when be back to
home country also provided.
II. THE HOBBIT - WORDNET
Taking a little time to conduct a survey about ”se-
mantic tagset”, WordNet turns out the biggest star on
stage with no doubt. Since the mid of 1980s, Linguistics
in Princeton University have conducted project building
the English Ontology WordNet with more than 117.000
synsets. These synsets connected to each other with
many relationships: hypernyms, hyponyms, meronym,
troponym... Each synset representing as a node or leaf
in tree contains a list of synonym words. This structure
is great advantage yet also disadvantage when the tagset
is too fine-grained - Figure 1, which is so hard that even
people can not figure out the difference between 2 labels
only by synonym or short definition. For example with
the word ‘bank’, we have 3 synsets it belong to:
• sloping land (especially the slope beside a body of
water)
• a long ridge or pile
• a slope in the turn of a road or track; the outside is
higher than the inside in order to reduce the effects
of centrifugal force
Understand its properties, I know this goal is too
hard to achieve: no available WordNet dictionary for
Vietnamese, and English corpus annotated WordNet is
small which get from annual contests like SemEval to
find the best Word Sense Disambiguation method. Ev-
erything seems not promising at all from the beginning,
so I decide to keep looking for other tagset.

Fig. 1. WordNet visualization of Noun and Adjective.
III. THE FELLOWSHIP OF THE RING - LLOCE AND
MAPPING
This is the most unpopular and less attractive tagset
you can ﬁnd via Google Scholar, yet the concept behind
this dictionary is easy to capture and friendly to human.
Ordinary dictionary puts ‘animal’ and ‘zoo’ or ‘uncle’
and ‘aunt’ in far positions due to alphabet order. How-
ever, in daily thinking process, those pairs are popular
in particular way. Constructor of Longman Lexicon of
Contemporary English(LLOCE)use this concept to clus-
ter words sharing common characteristics into groups.
LLOCE is organized as 14 topics called branches of
daily life, then divided into 128 subjects and nearly
2,500 groups(Figure 2).
For example: Branch C: People and the family
consists of:
• Subject ‘People’: Groups from C1 to C19
• Subject ‘Courting, sex, and marriage’: Groups from
C20 to C39
• Subject ‘Friendship and enmity’: Groups from C40
to C49
• ...
Each group has its own short deﬁnition and set of
words belong to. Statistically, 705 groups contain verb,
1,482 groups contain noun, 429 for adjective, and 63 for
Fig. 2. Structure of LLOCE dictionary.
adverb. LLOCE also has external relation between sub-
jects, however due to lack of evidence about how these
relationship created, I left them out of consideration.
The reason I choose LLOCE is because of the mate-
rial I have in hand:
• LLOCE dictionary of both English and Vietnamese.
• Bilingual corpus extracted from examples of
LLOCE dictionary.
The hard thing is that the corpus was not annotated,
so actually, I don’t know how to evaluate the result after
annotating. Also I can not trace back which sentence is
the example of which word. Fortunately, in this period, I
catch the idea of using Latent Dirichlet Allocation[LDA]
in topic modeling branch, which sounds like promising
unsupervised approach.

Fig. 3. Graphical Model of Latent Dirichlet Allocation.
LDA is a generative probabilistic model of a corpus.
The basic idea is that documents are represented as
random mixtures over latent topics, where each topic
is characterized by a distribution over words.
LDA assumes the following generative process for
each document w in a corpus D:
1) Choose N Poisson(ε) .
2) Choose θ Dir(α).
3) For each of the N words wn:
• Choose a topic zn Multinomial(θ).
• Choose a word wn from p(wn|zn, β), a multi-
nomial probability conditioned on the topic
zn.
Figure 3 shows factors counted in this technique. In
summary, when you feed the algorithm with example
sentences, you can get the mixed topic for each of them.
This proportion is a clue to help you determine which
group(label) will be the most suitable for considering
word.
For easy understanding, let’s try this example which
illustrated on Figure 4. First, you need to train LDA
on dictionary, so it can learn distribution of words in
each topic (we choose 2500 topics according to 2500
Groups of LLOCE), then using the achieved model, you
will find the mixed topic of any sentence. For example,
the sentence ”He hewed out an important position for
himself in the company” has mixed topic: 30% N43,
15% N89, 45% C52, 12.5% B41 and the rest is for
other Groups. If the word ”important” has 2 candidates
Groups (N43 and N89), we will choose the one has the
higher ratio, which means N43.
Not as I expected, the first step get trouble when the
model can not study the word distribution as I want.
Further discussion with Araki sensei shows that this
could be happen by many reasons: the number of topic
is too large, size of each document usually collections
of paragraph more than just sentence, the co-occurence
of words, and non accurate simulated environment when
training.
In an attempt to fix this, I tried different ways yet
still does not reach the goal. Increasing co-occurence
of words by dbPedia seems possible at first, but the
coverage of LLOCE is too small and domains is too
different. The number of topics when decreasing to the
size of Topics or Branches also can not help. To get
annotated corpus, I try to mapping from LLOCE to
WordNet based on similarity metric of tree structure of
WordNet, to make SEMCOR(WordNet annotated corpus
in SemEval) be an LLOCE annotated corpus. However,
despite the same distribution of word and nodes level,
no more deeper intuition drawn from this experiment.
In this time, I also implement a Pre-processing and
LLOCE package in Python for easy using later, which
offer basic function like checking which Groups word
belong to, the list of unique words in each Groups, and
baseline annotating with Left Right Maximum Match-
ing.
IV. THE TWO TOWERS - UCREL
The UCREL semantic analysis system is a framework
for undertaking the automatic semantic analysis of text.
The framework has been designed and used across a
number of research projects since 1990. The semantic
tagset used by USAS was originally loosely based on
Tom McArthur’s Longman Lexicon of Contemporary
English (LLOCE, 1981). It has a multi-tier structure
with 21 major discourse fields (shown on Figure 5),
subdivided, and with the possibility of further fine-
grained subdivision in certain cases.
In fact, most of the research in this tagset is my
friend’s task. But when getting stuck with LLOCE, I
then spent 2 weeks to work with it. This time idea is
more simple. They provide a website for uploading your
corpus, then you will received the annotated corpus in
English and 5 other languages(Dutch, Chinese, Italian,
Portuguese, Spanish). Then you can use word alignment
on your available bilingual corpus in sentence level,
then projecting label from English to Vietnamese. So
the problem now is not annotating Vietnamese, but turn
out how to project label from English to Vietnamese
correctly.
However, we have no Vietnamese dictionary to sup-
port, things come back to the starting point.
V. THE RETURN OF THE KING - WORDNET
My mentor has a quick discussion with me about
coming back to WordNet. He now has a English-
Vietnamese version of WordNet, and a part of Word-

Fig. 4. Idea about implementing LDA for annotating.
Fig. 5. UCREL category system.
Net’s SEMCOR translated into Vietnamese. The prob-
lem stays the same: make alignment correct. Finding
heuristics to solve exception in both languages: words
are adjectives in English but are noun in Vietnamese and
so on... This is my current work up to now.
VI. CONLUSION
So far, my process is not as good as I wish. The time I
spent for conducting survey was too small that makes me
confusing when doing experiment. Even when realizing
that, I still struggle to understand completely the full
concept of papers (which contains a lot of complex
formula). It guarantees many challenges waiting for me
ahead before reaching the deadline of my thesis. But I
know that I have learned the way to handle them, and the
Fig. 6. My lab.
memory here will be a reminder for me that whenever
there is a hard time, you can always get over it.
ACKNOWLEDGEMENT
Thank you Araki sensei for the warm discussions
about my work, for the support when I feel confusing
and lost in my research, and for teaching me the
”learning new things by doing experiment with it”.
Thank you Nakano-san and other students in lab for
giving me memorable experience, not only in work, but
also play as well.

FinalReport

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FinalReport

Similar to FinalReport (20)

FinalReport