SlideShare a Scribd company logo
Final Report of Internship Program
at Kyoto Institute of Technology
Ho Xuan Vinh,
Faculty of Information Technology - University of Science
Ho Chi Minh city, Vietnam
Email: hovinh39@gmail.com
Abstract—With the fastest development speed than ever,
Machine Learning approach shows its dominant in solv-
ing almost every problems from Image Processing to
Natural Language Processing. However, to utilize all of
this advantages, scientist must provide clean and clear
material for machines automatically extract features and
learn from then. This is the goal of this internship in
creating a bilingual annotated corpus Vietnamese-English
for training purpose of further advanced tasks. I have
met a lot of challenges in finding appropriate tagset and
implementing method, all are represented below .
Index Terms—LLOCE, UCREL, WordNet, sense tag,
semantic annotation, ...
I. INTRODUCTION
Beginning from Alan Turing’s famous Enigma de-
coder machine, Natural Language Processing field has
many up and down moment in its research and develop-
ment history. This interactive human-machine has taken
a great leap since 1980 with encouraging achievements
supported by Machine Learning techniques. Unavoid-
ably, the thirsty for well-annotated and large corpus
becomes one of the most important factors to have a
great training system. We also see that the more detail in
annotated corpus, the more information you can extract
and the higher the accuracy of the system. In specific,
we can divide the label for annotating to 3 levels:
• Morphological (noun, verb...)
• Grammar (pronoun...)
• Semantic (distinguish different entities with same
morphological)
The most distinctive is Semantic label that can not
only provide many meaningful information for docu-
ments, but also shows ability in solving complicated
ambiguity with semantic relationship. Not standing out
of the big game, many efforts have been made to create
a universal semantic tagset such as WordNet, CoreLex,
LLOCE, UCREL...
This would be a great news unless the corpus an-
notated with these tagsets are extremely rare, which
do not provide enough data for training and testing
Machine Learning methods, especially in low-resource
language. From this motivation, I aim to make an
annotated bilingual corpus for English - Vietnamese.
The order of following section will be my timeline in
doing research of this internship. Extra explanation and
discussion about result also been added to make clearer
about my changes in each approach. Concluding remark
will be given and my continue work when be back to
home country also provided.
II. THE HOBBIT - WORDNET
Taking a little time to conduct a survey about ”se-
mantic tagset”, WordNet turns out the biggest star on
stage with no doubt. Since the mid of 1980s, Linguistics
in Princeton University have conducted project building
the English Ontology WordNet with more than 117.000
synsets. These synsets connected to each other with
many relationships: hypernyms, hyponyms, meronym,
troponym... Each synset representing as a node or leaf
in tree contains a list of synonym words. This structure
is great advantage yet also disadvantage when the tagset
is too fine-grained - Figure 1, which is so hard that even
people can not figure out the difference between 2 labels
only by synonym or short definition. For example with
the word ‘bank’, we have 3 synsets it belong to:
• sloping land (especially the slope beside a body of
water)
• a long ridge or pile
• a slope in the turn of a road or track; the outside is
higher than the inside in order to reduce the effects
of centrifugal force
Understand its properties, I know this goal is too
hard to achieve: no available WordNet dictionary for
Vietnamese, and English corpus annotated WordNet is
small which get from annual contests like SemEval to
find the best Word Sense Disambiguation method. Ev-
erything seems not promising at all from the beginning,
so I decide to keep looking for other tagset.
Fig. 1. WordNet visualization of Noun and Adjective.
III. THE FELLOWSHIP OF THE RING - LLOCE AND
MAPPING
This is the most unpopular and less attractive tagset
you can find via Google Scholar, yet the concept behind
this dictionary is easy to capture and friendly to human.
Ordinary dictionary puts ‘animal’ and ‘zoo’ or ‘uncle’
and ‘aunt’ in far positions due to alphabet order. How-
ever, in daily thinking process, those pairs are popular
in particular way. Constructor of Longman Lexicon of
Contemporary English(LLOCE)use this concept to clus-
ter words sharing common characteristics into groups.
LLOCE is organized as 14 topics called branches of
daily life, then divided into 128 subjects and nearly
2,500 groups(Figure 2).
For example: Branch C: People and the family
consists of:
• Subject ‘People’: Groups from C1 to C19
• Subject ‘Courting, sex, and marriage’: Groups from
C20 to C39
• Subject ‘Friendship and enmity’: Groups from C40
to C49
• ...
Each group has its own short definition and set of
words belong to. Statistically, 705 groups contain verb,
1,482 groups contain noun, 429 for adjective, and 63 for
Fig. 2. Structure of LLOCE dictionary.
adverb. LLOCE also has external relation between sub-
jects, however due to lack of evidence about how these
relationship created, I left them out of consideration.
The reason I choose LLOCE is because of the mate-
rial I have in hand:
• LLOCE dictionary of both English and Vietnamese.
• Bilingual corpus extracted from examples of
LLOCE dictionary.
The hard thing is that the corpus was not annotated,
so actually, I don’t know how to evaluate the result after
annotating. Also I can not trace back which sentence is
the example of which word. Fortunately, in this period, I
catch the idea of using Latent Dirichlet Allocation[LDA]
in topic modeling branch, which sounds like promising
unsupervised approach.
Fig. 3. Graphical Model of Latent Dirichlet Allocation.
LDA is a generative probabilistic model of a corpus.
The basic idea is that documents are represented as
random mixtures over latent topics, where each topic
is characterized by a distribution over words.
LDA assumes the following generative process for
each document w in a corpus D:
1) Choose N Poisson(ε) .
2) Choose θ Dir(α).
3) For each of the N words wn:
• Choose a topic zn Multinomial(θ).
• Choose a word wn from p(wn|zn, β), a multi-
nomial probability conditioned on the topic
zn.
Figure 3 shows factors counted in this technique. In
summary, when you feed the algorithm with example
sentences, you can get the mixed topic for each of them.
This proportion is a clue to help you determine which
group(label) will be the most suitable for considering
word.
For easy understanding, let’s try this example which
illustrated on Figure 4. First, you need to train LDA
on dictionary, so it can learn distribution of words in
each topic (we choose 2500 topics according to 2500
Groups of LLOCE), then using the achieved model, you
will find the mixed topic of any sentence. For example,
the sentence ”He hewed out an important position for
himself in the company” has mixed topic: 30% N43,
15% N89, 45% C52, 12.5% B41 and the rest is for
other Groups. If the word ”important” has 2 candidates
Groups (N43 and N89), we will choose the one has the
higher ratio, which means N43.
Not as I expected, the first step get trouble when the
model can not study the word distribution as I want.
Further discussion with Araki sensei shows that this
could be happen by many reasons: the number of topic
is too large, size of each document usually collections
of paragraph more than just sentence, the co-occurence
of words, and non accurate simulated environment when
training.
In an attempt to fix this, I tried different ways yet
still does not reach the goal. Increasing co-occurence
of words by dbPedia seems possible at first, but the
coverage of LLOCE is too small and domains is too
different. The number of topics when decreasing to the
size of Topics or Branches also can not help. To get
annotated corpus, I try to mapping from LLOCE to
WordNet based on similarity metric of tree structure of
WordNet, to make SEMCOR(WordNet annotated corpus
in SemEval) be an LLOCE annotated corpus. However,
despite the same distribution of word and nodes level,
no more deeper intuition drawn from this experiment.
In this time, I also implement a Pre-processing and
LLOCE package in Python for easy using later, which
offer basic function like checking which Groups word
belong to, the list of unique words in each Groups, and
baseline annotating with Left Right Maximum Match-
ing.
IV. THE TWO TOWERS - UCREL
The UCREL semantic analysis system is a framework
for undertaking the automatic semantic analysis of text.
The framework has been designed and used across a
number of research projects since 1990. The semantic
tagset used by USAS was originally loosely based on
Tom McArthur’s Longman Lexicon of Contemporary
English (LLOCE, 1981). It has a multi-tier structure
with 21 major discourse fields (shown on Figure 5),
subdivided, and with the possibility of further fine-
grained subdivision in certain cases.
In fact, most of the research in this tagset is my
friend’s task. But when getting stuck with LLOCE, I
then spent 2 weeks to work with it. This time idea is
more simple. They provide a website for uploading your
corpus, then you will received the annotated corpus in
English and 5 other languages(Dutch, Chinese, Italian,
Portuguese, Spanish). Then you can use word alignment
on your available bilingual corpus in sentence level,
then projecting label from English to Vietnamese. So
the problem now is not annotating Vietnamese, but turn
out how to project label from English to Vietnamese
correctly.
However, we have no Vietnamese dictionary to sup-
port, things come back to the starting point.
V. THE RETURN OF THE KING - WORDNET
My mentor has a quick discussion with me about
coming back to WordNet. He now has a English-
Vietnamese version of WordNet, and a part of Word-
Fig. 4. Idea about implementing LDA for annotating.
Fig. 5. UCREL category system.
Net’s SEMCOR translated into Vietnamese. The prob-
lem stays the same: make alignment correct. Finding
heuristics to solve exception in both languages: words
are adjectives in English but are noun in Vietnamese and
so on... This is my current work up to now.
VI. CONLUSION
So far, my process is not as good as I wish. The time I
spent for conducting survey was too small that makes me
confusing when doing experiment. Even when realizing
that, I still struggle to understand completely the full
concept of papers (which contains a lot of complex
formula). It guarantees many challenges waiting for me
ahead before reaching the deadline of my thesis. But I
know that I have learned the way to handle them, and the
Fig. 6. My lab.
memory here will be a reminder for me that whenever
there is a hard time, you can always get over it.
ACKNOWLEDGEMENT
Thank you Araki sensei for the warm discussions
about my work, for the support when I feel confusing
and lost in my research, and for teaching me the
”learning new things by doing experiment with it”.
Thank you Nakano-san and other students in lab for
giving me memorable experience, not only in work, but
also play as well.

More Related Content

What's hot

AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
ijnlc
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
inscit2006
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
inscit2006
 
Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01Dr. Cupid Lucid
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
inscit2006
 
NLP
NLPNLP
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
Satyam Saxena
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
Valeria de Paiva
 
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYA DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
ijaia
 
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
ijnlc
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Saurabh Kaushik
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
IDES Editor
 
11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
ThennarasuSakkan
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
Ahmed Magdy Ezzeldin, MSc.
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 

What's hot (20)

AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
 
NLP
NLPNLP
NLP
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYA DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
 
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
 
NLPinAAC
NLPinAACNLPinAAC
NLPinAAC
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 

Similar to FinalReport

Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
Michel Bruley
 
Pal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsPal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsMustafa Jarrar
 
The Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented ProgrammingThe Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented Programming
Diane Allen
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
Marina Santini
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
OlusolaTop
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
CJ Jenkins
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
Ali Kabbadj
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
Vincenzo Lomonaco
 
Foundations of ICT In ELT
Foundations of ICT In ELTFoundations of ICT In ELT
Foundations of ICT In ELTjaedth
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Waqas Tariq
 
How to learn vocabulary in english
How to learn vocabulary in englishHow to learn vocabulary in english
How to learn vocabulary in english
Independant Teacher
 
Domain Specific Terminology Extraction (ICICT 2006)
Domain Specific Terminology Extraction (ICICT 2006)Domain Specific Terminology Extraction (ICICT 2006)
Domain Specific Terminology Extraction (ICICT 2006)
IT Industry
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Generative grammar
Generative grammarGenerative grammar
Generative grammar
Iri Win Imenza
 
Pal gov.tutorial4.session8 2.stepwisemethodologies
Pal gov.tutorial4.session8 2.stepwisemethodologiesPal gov.tutorial4.session8 2.stepwisemethodologies
Pal gov.tutorial4.session8 2.stepwisemethodologiesMustafa Jarrar
 
Putting the science in computer science
Putting the science in computer sciencePutting the science in computer science
Putting the science in computer science
Felienne Hermans
 
Resources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic EnglishResources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic EnglishThe Open Education Consortium
 
Resources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic EnglishResources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic English
The Open Education Consortium
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
csandit
 

Similar to FinalReport (20)

Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Pal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsPal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnets
 
The Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented ProgrammingThe Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented Programming
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Foundations of ICT In ELT
Foundations of ICT In ELTFoundations of ICT In ELT
Foundations of ICT In ELT
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
 
How to learn vocabulary in english
How to learn vocabulary in englishHow to learn vocabulary in english
How to learn vocabulary in english
 
Domain Specific Terminology Extraction (ICICT 2006)
Domain Specific Terminology Extraction (ICICT 2006)Domain Specific Terminology Extraction (ICICT 2006)
Domain Specific Terminology Extraction (ICICT 2006)
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Generative grammar
Generative grammarGenerative grammar
Generative grammar
 
Pal gov.tutorial4.session8 2.stepwisemethodologies
Pal gov.tutorial4.session8 2.stepwisemethodologiesPal gov.tutorial4.session8 2.stepwisemethodologies
Pal gov.tutorial4.session8 2.stepwisemethodologies
 
Putting the science in computer science
Putting the science in computer sciencePutting the science in computer science
Putting the science in computer science
 
Resources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic EnglishResources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic English
 
Resources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic EnglishResources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic English
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
 

FinalReport

  • 1. Final Report of Internship Program at Kyoto Institute of Technology Ho Xuan Vinh, Faculty of Information Technology - University of Science Ho Chi Minh city, Vietnam Email: hovinh39@gmail.com Abstract—With the fastest development speed than ever, Machine Learning approach shows its dominant in solv- ing almost every problems from Image Processing to Natural Language Processing. However, to utilize all of this advantages, scientist must provide clean and clear material for machines automatically extract features and learn from then. This is the goal of this internship in creating a bilingual annotated corpus Vietnamese-English for training purpose of further advanced tasks. I have met a lot of challenges in finding appropriate tagset and implementing method, all are represented below . Index Terms—LLOCE, UCREL, WordNet, sense tag, semantic annotation, ... I. INTRODUCTION Beginning from Alan Turing’s famous Enigma de- coder machine, Natural Language Processing field has many up and down moment in its research and develop- ment history. This interactive human-machine has taken a great leap since 1980 with encouraging achievements supported by Machine Learning techniques. Unavoid- ably, the thirsty for well-annotated and large corpus becomes one of the most important factors to have a great training system. We also see that the more detail in annotated corpus, the more information you can extract and the higher the accuracy of the system. In specific, we can divide the label for annotating to 3 levels: • Morphological (noun, verb...) • Grammar (pronoun...) • Semantic (distinguish different entities with same morphological) The most distinctive is Semantic label that can not only provide many meaningful information for docu- ments, but also shows ability in solving complicated ambiguity with semantic relationship. Not standing out of the big game, many efforts have been made to create a universal semantic tagset such as WordNet, CoreLex, LLOCE, UCREL... This would be a great news unless the corpus an- notated with these tagsets are extremely rare, which do not provide enough data for training and testing Machine Learning methods, especially in low-resource language. From this motivation, I aim to make an annotated bilingual corpus for English - Vietnamese. The order of following section will be my timeline in doing research of this internship. Extra explanation and discussion about result also been added to make clearer about my changes in each approach. Concluding remark will be given and my continue work when be back to home country also provided. II. THE HOBBIT - WORDNET Taking a little time to conduct a survey about ”se- mantic tagset”, WordNet turns out the biggest star on stage with no doubt. Since the mid of 1980s, Linguistics in Princeton University have conducted project building the English Ontology WordNet with more than 117.000 synsets. These synsets connected to each other with many relationships: hypernyms, hyponyms, meronym, troponym... Each synset representing as a node or leaf in tree contains a list of synonym words. This structure is great advantage yet also disadvantage when the tagset is too fine-grained - Figure 1, which is so hard that even people can not figure out the difference between 2 labels only by synonym or short definition. For example with the word ‘bank’, we have 3 synsets it belong to: • sloping land (especially the slope beside a body of water) • a long ridge or pile • a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force Understand its properties, I know this goal is too hard to achieve: no available WordNet dictionary for Vietnamese, and English corpus annotated WordNet is small which get from annual contests like SemEval to find the best Word Sense Disambiguation method. Ev- erything seems not promising at all from the beginning, so I decide to keep looking for other tagset.
  • 2. Fig. 1. WordNet visualization of Noun and Adjective. III. THE FELLOWSHIP OF THE RING - LLOCE AND MAPPING This is the most unpopular and less attractive tagset you can find via Google Scholar, yet the concept behind this dictionary is easy to capture and friendly to human. Ordinary dictionary puts ‘animal’ and ‘zoo’ or ‘uncle’ and ‘aunt’ in far positions due to alphabet order. How- ever, in daily thinking process, those pairs are popular in particular way. Constructor of Longman Lexicon of Contemporary English(LLOCE)use this concept to clus- ter words sharing common characteristics into groups. LLOCE is organized as 14 topics called branches of daily life, then divided into 128 subjects and nearly 2,500 groups(Figure 2). For example: Branch C: People and the family consists of: • Subject ‘People’: Groups from C1 to C19 • Subject ‘Courting, sex, and marriage’: Groups from C20 to C39 • Subject ‘Friendship and enmity’: Groups from C40 to C49 • ... Each group has its own short definition and set of words belong to. Statistically, 705 groups contain verb, 1,482 groups contain noun, 429 for adjective, and 63 for Fig. 2. Structure of LLOCE dictionary. adverb. LLOCE also has external relation between sub- jects, however due to lack of evidence about how these relationship created, I left them out of consideration. The reason I choose LLOCE is because of the mate- rial I have in hand: • LLOCE dictionary of both English and Vietnamese. • Bilingual corpus extracted from examples of LLOCE dictionary. The hard thing is that the corpus was not annotated, so actually, I don’t know how to evaluate the result after annotating. Also I can not trace back which sentence is the example of which word. Fortunately, in this period, I catch the idea of using Latent Dirichlet Allocation[LDA] in topic modeling branch, which sounds like promising unsupervised approach.
  • 3. Fig. 3. Graphical Model of Latent Dirichlet Allocation. LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document w in a corpus D: 1) Choose N Poisson(ε) . 2) Choose θ Dir(α). 3) For each of the N words wn: • Choose a topic zn Multinomial(θ). • Choose a word wn from p(wn|zn, β), a multi- nomial probability conditioned on the topic zn. Figure 3 shows factors counted in this technique. In summary, when you feed the algorithm with example sentences, you can get the mixed topic for each of them. This proportion is a clue to help you determine which group(label) will be the most suitable for considering word. For easy understanding, let’s try this example which illustrated on Figure 4. First, you need to train LDA on dictionary, so it can learn distribution of words in each topic (we choose 2500 topics according to 2500 Groups of LLOCE), then using the achieved model, you will find the mixed topic of any sentence. For example, the sentence ”He hewed out an important position for himself in the company” has mixed topic: 30% N43, 15% N89, 45% C52, 12.5% B41 and the rest is for other Groups. If the word ”important” has 2 candidates Groups (N43 and N89), we will choose the one has the higher ratio, which means N43. Not as I expected, the first step get trouble when the model can not study the word distribution as I want. Further discussion with Araki sensei shows that this could be happen by many reasons: the number of topic is too large, size of each document usually collections of paragraph more than just sentence, the co-occurence of words, and non accurate simulated environment when training. In an attempt to fix this, I tried different ways yet still does not reach the goal. Increasing co-occurence of words by dbPedia seems possible at first, but the coverage of LLOCE is too small and domains is too different. The number of topics when decreasing to the size of Topics or Branches also can not help. To get annotated corpus, I try to mapping from LLOCE to WordNet based on similarity metric of tree structure of WordNet, to make SEMCOR(WordNet annotated corpus in SemEval) be an LLOCE annotated corpus. However, despite the same distribution of word and nodes level, no more deeper intuition drawn from this experiment. In this time, I also implement a Pre-processing and LLOCE package in Python for easy using later, which offer basic function like checking which Groups word belong to, the list of unique words in each Groups, and baseline annotating with Left Right Maximum Match- ing. IV. THE TWO TOWERS - UCREL The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects since 1990. The semantic tagset used by USAS was originally loosely based on Tom McArthur’s Longman Lexicon of Contemporary English (LLOCE, 1981). It has a multi-tier structure with 21 major discourse fields (shown on Figure 5), subdivided, and with the possibility of further fine- grained subdivision in certain cases. In fact, most of the research in this tagset is my friend’s task. But when getting stuck with LLOCE, I then spent 2 weeks to work with it. This time idea is more simple. They provide a website for uploading your corpus, then you will received the annotated corpus in English and 5 other languages(Dutch, Chinese, Italian, Portuguese, Spanish). Then you can use word alignment on your available bilingual corpus in sentence level, then projecting label from English to Vietnamese. So the problem now is not annotating Vietnamese, but turn out how to project label from English to Vietnamese correctly. However, we have no Vietnamese dictionary to sup- port, things come back to the starting point. V. THE RETURN OF THE KING - WORDNET My mentor has a quick discussion with me about coming back to WordNet. He now has a English- Vietnamese version of WordNet, and a part of Word-
  • 4. Fig. 4. Idea about implementing LDA for annotating. Fig. 5. UCREL category system. Net’s SEMCOR translated into Vietnamese. The prob- lem stays the same: make alignment correct. Finding heuristics to solve exception in both languages: words are adjectives in English but are noun in Vietnamese and so on... This is my current work up to now. VI. CONLUSION So far, my process is not as good as I wish. The time I spent for conducting survey was too small that makes me confusing when doing experiment. Even when realizing that, I still struggle to understand completely the full concept of papers (which contains a lot of complex formula). It guarantees many challenges waiting for me ahead before reaching the deadline of my thesis. But I know that I have learned the way to handle them, and the Fig. 6. My lab. memory here will be a reminder for me that whenever there is a hard time, you can always get over it. ACKNOWLEDGEMENT Thank you Araki sensei for the warm discussions about my work, for the support when I feel confusing and lost in my research, and for teaching me the ”learning new things by doing experiment with it”. Thank you Nakano-san and other students in lab for giving me memorable experience, not only in work, but also play as well.