Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
Evolution of minds and languages: What evolved first and develops first in ch...Aaron Sloman
SLIDESHARE NOW STUPIDLY DOES NOT ALLOW SLIDES TO BE UPDATED. To find the latest version of these slides go to http://www.cs.bham.ac.uk/research/projects/cogaff//talks/#talk111
The version posted here was last updated on 16 March 2015. There have been several changes since then on the alternative site. Why did Slideshare take such a stupid decision (after being bought by Linkedin?)
A theory is presented according to which "languages" with structural variability and compositional semantics evolved in several species for *internal* use (e.g. in perception, planning, learning, forming goals, deciding, etc.) before *external* languages evolved for communication. The theory implies that such internal languages develop in young humans before a language for communication.
It is is also noted that the standard notion of 'compositional semantics' has to allow for the propagation of semantic content from parts to wholes to be potentially context sensitive at every stage: i.e. current context, speaker intentions, user knowledge, shared goals, can all affect how semantics of larger parts are derived from semantics of smaller parts+syntactic structure. This applies as much to non-verbal languages as to verbal ones.
This theory of how human languages evolved from earlier 'internal languages' (GLs) is inconsistent with the best known published theories of evolution or development of language.
But that does not make it wrong. Moreover, this theory is supported by empirical evidence including the example of deaf children in Nicaragua: http://en.wikipedia.org/wiki/Nicaraguan_Sign_Language
Deep misconceptions and the myth of data driven NLUWalid Saba
Early efforts to find theoretically elegant formal models for various linguistic phenomena did not result in any noticeable progress, despite nearly three decades of intensive research (late 1950’s through the late 1980’s ). As the various formal (and in most cases mere symbol manipulation) systems seemed to reach a deadlock, disillusionment in the brittle logical approach to language processing grew larger, and a number of researchers and practitioners in natural language processing (NLP) started to abandon theoretical elegance in favor of attaining some quick results using empirical (data-driven) approaches.
All seemed natural and expected. In the absence of theoretically elegant models that can explain a number of NL phenomena, it was quite reasonable to find researchers shifting their efforts to finding practical solutions for urgent problems using empirical methods. By the mid 1990’s, a data-driven statistical revolution that was already brewing over took the field of NLP by a storm, putting aside all efforts that were rooted in over 200 years of work in logic, metaphysics, grammars and formal semantics.
We believe, however, that this trend has overstepped the noble cause of using empirical methods to find reasonably working solutions for practical problems. In fact, the data-driven approach to NLP is now believed by many to be a plausible approach to building systems that can truly understand ordinary spoken language. This is not only a misguided trend, but is a very damaging development that will hinder significant progress in the field. In this regard, we hope this study will help start a sane, and an overdue, semantic (counter) revolution.
The main thesis here is this: (i) The Data-Driven approach to NLU is utterly fallacious; (ii) Logical Semantics has been seriously misguided; and (iii) logical semantics can be rectified, and here we suggest how this can be done and how to go forward, again
Towards a Universal Wordnet by Learning from Combined EvidenceGerard de Melo
Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their
meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high
level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
Evolution of minds and languages: What evolved first and develops first in ch...Aaron Sloman
SLIDESHARE NOW STUPIDLY DOES NOT ALLOW SLIDES TO BE UPDATED. To find the latest version of these slides go to http://www.cs.bham.ac.uk/research/projects/cogaff//talks/#talk111
The version posted here was last updated on 16 March 2015. There have been several changes since then on the alternative site. Why did Slideshare take such a stupid decision (after being bought by Linkedin?)
A theory is presented according to which "languages" with structural variability and compositional semantics evolved in several species for *internal* use (e.g. in perception, planning, learning, forming goals, deciding, etc.) before *external* languages evolved for communication. The theory implies that such internal languages develop in young humans before a language for communication.
It is is also noted that the standard notion of 'compositional semantics' has to allow for the propagation of semantic content from parts to wholes to be potentially context sensitive at every stage: i.e. current context, speaker intentions, user knowledge, shared goals, can all affect how semantics of larger parts are derived from semantics of smaller parts+syntactic structure. This applies as much to non-verbal languages as to verbal ones.
This theory of how human languages evolved from earlier 'internal languages' (GLs) is inconsistent with the best known published theories of evolution or development of language.
But that does not make it wrong. Moreover, this theory is supported by empirical evidence including the example of deaf children in Nicaragua: http://en.wikipedia.org/wiki/Nicaraguan_Sign_Language
Deep misconceptions and the myth of data driven NLUWalid Saba
Early efforts to find theoretically elegant formal models for various linguistic phenomena did not result in any noticeable progress, despite nearly three decades of intensive research (late 1950’s through the late 1980’s ). As the various formal (and in most cases mere symbol manipulation) systems seemed to reach a deadlock, disillusionment in the brittle logical approach to language processing grew larger, and a number of researchers and practitioners in natural language processing (NLP) started to abandon theoretical elegance in favor of attaining some quick results using empirical (data-driven) approaches.
All seemed natural and expected. In the absence of theoretically elegant models that can explain a number of NL phenomena, it was quite reasonable to find researchers shifting their efforts to finding practical solutions for urgent problems using empirical methods. By the mid 1990’s, a data-driven statistical revolution that was already brewing over took the field of NLP by a storm, putting aside all efforts that were rooted in over 200 years of work in logic, metaphysics, grammars and formal semantics.
We believe, however, that this trend has overstepped the noble cause of using empirical methods to find reasonably working solutions for practical problems. In fact, the data-driven approach to NLP is now believed by many to be a plausible approach to building systems that can truly understand ordinary spoken language. This is not only a misguided trend, but is a very damaging development that will hinder significant progress in the field. In this regard, we hope this study will help start a sane, and an overdue, semantic (counter) revolution.
The main thesis here is this: (i) The Data-Driven approach to NLU is utterly fallacious; (ii) Logical Semantics has been seriously misguided; and (iii) logical semantics can be rectified, and here we suggest how this can be done and how to go forward, again
Towards a Universal Wordnet by Learning from Combined EvidenceGerard de Melo
Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their
meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high
level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Taking into account communities of practice’s specific vocabularies in inform...inscit2006
L. Damas and C. Million-Rousseau
Condillac Group, LISTIC, Université de Savoie. 73370 Le Bourget du Lac, France
Ontologos Corp. 6, route de Nanfray, 74000 Cran-Gevrier, France
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYijaia
In today’s world of digital media, connecting millions of users, large amounts of information is being
generated. These are potential mines of knowledge and could give deep insights about the trends of both
social and scientific value. However, owing to the fact that most of this is highly unstructured, we cannot
make any sense of it. Natural language processing (NLP) is a serious attempt in this direction to organise
the textual matter which is in a human understandable form (natural language) in a meaningful and
insightful way. In this, text entailment can be considered a key component in verifying or proving the
correctness or efficiency of this organisation. This paper tries to make a survey of various text entailment
methods proposed giving a comparative picture based on certain criteria like robustness and semantic
precision.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
This presentation is a briefing of a paper about Networks and Natural Language Processing. It describes many graph based methods and algorithms that help in syntactic parsing, lexical semantics and other applications.
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Taking into account communities of practice’s specific vocabularies in inform...inscit2006
L. Damas and C. Million-Rousseau
Condillac Group, LISTIC, Université de Savoie. 73370 Le Bourget du Lac, France
Ontologos Corp. 6, route de Nanfray, 74000 Cran-Gevrier, France
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYijaia
In today’s world of digital media, connecting millions of users, large amounts of information is being
generated. These are potential mines of knowledge and could give deep insights about the trends of both
social and scientific value. However, owing to the fact that most of this is highly unstructured, we cannot
make any sense of it. Natural language processing (NLP) is a serious attempt in this direction to organise
the textual matter which is in a human understandable form (natural language) in a meaningful and
insightful way. In this, text entailment can be considered a key component in verifying or proving the
correctness or efficiency of this organisation. This paper tries to make a survey of various text entailment
methods proposed giving a comparative picture based on certain criteria like robustness and semantic
precision.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
This presentation is a briefing of a paper about Networks and Natural Language Processing. It describes many graph based methods and algorithms that help in syntactic parsing, lexical semantics and other applications.
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Word2vec on the italian language: first experimentsVincenzo Lomonaco
Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
Domain Specific Terminology Extraction (ICICT 2006)IT Industry
Imran Sarwar Bajwa, M. Imran Siddique, M. Abbas Choudhary, [2006], "Automatic Domain Specific Terminology Extraction using a Decision Support System", in IEEE 4th International Conference on Information and Communication Technology (ICICT 2006), Cairo, Egypt. pp:651-659
Programmers love science! At least, so they say. Because when it comes to the ‘science’ of developing code, the most used tool is brutal debate. Vim versus emacs, static versus dynamic typing, Java versus C#, this can go on for hours at end. In this session, software engineering professor Felienne Hermans will present the latest research in software engineering that tries to understand and explain what programming methods, languages and tools are best suited for different types of development.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Dictionary based concept mining an application for turkish
FinalReport
1. Final Report of Internship Program
at Kyoto Institute of Technology
Ho Xuan Vinh,
Faculty of Information Technology - University of Science
Ho Chi Minh city, Vietnam
Email: hovinh39@gmail.com
Abstract—With the fastest development speed than ever,
Machine Learning approach shows its dominant in solv-
ing almost every problems from Image Processing to
Natural Language Processing. However, to utilize all of
this advantages, scientist must provide clean and clear
material for machines automatically extract features and
learn from then. This is the goal of this internship in
creating a bilingual annotated corpus Vietnamese-English
for training purpose of further advanced tasks. I have
met a lot of challenges in finding appropriate tagset and
implementing method, all are represented below .
Index Terms—LLOCE, UCREL, WordNet, sense tag,
semantic annotation, ...
I. INTRODUCTION
Beginning from Alan Turing’s famous Enigma de-
coder machine, Natural Language Processing field has
many up and down moment in its research and develop-
ment history. This interactive human-machine has taken
a great leap since 1980 with encouraging achievements
supported by Machine Learning techniques. Unavoid-
ably, the thirsty for well-annotated and large corpus
becomes one of the most important factors to have a
great training system. We also see that the more detail in
annotated corpus, the more information you can extract
and the higher the accuracy of the system. In specific,
we can divide the label for annotating to 3 levels:
• Morphological (noun, verb...)
• Grammar (pronoun...)
• Semantic (distinguish different entities with same
morphological)
The most distinctive is Semantic label that can not
only provide many meaningful information for docu-
ments, but also shows ability in solving complicated
ambiguity with semantic relationship. Not standing out
of the big game, many efforts have been made to create
a universal semantic tagset such as WordNet, CoreLex,
LLOCE, UCREL...
This would be a great news unless the corpus an-
notated with these tagsets are extremely rare, which
do not provide enough data for training and testing
Machine Learning methods, especially in low-resource
language. From this motivation, I aim to make an
annotated bilingual corpus for English - Vietnamese.
The order of following section will be my timeline in
doing research of this internship. Extra explanation and
discussion about result also been added to make clearer
about my changes in each approach. Concluding remark
will be given and my continue work when be back to
home country also provided.
II. THE HOBBIT - WORDNET
Taking a little time to conduct a survey about ”se-
mantic tagset”, WordNet turns out the biggest star on
stage with no doubt. Since the mid of 1980s, Linguistics
in Princeton University have conducted project building
the English Ontology WordNet with more than 117.000
synsets. These synsets connected to each other with
many relationships: hypernyms, hyponyms, meronym,
troponym... Each synset representing as a node or leaf
in tree contains a list of synonym words. This structure
is great advantage yet also disadvantage when the tagset
is too fine-grained - Figure 1, which is so hard that even
people can not figure out the difference between 2 labels
only by synonym or short definition. For example with
the word ‘bank’, we have 3 synsets it belong to:
• sloping land (especially the slope beside a body of
water)
• a long ridge or pile
• a slope in the turn of a road or track; the outside is
higher than the inside in order to reduce the effects
of centrifugal force
Understand its properties, I know this goal is too
hard to achieve: no available WordNet dictionary for
Vietnamese, and English corpus annotated WordNet is
small which get from annual contests like SemEval to
find the best Word Sense Disambiguation method. Ev-
erything seems not promising at all from the beginning,
so I decide to keep looking for other tagset.
2. Fig. 1. WordNet visualization of Noun and Adjective.
III. THE FELLOWSHIP OF THE RING - LLOCE AND
MAPPING
This is the most unpopular and less attractive tagset
you can find via Google Scholar, yet the concept behind
this dictionary is easy to capture and friendly to human.
Ordinary dictionary puts ‘animal’ and ‘zoo’ or ‘uncle’
and ‘aunt’ in far positions due to alphabet order. How-
ever, in daily thinking process, those pairs are popular
in particular way. Constructor of Longman Lexicon of
Contemporary English(LLOCE)use this concept to clus-
ter words sharing common characteristics into groups.
LLOCE is organized as 14 topics called branches of
daily life, then divided into 128 subjects and nearly
2,500 groups(Figure 2).
For example: Branch C: People and the family
consists of:
• Subject ‘People’: Groups from C1 to C19
• Subject ‘Courting, sex, and marriage’: Groups from
C20 to C39
• Subject ‘Friendship and enmity’: Groups from C40
to C49
• ...
Each group has its own short definition and set of
words belong to. Statistically, 705 groups contain verb,
1,482 groups contain noun, 429 for adjective, and 63 for
Fig. 2. Structure of LLOCE dictionary.
adverb. LLOCE also has external relation between sub-
jects, however due to lack of evidence about how these
relationship created, I left them out of consideration.
The reason I choose LLOCE is because of the mate-
rial I have in hand:
• LLOCE dictionary of both English and Vietnamese.
• Bilingual corpus extracted from examples of
LLOCE dictionary.
The hard thing is that the corpus was not annotated,
so actually, I don’t know how to evaluate the result after
annotating. Also I can not trace back which sentence is
the example of which word. Fortunately, in this period, I
catch the idea of using Latent Dirichlet Allocation[LDA]
in topic modeling branch, which sounds like promising
unsupervised approach.
3. Fig. 3. Graphical Model of Latent Dirichlet Allocation.
LDA is a generative probabilistic model of a corpus.
The basic idea is that documents are represented as
random mixtures over latent topics, where each topic
is characterized by a distribution over words.
LDA assumes the following generative process for
each document w in a corpus D:
1) Choose N Poisson(ε) .
2) Choose θ Dir(α).
3) For each of the N words wn:
• Choose a topic zn Multinomial(θ).
• Choose a word wn from p(wn|zn, β), a multi-
nomial probability conditioned on the topic
zn.
Figure 3 shows factors counted in this technique. In
summary, when you feed the algorithm with example
sentences, you can get the mixed topic for each of them.
This proportion is a clue to help you determine which
group(label) will be the most suitable for considering
word.
For easy understanding, let’s try this example which
illustrated on Figure 4. First, you need to train LDA
on dictionary, so it can learn distribution of words in
each topic (we choose 2500 topics according to 2500
Groups of LLOCE), then using the achieved model, you
will find the mixed topic of any sentence. For example,
the sentence ”He hewed out an important position for
himself in the company” has mixed topic: 30% N43,
15% N89, 45% C52, 12.5% B41 and the rest is for
other Groups. If the word ”important” has 2 candidates
Groups (N43 and N89), we will choose the one has the
higher ratio, which means N43.
Not as I expected, the first step get trouble when the
model can not study the word distribution as I want.
Further discussion with Araki sensei shows that this
could be happen by many reasons: the number of topic
is too large, size of each document usually collections
of paragraph more than just sentence, the co-occurence
of words, and non accurate simulated environment when
training.
In an attempt to fix this, I tried different ways yet
still does not reach the goal. Increasing co-occurence
of words by dbPedia seems possible at first, but the
coverage of LLOCE is too small and domains is too
different. The number of topics when decreasing to the
size of Topics or Branches also can not help. To get
annotated corpus, I try to mapping from LLOCE to
WordNet based on similarity metric of tree structure of
WordNet, to make SEMCOR(WordNet annotated corpus
in SemEval) be an LLOCE annotated corpus. However,
despite the same distribution of word and nodes level,
no more deeper intuition drawn from this experiment.
In this time, I also implement a Pre-processing and
LLOCE package in Python for easy using later, which
offer basic function like checking which Groups word
belong to, the list of unique words in each Groups, and
baseline annotating with Left Right Maximum Match-
ing.
IV. THE TWO TOWERS - UCREL
The UCREL semantic analysis system is a framework
for undertaking the automatic semantic analysis of text.
The framework has been designed and used across a
number of research projects since 1990. The semantic
tagset used by USAS was originally loosely based on
Tom McArthur’s Longman Lexicon of Contemporary
English (LLOCE, 1981). It has a multi-tier structure
with 21 major discourse fields (shown on Figure 5),
subdivided, and with the possibility of further fine-
grained subdivision in certain cases.
In fact, most of the research in this tagset is my
friend’s task. But when getting stuck with LLOCE, I
then spent 2 weeks to work with it. This time idea is
more simple. They provide a website for uploading your
corpus, then you will received the annotated corpus in
English and 5 other languages(Dutch, Chinese, Italian,
Portuguese, Spanish). Then you can use word alignment
on your available bilingual corpus in sentence level,
then projecting label from English to Vietnamese. So
the problem now is not annotating Vietnamese, but turn
out how to project label from English to Vietnamese
correctly.
However, we have no Vietnamese dictionary to sup-
port, things come back to the starting point.
V. THE RETURN OF THE KING - WORDNET
My mentor has a quick discussion with me about
coming back to WordNet. He now has a English-
Vietnamese version of WordNet, and a part of Word-
4. Fig. 4. Idea about implementing LDA for annotating.
Fig. 5. UCREL category system.
Net’s SEMCOR translated into Vietnamese. The prob-
lem stays the same: make alignment correct. Finding
heuristics to solve exception in both languages: words
are adjectives in English but are noun in Vietnamese and
so on... This is my current work up to now.
VI. CONLUSION
So far, my process is not as good as I wish. The time I
spent for conducting survey was too small that makes me
confusing when doing experiment. Even when realizing
that, I still struggle to understand completely the full
concept of papers (which contains a lot of complex
formula). It guarantees many challenges waiting for me
ahead before reaching the deadline of my thesis. But I
know that I have learned the way to handle them, and the
Fig. 6. My lab.
memory here will be a reminder for me that whenever
there is a hard time, you can always get over it.
ACKNOWLEDGEMENT
Thank you Araki sensei for the warm discussions
about my work, for the support when I feel confusing
and lost in my research, and for teaching me the
”learning new things by doing experiment with it”.
Thank you Nakano-san and other students in lab for
giving me memorable experience, not only in work, but
also play as well.