This document describes a probabilistic definite clause grammar (PDCG) parser for Vietnamese sentence parsing that incorporates Chomsky's theory of subcategorization. The researchers built a treebank of 1000 Vietnamese training sentences that were syntactically analyzed and tagged by hand. They defined subcategorization tags for nouns, verbs, adjectives, adverbs, prepositions, and conjunctions in Vietnamese. They also defined phrasal tags and developed syntactic rules for phrases based on the subcategorization tags. Experimental results showed that the precisions, recalls and F-measures of the subcategorized PDCG parser were over 98%, demonstrating the effectiveness of the approach.
Processing vietnamese news titles to answer relative questions in vnewsqa ict...ijnlc
This paper introduces two important elements of our VNewsQA/ICT system: its semantic models
of simple Vietnamese sentences and its semantic processing mechanism. The VNewsQA/ICT is a
Vietnamese based Question Answering system which has the ability to gather information from
some Vietnamese news title forms on the ICTnews websites (http://www.ictnews.vn), instead of
using a database or a knowledge base, to answer the related Vietnamese questions in the domain
of information and communications technology.
SEMANTIC PROCESSING MECHANISM FOR LISTENING AND COMPREHENSION IN VNSCALENDAR ...ijnlc
This paper presents some generalities about the VNSCalendar system, a tool able to understand users’
voice commands, would help users with managing and querying their personal calendar by Vietnamese
speech. The main feature of this system consists in the fact that it is equipped with a mechanism of
analyzing syntax and semantics of Vietnamese commands and questions. The syntactic and semantic
processing of Vietnamese sentences is solved by using DCG (Definite Clause Grammar) and the methods of
formal semantics. This is the first system in this field of voice application, which is equipped an effective
semantic processing mechanism of Vietnamese language. Having been built and tested in PC environment,
our system proves its accuracy attaining more than 91%.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Emotion expression is an essential function for dai
ly life that can be severely affected some psycholo
gical
disorders. In this paper we identified seven emotio
nal states anger,surprise,sadness ,happiness,fear,d
isgust
and neutral.The definition of parameters is a cruci
al step in the development of a system for emotion
analysis.The 15 explored features are energy intens
ity,pitch,standard
deviation,jitter,shimmer,autocorrelation,noise to h
armonic ration,harmonic to noise ration,energy entr
opy
block,short term energy,zero crossing rate,spectral
roll-off,spectral centroid and spectral flux,and f
ormants
In this work database used is SAVEE(Surrey audio vi
sual expressed emotion).Results by using different
learning methods and estimation is done by using a
confidence interval for identified parameters are
compared and explained.The overall experimental res
ults reveals that Model 2 and Model 3 give better
results than Model 1 using learning methods and es
timation shows that most emotions are correctly
estimated by using energy intensity and pitch.
Machine translation evaluation is a very important
activity in machine translation development. Automa
tic
evaluation metrics proposed in literature are inade
quate as they require one or more human reference
translations to compare them with output produced b
y machine translation. This does not always give
accurate results as a text can have several differe
nt translations. Human evaluation metrics, on the o
ther
hand, lacks inter-annotator agreement and repeatabi
lity. In this paper we have proposed a new human
evaluation metric which addresses these issues. Mor
eover this metric also provides solid grounds for
making sound assumptions on the quality of the text
produced by a machine translation.
UNL-ization of Numbers and Ordinals in Punjabi with IANijnlc
In the field of Natural Language Processing, Universal Networking Language (UNL) has been an area of
immense interest among researchers during last couple of years. Universal Networking Language (UNL) is
an artificial Language used for representing information in a natural-language-independent format. This
paper presents UNL-ization of Punjabi sentences with the help of different examples, containing numbers
and ordinals written in words, using IAN (Interactive Analyzer) tool. In UNL approach, UNL-ization is a
process of converting natural language resource to UNL and NL-ization, is a process of generating a
natural language resource out of a UNL graph. IAN processes input sentences with the help of TRules and
Dictionary entries. The proposed system performs the UNL-ization of up to fourteen digit number and
ordinals, written in words in Punjabi language, with the help of 104 dictionary entries and 67 TRules. The
system is tested on a sample of 150 random Punjabi Numbers and Ordinals, written in words, and its FMeasure
comes out to be 1.000 (on a scale of 0 to 1).
Processing vietnamese news titles to answer relative questions in vnewsqa ict...ijnlc
This paper introduces two important elements of our VNewsQA/ICT system: its semantic models
of simple Vietnamese sentences and its semantic processing mechanism. The VNewsQA/ICT is a
Vietnamese based Question Answering system which has the ability to gather information from
some Vietnamese news title forms on the ICTnews websites (http://www.ictnews.vn), instead of
using a database or a knowledge base, to answer the related Vietnamese questions in the domain
of information and communications technology.
SEMANTIC PROCESSING MECHANISM FOR LISTENING AND COMPREHENSION IN VNSCALENDAR ...ijnlc
This paper presents some generalities about the VNSCalendar system, a tool able to understand users’
voice commands, would help users with managing and querying their personal calendar by Vietnamese
speech. The main feature of this system consists in the fact that it is equipped with a mechanism of
analyzing syntax and semantics of Vietnamese commands and questions. The syntactic and semantic
processing of Vietnamese sentences is solved by using DCG (Definite Clause Grammar) and the methods of
formal semantics. This is the first system in this field of voice application, which is equipped an effective
semantic processing mechanism of Vietnamese language. Having been built and tested in PC environment,
our system proves its accuracy attaining more than 91%.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Emotion expression is an essential function for dai
ly life that can be severely affected some psycholo
gical
disorders. In this paper we identified seven emotio
nal states anger,surprise,sadness ,happiness,fear,d
isgust
and neutral.The definition of parameters is a cruci
al step in the development of a system for emotion
analysis.The 15 explored features are energy intens
ity,pitch,standard
deviation,jitter,shimmer,autocorrelation,noise to h
armonic ration,harmonic to noise ration,energy entr
opy
block,short term energy,zero crossing rate,spectral
roll-off,spectral centroid and spectral flux,and f
ormants
In this work database used is SAVEE(Surrey audio vi
sual expressed emotion).Results by using different
learning methods and estimation is done by using a
confidence interval for identified parameters are
compared and explained.The overall experimental res
ults reveals that Model 2 and Model 3 give better
results than Model 1 using learning methods and es
timation shows that most emotions are correctly
estimated by using energy intensity and pitch.
Machine translation evaluation is a very important
activity in machine translation development. Automa
tic
evaluation metrics proposed in literature are inade
quate as they require one or more human reference
translations to compare them with output produced b
y machine translation. This does not always give
accurate results as a text can have several differe
nt translations. Human evaluation metrics, on the o
ther
hand, lacks inter-annotator agreement and repeatabi
lity. In this paper we have proposed a new human
evaluation metric which addresses these issues. Mor
eover this metric also provides solid grounds for
making sound assumptions on the quality of the text
produced by a machine translation.
UNL-ization of Numbers and Ordinals in Punjabi with IANijnlc
In the field of Natural Language Processing, Universal Networking Language (UNL) has been an area of
immense interest among researchers during last couple of years. Universal Networking Language (UNL) is
an artificial Language used for representing information in a natural-language-independent format. This
paper presents UNL-ization of Punjabi sentences with the help of different examples, containing numbers
and ordinals written in words, using IAN (Interactive Analyzer) tool. In UNL approach, UNL-ization is a
process of converting natural language resource to UNL and NL-ization, is a process of generating a
natural language resource out of a UNL graph. IAN processes input sentences with the help of TRules and
Dictionary entries. The proposed system performs the UNL-ization of up to fourteen digit number and
ordinals, written in words in Punjabi language, with the help of 104 dictionary entries and 67 TRules. The
system is tested on a sample of 150 random Punjabi Numbers and Ordinals, written in words, and its FMeasure
comes out to be 1.000 (on a scale of 0 to 1).
Event detection and summarization based on social networks and semantic query...ijnlc
Events can be characterized by a set of descriptive, collocated keywords extracted documents. Intuitively,
documents describing the same event will contain similar sets of keywords, and the graph for a document collection will contain clusters individual events. Helping users to understand the event is an acute problem nowadays as the users are struggling to keep up with tremendous amount of information published every day in the Internet. The challenging task is to detect the events from online web resources, it is getting more attentions. The important data source for event detection is a Web search log because the information it contains reflects users’ activities and interestingness to various real world events. There are three major issues playing role for event detection from web search logs: effectiveness, efficiency of
detected events. We focus on modeling the content of events by their semantic relations with other events
and generating structured summarization. Event mining is a useful way to understand computer system behaviors. The focus of recent works on event mining has been shifted to event summarization from discovering frequent patterns. Event summarization provides a comprehensible explanation of the event sequence based on certain aspects.
Developing links of compound sentences for parsing through marathi link gramm...ijnlc
Marathi is a verb-final language with a relatively free word order. Complex Sentences is one of the major types of sentences which are used commonly in any language. This paper explores the study of complex sentence structure of Marathi language. The paper proposes various links of complex sentence clauses and modelling of the complex sentences using proposed links in the Link Grammar Framework for parsing purpose.
Kridantas play a vital role in understanding Sanskrit language. Kridantas includes nouns, adjectives and
indeclinable words called avyayas. Kridantas are formed with root and certain suffixes called Krits. Some
times Kridantas may occur with certain prefixes. Many morphological analyzers are lacking the complete
analysis of Kridantas. This paper describes a novel approach to deal completely with Kridantas.
The current research is focusing on the area of Opinion Mining also called as sentiment analysis due to
sheer volume of opinion rich web resources such as discussion forums, review sites and blogs are available
in digital form. One important problem in sentiment analysis of product reviews is to produce summary of
opinions based on product features. We have surveyed and analyzed in this paper, various techniques that
have been developed for the key tasks of opinion mining. We have provided an overall picture of what is
involved in developing a software system for opinion mining on the basis of our survey and analysis.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
A SIGNATURE BASED DRAVIDIAN SIGN LANGUAGE RECOGNITION BY SPARSE REPRESENTATIONijnlc
Sign language is a visual-gestural language used by deaf-dumb people for communication. As normal people are unfamiliar of sign language, the hearing-impaired people find it difficult to communicate with them. The communication gap between the normal and the deaf-dumb people can be bridged by means of Human–Computer Interaction. The objective of this paper is to convert the Dravidian (Tamil) sign language into text. The proposed method recognizes 12 vowels, 18 consonants and a special character “Aytham” of Tamil language by a vision based approach. In this work, the static images of the hand signs are obtained a web/digital camera. The hand region is segmented by a threshold applied to the hue channel of the input image. Then the region of interest (i.e. from wrist to fingers) is segmented using the reversed horizontal projection profile and the Discrete Cosine transformed signature is extracted from the boundary of hand sign. These features are invariant to translation, scale and rotation. Sparse representation classifier is incorporated to recognize 31 hand signs. The proposed method has attained a maximum recognition accuracy of 71% in a uniform background.
The noun phrase introducers of npChapter 4the noun phr.docxarnoldmeredith47041
The noun phrase: introducers of np
Chapter 4
the noun phrase:
introducers of NP
Determiners
Numerals
Quantifiers
Quantity without Q
Possessive NPs
WH- words
The noun phrase:
Introducers of np
Determiners
Encode:
Definiteness
Indefiniteness
Number
Proximity (closeness)
(Questions: see 6: WH- determiners)
determiners
Definiteness:
A definite noun (phrase) is known to both speaker and hearer
Determiners
Definiteness
Example 1:
Context: Ann walks in and says to Bob:
“The student is outside.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which student she means. (He has to use non-linguistic sources to figure out which student it is.)
Determiners
Definiteness
Example 2:
Same context: Ann walks in and says to Bob:
“The President is on TV right now.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which person she means. (He has to use non-linguistic sources to figure out who it is—in this case, it’s probably not difficult.)
Determiners
Indefiniteness
An indefinite noun (phrase) is NOT assumed to be known to speaker and hearer.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A student is outside.”
Bob assumes she will explain which student is outside.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A president is outside.”
Bob assumes she will explain which president is outside. Since there aren’t usually lots of Presidents to choose from, this sentence is odd.
determiners
Number
Distinguish singular/plural
Examples:
A letter
Some letters / some writing
This letter
These letters
determiners
Proximity
Distinguish closeness to speaker or someone else; demonstratives
Examples:
This letter (close to speaker)
That letter (close to someone else)
These letters
Those letters
determiners
Summary
Encode:
Definiteness/indefiniteness
Number: singular/plural
Proximity to speaker/other
numerals
Encode:
Number
Indefiniteness
Sequence (order)
numerals
Number
Examples:
One frog jumped in the pond.
Ten frogs jumped in the pond.
numerals
Indefiniteness
Example:
Two frogs jumped in the pond.
The speaker and hearer are not assumed to know which particular frogs jumped in the pond, just how many did it.
numerals
Indefiniteness
Compare:
Two frogs jumped in the pond.
Those two frogs jumped in the pond.
numerals
Sequence (order)
Example:
The first frog jumped in the pond.
The second frog jumped in the pond.
Tells which frog based on its order relative to others:
Called ordinal numbers
Numerals:
Phrase structure rule
NP
Det
Num
N
the
second
frog
NP (Det) (Num) N
NP
Det
N
a
frog
NP
N
frogs
numerals
Summary:
Numerals encode number
Numerals can encode indefiniteness
Numerals can encode order
Phrase Structure Rule:
NP (Det) (Num) N
quantifiers
What quantifiers “do” (in terms of meaning):
Pick out members of a set in ways other .
The noun phrase introducers of npChapter 4the noun phr.docxdennisa15
The noun phrase: introducers of np
Chapter 4
the noun phrase:
introducers of NP
Determiners
Numerals
Quantifiers
Quantity without Q
Possessive NPs
WH- words
The noun phrase:
Introducers of np
Determiners
Encode:
Definiteness
Indefiniteness
Number
Proximity (closeness)
(Questions: see 6: WH- determiners)
determiners
Definiteness:
A definite noun (phrase) is known to both speaker and hearer
Determiners
Definiteness
Example 1:
Context: Ann walks in and says to Bob:
“The student is outside.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which student she means. (He has to use non-linguistic sources to figure out which student it is.)
Determiners
Definiteness
Example 2:
Same context: Ann walks in and says to Bob:
“The President is on TV right now.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which person she means. (He has to use non-linguistic sources to figure out who it is—in this case, it’s probably not difficult.)
Determiners
Indefiniteness
An indefinite noun (phrase) is NOT assumed to be known to speaker and hearer.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A student is outside.”
Bob assumes she will explain which student is outside.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A president is outside.”
Bob assumes she will explain which president is outside. Since there aren’t usually lots of Presidents to choose from, this sentence is odd.
determiners
Number
Distinguish singular/plural
Examples:
A letter
Some letters / some writing
This letter
These letters
determiners
Proximity
Distinguish closeness to speaker or someone else; demonstratives
Examples:
This letter (close to speaker)
That letter (close to someone else)
These letters
Those letters
determiners
Summary
Encode:
Definiteness/indefiniteness
Number: singular/plural
Proximity to speaker/other
numerals
Encode:
Number
Indefiniteness
Sequence (order)
numerals
Number
Examples:
One frog jumped in the pond.
Ten frogs jumped in the pond.
numerals
Indefiniteness
Example:
Two frogs jumped in the pond.
The speaker and hearer are not assumed to know which particular frogs jumped in the pond, just how many did it.
numerals
Indefiniteness
Compare:
Two frogs jumped in the pond.
Those two frogs jumped in the pond.
numerals
Sequence (order)
Example:
The first frog jumped in the pond.
The second frog jumped in the pond.
Tells which frog based on its order relative to others:
Called ordinal numbers
Numerals:
Phrase structure rule
NP
Det
Num
N
the
second
frog
NP (Det) (Num) N
NP
Det
N
a
frog
NP
N
frogs
numerals
Summary:
Numerals encode number
Numerals can encode indefiniteness
Numerals can encode order
Phrase Structure Rule:
NP (Det) (Num) N
quantifiers
What quantifiers “do” (in terms of meaning):
Pick out members of a set in ways other .
Event detection and summarization based on social networks and semantic query...ijnlc
Events can be characterized by a set of descriptive, collocated keywords extracted documents. Intuitively,
documents describing the same event will contain similar sets of keywords, and the graph for a document collection will contain clusters individual events. Helping users to understand the event is an acute problem nowadays as the users are struggling to keep up with tremendous amount of information published every day in the Internet. The challenging task is to detect the events from online web resources, it is getting more attentions. The important data source for event detection is a Web search log because the information it contains reflects users’ activities and interestingness to various real world events. There are three major issues playing role for event detection from web search logs: effectiveness, efficiency of
detected events. We focus on modeling the content of events by their semantic relations with other events
and generating structured summarization. Event mining is a useful way to understand computer system behaviors. The focus of recent works on event mining has been shifted to event summarization from discovering frequent patterns. Event summarization provides a comprehensible explanation of the event sequence based on certain aspects.
Developing links of compound sentences for parsing through marathi link gramm...ijnlc
Marathi is a verb-final language with a relatively free word order. Complex Sentences is one of the major types of sentences which are used commonly in any language. This paper explores the study of complex sentence structure of Marathi language. The paper proposes various links of complex sentence clauses and modelling of the complex sentences using proposed links in the Link Grammar Framework for parsing purpose.
Kridantas play a vital role in understanding Sanskrit language. Kridantas includes nouns, adjectives and
indeclinable words called avyayas. Kridantas are formed with root and certain suffixes called Krits. Some
times Kridantas may occur with certain prefixes. Many morphological analyzers are lacking the complete
analysis of Kridantas. This paper describes a novel approach to deal completely with Kridantas.
The current research is focusing on the area of Opinion Mining also called as sentiment analysis due to
sheer volume of opinion rich web resources such as discussion forums, review sites and blogs are available
in digital form. One important problem in sentiment analysis of product reviews is to produce summary of
opinions based on product features. We have surveyed and analyzed in this paper, various techniques that
have been developed for the key tasks of opinion mining. We have provided an overall picture of what is
involved in developing a software system for opinion mining on the basis of our survey and analysis.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
A SIGNATURE BASED DRAVIDIAN SIGN LANGUAGE RECOGNITION BY SPARSE REPRESENTATIONijnlc
Sign language is a visual-gestural language used by deaf-dumb people for communication. As normal people are unfamiliar of sign language, the hearing-impaired people find it difficult to communicate with them. The communication gap between the normal and the deaf-dumb people can be bridged by means of Human–Computer Interaction. The objective of this paper is to convert the Dravidian (Tamil) sign language into text. The proposed method recognizes 12 vowels, 18 consonants and a special character “Aytham” of Tamil language by a vision based approach. In this work, the static images of the hand signs are obtained a web/digital camera. The hand region is segmented by a threshold applied to the hue channel of the input image. Then the region of interest (i.e. from wrist to fingers) is segmented using the reversed horizontal projection profile and the Discrete Cosine transformed signature is extracted from the boundary of hand sign. These features are invariant to translation, scale and rotation. Sparse representation classifier is incorporated to recognize 31 hand signs. The proposed method has attained a maximum recognition accuracy of 71% in a uniform background.
The noun phrase introducers of npChapter 4the noun phr.docxarnoldmeredith47041
The noun phrase: introducers of np
Chapter 4
the noun phrase:
introducers of NP
Determiners
Numerals
Quantifiers
Quantity without Q
Possessive NPs
WH- words
The noun phrase:
Introducers of np
Determiners
Encode:
Definiteness
Indefiniteness
Number
Proximity (closeness)
(Questions: see 6: WH- determiners)
determiners
Definiteness:
A definite noun (phrase) is known to both speaker and hearer
Determiners
Definiteness
Example 1:
Context: Ann walks in and says to Bob:
“The student is outside.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which student she means. (He has to use non-linguistic sources to figure out which student it is.)
Determiners
Definiteness
Example 2:
Same context: Ann walks in and says to Bob:
“The President is on TV right now.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which person she means. (He has to use non-linguistic sources to figure out who it is—in this case, it’s probably not difficult.)
Determiners
Indefiniteness
An indefinite noun (phrase) is NOT assumed to be known to speaker and hearer.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A student is outside.”
Bob assumes she will explain which student is outside.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A president is outside.”
Bob assumes she will explain which president is outside. Since there aren’t usually lots of Presidents to choose from, this sentence is odd.
determiners
Number
Distinguish singular/plural
Examples:
A letter
Some letters / some writing
This letter
These letters
determiners
Proximity
Distinguish closeness to speaker or someone else; demonstratives
Examples:
This letter (close to speaker)
That letter (close to someone else)
These letters
Those letters
determiners
Summary
Encode:
Definiteness/indefiniteness
Number: singular/plural
Proximity to speaker/other
numerals
Encode:
Number
Indefiniteness
Sequence (order)
numerals
Number
Examples:
One frog jumped in the pond.
Ten frogs jumped in the pond.
numerals
Indefiniteness
Example:
Two frogs jumped in the pond.
The speaker and hearer are not assumed to know which particular frogs jumped in the pond, just how many did it.
numerals
Indefiniteness
Compare:
Two frogs jumped in the pond.
Those two frogs jumped in the pond.
numerals
Sequence (order)
Example:
The first frog jumped in the pond.
The second frog jumped in the pond.
Tells which frog based on its order relative to others:
Called ordinal numbers
Numerals:
Phrase structure rule
NP
Det
Num
N
the
second
frog
NP (Det) (Num) N
NP
Det
N
a
frog
NP
N
frogs
numerals
Summary:
Numerals encode number
Numerals can encode indefiniteness
Numerals can encode order
Phrase Structure Rule:
NP (Det) (Num) N
quantifiers
What quantifiers “do” (in terms of meaning):
Pick out members of a set in ways other .
The noun phrase introducers of npChapter 4the noun phr.docxdennisa15
The noun phrase: introducers of np
Chapter 4
the noun phrase:
introducers of NP
Determiners
Numerals
Quantifiers
Quantity without Q
Possessive NPs
WH- words
The noun phrase:
Introducers of np
Determiners
Encode:
Definiteness
Indefiniteness
Number
Proximity (closeness)
(Questions: see 6: WH- determiners)
determiners
Definiteness:
A definite noun (phrase) is known to both speaker and hearer
Determiners
Definiteness
Example 1:
Context: Ann walks in and says to Bob:
“The student is outside.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which student she means. (He has to use non-linguistic sources to figure out which student it is.)
Determiners
Definiteness
Example 2:
Same context: Ann walks in and says to Bob:
“The President is on TV right now.”
Bob assumes from Ann’s phrasing that she is referring to someone specific, and that he should know which person she means. (He has to use non-linguistic sources to figure out who it is—in this case, it’s probably not difficult.)
Determiners
Indefiniteness
An indefinite noun (phrase) is NOT assumed to be known to speaker and hearer.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A student is outside.”
Bob assumes she will explain which student is outside.
Determiners
Indefiniteness
Example 1:
Context: Ann walks in and begins to talk to Bob:
“A president is outside.”
Bob assumes she will explain which president is outside. Since there aren’t usually lots of Presidents to choose from, this sentence is odd.
determiners
Number
Distinguish singular/plural
Examples:
A letter
Some letters / some writing
This letter
These letters
determiners
Proximity
Distinguish closeness to speaker or someone else; demonstratives
Examples:
This letter (close to speaker)
That letter (close to someone else)
These letters
Those letters
determiners
Summary
Encode:
Definiteness/indefiniteness
Number: singular/plural
Proximity to speaker/other
numerals
Encode:
Number
Indefiniteness
Sequence (order)
numerals
Number
Examples:
One frog jumped in the pond.
Ten frogs jumped in the pond.
numerals
Indefiniteness
Example:
Two frogs jumped in the pond.
The speaker and hearer are not assumed to know which particular frogs jumped in the pond, just how many did it.
numerals
Indefiniteness
Compare:
Two frogs jumped in the pond.
Those two frogs jumped in the pond.
numerals
Sequence (order)
Example:
The first frog jumped in the pond.
The second frog jumped in the pond.
Tells which frog based on its order relative to others:
Called ordinal numbers
Numerals:
Phrase structure rule
NP
Det
Num
N
the
second
frog
NP (Det) (Num) N
NP
Det
N
a
frog
NP
N
frogs
numerals
Summary:
Numerals encode number
Numerals can encode indefiniteness
Numerals can encode order
Phrase Structure Rule:
NP (Det) (Num) N
quantifiers
What quantifiers “do” (in terms of meaning):
Pick out members of a set in ways other .
Trung Tâm Anh Văn Giao Tiếp Biên Hòa (Biên Hòa English Center) chuyên dạy
Anh Văn Giao Tiếp cho người đi làm.
Anh Văn Giao Tiếp cho giới văn phòng.
Anh Văn phỏng vấn xin việc.
Anh Văn du lịch.
Anh Văn xuất cảnh.
Anh Văn Thương Mại.
Anh Văn Phỏng Vấn xin Visa du học Mỹ.
Thông tin liên hệ:Trung Tâm Anh Văn Giao Tiếp Biên Hòa
Địa chỉ: 43A/1 Khu Phố 8A, Phường Tân Biên, Tp Biên Hòa, Tỉnh Đồng Nai.
Điện thoại: 0613 888 168Di Động: 0903 77 47 45 (Thầy Trần) Email:thandongtre@gmail.com
Website: http://anhvangiaotiepbienhoa.com/
Với nhiều năm kinh nghiệm trong việc giảng dạy anh văn giao tiếp cho người đi làm, bạn hoàn toàn an tâm với chúng tôi. Hơn nữa chúng tôi sẽ điều chỉnh chương trình học một cách linh hoạt sao cho phù hợp nhất với từng lớp và từng học viên.
Hầu hết học viên sau khi học với chúng tôi đều có khả năng giao tiếp tốt với người nước ngoài và đạt vị trí cao trong công ty.
Chúng tôi cam kết đầu ra chuẩn cho từng học viên.
Lớp ít người
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
IMPLEMENTING A SUBCATEGORIZED PROBABILISTIC DEFINITE CLAUSE GRAMMAR FOR VIETNAMESE SENTENCE PARSING
1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
DOI : 10.5121/ijnlc.2013.2401 01
IMPLEMENTING A SUBCATEGORIZED
PROBABILISTIC DEFINITE CLAUSE GRAMMAR FOR
VIETNAMESE SENTENCE PARSING
Dang Tuan Nguyen, Kiet Van Nguyen, Tin Trung Pham
Faculty of Computer Science, University of Information Technology,
Vietnam National University – Ho Chi Minh City,
Ho Chi Minh City, Vietnam
{ntdang, nvkiet, pttin}@nlke-group.net
ABSTRACT
In this paper, we introduce experiment results of a Vietnamese sentence parser which is built by using the
Chomsky’s subcategorization theory and PDCG (Probabilistic Definite Clause Grammar). The efficiency
of this subcategorized PDCG parser has been proved by experiments, in which, we have built by hand a
Treebank with 1000 syntactic structures of Vietnamese training sentences, and used different testing
datasets to evaluate the results. As a result, the precisions, recalls and F-measures of these experiments are
over 98%.
KEYWORDS
Probabilistic Context-Free Grammar, Probabilistic Definite Clause Grammar, Parsing, Subcategorization
1. INTRODUCTION
The PCFG (Probabilistic Context-Free Grammar) [1], [2], [3], and [4] has been applied for
developing some Vietnamese parsers as [5], [6], and [7]. All of these mentioned Vietnamese
parsers use un-subcategorized PCFG.
In this research, we are interested in applying the Chomsky’s subcategorization theory [8] and
PDCG (Probabilistic Definite Clause Grammar) [9], [10], [11], and [12] to implement a
subcategorized PDCG parser which allows analyzing effectively simple Vietnamese sentences.
To implement this parser, we define our Vietnamese subcategorized PDCG grammar based on
our set of sub-categorical and phrasal tags, and syntactic rules defined on these tags.
We also develop a Treebank for training this subcategorized PDCG parser. The Vietnamese
sentences in our Treebank are syntactically analyzed and tagged by hand.
2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
2
2. SUB-CATEGORICAL AND PHRASAL TAGS
2.1. Definition of Subcategorical Tags
• Nominal Tags
Nouns are divided into 9 sub-categories as presented in Table 1.
Table 1. Sub-categories of nouns
No. Nominal groups Non-terminals Adjectival tags
1 Common nouns n N
2 Abbreviation nouns n_abbr N_ABBR
3 Currency nouns n_currency N_CURRENCY
4 English nouns n_eng N_ENG
5 Idiomatic nouns n_idiom N_IDIOM
6 Proper nouns n_prop N_PROP
7 Temporal nouns n_time N_TIME
8 Title nouns n_title N_TITLE
9 Unit nouns n_unit N_UNIT
• Verbal tags
Verbs are divided into 8 sub-categories as presented in Table 2.
Table 2. Sub-categories of verbs
No. Verbal groups Non-terminals Adjectival tags
1 Ordinary verbs v V
2 Passive verbs v_bi V_BI
3 Motion verbs v_di V_DI
4 Acquisition verbs v_duoc V_DUOC
5 English verbs v_eng V_ENG
6 Idiomatic verbs v_idiom V_IDIOM
7 Là (to be) v_la V_LA
8 Modal verbs v_modal V_MODAL
• Adjectival tags
Adjectives are divided into 10 sub-categories as presented in Table 3.
Table 3. Sub-categories of adjectives
No. Adjectival groups Non-terminals Adjectival tags
1 Qualitative adjective adj ADJ
2 English adjective adj_eng ADJ_NUM
3 Idiomatic adjective adj_idiom ADJ_PERCENT
4 Measurement adjective adj_measure ADJ_QUANT
5 Numeric adjective adj_num ADJ_NUM
6 Ordinal adjective adj_order ADJ_ORDER
3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
3
7 Percentage adjective adj_percent ADJ_PERCENT
8 Quantitative adjective adj_quant ADJ_QUANT
9 Year adjective adj_year ADJ_YEAR
10 Definite adjective adj_dem ADJ_DEM
• Adverbial tags
Adverbs are divided into 7 sub-categories as presented in Table 4.
Table 4. Sub-categories of adverbs
No. Adverbial groups Non-terminals Adverbial tags
1 Ordinary adverb adv ADV
2 Estimative adverb adv_est ADV_EST
3 Adverb of frequency adv_freq ADV_FREG
4 Negative adverb adv_neg ADV_NEG
5 Ordinary adverb adv_order ADV_ORDER
6 Adverb of time adv_tense ADV_TENSE
7 Special adverb adv_sp ADV_SP
• Prepositional tags
Prepositions are divided into 3 groups, and ungrouped 17 prepositions. See in Table 5.
Table 5. Groups of prepositions
No. Prepositional groups Non-terminals Prepositional tags
1 Preposition of cause prep_cause PREP_CAUSE
2 Preposition of direction prep_direct PREP_DIRECT
3 Preposition of location prep_location PREP_location
4 B ng prep_bang PREP_BANG
5 Cho prep_cho PREP_CHO
6 C a prep_cua PREP_CUA
7 Cùng prep_cung PREP_CUNG
8 prep_de PREP_DE
9 Khi prep_khi PREP_KHI
10 Kh i prep_khoi PREP_KHOI
11 Không prep_khong PREP_KHONG
12 N u prep_neu PREP_NEU
13 Qua prep_qua PREP_QUA
14 Sau prep_sau PREP_SAU
15 Trong prep_trong PREP_TRONG
16 Trư c prep_truoc PREP_TRUOC
17 T prep_tu PREP_TU
18 Vào prep_vao PREP_VAO
19 V prep_ve PREP_VE
20 V i prep_voi PREP_VOI
4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
4
• Conjunctional tags
Table 6. Groups of prepositions
No. Conjunctional groups Non-terminals Conjunctional tags
1 và, “-”, “,” conj CONJ
• Special tags
In Vietnamese, there are words that always precede a noun or an adjective to modify for the noun
or adjective, e.g. “c ”, “c u”, “phó”, “siêu”, “tân”, … We arrange these special words in a group.
Table 7. Groups of special words
No. Special groups Non-terminals Special tags
1 Special words sp_word SP_WORD
2.2. Phrasal tags
• Verbal phrase tags
Verbal phrase are divides in to 5 groups as shown in Table 8.
Table 8. Groups of verbal phrases
No. Verbal phrase groups Non-terminals Verbal phrase tags
1 Verbal phrase having a intransitive verb vp1 VP1
2 Verbal phrase having a transitive verb and its
direct object
vp2 VP2
3 Verbal phrase having a transitive verb and its
indirect object
vp3 VP3
4 Verbal phrase having a transitive verb and its
direct and indirect object
vp4 VP4
5 General verbal phrase vp VP
• Nominal phrase tags
Noun phrases are divided into 8 groups, as presented in Table 9.
Table 9. Groups of nominal phrases
No. Nominal phrase groups Non-terminals Nominal phrase tags
1 General noun phrase np NP
2 Noun phrases have two components linked
together by connected words or hyphen
np_conj NP_CONJ
3 Currency noun phrases np_currency NP_CURRENCY
4 Noun phrases contains one, or two, or three,
or four, or five noun(s)
np_n
np_nn
np_nnn
np_nnnn
np_nnnnn
NP_N
NP_NN
NP_NNN
NP_NNNN
NP_NNNNN
5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
5
5 Noun phrases contain one, or two, or three
noun(s) that precede a preposition
np_npp
np_nnpp
np_nnnpp
NP_NPP
NP_NNPP
NP_NNNPP
6 Noun phrases of pronoun np_pn NP_PN
7 Noun phrases of proper name np_prop NP_PROP
8 Noun phrases of time np_time NP_TIME
• Prepositional phrase tags
Prepositional phrases are divided into 15 groups based on preposition, as presented in Table 10.
Table 10. Groups of prepositional phrases
No. Prepositional groups Non-terminals Prepositional
tags
1 Prepositional phrases contain the preposition “b ng” pp_bang PP_BANG
2 Prepositional phrases of cause pp_cause PP_CAUSE
3 Prepositional phrases contain the preposition “cho” pp_cho PP_CHO
4 Prepositional phrases contain the preposition “c a” pp_cua PP_CUA
5 Prepositional phrases contain the preposition “cùng” pp_cung PP_CUNG
6 Prepositional phrases contain a preposition of direction pp_direct PP_DIRECT
7 Prepositional phrases contain the preposition “khi” pp_khi PP_KHI
8 Prepositional phrases contain the preposition “không” pp_khong PP_KHONG
9 Prepositional phrases of location pp_location PP_LOCATION
10 Prepositional phrases contain the preposition “qua” pp_qua PP_QUA
11 Prepositional phrases contain the preposition “sau” pp_sau PP_SAU
12 Prepositional phrases contain the preposition “trong” pp_trong PP_TRONG
13 Prepositional phrases contain the preposition “trư c” pp_truoc PP_TRUOC
14 Prepositional phrases contain the preposition “vào” pp_vao PP_VAO
15 Prepositional phrases contain the preposition “v ” pp_ve PP_VE
• Adjectival phrase tags
Adjectival phrases are divided into 3 groups and presented in Table 11.
Table 11. Groups of adjectival phrases
No. Adjectival phrase groups Non-terminals Adjectival phrase tags
1 Adjectival phrase of quality adjp ADJP
2 Adjectival phrase of number adjp_num ADJP_NUM
3 Adjectival phrase of measurement adjp_measure ADJP_MEASURE
3. SYNTACTIC RULES OF PHRASES
The probabilities of phrasal structure rules are calculated with 1000 Vietnamese training
sentences in our TreeBank described in the experiments.
6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
6
3.1. Nominal phrase
• Nominal phrase of one noun
Nominal phrase NP_N is formed by a noun or/and an adjective modifying for noun.
Table 12. Syntactic rules of NP_N phrases
No. Rules Probabilities
1 NP_N → ADJ_PRE N 0.005714
2 NP_N → N 0.809524
3 NP_N → N ADJP 0.179048
4 NP_N → N ADJP_MEASURE 0.001905
5 NP_N → N ADJP_NUM 0.003810
• Nominal phrase of two nouns
Nominal phrase NP_NN is formed by two nouns and without adjuncts.
Table 13. Syntactic rules of NP_NN phrases
No. Rules Probabilities
1 NP_NN → N N 1.0
• Nominal phrase of three nouns
Nominal phrase NP_NNN is formed by three nouns and without complements.
Table 14. Syntactic rules of NP_NNN phrases
No. Rules Probabilities
1 NP_NNN → N N N 1.0
• Noun phrase of four nouns
Nominal phrase NP_NNNN formed by four nouns and without complements.
Table 15. Syntactic rules of NP_NNNN phrases
No. Rules Probabilities
1 NP_NNNN → N N N N 1.0
• Nominal phrase of five nouns
Nominal phrase NP_NNNNN formed by five nouns and without complements.
Table 16. Syntactic rules of NP_NNNNN phrases
No. Rules Probabilities
1 NP_NNNNN → N N N N N 1.0
18. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
18
5. EXPERIMENTS
To train our subcategorized PDCG parser, we build a Treebank with 1000 simple Vietnamese
training sentences, which are manually analyzed and tagged by using our Vietnamese
subcategorized PDCG grammar and sub-categorical and phrasal tags. All of these Vietnamese
sentences are titles of international news (from January 2011 to May 2013) which are collected
and selected from the web site of VnExpress [13]. Our Treebank is based on Penn Treebank [14],
in which all of the rules are represented by Sandiway Fong’s Prolog formats [15].
Based on the built Treebank, the parser can extract 36 sentential rules, 309 phrasal rules and 3248
lexical rules. The probabilities of these rules are calculated by following Probabilistic Definite
Clause Grammar [9], [10], [11], and [12].
The results of the experiments of the parser are presented in Table 38. We apply the evaluation
method proposed by [16].
Table 38. Results of testing the subcategorized PDCG parser
Testing datasets Number of sentences Precision Recall F
Dataset 1 250 98.46 98.50 98.48
Dataset 2 500 98.42 98.44 98.43
Dataset 3 750 98.78 98.80 98.79
Dataset 4 1000 98.78 98.82 98.80
For all of experiments, the averages of precisions, recalls and F measures are over 98%.
6. CONCLUSIONS
The application of Chomsky’s principle of subcategorization [8] and PDCG (Probabilistic
Definite Clause Grammar) [9], [10], [11], and [12] allows enhancing the precision, recall and F
measures of parsing on all of experimented Vietnamese sentences. However, building a
subcategorized PDCG for Vietnamese language requires much time and linguistic complexity in
defining tagset, and syntactic rules as well as building a Treebank.
In future works, we prepare to standardize the subcategorization, the tagset and syntactic rules for
Vietnamese language. At the same time, the lexicon of parser will be also extended. A strong
subcategorized PDCG parser will allow analyzing syntax with better precision.
REFERENCES
[1] Michael Collins, “Three generative lexicalized models for statistical parsing”, Proceeding ACL '98
Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and
Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp.
16-23, 1997.
[2] Michael Collins, “Head-Driven Statisticcal Models for Natural Language Parsing”, Journal
Computational Linguistics, MIT Press, Volume. 29, No. 4, pp. 589-637, 2003.
[3] Charniak Eugene, "Statistical techniques for natural language parsing", AI Magazine, 1997.
19. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
19
[4] Chistopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing,
MIT Press, 1999.
[5] Nguyen Quoc The, Le Thanh Huong, “Phân tích cú pháp ti ng Vi t s d ng văn ph m phi ng
c nh t v ng hóa k t h p xác su t”, Proceedings of the FAIR conference, Nha Trang, Vietnam,
Aug. 9-10, 2007.
[6] Hoang Anh Viet, Dinh Thi Phuong Thu, Huynh Quyet Thang, “Vietnamese Parsing Applying The
PCFG Model”, Proceedings of the Second Asia Pacific International Conference on Information
Science and Technology, Vietnam, 2007.
[7] tài VLSP. Available at: http://vlsp.vietlp.org:8080 .
[8] Noam Chomsky, Aspects of the theory of syntax, The M.I.T. Press, 1965.
[9] Qaiser Abbas, Nayyara Karamat, Sadia Niazi, “Development of Tree-bank Based Probabilistic
Grammar for Urdu Language”, International Journal of Electrical & Computer Sciences (IJECS),
Vol. 9, No. 9, 2009.
[10] Parsing PCFG in Prolog. Available at: http://w3.msi.vxu.se/~nivre/teaching/statnlp/pdcg.html
[11] Assignment for PCFG Parsing. Available at:
http://stp.lingfil.uu.se/~nivre/5LN437/statmet_ass2.html
[12] Gerald Gazdar, 1999. Available at:
http://www.informatics.sussex.ac.uk/research/groups/nlp/gazdar/teach/nlp/
[13] VnExpress. Available at: http://www.vnexpress.net
[14] Mitchell P. Marcus, Mary Ann Marcinkiewicz, Beatrice Santorini, “Building a Large Annotated
Corpus of English: The Penn Treebank”, Journal Computational Linguistics - Special issue on
using large corpora: II, MIT Press, Volume. 19, No. 2, pp. 313-330, 1993.
[15] Sandiway Fong, Treebank Viewer. Available at:
http://dingo.sbs.arizona.edu/~sandiway/treebankviewer/index.html
[16] E. Black, “A Procedure for Quantitatively Comparing the Syntactic Coverage of English
Grammars”, Proceedings DARPA Speech and Natural Language Workshop, Pacific Grove,
Morgan Kaufmann, 1991.