The document discusses logical-statistical methods for knowledge acquisition from texts, including distribution-statistical analysis, component analysis, and frequency-semantic analysis. Distribution-statistical analysis uses the frequency of words occurring together to determine their semantic relationship. Component analysis examines word definitions for common elements. Frequency-semantic analysis considers both the similarity and frequency of elements in word definitions. These methods are used to build semantic fields by grouping words into descriptive categories.
dialogue act modeling for automatic tagging and recognitionVipul Munot
Aim to present comprehensive framework
for modelling and automatic classification of DA’s
founded on well-known statistical methods
Present results obtained with this approach
on large widely available corpus of
spontaneous conversational speech.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
dialogue act modeling for automatic tagging and recognitionVipul Munot
Aim to present comprehensive framework
for modelling and automatic classification of DA’s
founded on well-known statistical methods
Present results obtained with this approach
on large widely available corpus of
spontaneous conversational speech.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
Tasks such as question answering and semantic search are dependent
on the ability of querying & reasoning over large-scale commonsense knowledge
bases (KBs). However, dealing with commonsense data demands coping with
problems such as the increase in schema complexity, semantic inconsistency, incompleteness
and scalability. This paper proposes a selective graph navigation
mechanism based on a distributional relational semantic model which can be applied
to querying & reasoning over heterogeneous knowledge bases (KBs). The
approach can be used for approximative reasoning, querying and associational
knowledge discovery. In this paper we focus on commonsense reasoning as the
main motivational scenario for the approach. The approach focuses on addressing
the following problems: (i) providing a semantic selection mechanism for facts
which are relevant and meaningful in a specific reasoning & querying context
and (ii) allowing coping with information incompleteness in large KBs. The approach
is evaluated using ConceptNet as a commonsense KB, and achieved high
selectivity, high scalability and high accuracy in the selection of meaningful nav-
igational paths. Distributional semantics is also used as a principled mechanism
to cope with information incompleteness.
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...Rommel Carvalho
Presentation given by Saminda Abeyruwan at the 6th Uncertainty Reasoning for the Semantic Web Workshop at the 9th International Semantic Web Conference in November 7, 2010.
Paper: PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods
Abstract: Formalizing an ontology for a domain manually is well-known as a tedious and cumbersome process. It is constrained by the knowledge acquisition bottleneck. Therefore, researchers developed algorithms and systems that can help to automatize the process. Among them are systems that include text corpora for the acquisition. Our idea is also based on vast amount of text corpora. Here, we provide a novel unsupervised bottom-up ontology generation method. It is based on lexico-semantic structures and Bayesian reasoning to expedite the ontology generation process. We provide a quantitative and two qualitative results illustrating our approach using a high throughput screening assay corpus and two custom text corpora. This process could also provide evidence for domain experts to build ontologies based on top-down approaches.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
Tasks such as question answering and semantic search are dependent
on the ability of querying & reasoning over large-scale commonsense knowledge
bases (KBs). However, dealing with commonsense data demands coping with
problems such as the increase in schema complexity, semantic inconsistency, incompleteness
and scalability. This paper proposes a selective graph navigation
mechanism based on a distributional relational semantic model which can be applied
to querying & reasoning over heterogeneous knowledge bases (KBs). The
approach can be used for approximative reasoning, querying and associational
knowledge discovery. In this paper we focus on commonsense reasoning as the
main motivational scenario for the approach. The approach focuses on addressing
the following problems: (i) providing a semantic selection mechanism for facts
which are relevant and meaningful in a specific reasoning & querying context
and (ii) allowing coping with information incompleteness in large KBs. The approach
is evaluated using ConceptNet as a commonsense KB, and achieved high
selectivity, high scalability and high accuracy in the selection of meaningful nav-
igational paths. Distributional semantics is also used as a principled mechanism
to cope with information incompleteness.
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...Rommel Carvalho
Presentation given by Saminda Abeyruwan at the 6th Uncertainty Reasoning for the Semantic Web Workshop at the 9th International Semantic Web Conference in November 7, 2010.
Paper: PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods
Abstract: Formalizing an ontology for a domain manually is well-known as a tedious and cumbersome process. It is constrained by the knowledge acquisition bottleneck. Therefore, researchers developed algorithms and systems that can help to automatize the process. Among them are systems that include text corpora for the acquisition. Our idea is also based on vast amount of text corpora. Here, we provide a novel unsupervised bottom-up ontology generation method. It is based on lexico-semantic structures and Bayesian reasoning to expedite the ontology generation process. We provide a quantitative and two qualitative results illustrating our approach using a high throughput screening assay corpus and two custom text corpora. This process could also provide evidence for domain experts to build ontologies based on top-down approaches.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
Este es el trabajo del grupo numero 4 sus integrantes son:
Monica Alejandra Barrios
Josselyn Stephanie Estrada
Ileana Alejandra Garcia
Karen Elizabeth Gonzales
Amanda del carmen Tobar
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
This presentation is devoted to a comparative analysis of the Computer Thesaurus of Ukrainian Verbs and the Specialized Thesaurus of Computer Ideography. These two dictionaries are representative examples of a general language (non-technical) computer thesaurus and a specialized computer thesaurus. We focus our attention on the entries of each thesaurus, its macrostructure, microstructure, compilation and use.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
Considerable research in the field of ontology matching has been performed where information sharing
and reuse becomes necessary in ontology development. Measurement of lexical similarity in ontology
matching is performed using synset, defined in WordNet. In this paper, we defined a Super Word Set,
which is an aggregate set that includes hypernym, hyponym, holonym, and meronym sets in WordNet.
The Super Word Set Similarity is calculated by the rate of words of concept name and synset’s words
inclusion in the Super Word Set. In order to measure of Super Word Set Similarity, we first extracted
Matched Concepts(MC), Matched Properties(MP) and Property Unmatched Concepts(PUC) from the
result of ontology matching. We compared these against two ontology matching tools – COMA++ and
LOM. The Super Word Set Similarity shows an average improvement of 12% over COMA++ and 19%
over LOM.
Presentation of the Marcu 2000 ACL paper "The rhetorical parsing of unrestricted texts- A surface-based approach" for Discourse Parsing and Language Technology seminar.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
Deep Learning Architectures for Semantic Relation Detection Tasks
Recognizing and distinguishing specific semantic relations from other types of semantic relations is an essential part of language understanding systems. Identifying expressions with similar and contrasting meanings is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations. In this talk, I will first present novel techniques for creating labelled datasets required for training deep learning models for classifying semantic relations between phrases. I will further present various neural network architectures that integrate morphological features into integrated path-based and distributional relation detection algorithms and demonstrate that this model outperforms state-of-the-art models in distinguishing semantic relations and is capable of efficiently handling multi-word expressions.
1. Topic No 1.
NATURAL LANGUAGE SIGN
SYSTEMS
MAIN SECTIONS
1.1. Models and methods of representation and organization of
knowledge – lections 1-2.
1.2. Quantitative specification of natural language systems —
lections 3-4, 8.
1.3. Logical-statistical methods of knowledge retrieval —
lections 5-7.
OPTIONAL SECTIONS FOR SELF-STUDY
1.4. Technology of automated formation of thesaurus.
1.5. Example of natural language resource studying.
3. References
Lecture materials can be found in:
Yu.N.Filippovich, А.V.Prohorov.
Semantics of information
technologies:
practices of dictionary-thesaurus
description. /
Computer linguistics series.
Introduction article by A.I. Novikov.
M.: MGUP, 2002.
—CD ROM in package
— pp. 46–54.
4. DISTRIBUTION-STATISTICAL METHOD
Basic hypothesis:
Meaningful language elements (words) that occur
together in a text interval are semantically connected
between each other
Quantitative (frequency) characteristics of
sole or joint occurrence of
meaningful language elements
‘connection strength’ coefficient formula
Semantic classification of
meaningful language elements
5. FREQUENCY CHARACTERISTICS OF
CONTEXTS
Context Сi(T) — a piece of text, a sequence (chain) of syntagmas.
T = C1(T)+...+Cq(T), where Сi(T) Cj(T)=, i,j (ij) [1,q]
If syntagma is a meaningful language element (word), then:
NA, fA=NA/N — quantity and frequency of contexts, where only
word A occurred;
NB , fB=NB/N — quantity and frequency of contexts, where only
word B occurred;
NAB , fAB=NAB/N — quantity and frequency of contexts, where joint
occurrence of words A and B took place;
N — total number of contexts.
6. FORMULAE OF ‘CONNECTION
STRENGTH’ COEFFICIENT (1)
K f
N
NAB AB
AB
— T.T.Tаnimоtо,
L.B.Dоуlе.
N
ffN
K BAAB
AB
— M.E.Mаrоn,
J.Kuhns.
7. FORMULAE OF ‘CONNECTION
STRENGTH’ COEFFICIENT (2)
K
f N
f fAB
AB
A B
— А.Ya.Shaikevich, G.Sаltоn,
R.M.Curtiсе.
— S.Dеnnis.
— H.E.Stilеs
8. ANALYSIS OF FORMULAE
OF ‘CONNECTION STRENGTH’ COEFFICIENT (1)
All formulae of ‘connection strength’ coefficient are united by
seeing events related to occurrence of words A and B as a system
of accidental phenomena.
Method procedure enables to establish the fact:
if A and B – independent events, than P(AB)=P(A)P(B).
Estimated value of ‘connection strength’ coefficient needs
interpretation (explanation)
Size of context (number of surrounding words) enables most likely
to define that:
а) 1–2 words — contact syntagmatic connections of
word combinations;
b) 5–10 words — distant syntagmatic connections
and paradigmatic relations;
c) 50–100 words — thematic connections between the words.
9. ANALYSIS OF FORMULAE
OF ‘CONNECTION STRENGTH’ COEFFICIENT (2)
Matrix of language units (words) cohesion and
associative matrix
word ... аi ...
word frequency fа
...
bj fb ... fаb ...
...
• formation of the core of thematically connected texts;
• automated construction of thesaurus;
• information search and indexing;
• automated abstracting.
Directions of method implementation:
10. METHODOLOGY FOR THESAURUS
CONSTRUCTION BASED ON DISTRIBUTION-
STATISTICAL METHOD
Compilation of frequency glossaries and concordances.
Analysis of joint occurrence of words (language units) and on
its basis compilation of associative matrix.
Subjective interpretation of associative matrix and formation
of classes of typical connections (relations).
Grouping (segregation) of specific relation types (genus-
species, causal, etc.).
Interpretations of separate word connections.
Grouping of semantic fields.
11. COMPONENT ANALYSIS
Method of component analysis enables to track
connection between two notions basing on the
analysis of their definitions
Definition
of notion A
Notion A fAB Notion B Definition
of notion B
Main method modifications:
• Quantitative specification of connection.
• Hypertext link.
12. QUANTITATIVE SPECIFICATION OF
CONNECTION
Two words A and B are considered connected by
the connection strength fаb = k,
if in their definitions there are k number of common words
— multitude of the same words,
used in definitions for words A and B;
}{x
AB
i
— number of the same words.x
AB
i
k , where = k >1
Clusters of words connected by connection strength
f = k , k = 1, 2, 3, ..., K.
13. HYPERTEXT LINK
Two words A and B are considered connected,
if in definition of each word there is a common word,
fаb = k =1.
Hypertext links usage:
• lexicographical systems
(e-dictionaries and encyclopedias),
• e-texts,
• information and reference systems etc.
Possible usage for knowledge analysis:
• analysis of definition system or definition dictionary;
• examination of quality of dictionary articles (by number of
connections with other dictionary articles, by length of chain);
• examination of extracts in definition dictionaries;
• analysis of text dictionaries;
• examination of hеlр-systems.
14. FREQUENCY-SEMANTIC METHOD
Frequency-semantic method uses two
characteristics of words definitions as a criterion for
connection strength estimation:
similarity of elements and frequency.
Method idea:
«...imagine forces of semantic adhesion as being an everywhere existing , leaked in
language field which has bodies in it – lexical language units. Different units interact the
same way as atoms, molecules, macro bodies, planets and space objects interact – on
one level, i.e. between homogeneous units, as well as on interlevels.»
Basic data:
• ideographic dictionaries.
• concise definition dictionary of Russian for foreigners.
• definition dictionaries of S.I. Ozhegov and D.N. Ushakov.
15. References
Karaulov Yu.N.
Frequency dictionary of semantic
multipliers of the Russian
language.
– М.: Nauka, 1980.
Karaulov Yu.N., V.I.Molchanov,
V.A.Afanasiev, N.V.Mihalev.
Analysis of dictionary
metalanguage using ECM.
– М.: Nauka, 1982. – 96 p.
16. FORMATION OF SEMANTIC FIELDS (1)
Aa
k
DWwd ij
Dw ji
a ijwd
A
k
DW
,
if , than , where:
— value of semantic connection strength between
word wi and descriptor dj ;
— multitude of acceptable values of semantic
connection strength between descriptors and words;
Dj = {wij} — multitude of words of a descriptor;
wi — word, i = 1...|W|, W = {wi} — multitude of words;
dj — descriptor, j = 1...|D|, D = {dj} — multitude of descriptors.
Practical task:
divide 9000 words between 1600 descriptors
17. FORMATION OF SEMANTIC FIELDS (2)
ISSUES OF PRACTICAL TASK SOLUTION
1. Determine the way of words comparison
• Choose the way to obtain (to indicate) semantic multiplier
(lemmatization, folding, root indication, word stem and quasi stem of
the word indication)
• Develop methodology for obtaining word semantic code.
2. Determine frequency characteristics of semantic multipliers.
3. Identification of the criterion for semantic connection of words
and descriptors.
• Phenomenological model of unit connectivity
• Phenomenological model of K connectivity
• Connectivity model with account of frequency of multipliers
18. DETERMINE THE WAY TO COMPARE WORDS
Word definition/descriptor — ~10 word forms,
Total number in experiment — ~110000 word forms.
semantic multiplier — elementary unit of concept plan.
Basic presumptions:
a) semantic expansion of language is discrete;
b) range of elements of expansion is final and observable ;
c) number of combinations is almost eternal;
d) semantic expansion is elementary, i.e. consists of indecomposable
elements;
e) semantic elements are monotonous, i.e. refer to contents (they are
elements of perception and thinking);
f) semantic elements form a universal set, i.e. they are of general character
and their number and range are similar for different languages .
19. WAYS TO OBTAIN (INDICATE) SEMANTIC
MUKTIPLIER
Lemmatization — acquisition of canonic word form.
Folding — folding of the word, i.e. deletion of vowels except for vowel
of the first syllable.
Root indication — representation of word with root morpheme.
Word stem indication — word representation with several
morphemes, for example, prefix and root.
Indication of quasi stem of the word — with random initial word part,
basing on the fact of shift of word meaning (its contents) to its
beginning.
20. METHOD OF OBTAINING SEMANTIC CODE OF
THE WORD
METHOD PROCEDURES
1. Entering of the coded word into its code.
2. Exclusion of semantic multiplier repetitions.
3. Filtration (deletion):
«zero» semantic multipliers
grammatical words
prepositions, conjunction etc.
4. Lexicalisation of collocations.
5. Formation of quasi word stems
RESULTS OF METHOD IMPLEMENTATION
}{s
jd
x
а) descriptor— dj = б) words — wi = }{s
iw
x
21. DETERMINATION OF FREQUENCY
CHARACTERITICS OF SEMANTIC MULTIPLIERS
Two frequency characteristics are associated
with semantic multiplier X:
— frequency of multiplier occurrence
in descriptor definitions
— frequency of multiplier occurrence
in word definitions
Frequency analysis of semantic multiplier methodology:
а) frequencies computing;
b) ranging and grading of multipliers in definitions
according to increase of their rank.
22. CRITERION OF SEMANTIC CONNECTIVITY
BETWEEN WORDS AND DESCRIPTORS
Stages of development of the criterion:
1. Phenomenological model of unit connectivity
if there is at least one common multiplier in definitions of words
and descriptors:
| dj wi | = 1;
2. Phenomenological model of K connectivity
there is K number of common semantic multipliers in definitions of
words and descriptors:
| dj wi | = K;
3. Connectivity model with account of frequency of multipliers
(selective criterion of Karaulov).
;2K f
D
x .6
23. SELECTIVE CRITERION OF KARAULOV
Word and descriptor are semantically connected if their definitions
have more than two similar semantic multipliers or if their definitions
have at least one common semantic multiplier and its frequency in
multitude of descriptors is more than six
Semantic fields construction procedure
1. Construction of the field according to unit connectivity model.
2. Narrowing of the field by number of coinciding multipliers.
3. Narrowing of the field with account to semantic multipliers frequency.
Dw ji
If
, than
24. QUESTIONS FOR SELF-CHECK
Name logical-statistical methods of knowledge retrieval from
texts.
Tell about distribution-statistical methodology of text analysis.
Tell about frequency-semantic methodology of text analysis.
Tell about component text analysis.