1. Topic No 1.
NATURAL LANGUAGE SIGN
SYSTEMS
MAIN SECTIONS
1.1. Models and methods of representation and organization of
knowledge – lections 1-2.
1.2. Quantitative specification of natural language systems —
lections 3-4, 8.
1.3. Logical-statistical methods of knowledge retrieval —
lections 5-7.
OPTIONAL SECTIONS FOR SELF-STUDY
1.4. Technology of automated formation of thesaurus.
1.5. Example of natural language resource studying.
3. References
Lecture materials can be found in:
Yu.N.Filippovich, А.V.Prohorov.
Semantics of information
technologies:
practices of dictionary-thesaurus
description. /
Computer linguistics series.
Introduction article by A.I. Novikov.
M.: MGUP, 2002.
—CD ROM in package
— pp. 46–54.
4. DISTRIBUTION-STATISTICAL METHOD
Basic hypothesis:
Meaningful language elements (words) that occur
together in a text interval are semantically connected
between each other
Quantitative (frequency) characteristics of
sole or joint occurrence of
meaningful language elements
‘connection strength’ coefficient formula
Semantic classification of
meaningful language elements
5. FREQUENCY CHARACTERISTICS OF
CONTEXTS
Context Сi(T) — a piece of text, a sequence (chain) of syntagmas.
T = C1(T)+...+Cq(T), where Сi(T) Cj(T)=, i,j (ij) [1,q]
If syntagma is a meaningful language element (word), then:
NA, fA=NA/N — quantity and frequency of contexts, where only
word A occurred;
NB , fB=NB/N — quantity and frequency of contexts, where only
word B occurred;
NAB , fAB=NAB/N — quantity and frequency of contexts, where joint
occurrence of words A and B took place;
N — total number of contexts.
6. FORMULAE OF ‘CONNECTION
STRENGTH’ COEFFICIENT (1)
K f
N
NAB AB
AB
— T.T.Tаnimоtо,
L.B.Dоуlе.
N
ffN
K BAAB
AB
— M.E.Mаrоn,
J.Kuhns.
7. FORMULAE OF ‘CONNECTION
STRENGTH’ COEFFICIENT (2)
K
f N
f fAB
AB
A B
— А.Ya.Shaikevich, G.Sаltоn,
R.M.Curtiсе.
— S.Dеnnis.
— H.E.Stilеs
8. ANALYSIS OF FORMULAE
OF ‘CONNECTION STRENGTH’ COEFFICIENT (1)
All formulae of ‘connection strength’ coefficient are united by
seeing events related to occurrence of words A and B as a system
of accidental phenomena.
Method procedure enables to establish the fact:
if A and B – independent events, than P(AB)=P(A)P(B).
Estimated value of ‘connection strength’ coefficient needs
interpretation (explanation)
Size of context (number of surrounding words) enables most likely
to define that:
а) 1–2 words — contact syntagmatic connections of
word combinations;
b) 5–10 words — distant syntagmatic connections
and paradigmatic relations;
c) 50–100 words — thematic connections between the words.
9. ANALYSIS OF FORMULAE
OF ‘CONNECTION STRENGTH’ COEFFICIENT (2)
Matrix of language units (words) cohesion and
associative matrix
word ... аi ...
word frequency fа
...
bj fb ... fаb ...
...
• formation of the core of thematically connected texts;
• automated construction of thesaurus;
• information search and indexing;
• automated abstracting.
Directions of method implementation:
10. METHODOLOGY FOR THESAURUS
CONSTRUCTION BASED ON DISTRIBUTION-
STATISTICAL METHOD
Compilation of frequency glossaries and concordances.
Analysis of joint occurrence of words (language units) and on
its basis compilation of associative matrix.
Subjective interpretation of associative matrix and formation
of classes of typical connections (relations).
Grouping (segregation) of specific relation types (genus-
species, causal, etc.).
Interpretations of separate word connections.
Grouping of semantic fields.
11. COMPONENT ANALYSIS
Method of component analysis enables to track
connection between two notions basing on the
analysis of their definitions
Definition
of notion A
Notion A fAB Notion B Definition
of notion B
Main method modifications:
• Quantitative specification of connection.
• Hypertext link.
12. QUANTITATIVE SPECIFICATION OF
CONNECTION
Two words A and B are considered connected by
the connection strength fаb = k,
if in their definitions there are k number of common words
— multitude of the same words,
used in definitions for words A and B;
}{x
AB
i
— number of the same words.x
AB
i
k , where = k >1
Clusters of words connected by connection strength
f = k , k = 1, 2, 3, ..., K.
13. HYPERTEXT LINK
Two words A and B are considered connected,
if in definition of each word there is a common word,
fаb = k =1.
Hypertext links usage:
• lexicographical systems
(e-dictionaries and encyclopedias),
• e-texts,
• information and reference systems etc.
Possible usage for knowledge analysis:
• analysis of definition system or definition dictionary;
• examination of quality of dictionary articles (by number of
connections with other dictionary articles, by length of chain);
• examination of extracts in definition dictionaries;
• analysis of text dictionaries;
• examination of hеlр-systems.
14. FREQUENCY-SEMANTIC METHOD
Frequency-semantic method uses two
characteristics of words definitions as a criterion for
connection strength estimation:
similarity of elements and frequency.
Method idea:
«...imagine forces of semantic adhesion as being an everywhere existing , leaked in
language field which has bodies in it – lexical language units. Different units interact the
same way as atoms, molecules, macro bodies, planets and space objects interact – on
one level, i.e. between homogeneous units, as well as on interlevels.»
Basic data:
• ideographic dictionaries.
• concise definition dictionary of Russian for foreigners.
• definition dictionaries of S.I. Ozhegov and D.N. Ushakov.
15. References
Karaulov Yu.N.
Frequency dictionary of semantic
multipliers of the Russian
language.
– М.: Nauka, 1980.
Karaulov Yu.N., V.I.Molchanov,
V.A.Afanasiev, N.V.Mihalev.
Analysis of dictionary
metalanguage using ECM.
– М.: Nauka, 1982. – 96 p.
16. FORMATION OF SEMANTIC FIELDS (1)
Aa
k
DWwd ij
Dw ji
a ijwd
A
k
DW
,
if , than , where:
— value of semantic connection strength between
word wi and descriptor dj ;
— multitude of acceptable values of semantic
connection strength between descriptors and words;
Dj = {wij} — multitude of words of a descriptor;
wi — word, i = 1...|W|, W = {wi} — multitude of words;
dj — descriptor, j = 1...|D|, D = {dj} — multitude of descriptors.
Practical task:
divide 9000 words between 1600 descriptors
17. FORMATION OF SEMANTIC FIELDS (2)
ISSUES OF PRACTICAL TASK SOLUTION
1. Determine the way of words comparison
• Choose the way to obtain (to indicate) semantic multiplier
(lemmatization, folding, root indication, word stem and quasi stem of
the word indication)
• Develop methodology for obtaining word semantic code.
2. Determine frequency characteristics of semantic multipliers.
3. Identification of the criterion for semantic connection of words
and descriptors.
• Phenomenological model of unit connectivity
• Phenomenological model of K connectivity
• Connectivity model with account of frequency of multipliers
18. DETERMINE THE WAY TO COMPARE WORDS
Word definition/descriptor — ~10 word forms,
Total number in experiment — ~110000 word forms.
semantic multiplier — elementary unit of concept plan.
Basic presumptions:
a) semantic expansion of language is discrete;
b) range of elements of expansion is final and observable ;
c) number of combinations is almost eternal;
d) semantic expansion is elementary, i.e. consists of indecomposable
elements;
e) semantic elements are monotonous, i.e. refer to contents (they are
elements of perception and thinking);
f) semantic elements form a universal set, i.e. they are of general character
and their number and range are similar for different languages .
19. WAYS TO OBTAIN (INDICATE) SEMANTIC
MUKTIPLIER
Lemmatization — acquisition of canonic word form.
Folding — folding of the word, i.e. deletion of vowels except for vowel
of the first syllable.
Root indication — representation of word with root morpheme.
Word stem indication — word representation with several
morphemes, for example, prefix and root.
Indication of quasi stem of the word — with random initial word part,
basing on the fact of shift of word meaning (its contents) to its
beginning.
20. METHOD OF OBTAINING SEMANTIC CODE OF
THE WORD
METHOD PROCEDURES
1. Entering of the coded word into its code.
2. Exclusion of semantic multiplier repetitions.
3. Filtration (deletion):
«zero» semantic multipliers
grammatical words
prepositions, conjunction etc.
4. Lexicalisation of collocations.
5. Formation of quasi word stems
RESULTS OF METHOD IMPLEMENTATION
}{s
jd
x
а) descriptor— dj = б) words — wi = }{s
iw
x
21. DETERMINATION OF FREQUENCY
CHARACTERITICS OF SEMANTIC MULTIPLIERS
Two frequency characteristics are associated
with semantic multiplier X:
— frequency of multiplier occurrence
in descriptor definitions
— frequency of multiplier occurrence
in word definitions
Frequency analysis of semantic multiplier methodology:
а) frequencies computing;
b) ranging and grading of multipliers in definitions
according to increase of their rank.
22. CRITERION OF SEMANTIC CONNECTIVITY
BETWEEN WORDS AND DESCRIPTORS
Stages of development of the criterion:
1. Phenomenological model of unit connectivity
if there is at least one common multiplier in definitions of words
and descriptors:
| dj wi | = 1;
2. Phenomenological model of K connectivity
there is K number of common semantic multipliers in definitions of
words and descriptors:
| dj wi | = K;
3. Connectivity model with account of frequency of multipliers
(selective criterion of Karaulov).
;2K f
D
x .6
23. SELECTIVE CRITERION OF KARAULOV
Word and descriptor are semantically connected if their definitions
have more than two similar semantic multipliers or if their definitions
have at least one common semantic multiplier and its frequency in
multitude of descriptors is more than six
Semantic fields construction procedure
1. Construction of the field according to unit connectivity model.
2. Narrowing of the field by number of coinciding multipliers.
3. Narrowing of the field with account to semantic multipliers frequency.
Dw ji
If
, than
24. QUESTIONS FOR SELF-CHECK
Name logical-statistical methods of knowledge retrieval from
texts.
Tell about distribution-statistical methodology of text analysis.
Tell about frequency-semantic methodology of text analysis.
Tell about component text analysis.