SlideShare a Scribd company logo
1 of 24
Topic No 1.
NATURAL LANGUAGE SIGN
SYSTEMS
MAIN SECTIONS
1.1. Models and methods of representation and organization of
knowledge – lections 1-2.
1.2. Quantitative specification of natural language systems —
lections 3-4, 8.
1.3. Logical-statistical methods of knowledge retrieval —
lections 5-7.
OPTIONAL SECTIONS FOR SELF-STUDY
1.4. Technology of automated formation of thesaurus.
1.5. Example of natural language resource studying.
Lection 5.
LOGICAL-STATISTICAL METHODS OF
KNOWLEDGE ACQUISITION
 Distribution-statistical method
 Componential analysis
 Frequency-semantic method
References
Lecture materials can be found in:
Yu.N.Filippovich, А.V.Prohorov.
Semantics of information
technologies:
practices of dictionary-thesaurus
description. /
Computer linguistics series.
Introduction article by A.I. Novikov.
M.: MGUP, 2002.
—CD ROM in package
— pp. 46–54.
DISTRIBUTION-STATISTICAL METHOD
Basic hypothesis:
Meaningful language elements (words) that occur
together in a text interval are semantically connected
between each other

Quantitative (frequency) characteristics of
sole or joint occurrence of
meaningful language elements

‘connection strength’ coefficient formula

Semantic classification of
meaningful language elements
FREQUENCY CHARACTERISTICS OF
CONTEXTS
Context Сi(T) — a piece of text, a sequence (chain) of syntagmas.
T = C1(T)+...+Cq(T), where Сi(T) Cj(T)=, i,j (ij) [1,q]
If syntagma is a meaningful language element (word), then:
NA, fA=NA/N — quantity and frequency of contexts, where only
word A occurred;
NB , fB=NB/N — quantity and frequency of contexts, where only
word B occurred;
NAB , fAB=NAB/N — quantity and frequency of contexts, where joint
occurrence of words A and B took place;
N — total number of contexts.
FORMULAE OF ‘CONNECTION
STRENGTH’ COEFFICIENT (1)
K f
N
NAB AB
AB
 
— T.T.Tаnimоtо,
L.B.Dоуlе.
N
ffN
K BAAB
AB

 — M.E.Mаrоn,
J.Kuhns.
FORMULAE OF ‘CONNECTION
STRENGTH’ COEFFICIENT (2)
K
f N
f fAB
AB
A B



— А.Ya.Shaikevich, G.Sаltоn,
R.M.Curtiсе.
— S.Dеnnis.
— H.E.Stilеs
ANALYSIS OF FORMULAE
OF ‘CONNECTION STRENGTH’ COEFFICIENT (1)
All formulae of ‘connection strength’ coefficient are united by
seeing events related to occurrence of words A and B as a system
of accidental phenomena.
Method procedure enables to establish the fact:
if A and B – independent events, than P(AB)=P(A)P(B).
Estimated value of ‘connection strength’ coefficient needs
interpretation (explanation)
Size of context (number of surrounding words) enables most likely
to define that:
а) 1–2 words — contact syntagmatic connections of
word combinations;
b) 5–10 words — distant syntagmatic connections
and paradigmatic relations;
c) 50–100 words — thematic connections between the words.
ANALYSIS OF FORMULAE
OF ‘CONNECTION STRENGTH’ COEFFICIENT (2)
Matrix of language units (words) cohesion and
associative matrix
word ... аi ...
word frequency fа
...
bj fb ... fаb ...
...
• formation of the core of thematically connected texts;
• automated construction of thesaurus;
• information search and indexing;
• automated abstracting.
Directions of method implementation:
METHODOLOGY FOR THESAURUS
CONSTRUCTION BASED ON DISTRIBUTION-
STATISTICAL METHOD
 Compilation of frequency glossaries and concordances.
 Analysis of joint occurrence of words (language units) and on
its basis compilation of associative matrix.
 Subjective interpretation of associative matrix and formation
of classes of typical connections (relations).
 Grouping (segregation) of specific relation types (genus-
species, causal, etc.).
 Interpretations of separate word connections.
 Grouping of semantic fields.
COMPONENT ANALYSIS
Method of component analysis enables to track
connection between two notions basing on the
analysis of their definitions
Definition
of notion A
 Notion A fAB Notion B  Definition
of notion B
Main method modifications:
• Quantitative specification of connection.
• Hypertext link.
QUANTITATIVE SPECIFICATION OF
CONNECTION
Two words A and B are considered connected by
the connection strength fаb = k,
if in their definitions there are k number of common words
— multitude of the same words,
used in definitions for words A and B;
}{x
AB
i
— number of the same words.x
AB
i
k  , where = k >1
Clusters of words connected by connection strength
f = k , k = 1, 2, 3, ..., K.
HYPERTEXT LINK
Two words A and B are considered connected,
if in definition of each word there is a common word,
fаb = k =1.
Hypertext links usage:
• lexicographical systems
(e-dictionaries and encyclopedias),
• e-texts,
• information and reference systems etc.
Possible usage for knowledge analysis:
• analysis of definition system or definition dictionary;
• examination of quality of dictionary articles (by number of
connections with other dictionary articles, by length of chain);
• examination of extracts in definition dictionaries;
• analysis of text dictionaries;
• examination of hеlр-systems.
FREQUENCY-SEMANTIC METHOD
Frequency-semantic method uses two
characteristics of words definitions as a criterion for
connection strength estimation:
similarity of elements and frequency.
Method idea:
«...imagine forces of semantic adhesion as being an everywhere existing , leaked in
language field which has bodies in it – lexical language units. Different units interact the
same way as atoms, molecules, macro bodies, planets and space objects interact – on
one level, i.e. between homogeneous units, as well as on interlevels.»
Basic data:
• ideographic dictionaries.
• concise definition dictionary of Russian for foreigners.
• definition dictionaries of S.I. Ozhegov and D.N. Ushakov.
References
Karaulov Yu.N.
Frequency dictionary of semantic
multipliers of the Russian
language.
– М.: Nauka, 1980.
Karaulov Yu.N., V.I.Molchanov,
V.A.Afanasiev, N.V.Mihalev.
Analysis of dictionary
metalanguage using ECM.
– М.: Nauka, 1982. – 96 p.
FORMATION OF SEMANTIC FIELDS (1)
Aa
k
DWwd ij
 Dw ji

a ijwd
A
k
DW
,
if , than , where:
— value of semantic connection strength between
word wi and descriptor dj ;
— multitude of acceptable values of semantic
connection strength between descriptors and words;
Dj = {wij} — multitude of words of a descriptor;
wi — word, i = 1...|W|, W = {wi} — multitude of words;
dj — descriptor, j = 1...|D|, D = {dj} — multitude of descriptors.
Practical task:
divide 9000 words between 1600 descriptors
FORMATION OF SEMANTIC FIELDS (2)
ISSUES OF PRACTICAL TASK SOLUTION
1. Determine the way of words comparison
• Choose the way to obtain (to indicate) semantic multiplier
(lemmatization, folding, root indication, word stem and quasi stem of
the word indication)
• Develop methodology for obtaining word semantic code.
2. Determine frequency characteristics of semantic multipliers.
3. Identification of the criterion for semantic connection of words
and descriptors.
• Phenomenological model of unit connectivity
• Phenomenological model of K connectivity
• Connectivity model with account of frequency of multipliers
DETERMINE THE WAY TO COMPARE WORDS
Word definition/descriptor — ~10 word forms,
Total number in experiment — ~110000 word forms.
semantic multiplier — elementary unit of concept plan.
Basic presumptions:
a) semantic expansion of language is discrete;
b) range of elements of expansion is final and observable ;
c) number of combinations is almost eternal;
d) semantic expansion is elementary, i.e. consists of indecomposable
elements;
e) semantic elements are monotonous, i.e. refer to contents (they are
elements of perception and thinking);
f) semantic elements form a universal set, i.e. they are of general character
and their number and range are similar for different languages .
WAYS TO OBTAIN (INDICATE) SEMANTIC
MUKTIPLIER
Lemmatization — acquisition of canonic word form.
Folding — folding of the word, i.e. deletion of vowels except for vowel
of the first syllable.
Root indication — representation of word with root morpheme.
Word stem indication — word representation with several
morphemes, for example, prefix and root.
Indication of quasi stem of the word — with random initial word part,
basing on the fact of shift of word meaning (its contents) to its
beginning.
METHOD OF OBTAINING SEMANTIC CODE OF
THE WORD
METHOD PROCEDURES
1. Entering of the coded word into its code.
2. Exclusion of semantic multiplier repetitions.
3. Filtration (deletion):
«zero» semantic multipliers
grammatical words
prepositions, conjunction etc.
4. Lexicalisation of collocations.
5. Formation of quasi word stems
RESULTS OF METHOD IMPLEMENTATION
}{s
jd
x
а) descriptor— dj = б) words — wi = }{s
iw
x
DETERMINATION OF FREQUENCY
CHARACTERITICS OF SEMANTIC MULTIPLIERS
Two frequency characteristics are associated
with semantic multiplier X:
— frequency of multiplier occurrence
in descriptor definitions
— frequency of multiplier occurrence
in word definitions
Frequency analysis of semantic multiplier methodology:
а) frequencies computing;
b) ranging and grading of multipliers in definitions
according to increase of their rank.
CRITERION OF SEMANTIC CONNECTIVITY
BETWEEN WORDS AND DESCRIPTORS
Stages of development of the criterion:
1. Phenomenological model of unit connectivity
if there is at least one common multiplier in definitions of words
and descriptors:
| dj  wi | = 1;
2. Phenomenological model of K connectivity
there is K number of common semantic multipliers in definitions of
words and descriptors:
| dj  wi | = K;
3. Connectivity model with account of frequency of multipliers
(selective criterion of Karaulov).
;2K f
D
x .6
SELECTIVE CRITERION OF KARAULOV
Word and descriptor are semantically connected if their definitions
have more than two similar semantic multipliers or if their definitions
have at least one common semantic multiplier and its frequency in
multitude of descriptors is more than six
Semantic fields construction procedure
1. Construction of the field according to unit connectivity model.
2. Narrowing of the field by number of coinciding multipliers.
3. Narrowing of the field with account to semantic multipliers frequency.
Dw ji

If
, than
QUESTIONS FOR SELF-CHECK
 Name logical-statistical methods of knowledge retrieval from
texts.
 Tell about distribution-statistical methodology of text analysis.
 Tell about frequency-semantic methodology of text analysis.
 Tell about component text analysis.

More Related Content

What's hot

Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional SemanticsAndre Freitas
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...Rommel Carvalho
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...IRJET Journal
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...
Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...
Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...Traian Rebedea
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianIDES Editor
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text University of Bari (Italy)
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignmentGuus Schreiber
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
 

What's hot (19)

Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...
Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...
Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Le...
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Distributional semantics
Distributional semanticsDistributional semantics
Distributional semantics
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignment
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
 

Viewers also liked

Procesadores de información contable
Procesadores de información contableProcesadores de información contable
Procesadores de información contablerosibelgarfido
 
Administración- trabajo de computacion
Administración- trabajo de computacionAdministración- trabajo de computacion
Administración- trabajo de computacionmarianadj
 
Candidatura Rocafort 2015
Candidatura Rocafort 2015Candidatura Rocafort 2015
Candidatura Rocafort 2015organitzacio
 
спортивное ориентирование
спортивное ориентированиеспортивное ориентирование
спортивное ориентированиеpodlesnieoe
 
Actividad 1 Recursos Didacticos Lidia
Actividad 1 Recursos Didacticos LidiaActividad 1 Recursos Didacticos Lidia
Actividad 1 Recursos Didacticos LidiaLidia143135
 
Genre research
Genre research Genre research
Genre research Fee_Fee
 
Resultados, Clasificaciones Y ProgramacióN
Resultados, Clasificaciones Y ProgramacióNResultados, Clasificaciones Y ProgramacióN
Resultados, Clasificaciones Y ProgramacióNguestd3a108
 
Restrepo mesa doris_diapositivas_act8.doc
Restrepo mesa doris_diapositivas_act8.docRestrepo mesa doris_diapositivas_act8.doc
Restrepo mesa doris_diapositivas_act8.docJaqueline2411
 

Viewers also liked (20)

Procesadores de información contable
Procesadores de información contableProcesadores de información contable
Procesadores de información contable
 
Administración- trabajo de computacion
Administración- trabajo de computacionAdministración- trabajo de computacion
Administración- trabajo de computacion
 
Candidatura Rocafort 2015
Candidatura Rocafort 2015Candidatura Rocafort 2015
Candidatura Rocafort 2015
 
Tic formato de gestion escolar
Tic formato de gestion escolarTic formato de gestion escolar
Tic formato de gestion escolar
 
спортивное ориентирование
спортивное ориентированиеспортивное ориентирование
спортивное ориентирование
 
Actividad 1 Recursos Didacticos Lidia
Actividad 1 Recursos Didacticos LidiaActividad 1 Recursos Didacticos Lidia
Actividad 1 Recursos Didacticos Lidia
 
Genre research
Genre research Genre research
Genre research
 
1
11
1
 
2009 2010 projecte sant jordi
2009 2010 projecte sant jordi2009 2010 projecte sant jordi
2009 2010 projecte sant jordi
 
Cge 19 de noviembre
Cge 19 de noviembreCge 19 de noviembre
Cge 19 de noviembre
 
Practica (1)
Practica (1)Practica (1)
Practica (1)
 
Resultados, Clasificaciones Y ProgramacióN
Resultados, Clasificaciones Y ProgramacióNResultados, Clasificaciones Y ProgramacióN
Resultados, Clasificaciones Y ProgramacióN
 
3tarea rectas
3tarea rectas3tarea rectas
3tarea rectas
 
Sistemas materiales
Sistemas materialesSistemas materiales
Sistemas materiales
 
Ex2
Ex2Ex2
Ex2
 
Restrepo mesa doris_diapositivas_act8.doc
Restrepo mesa doris_diapositivas_act8.docRestrepo mesa doris_diapositivas_act8.doc
Restrepo mesa doris_diapositivas_act8.doc
 
Presentación1.pptx
 Presentación1.pptx  Presentación1.pptx
Presentación1.pptx
 
Trabajo Del Grupo 4
Trabajo Del Grupo 4Trabajo Del Grupo 4
Trabajo Del Grupo 4
 
Escaner
EscanerEscaner
Escaner
 
Insectos mas raros del mundo
Insectos mas raros del mundoInsectos mas raros del mundo
Insectos mas raros del mundo
 

Similar to 1 l5eng

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
NON-TECHNICAL COMPUTER THESAURUS VERSUS SPECIALIZED COMPUTER THESAURUS
NON-TECHNICAL COMPUTER THESAURUSVERSUSSPECIALIZED COMPUTER THESAURUSNON-TECHNICAL COMPUTER THESAURUSVERSUSSPECIALIZED COMPUTER THESAURUS
NON-TECHNICAL COMPUTER THESAURUS VERSUS SPECIALIZED COMPUTER THESAURUSSabadel
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
Merging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked dataMerging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked dataJohn Pap
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approachdinesh_joshy
 
Mapping Landscape of Patterns - Vol.2
Mapping Landscape of Patterns - Vol.2Mapping Landscape of Patterns - Vol.2
Mapping Landscape of Patterns - Vol.2MariaLenzi1
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationSurabhi Verma
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentationDouglas Randall
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structurekevig
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structurekevig
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 

Similar to 1 l5eng (20)

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
NON-TECHNICAL COMPUTER THESAURUS VERSUS SPECIALIZED COMPUTER THESAURUS
NON-TECHNICAL COMPUTER THESAURUSVERSUSSPECIALIZED COMPUTER THESAURUSNON-TECHNICAL COMPUTER THESAURUSVERSUSSPECIALIZED COMPUTER THESAURUS
NON-TECHNICAL COMPUTER THESAURUS VERSUS SPECIALIZED COMPUTER THESAURUS
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
Merging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked dataMerging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked data
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
Mapping Landscape of Patterns - Vol.2
Mapping Landscape of Patterns - Vol.2Mapping Landscape of Patterns - Vol.2
Mapping Landscape of Patterns - Vol.2
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...
 
Exempler approach
Exempler approachExempler approach
Exempler approach
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
 
Marcu 2000 presentation
Marcu 2000 presentationMarcu 2000 presentation
Marcu 2000 presentation
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 

More from Noobie312

презентация
презентацияпрезентация
презентацияNoobie312
 
лекция 7 тема 1
лекция 7 тема 1лекция 7 тема 1
лекция 7 тема 1Noobie312
 
лекция 6 тема 1
лекция 6 тема 1лекция 6 тема 1
лекция 6 тема 1Noobie312
 
лекция 5 тема 1
лекция 5 тема 1лекция 5 тема 1
лекция 5 тема 1Noobie312
 
лекции 3 4 тема 1
лекции 3 4 тема 1лекции 3 4 тема 1
лекции 3 4 тема 1Noobie312
 
введение
введениевведение
введениеNoobie312
 
лекция 7 тема 1
лекция 7 тема 1лекция 7 тема 1
лекция 7 тема 1Noobie312
 
лекция 6 тема 1
лекция 6 тема 1лекция 6 тема 1
лекция 6 тема 1Noobie312
 
лекция 5 тема 1
лекция 5 тема 1лекция 5 тема 1
лекция 5 тема 1Noobie312
 
лекции 3 4 тема 1
лекции 3 4 тема 1лекции 3 4 тема 1
лекции 3 4 тема 1Noobie312
 
введение
введениевведение
введениеNoobie312
 

More from Noobie312 (11)

презентация
презентацияпрезентация
презентация
 
лекция 7 тема 1
лекция 7 тема 1лекция 7 тема 1
лекция 7 тема 1
 
лекция 6 тема 1
лекция 6 тема 1лекция 6 тема 1
лекция 6 тема 1
 
лекция 5 тема 1
лекция 5 тема 1лекция 5 тема 1
лекция 5 тема 1
 
лекции 3 4 тема 1
лекции 3 4 тема 1лекции 3 4 тема 1
лекции 3 4 тема 1
 
введение
введениевведение
введение
 
лекция 7 тема 1
лекция 7 тема 1лекция 7 тема 1
лекция 7 тема 1
 
лекция 6 тема 1
лекция 6 тема 1лекция 6 тема 1
лекция 6 тема 1
 
лекция 5 тема 1
лекция 5 тема 1лекция 5 тема 1
лекция 5 тема 1
 
лекции 3 4 тема 1
лекции 3 4 тема 1лекции 3 4 тема 1
лекции 3 4 тема 1
 
введение
введениевведение
введение
 

1 l5eng

  • 1. Topic No 1. NATURAL LANGUAGE SIGN SYSTEMS MAIN SECTIONS 1.1. Models and methods of representation and organization of knowledge – lections 1-2. 1.2. Quantitative specification of natural language systems — lections 3-4, 8. 1.3. Logical-statistical methods of knowledge retrieval — lections 5-7. OPTIONAL SECTIONS FOR SELF-STUDY 1.4. Technology of automated formation of thesaurus. 1.5. Example of natural language resource studying.
  • 2. Lection 5. LOGICAL-STATISTICAL METHODS OF KNOWLEDGE ACQUISITION  Distribution-statistical method  Componential analysis  Frequency-semantic method
  • 3. References Lecture materials can be found in: Yu.N.Filippovich, А.V.Prohorov. Semantics of information technologies: practices of dictionary-thesaurus description. / Computer linguistics series. Introduction article by A.I. Novikov. M.: MGUP, 2002. —CD ROM in package — pp. 46–54.
  • 4. DISTRIBUTION-STATISTICAL METHOD Basic hypothesis: Meaningful language elements (words) that occur together in a text interval are semantically connected between each other  Quantitative (frequency) characteristics of sole or joint occurrence of meaningful language elements  ‘connection strength’ coefficient formula  Semantic classification of meaningful language elements
  • 5. FREQUENCY CHARACTERISTICS OF CONTEXTS Context Сi(T) — a piece of text, a sequence (chain) of syntagmas. T = C1(T)+...+Cq(T), where Сi(T) Cj(T)=, i,j (ij) [1,q] If syntagma is a meaningful language element (word), then: NA, fA=NA/N — quantity and frequency of contexts, where only word A occurred; NB , fB=NB/N — quantity and frequency of contexts, where only word B occurred; NAB , fAB=NAB/N — quantity and frequency of contexts, where joint occurrence of words A and B took place; N — total number of contexts.
  • 6. FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (1) K f N NAB AB AB   — T.T.Tаnimоtо, L.B.Dоуlе. N ffN K BAAB AB   — M.E.Mаrоn, J.Kuhns.
  • 7. FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (2) K f N f fAB AB A B    — А.Ya.Shaikevich, G.Sаltоn, R.M.Curtiсе. — S.Dеnnis. — H.E.Stilеs
  • 8. ANALYSIS OF FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (1) All formulae of ‘connection strength’ coefficient are united by seeing events related to occurrence of words A and B as a system of accidental phenomena. Method procedure enables to establish the fact: if A and B – independent events, than P(AB)=P(A)P(B). Estimated value of ‘connection strength’ coefficient needs interpretation (explanation) Size of context (number of surrounding words) enables most likely to define that: а) 1–2 words — contact syntagmatic connections of word combinations; b) 5–10 words — distant syntagmatic connections and paradigmatic relations; c) 50–100 words — thematic connections between the words.
  • 9. ANALYSIS OF FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (2) Matrix of language units (words) cohesion and associative matrix word ... аi ... word frequency fа ... bj fb ... fаb ... ... • formation of the core of thematically connected texts; • automated construction of thesaurus; • information search and indexing; • automated abstracting. Directions of method implementation:
  • 10. METHODOLOGY FOR THESAURUS CONSTRUCTION BASED ON DISTRIBUTION- STATISTICAL METHOD  Compilation of frequency glossaries and concordances.  Analysis of joint occurrence of words (language units) and on its basis compilation of associative matrix.  Subjective interpretation of associative matrix and formation of classes of typical connections (relations).  Grouping (segregation) of specific relation types (genus- species, causal, etc.).  Interpretations of separate word connections.  Grouping of semantic fields.
  • 11. COMPONENT ANALYSIS Method of component analysis enables to track connection between two notions basing on the analysis of their definitions Definition of notion A  Notion A fAB Notion B  Definition of notion B Main method modifications: • Quantitative specification of connection. • Hypertext link.
  • 12. QUANTITATIVE SPECIFICATION OF CONNECTION Two words A and B are considered connected by the connection strength fаb = k, if in their definitions there are k number of common words — multitude of the same words, used in definitions for words A and B; }{x AB i — number of the same words.x AB i k  , where = k >1 Clusters of words connected by connection strength f = k , k = 1, 2, 3, ..., K.
  • 13. HYPERTEXT LINK Two words A and B are considered connected, if in definition of each word there is a common word, fаb = k =1. Hypertext links usage: • lexicographical systems (e-dictionaries and encyclopedias), • e-texts, • information and reference systems etc. Possible usage for knowledge analysis: • analysis of definition system or definition dictionary; • examination of quality of dictionary articles (by number of connections with other dictionary articles, by length of chain); • examination of extracts in definition dictionaries; • analysis of text dictionaries; • examination of hеlр-systems.
  • 14. FREQUENCY-SEMANTIC METHOD Frequency-semantic method uses two characteristics of words definitions as a criterion for connection strength estimation: similarity of elements and frequency. Method idea: «...imagine forces of semantic adhesion as being an everywhere existing , leaked in language field which has bodies in it – lexical language units. Different units interact the same way as atoms, molecules, macro bodies, planets and space objects interact – on one level, i.e. between homogeneous units, as well as on interlevels.» Basic data: • ideographic dictionaries. • concise definition dictionary of Russian for foreigners. • definition dictionaries of S.I. Ozhegov and D.N. Ushakov.
  • 15. References Karaulov Yu.N. Frequency dictionary of semantic multipliers of the Russian language. – М.: Nauka, 1980. Karaulov Yu.N., V.I.Molchanov, V.A.Afanasiev, N.V.Mihalev. Analysis of dictionary metalanguage using ECM. – М.: Nauka, 1982. – 96 p.
  • 16. FORMATION OF SEMANTIC FIELDS (1) Aa k DWwd ij  Dw ji  a ijwd A k DW , if , than , where: — value of semantic connection strength between word wi and descriptor dj ; — multitude of acceptable values of semantic connection strength between descriptors and words; Dj = {wij} — multitude of words of a descriptor; wi — word, i = 1...|W|, W = {wi} — multitude of words; dj — descriptor, j = 1...|D|, D = {dj} — multitude of descriptors. Practical task: divide 9000 words between 1600 descriptors
  • 17. FORMATION OF SEMANTIC FIELDS (2) ISSUES OF PRACTICAL TASK SOLUTION 1. Determine the way of words comparison • Choose the way to obtain (to indicate) semantic multiplier (lemmatization, folding, root indication, word stem and quasi stem of the word indication) • Develop methodology for obtaining word semantic code. 2. Determine frequency characteristics of semantic multipliers. 3. Identification of the criterion for semantic connection of words and descriptors. • Phenomenological model of unit connectivity • Phenomenological model of K connectivity • Connectivity model with account of frequency of multipliers
  • 18. DETERMINE THE WAY TO COMPARE WORDS Word definition/descriptor — ~10 word forms, Total number in experiment — ~110000 word forms. semantic multiplier — elementary unit of concept plan. Basic presumptions: a) semantic expansion of language is discrete; b) range of elements of expansion is final and observable ; c) number of combinations is almost eternal; d) semantic expansion is elementary, i.e. consists of indecomposable elements; e) semantic elements are monotonous, i.e. refer to contents (they are elements of perception and thinking); f) semantic elements form a universal set, i.e. they are of general character and their number and range are similar for different languages .
  • 19. WAYS TO OBTAIN (INDICATE) SEMANTIC MUKTIPLIER Lemmatization — acquisition of canonic word form. Folding — folding of the word, i.e. deletion of vowels except for vowel of the first syllable. Root indication — representation of word with root morpheme. Word stem indication — word representation with several morphemes, for example, prefix and root. Indication of quasi stem of the word — with random initial word part, basing on the fact of shift of word meaning (its contents) to its beginning.
  • 20. METHOD OF OBTAINING SEMANTIC CODE OF THE WORD METHOD PROCEDURES 1. Entering of the coded word into its code. 2. Exclusion of semantic multiplier repetitions. 3. Filtration (deletion): «zero» semantic multipliers grammatical words prepositions, conjunction etc. 4. Lexicalisation of collocations. 5. Formation of quasi word stems RESULTS OF METHOD IMPLEMENTATION }{s jd x а) descriptor— dj = б) words — wi = }{s iw x
  • 21. DETERMINATION OF FREQUENCY CHARACTERITICS OF SEMANTIC MULTIPLIERS Two frequency characteristics are associated with semantic multiplier X: — frequency of multiplier occurrence in descriptor definitions — frequency of multiplier occurrence in word definitions Frequency analysis of semantic multiplier methodology: а) frequencies computing; b) ranging and grading of multipliers in definitions according to increase of their rank.
  • 22. CRITERION OF SEMANTIC CONNECTIVITY BETWEEN WORDS AND DESCRIPTORS Stages of development of the criterion: 1. Phenomenological model of unit connectivity if there is at least one common multiplier in definitions of words and descriptors: | dj  wi | = 1; 2. Phenomenological model of K connectivity there is K number of common semantic multipliers in definitions of words and descriptors: | dj  wi | = K; 3. Connectivity model with account of frequency of multipliers (selective criterion of Karaulov). ;2K f D x .6
  • 23. SELECTIVE CRITERION OF KARAULOV Word and descriptor are semantically connected if their definitions have more than two similar semantic multipliers or if their definitions have at least one common semantic multiplier and its frequency in multitude of descriptors is more than six Semantic fields construction procedure 1. Construction of the field according to unit connectivity model. 2. Narrowing of the field by number of coinciding multipliers. 3. Narrowing of the field with account to semantic multipliers frequency. Dw ji  If , than
  • 24. QUESTIONS FOR SELF-CHECK  Name logical-statistical methods of knowledge retrieval from texts.  Tell about distribution-statistical methodology of text analysis.  Tell about frequency-semantic methodology of text analysis.  Tell about component text analysis.