Clusterization of text documents
using WordNet and semantic
similarity measures
Way to improve quality of clusterization of text documents using hierarchic thesaurus
(WordNet,YARN) and semantic similarity metrics between synsets in WordNet. We
also present ECOSA algorithm, which is used to reduce dimensions of documents'
characteristic vectors. When using ECOSA, clusterization process consists of two steps:
training and actual use.
Nikolai Blenda
Chelyabinsk State University, Russia
bna@csu.ru
Николай Бленда
Челябинский государственный унивеситет
bna@csu.ru
Aggregation concepts from WordNet and used
measure of semantic similarity
Clusterization process according to ECOSA
Percentage of errors à 0
Silhouette coefficient à max
Silhouette coefficient: where
D — set of documents
C — set of clusters
s(p,C) — similarity measure of the document in relation to other
clusters
Technical realization Expected results
Programming language: Phyton
Tokenization: nltk
morphological analysis: Pymorphy2
(russian), Nltk (english)
translation from Russian into English:
- Google translate API
(temporarily)
WordNet: (nltk.corpus) endlish
Clustering and visualization:
- RapidMiner (temporarily)
- R (planned)
Lower error rates in comparison with
the COSA (St.Staab) (COSA not included
in the measure of semantic similarity
of concepts)
Getting smaller characteristic vector of
documents compared to the GLSA

Nikolai Blenda - Clusterization of text documents using WordNet and semantic similarity measures

  • 1.
    Clusterization of textdocuments using WordNet and semantic similarity measures Way to improve quality of clusterization of text documents using hierarchic thesaurus (WordNet,YARN) and semantic similarity metrics between synsets in WordNet. We also present ECOSA algorithm, which is used to reduce dimensions of documents' characteristic vectors. When using ECOSA, clusterization process consists of two steps: training and actual use. Nikolai Blenda Chelyabinsk State University, Russia bna@csu.ru Николай Бленда Челябинский государственный унивеситет bna@csu.ru
  • 2.
    Aggregation concepts fromWordNet and used measure of semantic similarity
  • 3.
    Clusterization process accordingto ECOSA Percentage of errors à 0 Silhouette coefficient à max Silhouette coefficient: where D — set of documents C — set of clusters s(p,C) — similarity measure of the document in relation to other clusters
  • 4.
    Technical realization Expectedresults Programming language: Phyton Tokenization: nltk morphological analysis: Pymorphy2 (russian), Nltk (english) translation from Russian into English: - Google translate API (temporarily) WordNet: (nltk.corpus) endlish Clustering and visualization: - RapidMiner (temporarily) - R (planned) Lower error rates in comparison with the COSA (St.Staab) (COSA not included in the measure of semantic similarity of concepts) Getting smaller characteristic vector of documents compared to the GLSA