SlideShare a Scribd company logo
1 of 7
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 169
SEMANTIC BASED DOCUMENT CLUSTERING USING LEXICAL CHAINS
SHABANA AFREEN1, DR. B. SRINIVASU2
1M.tech Scholar, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for
Women, Telangana-Hyderabad, India.
2Associate Professor, Dept. of Computer Science and Engineering, Stanley College of Engineering and
Technology for Women, Telangana -Hyderabad, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – Traditional clustering algorithms do not
consider the semantic relationships among documentssothat
cannot accurately represent cluster of the documents. To
overcome these problems, introducing semantic information
from ontology such as WordNet has been widely used to
improve the quality of text clustering. However, there exist
several challenges such as extracting core semantics from
texts, assigning appropriate description for the generated
clusters and diversity of vocabulary.
In this project we report our attempt towards integrating
WordNet with lexical chains to alleviate these problems. The
proposed approach exploits the way we can identifythetheme
of the document based on disambiguated core semantic
features extracted and exploits the characteristics of lexical
chain based on WordNet. In our approach the main
contributions are preprocessing of document which identifies
the noun as a feature by performing tagging and
lemmatization, performing word sense disambiguation to
obtained candidate words based on the modified similarity
approach and finally the generation of clusterbasedonlexical
chains.
We observed better performance of lexical chain based on the
chain evaluation heuristic whose threshold is set to 50%. In
future we can demonstrate the lexical chains can lead to
improvements in performance of text clustering using
ontologies.
Key Words: Text clustering, WordNet, Lexical chains,
Core semantic features, cluster, semantic, candidate
words.
1. INTRODUCTION
This section provides detail about semantic web mining and
a more explained introduction to text clustering followed by
few examined problems facing for text clustering and their
possible solutions.
1.1 Text Clustering
With the increasing informationonInternet, Webmining has
been the focus of information retrieval. Now a day’sInternet
is being used so widely that it leads to a large repository of
documents. Text clustering is a useful technique thataimsat
organizing large document collections into smaller
meaningful and manageable groups, which plays an
important role in information retrieval, browsing and
comprehension [1].
1.2 Problem Description
Feature vectors generated using Bow results in very large
dimensional vectors. The feature selected is directly
proportional to dimension.
Extract core semantics from texts selecting the feature will
reduced number of terms which depict high semantic conre
content.
The quality of extracted lexical chains is highly depends on
the quality and quantity of the concepts within a document.
i.e., larger the chain clearer the concepts.
Several other challenges for the clustering results:
synonym and polysemy problems. There has been much
work done on the use of ontology to replace the original
word in a document with the most appropriate word called
word sense disambiguation(WSD).
Assign distinguished and meaningful description for the
generated clusters
1.3 Basics and background knowledge
1.3.1WordNet
WordNet is the product of a research project as Princeton
University. it classify the word in two categories i.e.,content
word and function word, content words dealt with noun,
verb, adverb and adjective which form a set of synonyms
called synsets. The synsets are organized into senses, giving
thus the synonyms of each word, and also into a
hyponym/hypernym (i.e., is-A) and meronym/holonym(i.e.,
Part-of) relationships[2].
1.3.2 Semantic similarity
Semantic similarity plays an important role in natural
language processing, . In general, all the measures can be
grouped into four classes: path length based measures,
information content based measures, features based
measures and hybrid measures [3].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 170
1.3.2.1 Path-based Measures
The main idea of path-based measures is that the similarity
between two concepts is a function of the length of the path
linking the concepts and the position of the concepts in the
taxonomy.
The various path based measure include Shortest path
measure, wu and palmer’s measure, leakcock and
chodorow’s measure and li’s measure.
The information based measure are resnik’s measure,lin’s
measure, jiang’s measure
Feature based measure is independent on thetaxonomyand
the subsumers of the concepts.Thefinal approachi.e.,hybrid
measure combined the ideas presented above[4,5,6,7].
1.4 Word sense disambiguation
We adopt the WSD procedure which is given by [8], aim to
identify the most appropriate sense associated with each
noun in a given document based on the assumption that one
sense per discourse. The WSD approach can be described as
follows. Let N= {n1, n2, …..np} denote the set of all senses
associated with the noun ni according to the WordNet
ontology. We determine the most appropriate sense of a
noun n, by computing the sum of its similarity to other noun
senses in d as follows.
Where s(cik,cjm) is the similarity between two senses. We
restrict to the first three senses for eachsynsettoparticipate
in this computation for several reasons as given by [9].
First, the senses of a given noun in the WordNet hierarchy
are arranged in descendingorderaccordingtotheircommon
usage. Furthermore we compare the clustering results on
using only the top three senses against using all senses of a
noun, the former yields similar clustering results at a
reduced computation cost to the latter.
1.5 Lexical chains
Lexical chains are groups of words which exhibit lexical
cohesion. Cohesion is a way of getting text together as a
whole. Lexical cohesion is exhibited through cohesive
relations. They have classified these relations as:
1. reiteration with identity of reference
2. reiteration without identity of reference
3. reiteration by means of super ordinate
4. systematic semantic relation
5. non systematic semantic relation
6. The first three relations involve reiteration which
includes repetition of the same word in the same
sense, the use of a synonym for a word and the use
of hypernyms for a word respectively. The last two
relations involve collocations i.e., semantic
relationships between words that often co-occur.
Lexical chains in a text are identified by the
presence of strong semantic relations between the
words in the text[10].
2. LITERATURE SURVEY
2.1 Introduction
Most of the existing text clustering methods use the bag of
words model knownfrominformationretrieval wheresingle
terms are used as features for representing the documents
and they are treated independently. Some researches
recently put their focus on the conceptual featuresextracted
from text using ontologies and have shown that ontologies
could improve the performance of text mining.
2.2 Semantic ontology for text clustering
Current approaches for semanticontologyfortextclustering
can be divided into two major categories, namely, concept
mapping and embedded methods[11].
Concept mapping methods simply replace each term in a
document by its corresponding concepts(s) extracted from
an ontology before applying the clustering algorithm. These
methods are appealing because they can be applied to any
clusteringalgorithm[12].Furthermore,themappingofterms
into concepts incurs only a one-time cost, thus allowing the
clustering algorithm to be invoked multiple times (for
different cluster initialization, parameter settings etc.,)
without the additional overhead of re-creating the concepts.
However, their main limitation is that the quality of the
clusters is highly dependent on the correctness of the WSD
procedure used. Embedded methods, on the other hand,
integrate the ontological background knowledge directly
into the clustering algorithm. This would require
modifications to the existing clustering algorithm, which
often leads to substantial increase in its both runtime and
memory requirements.InsteadofperformingWSDexplicitly,
these methods assume the availability of a distance/
similarity matrix for all pairs of words in a text corpus
computed based ontheWordNet’sconcepthierarchy.Sincea
word can be mapped to several synsets, multiple pairs may
exist between any two words and the algorithm has to
decide which path to use when computing the distance
measure. Thus embedded methods are still susceptible to
incorrect mapping issues related to WSD.
2.3 Proposed approach
The proposed approach exploits the relations to provide a
more accurate assessment of the similarity between terms
for word sense disambiguation.Furthermore,weimplement
lexical chains to extract a set of semantically related words
from texts, which can represent the semantic content of the
texts. Although lexical chains have been extensively used in
text summarization, their potential impactontextclustering
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 171
problem has not been fully investigated. Our integrated way
can identify the theme of documents based on the
disambiguated core features extracted, and in parallel
downsize the dimensions of feature space.
The work exploits this characteristic of lexical chains - each
lexical chain is an indicator of a topical strand and the
segments connected by it belong to the same topic. Wework
on the premise that, instead of just computing the lexical
chains with respect to a single document, if we compute the
chains across documents, we are in effect chaining together
those documents belonging to a topical strand.
3. ARCHITECTURE
This section deals with the method used, architecture and
statistics of semantics for the construction of lexical chains
based on semantic knowledge database such as WordNet.
3.1 System architecture
Text document clustering can greatly simplify browsing
large collections of documents by reorganizing them into a
smaller number of manageable clusters.
Preprocessing the documents is probably at least as
important as the choice of an algorithm, since an algorithm
can only be as good as the data it works on. While there are a
number of preprocessing steps, that are almost standard
now, the effects of adding background knowledge are still
not very extensively researched.
We preprocess the document by running it through a
tokenizer; we then filter out all non-nounwordsidentifiedin
the WSD stage. (The WSD here referred as the process of
word sense disambiguation where original word is being
replaced by the most appropriate sense based on the
similarity sense from WordNet).The result is a sequence of
nouns which appear in the text along withitssense. Werefer
to these as ‘candidate words’. We base our algorithm on the
WordNet lexical Database. WordNet is used to identify the
relations among the words. We use identity, synonym,
hypernym, meronym relations to compute the chains. Our
algorithm works by maintaining a global setoflexical chains,
each of which represents a topic.
We now compute the lexical chains correspondingto eachof
the candidate words by looking up the synsets for the word
from WordNet. We then traverse the global list of lexical
chains to identify those chains with which it has a identity,
synonym, hypernym and meronym relations. We refer to
these identified lexical chains as potential chains.wereferto
these identified lexical chains as potential chains. If the
candidate word has no relation with any of the chains in the
global list, a new potential chain is created.
A chain is selected from the global set based on the score of
representative i.e., a threshold value. We have selected the
threshold in such a way that the chain which is greater than
the threshold value is selected as the potential chain for the
document.
Figure 3.1 Architecture
3.2 Candidate word generation
It implements a method to obtain description of a synsets,
Description of synsets: Let C={c1, c2, c3, c4…..ck}bethesetof
synsets in a document, ci € C. let lemma(ci) be the set of
words that constitute a synsets of ci. let gloss(ci) be the
definition and examples of usages of ci. let related(ci) be the
union of the synonym, hypernym, hyponym.
Scoring mechanism:
The scoring mechanism assigns an n word overlapthescore
of n2. This gives an n- word overlap a score that is greater
than the sum of the scores assigned to those n words if they
had occurred in two or more phrases, each lessthannwords
long.
Modified measure: In order to take the full advantage of
both explicit andimplicitsemanticrelationsbetweensynsets
such as is-a and has-par links, we define a new similarity
measure that combines both measures as below.
Where S= log(score(DES(cp),DES(cq))+1), this method not
only reflects structure information of synsets such as
distance, but also incorporatescontentmeaningofsynsetsin
the ontology. It integrates well with explicit and implicit
semantic between synsets in ontology.
Text
Documents
Preprocess WordNet
Candidate words
Lexical chains
Filtered lexical chains
Cluster with labels
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 172
3.3 Lexical chain
Lexical chain represents semantic relations among the
selected word senses of the words appearing in that lexical
chain. Each node in a lexical chain is a word sense of a word,
and each link can be identity, synonym,hypernym,meronym
relation between two word senses.
In order to extract the core semantics, the semantic
importance of word senses within a given document should
be evaluated first, generally, let N= {n1, n2,n3 …..nk} bethe set
of nouns in a document d and let f={f 1,f 2,f3 ……fp} be the
corresponding frequency of occurrence of nouns in d. let
C={c1, c2, c3, …..ck} be the set of disambiguation conceptsthat
corresponding to N, given a document d, a set of nouns N, a
set of frequencies F and a set of concepts C, let W={w1,w2, w3
……wn} as the set of corresponding weight of disambiguated
concepts in C, if ci(ci€C) is mapped fromnkandnm(nk,nm€N).
then the weight of ci us computed by based on the weighted
concepts, we give following definition.
Wk=fk+fn
Score of concept:
Let C={c1, c2, c3 …..cn} be the set of disambiguated
concepts(word senses), and let W={ w1, w2, w3 ……wq } be
the set of corresponding weight of disambiguated concepts
in C. let RN={identity, synonym,hypernym,meronym} bethe
set of semantic relations, and let R ={r1 r2=r1, r3, r4 } be the
set of the corresponding weight of relation in RN. Then the
score of a concept ci(ci€C) in a lexical chain is computed by
A large value of S(ci) indicates that ci is a semantically
important concept in a document. The relation weight r(r
€R) depending on the kind of semantic relationship and it is
in the order listed: identity, synonym, hypernym, meronym
(thus r1=r2>r3>r4).
Score of lexical chain:
Let L= {L1, L2, L3………. Lm} be a set of lexical chain of a given
document. Li€L, let ci = { ci1, ci2, ci3 …..ciq } be a set of
disambiguated concepts in Li let S(ci) be the score ofconcept
cil(cil € ci ). Then , the score S(ci1) of lexical chain in a
document is define as
Score of representative lexical chain:
Let Let L= {L1, L2, L3………. Lm} be a set of lexical chains of a
given document. Let LR ={L1
R , L2
R ……Ln
R } (n≤m) be a set of
representative lexical chains that satisfy the following
criterion:
Where ∞ is a weighting coefficient that is used to control the
number of the representativelexical chainstobeconsidered.
We extract the weighted concepts in the lexical chain LR
composing set of core semantic features for the given
document. It is these concepts canbethenusedtoclusterthe
document.
4. IMPLEMENTATION
4.1Dataset Description
We have run our experiments on a small dataset of 30
documents retrieved from 20Newsgroup dataset. This was
because we were unable to get a pre-clustered dataset and
comparing the results would have been difficult. Hence, we
limited our experiments to a small dataset to keep our
experiments humanly tractable.
We keep the number of documents small in order to be able
to do a qualitative analysis of the clustersformedasopposed
to a quantitative one.
4.1 Pre-processing
Documents are process by passing through tokenizer by
setting space as delimiter, and then we obtain the tokens.
These tokens are further passed to a Tagger to assign the
category for each token, in our case it is noun finally we pass
to a lemmatizer to obtain the root form of the tagged word.
Here we are using wordNet lemmatizer for a part of speech
tagging output. Lemmatizer return the same word if it is not
found in WordNet or else returns the lemma.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 173
Figure 4.1: preprocessing
4.2 Performing Word Sense Disambiguation
Here we determine the most appropriatesenseassociatedto
each lemma by computing the similarity among three sense,
the sense which assigned the highest score is considerasthe
probable sense. Finally we replace our lemma with the
synset of that sense which is termed as candidate
words.
Figure 4.2 Calculating word sense disambiguation
4.3 Cluster Generation
The candidate wordsareassignedtheweightsandgenerated
the score of concept by considering the relations. Then
summing up all the relation score will givetheconceptscore.
This concept score further used to compute the score of
lexical chain. we then evaluate the filtered chains,inorder to
select a subset of chains to which the document isadded. We
select all those lexical chains whose length exceedsheuristic
value.
Our algorithm works on the assumption that lexical chains
represent the theme of the document. And grouping
documents based on these lexical chains results in the
documents being clustered based on its theme. This work
explores if and how the two following methods can improve
the effectiveness of clustering through semantic similarity
approach.
Figure 4.3 Generation of cluster
5. EXPERIMENT RESULTS
The results are analyzed after pre-processing step is
illustrated in the below table 5.1. the first columnspecifythe
document id, second column specify thetokensandthethird
column specify the number of nouns obtained.
Table 5.1: sample for pos tagging.
Document id Number of
tokens
Number of
Nouns
9 106 26
10 129 19
29 49 26
30 70 19
The below table shows the chains obtained for the
documents. The first columns specify the document ids and
their respective chains.
Table 5.2: Sample for lexical chains.
Document id Chains
Documents
Tokenization
Pos tagging
Lemmatization
Lemma
Lemma
WordNet
Calculate sense
Similarity:
1. wu and
palmer
2. banarjee &
pedersen
Candidate words
1. Score of concept
2. Score of lexical chain
Representative of lexical chain
Cluster with a Label
Candidate words
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 174
29 1. Bent, hang, knack
2. Design, designing
3. Station
4. Wagon, wagon
5. Tercel, tercelet, tierce
6. Corolla
7. Architect, designer
8. Sense, signified
10 1. Earn
2. Livestock, carcass
3. Argentina, Philippines
4. Bond, American, Canada
30 1. Earn
2. Nyse, take, income,
stock, rate
Table 5.3 Number of Lexical chains generated.
Document id Number of chains
29 8
30 2
10 4
6. CONCLUSIONS
Documents contain multiple topics and clustering them
using hard clustering is very unnatural. Documents should
ideally be clustered using soft clustering algorithms.
Unfortunately, these algorithms work only for very small
dimensions. The natureoflexical chainsmakesthemsuitable
for clustering documents. Each lexical chain is considered as
a topical strand and if two documents share a chain, then
they contain the same topics.
This work presents a methodology for clustering using
disambiguated concepts and lexical chains.Amodifiedterm-
based semantic similarity measure is proposed for word
sense disambiguation, and lexical chains are employed to
extract core semantic features that express the topic of
documents, which determining the number of clusters, and
assigning appropriate descriptionforthegeneratedclusters.
More importantly, we show that the lexical chain features
(core semantics) can improvethe qualitysignificantlywith a
reduced number of features in the document clustering
process. Although lexical chains have been widely used in
many application domains, this study is one of the few
researches which try to investigate the potential impact of
lexical chains on text clustering.
In future work, we would like to perform our method on a
larger knowledge base, such as Wikipedia. Moreover, since
we have demonstrated that the lexical chains can lead to
improvements in text clustering, the next work we plan to
explore the feasibility of lexical chains in the text mining
task.
ACKNOWLEDGEMENT
To Ammi
Affu is nothing without you
All that I am, all that I do, all that I ever had, I owe it all
to you.
To pappa
I’m glad how your words have always given me life’s real
view,
“Winning is great, sure but if you want to do something in
life run a marathon of life. I am there.”
It is with a sense of gratitude and appreciation that I feel to
acknowledge any well wishers for their king support and
encouragement during the completion of the project.
I would like to express my heartfelt gratitude to my Project
guide Dr. Srinivasu Badugu, Coordinator of M.Tech in the
Computer Science Engineering department, for encouraging
and guiding me through the project. His extreme energy,
creativity and excellentdomainknowledgehavealwaysbeena
constant source of motivation and orientation for me to gain
more hands-on experience and hence get edge over. I am
highly indebted to him for his guidance and constant
supervision as well as for providing necessary information
regarding the project & also for his support in completing the
project.
REFERENCES
1. Wei, Tingting, et al. "A semantic approach for text
clustering using WordNet and lexical chains." Expert
Systems with Applications 42.4 (2015): 2264-2275.
2. Miller, George A., and Walter G. Charles. "Contextual
correlates of semantic ++
3. similarity." Language and cognitive processes 6.1
(1991): 1-28.
4. Resnik, Philip. "Semantic similarity in a taxonomy:
An information-based measure and its application
to problems of ambiguity in natural language." J.
Artif. Intell. Res.(JAIR) 11 (1999): 95-130.
5. Sánchez, David, et al. "Ontology-based semantic
similarity: A new feature-based approach." Expert
Systems with Applications 39.9 (2012): 7718-7728.
6. Budanitsky, Alexander, and Graeme Hirst.
"Evaluating wordnet-based measures of lexical
semantic relatedness." Computational
Linguistics 32.1 (2006): 13-47.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 175
7. Budanitsky, Alexander, and Graeme Hirst.
"Evaluating wordnet-based measures of lexical
semantic relatedness." Computational
Linguistics 32.1 (2006): 13-47.
8. Amine, Abdelmalek, Zakaria Elberrichi, and Michel
Simonet. "Evaluation of textclusteringmethodsusing
wordnet." Int. Arab J. Inf. Technol. 7.4 (2010): 349-
357.
9. Jayarajan, Dinakar, Dipti Deodhare, and Balaraman
Ravindran. "Lexical chains as document features."
(2008): 111-117.
10. Fodeh, Samah, Bill Punch, and Pang-Ning Tan. "On
ontology-driven document clustering using core
semantic features." Knowledge and information
systems28.2 (2011): 395-421.
11. Termier, Alexandre, Michèle Sebag, and Marie-
Christine Rousset. "Combining Statistics and
Semantics for Word and Document
Clustering." workshop on ontology learning. 2001.
12. Chen, Chun-Ling, Frank SC Tseng, and Tyne
Liang. "An integration of WordNet and fuzzy
association rule mining for multi-label document
clustering." Data & Knowledge Engineering 69.11
(2010): 1208-1226.
BIOGRAPHIES
SHABANA AFREEN received
her B.E and M.tech degree in
computer science from Osmania
university. Her research interests
are natural language processing,
semantic web.

More Related Content

What's hot

Text Document categorization using support vector machine
Text Document categorization using support vector machineText Document categorization using support vector machine
Text Document categorization using support vector machineIRJET Journal
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsIJCSIS Research Publications
 
IRJET- A Survey Paper on Text Summarization Methods
IRJET- A Survey Paper on Text Summarization MethodsIRJET- A Survey Paper on Text Summarization Methods
IRJET- A Survey Paper on Text Summarization MethodsIRJET Journal
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
IRJET - Automatic Text Summarization of News Articles
IRJET -  	  Automatic Text Summarization of News ArticlesIRJET -  	  Automatic Text Summarization of News Articles
IRJET - Automatic Text Summarization of News ArticlesIRJET Journal
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationIRJET Journal
 
Review of Various Text Categorization Methods
Review of Various Text Categorization MethodsReview of Various Text Categorization Methods
Review of Various Text Categorization Methodsiosrjce
 
Single document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunkingSingle document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationijnlc
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
 
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmIndonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmjournalBEEI
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 

What's hot (18)

Text Document categorization using support vector machine
Text Document categorization using support vector machineText Document categorization using support vector machine
Text Document categorization using support vector machine
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
 
K0936266
K0936266K0936266
K0936266
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
IRJET- A Survey Paper on Text Summarization Methods
IRJET- A Survey Paper on Text Summarization MethodsIRJET- A Survey Paper on Text Summarization Methods
IRJET- A Survey Paper on Text Summarization Methods
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrieval
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
IRJET - Automatic Text Summarization of News Articles
IRJET -  	  Automatic Text Summarization of News ArticlesIRJET -  	  Automatic Text Summarization of News Articles
IRJET - Automatic Text Summarization of News Articles
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet Segmentation
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Review of Various Text Categorization Methods
Review of Various Text Categorization MethodsReview of Various Text Categorization Methods
Review of Various Text Categorization Methods
 
Single document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunkingSingle document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunking
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classification
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
 
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmIndonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 

Similar to Semantic Based Document Clustering Using Lexical Chains

Semantic Similarity Between Sentences
Semantic Similarity Between SentencesSemantic Similarity Between Sentences
Semantic Similarity Between SentencesIRJET Journal
 
Semantic similarity measurement- A theoretical study of various approaches
Semantic similarity measurement- A theoretical study of various approachesSemantic similarity measurement- A theoretical study of various approaches
Semantic similarity measurement- A theoretical study of various approachesIRJET Journal
 
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITIONSEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITIONcscpconf
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET Journal
 
P036401020107
P036401020107P036401020107
P036401020107theijes
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPIRJET Journal
 
Context Driven Technique for Document Classification
Context Driven Technique for Document ClassificationContext Driven Technique for Document Classification
Context Driven Technique for Document ClassificationIDES Editor
 
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
IRJET-  	  Multi-Document Summarization using Fuzzy and Hierarchical ApproachIRJET-  	  Multi-Document Summarization using Fuzzy and Hierarchical Approach
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical ApproachIRJET Journal
 
IRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text DocumentIRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemIRJET Journal
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
 
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONEXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONIJNSA Journal
 
Syntactic Indexes for Text Retrieval
Syntactic Indexes for Text RetrievalSyntactic Indexes for Text Retrieval
Syntactic Indexes for Text RetrievalITIIIndustries
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based Systemijcnes
 
Automatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingAutomatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingIRJET Journal
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 

Similar to Semantic Based Document Clustering Using Lexical Chains (20)

Semantic Similarity Between Sentences
Semantic Similarity Between SentencesSemantic Similarity Between Sentences
Semantic Similarity Between Sentences
 
Semantic similarity measurement- A theoretical study of various approaches
Semantic similarity measurement- A theoretical study of various approachesSemantic similarity measurement- A theoretical study of various approaches
Semantic similarity measurement- A theoretical study of various approaches
 
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITIONSEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
 
G04124041046
G04124041046G04124041046
G04124041046
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software Technologies
 
P036401020107
P036401020107P036401020107
P036401020107
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLP
 
Context Driven Technique for Document Classification
Context Driven Technique for Document ClassificationContext Driven Technique for Document Classification
Context Driven Technique for Document Classification
 
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
IRJET-  	  Multi-Document Summarization using Fuzzy and Hierarchical ApproachIRJET-  	  Multi-Document Summarization using Fuzzy and Hierarchical Approach
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
 
IRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text DocumentIRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text Document
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect Information
 
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONEXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
 
Syntactic Indexes for Text Retrieval
Syntactic Indexes for Text RetrievalSyntactic Indexes for Text Retrieval
Syntactic Indexes for Text Retrieval
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based System
 
Automatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingAutomatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language Processing
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Recently uploaded (20)

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

Semantic Based Document Clustering Using Lexical Chains

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 169 SEMANTIC BASED DOCUMENT CLUSTERING USING LEXICAL CHAINS SHABANA AFREEN1, DR. B. SRINIVASU2 1M.tech Scholar, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana-Hyderabad, India. 2Associate Professor, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana -Hyderabad, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Traditional clustering algorithms do not consider the semantic relationships among documentssothat cannot accurately represent cluster of the documents. To overcome these problems, introducing semantic information from ontology such as WordNet has been widely used to improve the quality of text clustering. However, there exist several challenges such as extracting core semantics from texts, assigning appropriate description for the generated clusters and diversity of vocabulary. In this project we report our attempt towards integrating WordNet with lexical chains to alleviate these problems. The proposed approach exploits the way we can identifythetheme of the document based on disambiguated core semantic features extracted and exploits the characteristics of lexical chain based on WordNet. In our approach the main contributions are preprocessing of document which identifies the noun as a feature by performing tagging and lemmatization, performing word sense disambiguation to obtained candidate words based on the modified similarity approach and finally the generation of clusterbasedonlexical chains. We observed better performance of lexical chain based on the chain evaluation heuristic whose threshold is set to 50%. In future we can demonstrate the lexical chains can lead to improvements in performance of text clustering using ontologies. Key Words: Text clustering, WordNet, Lexical chains, Core semantic features, cluster, semantic, candidate words. 1. INTRODUCTION This section provides detail about semantic web mining and a more explained introduction to text clustering followed by few examined problems facing for text clustering and their possible solutions. 1.1 Text Clustering With the increasing informationonInternet, Webmining has been the focus of information retrieval. Now a day’sInternet is being used so widely that it leads to a large repository of documents. Text clustering is a useful technique thataimsat organizing large document collections into smaller meaningful and manageable groups, which plays an important role in information retrieval, browsing and comprehension [1]. 1.2 Problem Description Feature vectors generated using Bow results in very large dimensional vectors. The feature selected is directly proportional to dimension. Extract core semantics from texts selecting the feature will reduced number of terms which depict high semantic conre content. The quality of extracted lexical chains is highly depends on the quality and quantity of the concepts within a document. i.e., larger the chain clearer the concepts. Several other challenges for the clustering results: synonym and polysemy problems. There has been much work done on the use of ontology to replace the original word in a document with the most appropriate word called word sense disambiguation(WSD). Assign distinguished and meaningful description for the generated clusters 1.3 Basics and background knowledge 1.3.1WordNet WordNet is the product of a research project as Princeton University. it classify the word in two categories i.e.,content word and function word, content words dealt with noun, verb, adverb and adjective which form a set of synonyms called synsets. The synsets are organized into senses, giving thus the synonyms of each word, and also into a hyponym/hypernym (i.e., is-A) and meronym/holonym(i.e., Part-of) relationships[2]. 1.3.2 Semantic similarity Semantic similarity plays an important role in natural language processing, . In general, all the measures can be grouped into four classes: path length based measures, information content based measures, features based measures and hybrid measures [3].
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 170 1.3.2.1 Path-based Measures The main idea of path-based measures is that the similarity between two concepts is a function of the length of the path linking the concepts and the position of the concepts in the taxonomy. The various path based measure include Shortest path measure, wu and palmer’s measure, leakcock and chodorow’s measure and li’s measure. The information based measure are resnik’s measure,lin’s measure, jiang’s measure Feature based measure is independent on thetaxonomyand the subsumers of the concepts.Thefinal approachi.e.,hybrid measure combined the ideas presented above[4,5,6,7]. 1.4 Word sense disambiguation We adopt the WSD procedure which is given by [8], aim to identify the most appropriate sense associated with each noun in a given document based on the assumption that one sense per discourse. The WSD approach can be described as follows. Let N= {n1, n2, …..np} denote the set of all senses associated with the noun ni according to the WordNet ontology. We determine the most appropriate sense of a noun n, by computing the sum of its similarity to other noun senses in d as follows. Where s(cik,cjm) is the similarity between two senses. We restrict to the first three senses for eachsynsettoparticipate in this computation for several reasons as given by [9]. First, the senses of a given noun in the WordNet hierarchy are arranged in descendingorderaccordingtotheircommon usage. Furthermore we compare the clustering results on using only the top three senses against using all senses of a noun, the former yields similar clustering results at a reduced computation cost to the latter. 1.5 Lexical chains Lexical chains are groups of words which exhibit lexical cohesion. Cohesion is a way of getting text together as a whole. Lexical cohesion is exhibited through cohesive relations. They have classified these relations as: 1. reiteration with identity of reference 2. reiteration without identity of reference 3. reiteration by means of super ordinate 4. systematic semantic relation 5. non systematic semantic relation 6. The first three relations involve reiteration which includes repetition of the same word in the same sense, the use of a synonym for a word and the use of hypernyms for a word respectively. The last two relations involve collocations i.e., semantic relationships between words that often co-occur. Lexical chains in a text are identified by the presence of strong semantic relations between the words in the text[10]. 2. LITERATURE SURVEY 2.1 Introduction Most of the existing text clustering methods use the bag of words model knownfrominformationretrieval wheresingle terms are used as features for representing the documents and they are treated independently. Some researches recently put their focus on the conceptual featuresextracted from text using ontologies and have shown that ontologies could improve the performance of text mining. 2.2 Semantic ontology for text clustering Current approaches for semanticontologyfortextclustering can be divided into two major categories, namely, concept mapping and embedded methods[11]. Concept mapping methods simply replace each term in a document by its corresponding concepts(s) extracted from an ontology before applying the clustering algorithm. These methods are appealing because they can be applied to any clusteringalgorithm[12].Furthermore,themappingofterms into concepts incurs only a one-time cost, thus allowing the clustering algorithm to be invoked multiple times (for different cluster initialization, parameter settings etc.,) without the additional overhead of re-creating the concepts. However, their main limitation is that the quality of the clusters is highly dependent on the correctness of the WSD procedure used. Embedded methods, on the other hand, integrate the ontological background knowledge directly into the clustering algorithm. This would require modifications to the existing clustering algorithm, which often leads to substantial increase in its both runtime and memory requirements.InsteadofperformingWSDexplicitly, these methods assume the availability of a distance/ similarity matrix for all pairs of words in a text corpus computed based ontheWordNet’sconcepthierarchy.Sincea word can be mapped to several synsets, multiple pairs may exist between any two words and the algorithm has to decide which path to use when computing the distance measure. Thus embedded methods are still susceptible to incorrect mapping issues related to WSD. 2.3 Proposed approach The proposed approach exploits the relations to provide a more accurate assessment of the similarity between terms for word sense disambiguation.Furthermore,weimplement lexical chains to extract a set of semantically related words from texts, which can represent the semantic content of the texts. Although lexical chains have been extensively used in text summarization, their potential impactontextclustering
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 171 problem has not been fully investigated. Our integrated way can identify the theme of documents based on the disambiguated core features extracted, and in parallel downsize the dimensions of feature space. The work exploits this characteristic of lexical chains - each lexical chain is an indicator of a topical strand and the segments connected by it belong to the same topic. Wework on the premise that, instead of just computing the lexical chains with respect to a single document, if we compute the chains across documents, we are in effect chaining together those documents belonging to a topical strand. 3. ARCHITECTURE This section deals with the method used, architecture and statistics of semantics for the construction of lexical chains based on semantic knowledge database such as WordNet. 3.1 System architecture Text document clustering can greatly simplify browsing large collections of documents by reorganizing them into a smaller number of manageable clusters. Preprocessing the documents is probably at least as important as the choice of an algorithm, since an algorithm can only be as good as the data it works on. While there are a number of preprocessing steps, that are almost standard now, the effects of adding background knowledge are still not very extensively researched. We preprocess the document by running it through a tokenizer; we then filter out all non-nounwordsidentifiedin the WSD stage. (The WSD here referred as the process of word sense disambiguation where original word is being replaced by the most appropriate sense based on the similarity sense from WordNet).The result is a sequence of nouns which appear in the text along withitssense. Werefer to these as ‘candidate words’. We base our algorithm on the WordNet lexical Database. WordNet is used to identify the relations among the words. We use identity, synonym, hypernym, meronym relations to compute the chains. Our algorithm works by maintaining a global setoflexical chains, each of which represents a topic. We now compute the lexical chains correspondingto eachof the candidate words by looking up the synsets for the word from WordNet. We then traverse the global list of lexical chains to identify those chains with which it has a identity, synonym, hypernym and meronym relations. We refer to these identified lexical chains as potential chains.wereferto these identified lexical chains as potential chains. If the candidate word has no relation with any of the chains in the global list, a new potential chain is created. A chain is selected from the global set based on the score of representative i.e., a threshold value. We have selected the threshold in such a way that the chain which is greater than the threshold value is selected as the potential chain for the document. Figure 3.1 Architecture 3.2 Candidate word generation It implements a method to obtain description of a synsets, Description of synsets: Let C={c1, c2, c3, c4…..ck}bethesetof synsets in a document, ci € C. let lemma(ci) be the set of words that constitute a synsets of ci. let gloss(ci) be the definition and examples of usages of ci. let related(ci) be the union of the synonym, hypernym, hyponym. Scoring mechanism: The scoring mechanism assigns an n word overlapthescore of n2. This gives an n- word overlap a score that is greater than the sum of the scores assigned to those n words if they had occurred in two or more phrases, each lessthannwords long. Modified measure: In order to take the full advantage of both explicit andimplicitsemanticrelationsbetweensynsets such as is-a and has-par links, we define a new similarity measure that combines both measures as below. Where S= log(score(DES(cp),DES(cq))+1), this method not only reflects structure information of synsets such as distance, but also incorporatescontentmeaningofsynsetsin the ontology. It integrates well with explicit and implicit semantic between synsets in ontology. Text Documents Preprocess WordNet Candidate words Lexical chains Filtered lexical chains Cluster with labels
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 172 3.3 Lexical chain Lexical chain represents semantic relations among the selected word senses of the words appearing in that lexical chain. Each node in a lexical chain is a word sense of a word, and each link can be identity, synonym,hypernym,meronym relation between two word senses. In order to extract the core semantics, the semantic importance of word senses within a given document should be evaluated first, generally, let N= {n1, n2,n3 …..nk} bethe set of nouns in a document d and let f={f 1,f 2,f3 ……fp} be the corresponding frequency of occurrence of nouns in d. let C={c1, c2, c3, …..ck} be the set of disambiguation conceptsthat corresponding to N, given a document d, a set of nouns N, a set of frequencies F and a set of concepts C, let W={w1,w2, w3 ……wn} as the set of corresponding weight of disambiguated concepts in C, if ci(ci€C) is mapped fromnkandnm(nk,nm€N). then the weight of ci us computed by based on the weighted concepts, we give following definition. Wk=fk+fn Score of concept: Let C={c1, c2, c3 …..cn} be the set of disambiguated concepts(word senses), and let W={ w1, w2, w3 ……wq } be the set of corresponding weight of disambiguated concepts in C. let RN={identity, synonym,hypernym,meronym} bethe set of semantic relations, and let R ={r1 r2=r1, r3, r4 } be the set of the corresponding weight of relation in RN. Then the score of a concept ci(ci€C) in a lexical chain is computed by A large value of S(ci) indicates that ci is a semantically important concept in a document. The relation weight r(r €R) depending on the kind of semantic relationship and it is in the order listed: identity, synonym, hypernym, meronym (thus r1=r2>r3>r4). Score of lexical chain: Let L= {L1, L2, L3………. Lm} be a set of lexical chain of a given document. Li€L, let ci = { ci1, ci2, ci3 …..ciq } be a set of disambiguated concepts in Li let S(ci) be the score ofconcept cil(cil € ci ). Then , the score S(ci1) of lexical chain in a document is define as Score of representative lexical chain: Let Let L= {L1, L2, L3………. Lm} be a set of lexical chains of a given document. Let LR ={L1 R , L2 R ……Ln R } (n≤m) be a set of representative lexical chains that satisfy the following criterion: Where ∞ is a weighting coefficient that is used to control the number of the representativelexical chainstobeconsidered. We extract the weighted concepts in the lexical chain LR composing set of core semantic features for the given document. It is these concepts canbethenusedtoclusterthe document. 4. IMPLEMENTATION 4.1Dataset Description We have run our experiments on a small dataset of 30 documents retrieved from 20Newsgroup dataset. This was because we were unable to get a pre-clustered dataset and comparing the results would have been difficult. Hence, we limited our experiments to a small dataset to keep our experiments humanly tractable. We keep the number of documents small in order to be able to do a qualitative analysis of the clustersformedasopposed to a quantitative one. 4.1 Pre-processing Documents are process by passing through tokenizer by setting space as delimiter, and then we obtain the tokens. These tokens are further passed to a Tagger to assign the category for each token, in our case it is noun finally we pass to a lemmatizer to obtain the root form of the tagged word. Here we are using wordNet lemmatizer for a part of speech tagging output. Lemmatizer return the same word if it is not found in WordNet or else returns the lemma.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 173 Figure 4.1: preprocessing 4.2 Performing Word Sense Disambiguation Here we determine the most appropriatesenseassociatedto each lemma by computing the similarity among three sense, the sense which assigned the highest score is considerasthe probable sense. Finally we replace our lemma with the synset of that sense which is termed as candidate words. Figure 4.2 Calculating word sense disambiguation 4.3 Cluster Generation The candidate wordsareassignedtheweightsandgenerated the score of concept by considering the relations. Then summing up all the relation score will givetheconceptscore. This concept score further used to compute the score of lexical chain. we then evaluate the filtered chains,inorder to select a subset of chains to which the document isadded. We select all those lexical chains whose length exceedsheuristic value. Our algorithm works on the assumption that lexical chains represent the theme of the document. And grouping documents based on these lexical chains results in the documents being clustered based on its theme. This work explores if and how the two following methods can improve the effectiveness of clustering through semantic similarity approach. Figure 4.3 Generation of cluster 5. EXPERIMENT RESULTS The results are analyzed after pre-processing step is illustrated in the below table 5.1. the first columnspecifythe document id, second column specify thetokensandthethird column specify the number of nouns obtained. Table 5.1: sample for pos tagging. Document id Number of tokens Number of Nouns 9 106 26 10 129 19 29 49 26 30 70 19 The below table shows the chains obtained for the documents. The first columns specify the document ids and their respective chains. Table 5.2: Sample for lexical chains. Document id Chains Documents Tokenization Pos tagging Lemmatization Lemma Lemma WordNet Calculate sense Similarity: 1. wu and palmer 2. banarjee & pedersen Candidate words 1. Score of concept 2. Score of lexical chain Representative of lexical chain Cluster with a Label Candidate words
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 174 29 1. Bent, hang, knack 2. Design, designing 3. Station 4. Wagon, wagon 5. Tercel, tercelet, tierce 6. Corolla 7. Architect, designer 8. Sense, signified 10 1. Earn 2. Livestock, carcass 3. Argentina, Philippines 4. Bond, American, Canada 30 1. Earn 2. Nyse, take, income, stock, rate Table 5.3 Number of Lexical chains generated. Document id Number of chains 29 8 30 2 10 4 6. CONCLUSIONS Documents contain multiple topics and clustering them using hard clustering is very unnatural. Documents should ideally be clustered using soft clustering algorithms. Unfortunately, these algorithms work only for very small dimensions. The natureoflexical chainsmakesthemsuitable for clustering documents. Each lexical chain is considered as a topical strand and if two documents share a chain, then they contain the same topics. This work presents a methodology for clustering using disambiguated concepts and lexical chains.Amodifiedterm- based semantic similarity measure is proposed for word sense disambiguation, and lexical chains are employed to extract core semantic features that express the topic of documents, which determining the number of clusters, and assigning appropriate descriptionforthegeneratedclusters. More importantly, we show that the lexical chain features (core semantics) can improvethe qualitysignificantlywith a reduced number of features in the document clustering process. Although lexical chains have been widely used in many application domains, this study is one of the few researches which try to investigate the potential impact of lexical chains on text clustering. In future work, we would like to perform our method on a larger knowledge base, such as Wikipedia. Moreover, since we have demonstrated that the lexical chains can lead to improvements in text clustering, the next work we plan to explore the feasibility of lexical chains in the text mining task. ACKNOWLEDGEMENT To Ammi Affu is nothing without you All that I am, all that I do, all that I ever had, I owe it all to you. To pappa I’m glad how your words have always given me life’s real view, “Winning is great, sure but if you want to do something in life run a marathon of life. I am there.” It is with a sense of gratitude and appreciation that I feel to acknowledge any well wishers for their king support and encouragement during the completion of the project. I would like to express my heartfelt gratitude to my Project guide Dr. Srinivasu Badugu, Coordinator of M.Tech in the Computer Science Engineering department, for encouraging and guiding me through the project. His extreme energy, creativity and excellentdomainknowledgehavealwaysbeena constant source of motivation and orientation for me to gain more hands-on experience and hence get edge over. I am highly indebted to him for his guidance and constant supervision as well as for providing necessary information regarding the project & also for his support in completing the project. REFERENCES 1. Wei, Tingting, et al. "A semantic approach for text clustering using WordNet and lexical chains." Expert Systems with Applications 42.4 (2015): 2264-2275. 2. Miller, George A., and Walter G. Charles. "Contextual correlates of semantic ++ 3. similarity." Language and cognitive processes 6.1 (1991): 1-28. 4. Resnik, Philip. "Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language." J. Artif. Intell. Res.(JAIR) 11 (1999): 95-130. 5. Sánchez, David, et al. "Ontology-based semantic similarity: A new feature-based approach." Expert Systems with Applications 39.9 (2012): 7718-7728. 6. Budanitsky, Alexander, and Graeme Hirst. "Evaluating wordnet-based measures of lexical semantic relatedness." Computational Linguistics 32.1 (2006): 13-47.
  • 7. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 175 7. Budanitsky, Alexander, and Graeme Hirst. "Evaluating wordnet-based measures of lexical semantic relatedness." Computational Linguistics 32.1 (2006): 13-47. 8. Amine, Abdelmalek, Zakaria Elberrichi, and Michel Simonet. "Evaluation of textclusteringmethodsusing wordnet." Int. Arab J. Inf. Technol. 7.4 (2010): 349- 357. 9. Jayarajan, Dinakar, Dipti Deodhare, and Balaraman Ravindran. "Lexical chains as document features." (2008): 111-117. 10. Fodeh, Samah, Bill Punch, and Pang-Ning Tan. "On ontology-driven document clustering using core semantic features." Knowledge and information systems28.2 (2011): 395-421. 11. Termier, Alexandre, Michèle Sebag, and Marie- Christine Rousset. "Combining Statistics and Semantics for Word and Document Clustering." workshop on ontology learning. 2001. 12. Chen, Chun-Ling, Frank SC Tseng, and Tyne Liang. "An integration of WordNet and fuzzy association rule mining for multi-label document clustering." Data & Knowledge Engineering 69.11 (2010): 1208-1226. BIOGRAPHIES SHABANA AFREEN received her B.E and M.tech degree in computer science from Osmania university. Her research interests are natural language processing, semantic web.