Model of semantic textual document clustering

MODEL OF SEMANTIC
TEXTUAL DOCUMENT
CLUSTERING
Welcome
to Viva about
Supervised By,
Assoc. Prof. Dr. WaelYafoozs
Dean, Faculty of Computer and Information Technology,
Al-Madinah International University, Shah Alam, Malaysia
Submitted By,
SK Ahammad Fahad
Matric No: MIT153BL308
Master of Science in Information and Communication Technology
Faculty of Computer and Information Technology
Al-Madinah International University, Shah Alam, Malaysia

Context of Presentation
 Introduction
 Problem Statement
 Research Question
 Research Objective
 Related Studies
 Research Methodology
 Proposed Model
 Experiment Setting
 Testing
 Results And Discussion
 Conclusion
 Future Research

Introduction
Text document increasing over the
internet, e-mail, articles, e-book, report,
web pages & they are stored in the
electronic format.
All the text is unstructured or semi-
structured. It is very difficult to find the
information from the huge database.
Those should maintain by appropriate clustering for retrieve the valuable
information from them.

Introduction
 Document Clustering is extremely useful gizmo in
today’s world wherever a lot of textual records and
knowledge square measure keep and retrieved
electronically [Publié le lundi – 2016].
 The document browsing becomes easier, friendly and
economical.
 Traditional clustering methods are not effective on
textual clustering [Charu & Zha-2012].
 Per-Processing and Chose appropriate Clustering
Method is Most important to get accurate Document
Clustering.

Problem Statement
 It is together a challenge to look out the useful data from the
large documents [Charu & Zha-2012].
 The traditional document cluster unit high-dimensional about
texts. [Hemant & Cappe-2009].
 The presence of logical structure clues within the document,
scientific criteria and applied math similarity measures chiefly
accustomed figure thematically coherent, contiguous text
blocks in unstructured documents [Qi Sun & Wu-2008].
 Recent segmentation techniques have taken advantage of
advances in generative topic modeling algorithms, which were
specifically designed to spot topics at intervals text to cipher
word–topic distributions [J.G. Lee & Whang-2007].

Research Questions
1. What are semantic textual document clustering and
the specialty of semantic relation in textual Document
Clustering?
2. How to do modeling, analysis and develop a proper
semantic document clustering method?
3. What is the result of testing and analysis of proposed
clustering method?

Research Objectives
 To study the existing tools and techniques of semantic
textual document clustering.
 To propose and develop a model for semantic textual
document clustering.
 To test the proposed model.

Related Studies (Textual Document Clustering)
Document Clustering is that the
tactic of grouping a bunch of
records into clusters
[Amanpreet Kaur & Amarpreet
Singh –2014].
Documents within each group
area unit like each other, in
different words, they belong to
same topic or subtopic.
A document clustering formula is typically captivated with the
employment of a pair-wise distance live between the individual
documents to be clustered.
Most of the techniques utilized in document clustering have an
effect on a document as a bag of words.

Related Studies (Semantic Document Clustering)
 Related Studies Semantic
document clustering parses
the material into two ways,
syntactically and semantically.
 Syntactical Parsing can ignore
the less sensitive data from
documents.
 Semantic parsing can apply
on the parsed syntactic data
which can cluster the
documents properly and
provide a response to the
user which is not accurately
in traditional methods.

Related Studies (COBWEB Conceptual Clustering)
• COBWEB is a
conceptual
clustering
algorithm
developed by
Fisher for the
analysis of
categorical
data that
cannot order.
• The Cobweb algorithm is an incremental clustering algorithm that
clusters one tuple at a time in a top-down manner.
• The algorithm uses four operators to evaluate and improve the
quality of the tree. The quality measure in Cobweb is category
utility.

Related Studies (WordNet)
 The lexical database is an organized description of the lexemes of a
language.
 Every language has at least two major lexical categories; Noun&
Verb
 Many languages also have two other major categories; Adjective &
Adverb
 Many languages have minor lexical categories such as; Conjunctions,
Particles & Adpositions
 WordNet® is a large lexical database of English. Nouns, verbs,
adjectives, and adverbs grouped into sets of cognitive synonyms
(synsets).
 Synsets interlinked using conceptual-semantic and lexical relations.
 WordNet as background knowledge to enhance document clustering
by offering relations between vocabulary terms and WordNet is
helpful for clustering process.

Research Methodology
Phases Activities Deliver
Feasibility Study
• Book
• Journal
• Paper
• Encyclopedia
• Information Source
• Textual Document
Clustering
• Semantic Document
clustering
• Dataset
• Natural language
processing
Requirement
Analysis
• PyCharm (IDE)
• SQLite Relational
Database
• WordNet
• DB Viewer
• COBWEB Algorithm
• Sample Text
• Full Text Search
• Dataset Schema
• SQL quarry
• Stopword removal
• Lemmatization
• Frequency
• Semantic Document
Clustering
• WordNet

Research Methodology
Phases Activities Deliver
Modeling
• Development
Platform
• Natural Language
tools
• Accuracy tools
• WordNet
• COBWEB concept formation
• NLTK
• Synset
Model Development
• Coding
• Experiment deign
• Standard maintain
• Sample Text
• Semating Document
clustering Model
• Hardware and software
prepare
• Similarity measure
Testing and Analysis
• Pre-Processing
• Clustering
• Accuracy Measure
• High Accurate Cluster
• Semantic Relation Between
words

Proposed Model(Steps)
• Sample Tex Files
• Remove Tags from Text
• Tokenize Documents
• Remove Stopwords
• Synset Replacement (WordNet)
• Lemmatization (WordNet)
• Clustering (COBWEB Algorithm)
• Measure Cluster Accuracy
• High Accurate Cluster

Proposed Model(Flow-Chart)
Remove tags from input text Removing unwanted Noise from
Tokens.

Proposed Model(Flow-Chart)
Steps to remove stop words from
tokens.
Lemmatization and Streaming process Flow-Chart

Experiment Setting
 HP Touch Smart 320 Desktop PC.
 Display: 50.80 cm (20 inch)
Resolution: 1600 x 900 (16:9
aspect ratio)
 Motherboard: Angelino2-UB
 Processor: AMD A6-3600, 4 MB
Cache
 Memory: 8 GB, PC3-10600 MB/sec
 Hard Drive: 1 TB, 7200 rpm
Rotational Speed
Software
 Pycharm: Version: 2016.3.2; Build:
163.10154.50; Released: December 30,
2016. 1 Year Education(Student) License.
 Scientific Tools: Python Notebook,
Academic degree interactive Python
console, Matplotlib, and NumPy.
 NLTK: WordNet, classification, tokenization,
stemming, tagging, parsing, linguistics
reasoning, wrappers .
 SQLite: SQLite3 is python Built-in
Database system, it is self-contained,
server less, zero-configuration and
transactional.
Hardware
 Paper’s Abstract Consider as Sample.
 20 Abstract from 20 Paper.
Sample Collection

Results And Discussion
• 20 samples from 20 different Source.
• Total 3292 tokens came from 20 sample
• 1524 token removed for stopword matching. 46.29% of the
total tokens.
• There has 1748 token left.
• After WordNet Operation( Synset & Lemmatization) it has
672 tokens and it is 20.41% of total tokens.
• In those 672 tokens, only 144 tokens are unique.

Results And Discussion
• Most 22 times we have a tendency to get a word.
• When complete the clustering process by COBWEB
algorithmic rule; then we get 35 clusters.
• All sample documents associated except sample 3 and
sample 16. Those two inputs do not have enough maturity
to assign a cluster.
• F-Measure applied on 35 clusters.
• Several Clusters are 100% Accurate.
• We consider minimum accuracy for overall accuracy and it
was79.60%.

Experiment Setting (F-Measure)

Testing (Pre-Processing)
Name of Source File Number of Token in File
“Sample 1” 213
“Sample 2” 257
“Sample 3” 127
“Sample 4” 204
“Sample 5” 451
“Sample 6” 216
“Sample 7” 108
“Sample 8” 259
“Sample 9” 151
“Sample 10” 79
“Sample 11” 149
“Sample 12” 86
“Sample 13” 154
“Sample 14” 100
“Sample 15” 132
“Sample 16” 84
“Sample 17” 152
“Sample 18” 139
“Sample 19” 114
“Sample 20” 117
Sample text file report after tokenize
Name of Source
File
Total Remove
Token
Token after Remove Stop
Word
“Sample 1” 86 127
“Sample 2” 105 152
“Sample 3” 71 56
“Sample 4” 86 118
“Sample 5” 213 218
“Sample 6” 103 113
“Sample 7” 61 47
“Sample 8” 128 131
“Sample 9” 62 89
“Sample 10” 34 45
“Sample 11” 60 89
“Sample 12” 49 37
“Sample 13” 74 80
“Sample 14” 47 53
“Sample 15” 66 66
“Sample 16” 37 47
“Sample 17” 59 93
“Sample 18” 60 79
“Sample 19” 52 62
“Sample 20” 71 46
Token left for processing after remove stopwords

Testing (Clusters)
Cluster Name Member of Cluster (Source file)
Algorithm Sample 5, Sample 8, Sample 11, Sample 13, Sample 19
Approach Sample 2, Sample18
Citat Sample 5
Classif Sample 8
Cliqu Sample 4
Cluster
Sample 2, Sample 4, Sample 5, Sample 6, Sample 7, Sample 9, Sample 10,
Sample 11, Sample 12, Sample 13, Sample 14
Cobweb Sample 8
Concept Sample 10, Sample 17
Data Sample 9, Sample 13, Sample 14
Document Sample 1, Sample 2, Sample 4, Sample 5
f-measur Sample 20
Function Sample 8
Inform Sample 2
Insert Sample 8
Language Sample 2
Measure Sample 5, Sample 15
Clusters with the member

Testing (Clusters)
Model Sample 5
Multilingu Sample 2
Node Sample 8
Object Sample 8
Ontolog Sample 5, Sample 17
Oper Sample 8
Pass Sample 19
Probabl Sample 15
Select Sample 5
Semant Sample 4, Sample 5, Sample 17
Separ Sample 8
Similar Sample 17
Singl Sample 19
Technique Sample 17
Term Sample 1
Tree Sample 8
Valu Sample 8
Version Sample 19
Word Sample 1

Conclusion
 Our framework to done a valuable clustering textual
documents for grab the secret information from unsupervised,
unclassified text.
 We proposed and developed a full system with the capability to
work with the semantic meaning of textual data.
 We use WordNet to ensure the semantic value of data and
maintain relation semantically.
 We just try to deliver a very quality full, accurate clustering. F-
Measure evaluation and testing assure that our clusters are so
accurate.
 Semantic clustering with WordNet gives us successful semantic
relation clustering and by f-measure ensures the quality.

Future Research
 Use new version of Conceptual clusterings like COBWEB/3 or
ITERATE or LABYRINTH.
 We design tor word token, in future there have some chance to
work with sentence token.
 We use an only synset feature of WordNet. There have much
more tools on the WordNet-like type, semantic meaning. We
can use them for future research.

Reference
 Amanpreet Kaur Toor and Amarpreet Singh, Amritsar College of Engineering & Technology,
Punjab, India , An Advanced Clustering Algorithm (ACA) for Clustering Large Data Set to
Achieve High Dimensionality: Computer Science Systems Biology, Toor and Singh, J
Comput Sci Syst Biol 2014, 7:4. URL: http://dx.doi.org/10.4172/jcsb.1000146, 2014.
 C. Aggarwal and C. Zhai. A survey of text clustering algorithms, Mining Text Data,
Springer, 2012
 Charu C. Aggarwal & Chengxiang Zha. Mining Text Data. Kluwer Academic Publishers.
Boston,Dordrecht,London, 2012
 G. Qi, C. Aggarwal, and T. Huang. Community detection with edge content in social
media networks, ICDE Conference, 2013.
 G. Qi, C. Aggarwal, and T. Huang. Online community detection in social sensing. WSDM
Conference, 2013.
 Hemant Misra, Franc¸ois Yvon, Joemon M. Jose, and Olivier Cappe. Text segmentation via
topic modeling: An analytical study. In Proceedings of the 18th ACM Conference on
Information and Knowledge Management, CIKM ’09, pages 1553–1556, New York, USA.
ACM, 2009
 J.G. Lee, J. Han, and K.-Y. Whang. Trajectory clustering: A partition-and-group framework.
SIGMOD Conference, 593–604, 2007

Reference
 M. Karthikeyan, P. Aruna, “Probability Based Document Clustering and Image
Clustering using Content-Based Image Retrieval”, In Elsevier Journal of Applied Soft
Computing, Pp.959-966, 2012
 MacLellan, C.J., Harpstead, E., Aleven, V., Koedinger, K.R. (2015) TRESTLE: Incremental
Learning in Structured Domains using Partial Matching and Categorization. The Third
Annual Conference on Advances in Cognitive Systems. Atlanta, GA. May 28-31, 2015
 Pritam C. Gaigole, L. H. Patil, P.M Chaudhari, “Preprocessing Techniques in Text
Categorization”, National Conference on Innovative Paradigms in Engineering &
Technology (NCIPET-2013), 2013
 Publié le lundi. Machine Learning, Sémantique Données non-structurées, 2016
 Qi Sun, Runxin Li, Dingsheng Luo, and Xihong Wu. Text segmentation with LDA-based
Fisher kernel. In Proceedings of the 46th Annual Meeting of the Association for
Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short
’08, pages 269–272, Stroudsburg, PA, USA. Association for Computational Linguistics,
2008
 Y. Sun, C. Aggarwal, and J. Han. Relation-strength aware clustering of heterogeneous
information networks with incomplete attributes, Journal of Proceedings of the VLDB
Endowment, 5(5):394–405, 2012.

Model of semantic textual document clustering

Model of semantic textual document clustering

More Related Content

What's hot

Similar to Model of semantic textual document clustering

Recently uploaded

Model of semantic textual document clustering

Editor's Notes