MODEL OF SEMANTIC
TEXTUAL DOCUMENT
CLUSTERING
Welcome
to Viva about
Supervised By,
Assoc. Prof. Dr. WaelYafoozs
Dean, Faculty of Computer and Information Technology,
Al-Madinah International University, Shah Alam, Malaysia
Submitted By,
SK Ahammad Fahad
Matric No: MIT153BL308
Master of Science in Information and Communication Technology
Faculty of Computer and Information Technology
Al-Madinah International University, Shah Alam, Malaysia
Context of Presentation
 Introduction
 Problem Statement
 Research Question
 Research Objective
 Related Studies
 Research Methodology
 Proposed Model
 Experiment Setting
 Testing
 Results And Discussion
 Conclusion
 Future Research
Introduction
Text document increasing over the
internet, e-mail, articles, e-book, report,
web pages & they are stored in the
electronic format.
All the text is unstructured or semi-
structured. It is very difficult to find the
information from the huge database.
Those should maintain by appropriate clustering for retrieve the valuable
information from them.
Introduction
 Document Clustering is extremely useful gizmo in
today’s world wherever a lot of textual records and
knowledge square measure keep and retrieved
electronically [Publié le lundi – 2016].
 The document browsing becomes easier, friendly and
economical.
 Traditional clustering methods are not effective on
textual clustering [Charu & Zha-2012].
 Per-Processing and Chose appropriate Clustering
Method is Most important to get accurate Document
Clustering.
Problem Statement
 It is together a challenge to look out the useful data from the
large documents [Charu & Zha-2012].
 The traditional document cluster unit high-dimensional about
texts. [Hemant & Cappe-2009].
 The presence of logical structure clues within the document,
scientific criteria and applied math similarity measures chiefly
accustomed figure thematically coherent, contiguous text
blocks in unstructured documents [Qi Sun & Wu-2008].
 Recent segmentation techniques have taken advantage of
advances in generative topic modeling algorithms, which were
specifically designed to spot topics at intervals text to cipher
word–topic distributions [J.G. Lee & Whang-2007].
Research Questions
1. What are semantic textual document clustering and
the specialty of semantic relation in textual Document
Clustering?
2. How to do modeling, analysis and develop a proper
semantic document clustering method?
3. What is the result of testing and analysis of proposed
clustering method?
Research Objectives
 To study the existing tools and techniques of semantic
textual document clustering.
 To propose and develop a model for semantic textual
document clustering.
 To test the proposed model.
Related Studies (Textual Document Clustering)
Document Clustering is that the
tactic of grouping a bunch of
records into clusters
[Amanpreet Kaur & Amarpreet
Singh –2014].
Documents within each group
area unit like each other, in
different words, they belong to
same topic or subtopic.
A document clustering formula is typically captivated with the
employment of a pair-wise distance live between the individual
documents to be clustered.
Most of the techniques utilized in document clustering have an
effect on a document as a bag of words.
Related Studies (Semantic Document Clustering)
 Related Studies Semantic
document clustering parses
the material into two ways,
syntactically and semantically.
 Syntactical Parsing can ignore
the less sensitive data from
documents.
 Semantic parsing can apply
on the parsed syntactic data
which can cluster the
documents properly and
provide a response to the
user which is not accurately
in traditional methods.
Related Studies (COBWEB Conceptual Clustering)
• COBWEB is a
conceptual
clustering
algorithm
developed by
Fisher for the
analysis of
categorical
data that
cannot order.
• The Cobweb algorithm is an incremental clustering algorithm that
clusters one tuple at a time in a top-down manner.
• The algorithm uses four operators to evaluate and improve the
quality of the tree. The quality measure in Cobweb is category
utility.
Related Studies (WordNet)
 The lexical database is an organized description of the lexemes of a
language.
 Every language has at least two major lexical categories; Noun&
Verb
 Many languages also have two other major categories; Adjective &
Adverb
 Many languages have minor lexical categories such as; Conjunctions,
Particles & Adpositions
 WordNet® is a large lexical database of English. Nouns, verbs,
adjectives, and adverbs grouped into sets of cognitive synonyms
(synsets).
 Synsets interlinked using conceptual-semantic and lexical relations.
 WordNet as background knowledge to enhance document clustering
by offering relations between vocabulary terms and WordNet is
helpful for clustering process.
Related Studies (WordNet)
Research Methodology
Phases Activities Deliver
Feasibility Study
• Book
• Journal
• Paper
• Encyclopedia
• Information Source
• Textual Document
Clustering
• Semantic Document
clustering
• Dataset
• Natural language
processing
Requirement
Analysis
• PyCharm (IDE)
• SQLite Relational
Database
• WordNet
• DB Viewer
• COBWEB Algorithm
• Sample Text
• Full Text Search
• Dataset Schema
• SQL quarry
• Stopword removal
• Lemmatization
• Frequency
• Semantic Document
Clustering
• WordNet
Research Methodology
Phases Activities Deliver
Modeling
• Development
Platform
• Natural Language
tools
• Accuracy tools
• WordNet
• COBWEB concept formation
• NLTK
• Synset
Model Development
• Coding
• Experiment deign
• Standard maintain
• Sample Text
• Semating Document
clustering Model
• Hardware and software
prepare
• Similarity measure
Testing and Analysis
• Pre-Processing
• Clustering
• Accuracy Measure
• High Accurate Cluster
• Semantic Relation Between
words
Proposed Model(Steps)
• Sample Tex Files
• Remove Tags from Text
• Tokenize Documents
• Remove Stopwords
• Synset Replacement (WordNet)
• Lemmatization (WordNet)
• Clustering (COBWEB Algorithm)
• Measure Cluster Accuracy
• High Accurate Cluster
Proposed Model(Flow-Chart)
Remove tags from input text Removing unwanted Noise from
Tokens.
Proposed Model(Flow-Chart)
Steps to remove stop words from
tokens.
Lemmatization and Streaming process Flow-Chart
Experiment Setting
 HP Touch Smart 320 Desktop PC.
 Display: 50.80 cm (20 inch)
Resolution: 1600 x 900 (16:9
aspect ratio)
 Motherboard: Angelino2-UB
 Processor: AMD A6-3600, 4 MB
Cache
 Memory: 8 GB, PC3-10600 MB/sec
 Hard Drive: 1 TB, 7200 rpm
Rotational Speed
Software
 Pycharm: Version: 2016.3.2; Build:
163.10154.50; Released: December 30,
2016. 1 Year Education(Student) License.
 Scientific Tools: Python Notebook,
Academic degree interactive Python
console, Matplotlib, and NumPy.
 NLTK: WordNet, classification, tokenization,
stemming, tagging, parsing, linguistics
reasoning, wrappers .
 SQLite: SQLite3 is python Built-in
Database system, it is self-contained,
server less, zero-configuration and
transactional.
Hardware
 Paper’s Abstract Consider as Sample.
 20 Abstract from 20 Paper.
Sample Collection
Results And Discussion
• 20 samples from 20 different Source.
• Total 3292 tokens came from 20 sample
• 1524 token removed for stopword matching. 46.29% of the
total tokens.
• There has 1748 token left.
• After WordNet Operation( Synset & Lemmatization) it has
672 tokens and it is 20.41% of total tokens.
• In those 672 tokens, only 144 tokens are unique.
Results And Discussion
• Most 22 times we have a tendency to get a word.
• When complete the clustering process by COBWEB
algorithmic rule; then we get 35 clusters.
• All sample documents associated except sample 3 and
sample 16. Those two inputs do not have enough maturity
to assign a cluster.
• F-Measure applied on 35 clusters.
• Several Clusters are 100% Accurate.
• We consider minimum accuracy for overall accuracy and it
was79.60%.
Experiment Setting (F-Measure)
Testing (Pre-Processing)
Name of Source File Number of Token in File
“Sample 1” 213
“Sample 2” 257
“Sample 3” 127
“Sample 4” 204
“Sample 5” 451
“Sample 6” 216
“Sample 7” 108
“Sample 8” 259
“Sample 9” 151
“Sample 10” 79
“Sample 11” 149
“Sample 12” 86
“Sample 13” 154
“Sample 14” 100
“Sample 15” 132
“Sample 16” 84
“Sample 17” 152
“Sample 18” 139
“Sample 19” 114
“Sample 20” 117
Sample text file report after tokenize
Name of Source
File
Total Remove
Token
Token after Remove Stop
Word
“Sample 1” 86 127
“Sample 2” 105 152
“Sample 3” 71 56
“Sample 4” 86 118
“Sample 5” 213 218
“Sample 6” 103 113
“Sample 7” 61 47
“Sample 8” 128 131
“Sample 9” 62 89
“Sample 10” 34 45
“Sample 11” 60 89
“Sample 12” 49 37
“Sample 13” 74 80
“Sample 14” 47 53
“Sample 15” 66 66
“Sample 16” 37 47
“Sample 17” 59 93
“Sample 18” 60 79
“Sample 19” 52 62
“Sample 20” 71 46
Token left for processing after remove stopwords
Testing (Clusters)
Cluster Name Member of Cluster (Source file)
Algorithm Sample 5, Sample 8, Sample 11, Sample 13, Sample 19
Approach Sample 2, Sample18
Citat Sample 5
Classif Sample 8
Cliqu Sample 4
Cluster
Sample 2, Sample 4, Sample 5, Sample 6, Sample 7, Sample 9, Sample 10,
Sample 11, Sample 12, Sample 13, Sample 14
Cobweb Sample 8
Concept Sample 10, Sample 17
Data Sample 9, Sample 13, Sample 14
Document Sample 1, Sample 2, Sample 4, Sample 5
f-measur Sample 20
Function Sample 8
Inform Sample 2
Insert Sample 8
Language Sample 2
Measure Sample 5, Sample 15
Clusters with the member
Testing (Clusters)
Model Sample 5
Multilingu Sample 2
Node Sample 8
Object Sample 8
Ontolog Sample 5, Sample 17
Oper Sample 8
Pass Sample 19
Probabl Sample 15
Select Sample 5
Semant Sample 4, Sample 5, Sample 17
Separ Sample 8
Similar Sample 17
Singl Sample 19
Technique Sample 17
Term Sample 1
Tree Sample 8
Valu Sample 8
Version Sample 19
Word Sample 1
Conclusion
 Our framework to done a valuable clustering textual
documents for grab the secret information from unsupervised,
unclassified text.
 We proposed and developed a full system with the capability to
work with the semantic meaning of textual data.
 We use WordNet to ensure the semantic value of data and
maintain relation semantically.
 We just try to deliver a very quality full, accurate clustering. F-
Measure evaluation and testing assure that our clusters are so
accurate.
 Semantic clustering with WordNet gives us successful semantic
relation clustering and by f-measure ensures the quality.
Future Research
 Use new version of Conceptual clusterings like COBWEB/3 or
ITERATE or LABYRINTH.
 We design tor word token, in future there have some chance to
work with sentence token.
 We use an only synset feature of WordNet. There have much
more tools on the WordNet-like type, semantic meaning. We
can use them for future research.
Reference
 Amanpreet Kaur Toor and Amarpreet Singh, Amritsar College of Engineering & Technology,
Punjab, India , An Advanced Clustering Algorithm (ACA) for Clustering Large Data Set to
Achieve High Dimensionality: Computer Science Systems Biology, Toor and Singh, J
Comput Sci Syst Biol 2014, 7:4. URL: http://dx.doi.org/10.4172/jcsb.1000146, 2014.
 C. Aggarwal and C. Zhai. A survey of text clustering algorithms, Mining Text Data,
Springer, 2012
 Charu C. Aggarwal & Chengxiang Zha. Mining Text Data. Kluwer Academic Publishers.
Boston,Dordrecht,London, 2012
 G. Qi, C. Aggarwal, and T. Huang. Community detection with edge content in social
media networks, ICDE Conference, 2013.
 G. Qi, C. Aggarwal, and T. Huang. Online community detection in social sensing. WSDM
Conference, 2013.
 Hemant Misra, Franc¸ois Yvon, Joemon M. Jose, and Olivier Cappe. Text segmentation via
topic modeling: An analytical study. In Proceedings of the 18th ACM Conference on
Information and Knowledge Management, CIKM ’09, pages 1553–1556, New York, USA.
ACM, 2009
 J.G. Lee, J. Han, and K.-Y. Whang. Trajectory clustering: A partition-and-group framework.
SIGMOD Conference, 593–604, 2007
Reference
 M. Karthikeyan, P. Aruna, “Probability Based Document Clustering and Image
Clustering using Content-Based Image Retrieval”, In Elsevier Journal of Applied Soft
Computing, Pp.959-966, 2012
 MacLellan, C.J., Harpstead, E., Aleven, V., Koedinger, K.R. (2015) TRESTLE: Incremental
Learning in Structured Domains using Partial Matching and Categorization. The Third
Annual Conference on Advances in Cognitive Systems. Atlanta, GA. May 28-31, 2015
 Pritam C. Gaigole, L. H. Patil, P.M Chaudhari, “Preprocessing Techniques in Text
Categorization”, National Conference on Innovative Paradigms in Engineering &
Technology (NCIPET-2013), 2013
 Publié le lundi. Machine Learning, Sémantique Données non-structurées, 2016
 Qi Sun, Runxin Li, Dingsheng Luo, and Xihong Wu. Text segmentation with LDA-based
Fisher kernel. In Proceedings of the 46th Annual Meeting of the Association for
Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short
’08, pages 269–272, Stroudsburg, PA, USA. Association for Computational Linguistics,
2008
 Y. Sun, C. Aggarwal, and J. Han. Relation-strength aware clustering of heterogeneous
information networks with incomplete attributes, Journal of Proceedings of the VLDB
Endowment, 5(5):394–405, 2012.
Model of semantic textual document clustering

Model of semantic textual document clustering

  • 1.
    MODEL OF SEMANTIC TEXTUALDOCUMENT CLUSTERING Welcome to Viva about Supervised By, Assoc. Prof. Dr. WaelYafoozs Dean, Faculty of Computer and Information Technology, Al-Madinah International University, Shah Alam, Malaysia Submitted By, SK Ahammad Fahad Matric No: MIT153BL308 Master of Science in Information and Communication Technology Faculty of Computer and Information Technology Al-Madinah International University, Shah Alam, Malaysia
  • 2.
    Context of Presentation Introduction  Problem Statement  Research Question  Research Objective  Related Studies  Research Methodology  Proposed Model  Experiment Setting  Testing  Results And Discussion  Conclusion  Future Research
  • 3.
    Introduction Text document increasingover the internet, e-mail, articles, e-book, report, web pages & they are stored in the electronic format. All the text is unstructured or semi- structured. It is very difficult to find the information from the huge database. Those should maintain by appropriate clustering for retrieve the valuable information from them.
  • 4.
    Introduction  Document Clusteringis extremely useful gizmo in today’s world wherever a lot of textual records and knowledge square measure keep and retrieved electronically [Publié le lundi – 2016].  The document browsing becomes easier, friendly and economical.  Traditional clustering methods are not effective on textual clustering [Charu & Zha-2012].  Per-Processing and Chose appropriate Clustering Method is Most important to get accurate Document Clustering.
  • 5.
    Problem Statement  Itis together a challenge to look out the useful data from the large documents [Charu & Zha-2012].  The traditional document cluster unit high-dimensional about texts. [Hemant & Cappe-2009].  The presence of logical structure clues within the document, scientific criteria and applied math similarity measures chiefly accustomed figure thematically coherent, contiguous text blocks in unstructured documents [Qi Sun & Wu-2008].  Recent segmentation techniques have taken advantage of advances in generative topic modeling algorithms, which were specifically designed to spot topics at intervals text to cipher word–topic distributions [J.G. Lee & Whang-2007].
  • 6.
    Research Questions 1. Whatare semantic textual document clustering and the specialty of semantic relation in textual Document Clustering? 2. How to do modeling, analysis and develop a proper semantic document clustering method? 3. What is the result of testing and analysis of proposed clustering method?
  • 7.
    Research Objectives  Tostudy the existing tools and techniques of semantic textual document clustering.  To propose and develop a model for semantic textual document clustering.  To test the proposed model.
  • 8.
    Related Studies (TextualDocument Clustering) Document Clustering is that the tactic of grouping a bunch of records into clusters [Amanpreet Kaur & Amarpreet Singh –2014]. Documents within each group area unit like each other, in different words, they belong to same topic or subtopic. A document clustering formula is typically captivated with the employment of a pair-wise distance live between the individual documents to be clustered. Most of the techniques utilized in document clustering have an effect on a document as a bag of words.
  • 9.
    Related Studies (SemanticDocument Clustering)  Related Studies Semantic document clustering parses the material into two ways, syntactically and semantically.  Syntactical Parsing can ignore the less sensitive data from documents.  Semantic parsing can apply on the parsed syntactic data which can cluster the documents properly and provide a response to the user which is not accurately in traditional methods.
  • 10.
    Related Studies (COBWEBConceptual Clustering) • COBWEB is a conceptual clustering algorithm developed by Fisher for the analysis of categorical data that cannot order. • The Cobweb algorithm is an incremental clustering algorithm that clusters one tuple at a time in a top-down manner. • The algorithm uses four operators to evaluate and improve the quality of the tree. The quality measure in Cobweb is category utility.
  • 11.
    Related Studies (WordNet) The lexical database is an organized description of the lexemes of a language.  Every language has at least two major lexical categories; Noun& Verb  Many languages also have two other major categories; Adjective & Adverb  Many languages have minor lexical categories such as; Conjunctions, Particles & Adpositions  WordNet® is a large lexical database of English. Nouns, verbs, adjectives, and adverbs grouped into sets of cognitive synonyms (synsets).  Synsets interlinked using conceptual-semantic and lexical relations.  WordNet as background knowledge to enhance document clustering by offering relations between vocabulary terms and WordNet is helpful for clustering process.
  • 12.
  • 13.
    Research Methodology Phases ActivitiesDeliver Feasibility Study • Book • Journal • Paper • Encyclopedia • Information Source • Textual Document Clustering • Semantic Document clustering • Dataset • Natural language processing Requirement Analysis • PyCharm (IDE) • SQLite Relational Database • WordNet • DB Viewer • COBWEB Algorithm • Sample Text • Full Text Search • Dataset Schema • SQL quarry • Stopword removal • Lemmatization • Frequency • Semantic Document Clustering • WordNet
  • 14.
    Research Methodology Phases ActivitiesDeliver Modeling • Development Platform • Natural Language tools • Accuracy tools • WordNet • COBWEB concept formation • NLTK • Synset Model Development • Coding • Experiment deign • Standard maintain • Sample Text • Semating Document clustering Model • Hardware and software prepare • Similarity measure Testing and Analysis • Pre-Processing • Clustering • Accuracy Measure • High Accurate Cluster • Semantic Relation Between words
  • 15.
    Proposed Model(Steps) • SampleTex Files • Remove Tags from Text • Tokenize Documents • Remove Stopwords • Synset Replacement (WordNet) • Lemmatization (WordNet) • Clustering (COBWEB Algorithm) • Measure Cluster Accuracy • High Accurate Cluster
  • 16.
    Proposed Model(Flow-Chart) Remove tagsfrom input text Removing unwanted Noise from Tokens.
  • 17.
    Proposed Model(Flow-Chart) Steps toremove stop words from tokens. Lemmatization and Streaming process Flow-Chart
  • 18.
    Experiment Setting  HPTouch Smart 320 Desktop PC.  Display: 50.80 cm (20 inch) Resolution: 1600 x 900 (16:9 aspect ratio)  Motherboard: Angelino2-UB  Processor: AMD A6-3600, 4 MB Cache  Memory: 8 GB, PC3-10600 MB/sec  Hard Drive: 1 TB, 7200 rpm Rotational Speed Software  Pycharm: Version: 2016.3.2; Build: 163.10154.50; Released: December 30, 2016. 1 Year Education(Student) License.  Scientific Tools: Python Notebook, Academic degree interactive Python console, Matplotlib, and NumPy.  NLTK: WordNet, classification, tokenization, stemming, tagging, parsing, linguistics reasoning, wrappers .  SQLite: SQLite3 is python Built-in Database system, it is self-contained, server less, zero-configuration and transactional. Hardware  Paper’s Abstract Consider as Sample.  20 Abstract from 20 Paper. Sample Collection
  • 19.
    Results And Discussion •20 samples from 20 different Source. • Total 3292 tokens came from 20 sample • 1524 token removed for stopword matching. 46.29% of the total tokens. • There has 1748 token left. • After WordNet Operation( Synset & Lemmatization) it has 672 tokens and it is 20.41% of total tokens. • In those 672 tokens, only 144 tokens are unique.
  • 20.
    Results And Discussion •Most 22 times we have a tendency to get a word. • When complete the clustering process by COBWEB algorithmic rule; then we get 35 clusters. • All sample documents associated except sample 3 and sample 16. Those two inputs do not have enough maturity to assign a cluster. • F-Measure applied on 35 clusters. • Several Clusters are 100% Accurate. • We consider minimum accuracy for overall accuracy and it was79.60%.
  • 21.
  • 22.
    Testing (Pre-Processing) Name ofSource File Number of Token in File “Sample 1” 213 “Sample 2” 257 “Sample 3” 127 “Sample 4” 204 “Sample 5” 451 “Sample 6” 216 “Sample 7” 108 “Sample 8” 259 “Sample 9” 151 “Sample 10” 79 “Sample 11” 149 “Sample 12” 86 “Sample 13” 154 “Sample 14” 100 “Sample 15” 132 “Sample 16” 84 “Sample 17” 152 “Sample 18” 139 “Sample 19” 114 “Sample 20” 117 Sample text file report after tokenize Name of Source File Total Remove Token Token after Remove Stop Word “Sample 1” 86 127 “Sample 2” 105 152 “Sample 3” 71 56 “Sample 4” 86 118 “Sample 5” 213 218 “Sample 6” 103 113 “Sample 7” 61 47 “Sample 8” 128 131 “Sample 9” 62 89 “Sample 10” 34 45 “Sample 11” 60 89 “Sample 12” 49 37 “Sample 13” 74 80 “Sample 14” 47 53 “Sample 15” 66 66 “Sample 16” 37 47 “Sample 17” 59 93 “Sample 18” 60 79 “Sample 19” 52 62 “Sample 20” 71 46 Token left for processing after remove stopwords
  • 23.
    Testing (Clusters) Cluster NameMember of Cluster (Source file) Algorithm Sample 5, Sample 8, Sample 11, Sample 13, Sample 19 Approach Sample 2, Sample18 Citat Sample 5 Classif Sample 8 Cliqu Sample 4 Cluster Sample 2, Sample 4, Sample 5, Sample 6, Sample 7, Sample 9, Sample 10, Sample 11, Sample 12, Sample 13, Sample 14 Cobweb Sample 8 Concept Sample 10, Sample 17 Data Sample 9, Sample 13, Sample 14 Document Sample 1, Sample 2, Sample 4, Sample 5 f-measur Sample 20 Function Sample 8 Inform Sample 2 Insert Sample 8 Language Sample 2 Measure Sample 5, Sample 15 Clusters with the member
  • 24.
    Testing (Clusters) Model Sample5 Multilingu Sample 2 Node Sample 8 Object Sample 8 Ontolog Sample 5, Sample 17 Oper Sample 8 Pass Sample 19 Probabl Sample 15 Select Sample 5 Semant Sample 4, Sample 5, Sample 17 Separ Sample 8 Similar Sample 17 Singl Sample 19 Technique Sample 17 Term Sample 1 Tree Sample 8 Valu Sample 8 Version Sample 19 Word Sample 1
  • 25.
    Conclusion  Our frameworkto done a valuable clustering textual documents for grab the secret information from unsupervised, unclassified text.  We proposed and developed a full system with the capability to work with the semantic meaning of textual data.  We use WordNet to ensure the semantic value of data and maintain relation semantically.  We just try to deliver a very quality full, accurate clustering. F- Measure evaluation and testing assure that our clusters are so accurate.  Semantic clustering with WordNet gives us successful semantic relation clustering and by f-measure ensures the quality.
  • 26.
    Future Research  Usenew version of Conceptual clusterings like COBWEB/3 or ITERATE or LABYRINTH.  We design tor word token, in future there have some chance to work with sentence token.  We use an only synset feature of WordNet. There have much more tools on the WordNet-like type, semantic meaning. We can use them for future research.
  • 27.
    Reference  Amanpreet KaurToor and Amarpreet Singh, Amritsar College of Engineering & Technology, Punjab, India , An Advanced Clustering Algorithm (ACA) for Clustering Large Data Set to Achieve High Dimensionality: Computer Science Systems Biology, Toor and Singh, J Comput Sci Syst Biol 2014, 7:4. URL: http://dx.doi.org/10.4172/jcsb.1000146, 2014.  C. Aggarwal and C. Zhai. A survey of text clustering algorithms, Mining Text Data, Springer, 2012  Charu C. Aggarwal & Chengxiang Zha. Mining Text Data. Kluwer Academic Publishers. Boston,Dordrecht,London, 2012  G. Qi, C. Aggarwal, and T. Huang. Community detection with edge content in social media networks, ICDE Conference, 2013.  G. Qi, C. Aggarwal, and T. Huang. Online community detection in social sensing. WSDM Conference, 2013.  Hemant Misra, Franc¸ois Yvon, Joemon M. Jose, and Olivier Cappe. Text segmentation via topic modeling: An analytical study. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pages 1553–1556, New York, USA. ACM, 2009  J.G. Lee, J. Han, and K.-Y. Whang. Trajectory clustering: A partition-and-group framework. SIGMOD Conference, 593–604, 2007
  • 28.
    Reference  M. Karthikeyan,P. Aruna, “Probability Based Document Clustering and Image Clustering using Content-Based Image Retrieval”, In Elsevier Journal of Applied Soft Computing, Pp.959-966, 2012  MacLellan, C.J., Harpstead, E., Aleven, V., Koedinger, K.R. (2015) TRESTLE: Incremental Learning in Structured Domains using Partial Matching and Categorization. The Third Annual Conference on Advances in Cognitive Systems. Atlanta, GA. May 28-31, 2015  Pritam C. Gaigole, L. H. Patil, P.M Chaudhari, “Preprocessing Techniques in Text Categorization”, National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2013), 2013  Publié le lundi. Machine Learning, Sémantique Données non-structurées, 2016  Qi Sun, Runxin Li, Dingsheng Luo, and Xihong Wu. Text segmentation with LDA-based Fisher kernel. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short ’08, pages 269–272, Stroudsburg, PA, USA. Association for Computational Linguistics, 2008  Y. Sun, C. Aggarwal, and J. Han. Relation-strength aware clustering of heterogeneous information networks with incomplete attributes, Journal of Proceedings of the VLDB Endowment, 5(5):394–405, 2012.

Editor's Notes

  • #2 NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image.