SlideShare a Scribd company logo
1 of 44
Download to read offline
Text Categorization as a Graph Classification
Problem
1
Outline
Section 1 Introduction
Section 2 Review of the related work
Section 3 Preliminary concepts
Section 4 Proposed approaches
Section 5 Experimental evaluation
Section 6 Conclusion
References
2
1. What is text mining ?
2. Bag-of-words and its issues
3. Graph-of-words - A new approach
Introduction
3
Introduction
What is Text mining?
Search engines
Understand user’s queries. E.g. What is Google?
Find matching websites or documents (ranking).
Product recommendation
Understand product description.
Understand product reviews. 4
Introduction
Bag-of-words and its issues
Definition
A text (such as a sentence or a document) is represented as the bag (multiset)
of its words.
5
Introduction
Bag-of-words and its issues
Example
“He likes watching action movies, she likes watching romantic movies”
⇒ [ “He”, “likes”, “watching”, “action”, “movies”, “she”, “likes”, “watching”,
“romantic”, “movies” ].
The sentence has 10 distinct words, by using indexes of the list, it can be
represented by a 10-entry vector: [ 1, 2, 2, 1, 2, 1, 2, 2, 1, 2 ]
6
Introduction
Bag-of-words and its issues
Problems
There are millions of n-gram features when dealing with thousands of news
articles, but only a few hundreds actually present in each article and tens
of class labels.
N-gram fails to capture word inversion and subset matching (e.g., “article
about news” vs. “news article”).
7
Introduction
Graph-of-words - A new approach
8
Consider the task of text categorization as a graph
classification problem.
Represent textual documents as graph-of-words
instead of traditional n-gram bag-of-words.
Extract more discriminative features that
correspond to long-distance n-grams through
frequent subgraph mining.
Introduction
Graph-of-words - A new approach
9
Summary:
1. Constructs a graph-of-words for each document
in the set
2. For each graphs from step 1 , extract its main
core (for cost-effective)
3. Find all frequent subgraphs size n in the
obtained set of graphs from step 2
4. Remove isomorphic subgraphs to reduce the
total number of features
5. Finally, extract n-gram features on the
remaining text
● Subgraph feature mining on graph-of-words representations by Markov et
al. (2007)
Kudo and Matsumoto (2004), Matsumoto et al. (2005), Jiang et al. (2010) and
Arora et al. (2010) suggested using parse and dependency trees
representation for text categorization, but the support value (i.e. the total
number of features) was not discussed and can potentially lead to millions
of subgraphs on standard datasets.
Review of the related works
10
1. Graph-of-words model
2. Subgraph isomorphism
3. K-core and main core
Preliminary Concepts
11
Definition
An undirected graph G = (V, E) , where
V is the set of vertices, which represents unique terms of the document
E is the set of edges, which represents co-occurrences between the terms
within a fixed-size sliding window
12
Preliminary Concepts
Graph-of-words model
Definition
Given two graphs G and H, an isomorphism of G and H is a bijection between the
vertex sets of G and H such that any two vertices u and v of G are adjacent in G if
and only if f(u) and f(v) are adjacent in H.
Example
13
Preliminary Concepts
Subgraph isomorphism
Definition
A subgraph H = (V’, E’) induced by the subset of vertices V’ ⊆ V and the subset of
edges E’ ⊆ E of graph G = (V, E) is called a k-core, where k is an integer, if and
only if: H is the maximal subgraph holds the property ∀ v ∈ V’, deg(v) >= k.
k-core: a maximal connected subgraph whose vertices are at least of degree k
within that subgraph.
main core: the k-core with the largest k.
Preliminary Concepts
K-core and main core
14
Example
Fig. Two 3-cores of a graph
Preliminary Concepts
K-core and main core
15
1. Unsupervised feature mining using gSpan
2. Find frequent subgraphs using gSpan
3. Unsupervised support selection
4. Considered classifiers
5. Multiclass scenario
6. Main core mining using gSpan
Proposed approaches
16
Idea
● Considered the task of text categorization as a graph classification problem
● Representing textual documents as graph-of-words and then extracting
subgraph features to train a graph classifier.
● Each document is a separate graph-of-words and the collection of
documents thus corresponds to a set of graphs.
Proposed approaches
Unsupervised feature mining using gSpan
17
Given
● D = {G0
, G1
, G2
, ..., GN
} a graph dataset
● Support(g) the number of graphs (in D) in which g is a subgraph
● minSup minimum support threshold
Problem
Find any subgraph so that support(g) >= minSup
Proposed approaches
Find frequent subgraphs using gSpan
18
Frequent subgraph : a subgraph of multiple graph in D
Proposed approaches
Find frequent subgraphs using gSpan
19
Baseline solution
● Enumerate all the subgraphs and testing for isomorphism throughout the
collection => very expensive
Propose solution
● Use gSpan (graph-based Substructure pattern mining )
Proposed approaches
Find frequent subgraphs using gSpan
20
gSpan Idea:
1. For each graph, build a lexicographic order of all the edges using depth-first-
search (DFS) traversal
2. Assign to each of them a unique minimum DFS code.
3. Based on all these DFS codes, a hierarchical search tree is constructed at the
collection-level.
4. By pre-order traversal of this tree, gSpan discovers all frequent subgraphs
with required support.
Proposed approaches
Find frequent subgraphs using gSpan
21
Note :
● Given two graphs G and G’
G is isomorphic to G’ if and only if minDFS(G) = minDFS(G’)
The lower the support will result in:
1. more features
2. longer the mining
3. longer feature vector generation
4. longer learning .
Proposed approaches
Find frequent subgraphs using gSpan
22
Given
D = {G0
, G1
, G2
,... ,GN
} a graph dataset
Support(g) denotes the number of graphs (in D) in which g is a subgraph
minSup denotes the minimum support threshold
Proposed approaches
Unsupervised support selection (Select best minSup)
23
Situation
The classifier can only improve its goodness of fit with more features
=> It is likely that the lowest support will lead to the best test accuracy
As the support decreases, the number of features increases slightly up until a
point where it increases exponentially
=> This makes both the feature vector generation and the learning expensive,
especially with multiple classes.
Proposed approaches
Unsupervised support selection (Select best minSup)
24
Problem
Select best minSup
Solution
Use the Elbow method
Proposed approaches
Unsupervised support selection (Select best minSup)
25
Elbow method
Example: selecting the number of clusters in k-means clustering
Choose a number of clusters so that adding
another cluster doesn't give much better
modeling of the data
Proposed approaches
Unsupervised support selection (Select best minSup)
26
Elbow method
In our case :
Choose a minSup so that decreasing this value by a unit will :
not give much better accuracy
but increase the number of features significantly
Proposed approaches
Unsupervised support selection (Select best minSup)
27
Standard baseline classifiers
K-nearest neighbors (kNN) (Larkey and Croft, 1996)
Naive Bayes (NB) (McCallum and Nigam, 1998)
Linear Support Vector Machines (SVM) (Joachims, 1998)
Proposed approaches
Considered classifiers
28
Problem
Single support value might lead to some classes generating a tremendous
number of features ( hundreds of thousands ) and some others only a few (a few
hundreds subgraphs)
⇒ Need an extremely low support to include discriminative features for
these minority classes
⇒ Resulting in an exponential number of features because of the majority
classes.
Proposed approaches
Multiclass scenario
29
Solution
Mine frequent subgraphs per class using the same relative support (in %)
Then aggregate each feature set into a global one at the cost of a supervised
process (but still avoids cross validating).
Proposed approaches
Multiclass scenario
30
Problem
The number of features (subgraphs) to be extracted is very large when mining
frequent subgraphs directly !
How to extract discriminative features while maintaining word dependence
and retaining as much classification information as possible ?
Solution
Reduce the graphs’ size by keeping the densest subgraphs.
Proposed approaches
Main core using gSpan
31
Implementation
Batagelj-Zaveršnik algorithm, which is optimally implemented (in C++
language) by gSpan.
Proposed approaches
Main core using gSpan
32
1. Datasets
2. Results
3. Unsupervised support selection
4. Distributions of mined n-grams
Experimental evaluation
33
Experimental evaluation
Datasets
34
● WebKB: 4 most frequent categories among labeled web pages from
various CS departments
(2,803 for training and 1,396 for test )
● R8: 8 most frequent categories of Reuters- 21578, a set of labeled news
articles from the 1987 Reuters newswire
(5,485 for training and 2,189 for test )
● LingSpam: 2,893 emails classified as spam or legitimate messages
(10 sets for 10-fold cross validation )
● Amazon: 8,000 product reviews over four different sub-collections
(books, DVDs, electronics and kitchen appliances) classified as positive
or negative
(1,600 for training and 400 for test )
Experimental evaluation
Datasets
35
● Multi-class document categorization : WebKB and R8
● Spam detection (Ling-Spam)
● Opinion mining (Amazon) so as to cover all the main subtasks of text
categorization
Table 1: Total number of features (n-grams or subgraphs) vs. number of features present only in main
cores along with the reduction of the dimension of the feature space on all four datasets.
36
Experimental evaluation
Results
Table 2: Test accuracy and macro-average F1-score on four standard datasets. Bold font marks the best
performance in a column * indicates statistical significance at p < 0.05 using micro sign test with regards
to the SVM baseline of the same column. MC corresponds to unsupervised feature selection using the
main core of each graph-of-words to extract n-gram and subgraph features. gSpan mining support
values are 1.6% (WebKB), 7% (R8), 4% (LingSpam) and 0.5% (Amazon).
37
Experimental evaluation
Results
Figure 2: Distribution of non-zero n-gram feature values before and after unsupervised feature selection
(main core retention) on R8 dataset. 38
Experimental evaluation
Results
Figure 3: Number of subgraph features/accuracy in test per support (%) on WebKB (left) and R8 (right)
datasets: in black, the selected support value chosen via the elbow method and in red, the accuracy in
test for the SVM baseline.
Experimental evaluation
Unsupervised support selection
39
Figure 4: Distribution of n-grams (standard and long-distance ones) among all the features on WebKB
dataset.
Experimental evaluation
Distribution of mined n-grams
40
Figure 5: Distribution of n-grams (standard and long-distance ones) among the top 5% most
discriminative features for SVM on WebKB dataset.
Experimental evaluation
Distribution of mined n-grams
41
Conclusion
New graph-of-words approach for text mining.
Consider the problem as a graph classification
Achieved:
Extract more discriminative features that correspond to long-distance n-grams
through frequent subgraph mining
42
References
Text Categorization as a Graph Classification Problem (François Rousseau, Emmanouil Kiagias ,Michalis Vazirgiannis )
http://www.aclweb.org/anthology/P15-1164
gSpan: Graph-Based Substructure Pattern Mining (Xifeng Yan and Jiawei Han )
http://cs.ucsb.edu/~xyan/papers/gSpan-short.pdf
Determining the number of clusters in a data set - The Elbow Method
https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
Graph isomorphism
https://en.wikipedia.org/wiki/Graph_isomorphism
43
44

More Related Content

What's hot

3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 

What's hot (15)

Topic Models
Topic ModelsTopic Models
Topic Models
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Chapter8
Chapter8Chapter8
Chapter8
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Lect4
Lect4Lect4
Lect4
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 

Viewers also liked

Crypto theory to practice
Crypto theory to practiceCrypto theory to practice
Crypto theory to practiceLuis Goldster
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalDavid Hoen
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prologDavid Hoen
 
Text classification methods
Text classification methodsText classification methods
Text classification methodsJames Wong
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in cryptoJames Wong
 
Crypto passport authentication
Crypto passport authenticationCrypto passport authentication
Crypto passport authenticationLuis Goldster
 
Database introduction
Database introductionDatabase introduction
Database introductionDavid Hoen
 
Database introduction
Database introductionDatabase introduction
Database introductionLuis Goldster
 
Database constraints
Database constraintsDatabase constraints
Database constraintsDavid Hoen
 
Sql database object
Sql database objectSql database object
Sql database objectYoung Alista
 

Viewers also liked (20)

Cryptography
CryptographyCryptography
Cryptography
 
Hashfunction
HashfunctionHashfunction
Hashfunction
 
Cryptography
CryptographyCryptography
Cryptography
 
Crypto theory to practice
Crypto theory to practiceCrypto theory to practice
Crypto theory to practice
 
Xml stylus studio
Xml stylus studioXml stylus studio
Xml stylus studio
 
Nlp naive bayes
Nlp naive bayesNlp naive bayes
Nlp naive bayes
 
Basic dns-mod
Basic dns-modBasic dns-mod
Basic dns-mod
 
Prolog resume
Prolog resumeProlog resume
Prolog resume
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Basic dns-mod
Basic dns-modBasic dns-mod
Basic dns-mod
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
Text classification methods
Text classification methodsText classification methods
Text classification methods
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in crypto
 
Crypto passport authentication
Crypto passport authenticationCrypto passport authentication
Crypto passport authentication
 
Database introduction
Database introductionDatabase introduction
Database introduction
 
Database introduction
Database introductionDatabase introduction
Database introduction
 
Nlp naive bayes
Nlp naive bayesNlp naive bayes
Nlp naive bayes
 
Nlp naive bayes
Nlp naive bayesNlp naive bayes
Nlp naive bayes
 
Database constraints
Database constraintsDatabase constraints
Database constraints
 
Sql database object
Sql database objectSql database object
Sql database object
 

Similar to Graph Classification Approach to Text Categorization Using Subgraph Features

Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.pptSueMiu
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Shenghui Wang
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...
PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...
PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...AkankshaRawat53
 
Advanced Data Structures 2006
Advanced Data Structures 2006Advanced Data Structures 2006
Advanced Data Structures 2006Sanjay Goel
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
 

Similar to Graph Classification Approach to Text Categorization Using Subgraph Features (20)

Text clustering
Text clusteringText clustering
Text clustering
 
My8clst
My8clstMy8clst
My8clst
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...
PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...
PaperReview_ “Few-shot Graph Classification with Contrastive Loss and Meta-cl...
 
Advanced Data Structures 2006
Advanced Data Structures 2006Advanced Data Structures 2006
Advanced Data Structures 2006
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 

More from James Wong

Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtosJames Wong
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningJames Wong
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksJames Wong
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsJames Wong
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceJames Wong
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesJames Wong
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileJames Wong
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheJames Wong
 
Abstract class
Abstract classAbstract class
Abstract classJames Wong
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisJames Wong
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaJames Wong
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsJames Wong
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonJames Wong
 

More from James Wong (20)

Data race
Data raceData race
Data race
 
Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtos
 
Recursion
RecursionRecursion
Recursion
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Object model
Object modelObject model
Object model
 
Abstract class
Abstract classAbstract class
Abstract class
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Inheritance
InheritanceInheritance
Inheritance
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Graph Classification Approach to Text Categorization Using Subgraph Features

  • 1. Text Categorization as a Graph Classification Problem 1
  • 2. Outline Section 1 Introduction Section 2 Review of the related work Section 3 Preliminary concepts Section 4 Proposed approaches Section 5 Experimental evaluation Section 6 Conclusion References 2
  • 3. 1. What is text mining ? 2. Bag-of-words and its issues 3. Graph-of-words - A new approach Introduction 3
  • 4. Introduction What is Text mining? Search engines Understand user’s queries. E.g. What is Google? Find matching websites or documents (ranking). Product recommendation Understand product description. Understand product reviews. 4
  • 5. Introduction Bag-of-words and its issues Definition A text (such as a sentence or a document) is represented as the bag (multiset) of its words. 5
  • 6. Introduction Bag-of-words and its issues Example “He likes watching action movies, she likes watching romantic movies” ⇒ [ “He”, “likes”, “watching”, “action”, “movies”, “she”, “likes”, “watching”, “romantic”, “movies” ]. The sentence has 10 distinct words, by using indexes of the list, it can be represented by a 10-entry vector: [ 1, 2, 2, 1, 2, 1, 2, 2, 1, 2 ] 6
  • 7. Introduction Bag-of-words and its issues Problems There are millions of n-gram features when dealing with thousands of news articles, but only a few hundreds actually present in each article and tens of class labels. N-gram fails to capture word inversion and subset matching (e.g., “article about news” vs. “news article”). 7
  • 8. Introduction Graph-of-words - A new approach 8 Consider the task of text categorization as a graph classification problem. Represent textual documents as graph-of-words instead of traditional n-gram bag-of-words. Extract more discriminative features that correspond to long-distance n-grams through frequent subgraph mining.
  • 9. Introduction Graph-of-words - A new approach 9 Summary: 1. Constructs a graph-of-words for each document in the set 2. For each graphs from step 1 , extract its main core (for cost-effective) 3. Find all frequent subgraphs size n in the obtained set of graphs from step 2 4. Remove isomorphic subgraphs to reduce the total number of features 5. Finally, extract n-gram features on the remaining text
  • 10. ● Subgraph feature mining on graph-of-words representations by Markov et al. (2007) Kudo and Matsumoto (2004), Matsumoto et al. (2005), Jiang et al. (2010) and Arora et al. (2010) suggested using parse and dependency trees representation for text categorization, but the support value (i.e. the total number of features) was not discussed and can potentially lead to millions of subgraphs on standard datasets. Review of the related works 10
  • 11. 1. Graph-of-words model 2. Subgraph isomorphism 3. K-core and main core Preliminary Concepts 11
  • 12. Definition An undirected graph G = (V, E) , where V is the set of vertices, which represents unique terms of the document E is the set of edges, which represents co-occurrences between the terms within a fixed-size sliding window 12 Preliminary Concepts Graph-of-words model
  • 13. Definition Given two graphs G and H, an isomorphism of G and H is a bijection between the vertex sets of G and H such that any two vertices u and v of G are adjacent in G if and only if f(u) and f(v) are adjacent in H. Example 13 Preliminary Concepts Subgraph isomorphism
  • 14. Definition A subgraph H = (V’, E’) induced by the subset of vertices V’ ⊆ V and the subset of edges E’ ⊆ E of graph G = (V, E) is called a k-core, where k is an integer, if and only if: H is the maximal subgraph holds the property ∀ v ∈ V’, deg(v) >= k. k-core: a maximal connected subgraph whose vertices are at least of degree k within that subgraph. main core: the k-core with the largest k. Preliminary Concepts K-core and main core 14
  • 15. Example Fig. Two 3-cores of a graph Preliminary Concepts K-core and main core 15
  • 16. 1. Unsupervised feature mining using gSpan 2. Find frequent subgraphs using gSpan 3. Unsupervised support selection 4. Considered classifiers 5. Multiclass scenario 6. Main core mining using gSpan Proposed approaches 16
  • 17. Idea ● Considered the task of text categorization as a graph classification problem ● Representing textual documents as graph-of-words and then extracting subgraph features to train a graph classifier. ● Each document is a separate graph-of-words and the collection of documents thus corresponds to a set of graphs. Proposed approaches Unsupervised feature mining using gSpan 17
  • 18. Given ● D = {G0 , G1 , G2 , ..., GN } a graph dataset ● Support(g) the number of graphs (in D) in which g is a subgraph ● minSup minimum support threshold Problem Find any subgraph so that support(g) >= minSup Proposed approaches Find frequent subgraphs using gSpan 18
  • 19. Frequent subgraph : a subgraph of multiple graph in D Proposed approaches Find frequent subgraphs using gSpan 19
  • 20. Baseline solution ● Enumerate all the subgraphs and testing for isomorphism throughout the collection => very expensive Propose solution ● Use gSpan (graph-based Substructure pattern mining ) Proposed approaches Find frequent subgraphs using gSpan 20
  • 21. gSpan Idea: 1. For each graph, build a lexicographic order of all the edges using depth-first- search (DFS) traversal 2. Assign to each of them a unique minimum DFS code. 3. Based on all these DFS codes, a hierarchical search tree is constructed at the collection-level. 4. By pre-order traversal of this tree, gSpan discovers all frequent subgraphs with required support. Proposed approaches Find frequent subgraphs using gSpan 21
  • 22. Note : ● Given two graphs G and G’ G is isomorphic to G’ if and only if minDFS(G) = minDFS(G’) The lower the support will result in: 1. more features 2. longer the mining 3. longer feature vector generation 4. longer learning . Proposed approaches Find frequent subgraphs using gSpan 22
  • 23. Given D = {G0 , G1 , G2 ,... ,GN } a graph dataset Support(g) denotes the number of graphs (in D) in which g is a subgraph minSup denotes the minimum support threshold Proposed approaches Unsupervised support selection (Select best minSup) 23
  • 24. Situation The classifier can only improve its goodness of fit with more features => It is likely that the lowest support will lead to the best test accuracy As the support decreases, the number of features increases slightly up until a point where it increases exponentially => This makes both the feature vector generation and the learning expensive, especially with multiple classes. Proposed approaches Unsupervised support selection (Select best minSup) 24
  • 25. Problem Select best minSup Solution Use the Elbow method Proposed approaches Unsupervised support selection (Select best minSup) 25
  • 26. Elbow method Example: selecting the number of clusters in k-means clustering Choose a number of clusters so that adding another cluster doesn't give much better modeling of the data Proposed approaches Unsupervised support selection (Select best minSup) 26
  • 27. Elbow method In our case : Choose a minSup so that decreasing this value by a unit will : not give much better accuracy but increase the number of features significantly Proposed approaches Unsupervised support selection (Select best minSup) 27
  • 28. Standard baseline classifiers K-nearest neighbors (kNN) (Larkey and Croft, 1996) Naive Bayes (NB) (McCallum and Nigam, 1998) Linear Support Vector Machines (SVM) (Joachims, 1998) Proposed approaches Considered classifiers 28
  • 29. Problem Single support value might lead to some classes generating a tremendous number of features ( hundreds of thousands ) and some others only a few (a few hundreds subgraphs) ⇒ Need an extremely low support to include discriminative features for these minority classes ⇒ Resulting in an exponential number of features because of the majority classes. Proposed approaches Multiclass scenario 29
  • 30. Solution Mine frequent subgraphs per class using the same relative support (in %) Then aggregate each feature set into a global one at the cost of a supervised process (but still avoids cross validating). Proposed approaches Multiclass scenario 30
  • 31. Problem The number of features (subgraphs) to be extracted is very large when mining frequent subgraphs directly ! How to extract discriminative features while maintaining word dependence and retaining as much classification information as possible ? Solution Reduce the graphs’ size by keeping the densest subgraphs. Proposed approaches Main core using gSpan 31
  • 32. Implementation Batagelj-Zaveršnik algorithm, which is optimally implemented (in C++ language) by gSpan. Proposed approaches Main core using gSpan 32
  • 33. 1. Datasets 2. Results 3. Unsupervised support selection 4. Distributions of mined n-grams Experimental evaluation 33
  • 34. Experimental evaluation Datasets 34 ● WebKB: 4 most frequent categories among labeled web pages from various CS departments (2,803 for training and 1,396 for test ) ● R8: 8 most frequent categories of Reuters- 21578, a set of labeled news articles from the 1987 Reuters newswire (5,485 for training and 2,189 for test ) ● LingSpam: 2,893 emails classified as spam or legitimate messages (10 sets for 10-fold cross validation ) ● Amazon: 8,000 product reviews over four different sub-collections (books, DVDs, electronics and kitchen appliances) classified as positive or negative (1,600 for training and 400 for test )
  • 35. Experimental evaluation Datasets 35 ● Multi-class document categorization : WebKB and R8 ● Spam detection (Ling-Spam) ● Opinion mining (Amazon) so as to cover all the main subtasks of text categorization
  • 36. Table 1: Total number of features (n-grams or subgraphs) vs. number of features present only in main cores along with the reduction of the dimension of the feature space on all four datasets. 36 Experimental evaluation Results
  • 37. Table 2: Test accuracy and macro-average F1-score on four standard datasets. Bold font marks the best performance in a column * indicates statistical significance at p < 0.05 using micro sign test with regards to the SVM baseline of the same column. MC corresponds to unsupervised feature selection using the main core of each graph-of-words to extract n-gram and subgraph features. gSpan mining support values are 1.6% (WebKB), 7% (R8), 4% (LingSpam) and 0.5% (Amazon). 37 Experimental evaluation Results
  • 38. Figure 2: Distribution of non-zero n-gram feature values before and after unsupervised feature selection (main core retention) on R8 dataset. 38 Experimental evaluation Results
  • 39. Figure 3: Number of subgraph features/accuracy in test per support (%) on WebKB (left) and R8 (right) datasets: in black, the selected support value chosen via the elbow method and in red, the accuracy in test for the SVM baseline. Experimental evaluation Unsupervised support selection 39
  • 40. Figure 4: Distribution of n-grams (standard and long-distance ones) among all the features on WebKB dataset. Experimental evaluation Distribution of mined n-grams 40
  • 41. Figure 5: Distribution of n-grams (standard and long-distance ones) among the top 5% most discriminative features for SVM on WebKB dataset. Experimental evaluation Distribution of mined n-grams 41
  • 42. Conclusion New graph-of-words approach for text mining. Consider the problem as a graph classification Achieved: Extract more discriminative features that correspond to long-distance n-grams through frequent subgraph mining 42
  • 43. References Text Categorization as a Graph Classification Problem (François Rousseau, Emmanouil Kiagias ,Michalis Vazirgiannis ) http://www.aclweb.org/anthology/P15-1164 gSpan: Graph-Based Substructure Pattern Mining (Xifeng Yan and Jiawei Han ) http://cs.ucsb.edu/~xyan/papers/gSpan-short.pdf Determining the number of clusters in a data set - The Elbow Method https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set Graph isomorphism https://en.wikipedia.org/wiki/Graph_isomorphism 43
  • 44. 44