Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based
Representation and Self-organizing Maps

Alberto P. Garc´
ıa-Plaza, V´
ıctor Fresno, Raquel Mart´
ınez
NLP & IR Group, UNED

December 12, 2008

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps
Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives
2 Our Approach: Extended Fuzzy Combination of Criteria
(EFCC)
3 Experiment Description
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´
ınez, NLP & IR Group, UNED slide 2


Table of Contents

1 Objectives
(EFCC)
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Objectives

Group HTML documents by content similarity.
Self-Organizing Maps (SOM) to organize, visualize and
navigate through the collection.
Term weighting function taking advantage of HTML tags
Combining, by means of fuzzy logic, heuristic criteria based on
the inherent semantics of some HTML tags and word positions
in the document.

Hypothesis
An improvement in document representation will involve an
increase in map quality.

Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
1 Fuzzy Logic
2 EFCC
3 Linguistic Variables
4 Knowledge Base
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Fuzzy logic

Capturing human expert knowledge.
Close to natural language.
Knowledge base: deﬁned by a set of IF-THEN rules.
Linguistic variables
Deﬁned using natural language words and fuzzy sets.
These sets allow the description of the membership degree of
an object to a particular class.

Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
1 Fuzzy Logic
2 EFCC
4 Knowledge Base
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Extended Fuzzy Combination of Criteria

Alberto P. Garc´
ıa-Plaza, V´



Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
1 Fuzzy Logic
2 EFCC
4 Knowledge Base
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Linguistic Variables

Alberto P. Garc´
ıa-Plaza, V´



Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
1 Fuzzy Logic
2 EFCC
4 Knowledge Base
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Knowledge Base

Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
1 Dimensionality Reduction
2 Document Map
3 Evaluation Methods
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Dimensionality Reduction

Input vectors dimension ranging from 100 to 5000
Stopwords, puntuaction marks suﬃxes, and words occurring
less than 50 times in the whole corpus were removed.
Two well known methods:
Document frequency reduction.
Random projection method.
Three proposed rank-based methods:
Most Valued Terms.
Fixed reduction method.
More Frequent Terms until n level.

Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
2 Document Map
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Document Map Construction

Benchmark dataset for clustering: Banksearch1
10000 documents
10 classes
SOM size was set equal to the number of classes of input
documents, i.e. 5x2, in order to compare clustering results.

1
M. P. Sinka and D. W. Corne. A large benchmark dataset for web document clustering. Soft Computing
Systems: Design, Management, and Applications, 2002.
Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
2 Document Map
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Evaluation Methods

Weighted average of the F-measure for each class.
After mapping the collection in the trained map, the class
with greater number of documents mapped on a neuron will
be selected to label the unit.
All the document vectors in a neuron which class is diﬀerent
from the neuron label will be counted as errors.

Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Best reduction for each term weighting function

Alberto P. Garc´
ıa-Plaza, V´


MFTn reduction provides stability

Alberto P. Garc´
ıa-Plaza, V´


EFCC+MFTn obtains its best results with the
smallest number of features

Alberto P. Garc´
ıa-Plaza, V´


Table of Contents

1 Objectives
(EFCC)
4 Results
5 Conclusion

Alberto P. Garc´
ıa-Plaza, V´


Conclusion

Unsupervised document representation method, based on
fuzzy logic, focused on clustering HTML documents by means
of self-organizing maps.
MFTn reduction is the most stable reduction in all cases.
EFCC representation allows to obtain better results using a
smaller vocabulary.
Smaller number of features needed to represent the input
documents and SOM unit vectors, which implies an
improvement in computational cost.

Alberto P. Garc´
ıa-Plaza, V´


Thank You!

Alberto P. Garc´
ıa-Plaza, V´


Related Work

VSM Topic Document Weighting Modiﬁes
Information Type Function SOM
Self organization of
a Massive Document Yes Yes Text Shannon’s Entrophy No
Collection2
Document Clustering Yes No Text Binary, TF, TF-IDF No
using Phrases3
Document Clustering Yes Yes Text ESVM, HSVM, HyM No
using WordNet4
Conceptional SOM5 Yes No Text TF Yes

2
T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a
massive document collection. IEEE Trans. on Neural Networks, 2000.
3
J. Bakus, M. Hussin, and M. Kamel. A som-based document clustering using phrases. In ICONIP, 2002.
4
C. Hung and S. Wermter. Neural network based document clustering using wordnet ontologies. Int. J.
Hybrid Intell. Syst., 2004
5
Y. Liu, X. Wang, and C. Wu. Consom: A conceptional som model for text clustering. In Neurocomputing,
2008
Alberto P. Garc´
ıa-Plaza, V´

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps