Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
An improved fuzzy system for representing web pages in Clustering Tasks
1. An Improved Fuzzy System for
Representing Web Pages in
Clustering Tasks
PhD Thesis
Alberto Pérez García-Plaza
UNED – NLP & IR group
October 23, 2012
Advisors:
Raquel Martínez Unanue
Víctor Fresno Fernández
2. Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications
2/83
3. Table of Contents
1. Introduction
1. Motivation
2. Objectives
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications
3/83
5. Motivation
• Document representation plays a key role in clustering.
• Representation comes first.
• We focus on Document Representation
• Characteristics employed…
1. Introduction
• …and the way of using them.
5/83
6. State of the Art
• TF-IDF a de facto standard.
• Combination of criteria:
• Linear approaches.
• Algorithm.
1. Introduction
• Hyperlinks.
• Datasets for evaluation differ from one work to another.
6/83
11. Web Page Example
• Our criteria: Position (Standard, Preferential)
1. Introduction
11/83
12. Combining Criteria
• Linear Combination of Criteria:
I k = tkit + ekie + fki f + pkip
1. Introduction
• Each criterion is multiplied by a constant.
• Constants try to establish the importance of each criterion.
12/83
14. Combining Criteria
1. Introduction
Title terms not related to the theme of the document.
14/83
15. Combining Criteria
• Need
• Related conditions to establish word importance.
• Fuzzy Logic because:
• Declare the knowledge without specifying the calculation.
1. Introduction
• Rules close to natural language (IF - THEN).
• Relations among criteria.
• Ease the task of expressing heuristic knowledge.
• Other kind of systems requires an additional effort to
understand how the system works to be able to modify them.
15/83
16. Problem Statement
To study and improve a web page1 representation based on
1. Introduction
fuzzy logic2 applied to clustering tasks.
(1) HTML documents.
(2) FCC, Víctor Fresno, PhD Thesis (2006).
16/83
17. Objectives
1. Compare the fuzzy system with TF-IDF as standard method
and different dimension reduction methods.
2. Analyze an existing fuzzy combination of criteria (FCC).
3. Assess the possibility of adding new criteria beyond
1. Introduction
document contents.
4. Adjust the representation to concrete datasets.
5. Evaluate our proposals in hierarchical clustering.
6. Evaluate our methods in more than one language.
17/83
18. Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
1. Dimension Reduction
2. Criteria Analysis
3. New Criteria
3. Adjusting the representation
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications
18/83
22. Datasets
Dataset # Documents # Categories Language Hierarchical
Banksearch 9,897 10* English No
WebKB 4,518 6 English No
SODP 12,148 17 English No
WAD 166 4 – 1st level English & Yes
17 – 2nd level Spanish
(*) approx. same number of documents within each category.
22/83
23. Basic Clustering Settings
• Stop words removal & Stemming (Porter).
• Cluto-rbr with default parameters.
• Initial Weighting Functions: TF-IDF and FCC.
• Dimension Reduction Methods (100, 500, 1000, 2000, 5000
features): DF, LSI, RP, MFT.
• F-measure to evaluate clustering quality.
2 × R(i, j)× P(i, j)
F(i, j) =
R(i, j) + P(i, j)
nj
F = å × max{F(i, j)}
j n
j
23/83
24. MFT Reduction
Our proposal for dimension reduction:
Rank terms within each document
1. Soccer 1. Music 1. Goal 1. Music
2. Goal 2. Show 2. Ball 2. Band
3. Referee 3. Band 3. Soccer 3. Album
4. … 4. … 4. … 4. …
Music Soccer Goal Show Band Ball
24/83
…until the desired number of terms is reached.
25. Dimension Reduction
Experiments
Comparison: MFT Vs. LSI
Representation Avg. F S. D.
Both methods over TF-IDF and FCC
Banksearch
TF-IDF MFT 0.748 0.028
TF-IDF LSI 0.756 0.005
FCC MFT 0.756 0.019
FCC LSI 0.769 0.011
WebKB
TF-IDF MFT 0.460 0.051
TF-IDF LSI 0.507 0.006
FCC MFT 0.469 0.009 25/83
FCC LSI 0.466 0.011
26. Dimension Reduction
Experiments
Comparison: MFT Vs. LSI
Representation Avg. F S. D. • LSI outperforms MFT.
Banksearch • FCC and TF-IDF are not working as
well as they could.
TF-IDF MFT 0.748 0.028
TF-IDF LSI 0.756 0.005
FCC MFT 0.756 0.019
FCC LSI 0.769 0.011
WebKB
TF-IDF MFT 0.460 0.051
TF-IDF LSI 0.507 0.006
FCC MFT 0.469 0.009 26/83
FCC LSI 0.466 0.011
27. Dimension Reduction
Experiments
Comparison: MFT Vs. LSI
Representation Avg. F S. D. • LSI outperforms MFT.
Banksearch • FCC and TF-IDF are not working as
well as they could.
TF-IDF MFT 0.748 0.028
TF-IDF LSI 0.756 0.005 • FCC in WebKB obtains bad results,
FCC MFT 0.756 0.019 even with LSI.
FCC LSI 0.769 0.011
WebKB
TF-IDF MFT 0.460 0.051
TF-IDF LSI 0.507 0.006
FCC MFT 0.469 0.009 27/83
FCC LSI 0.466 0.011
28. Analysis of the Combination
Banksearch
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.723 0.757 0.768 0.765 0.768
title 0.626 0.646 0.632 0.634 0.639
emphasis 0.586 0.671 0.674 0.685 0.693
frequency 0.689 0.715 0.720 0.724 0.731
position 0.310 0.525 0.538 0.599 0.608
The combination always outperforms individual criteria.
Frequency seems to the be the best among individual criteria.
28/83
29. Analysis of the Combination
Banksearch
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.723 0.757 0.768 0.765 0.768
title 0.626 0.646 0.632 0.634 0.639
emphasis 0.586 0.671 0.674 0.685 0.693
frequency 0.689 0.715 0.720 0.724 0.731
position 0.310 0.525 0.538 0.599 0.608
The combination always outperforms individual criteria.
Frequency seems to the be the best among individual criteria.
29/83
30. Analysis of the Combination
Banksearch
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.723 0.757 0.768 0.765 0.768
title 0.626 0.646 0.632 0.634 0.639
emphasis 0.586 0.671 0.674 0.685 0.693
frequency 0.689 0.715 0.720 0.724 0.731
position 0.310 0.525 0.538 0.599 0.608
The combination always outperforms individual criteria.
Frequency seems to the be the best among individual criteria.
30/83
31. Analysis of the Combination
WebKB
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.453 0.472 0.475 0.468 0.475
title 0.432 0.433 0.404 0.488 0.479
emphasis 0.415 0.431 0.433 0.465 0.489
frequency 0.441 0.460 0.460 0.468 0.446
position 0.301 0.283 0.317 0.281 0.286
The combination does not always outperform the others.
Frequency is not always the best among individual criteria.
When title and emphasis could lead to a better clustering, the combination get 31/83
worse.
32. Analysis of the Combination
WebKB
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.453 0.472 0.475 0.468 0.475
title 0.432 0.433 0.404 0.488 0.479
emphasis 0.415 0.431 0.433 0.465 0.489
frequency 0.441 0.460 0.460 0.468 0.446
position 0.301 0.283 0.317 0.281 0.286
The combination does not always outperform the others.
Frequency is not always the best among individual criteria.
When title and emphasis could lead to a better clustering, the combination get 32/83
worse.
33. Analysis of the Combination
WebKB
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.453 0.472 0.475 0.468 0.475
title 0.432 0.433 0.404 0.488 0.479
emphasis 0.415 0.431 0.433 0.465 0.489
frequency 0.441 0.460 0.460 0.468 0.446
position 0.301 0.283 0.317 0.281 0.286
The combination does not always outperform the others.
Frequency is not always the best among individual criteria.
When title and emphasis could lead to a better clustering, the combination get 33/83
worse.
34. Analysis of the Combination
WebKB
Rep.Dim. 100 500 1000 2000 5000
FCC MFT 0.453 0.472 0.475 0.468 0.475
title 0.432 0.433 0.404 0.488 0.479
emphasis 0.415 0.431 0.433 0.465 0.489
frequency 0.441 0.460 0.460 0.468 0.446
position 0.301 0.283 0.317 0.281 0.286
The combination does not always outperform the others.
Frequency is not always the best among individual criteria.
When title and emphasis could lead to a better clustering, the combination 34/83
gets worse.
35. Analysis of the Combination
• Position is considered more decisive than others.
• But position empirically got the worst results.
• Its heuristics are based on written texts and not in web pages.
• Sample rule:
• IF title IS low
AND frequency IS medium
AND emphasis IS high
AND position IS preferential
THEN importance IS very high
35/83
39. EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F S. D.
Banksearch • Banksearch: with EFCC both
TF-IDF LSI 0.756 0.005 reductions get similar results.
FCC LSI 0.769 0.011
EFCC MFT 0.760 0.014
EFCC LSI 0.758 0.013
WebKB
TF-IDF LSI 0.507 0.006
FCC LSI 0.469 0.011
EFCC MFT 0.532 0.032
EFCC LSI 0.483 0.000
39/83
40. EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F S. D.
Banksearch • Banksearch: with EFCC both
TF-IDF LSI 0.756 0.005 reductions get similar results.
FCC LSI 0.769 0.011
• WebKB: EFCC seems to solve the
EFCC MFT 0.760 0.014 problems of FCC.
EFCC LSI 0.758 0.013
WebKB
TF-IDF LSI 0.507 0.006
FCC LSI 0.469 0.011
EFCC MFT 0.532 0.032
EFCC LSI 0.483 0.000
40/83
41. EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F S. D.
Banksearch • Banksearch: with EFCC both
TF-IDF LSI 0.756 0.005 reductions get similar results.
FCC LSI 0.769 0.011
• WebKB: EFCC seems to solve the
EFCC MFT 0.760 0.014 problems of FCC.
EFCC LSI 0.758 0.013
• EFCC with MFT seems to be a good
WebKB alternative to TF-IDF with LSI.
TF-IDF LSI 0.507 0.006
FCC LSI 0.469 0.011
MFT is cheaper than LSI
EFCC MFT 0.532 0.032
EFCC LSI 0.483 0.000
41/83
42. Criteria Beyond the Document
• Add IDF to EFCC:
EFCC - IDF(t, d, D) = EFCC(t, d)× IDF(t, D)
Comparison: EFCC Vs. EFCC-IDF
Representation Avg. F S. D. • EFCC-IDF does not work in WebKB.
Banksearch
• IDF strongly affects EFCC.
EFCC MFT 0.760 0.014
EFCC-IDF MFT 0.749 0.129 • WebKB unbalanced categories.
WebKB
EFCC MFT 0.532 0.032
EFCC-IDF MFT 0.350 0.070 42/83
43. Criteria Beyond the Document
• Add information from Anchor Texts:
• We collect up to 300 unique inlinks for each SODP page (~ 1M).
• Two experiments.
• Three alternatives for each experiment.
(a) Anchors as plain text.
(b) Anchors as titles.
(1) Just adding anchors.
(2) Removing outlinks.
(3) Removing stopwords.
43/83
44. Criteria Beyond the Document
EFCC Vs. EFCC + Anchor texts
Representation Avg. F S. D.
SODP
(a) Anchors as plain text.
FCC MFT 0.242 0.028 (b) Anchors as titles.
EFCC MFT 0.275 0.025
EFCC a-1 MFT 0.268 0.027
EFCC a-2 MFT 0.267 0.024
(1) Just adding anchors.
EFCC a-3 MFT 0.276 0.022 (2) Removing outlinks.
EFCC b-1 MFT 0.277 0.015 (3) Removing stopwords.
EFCC b-2 MFT 0.270 0.016
EFCC b-3 MFT 0.267 0.012
44/83
45. Criteria Beyond the Document
EFCC Vs. EFCC + Anchor texts
Representation Avg. F S. D.
SODP • The best case using anchor texts gets
FCC MFT 0.242 0.028 similar results than EFCC MFT.
EFCC MFT 0.275 0.025
• Computational cost.
EFCC a-1 MFT 0.268 0.027
EFCC a-2 MFT 0.267 0.024
EFCC a-3 MFT 0.276 0.022
EFCC b-1 MFT 0.277 0.015
EFCC b-2 MFT 0.270 0.016
EFCC b-3 MFT 0.267 0.012
45/83
46. Criteria Beyond the Document
EFCC Vs. EFCC + Anchor texts
Representation Avg. F S. D.
SODP • The best case using anchor texts gets
FCC MFT 0.242 0.028 similar results than EFCC MFT.
EFCC MFT 0.275 0.025
• Computational cost.
EFCC a-1 MFT 0.268 0.027
EFCC a-2 MFT 0.267 0.024 • FCC performs worse than EFCC in this
collection also.
EFCC a-3 MFT 0.276 0.022
EFCC b-1 MFT 0.277 0.015
EFCC b-2 MFT 0.270 0.016
EFCC b-3 MFT 0.267 0.012
46/83
47. Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
1. Analyze Data Distributions
2. Tune Membership Functions
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications
47/83
48. Adjusting the Representation
• Some dataset characteristics could influence the way of
defining the Fuzzy Rule Based System.
• Document information is captured by means of membership
functions.
• Should these functions be modified depending on the
dataset?
Example of membership functions
associated to Frequency Linguistic
Variable.
48/83
52. Adjusting the Representation
• Long tails could lead to consider High only maximum values
(with the original fuzzy sets).
• Low values compressed at the left side.
• We believe that High or Low should be relative values.
• High or Low should depend on the distribution.
• Symmetrical sets are appropriate for uniformly distributed
values.
• Input data patterns are not always the same input capture
process should not be always the same.
52/83
58. Adjusting the Representation
• Titles have a small number of possible values.
• We try to establish the sets to allow at least one value in each
interval when it is possible.
58/83
59. Adjusting the Representation
• Titles have a small number of possible values.
• We try to establish the sets to allow at least one value in each
interval when it is possible.
• We use the lowest value of the distribution for the low set and
divide the rest in equidistant percentiles.
59/83
60. Adjusting the Representation
AFCC Vs. EFCC, FCC & TF-IDF
Representation Avg. F S. D. Representation Avg. F S. D.
Banksearch SODP
TF-IDF MFT 0.748 0.028 TF-IDF MFT 0.293 0.030
FCC MFT 0.756 0.019 FCC MFT 0.242 0.028
EFCC MFT 0.760 0.014 EFCC MFT 0.275 0.024
AFCC MFT 0.770 0.016 AFCC MFT 0.272 0.023
WebKB
TF-IDF MFT 0.460 0.051
FCC MFT 0.469 0.009
EFCC MFT 0.532 0.032
AFCC MFT 0.565 0.025
60/83
61. Adjusting the Representation
AFCC Vs. EFCC, FCC & TF-IDF
Representation Avg. F S. D. Representation Avg. F S. D.
Banksearch SODP
TF-IDF MFT 0.748 0.028 TF-IDF MFT 0.293 0.030
FCC MFT 0.756 0.019 FCC MFT 0.242 0.028
EFCC MFT 0.760 0.014 EFCC MFT 0.275 0.024
AFCC MFT 0.770 0.016 AFCC MFT 0.272 0.023
WebKB
• AFCC get the best results in two
TF-IDF MFT 0.460 0.051 datasets.
FCC MFT 0.469 0.009
EFCC MFT 0.532 0.032
AFCC MFT 0.565 0.025
61/83
62. Adjusting the Representation
AFCC Vs. EFCC, FCC & TF-IDF
Representation Avg. F S. D. Representation Avg. F S. D.
Banksearch SODP
TF-IDF MFT 0.748 0.028 TF-IDF MFT 0.293 0.030
FCC MFT 0.756 0.019 FCC MFT 0.242 0.028
EFCC MFT 0.760 0.014 EFCC MFT 0.275 0.024
AFCC MFT 0.770 0.016 AFCC MFT 0.272 0.023
WebKB
• AFCC gets the best results in two
TF-IDF MFT 0.460 0.051 datasets.
FCC MFT 0.469 0.009 • AFCC gets always comparable or
better results than FCC and EFCC.
EFCC MFT 0.532 0.032
AFCC MFT 0.565 0.025
62/83
63. Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
4. Test Scenario: Taxonomy Learning
1. Hierarchical Clustering
2. Two Languages
5. Conclusions & Outlook
6. Publications
63/83
64. Test Scenario
• To try to build a taxonomy from a set of text documents from
Wikipedia.
64/83
65. Test Scenario
• Input: Comparable corpora in English and Spanish (documents
about animals).
65/83
67. Test Scenario
• Taxonomic F-measure.
• Labeling process:
• Infer concept names from majority of child nodes.
• When more than one node is selected to be labeled the same,
they are merged if siblings…
67/83
68. Test Scenario
• Taxonomic F-measure.
• Labeling process:
• Infer concept names from majority of child nodes.
• When more than one node is selected to be labeled the same,
they are merged if siblings…
• or the smaller one remains as unclassified in other case.
68/83
71. Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications
71/83
72. Conclusions
• To study a fuzzy model to represent HTML documents for
clustering:
• To propose a lightweight dimension reduction method focused on
the weighting function.
• To propose alternatives to improve the system (EFCC, addFCC).
• To explore new criteria to be used (IDF, anchor texts).
• To compare our results with the previous FRBSs and TF-IDF.
• MFT obtained results comparable to LSI when used with EFCC.
• EFCC improved the system by changing the way in which the
rules were defined to simplify the system and avoid rules that
did not work.
• IDF and Anchor Texts did not contribute to improve results. 72/83
• EFCC achieved good performance in all datasets.
73. Conclusions
• To adjust the system to concrete datasets:
• To analyze the frequency distributions of terms within each
criterion.
• To propose a way of tuning the basic parameters of the
membership functions in an automated way.
• To evaluate the results compared to previous FRBS and TF-IDF.
• We found different term distributions among datasets: tuning
the information capture process seems to make sense.
• Cases that do not follow a power law seems to be better
candidates to improve results by FRBS tuning.
• The tuned system is based on dataset statistics only.
• Tuning the system is a feasible way of improving the 73/83
representation.
74. Conclusions
• Evaluation of our proposals in a test scenario.
• Taxonomy learning problem through hierarchical clustering.
• Different algorithm.
• Different evaluation method.
• Comparable corpora written in English and Spanish.
• Fuzzy logic based alternatives improved TF-IDF in English.
• For Spanish, the results were closer. Probably the stemming
process affects the behavior of the representation.
• Our results validate the usefulness of FRBSs for representing
documents in clustering tasks.
74/83
75. Conclusions
• Globally in this thesis:
• Fuzzy logic showed its appropriateness to be used as a tool to
declare the knowledge in an easy and understandable way to
represent web pages.
• Some contexts where our proposals could achieve good results
have been identified.
75/83
76. Future Directions
• To study the effect of non-linear scaling factors over the fuzzy
sets.
• To explore whether partial clustering solutions could be used
for tuning the system.
• To study new criteria to include in the combination.
• Would it be possible to learn the rule set from examples?
• To apply this kind of fuzzy approaches to combine profiles in
company name filtering on Twitter.
76/83
77. Future Directions
• To study the effect of non-linear scaling factors over the fuzzy
sets.
• To explore whether partial clustering solutions could be used
for tuning the system.
• To study new criteria to include in the combination.
• It would be possible to learn the rule set from examples?
• To apply this kind of fuzzy approaches to combine profiles in
company name filtering on Twitter.
77/83
78. Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications
78/83
79. Publications
• Peer-reviewed Conferences (I):
• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2008. Web
Page Clustering Using a Fuzzy Logic Based Representation and Self-
Organizing Maps. In Proceedings of Web Intelligence 2008, International
Conference on Web Intelligence and Intelligent Agent Technology
(IEEE/WIC/ACM). Volume 1, Page(s): 851 - 854. Sydney, Australia.
Acceptance Rate: 20%
[8 citations]
• Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Sini Pessala, and Timo
Honkela. 2010. Learning taxonomic relations from a set of text
documents. In Proceedings of AAIA’10 , the 5th International Symposium
Advances in Artificial Intelligence and Applications . Page(s): 105 - 112.
Wisla, Poland.
International Fuzzy Systems Association Award for Young Scientist.
79/83
[2 citations]
80. Publications
• Peer-reviewed Conferences (II):
• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fuzzy
Combinations of Criteria: An Application to Web Page Representation for
Clustering. In Proceedings of CICLing 2012, the 13th International
Conference on Intelligent Text Processing and Computational Linguistics.
Pages(s): 157 - 168. New Delhi, India.
Acceptance Rate: 28.6%
• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fitting
Document Representation to Specific Datasets by Adjusting Membership
Functions. In Proceedings of FUZZ-IEEE 2012, the IEEE International
Conference on Fuzzy Systems. Brisbane, Australia.
ERA A
80/83
81. Publications
• Journals:
• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2009. Una
Representación Basada en Lógica Borrosa para el Clustering de páginas
web con Mapas Auto-Organizativos. Procesamiento del Lenguaje
Natural, vol. 42, Pages 79 - 86.
FECYT Quality Seal for Scientific Spanish Journals. Spanish Foundation
for Science and Technology.
• Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Víctor Fresno, Raquel
Martínez and Timo Honkela. 2012. Learning a taxonomy from a set of
text documents. Applied Soft Computing. Volume 12, Issue 3, Pages
1138 - 1148, March 2012.
2011 JCR Impact Factor = 2.612.
[6 CITATIONS]
Ranked Q1 in Computer Science, Artificial Intelligence and Computer 81/83
Science, Interdisciplinary Applications.
82. Publications
• Workshops:
• Agustín D. Delgado Muñoz, Raquel Martínez, Alberto Pérez García-Plaza
and Víctor Fresno. 2012. Unsupervised Real-Time Company Name
Disambiguation in Twitter. In Proceedings of the ICWSM-12 Workshop
on Real-Time Analysis and Mining of Social Streams, 6th International
AAAI Conference on Weblogs and Social Media. Page(s): 25 - 28. Dublin,
Ireland.
82/83
83. IF People IS Here AND Talk IS Done THEN
Slide IS
Thank You!
83/83