SlideShare a Scribd company logo
1 of 83
An Improved Fuzzy System for
 Representing Web Pages in
      Clustering Tasks
               PhD Thesis
        Alberto Pérez García-Plaza
         UNED – NLP & IR group
            October 23, 2012

                 Advisors:
         Raquel Martínez Unanue
         Víctor Fresno Fernández
Table of Contents
1.   Introduction
2.   Web Page Representation and Fuzzy Logic
3.   Adjusting the representation
4.   Test Scenario: Taxonomy Learning
5.   Conclusions & Outlook
6.   Publications




                                               2/83
Table of Contents
1. Introduction
     1.   Motivation
     2.   Objectives
2.    Web Page Representation and Fuzzy Logic
3.    Adjusting the representation
4.    Test Scenario: Taxonomy Learning
5.    Conclusions & Outlook
6.    Publications



                                                3/83
Motivation
• Document clustering is grouping documents based only in the
  documents themselves.




                                                                1. Introduction
                                                                4/83
Motivation
• Document representation plays a key role in clustering.
• Representation comes first.
• We focus on Document Representation
  • Characteristics employed…




                                                            1. Introduction
  • …and the way of using them.




                                                            5/83
State of the Art
• TF-IDF a de facto standard.
• Combination of criteria:
  • Linear approaches.
  • Algorithm.




                                                             1. Introduction
• Hyperlinks.
• Datasets for evaluation differ from one work to another.




                                                             6/83
Web Page Example
• Our criteria:




                   1. Introduction
                   7/83
Web Page Example
• Our criteria: Title




                        1. Introduction
                        8/83
Web Page Example
• Our criteria: Emphasis




                           1. Introduction
                           9/83
Web Page Example
• Our criteria: Frequency




                             1. Introduction
                            10/83
Web Page Example
• Our criteria: Position (Standard, Preferential)




                                                     1. Introduction
                                                    11/83
Combining Criteria
• Linear Combination of Criteria:

           I k = tkit + ekie + fki f + pkip




                                                                  1. Introduction
• Each criterion is multiplied by a constant.
• Constants try to establish the importance of each criterion.




                                                                 12/83
Combining Criteria




                      1. Introduction
                     13/83
Combining Criteria




                                                          1. Introduction
 Title terms not related to the theme of the document.


                                                         14/83
Combining Criteria
• Need
  • Related conditions to establish word importance.
• Fuzzy Logic because:
  •   Declare the knowledge without specifying the calculation.




                                                                   1. Introduction
  •   Rules close to natural language (IF - THEN).
  •   Relations among criteria.
  •   Ease the task of expressing heuristic knowledge.
• Other kind of systems requires an additional effort to
  understand how the system works to be able to modify them.


                                                                  15/83
Problem Statement


To study and improve a web page1 representation based on




                                                            1. Introduction
fuzzy logic2 applied to clustering tasks.



(1) HTML documents.
(2) FCC, Víctor Fresno, PhD Thesis (2006).


                                                           16/83
Objectives
1. Compare the fuzzy system with TF-IDF as standard method
   and different dimension reduction methods.
2. Analyze an existing fuzzy combination of criteria (FCC).
3. Assess the possibility of adding new criteria beyond




                                                               1. Introduction
   document contents.
4. Adjust the representation to concrete datasets.
5. Evaluate our proposals in hierarchical clustering.
6. Evaluate our methods in more than one language.



                                                              17/83
Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
     1.   Dimension Reduction
     2.   Criteria Analysis
     3.   New Criteria
3.    Adjusting the representation
4.    Test Scenario: Taxonomy Learning
5.    Conclusions & Outlook
6.    Publications


                                             18/83
Web Page Representation

  Term           Dimension
 Weighting       Reduction



 Evaluation      Clustering   19/83
Overview of the Fuzzy System




                   Knowledge
                      base




                               20/83
Overview of the Fuzzy System




                          Knowledge
                             base




                                      21/83
Datasets
Dataset       # Documents     # Categories     Language       Hierarchical
Banksearch    9,897           10*              English        No
WebKB         4,518           6                English        No
SODP          12,148          17               English        No
WAD           166             4 – 1st level    English &      Yes
                              17 – 2nd level   Spanish



 (*) approx. same number of documents within each category.




                                                                             22/83
Basic Clustering Settings
• Stop words removal & Stemming (Porter).
• Cluto-rbr with default parameters.
• Initial Weighting Functions: TF-IDF and FCC.
• Dimension Reduction Methods (100, 500, 1000, 2000, 5000
  features): DF, LSI, RP, MFT.
• F-measure to evaluate clustering quality.
                           2 × R(i, j)× P(i, j)
                 F(i, j) =
                            R(i, j) + P(i, j)
                           nj
                  F = å × max{F(i, j)}
                        j n
                                 j

                                                            23/83
MFT Reduction
Our proposal for dimension reduction:
                Rank terms within each document



   1. Soccer    1. Music    1. Goal      1. Music
   2. Goal      2. Show     2. Ball      2. Band
   3. Referee   3. Band     3. Soccer    3. Album
   4. …         4. …        4. …         4. …



           Music Soccer Goal Show Band Ball
                                                    24/83

…until the desired number of terms is reached.
Dimension Reduction
Experiments
Comparison: MFT Vs. LSI
Representation Avg. F   S. D.
                                Both methods over TF-IDF and FCC
Banksearch
TF-IDF MFT     0.748    0.028
TF-IDF LSI     0.756    0.005
FCC MFT        0.756    0.019
FCC LSI        0.769    0.011
WebKB
TF-IDF MFT     0.460    0.051
TF-IDF LSI     0.507    0.006
FCC MFT        0.469    0.009                                      25/83
FCC LSI        0.466    0.011
Dimension Reduction
Experiments
Comparison: MFT Vs. LSI
Representation Avg. F   S. D.   • LSI outperforms MFT.
Banksearch                          • FCC and TF-IDF are not working as
                                       well as they could.
TF-IDF MFT     0.748    0.028
TF-IDF LSI     0.756    0.005
FCC MFT        0.756    0.019
FCC LSI        0.769    0.011
WebKB
TF-IDF MFT     0.460    0.051
TF-IDF LSI     0.507    0.006
FCC MFT        0.469    0.009                                             26/83
FCC LSI        0.466    0.011
Dimension Reduction
Experiments
Comparison: MFT Vs. LSI
Representation Avg. F   S. D.   • LSI outperforms MFT.
Banksearch                          • FCC and TF-IDF are not working as
                                       well as they could.
TF-IDF MFT     0.748    0.028
TF-IDF LSI     0.756    0.005   • FCC in WebKB obtains bad results,
FCC MFT        0.756    0.019     even with LSI.
FCC LSI        0.769    0.011
WebKB
TF-IDF MFT     0.460    0.051
TF-IDF LSI     0.507    0.006
FCC MFT        0.469    0.009                                             27/83
FCC LSI        0.466    0.011
Analysis of the Combination
Banksearch
Rep.Dim.        100          500          1000           2000      5000
FCC MFT            0.723        0.757         0.768         0.765     0.768
title              0.626        0.646         0.632         0.634     0.639
emphasis           0.586        0.671         0.674         0.685     0.693
frequency          0.689        0.715         0.720         0.724     0.731
position           0.310        0.525         0.538         0.599     0.608


The combination always outperforms individual criteria.

Frequency seems to the be the best among individual criteria.
                                                                              28/83
Analysis of the Combination
Banksearch
Rep.Dim.        100          500          1000           2000      5000
FCC MFT            0.723        0.757         0.768         0.765     0.768
title              0.626        0.646         0.632         0.634     0.639
emphasis           0.586        0.671         0.674         0.685     0.693
frequency          0.689        0.715         0.720         0.724     0.731
position           0.310        0.525         0.538         0.599     0.608


The combination always outperforms individual criteria.

Frequency seems to the be the best among individual criteria.
                                                                              29/83
Analysis of the Combination
Banksearch
Rep.Dim.        100          500          1000           2000      5000
FCC MFT            0.723        0.757         0.768         0.765     0.768
title              0.626        0.646         0.632         0.634     0.639
emphasis           0.586        0.671         0.674         0.685     0.693
frequency          0.689        0.715         0.720         0.724     0.731
position           0.310        0.525         0.538         0.599     0.608


The combination always outperforms individual criteria.

Frequency seems to the be the best among individual criteria.
                                                                              30/83
Analysis of the Combination
WebKB
Rep.Dim.        100           500          1000          2000         5000
FCC MFT            0.453         0.472         0.475          0.468      0.475
title              0.432         0.433         0.404          0.488      0.479
emphasis           0.415         0.431         0.433          0.465      0.489
frequency          0.441         0.460         0.460          0.468      0.446
position           0.301         0.283         0.317          0.281      0.286


The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination get   31/83
worse.
Analysis of the Combination
WebKB
Rep.Dim.        100           500          1000          2000         5000
FCC MFT            0.453         0.472         0.475          0.468      0.475
title              0.432         0.433         0.404          0.488      0.479
emphasis           0.415         0.431         0.433          0.465      0.489
frequency          0.441         0.460         0.460          0.468      0.446
position           0.301         0.283         0.317          0.281      0.286


The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination get   32/83
worse.
Analysis of the Combination
WebKB
Rep.Dim.        100           500          1000          2000         5000
FCC MFT            0.453         0.472         0.475          0.468      0.475
title              0.432         0.433         0.404          0.488      0.479
emphasis           0.415         0.431         0.433          0.465      0.489
frequency          0.441         0.460         0.460          0.468      0.446
position           0.301         0.283         0.317          0.281      0.286


The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination get   33/83
worse.
Analysis of the Combination
WebKB
Rep.Dim.        100           500          1000          2000        5000
FCC MFT            0.453         0.472         0.475          0.468      0.475
title              0.432         0.433         0.404          0.488      0.479
emphasis           0.415         0.431         0.433          0.465      0.489
frequency          0.441         0.460         0.460          0.468      0.446
position           0.301         0.283         0.317          0.281      0.286


The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination       34/83
gets worse.
Analysis of the Combination
•   Position is considered more decisive than others.
•   But position empirically got the worst results.
•   Its heuristics are based on written texts and not in web pages.
•   Sample rule:
    • IF title IS low
         AND frequency IS medium
         AND emphasis IS high
         AND position IS preferential
         THEN importance IS very high


                                                                      35/83
EFCC Rule Base




                 36/83
EFCC Rule Base




                                        37/83
         Title, Emphasis and Position
                      +
                  Frequency
EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F   S. D.
Banksearch
TF-IDF LSI     0.756    0.005
FCC LSI        0.769    0.011
EFCC MFT       0.760    0.014
EFCC LSI       0.758    0.013
WebKB
TF-IDF LSI     0.507    0.006
FCC LSI        0.469    0.011
EFCC MFT       0.532    0.032
EFCC LSI       0.483    0.000
                                    38/83
EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F   S. D.
Banksearch                      • Banksearch: with EFCC both
TF-IDF LSI     0.756    0.005     reductions get similar results.
FCC LSI        0.769    0.011
EFCC MFT       0.760    0.014
EFCC LSI       0.758    0.013
WebKB
TF-IDF LSI     0.507    0.006
FCC LSI        0.469    0.011
EFCC MFT       0.532    0.032
EFCC LSI       0.483    0.000
                                                                    39/83
EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F   S. D.
Banksearch                      • Banksearch: with EFCC both
TF-IDF LSI     0.756    0.005     reductions get similar results.
FCC LSI        0.769    0.011
                                • WebKB: EFCC seems to solve the
EFCC MFT       0.760    0.014     problems of FCC.
EFCC LSI       0.758    0.013
WebKB
TF-IDF LSI     0.507    0.006
FCC LSI        0.469    0.011
EFCC MFT       0.532    0.032
EFCC LSI       0.483    0.000
                                                                    40/83
EFCC Experiments
Comparison: EFCC Vs. FCC & TF-IDF
Representation Avg. F   S. D.
Banksearch                      • Banksearch: with EFCC both
TF-IDF LSI     0.756    0.005     reductions get similar results.
FCC LSI        0.769    0.011
                                • WebKB: EFCC seems to solve the
EFCC MFT       0.760    0.014     problems of FCC.
EFCC LSI       0.758    0.013
                                • EFCC with MFT seems to be a good
WebKB                             alternative to TF-IDF with LSI.
TF-IDF LSI     0.507    0.006
FCC LSI        0.469    0.011
                                      MFT is cheaper than LSI
EFCC MFT       0.532    0.032
EFCC LSI       0.483    0.000
                                                                     41/83
Criteria Beyond the Document
• Add IDF to EFCC:
        EFCC - IDF(t, d, D) = EFCC(t, d)× IDF(t, D)

Comparison: EFCC Vs. EFCC-IDF

Representation Avg. F   S. D.   • EFCC-IDF does not work in WebKB.
Banksearch
                                • IDF strongly affects EFCC.
EFCC MFT       0.760    0.014
EFCC-IDF MFT   0.749    0.129   • WebKB unbalanced categories.
WebKB
EFCC MFT       0.532    0.032
EFCC-IDF MFT   0.350    0.070                                        42/83
Criteria Beyond the Document
• Add information from Anchor Texts:
  • We collect up to 300 unique inlinks for each SODP page (~ 1M).
• Two experiments.
• Three alternatives for each experiment.

     (a) Anchors as plain text.
     (b) Anchors as titles.



                                      (1) Just adding anchors.
                                      (2) Removing outlinks.
                                      (3) Removing stopwords.
                                                                     43/83
Criteria Beyond the Document
EFCC Vs. EFCC + Anchor texts
Representation Avg. F   S. D.
SODP
                                (a) Anchors as plain text.
FCC MFT        0.242    0.028   (b) Anchors as titles.
EFCC MFT       0.275    0.025
EFCC a-1 MFT   0.268    0.027
EFCC a-2 MFT   0.267    0.024
                                (1) Just adding anchors.
EFCC a-3 MFT   0.276    0.022   (2) Removing outlinks.
EFCC b-1 MFT   0.277    0.015   (3) Removing stopwords.
EFCC b-2 MFT   0.270    0.016
EFCC b-3 MFT   0.267    0.012

                                                             44/83
Criteria Beyond the Document
EFCC Vs. EFCC + Anchor texts
Representation Avg. F   S. D.
SODP                            • The best case using anchor texts gets
FCC MFT        0.242    0.028     similar results than EFCC MFT.
EFCC MFT       0.275    0.025
                                • Computational cost.
EFCC a-1 MFT   0.268    0.027
EFCC a-2 MFT   0.267    0.024
EFCC a-3 MFT   0.276    0.022
EFCC b-1 MFT   0.277    0.015
EFCC b-2 MFT   0.270    0.016
EFCC b-3 MFT   0.267    0.012

                                                                          45/83
Criteria Beyond the Document
EFCC Vs. EFCC + Anchor texts
Representation Avg. F   S. D.
SODP                            • The best case using anchor texts gets
FCC MFT        0.242    0.028     similar results than EFCC MFT.
EFCC MFT       0.275    0.025
                                • Computational cost.
EFCC a-1 MFT   0.268    0.027
EFCC a-2 MFT   0.267    0.024   • FCC performs worse than EFCC in this
                                  collection also.
EFCC a-3 MFT   0.276    0.022
EFCC b-1 MFT   0.277    0.015
EFCC b-2 MFT   0.270    0.016
EFCC b-3 MFT   0.267    0.012

                                                                          46/83
Table of Contents
1. Introduction
2. Web Page Representation and Fuzzy Logic
3. Adjusting the representation
  1.   Analyze Data Distributions
  2.   Tune Membership Functions
4. Test Scenario: Taxonomy Learning
5. Conclusions & Outlook
6. Publications



                                             47/83
Adjusting the Representation
• Some dataset characteristics could influence the way of
  defining the Fuzzy Rule Based System.
• Document information is captured by means of membership
  functions.
• Should these functions be modified depending on the
  dataset?


                               Example of membership functions
                               associated to Frequency Linguistic
                               Variable.

                                                                    48/83
Adjusting the Representation
• The inputs are frequency values in different criteria.




                              Frequency                    49/83
Adjusting the Representation
• The inputs are frequency values in different criteria.




                              Emphasis                     50/83
Adjusting the Representation
• The inputs are frequency values in different criteria.




                                Titles                     51/83
Adjusting the Representation
• Long tails could lead to consider High only maximum values
  (with the original fuzzy sets).
  • Low values compressed at the left side.

• We believe that High or Low should be relative values.
  • High or Low should depend on the distribution.
  • Symmetrical sets are appropriate for uniformly distributed
    values.


• Input data patterns are not always the same  input capture
  process should not be always the same.
                                                                 52/83
Adjusting the Representation

                  • Grant at least 1 value for each
                    interval: 1/5




                                                      53/83
Adjusting the Representation

                  • Grant at least 1 value for each
                    interval: 1/5
                  • Equidistant percentiles for the
                    rest of the intervals.




                                                      54/83
Adjusting the Representation

                  • Grant at least 1 value for each
                    interval: 1/5
                  • Equidistant percentiles for the
                    rest of the intervals.




                                                      55/83
Adjusting the Representation




                               56/83
Adjusting the Representation




                               57/83
Adjusting the Representation
• Titles have a small number of possible values.
• We try to establish the sets to allow at least one value in each
  interval when it is possible.




                                                                     58/83
Adjusting the Representation
• Titles have a small number of possible values.
• We try to establish the sets to allow at least one value in each
  interval when it is possible.
• We use the lowest value of the distribution for the low set and
  divide the rest in equidistant percentiles.




                                                                     59/83
Adjusting the Representation
AFCC Vs. EFCC, FCC & TF-IDF
Representation Avg. F   S. D.   Representation Avg. F   S. D.
Banksearch                      SODP
TF-IDF MFT     0.748    0.028   TF-IDF MFT     0.293    0.030
FCC MFT        0.756    0.019   FCC MFT        0.242    0.028
EFCC MFT       0.760    0.014   EFCC MFT       0.275    0.024
AFCC MFT       0.770    0.016   AFCC MFT       0.272    0.023
WebKB
TF-IDF MFT     0.460    0.051
FCC MFT        0.469    0.009
EFCC MFT       0.532    0.032
AFCC MFT       0.565    0.025
                                                                60/83
Adjusting the Representation
AFCC Vs. EFCC, FCC & TF-IDF
Representation Avg. F   S. D.       Representation Avg. F      S. D.
Banksearch                          SODP
TF-IDF MFT     0.748    0.028       TF-IDF MFT       0.293     0.030
FCC MFT        0.756    0.019       FCC MFT          0.242     0.028
EFCC MFT       0.760    0.014       EFCC MFT         0.275     0.024
AFCC MFT       0.770    0.016       AFCC MFT         0.272     0.023
WebKB
                                • AFCC get the best results in two
TF-IDF MFT     0.460    0.051     datasets.
FCC MFT        0.469    0.009
EFCC MFT       0.532    0.032
AFCC MFT       0.565    0.025
                                                                       61/83
Adjusting the Representation
AFCC Vs. EFCC, FCC & TF-IDF
Representation Avg. F   S. D.       Representation Avg. F     S. D.
Banksearch                          SODP
TF-IDF MFT     0.748    0.028       TF-IDF MFT       0.293    0.030
FCC MFT        0.756    0.019       FCC MFT          0.242    0.028
EFCC MFT       0.760    0.014       EFCC MFT         0.275    0.024
AFCC MFT       0.770    0.016       AFCC MFT         0.272    0.023
WebKB
                                • AFCC gets the best results in two
TF-IDF MFT     0.460    0.051     datasets.
FCC MFT        0.469    0.009   • AFCC gets always comparable or
                                  better results than FCC and EFCC.
EFCC MFT       0.532    0.032
AFCC MFT       0.565    0.025
                                                                      62/83
Table of Contents
1.    Introduction
2.    Web Page Representation and Fuzzy Logic
3.    Adjusting the representation
4.    Test Scenario: Taxonomy Learning
     1.   Hierarchical Clustering
     2.   Two Languages
5. Conclusions & Outlook
6. Publications



                                                63/83
Test Scenario
• To try to build a taxonomy from a set of text documents from
  Wikipedia.




                                                                 64/83
Test Scenario
• Input: Comparable corpora in English and Spanish (documents
  about animals).




                                                                65/83
Test Scenario
• Algorithm: SOM




                   66/83
Test Scenario
• Taxonomic F-measure.
• Labeling process:
  • Infer concept names from majority of child nodes.
  • When more than one node is selected to be labeled the same,
    they are merged if siblings…




                                                                  67/83
Test Scenario
• Taxonomic F-measure.
• Labeling process:
  • Infer concept names from majority of child nodes.
  • When more than one node is selected to be labeled the same,
    they are merged if siblings…
  • or the smaller one remains as unclassified in other case.




                                                                  68/83
Test Scenario
• English results:

                Taxonomic F-measure




                                                   69/83


                                      Dimensions
Test Scenario
• Spanish results:

                Taxonomic F-measure




                                                   70/83


                                      Dimensions
Table of Contents
1.   Introduction
2.   Web Page Representation and Fuzzy Logic
3.   Adjusting the representation
4.   Test Scenario: Taxonomy Learning
5.   Conclusions & Outlook
6.   Publications




                                               71/83
Conclusions
• To study a fuzzy model to represent HTML documents for
  clustering:
  • To propose a lightweight dimension reduction method focused on
    the weighting function.
  • To propose alternatives to improve the system (EFCC, addFCC).
  • To explore new criteria to be used (IDF, anchor texts).
  • To compare our results with the previous FRBSs and TF-IDF.

• MFT obtained results comparable to LSI when used with EFCC.
• EFCC improved the system by changing the way in which the
  rules were defined to simplify the system and avoid rules that
  did not work.
• IDF and Anchor Texts did not contribute to improve results.        72/83
• EFCC achieved good performance in all datasets.
Conclusions
• To adjust the system to concrete datasets:
  • To analyze the frequency distributions of terms within each
    criterion.
  • To propose a way of tuning the basic parameters of the
    membership functions in an automated way.
  • To evaluate the results compared to previous FRBS and TF-IDF.

• We found different term distributions among datasets: tuning
  the information capture process seems to make sense.
• Cases that do not follow a power law seems to be better
  candidates to improve results by FRBS tuning.
• The tuned system is based on dataset statistics only.
• Tuning the system is a feasible way of improving the              73/83
  representation.
Conclusions
• Evaluation of our proposals in a test scenario.
  •   Taxonomy learning problem through hierarchical clustering.
  •   Different algorithm.
  •   Different evaluation method.
  •   Comparable corpora written in English and Spanish.


• Fuzzy logic based alternatives improved TF-IDF in English.
• For Spanish, the results were closer. Probably the stemming
  process affects the behavior of the representation.
• Our results validate the usefulness of FRBSs for representing
  documents in clustering tasks.
                                                                   74/83
Conclusions

• Globally in this thesis:
  • Fuzzy logic showed its appropriateness to be used as a tool to
    declare the knowledge in an easy and understandable way to
    represent web pages.
  • Some contexts where our proposals could achieve good results
    have been identified.




                                                                     75/83
Future Directions
• To study the effect of non-linear scaling factors over the fuzzy
  sets.
• To explore whether partial clustering solutions could be used
  for tuning the system.
• To study new criteria to include in the combination.
• Would it be possible to learn the rule set from examples?
• To apply this kind of fuzzy approaches to combine profiles in
  company name filtering on Twitter.



                                                                     76/83
Future Directions
• To study the effect of non-linear scaling factors over the fuzzy
  sets.
• To explore whether partial clustering solutions could be used
  for tuning the system.
• To study new criteria to include in the combination.
• It would be possible to learn the rule set from examples?
• To apply this kind of fuzzy approaches to combine profiles in
  company name filtering on Twitter.



                                                                     77/83
Table of Contents
1.   Introduction
2.   Web Page Representation and Fuzzy Logic
3.   Adjusting the representation
4.   Test Scenario: Taxonomy Learning
5.   Conclusions & Outlook
6.   Publications




                                               78/83
Publications
• Peer-reviewed Conferences (I):
  • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2008. Web
    Page Clustering Using a Fuzzy Logic Based Representation and Self-
    Organizing Maps. In Proceedings of Web Intelligence 2008, International
    Conference on Web Intelligence and Intelligent Agent Technology
    (IEEE/WIC/ACM). Volume 1, Page(s): 851 - 854. Sydney, Australia.
    Acceptance Rate: 20%
    [8 citations]
  • Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Sini Pessala, and Timo
    Honkela. 2010. Learning taxonomic relations from a set of text
    documents. In Proceedings of AAIA’10 , the 5th International Symposium
    Advances in Artificial Intelligence and Applications . Page(s): 105 - 112.
    Wisla, Poland.
    International Fuzzy Systems Association Award for Young Scientist.
                                                                                 79/83
    [2 citations]
Publications
• Peer-reviewed Conferences (II):
  • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fuzzy
    Combinations of Criteria: An Application to Web Page Representation for
    Clustering. In Proceedings of CICLing 2012, the 13th International
    Conference on Intelligent Text Processing and Computational Linguistics.
    Pages(s): 157 - 168. New Delhi, India.
    Acceptance Rate: 28.6%
  • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fitting
    Document Representation to Specific Datasets by Adjusting Membership
    Functions. In Proceedings of FUZZ-IEEE 2012, the IEEE International
    Conference on Fuzzy Systems. Brisbane, Australia.
    ERA A


                                                                                80/83
Publications
• Journals:
  • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2009. Una
    Representación Basada en Lógica Borrosa para el Clustering de páginas
    web con Mapas Auto-Organizativos. Procesamiento del Lenguaje
    Natural, vol. 42, Pages 79 - 86.
    FECYT Quality Seal for Scientific Spanish Journals. Spanish Foundation
    for Science and Technology.
  • Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Víctor Fresno, Raquel
    Martínez and Timo Honkela. 2012. Learning a taxonomy from a set of
    text documents. Applied Soft Computing. Volume 12, Issue 3, Pages
    1138 - 1148, March 2012.
    2011 JCR Impact Factor = 2.612.
    [6 CITATIONS]
    Ranked Q1 in Computer Science, Artificial Intelligence and Computer      81/83
    Science, Interdisciplinary Applications.
Publications
• Workshops:
  • Agustín D. Delgado Muñoz, Raquel Martínez, Alberto Pérez García-Plaza
    and Víctor Fresno. 2012. Unsupervised Real-Time Company Name
    Disambiguation in Twitter. In Proceedings of the ICWSM-12 Workshop
    on Real-Time Analysis and Mining of Social Streams, 6th International
    AAAI Conference on Weblogs and Social Media. Page(s): 25 - 28. Dublin,
    Ireland.




                                                                             82/83
IF People IS Here AND Talk IS Done THEN
      Slide IS


     Thank You!

                                          83/83

More Related Content

Viewers also liked

A Fuzzy System for Educational Tasks for Children with Reading Disabilities
A Fuzzy System for Educational Tasks for Children with Reading DisabilitiesA Fuzzy System for Educational Tasks for Children with Reading Disabilities
A Fuzzy System for Educational Tasks for Children with Reading DisabilitiesAdalberto Pereira
 
International Journal of Computational Science and Information Technology (I...
 International Journal of Computational Science and Information Technology (I... International Journal of Computational Science and Information Technology (I...
International Journal of Computational Science and Information Technology (I...ijcsity
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentAung Thu Rha Hein
 
Neuro-fuzzy systems
Neuro-fuzzy systemsNeuro-fuzzy systems
Neuro-fuzzy systemsSagar Ahire
 
Fuzzy Logic Control of Hybrid Energy System
Fuzzy Logic Control of Hybrid Energy SystemFuzzy Logic Control of Hybrid Energy System
Fuzzy Logic Control of Hybrid Energy SystemSuraj Shandilya
 

Viewers also liked (6)

A Fuzzy System for Educational Tasks for Children with Reading Disabilities
A Fuzzy System for Educational Tasks for Children with Reading DisabilitiesA Fuzzy System for Educational Tasks for Children with Reading Disabilities
A Fuzzy System for Educational Tasks for Children with Reading Disabilities
 
International Journal of Computational Science and Information Technology (I...
 International Journal of Computational Science and Information Technology (I... International Journal of Computational Science and Information Technology (I...
International Journal of Computational Science and Information Technology (I...
 
Geoinformatics FCE CTU 2011
Geoinformatics FCE CTU 2011Geoinformatics FCE CTU 2011
Geoinformatics FCE CTU 2011
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessment
 
Neuro-fuzzy systems
Neuro-fuzzy systemsNeuro-fuzzy systems
Neuro-fuzzy systems
 
Fuzzy Logic Control of Hybrid Energy System
Fuzzy Logic Control of Hybrid Energy SystemFuzzy Logic Control of Hybrid Energy System
Fuzzy Logic Control of Hybrid Energy System
 

Similar to An improved fuzzy system for representing web pages in Clustering Tasks

RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...RuleML
 
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedis Labs
 
How to Reuse a Faceted Classification and Put It on the Semantic Web
How to Reuse a Faceted Classification and Put It on the Semantic WebHow to Reuse a Faceted Classification and Put It on the Semantic Web
How to Reuse a Faceted Classification and Put It on the Semantic WebBene Rodriguez
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programmingJuggernaut Liu
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementEmil Lupu
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 
Technical research writing
Technical research writing   Technical research writing
Technical research writing AJAL A J
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...Nishant Kumar
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxbartholomeocoombs
 
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019webwinkelvakdag
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Mauro Vallati
 
MPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource IndexMPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource IndexChris Wilper
 
No sweat patent search 2
No sweat patent search 2No sweat patent search 2
No sweat patent search 2PatSnap
 

Similar to An improved fuzzy system for representing web pages in Clustering Tasks (20)

RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
 
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
 
How to Reuse a Faceted Classification and Put It on the Semantic Web
How to Reuse a Faceted Classification and Put It on the Semantic WebHow to Reuse a Faceted Classification and Put It on the Semantic Web
How to Reuse a Faceted Classification and Put It on the Semantic Web
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Technical research writing
Technical research writing   Technical research writing
Technical research writing
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Data Mining
Data MiningData Mining
Data Mining
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
 
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
MPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource IndexMPTStore: A Fast, Scalable, and Stable Resource Index
MPTStore: A Fast, Scalable, and Stable Resource Index
 
No sweat patent search 2
No sweat patent search 2No sweat patent search 2
No sweat patent search 2
 

Recently uploaded

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 

Recently uploaded (20)

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 

An improved fuzzy system for representing web pages in Clustering Tasks

  • 1. An Improved Fuzzy System for Representing Web Pages in Clustering Tasks PhD Thesis Alberto Pérez García-Plaza UNED – NLP & IR group October 23, 2012 Advisors: Raquel Martínez Unanue Víctor Fresno Fernández
  • 2. Table of Contents 1. Introduction 2. Web Page Representation and Fuzzy Logic 3. Adjusting the representation 4. Test Scenario: Taxonomy Learning 5. Conclusions & Outlook 6. Publications 2/83
  • 3. Table of Contents 1. Introduction 1. Motivation 2. Objectives 2. Web Page Representation and Fuzzy Logic 3. Adjusting the representation 4. Test Scenario: Taxonomy Learning 5. Conclusions & Outlook 6. Publications 3/83
  • 4. Motivation • Document clustering is grouping documents based only in the documents themselves. 1. Introduction 4/83
  • 5. Motivation • Document representation plays a key role in clustering. • Representation comes first. • We focus on Document Representation • Characteristics employed… 1. Introduction • …and the way of using them. 5/83
  • 6. State of the Art • TF-IDF a de facto standard. • Combination of criteria: • Linear approaches. • Algorithm. 1. Introduction • Hyperlinks. • Datasets for evaluation differ from one work to another. 6/83
  • 7. Web Page Example • Our criteria: 1. Introduction 7/83
  • 8. Web Page Example • Our criteria: Title 1. Introduction 8/83
  • 9. Web Page Example • Our criteria: Emphasis 1. Introduction 9/83
  • 10. Web Page Example • Our criteria: Frequency 1. Introduction 10/83
  • 11. Web Page Example • Our criteria: Position (Standard, Preferential) 1. Introduction 11/83
  • 12. Combining Criteria • Linear Combination of Criteria: I k = tkit + ekie + fki f + pkip 1. Introduction • Each criterion is multiplied by a constant. • Constants try to establish the importance of each criterion. 12/83
  • 13. Combining Criteria 1. Introduction 13/83
  • 14. Combining Criteria 1. Introduction Title terms not related to the theme of the document. 14/83
  • 15. Combining Criteria • Need • Related conditions to establish word importance. • Fuzzy Logic because: • Declare the knowledge without specifying the calculation. 1. Introduction • Rules close to natural language (IF - THEN). • Relations among criteria. • Ease the task of expressing heuristic knowledge. • Other kind of systems requires an additional effort to understand how the system works to be able to modify them. 15/83
  • 16. Problem Statement To study and improve a web page1 representation based on 1. Introduction fuzzy logic2 applied to clustering tasks. (1) HTML documents. (2) FCC, Víctor Fresno, PhD Thesis (2006). 16/83
  • 17. Objectives 1. Compare the fuzzy system with TF-IDF as standard method and different dimension reduction methods. 2. Analyze an existing fuzzy combination of criteria (FCC). 3. Assess the possibility of adding new criteria beyond 1. Introduction document contents. 4. Adjust the representation to concrete datasets. 5. Evaluate our proposals in hierarchical clustering. 6. Evaluate our methods in more than one language. 17/83
  • 18. Table of Contents 1. Introduction 2. Web Page Representation and Fuzzy Logic 1. Dimension Reduction 2. Criteria Analysis 3. New Criteria 3. Adjusting the representation 4. Test Scenario: Taxonomy Learning 5. Conclusions & Outlook 6. Publications 18/83
  • 19. Web Page Representation Term Dimension Weighting Reduction Evaluation Clustering 19/83
  • 20. Overview of the Fuzzy System Knowledge base 20/83
  • 21. Overview of the Fuzzy System Knowledge base 21/83
  • 22. Datasets Dataset # Documents # Categories Language Hierarchical Banksearch 9,897 10* English No WebKB 4,518 6 English No SODP 12,148 17 English No WAD 166 4 – 1st level English & Yes 17 – 2nd level Spanish (*) approx. same number of documents within each category. 22/83
  • 23. Basic Clustering Settings • Stop words removal & Stemming (Porter). • Cluto-rbr with default parameters. • Initial Weighting Functions: TF-IDF and FCC. • Dimension Reduction Methods (100, 500, 1000, 2000, 5000 features): DF, LSI, RP, MFT. • F-measure to evaluate clustering quality. 2 × R(i, j)× P(i, j) F(i, j) = R(i, j) + P(i, j) nj F = å × max{F(i, j)} j n j 23/83
  • 24. MFT Reduction Our proposal for dimension reduction: Rank terms within each document 1. Soccer 1. Music 1. Goal 1. Music 2. Goal 2. Show 2. Ball 2. Band 3. Referee 3. Band 3. Soccer 3. Album 4. … 4. … 4. … 4. … Music Soccer Goal Show Band Ball 24/83 …until the desired number of terms is reached.
  • 25. Dimension Reduction Experiments Comparison: MFT Vs. LSI Representation Avg. F S. D. Both methods over TF-IDF and FCC Banksearch TF-IDF MFT 0.748 0.028 TF-IDF LSI 0.756 0.005 FCC MFT 0.756 0.019 FCC LSI 0.769 0.011 WebKB TF-IDF MFT 0.460 0.051 TF-IDF LSI 0.507 0.006 FCC MFT 0.469 0.009 25/83 FCC LSI 0.466 0.011
  • 26. Dimension Reduction Experiments Comparison: MFT Vs. LSI Representation Avg. F S. D. • LSI outperforms MFT. Banksearch • FCC and TF-IDF are not working as well as they could. TF-IDF MFT 0.748 0.028 TF-IDF LSI 0.756 0.005 FCC MFT 0.756 0.019 FCC LSI 0.769 0.011 WebKB TF-IDF MFT 0.460 0.051 TF-IDF LSI 0.507 0.006 FCC MFT 0.469 0.009 26/83 FCC LSI 0.466 0.011
  • 27. Dimension Reduction Experiments Comparison: MFT Vs. LSI Representation Avg. F S. D. • LSI outperforms MFT. Banksearch • FCC and TF-IDF are not working as well as they could. TF-IDF MFT 0.748 0.028 TF-IDF LSI 0.756 0.005 • FCC in WebKB obtains bad results, FCC MFT 0.756 0.019 even with LSI. FCC LSI 0.769 0.011 WebKB TF-IDF MFT 0.460 0.051 TF-IDF LSI 0.507 0.006 FCC MFT 0.469 0.009 27/83 FCC LSI 0.466 0.011
  • 28. Analysis of the Combination Banksearch Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.723 0.757 0.768 0.765 0.768 title 0.626 0.646 0.632 0.634 0.639 emphasis 0.586 0.671 0.674 0.685 0.693 frequency 0.689 0.715 0.720 0.724 0.731 position 0.310 0.525 0.538 0.599 0.608 The combination always outperforms individual criteria. Frequency seems to the be the best among individual criteria. 28/83
  • 29. Analysis of the Combination Banksearch Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.723 0.757 0.768 0.765 0.768 title 0.626 0.646 0.632 0.634 0.639 emphasis 0.586 0.671 0.674 0.685 0.693 frequency 0.689 0.715 0.720 0.724 0.731 position 0.310 0.525 0.538 0.599 0.608 The combination always outperforms individual criteria. Frequency seems to the be the best among individual criteria. 29/83
  • 30. Analysis of the Combination Banksearch Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.723 0.757 0.768 0.765 0.768 title 0.626 0.646 0.632 0.634 0.639 emphasis 0.586 0.671 0.674 0.685 0.693 frequency 0.689 0.715 0.720 0.724 0.731 position 0.310 0.525 0.538 0.599 0.608 The combination always outperforms individual criteria. Frequency seems to the be the best among individual criteria. 30/83
  • 31. Analysis of the Combination WebKB Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.453 0.472 0.475 0.468 0.475 title 0.432 0.433 0.404 0.488 0.479 emphasis 0.415 0.431 0.433 0.465 0.489 frequency 0.441 0.460 0.460 0.468 0.446 position 0.301 0.283 0.317 0.281 0.286 The combination does not always outperform the others. Frequency is not always the best among individual criteria. When title and emphasis could lead to a better clustering, the combination get 31/83 worse.
  • 32. Analysis of the Combination WebKB Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.453 0.472 0.475 0.468 0.475 title 0.432 0.433 0.404 0.488 0.479 emphasis 0.415 0.431 0.433 0.465 0.489 frequency 0.441 0.460 0.460 0.468 0.446 position 0.301 0.283 0.317 0.281 0.286 The combination does not always outperform the others. Frequency is not always the best among individual criteria. When title and emphasis could lead to a better clustering, the combination get 32/83 worse.
  • 33. Analysis of the Combination WebKB Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.453 0.472 0.475 0.468 0.475 title 0.432 0.433 0.404 0.488 0.479 emphasis 0.415 0.431 0.433 0.465 0.489 frequency 0.441 0.460 0.460 0.468 0.446 position 0.301 0.283 0.317 0.281 0.286 The combination does not always outperform the others. Frequency is not always the best among individual criteria. When title and emphasis could lead to a better clustering, the combination get 33/83 worse.
  • 34. Analysis of the Combination WebKB Rep.Dim. 100 500 1000 2000 5000 FCC MFT 0.453 0.472 0.475 0.468 0.475 title 0.432 0.433 0.404 0.488 0.479 emphasis 0.415 0.431 0.433 0.465 0.489 frequency 0.441 0.460 0.460 0.468 0.446 position 0.301 0.283 0.317 0.281 0.286 The combination does not always outperform the others. Frequency is not always the best among individual criteria. When title and emphasis could lead to a better clustering, the combination 34/83 gets worse.
  • 35. Analysis of the Combination • Position is considered more decisive than others. • But position empirically got the worst results. • Its heuristics are based on written texts and not in web pages. • Sample rule: • IF title IS low AND frequency IS medium AND emphasis IS high AND position IS preferential THEN importance IS very high 35/83
  • 36. EFCC Rule Base 36/83
  • 37. EFCC Rule Base 37/83 Title, Emphasis and Position + Frequency
  • 38. EFCC Experiments Comparison: EFCC Vs. FCC & TF-IDF Representation Avg. F S. D. Banksearch TF-IDF LSI 0.756 0.005 FCC LSI 0.769 0.011 EFCC MFT 0.760 0.014 EFCC LSI 0.758 0.013 WebKB TF-IDF LSI 0.507 0.006 FCC LSI 0.469 0.011 EFCC MFT 0.532 0.032 EFCC LSI 0.483 0.000 38/83
  • 39. EFCC Experiments Comparison: EFCC Vs. FCC & TF-IDF Representation Avg. F S. D. Banksearch • Banksearch: with EFCC both TF-IDF LSI 0.756 0.005 reductions get similar results. FCC LSI 0.769 0.011 EFCC MFT 0.760 0.014 EFCC LSI 0.758 0.013 WebKB TF-IDF LSI 0.507 0.006 FCC LSI 0.469 0.011 EFCC MFT 0.532 0.032 EFCC LSI 0.483 0.000 39/83
  • 40. EFCC Experiments Comparison: EFCC Vs. FCC & TF-IDF Representation Avg. F S. D. Banksearch • Banksearch: with EFCC both TF-IDF LSI 0.756 0.005 reductions get similar results. FCC LSI 0.769 0.011 • WebKB: EFCC seems to solve the EFCC MFT 0.760 0.014 problems of FCC. EFCC LSI 0.758 0.013 WebKB TF-IDF LSI 0.507 0.006 FCC LSI 0.469 0.011 EFCC MFT 0.532 0.032 EFCC LSI 0.483 0.000 40/83
  • 41. EFCC Experiments Comparison: EFCC Vs. FCC & TF-IDF Representation Avg. F S. D. Banksearch • Banksearch: with EFCC both TF-IDF LSI 0.756 0.005 reductions get similar results. FCC LSI 0.769 0.011 • WebKB: EFCC seems to solve the EFCC MFT 0.760 0.014 problems of FCC. EFCC LSI 0.758 0.013 • EFCC with MFT seems to be a good WebKB alternative to TF-IDF with LSI. TF-IDF LSI 0.507 0.006 FCC LSI 0.469 0.011 MFT is cheaper than LSI EFCC MFT 0.532 0.032 EFCC LSI 0.483 0.000 41/83
  • 42. Criteria Beyond the Document • Add IDF to EFCC: EFCC - IDF(t, d, D) = EFCC(t, d)× IDF(t, D) Comparison: EFCC Vs. EFCC-IDF Representation Avg. F S. D. • EFCC-IDF does not work in WebKB. Banksearch • IDF strongly affects EFCC. EFCC MFT 0.760 0.014 EFCC-IDF MFT 0.749 0.129 • WebKB unbalanced categories. WebKB EFCC MFT 0.532 0.032 EFCC-IDF MFT 0.350 0.070 42/83
  • 43. Criteria Beyond the Document • Add information from Anchor Texts: • We collect up to 300 unique inlinks for each SODP page (~ 1M). • Two experiments. • Three alternatives for each experiment. (a) Anchors as plain text. (b) Anchors as titles. (1) Just adding anchors. (2) Removing outlinks. (3) Removing stopwords. 43/83
  • 44. Criteria Beyond the Document EFCC Vs. EFCC + Anchor texts Representation Avg. F S. D. SODP (a) Anchors as plain text. FCC MFT 0.242 0.028 (b) Anchors as titles. EFCC MFT 0.275 0.025 EFCC a-1 MFT 0.268 0.027 EFCC a-2 MFT 0.267 0.024 (1) Just adding anchors. EFCC a-3 MFT 0.276 0.022 (2) Removing outlinks. EFCC b-1 MFT 0.277 0.015 (3) Removing stopwords. EFCC b-2 MFT 0.270 0.016 EFCC b-3 MFT 0.267 0.012 44/83
  • 45. Criteria Beyond the Document EFCC Vs. EFCC + Anchor texts Representation Avg. F S. D. SODP • The best case using anchor texts gets FCC MFT 0.242 0.028 similar results than EFCC MFT. EFCC MFT 0.275 0.025 • Computational cost. EFCC a-1 MFT 0.268 0.027 EFCC a-2 MFT 0.267 0.024 EFCC a-3 MFT 0.276 0.022 EFCC b-1 MFT 0.277 0.015 EFCC b-2 MFT 0.270 0.016 EFCC b-3 MFT 0.267 0.012 45/83
  • 46. Criteria Beyond the Document EFCC Vs. EFCC + Anchor texts Representation Avg. F S. D. SODP • The best case using anchor texts gets FCC MFT 0.242 0.028 similar results than EFCC MFT. EFCC MFT 0.275 0.025 • Computational cost. EFCC a-1 MFT 0.268 0.027 EFCC a-2 MFT 0.267 0.024 • FCC performs worse than EFCC in this collection also. EFCC a-3 MFT 0.276 0.022 EFCC b-1 MFT 0.277 0.015 EFCC b-2 MFT 0.270 0.016 EFCC b-3 MFT 0.267 0.012 46/83
  • 47. Table of Contents 1. Introduction 2. Web Page Representation and Fuzzy Logic 3. Adjusting the representation 1. Analyze Data Distributions 2. Tune Membership Functions 4. Test Scenario: Taxonomy Learning 5. Conclusions & Outlook 6. Publications 47/83
  • 48. Adjusting the Representation • Some dataset characteristics could influence the way of defining the Fuzzy Rule Based System. • Document information is captured by means of membership functions. • Should these functions be modified depending on the dataset? Example of membership functions associated to Frequency Linguistic Variable. 48/83
  • 49. Adjusting the Representation • The inputs are frequency values in different criteria. Frequency 49/83
  • 50. Adjusting the Representation • The inputs are frequency values in different criteria. Emphasis 50/83
  • 51. Adjusting the Representation • The inputs are frequency values in different criteria. Titles 51/83
  • 52. Adjusting the Representation • Long tails could lead to consider High only maximum values (with the original fuzzy sets). • Low values compressed at the left side. • We believe that High or Low should be relative values. • High or Low should depend on the distribution. • Symmetrical sets are appropriate for uniformly distributed values. • Input data patterns are not always the same  input capture process should not be always the same. 52/83
  • 53. Adjusting the Representation • Grant at least 1 value for each interval: 1/5 53/83
  • 54. Adjusting the Representation • Grant at least 1 value for each interval: 1/5 • Equidistant percentiles for the rest of the intervals. 54/83
  • 55. Adjusting the Representation • Grant at least 1 value for each interval: 1/5 • Equidistant percentiles for the rest of the intervals. 55/83
  • 58. Adjusting the Representation • Titles have a small number of possible values. • We try to establish the sets to allow at least one value in each interval when it is possible. 58/83
  • 59. Adjusting the Representation • Titles have a small number of possible values. • We try to establish the sets to allow at least one value in each interval when it is possible. • We use the lowest value of the distribution for the low set and divide the rest in equidistant percentiles. 59/83
  • 60. Adjusting the Representation AFCC Vs. EFCC, FCC & TF-IDF Representation Avg. F S. D. Representation Avg. F S. D. Banksearch SODP TF-IDF MFT 0.748 0.028 TF-IDF MFT 0.293 0.030 FCC MFT 0.756 0.019 FCC MFT 0.242 0.028 EFCC MFT 0.760 0.014 EFCC MFT 0.275 0.024 AFCC MFT 0.770 0.016 AFCC MFT 0.272 0.023 WebKB TF-IDF MFT 0.460 0.051 FCC MFT 0.469 0.009 EFCC MFT 0.532 0.032 AFCC MFT 0.565 0.025 60/83
  • 61. Adjusting the Representation AFCC Vs. EFCC, FCC & TF-IDF Representation Avg. F S. D. Representation Avg. F S. D. Banksearch SODP TF-IDF MFT 0.748 0.028 TF-IDF MFT 0.293 0.030 FCC MFT 0.756 0.019 FCC MFT 0.242 0.028 EFCC MFT 0.760 0.014 EFCC MFT 0.275 0.024 AFCC MFT 0.770 0.016 AFCC MFT 0.272 0.023 WebKB • AFCC get the best results in two TF-IDF MFT 0.460 0.051 datasets. FCC MFT 0.469 0.009 EFCC MFT 0.532 0.032 AFCC MFT 0.565 0.025 61/83
  • 62. Adjusting the Representation AFCC Vs. EFCC, FCC & TF-IDF Representation Avg. F S. D. Representation Avg. F S. D. Banksearch SODP TF-IDF MFT 0.748 0.028 TF-IDF MFT 0.293 0.030 FCC MFT 0.756 0.019 FCC MFT 0.242 0.028 EFCC MFT 0.760 0.014 EFCC MFT 0.275 0.024 AFCC MFT 0.770 0.016 AFCC MFT 0.272 0.023 WebKB • AFCC gets the best results in two TF-IDF MFT 0.460 0.051 datasets. FCC MFT 0.469 0.009 • AFCC gets always comparable or better results than FCC and EFCC. EFCC MFT 0.532 0.032 AFCC MFT 0.565 0.025 62/83
  • 63. Table of Contents 1. Introduction 2. Web Page Representation and Fuzzy Logic 3. Adjusting the representation 4. Test Scenario: Taxonomy Learning 1. Hierarchical Clustering 2. Two Languages 5. Conclusions & Outlook 6. Publications 63/83
  • 64. Test Scenario • To try to build a taxonomy from a set of text documents from Wikipedia. 64/83
  • 65. Test Scenario • Input: Comparable corpora in English and Spanish (documents about animals). 65/83
  • 67. Test Scenario • Taxonomic F-measure. • Labeling process: • Infer concept names from majority of child nodes. • When more than one node is selected to be labeled the same, they are merged if siblings… 67/83
  • 68. Test Scenario • Taxonomic F-measure. • Labeling process: • Infer concept names from majority of child nodes. • When more than one node is selected to be labeled the same, they are merged if siblings… • or the smaller one remains as unclassified in other case. 68/83
  • 69. Test Scenario • English results: Taxonomic F-measure 69/83 Dimensions
  • 70. Test Scenario • Spanish results: Taxonomic F-measure 70/83 Dimensions
  • 71. Table of Contents 1. Introduction 2. Web Page Representation and Fuzzy Logic 3. Adjusting the representation 4. Test Scenario: Taxonomy Learning 5. Conclusions & Outlook 6. Publications 71/83
  • 72. Conclusions • To study a fuzzy model to represent HTML documents for clustering: • To propose a lightweight dimension reduction method focused on the weighting function. • To propose alternatives to improve the system (EFCC, addFCC). • To explore new criteria to be used (IDF, anchor texts). • To compare our results with the previous FRBSs and TF-IDF. • MFT obtained results comparable to LSI when used with EFCC. • EFCC improved the system by changing the way in which the rules were defined to simplify the system and avoid rules that did not work. • IDF and Anchor Texts did not contribute to improve results. 72/83 • EFCC achieved good performance in all datasets.
  • 73. Conclusions • To adjust the system to concrete datasets: • To analyze the frequency distributions of terms within each criterion. • To propose a way of tuning the basic parameters of the membership functions in an automated way. • To evaluate the results compared to previous FRBS and TF-IDF. • We found different term distributions among datasets: tuning the information capture process seems to make sense. • Cases that do not follow a power law seems to be better candidates to improve results by FRBS tuning. • The tuned system is based on dataset statistics only. • Tuning the system is a feasible way of improving the 73/83 representation.
  • 74. Conclusions • Evaluation of our proposals in a test scenario. • Taxonomy learning problem through hierarchical clustering. • Different algorithm. • Different evaluation method. • Comparable corpora written in English and Spanish. • Fuzzy logic based alternatives improved TF-IDF in English. • For Spanish, the results were closer. Probably the stemming process affects the behavior of the representation. • Our results validate the usefulness of FRBSs for representing documents in clustering tasks. 74/83
  • 75. Conclusions • Globally in this thesis: • Fuzzy logic showed its appropriateness to be used as a tool to declare the knowledge in an easy and understandable way to represent web pages. • Some contexts where our proposals could achieve good results have been identified. 75/83
  • 76. Future Directions • To study the effect of non-linear scaling factors over the fuzzy sets. • To explore whether partial clustering solutions could be used for tuning the system. • To study new criteria to include in the combination. • Would it be possible to learn the rule set from examples? • To apply this kind of fuzzy approaches to combine profiles in company name filtering on Twitter. 76/83
  • 77. Future Directions • To study the effect of non-linear scaling factors over the fuzzy sets. • To explore whether partial clustering solutions could be used for tuning the system. • To study new criteria to include in the combination. • It would be possible to learn the rule set from examples? • To apply this kind of fuzzy approaches to combine profiles in company name filtering on Twitter. 77/83
  • 78. Table of Contents 1. Introduction 2. Web Page Representation and Fuzzy Logic 3. Adjusting the representation 4. Test Scenario: Taxonomy Learning 5. Conclusions & Outlook 6. Publications 78/83
  • 79. Publications • Peer-reviewed Conferences (I): • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2008. Web Page Clustering Using a Fuzzy Logic Based Representation and Self- Organizing Maps. In Proceedings of Web Intelligence 2008, International Conference on Web Intelligence and Intelligent Agent Technology (IEEE/WIC/ACM). Volume 1, Page(s): 851 - 854. Sydney, Australia. Acceptance Rate: 20% [8 citations] • Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Sini Pessala, and Timo Honkela. 2010. Learning taxonomic relations from a set of text documents. In Proceedings of AAIA’10 , the 5th International Symposium Advances in Artificial Intelligence and Applications . Page(s): 105 - 112. Wisla, Poland. International Fuzzy Systems Association Award for Young Scientist. 79/83 [2 citations]
  • 80. Publications • Peer-reviewed Conferences (II): • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering. In Proceedings of CICLing 2012, the 13th International Conference on Intelligent Text Processing and Computational Linguistics. Pages(s): 157 - 168. New Delhi, India. Acceptance Rate: 28.6% • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fitting Document Representation to Specific Datasets by Adjusting Membership Functions. In Proceedings of FUZZ-IEEE 2012, the IEEE International Conference on Fuzzy Systems. Brisbane, Australia. ERA A 80/83
  • 81. Publications • Journals: • Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2009. Una Representación Basada en Lógica Borrosa para el Clustering de páginas web con Mapas Auto-Organizativos. Procesamiento del Lenguaje Natural, vol. 42, Pages 79 - 86. FECYT Quality Seal for Scientific Spanish Journals. Spanish Foundation for Science and Technology. • Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez and Timo Honkela. 2012. Learning a taxonomy from a set of text documents. Applied Soft Computing. Volume 12, Issue 3, Pages 1138 - 1148, March 2012. 2011 JCR Impact Factor = 2.612. [6 CITATIONS] Ranked Q1 in Computer Science, Artificial Intelligence and Computer 81/83 Science, Interdisciplinary Applications.
  • 82. Publications • Workshops: • Agustín D. Delgado Muñoz, Raquel Martínez, Alberto Pérez García-Plaza and Víctor Fresno. 2012. Unsupervised Real-Time Company Name Disambiguation in Twitter. In Proceedings of the ICWSM-12 Workshop on Real-Time Analysis and Mining of Social Streams, 6th International AAAI Conference on Weblogs and Social Media. Page(s): 25 - 28. Dublin, Ireland. 82/83
  • 83. IF People IS Here AND Talk IS Done THEN Slide IS Thank You! 83/83