Clustering of Similar Values, in Spanish,
for the Improvement of Search Systems
Sergio Luján-Mora & Manuel Palomar
(sergio...
Clustering of Similar Values,
in Spanish,
for the Improvement of
Search Systems
Sergio Luján-Mora & Manuel Palomar
Departm...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
Introduction
• Information systems  Rapid and
precise access
• Databases  Find information
• Inconsistency: a term repre...
Introduction
• Term
– Universidad de Alicante

• Different values found in databases:
– Universidad Alicante
– Unibersidad...
Introduction
• The problem:
– Data redundancy  Inconsistency
– Integration of different databases into a
common repositor...
Introduction
• We use clustering within an automatic
method for reducing on inconsistency
1. Values that refer to a same t...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
Taxonomy of different values
• Omission or inclusion of the written
accent:
Asociación Astronómica
Asociacion Astronomica
...
Taxonomy of different values
• Abbreviations and acronyms:
Dpto. de Derecho Civil
Departamento de Derecho Civil

• Word or...
Taxonomy of different values
• Different denominations:
Unidad de Registro Sismológico
Unidad de Registro Sísmico

• Punct...
Taxonomy of different values
• Errors (misspelling, typing or printing
errors):

Gabinete de imagen
Gavinete de imagen

• ...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
The solution
1. Preparation
Main
step

2. Reading
3. Sorting
4. Clustering
5. Checking
6. Updating

Dpto. de Lenguajes y S...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
The clustering algorithm
• Similarity:
– Edit distance or Levenshtein distance (LD)
– Invariant distance from word positio...
The clustering algorithm
• Filtering:
– Length distance (LEND)
– Transposition-invariant distance (TID)

Dpto. de Lenguaje...
The clustering algorithm
Input:
C: Sorted strings in descending order by frequency (c1…cm)
Output:
G: Set of clusters (g1…...
The clustering algorithm
3. For each string cj in C
If LEND(ci, cj) < α LEND(ci, cj) then
If TID(ci, cj) < α TID(ci, cj) t...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
Results

Indexes for measuring the cluster complexity

CI: Consistency Index
FCI: File Consistency Index

∑∑ LD( x , x )
n...
Results
• File A

• File B

– Without
• FCI: 0.31

– With
• FCI: 0.12

Dpto. de Lenguajes y Sistemas Informáticos
Universi...
Results
• Evaluation measures:
– ONC: optimal number of clusters
– NC: number of clusters generated
– NCC: number of compl...
Results
• Precision: NCC / ONC
• Error: NIC / ONC

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (Esp...
Results
• File A

• File B

– Without

– Without

• Precision: 70.7%

• Precision: 67.4%

• Error: 7.6%

• Error: 8.7%

– ...
Contents
• Introduction
• The problem: causes
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de L...
Conclusions
• Achieves good results: improves on
data quality
• Review obtained clusters
• Expansion of abbreviations
• Pa...
Upcoming SlideShare
Loading in...5
×

Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

83

Published on

Published in:
Proceedings International Joint Conference, 7th Ibero-American Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings, p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22 2000. ISBN: 85-87837-03-6.

Download:
http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
83
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • {"27":"This method achieves good results in all experiments done, but it does not eliminate the need to review the obtained clusters.\nThe expansion of abbreviations improves on the results.\n","16":"The similarity between two strings must be evaluated. We use the EDIT DISTANCE and the INVARIANT DISTANCE FROM WORD POSITION. \nThe edit distance of two strings is a measure of similarity that is given by the minimal number of simple edit operations needed to transform an string into the other.\nThe simple editing operations considered are: the insertion of a character, the deletion of a character and the substitution of one character with another.\n","5":"For example, if we consult a database that stores information about university researches, we may easily find that there are different values for the same university:\n- Universidad Alicante (without the preposition)\n- Unibersidad de Alicante (with misspelling)\n- Universitat d’Alacant (in Catalan)\n- and even University of Alicante (in English)\nIf a database suffers inconsistency, a search using a given value will not provide all the available information about the term.\n","22":"As we can see, the clusters of file B are more complex than those of file.\nIn both files, the FCI is reduced when expanding the abbreviations.\n","11":"- Different denominations.\n- Punctuation marks: hyphens, commas, semi-colons, brackets, exclamation marks and so on.\n","17":"These distances speed up the clustering. The expensive computation of LD and IDWP can be avoided. \n","6":"The problem of the inconsistency in the values stored in databases may have two origins:\n- Different people may insert the same term with different values in a database.\n- When we try to integrate different databases, they may use different values for representing the same term.\n","23":"We have evaluated the clusters obtained by using four measures that are obtained by comparing the clusters produced with the optimal clusters (handcrafted).\n","12":"- Errors: misspelling, typing or printing errors.\n- Use of different languages.\n","1":"Perhaps we should begin.\nGood morning everyone. Thanks for coming. I am a member of the Deparment of Software and Computing Systems at the University of Alicante in Spain.\nThe work I’m going to present is Clustering of Similar Values, in Spanish, for the Improvement of Search Systems.\n","7":"We present an automatic method for reducing on the inconsistency found in existing databases, and thus, improving data quality. \nAll the values that refer to a same term are clustered by measuring their degree of similarity.\nThe clustered values can be assigned to a common value which, in principle, could be substituted for the original values.\n","24":"From the previous measures, we obtain Precision: NCC divided by ONC and Error: NIC divided by ONC.\n","13":"The method we propose can be divided into six steps.\n","2":"Perhaps we should begin.\nGood morning everyone. Thanks for coming. I am a member of the Deparment of Software and Computing Systems at the University of Alicante in Spain.\nThe work I’m going to present is Clustering of Similar Values, in Spanish, for the Improvement of Search Systems.\n","8":"After analysing several databases with information both in Spanish, we have noticed that the different values that appear for a given term are due to a combination of the following causes.\n","25":"In both files, the expansion of abbreviations produces improvements: it increases the precision and reduces the error.\nFor file A, a maximum precision of 70.7% and 84.8% is obtained without and with expansion of abbreviations.\nFor file B, a maximum precision of 67.4% and 72.8% is obtained without and with expansion of abbreviations.\n","14":"1. Preparation. It may be necessary to prepare the strings before applying the clustering algorithm.\n2. Reading. The following process is repeated for each of the strings contained in the input file: Read a string, Expand abbreviations and acronyms, Remove accents, Shift string to lower-case, Store the string.\n3. Sorting. The strings are sorted, in descending order, by frequency of appearance.\n","3":"First of all, I would like to outline the main points of my talk. \nI have structured my talk into six sections. Firstly, I will give an introduction to my research and I will make a few observations about the inconsistency problem. Then I will go on to explain the main causes of the problem. Next, I will talk about our proposed method for reducing inconsistency in databases. And then, I will present our clustering algorithm and the distance metrics it uses. Finally, I will highlight the main results of our method and the conclusions of our research.\nLet’s move on to the first part of the presentation.\n","9":"- The omission or inclusion of the written accent.\n- The use of lower-case and upper-case letters.\n","15":"The standard method of detecting exact duplicates in a table is to sort the table and then to check if neighboring tuples are identical. This approach can be extended to detect approximate duplicates.\n","4":"Existing information systems provide rapid and precise access to information stored in databases.\nOne of the main uses of databases is find information.\nIf a database has a bad design, it is very likely that it suffers inconsistency: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on.\n","21":"We have used two files for evaluating our method. They contain data from two databases with inconsistency problems.\nWe have developed a coefficient named CONSISTENCY INDEX that permits the evaluation of the complexity of a cluster: the greater the value of the coefficient, the more different the strings that form the cluster are.\nThe FILE CONSISTENCY INDEX is defined as the average of the consistency indexes of all the existing clusters in the file.\n","10":"- The use of abbreviations and acronyms.\n- Different word order.\n"}
  • Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

    1. 1. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems Sergio Luján-Mora & Manuel Palomar (sergio.lujan@ua.es / @sergiolujanmora) Department of Software and Computing Systems University of Alicante, Spain Published in: Proceedings International Joint Conference, 7th IberoAmerican Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings, p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22 2000. ISBN: 85-87837-03-6. Download: http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1 1
    2. 2. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems Sergio Luján-Mora & Manuel Palomar Department of Software and Computing Systems University of Alicante, Spain Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 2
    3. 3. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 3
    4. 4. Introduction • Information systems  Rapid and precise access • Databases  Find information • Inconsistency: a term represented by different values Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 4
    5. 5. Introduction • Term – Universidad de Alicante • Different values found in databases: – Universidad Alicante – Unibersidad de Alicante – Universitat d’Alacant – University of Alicante Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 5
    6. 6. Introduction • The problem: – Data redundancy  Inconsistency – Integration of different databases into a common repository (e.g. data warehouses): • different criteria  data redundancy  Dpto. de Lenguajes y Sistemas Informáticos Inconsistency Universidad de Alicante (España) 6
    7. 7. Introduction • We use clustering within an automatic method for reducing on inconsistency 1. Values that refer to a same term are clustered 2. All values are replaced by the cluster sample Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 7
    8. 8. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 8
    9. 9. Taxonomy of different values • Omission or inclusion of the written accent: Asociación Astronómica Asociacion Astronomica • Lower-case / upper-case: Departamento de Lenguajes y Sistemas Departamento de lenguajes y sistemas Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 9
    10. 10. Taxonomy of different values • Abbreviations and acronyms: Dpto. de Derecho Civil Departamento de Derecho Civil • Word order: Miguel de Cervantes Saavedra Cervantes Saavedra, Miguel de Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 10
    11. 11. Taxonomy of different values • Different denominations: Unidad de Registro Sismológico Unidad de Registro Sísmico • Punctuation marks: Laboratorio Multimedia (mmlab) Laboratorio Multimedia - mmlab Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 11
    12. 12. Taxonomy of different values • Errors (misspelling, typing or printing errors): Gabinete de imagen Gavinete de imagen • Different languages: Universidad de Alicante University of Alicante Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 12
    13. 13. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 13
    14. 14. The solution 1. Preparation Main step 2. Reading 3. Sorting 4. Clustering 5. Checking 6. Updating Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 14
    15. 15. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 15
    16. 16. The clustering algorithm • Similarity: – Edit distance or Levenshtein distance (LD) – Invariant distance from word position (IDWP) Universidad de Alicante Alicante, Universidad de Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 16
    17. 17. The clustering algorithm • Filtering: – Length distance (LEND) – Transposition-invariant distance (TID) Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 17
    18. 18. The clustering algorithm Input: C: Sorted strings in descending order by frequency (c1…cm) Output: G: Set of clusters (g1…gn) STEPS 1 Select ci, the first string in C, and insert it into the new cluster gk 2 Remove ci from C Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 18
    19. 19. The clustering algorithm 3. For each string cj in C If LEND(ci, cj) < α LEND(ci, cj) then If TID(ci, cj) < α TID(ci, cj) then If LD(ci, cj) < α LD(ci, cj) then Insert cj into cluster gk Remove cj from C Else If IDWP(ci, cj) < α IDWP(ci, cj) then Insert cj into cluster gk Dpto. de Lenguajes y Sistemas Informáticos Remove c from C Universidad de Alicante j (España) 19
    20. 20. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 20
    21. 21. Results Indexes for measuring the cluster complexity CI: Consistency Index FCI: File Consistency Index ∑∑ LD( x , x ) n CI = n i i =1 j =1 j n ∑x i =1 m i Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) FCI = ∑ CI i =1 i m 21
    22. 22. Results • File A • File B – Without • FCI: 0.31 – With • FCI: 0.12 Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) – Without • FCI: 1.72 – With • FCI: 1.11 22
    23. 23. Results • Evaluation measures: – ONC: optimal number of clusters – NC: number of clusters generated – NCC: number of completely correct clusters – NIC: number of incorrect clusters – NES: number of erroneous strings Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 23
    24. 24. Results • Precision: NCC / ONC • Error: NIC / ONC Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 24
    25. 25. Results • File A • File B – Without – Without • Precision: 70.7% • Precision: 67.4% • Error: 7.6% • Error: 8.7% – With – With • Precision: 84.8% • Precision: 72.8% • Error: 0% • Error: 6.5% Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 25
    26. 26. Contents • Introduction • The problem: causes • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 26
    27. 27. Conclusions • Achieves good results: improves on data quality • Review obtained clusters • Expansion of abbreviations • Parameters Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 27

    ×