Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Clustering of Similar Values, in Spanish,
for the Improvement of Search Systems
Sergio Luján-Mora & Manuel Palomar
(sergio...
Clustering of Similar Values,
in Spanish,
for the Improvement of
Search Systems
Sergio Luján-Mora & Manuel Palomar
Departm...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
Introduction
• Information systems  Rapid and
precise access
• Databases  Find information
• Inconsistency: a term repre...
Introduction
• Term
– Universidad de Alicante

• Different values found in databases:
– Universidad Alicante
– Unibersidad...
Introduction
• The problem:
– Data redundancy  Inconsistency
– Integration of different databases into a
common repositor...
Introduction
• We use clustering within an automatic
method for reducing on inconsistency
1. Values that refer to a same t...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
Taxonomy of different values
• Omission or inclusion of the written
accent:
Asociación Astronómica
Asociacion Astronomica
...
Taxonomy of different values
• Abbreviations and acronyms:
Dpto. de Derecho Civil
Departamento de Derecho Civil

• Word or...
Taxonomy of different values
• Different denominations:
Unidad de Registro Sismológico
Unidad de Registro Sísmico

• Punct...
Taxonomy of different values
• Errors (misspelling, typing or printing
errors):

Gabinete de imagen
Gavinete de imagen

• ...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
The solution
1. Preparation
Main
step

2. Reading
3. Sorting
4. Clustering
5. Checking
6. Updating

Dpto. de Lenguajes y S...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
The clustering algorithm
• Similarity:
– Edit distance or Levenshtein distance (LD)
– Invariant distance from word positio...
The clustering algorithm
• Filtering:
– Length distance (LEND)
– Transposition-invariant distance (TID)

Dpto. de Lenguaje...
The clustering algorithm
Input:
C: Sorted strings in descending order by frequency (c1…cm)
Output:
G: Set of clusters (g1…...
The clustering algorithm
3. For each string cj in C
If LEND(ci, cj) < α LEND(ci, cj) then
If TID(ci, cj) < α TID(ci, cj) t...
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
D...
Results

Indexes for measuring the cluster complexity

CI: Consistency Index
FCI: File Consistency Index

∑∑ LD( x , x )
n...
Results
• File A

• File B

– Without
• FCI: 0.31

– With
• FCI: 0.12

Dpto. de Lenguajes y Sistemas Informáticos
Universi...
Results
• Evaluation measures:
– ONC: optimal number of clusters
– NC: number of clusters generated
– NCC: number of compl...
Results
• Precision: NCC / ONC
• Error: NIC / ONC

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (Esp...
Results
• File A

• File B

– Without

– Without

• Precision: 70.7%

• Precision: 67.4%

• Error: 7.6%

• Error: 8.7%

– ...
Contents
• Introduction
• The problem: causes
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de L...
Conclusions
• Achieves good results: improves on
data quality
• Review obtained clusters
• Expansion of abbreviations
• Pa...
Upcoming SlideShare
Loading in …5
×

Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

324 views

Published on

Published in:
Proceedings International Joint Conference, 7th Ibero-American Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings, p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22 2000. ISBN: 85-87837-03-6.

Download:
http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

  1. 1. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems Sergio Luján-Mora & Manuel Palomar (sergio.lujan@ua.es / @sergiolujanmora) Department of Software and Computing Systems University of Alicante, Spain Published in: Proceedings International Joint Conference, 7th IberoAmerican Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings, p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22 2000. ISBN: 85-87837-03-6. Download: http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1 1
  2. 2. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems Sergio Luján-Mora & Manuel Palomar Department of Software and Computing Systems University of Alicante, Spain Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 2
  3. 3. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 3
  4. 4. Introduction • Information systems  Rapid and precise access • Databases  Find information • Inconsistency: a term represented by different values Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 4
  5. 5. Introduction • Term – Universidad de Alicante • Different values found in databases: – Universidad Alicante – Unibersidad de Alicante – Universitat d’Alacant – University of Alicante Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 5
  6. 6. Introduction • The problem: – Data redundancy  Inconsistency – Integration of different databases into a common repository (e.g. data warehouses): • different criteria  data redundancy  Dpto. de Lenguajes y Sistemas Informáticos Inconsistency Universidad de Alicante (España) 6
  7. 7. Introduction • We use clustering within an automatic method for reducing on inconsistency 1. Values that refer to a same term are clustered 2. All values are replaced by the cluster sample Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 7
  8. 8. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 8
  9. 9. Taxonomy of different values • Omission or inclusion of the written accent: Asociación Astronómica Asociacion Astronomica • Lower-case / upper-case: Departamento de Lenguajes y Sistemas Departamento de lenguajes y sistemas Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 9
  10. 10. Taxonomy of different values • Abbreviations and acronyms: Dpto. de Derecho Civil Departamento de Derecho Civil • Word order: Miguel de Cervantes Saavedra Cervantes Saavedra, Miguel de Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 10
  11. 11. Taxonomy of different values • Different denominations: Unidad de Registro Sismológico Unidad de Registro Sísmico • Punctuation marks: Laboratorio Multimedia (mmlab) Laboratorio Multimedia - mmlab Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 11
  12. 12. Taxonomy of different values • Errors (misspelling, typing or printing errors): Gabinete de imagen Gavinete de imagen • Different languages: Universidad de Alicante University of Alicante Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 12
  13. 13. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 13
  14. 14. The solution 1. Preparation Main step 2. Reading 3. Sorting 4. Clustering 5. Checking 6. Updating Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 14
  15. 15. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 15
  16. 16. The clustering algorithm • Similarity: – Edit distance or Levenshtein distance (LD) – Invariant distance from word position (IDWP) Universidad de Alicante Alicante, Universidad de Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 16
  17. 17. The clustering algorithm • Filtering: – Length distance (LEND) – Transposition-invariant distance (TID) Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 17
  18. 18. The clustering algorithm Input: C: Sorted strings in descending order by frequency (c1…cm) Output: G: Set of clusters (g1…gn) STEPS 1 Select ci, the first string in C, and insert it into the new cluster gk 2 Remove ci from C Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 18
  19. 19. The clustering algorithm 3. For each string cj in C If LEND(ci, cj) < α LEND(ci, cj) then If TID(ci, cj) < α TID(ci, cj) then If LD(ci, cj) < α LD(ci, cj) then Insert cj into cluster gk Remove cj from C Else If IDWP(ci, cj) < α IDWP(ci, cj) then Insert cj into cluster gk Dpto. de Lenguajes y Sistemas Informáticos Remove c from C Universidad de Alicante j (España) 19
  20. 20. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 20
  21. 21. Results Indexes for measuring the cluster complexity CI: Consistency Index FCI: File Consistency Index ∑∑ LD( x , x ) n CI = n i i =1 j =1 j n ∑x i =1 m i Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) FCI = ∑ CI i =1 i m 21
  22. 22. Results • File A • File B – Without • FCI: 0.31 – With • FCI: 0.12 Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) – Without • FCI: 1.72 – With • FCI: 1.11 22
  23. 23. Results • Evaluation measures: – ONC: optimal number of clusters – NC: number of clusters generated – NCC: number of completely correct clusters – NIC: number of incorrect clusters – NES: number of erroneous strings Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 23
  24. 24. Results • Precision: NCC / ONC • Error: NIC / ONC Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 24
  25. 25. Results • File A • File B – Without – Without • Precision: 70.7% • Precision: 67.4% • Error: 7.6% • Error: 8.7% – With – With • Precision: 84.8% • Precision: 72.8% • Error: 0% • Error: 6.5% Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 25
  26. 26. Contents • Introduction • The problem: causes • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 26
  27. 27. Conclusions • Achieves good results: improves on data quality • Review obtained clusters • Expansion of abbreviations • Parameters Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 27

×