Upcoming SlideShare
×

# Text Comparison Using Soft Cardinality

340 views

Published on

The classical set theory provides a method for comparing objects using cardinality and intersection, in combination with well-known resemblance coefficients such as Dice, Jaccard, and cosine. However, set operations are intrinsically crisp: they disregard similarities between elements. We propose a new general-purpose method for object comparison using a soft cardinality function that assesses set cardinality via an auxiliary affinity (similarity) measure. Our experiments with 12 text matching datasets show that the soft cardinality method is superior to known approximate string comparison methods in text comparison task.

Published in: Technology, Sports
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
340
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Text Comparison Using Soft Cardinality

1. 1. Text Comparison Using Soft Cardinality Sergio Jimenez1 Fabio González1 Alexander Gelbukh2 1Universidad Nacional de Colombia – Bogota 2Centro de Investigaciones en Computación (CIC), IPN, Mexico SPIRE’10
2. 2. Can you make a better comparison between two “bags” removing redundancy? SPIRE’10
3. 3. Classic vs. Soft Cardinality S S Classic set cardinality |S|=4 Soft cardinality |S|α=2.?? Repeated elements are counted only once Similar elements count less for soft cardinality SPIRE’10
4. 4. Soft Cardinality Definition 1. Consider each element as a subset. S={s1, s2, …, sn} S’={{s1}, {s2}, …, {sn}} 2. Set the cardinality of each subset be equal to 1. |{si}|=1 3. Consider similarity between pairs of elements in S as intersections beween their corresponding subsets |{si}∩{sj}|= α(si ,sj) 4. Let |S|α be the soft cardinality of S as: |S|α =|{s1}U{s2}U … U {sn}| SPIRE’10
5. 5. Soft Cardinality S S’ area=1 α( , ) |S|α= SPIRE’10
6. 6. Cardinality of the union of sets 212121 AAAAAA 321313221321321 AAAAAAAAAAAAAAA 1st problem: binary similarity measures α(*,*) are common, but n-ary similarity functions are not. 2nd problem: the number of terms for n sets is 2n-1 __ __ ___ ________ ________ ________ ______________ 3rd problem: if pair-wise intersections are large (close to 1) n-wise intersections are also large, so they can not be ignored. SPIRE’10
7. 7. An Approximation to Soft Cardinality … S 0.70 0.20 0.15 1.00 0.70 0.15 0.70 1.00 0.20 0.15 0.20 1.00 1.0<|S|α<3.0 Affinity (similarity) function α(·,·) Affinity Matrix SPIRE’10
8. 8. … An Approximation to Soft Cardinality },,,{ 21 nsssS  1.00 0.70 0.15 0.70 1.00 0.20 0.15 0.20 1.00 |S|α=1.81 SPIRE’10
9. 9. Weighted Soft Cardinality S 0.001 0.0070.001 500kg 0.1kg 0.001kg 1.000 0.007 0.001 0.007 1.000 0.001 0.001 0.001 1.000 500 0.01 0.001 wi Affinity Importance |S|α=496 (kg) |S|α=2.98 (elements) non-weighted weighted SPIRE’10
10. 10. Soft Cardinality for Text Comparison S1={“Sergio”,”Jimenes”,“Vargaz“} S2={“Cergio”,”Gimenez”,“Vargas“} Sergio Jimenes Vargaz Sergio 1.000 0.000 0.333 Jimenes 0.000 1.000 0.000 Vargaz 0.333 0.000 1.000 Cergio Gimenez Vargas Cergio 1.000 0.000 0.333 Gimenez 0.000 1.000 0.000 Vargas 0.333 0.000 1.000 S1U S2={“Sergio”,”Jimenes”,“Vargaz“, “Cergio”,”Gimenez”,“Vargas“} Sergio Jimenes Vargaz Cergio Gimenez Vargas Sergio 1.000 0.000 0.333 0.833 0.000 0.333 Jimenes 0.000 1.000 0.000 0.000 0.714 0.143 Vargaz 0.333 0.000 1.000 0.333 0.143 0.833 Cergio 0.833 0.000 0.333 1.000 0.000 0.333 Gimenez 0.000 0.714 0.143 0.000 1.000 0.000 Vargas 0.333 0.143 0.833 0.333 0.000 1.000 |S1|α=2.50 |S2|α=2.50 |S1U S2|α=2.63 |S1∩ S2|α= |S1|α +|S2|α -|S1U S2|α |S1∩ S2|α= 2.37 90.0),( 21 21 21 SS SS SSJaccardα(x,y) is a normalized edit distance converted to similarity SPIRE’10
11. 11. The Name Matching Problem Relation #1 Relation #2 5 matches SPIRE’10
12. 12. 12 Name Matching Datasets SPIRE’10
13. 13. IAP (interpolated average precision) results State-of-the-art SoftTF-IDF [Cohen et al. 2003] Adaptive approach Similarity(A,B,corpus) Soft Cardinality Static approach Similarity(A,B) Soft Cardinality weighted IDF adaptive approach Similarity(A,B,corpus) SPIRE’10
14. 14. Conclusions • Soft Cardinality provides a “nice” method to compare bags of words considering similarity among terms and term weighting. • Experimental evidence shows that Soft Cardinality method is comparable (slightly better) to soft versions of the word- space model. SPIRE’10
15. 15. Questions? SPIRE’10