Fuzzy Hash Map

5,490 views

Published on

This is a presentation of Fuzzy Hash Map (FHM). FHM is an extension to the regular Java HashMap data structure allowing efficient fuzzy string key search. Customizable algorithms and settings bring flexibility to this new data structure, making it adaptable to each specific use case. Fuzzy string search performance comparison between Fuzzy Hash Map and the regular HashMap are presented for both accuracy and time consumption. Results show very good performance for Fuzzy Hash Map compared to the regular HashMap.

Published in: Technology, Business
1 Comment
5 Likes
Statistics
Notes
  • Sorry for the late response. I'm glad you liked this. As for generic types, yes, FuzzyHashMaps could extend the generic HashMap. However, this implementation only has support for String keys. Any other key type would need its custom FuzzyKey, with the 'fuzzy' logic defined for that specific type. It's impossible to define a generic FuzzyKey with a generic fuzzy logic.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
5,490
On SlideShare
0
From Embeds
0
Number of Embeds
356
Actions
Shares
0
Downloads
50
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Fuzzy Hash Map

  1. 1. Efficient Fuzzy Search Enabled Hash Map<br />Vasile Topac<br />PhD Student<br />Department of Information Technology and Computer Science<br />“Politehnica” University Of Timisoara<br />Email: vasile.topac@aut.upt.ro<br />4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA<br />
  2. 2. How it all started<br />&<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  3. 3. Java HashMap<br /><ul><li> widely used Java data structure
  4. 4. stores (key, value) pairs
  5. 5. search by key
  6. 6. very fast
  7. 7. a hash function generates a hash code for indexation
  8. 8. Uses equals method to compare trough the keys
  9. 9. only values for existing keyscan be retrieved </li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  10. 10. Java HashMap<br />phone book example<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  11. 11. Java HashMap<br />Collision<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  12. 12. Java HashMap<br />Search for “Lisa Smith”<br />hashMap.get(“Lisa Smith”);<br />Result: “521-8976”<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  13. 13. Problem<br /><ul><li>only values for existing keyscan be retrieved </li></ul>Search for “Lissa Smith”<br />hashMap.get(“Lissa Smith”);<br />Result: null<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  14. 14. Problem<br /><ul><li> search for “Lissa Smith”</li></ul>Brute force solution:<br /> - iterate trough the set of entries and search approximate matches<br /> Works, but is time expensive<br />Fuzzy data structures – currently available for database <br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  15. 15. Fuzzy Hash Map<br />“ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  16. 16. Fuzzy Hash Map<br />UML Class Diagram<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  17. 17. Fuzzy Hash Map<br />How it works<br />FuzzyKey overridden methods <br /><ul><li> hashCode()
  18. 18. prehashing- create collisions to cluster data
  19. 19. substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”
  20. 20. soundexsoundex(“Fuzzy Search”) = F226
  21. 21. equals(Object o)
  22. 22. string metrics
  23. 23. Levenshtain DistanceLD(computing, computation)=4
  24. 24. Hamming DistanceHD(computing, computers)=3</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  25. 25. Fuzzy Hash Map<br />Example<br />(law terminology dictionary)<br /><ul><li> hashCode()
  26. 26. prehashing
  27. 27. substring 4
  28. 28. equals(Object o)
  29. 29. Levenshtain Distance</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  30. 30. Fuzzy Hash Map<br />“the judge has the option of either adjudicating you as guilty or..”<br />fuzzyHashMap.get(“adjudicating”) = null<br />fuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence imposed by a judge…”<br /><ul><li> hashCode()</li></ul>substring 4 = “adju” <br /><ul><li> equals(Object o)</li></ul>LD(adjudicating, adjudication) = 2<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  31. 31. Fuzzy Hash Map<br />fuzzyHashMap.getFuzzy(“violent”)<br />= “violence”<br />LD(violent, violence) = 2<br />LD(violent, violation) = 5<br />“violence” is returned<br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  32. 32. Fuzzy Hash Map<br />Example<br />(phone book)<br /><ul><li> hashCode()
  33. 33. prehashing
  34. 34. soundex
  35. 35. equals(Object o)
  36. 36. Levenshtain Distance</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  37. 37. Results<br />Accuracy Test<br />Test conditions<br />- Substring(0,4) hashing function<br />- Levenshtein Distance fuzzy matching algorithm<br /><ul><li> Distance threshold value 2
  38. 38. medical terminology dictionary populated with 1030 English medical terms</li></ul>Test results<br /><ul><li>Parse text from American Family Physicians Journal
  39. 39. text of 568 words
  40. 40. 43 words identified as medical terms
  41. 41. 9 were incorrect matches
  42. 42. 80% accuracy
  43. 43. Parse text from eMedicine web site
  44. 44. text of 2730 words
  45. 45. 260 were recognized
  46. 46. 7 were incorrect matches
  47. 47. 97% accuracy</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  48. 48. Results<br />Speed Test<br /><ul><li>Exact matches only</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  49. 49. Results<br />Speed Test<br /><ul><li>Fuzzy matches only</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  50. 50. Results<br />Speed Test<br /><ul><li>Exact & fuzzy matches</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  51. 51. Conclusion<br /><ul><li>FuzzyHashMap data structures proved to have very good performance on working with uncertain data
  52. 52. Flexible (can choose different pre-hashing functions and string metrics)
  53. 53. available as open source http://fuzzyhashmap.sourceforge.net/
  54. 54. community can extend the functionality
  55. 55. Future work:
  56. 56. adding more string metrics
  57. 57. improve performance
  58. 58. implement Fuzzy TreeMap</li></ul>SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />
  59. 59. Thank you!<br />sources at:<br />http://fuzzyhashmap.sourceforge.net<br />contact<br />vasile.topac@aut.upt.ro <br />SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac<br />

×