Fuzzy Hash Map

  • 3,751 views
Uploaded on

This is a presentation of Fuzzy Hash Map (FHM). FHM is an extension to the regular Java HashMap data structure allowing efficient fuzzy string key search. Customizable algorithms and settings bring …

This is a presentation of Fuzzy Hash Map (FHM). FHM is an extension to the regular Java HashMap data structure allowing efficient fuzzy string key search. Customizable algorithms and settings bring flexibility to this new data structure, making it adaptable to each specific use case. Fuzzy string search performance comparison between Fuzzy Hash Map and the regular HashMap are presented for both accuracy and time consumption. Results show very good performance for Fuzzy Hash Map compared to the regular HashMap.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Sorry for the late response. I'm glad you liked this. As for generic types, yes, FuzzyHashMaps could extend the generic HashMap. However, this implementation only has support for String keys. Any other key type would need its custom FuzzyKey, with the 'fuzzy' logic defined for that specific type. It's impossible to define a generic FuzzyKey with a generic fuzzy logic.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,751
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
34
Comments
1
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Efficient Fuzzy Search Enabled Hash Map
    Vasile Topac
    PhD Student
    Department of Information Technology and Computer Science
    “Politehnica” University Of Timisoara
    Email: vasile.topac@aut.upt.ro
    4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA
  • 2. How it all started
    &
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 3. Java HashMap
    • widely used Java data structure
    • 4. stores (key, value) pairs
    • 5. search by key
    • 6. very fast
    • 7. a hash function generates a hash code for indexation
    • 8. Uses equals method to compare trough the keys
    • 9. only values for existing keyscan be retrieved
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 10. Java HashMap
    phone book example
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 11. Java HashMap
    Collision
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 12. Java HashMap
    Search for “Lisa Smith”
    hashMap.get(“Lisa Smith”);
    Result: “521-8976”
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 13. Problem
    • only values for existing keyscan be retrieved
    Search for “Lissa Smith”
    hashMap.get(“Lissa Smith”);
    Result: null
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 14. Problem
    • search for “Lissa Smith”
    Brute force solution:
    - iterate trough the set of entries and search approximate matches
    Works, but is time expensive
    Fuzzy data structures – currently available for database
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 15. Fuzzy Hash Map
    “ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 16. Fuzzy Hash Map
    UML Class Diagram
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 17. Fuzzy Hash Map
    How it works
    FuzzyKey overridden methods
    • hashCode()
    • 18. prehashing- create collisions to cluster data
    • 19. substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”
    • 20. soundexsoundex(“Fuzzy Search”) = F226
    • 21. equals(Object o)
    • 22. string metrics
    • 23. Levenshtain DistanceLD(computing, computation)=4
    • 24. Hamming DistanceHD(computing, computers)=3
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 25. Fuzzy Hash Map
    Example
    (law terminology dictionary)
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 30. Fuzzy Hash Map
    “the judge has the option of either adjudicating you as guilty or..”
    fuzzyHashMap.get(“adjudicating”) = null
    fuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence imposed by a judge…”
    • hashCode()
    substring 4 = “adju”
    • equals(Object o)
    LD(adjudicating, adjudication) = 2
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 31. Fuzzy Hash Map
    fuzzyHashMap.getFuzzy(“violent”)
    = “violence”
    LD(violent, violence) = 2
    LD(violent, violation) = 5
    “violence” is returned
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 32. Fuzzy Hash Map
    Example
    (phone book)
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 37. Results
    Accuracy Test
    Test conditions
    - Substring(0,4) hashing function
    - Levenshtein Distance fuzzy matching algorithm
    • Distance threshold value 2
    • 38. medical terminology dictionary populated with 1030 English medical terms
    Test results
    • Parse text from American Family Physicians Journal
    • 39. text of 568 words
    • 40. 43 words identified as medical terms
    • 41. 9 were incorrect matches
    • 42. 80% accuracy
    • 43. Parse text from eMedicine web site
    • 44. text of 2730 words
    • 45. 260 were recognized
    • 46. 7 were incorrect matches
    • 47. 97% accuracy
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 48. Results
    Speed Test
    • Exact matches only
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 49. Results
    Speed Test
    • Fuzzy matches only
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 50. Results
    Speed Test
    • Exact & fuzzy matches
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 51. Conclusion
    • FuzzyHashMap data structures proved to have very good performance on working with uncertain data
    • 52. Flexible (can choose different pre-hashing functions and string metrics)
    • 53. available as open source http://fuzzyhashmap.sourceforge.net/
    • 54. community can extend the functionality
    • 55. Future work:
    • 56. adding more string metrics
    • 57. improve performance
    • 58. implement Fuzzy TreeMap
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac
  • 59. Thank you!
    sources at:
    http://fuzzyhashmap.sourceforge.net
    contact
    vasile.topac@aut.upt.ro
    SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - VasileTopac