Name conflict resolution

262 views

Published on

A project based on resolving the conflict names or similar names with existing company names in the database.

Published in: Education, Travel, News & Politics
  • Be the first to comment

Name conflict resolution

  1. 1. Name Conflict Resolution for Company Registration 8/30/2013
  2. 2. •System to automate company registration process •Compares the company names using string matching algorithms •Names are ranked according to their similarity percentage •A name is rejected if the similarity score is 100% Introduction Introduction 8/30/2013
  3. 3. •To develop a system to resolve naming conflict. •To find names similar to the name proposed by user. •To provide the ranks of matched proposed name with other existing names. Objectives Objectives 8/30/2013
  4. 4. Building Base Dictionary Keyword Generation Finding Possible Matches Finding Duplicates Finding Ranks Methodology Methodology 8/30/2013
  5. 5. System Design 8/30/2013
  6. 6. 8/30/2013
  7. 7. User Input“Centre Nepal Metals Industries” User Input 8/30/2013
  8. 8. 8/30/2013
  9. 9. Preprocessing Engine 8/30/2013
  10. 10. Downcasting “Centre Nepal Metals Industries” Downcasting “centre nepal metals industries” • act of casting input from uppercase letters to lowercases 8/30/2013
  11. 11. Transformation Transformation “centre nepal metals industries” • conversion of British English words to American English words center 8/30/2013
  12. 12. Stopword Removal Stopword Removal “center nepal metals industries” • process of removing predefined stopwords from the string literal “center nepal metals ”industries“center nepal metals” 8/30/2013
  13. 13. Tokenization Tokenization • process of reducing a large string to a set of tokens “center nepal metals”center nepal metals 8/30/2013
  14. 14. Stemming Stemming • process of reducing a word to a root, or simpler form metal center nepal metals center nepal Tokens Stemmed Tokens 8/30/2013
  15. 15. 8/30/2013
  16. 16. Translation Translation • conversion of the meaning of a source-language text by means of an equivalent target-language text metal center nepal Stemmed Tokens Translated Tokens 8/30/2013
  17. 17. Transliteration Transliteration • conversion of a text from one script to another dhatu kendra nepal Translated Tokens Transliterated Tokens 8/30/2013
  18. 18. 8/30/2013
  19. 19. Final Token List dhatumetal nepalcenter kendra 8/30/2013
  20. 20. 8/30/2013
  21. 21. Database Query using Final Token List •nepal medical centre pvt. ltd. •nepal dhatu company •metal nepal pvt. ltd. •enter nepal •nepal metal industries •dhatu sankalan kendra 8/30/2013
  22. 22. From Database Query Result Nepal Medical Centre pvt. ltd. 8/30/2013
  23. 23. 8/30/2013
  24. 24. Database Generated Keywords centernepal medical 8/30/2013
  25. 25. 8/30/2013
  26. 26. Permutation •kendranepal dhatu •kendranepal metal •center nepal dhatu •center nepal metal 8/30/2013
  27. 27. 8/30/2013
  28. 28. Levenshtein Distance Calculation Optimized Maximal Similarity using Hungarian Algorithm Sorenson Index to Calculate Similarity % 3 steps 1 2 3 Comparison Comparison 8/30/2013
  29. 29. nepal medical center center 5 6 0 5 6 0 8/30/2013 Levenshtein Distance Calculation
  30. 30. nepal medical center nepal 0 4 5 5 6 0 0 4 5 8/30/2013 Levenshtein Distance Calculation
  31. 31. nepal medical center metal 2 3 4 2 3 4 5 6 0 0 4 5 8/30/2013 Levenshtein Distance Calculation
  32. 32. nepal medical center center 1 1 6 1 1 6 8/30/2013 Similarity Weight Calculation
  33. 33. nepal medical center nepal 5 3 1 5 3 1 1 1 6 8/30/2013 Similarity Weight Calculation
  34. 34. nepal medical center metal 3 4 2 3 4 2 1 1 6 5 3 1 8/30/2013 Similarity Weight Calculation
  35. 35. nepal medical center 5center nepal metal 6 3 2 1 1 4 6 5 4 Bipartite Graph 8/30/2013
  36. 36. Sorenson Similarity 8/30/2013
  37. 37. 8/30/2013
  38. 38. Final Ranked List 8/30/2013
  39. 39.  111160 Registered Company Names  106299 Unique Reg. ID / Company Names  16326 Words in English- Nepali Dictionary  144 British-American Words for Transformation Dataset Dataset 8/30/2013
  40. 40. 1.664 2.204 11.952 37.743 8.959 13.315 39.994 107.498 0 20 40 60 80 100 120 1 Token 2 Tokens 3 Tokens 4 Tokens TimetoCompute(sec) Number of Tokens Number of Tokens VS Computation Time Time to compute (sec) in I5 CPU Time to compute (sec) in Dual Core CPU Result Analysis 8/30/2013
  41. 41. •Stemming sometimes produces incorrect results if input contains a Nepali word •Dictionary (English-Nepali) does not contain enough words •Tokenization is based on whitespace and hyphen only •Comparison is not phonetic based Limitations Limitations 8/30/2013
  42. 42. •Use of Taxonomy for classifying the tokens •Using some weighing measures to assign weights to tokens •Implementation of faster searching methods •Integration of phonetic based similarity measures Future Enhancements Future Enhancements 8/30/2013
  43. 43. Thank You Gaurav Kumar Goyal 16214 Janardan Chaudhary 16216 Nimesh Mishra 16221 Sanat Maharjan 16230 8/30/2013

×