Name conflict resolution
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Name conflict resolution

  • 163 views
Uploaded on

A project based on resolving the conflict names or similar names with existing company names in the database.

A project based on resolving the conflict names or similar names with existing company names in the database.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
163
On Slideshare
163
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Add Presentation Date
  • Downcasting also referred as type refinement is act of casting script from uppercaseletters to lowercases. It is done so as to make sure there is no conflict in company namesdue to uppercase letters between the words to make it a unique name.
  • Transformation is the conversion of words from British English word to that to AmericanEnglish words. Transformation is done to avoid the generation of unwanted keywords orconflicting keywords. Our dictionary consist of around 130 commonly used words thatis converted when found from British English word to American English word.
  • remove the words that are considered similar/unimportant according to the Office of the Company Registrar.
  • Downcasting also referred as type refinement is act of casting script from uppercaseletters to lowercases. It is done so as to make sure there is no conflict in company namesdue to uppercase letters between the words to make it a unique name.
  • Process of reducing a word to its root form Stemming is the process of reducing a word to a root, or simpler form which are presentin plural forms. Stemming is often used in text processing applications. There are manydifferent approaches to stemming, each with their own design goals. Some areaggressive, reducing words to the smallest root possible. Here, Stemming is done withthe help of morphological analyzer. Morphological analysis is done in order to produceEnglish dictionary based words. For example, words like “services”, “metals” arereduced to simpler singular forms as “service” and “metal”.We used stemming to obtain the dictionary based root words. Using root words, wesimplified the matching process.
  • Translation is the conversion of the meaning of a source-language text by means of an equivalent target-language text. In this process, equivalent Nepali text is obtained of the English words as obtained by mapping each keyword matched accordingly with the English Dictionary. The matched word are then mapped with the English-Nepali Dictionary provided by Madan Puraskar Pustakalaya. The unmatched words are simply placed with translatedtokens. For Example the word “nepal”, “metal” is mapped onto the dictionary to get the word “नेपाल”, “धातु”.
  • Transliteration is the conversion of a text from one script to another. To transliterate a Nepali word to English word, we used dictionary mapping to map individual Nepali syllable to form English alphabet. Here in above example of translation the word “नेपाल”,“धातु” are transliterated to “Nepal” and “dhatu” and then extracted to the pool of keywords for further processing.
  • Obtained from the process of stemming and transliteration. Unique tokens are taken aften double metaphone comparison.
  • Database query is constructed from the final token list and matched against using MySQL inbuilt like function %like%
  • Random Selection
  • Spat(Steel)  SpitPaperMill  Paper Mill16326

Transcript

  • 1. Name Conflict Resolution for Company Registration 8/30/2013
  • 2. •System to automate company registration process •Compares the company names using string matching algorithms •Names are ranked according to their similarity percentage •A name is rejected if the similarity score is 100% Introduction Introduction 8/30/2013
  • 3. •To develop a system to resolve naming conflict. •To find names similar to the name proposed by user. •To provide the ranks of matched proposed name with other existing names. Objectives Objectives 8/30/2013
  • 4. Building Base Dictionary Keyword Generation Finding Possible Matches Finding Duplicates Finding Ranks Methodology Methodology 8/30/2013
  • 5. System Design 8/30/2013
  • 6. 8/30/2013
  • 7. User Input“Centre Nepal Metals Industries” User Input 8/30/2013
  • 8. 8/30/2013
  • 9. Preprocessing Engine 8/30/2013
  • 10. Downcasting “Centre Nepal Metals Industries” Downcasting “centre nepal metals industries” • act of casting input from uppercase letters to lowercases 8/30/2013
  • 11. Transformation Transformation “centre nepal metals industries” • conversion of British English words to American English words center 8/30/2013
  • 12. Stopword Removal Stopword Removal “center nepal metals industries” • process of removing predefined stopwords from the string literal “center nepal metals ”industries“center nepal metals” 8/30/2013
  • 13. Tokenization Tokenization • process of reducing a large string to a set of tokens “center nepal metals”center nepal metals 8/30/2013
  • 14. Stemming Stemming • process of reducing a word to a root, or simpler form metal center nepal metals center nepal Tokens Stemmed Tokens 8/30/2013
  • 15. 8/30/2013
  • 16. Translation Translation • conversion of the meaning of a source-language text by means of an equivalent target-language text metal center nepal Stemmed Tokens Translated Tokens 8/30/2013
  • 17. Transliteration Transliteration • conversion of a text from one script to another dhatu kendra nepal Translated Tokens Transliterated Tokens 8/30/2013
  • 18. 8/30/2013
  • 19. Final Token List dhatumetal nepalcenter kendra 8/30/2013
  • 20. 8/30/2013
  • 21. Database Query using Final Token List •nepal medical centre pvt. ltd. •nepal dhatu company •metal nepal pvt. ltd. •enter nepal •nepal metal industries •dhatu sankalan kendra 8/30/2013
  • 22. From Database Query Result Nepal Medical Centre pvt. ltd. 8/30/2013
  • 23. 8/30/2013
  • 24. Database Generated Keywords centernepal medical 8/30/2013
  • 25. 8/30/2013
  • 26. Permutation •kendranepal dhatu •kendranepal metal •center nepal dhatu •center nepal metal 8/30/2013
  • 27. 8/30/2013
  • 28. Levenshtein Distance Calculation Optimized Maximal Similarity using Hungarian Algorithm Sorenson Index to Calculate Similarity % 3 steps 1 2 3 Comparison Comparison 8/30/2013
  • 29. nepal medical center center 5 6 0 5 6 0 8/30/2013 Levenshtein Distance Calculation
  • 30. nepal medical center nepal 0 4 5 5 6 0 0 4 5 8/30/2013 Levenshtein Distance Calculation
  • 31. nepal medical center metal 2 3 4 2 3 4 5 6 0 0 4 5 8/30/2013 Levenshtein Distance Calculation
  • 32. nepal medical center center 1 1 6 1 1 6 8/30/2013 Similarity Weight Calculation
  • 33. nepal medical center nepal 5 3 1 5 3 1 1 1 6 8/30/2013 Similarity Weight Calculation
  • 34. nepal medical center metal 3 4 2 3 4 2 1 1 6 5 3 1 8/30/2013 Similarity Weight Calculation
  • 35. nepal medical center 5center nepal metal 6 3 2 1 1 4 6 5 4 Bipartite Graph 8/30/2013
  • 36. Sorenson Similarity 8/30/2013
  • 37. 8/30/2013
  • 38. Final Ranked List 8/30/2013
  • 39.  111160 Registered Company Names  106299 Unique Reg. ID / Company Names  16326 Words in English- Nepali Dictionary  144 British-American Words for Transformation Dataset Dataset 8/30/2013
  • 40. 1.664 2.204 11.952 37.743 8.959 13.315 39.994 107.498 0 20 40 60 80 100 120 1 Token 2 Tokens 3 Tokens 4 Tokens TimetoCompute(sec) Number of Tokens Number of Tokens VS Computation Time Time to compute (sec) in I5 CPU Time to compute (sec) in Dual Core CPU Result Analysis 8/30/2013
  • 41. •Stemming sometimes produces incorrect results if input contains a Nepali word •Dictionary (English-Nepali) does not contain enough words •Tokenization is based on whitespace and hyphen only •Comparison is not phonetic based Limitations Limitations 8/30/2013
  • 42. •Use of Taxonomy for classifying the tokens •Using some weighing measures to assign weights to tokens •Implementation of faster searching methods •Integration of phonetic based similarity measures Future Enhancements Future Enhancements 8/30/2013
  • 43. Thank You Gaurav Kumar Goyal 16214 Janardan Chaudhary 16216 Nimesh Mishra 16221 Sanat Maharjan 16230 8/30/2013