Name conflict resolution

Name Conflict
Resolution
for
Company Registration
8/30/2013

•System to automate company registration
process
•Compares the company names using string
matching algorithms
•Names are ranked according to their similarity
percentage
•A name is rejected if the similarity score is 100%
Introduction
Introduction
8/30/2013

•To develop a system to resolve naming conflict.
•To find names similar to the name proposed by
user.
•To provide the ranks of matched proposed name
with other existing names.
Objectives
Objectives
8/30/2013

Building Base Dictionary
Keyword Generation
Finding Possible Matches
Finding Duplicates
Finding Ranks
Methodology
Methodology
8/30/2013

User Input“Centre Nepal Metals Industries”
User Input
8/30/2013

Preprocessing Engine
8/30/2013

Downcasting
“Centre Nepal Metals Industries”
Downcasting
“centre nepal metals industries”
• act of casting input from uppercase letters
to lowercases
8/30/2013

Transformation
Transformation
“centre nepal metals industries”
• conversion of British English words to
American English words
center
8/30/2013

Stopword Removal
Stopword Removal
“center nepal metals industries”
• process of removing predefined stopwords
from the string literal
“center nepal metals ”industries“center nepal metals”
8/30/2013

Tokenization
Tokenization
• process of reducing a large string to a set
of tokens
“center nepal metals”center nepal metals
8/30/2013

Stemming
Stemming
• process of reducing a word to a root, or
simpler form
metal
center
nepal
metals
center
nepal
Tokens Stemmed Tokens
8/30/2013

Translation
Translation
• conversion of the meaning of a source-language
text by means of an equivalent target-language
text
metal
center
nepal
Stemmed Tokens Translated Tokens
8/30/2013

Transliteration
Transliteration
• conversion of a text from one script to another
dhatu
kendra
nepal
Translated Tokens Transliterated Tokens
8/30/2013

Final Token List
dhatumetal
nepalcenter kendra
8/30/2013

Database Query using Final Token List
•nepal medical centre pvt. ltd.
•nepal dhatu company
•metal nepal pvt. ltd.
•enter nepal
•nepal metal industries
•dhatu sankalan kendra
8/30/2013

From Database
Query Result
Nepal Medical Centre pvt. ltd.
8/30/2013

Database Generated
Keywords
centernepal medical
8/30/2013

Permutation
•kendranepal dhatu
•kendranepal metal
•center nepal dhatu
•center nepal metal
8/30/2013

Levenshtein Distance Calculation
Optimized Maximal Similarity using
Hungarian Algorithm
Sorenson Index to Calculate Similarity %
3 steps
1
2
3
Comparison
Comparison
8/30/2013

nepal
medical
center
center
5
6
0
5 6 0
8/30/2013
Levenshtein Distance
Calculation

nepal
medical
center
nepal
0
4
5
5 6 0
0 4 5
8/30/2013
Calculation

nepal
medical
center
metal
2
3
4
2 3 4
5 6 0
0 4 5
8/30/2013
Calculation

nepal
medical
center
center
1
1
6
1 1 6
8/30/2013
Similarity Weight
Calculation

nepal
medical
center
nepal
5
3
1
5 3 1
1 1 6
8/30/2013
Similarity Weight
Calculation

nepal
medical
center
metal
3
4
2
3 4 2
1 1 6
5 3 1
8/30/2013
Similarity Weight
Calculation

nepal
medical
center
5center
nepal
metal
6
3
2
1
1
4
6
5
4
Bipartite Graph
8/30/2013

 111160 Registered Company Names
 106299 Unique Reg. ID / Company Names
 16326 Words in English- Nepali Dictionary
 144 British-American Words for
Transformation
Dataset
Dataset
8/30/2013

1.664
2.204
11.952
37.743
8.959 13.315
39.994
107.498
0
20
40
60
80
100
120
1 Token 2 Tokens 3 Tokens 4 Tokens
TimetoCompute(sec)
Number of Tokens
Number of Tokens VS Computation Time
Time to compute (sec) in I5 CPU
Time to compute (sec) in Dual Core CPU
Result Analysis
8/30/2013

•Stemming sometimes produces incorrect results if
input contains a Nepali word
•Dictionary (English-Nepali) does not contain enough
words
•Tokenization is based on whitespace and hyphen only
•Comparison is not phonetic based
Limitations
Limitations
8/30/2013

•Use of Taxonomy for classifying the tokens
•Using some weighing measures to assign weights to
tokens
•Implementation of faster searching methods
•Integration of phonetic based similarity measures
Future Enhancements
Future Enhancements
8/30/2013

Thank You
Gaurav Kumar Goyal 16214
Janardan Chaudhary 16216
Nimesh Mishra 16221
Sanat Maharjan 16230
8/30/2013

Name conflict resolution

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Name conflict resolution

Editor's Notes