More Related Content
Similar to Building similarentityrecognizerv1 (20)
Building similarentityrecognizerv1
- 2. Agenda
Similar Entity Detection Scenarios
Challenges, Techniques and Algorithms
Semantic applicability
Big Data Challenges and Solution
Sample Results
2 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 3. Scenario 1 - Fraud Detection in Insurance Claims
Are P1 , P2 and
P3 same?
Is there FRAUD
P1
P2
Tom Harold Tom H makes claim
makes claim on P3 on Policy 2
Policy 1 T Harold makes a claim on
Policy 3
3 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 4. Scenario 2 – Cross Sell Potential Detection in Insurance
Does Tom Harold hold a policy in any other
system. What are the policies he holds. Is
there a potential for cross sell.
Tom Harold holds Policy 1 in System 1
He is high net-worth customer
4 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 5. Example features for different people
Person 1 Person 2 Person 3
• First Name – • First Name – • First Name –
Tom Tom Tom
• Middle Name - • Middle Name - • Last Name -
• Last Name - Harry Harold
Harold • Last Name - • Date of Birth –
• Date of Birth – Harold 20/10/1988
20/10/1987 • Date of Birth – • Address -
• Address - 20/10/1987 1, Mahatma
1, MG • Address - Gandhi
Road, Bangalore 1, Mahatma Rd, Bangalore -
– 56 Gandhi 560056
Rd, Bangalore -
560056
Questions :
• Is Person 1 same as Person2 ?
• Is Person 2 same as Person3 ?
• Is Person 1 same as Person3 ?
5 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 6. Similar Entity Detection Challenges
Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as
Person 2
Not so trivial for a machine
Weightages must be arrived at for different features
Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different
A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd
Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g.
20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty
Hence need other techniques like machine learning and semantic techniques
6 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 7. Similar Entity Detection Methodology
Given two entities how can we say that two entities are same
• Identify relevant features
Step 1
• Extract values for the features
Step 2
• Create a model which can classify the
Step 3 two entities as same or different
• Use the model to classify future
Step 4 customer pairs
7 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 8. Supervised Learning model
Labeled pre-identified customer pairs data as inputs
Values for different features for each of the customer pairs
Each customer pair is tagged as Same, Probably Same, Different
A supervised algorithm is chosen - ( Actual algorithm based on data characteristics )
The tagged data is fed as input
Output is the model
Model will classify a new customer pair into one of the identified categories
Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores
8 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 9. Supervised Learning model example
• Live example of how to classify a given set of customer records using
Supervised Learning
9 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 10. Un-Supervised Learning model
In many cases there is no pre-labeled data
In this case we would need to choose an Un-
supervised learning model
The model will automatically detect patterns in the
data and cluster the data points into different clusters
Any newly added customer pair would be placed in
the right cluster
10 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 11. Un-Supervised Learning model example
• Live example of how to classify a given set of customer records using
Un-Supervised Learning
11 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 12. Continuous Learning
In many cases there will be some small set
of labeled data and very large set of un-
labeled data
An initial model will be created using the
small labeled data set
As more labeled data is available the model
will evolve due to continuous learning and
become more better at the classification
12 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 13. Semantic Techniques applicability
Semantic similarity scoring for features
• Feature - List of Games played
• Person 1 plays – Racquet Sports
• Person 2 plays – Lawn Tennis
• Using semantic comparison we can see that there is a high similarity between
person 1 and person 2 on the List of Games played feature
Extraction of features from different data sources
• Similar features named differently
Associating customers in different data sources as same or different
• Flexibly and easy addition of new relationships
Ease of adding additional data sources
13 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 14. Large Data handling challenges
Entity similarity is a pair wise operation
If there are n entities then there n*(n-1)
number of comparisons to be done
Also within each comparison for every
feature pair has to be compared
Highly time consuming operations
14 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 15. Large Data handling ideas
Use of Apache Mahout
• Split the comparison into m
different machines
• Each machine now handles
- n/m customer
• Nearly an m time speed-up
Batch time
Incremental comparisons and
addition of new tagging
customer pairs • Reduce run time
response to find similar
entities
15 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 16. Sample Metrics from our experiments
• Discussion on the sample metrics from our experiments
• Learning from same
– Which algorithm and method was more apt under different
circumstances
16 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
- 17. Thank You
17 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL