SlideShare a Scribd company logo
Building Similar Entity
                    Recognizers

                    By Arthi Venkataraman




1   © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Agenda
      Similar Entity Detection Scenarios

      Challenges, Techniques and Algorithms

      Semantic applicability

      Big Data Challenges and Solution

      Sample Results


2                 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Scenario 1 - Fraud Detection in Insurance Claims




                           Are P1 , P2 and
                              P3 same?
                          Is there FRAUD

         P1
                                                                                P2
     Tom Harold                                                          Tom H makes claim
    makes claim on              P3                                          on Policy 2
       Policy 1      T Harold makes a claim on
                              Policy 3




3                          © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Scenario 2 – Cross Sell Potential Detection in Insurance




                   Does Tom Harold hold a policy in any other
                   system. What are the policies he holds. Is
                         there a potential for cross sell.




                         Tom Harold holds Policy 1 in System 1
                            He is high net-worth customer




4                       © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Example features for different people

        Person 1                           Person 2                                 Person 3

    • First Name –                • First Name –                                • First Name –
      Tom                           Tom                                           Tom
    • Middle Name -               • Middle Name -                               • Last Name -
    • Last Name -                   Harry                                         Harold
      Harold                      • Last Name -                                 • Date of Birth –
    • Date of Birth –               Harold                                        20/10/1988
      20/10/1987                  • Date of Birth –                             • Address -
    • Address -                     20/10/1987                                    1, Mahatma
      1, MG                       • Address -                                     Gandhi
      Road, Bangalore               1, Mahatma                                    Rd, Bangalore -
      – 56                          Gandhi                                        560056
                                    Rd, Bangalore -
                                    560056

             Questions :
             • Is Person 1 same as Person2 ?
             • Is Person 2 same as Person3 ?
             • Is Person 1 same as Person3 ?
5                                 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Similar Entity Detection Challenges


    Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as
    Person 2


    Not so trivial for a machine


    Weightages must be arrived at for different features


    Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different
    A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd

    Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g.
    20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty


    Hence need other techniques like machine learning and semantic techniques




6                                         © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Similar Entity Detection Methodology
        Given two entities how can we say that two entities are same


             • Identify relevant features
    Step 1


             • Extract values for the features
    Step 2


             • Create a model which can classify the
    Step 3     two entities as same or different

             • Use the model to classify future
    Step 4     customer pairs

7                          © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Supervised Learning model

    Labeled pre-identified customer pairs data as inputs

    Values for different features for each of the customer pairs

    Each customer pair is tagged as Same, Probably Same, Different

    A supervised algorithm is chosen - ( Actual algorithm based on data characteristics )

    The tagged data is fed as input

    Output is the model

    Model will classify a new customer pair into one of the identified categories

    Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores




8                                     © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Supervised Learning model example

    • Live example of how to classify a given set of customer records using
      Supervised Learning




9                           © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Un-Supervised Learning model



     In many cases there is no pre-labeled data

     In this case we would need to choose an Un-
     supervised learning model

     The model will automatically detect patterns in the
     data and cluster the data points into different clusters

     Any newly added customer pair would be placed in
     the right cluster


10                      © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Un-Supervised Learning model example

     • Live example of how to classify a given set of customer records using
       Un-Supervised Learning




11                           © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Continuous Learning

     In many cases there will be some small set
     of labeled data and very large set of un-
     labeled data

     An initial model will be created using the
     small labeled data set

     As more labeled data is available the model
     will evolve due to continuous learning and
     become more better at the classification

12                   © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Semantic Techniques applicability


     Semantic similarity scoring for features
      •   Feature - List of Games played
      •   Person 1 plays – Racquet Sports
      •   Person 2 plays – Lawn Tennis
      •   Using semantic comparison we can see that there is a high similarity between
          person 1 and person 2 on the List of Games played feature

     Extraction of features from different data sources
      • Similar features named differently

     Associating customers in different data sources as same or different
      • Flexibly and easy addition of new relationships

     Ease of adding additional data sources



13                                © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Large Data handling challenges


                 Entity similarity is a pair wise operation



                If there are n entities then there n*(n-1)
                   number of comparisons to be done


                 Also within each comparison for every
                    feature pair has to be compared



                   Highly time consuming operations


14                  © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Large Data handling ideas


                      Use of Apache Mahout
                      • Split the comparison into m
                        different machines
                      • Each machine now handles
                        - n/m customer
                      • Nearly an m time speed-up




                                                Batch time
            Incremental                         comparisons and
           addition of new                      tagging
           customer pairs                       • Reduce run time
                                                  response to find similar
                                                  entities




15                 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Sample Metrics from our experiments

     • Discussion on the sample metrics from our experiments
     • Learning from same
        – Which algorithm and method was more apt under different
          circumstances




16                           © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
Thank You




17   © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL

More Related Content

Similar to Building similarentityrecognizerv1

Understanding the semantics landscape
Understanding the semantics landscapeUnderstanding the semantics landscape
Understanding the semantics landscape
MikeHypercube
 
4 philip lew - how to improve the mobile user experience
4   philip lew - how to improve the mobile user experience4   philip lew - how to improve the mobile user experience
4 philip lew - how to improve the mobile user experience
Ievgenii Katsan
 
The New gTLD Program: What, When, and Why
The New gTLD Program: What, When, and WhyThe New gTLD Program: What, When, and Why
The New gTLD Program: What, When, and Why
Knobbe Martens - Intellectual Property Law
 
13 Common Mistakes about Communicating Policies & Procedures Information
13 Common Mistakes about Communicating Policies & Procedures Information 13 Common Mistakes about Communicating Policies & Procedures Information
13 Common Mistakes about Communicating Policies & Procedures Information
Maxwell Hoffmann
 
W3 d2innovation
W3 d2innovationW3 d2innovation
W3 d2innovation
VaishaliPawar21
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
Neo4j
 
Where's My Data? Managing the Data Residency Challenge
Where's My Data? Managing the Data Residency ChallengeWhere's My Data? Managing the Data Residency Challenge
Where's My Data? Managing the Data Residency Challenge
Cloud Standards Customer Council
 
Dispersed and cross border projects
Dispersed and cross border projectsDispersed and cross border projects
Dispersed and cross border projects
tumetr1
 
The New gTLD Opportunity - Sandeep Ramchandani, Director, Radix
The New gTLD Opportunity - Sandeep Ramchandani, Director, RadixThe New gTLD Opportunity - Sandeep Ramchandani, Director, Radix
The New gTLD Opportunity - Sandeep Ramchandani, Director, Radix
ResellerClub
 
1 philip lew - what is the meaning of good service and how to take your ser...
1   philip lew - what is the meaning of good service and how to take your ser...1   philip lew - what is the meaning of good service and how to take your ser...
1 philip lew - what is the meaning of good service and how to take your ser...
Ievgenii Katsan
 
When Deep Learning Meets Recommender System
When Deep Learning Meets Recommender SystemWhen Deep Learning Meets Recommender System
When Deep Learning Meets Recommender System
Asi Messica
 
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next StepsConsumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps
guestbd878c
 
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...
wegdam
 
II-PIC 2017: To err is human – growing in experience as a patent information ...
II-PIC 2017: To err is human – growing in experience as a patent information ...II-PIC 2017: To err is human – growing in experience as a patent information ...
II-PIC 2017: To err is human – growing in experience as a patent information ...
Dr. Haxel Consult
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
oralonso
 
Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...
Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...
Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...
Evernym
 
LMAA Advantage Clutch Presentation October 2009 (F)
LMAA Advantage Clutch Presentation October 2009 (F)LMAA Advantage Clutch Presentation October 2009 (F)
LMAA Advantage Clutch Presentation October 2009 (F)
Scott McLaughlin
 
Standards for a blue ocean
Standards for a blue oceanStandards for a blue ocean
Standards for a blue ocean
MrsAlways RigHt
 
Implications of GDPR in Conjunction with UMA
Implications of GDPR in Conjunction with UMAImplications of GDPR in Conjunction with UMA
Implications of GDPR in Conjunction with UMA
ForgeRock
 

Similar to Building similarentityrecognizerv1 (20)

Understanding the semantics landscape
Understanding the semantics landscapeUnderstanding the semantics landscape
Understanding the semantics landscape
 
4 philip lew - how to improve the mobile user experience
4   philip lew - how to improve the mobile user experience4   philip lew - how to improve the mobile user experience
4 philip lew - how to improve the mobile user experience
 
The New gTLD Program: What, When, and Why
The New gTLD Program: What, When, and WhyThe New gTLD Program: What, When, and Why
The New gTLD Program: What, When, and Why
 
13 Common Mistakes about Communicating Policies & Procedures Information
13 Common Mistakes about Communicating Policies & Procedures Information 13 Common Mistakes about Communicating Policies & Procedures Information
13 Common Mistakes about Communicating Policies & Procedures Information
 
W3 d2innovation
W3 d2innovationW3 d2innovation
W3 d2innovation
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Where's My Data? Managing the Data Residency Challenge
Where's My Data? Managing the Data Residency ChallengeWhere's My Data? Managing the Data Residency Challenge
Where's My Data? Managing the Data Residency Challenge
 
Dispersed and cross border projects
Dispersed and cross border projectsDispersed and cross border projects
Dispersed and cross border projects
 
The New gTLD Opportunity - Sandeep Ramchandani, Director, Radix
The New gTLD Opportunity - Sandeep Ramchandani, Director, RadixThe New gTLD Opportunity - Sandeep Ramchandani, Director, Radix
The New gTLD Opportunity - Sandeep Ramchandani, Director, Radix
 
1 philip lew - what is the meaning of good service and how to take your ser...
1   philip lew - what is the meaning of good service and how to take your ser...1   philip lew - what is the meaning of good service and how to take your ser...
1 philip lew - what is the meaning of good service and how to take your ser...
 
When Deep Learning Meets Recommender System
When Deep Learning Meets Recommender SystemWhen Deep Learning Meets Recommender System
When Deep Learning Meets Recommender System
 
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next StepsConsumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps
 
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...
Consumer Identity: a Dutch Perspective on Benefits, Issues and Next Steps (EI...
 
II-PIC 2017: To err is human – growing in experience as a patent information ...
II-PIC 2017: To err is human – growing in experience as a patent information ...II-PIC 2017: To err is human – growing in experience as a patent information ...
II-PIC 2017: To err is human – growing in experience as a patent information ...
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
 
Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...
Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...
Why The Web Needs Decentralized Identifiers (DIDs) — Even if Google, Apple, a...
 
LMAA Advantage Clutch Presentation October 2009 (F)
LMAA Advantage Clutch Presentation October 2009 (F)LMAA Advantage Clutch Presentation October 2009 (F)
LMAA Advantage Clutch Presentation October 2009 (F)
 
Standards for a blue ocean
Standards for a blue oceanStandards for a blue ocean
Standards for a blue ocean
 
Implications of GDPR in Conjunction with UMA
Implications of GDPR in Conjunction with UMAImplications of GDPR in Conjunction with UMA
Implications of GDPR in Conjunction with UMA
 

Building similarentityrecognizerv1

  • 1. Building Similar Entity Recognizers By Arthi Venkataraman 1 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 2. Agenda Similar Entity Detection Scenarios Challenges, Techniques and Algorithms Semantic applicability Big Data Challenges and Solution Sample Results 2 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 3. Scenario 1 - Fraud Detection in Insurance Claims Are P1 , P2 and P3 same? Is there FRAUD P1 P2 Tom Harold Tom H makes claim makes claim on P3 on Policy 2 Policy 1 T Harold makes a claim on Policy 3 3 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 4. Scenario 2 – Cross Sell Potential Detection in Insurance Does Tom Harold hold a policy in any other system. What are the policies he holds. Is there a potential for cross sell. Tom Harold holds Policy 1 in System 1 He is high net-worth customer 4 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 5. Example features for different people Person 1 Person 2 Person 3 • First Name – • First Name – • First Name – Tom Tom Tom • Middle Name - • Middle Name - • Last Name - • Last Name - Harry Harold Harold • Last Name - • Date of Birth – • Date of Birth – Harold 20/10/1988 20/10/1987 • Date of Birth – • Address - • Address - 20/10/1987 1, Mahatma 1, MG • Address - Gandhi Road, Bangalore 1, Mahatma Rd, Bangalore - – 56 Gandhi 560056 Rd, Bangalore - 560056 Questions : • Is Person 1 same as Person2 ? • Is Person 2 same as Person3 ? • Is Person 1 same as Person3 ? 5 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 6. Similar Entity Detection Challenges Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as Person 2 Not so trivial for a machine Weightages must be arrived at for different features Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g. 20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty Hence need other techniques like machine learning and semantic techniques 6 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 7. Similar Entity Detection Methodology Given two entities how can we say that two entities are same • Identify relevant features Step 1 • Extract values for the features Step 2 • Create a model which can classify the Step 3 two entities as same or different • Use the model to classify future Step 4 customer pairs 7 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 8. Supervised Learning model Labeled pre-identified customer pairs data as inputs Values for different features for each of the customer pairs Each customer pair is tagged as Same, Probably Same, Different A supervised algorithm is chosen - ( Actual algorithm based on data characteristics ) The tagged data is fed as input Output is the model Model will classify a new customer pair into one of the identified categories Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores 8 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 9. Supervised Learning model example • Live example of how to classify a given set of customer records using Supervised Learning 9 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 10. Un-Supervised Learning model In many cases there is no pre-labeled data In this case we would need to choose an Un- supervised learning model The model will automatically detect patterns in the data and cluster the data points into different clusters Any newly added customer pair would be placed in the right cluster 10 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 11. Un-Supervised Learning model example • Live example of how to classify a given set of customer records using Un-Supervised Learning 11 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 12. Continuous Learning In many cases there will be some small set of labeled data and very large set of un- labeled data An initial model will be created using the small labeled data set As more labeled data is available the model will evolve due to continuous learning and become more better at the classification 12 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 13. Semantic Techniques applicability Semantic similarity scoring for features • Feature - List of Games played • Person 1 plays – Racquet Sports • Person 2 plays – Lawn Tennis • Using semantic comparison we can see that there is a high similarity between person 1 and person 2 on the List of Games played feature Extraction of features from different data sources • Similar features named differently Associating customers in different data sources as same or different • Flexibly and easy addition of new relationships Ease of adding additional data sources 13 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 14. Large Data handling challenges Entity similarity is a pair wise operation If there are n entities then there n*(n-1) number of comparisons to be done Also within each comparison for every feature pair has to be compared Highly time consuming operations 14 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 15. Large Data handling ideas Use of Apache Mahout • Split the comparison into m different machines • Each machine now handles - n/m customer • Nearly an m time speed-up Batch time Incremental comparisons and addition of new tagging customer pairs • Reduce run time response to find similar entities 15 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 16. Sample Metrics from our experiments • Discussion on the sample metrics from our experiments • Learning from same – Which algorithm and method was more apt under different circumstances 16 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL
  • 17. Thank You 17 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL