Large Scale Entity Resolution, Lexus Nexus

659 views

Published on

Large Scale Entity Resolution
Tools for finding the important needle in the haystack

Global Directions Confrence 2013

Published in: Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
659
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Large Scale Entity Resolution, Lexus Nexus

  1. 1. Large Scale Entity Resolution Tools for Finding the Important Needle in the Haystack Mary Galvin, Technical Consultant, LexisNexis Kodak Global Directions ‘13
  2. 2. 2Strategies for Entity Resolution to Reveal Hidden Connections
  3. 3. Semantics 1. ‘Entity’: A thing with distinct and independent existence containing enough attributes to uniquely set it apart from something else. 2. ‘Entity Resolution’: The processes and methodologies used to uncover instances where the same ‘entity’ is referred to across disparate sources of digital information (ie, records, news stories, blogs/microblogs, etc.). Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack 3
  4. 4. 4 Large Scale Entity Resolution Use Case #1 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  5. 5. 5 Scenario Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently. Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems. Providers recruit members to provide and escalate services that are not rendered. Result The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk. Large Scale Entity Resolution Use Case #2 Almost every prescription is in social isolation (> 96%) Non-Social Large % of prescriptions show socialization (long tail) Social Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  6. 6. 6 Large Scale Entity Resolution Challenges 1. Permanence/Persistence 2. Transparency 3. Spatial and Temporal Considerations 4. Source Credibility Considerations 5. Completeness Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  7. 7. Entity Resolution Methodologies 7 Rules-Based: − Based on logic (IF/ELSE or SWITCH statements) − Example: If field values 1, 2 and 5 from source ‘a’ are equivalent to values 3, 6 and 7 in source ‘b’, respectively, then declare a match. Statistics-Based: − Based on computation of weights and thresholds; a match is declared only when the sum of all weights surpasses a certain threshold − Example: Threshold = 29 Sum of Individual Field Scores (based on specificity Values) Source A Source B Field 1 Score Field 2 Score Field 3 Score Field 4 Score Field 1 Score Field 2 Score Field 3 Score Field 4 Score Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  8. 8. 8 Choosing the Right Methodology Methodology Pros Cons Rules-Based • High Precision • Optimal for Small Datasets • Heavy Maintenance Required • Performance Degradation as Rule Set and Datasets Increase • Re-writing of Rules Required as Additional Languages are Present Statistics-Based • Language Agnostic • Entity Agnostic • Optimal for Large Datasets • Overkill for Small Datasets Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  9. 9. 9 Why Statistically-Based Systems Excel “The advantage of this [statistical] approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts” - Ray Kurzweil, How To Create A Mind (in reference to using statistical approaches for speech recognition technology) Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  10. 10. 10 Consideration #1: “Dirty” Data US Consumer Data Frequent Zip Code Patterns US Consumer Data Frequent Phone Number Values International Cargo Shipping Data – Shipper Names Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  11. 11. 11 Consideration #2: Incomplete Data Null Field Value Scenarios Partial Field Value Scenarios Cluster # F Name M Name L Name 1 Sardar Khan Niazi 2. S. K. Niazi Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  12. 12. 12Strategies for Entity Resolution to Reveal Hidden Connections Consideration #3: Semi-Structured Data International Postal Addresses 101 ZUBAIDA GARDEN NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI 101 BLOCK E FIRST FLOOR ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FA,KARACHI E-101 ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI
  13. 13. 13Strategies for Entity Resolution to Reveal Hidden Connections Consideration #4: Semi-Structured Data US Postal Addresses 939 JEFFERSON ST 110 E ELM ST 426 NEW YORK AVE 212 E MAIN ST 1900 EAGLE DR Street Name City Name State Name Bakersfield Ashland Newton Brookfield Middletown California North Carolina Ohio Connecticut Maryland Average Specificity: 19.63 11.12 5.03 Location 14.03
  14. 14. Entity Resolution Benefits 14Strategies for Entity Resolution to Reveal Hidden Connections Which Scenario is More Optimal for Your Business?
  15. 15. Entity Resolution Vision 15Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack • Across industry and government, many initiatives and missions boil down to 4 primary entity types: • People • Businesses/Organizations • Locations • Assets • A deeper understanding of entities and their interconnections translates to: • Increased successes in cracking fraud, waste and abuse • Better matching of people to people across social networks • Stronger indicators of supply chain risk for the enterprise
  16. 16. Entity Resolution Vision 16Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack From a technical implementation standpoint, can scientific findings pertaining to the neocortex help us further revolutionize entity resolution technology as it stands today? • Our statistical approach has us heading in the right direction • We are continuously finding new ways to represent the hierarchical nature of entities • We should take heed of the brain’s innate ability to “prune”, while possibly looking at ways to emulate “pruning” so that unnecessary retention of data with little to no value doesn’t continue to bog the enterprise down
  17. 17. 17Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack Mary Galvin Technical Consultant LexisNexis Special Services, Inc. (LNSSI) LexisNexis | Risk Solutions 202.595.4043 Mobile mary.galvin@lnssi.com Q&A

×