Large Scale Entity Resolution
Tools for Finding the Important Needle in the Haystack
Mary Galvin, Technical Consultant, Le...
2Strategies for Entity Resolution to Reveal Hidden Connections
Semantics
1. ‘Entity’: A thing with distinct and independent existence containing
enough attributes to uniquely set it apa...
4
Large Scale Entity Resolution Use Case #1
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Hay...
5
Scenario
Healthcare insurers need better analytics to identify
drug seeking behavior and schemes that recruit
members to...
6
Large Scale Entity Resolution Challenges
1. Permanence/Persistence
2. Transparency
3. Spatial and Temporal Consideration...
Entity Resolution Methodologies
7
Rules-Based:
− Based on logic (IF/ELSE or SWITCH
statements)
− Example: If field values ...
8
Choosing the Right Methodology
Methodology Pros Cons
Rules-Based • High Precision
• Optimal for Small Datasets
• Heavy M...
9
Why Statistically-Based Systems Excel
“The advantage of this [statistical] approach over hand-coded
rules is that the mo...
10
Consideration #1: “Dirty” Data
US Consumer Data
Frequent Zip Code Patterns
US Consumer Data
Frequent Phone Number Value...
11
Consideration #2: Incomplete Data
Null Field Value Scenarios
Partial Field Value Scenarios
Cluster # F Name M Name L Na...
12Strategies for Entity Resolution to Reveal Hidden Connections
Consideration #3: Semi-Structured Data
International Posta...
13Strategies for Entity Resolution to Reveal Hidden Connections
Consideration #4: Semi-Structured Data
US Postal Addresses...
Entity Resolution Benefits
14Strategies for Entity Resolution to Reveal Hidden Connections
Which Scenario is More Optimal ...
Entity Resolution Vision
15Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
• Across in...
Entity Resolution Vision
16Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
From a tech...
17Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
Mary Galvin
Technical Consultant
Lex...
Upcoming SlideShare
Loading in...5
×

Large Scale Entity Resolution, Lexus Nexus

381

Published on

Large Scale Entity Resolution
Tools for finding the important needle in the haystack

Global Directions Confrence 2013

Published in: Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
381
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Large Scale Entity Resolution, Lexus Nexus

  1. 1. Large Scale Entity Resolution Tools for Finding the Important Needle in the Haystack Mary Galvin, Technical Consultant, LexisNexis Kodak Global Directions ‘13
  2. 2. 2Strategies for Entity Resolution to Reveal Hidden Connections
  3. 3. Semantics 1. ‘Entity’: A thing with distinct and independent existence containing enough attributes to uniquely set it apart from something else. 2. ‘Entity Resolution’: The processes and methodologies used to uncover instances where the same ‘entity’ is referred to across disparate sources of digital information (ie, records, news stories, blogs/microblogs, etc.). Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack 3
  4. 4. 4 Large Scale Entity Resolution Use Case #1 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  5. 5. 5 Scenario Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently. Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems. Providers recruit members to provide and escalate services that are not rendered. Result The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk. Large Scale Entity Resolution Use Case #2 Almost every prescription is in social isolation (> 96%) Non-Social Large % of prescriptions show socialization (long tail) Social Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  6. 6. 6 Large Scale Entity Resolution Challenges 1. Permanence/Persistence 2. Transparency 3. Spatial and Temporal Considerations 4. Source Credibility Considerations 5. Completeness Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  7. 7. Entity Resolution Methodologies 7 Rules-Based: − Based on logic (IF/ELSE or SWITCH statements) − Example: If field values 1, 2 and 5 from source ‘a’ are equivalent to values 3, 6 and 7 in source ‘b’, respectively, then declare a match. Statistics-Based: − Based on computation of weights and thresholds; a match is declared only when the sum of all weights surpasses a certain threshold − Example: Threshold = 29 Sum of Individual Field Scores (based on specificity Values) Source A Source B Field 1 Score Field 2 Score Field 3 Score Field 4 Score Field 1 Score Field 2 Score Field 3 Score Field 4 Score Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  8. 8. 8 Choosing the Right Methodology Methodology Pros Cons Rules-Based • High Precision • Optimal for Small Datasets • Heavy Maintenance Required • Performance Degradation as Rule Set and Datasets Increase • Re-writing of Rules Required as Additional Languages are Present Statistics-Based • Language Agnostic • Entity Agnostic • Optimal for Large Datasets • Overkill for Small Datasets Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  9. 9. 9 Why Statistically-Based Systems Excel “The advantage of this [statistical] approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts” - Ray Kurzweil, How To Create A Mind (in reference to using statistical approaches for speech recognition technology) Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  10. 10. 10 Consideration #1: “Dirty” Data US Consumer Data Frequent Zip Code Patterns US Consumer Data Frequent Phone Number Values International Cargo Shipping Data – Shipper Names Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  11. 11. 11 Consideration #2: Incomplete Data Null Field Value Scenarios Partial Field Value Scenarios Cluster # F Name M Name L Name 1 Sardar Khan Niazi 2. S. K. Niazi Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  12. 12. 12Strategies for Entity Resolution to Reveal Hidden Connections Consideration #3: Semi-Structured Data International Postal Addresses 101 ZUBAIDA GARDEN NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI 101 BLOCK E FIRST FLOOR ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FA,KARACHI E-101 ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI
  13. 13. 13Strategies for Entity Resolution to Reveal Hidden Connections Consideration #4: Semi-Structured Data US Postal Addresses 939 JEFFERSON ST 110 E ELM ST 426 NEW YORK AVE 212 E MAIN ST 1900 EAGLE DR Street Name City Name State Name Bakersfield Ashland Newton Brookfield Middletown California North Carolina Ohio Connecticut Maryland Average Specificity: 19.63 11.12 5.03 Location 14.03
  14. 14. Entity Resolution Benefits 14Strategies for Entity Resolution to Reveal Hidden Connections Which Scenario is More Optimal for Your Business?
  15. 15. Entity Resolution Vision 15Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack • Across industry and government, many initiatives and missions boil down to 4 primary entity types: • People • Businesses/Organizations • Locations • Assets • A deeper understanding of entities and their interconnections translates to: • Increased successes in cracking fraud, waste and abuse • Better matching of people to people across social networks • Stronger indicators of supply chain risk for the enterprise
  16. 16. Entity Resolution Vision 16Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack From a technical implementation standpoint, can scientific findings pertaining to the neocortex help us further revolutionize entity resolution technology as it stands today? • Our statistical approach has us heading in the right direction • We are continuously finding new ways to represent the hierarchical nature of entities • We should take heed of the brain’s innate ability to “prune”, while possibly looking at ways to emulate “pruning” so that unnecessary retention of data with little to no value doesn’t continue to bog the enterprise down
  17. 17. 17Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack Mary Galvin Technical Consultant LexisNexis Special Services, Inc. (LNSSI) LexisNexis | Risk Solutions 202.595.4043 Mobile mary.galvin@lnssi.com Q&A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×