• Like

Large Scale Entity Resolution, Lexus Nexus

  • 208 views
Uploaded on

Large Scale Entity Resolution …

Large Scale Entity Resolution
Tools for finding the important needle in the haystack

Global Directions Confrence 2013

More in: Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
208
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large Scale Entity Resolution Tools for Finding the Important Needle in the Haystack Mary Galvin, Technical Consultant, LexisNexis Kodak Global Directions ‘13
  • 2. 2Strategies for Entity Resolution to Reveal Hidden Connections
  • 3. Semantics 1. ‘Entity’: A thing with distinct and independent existence containing enough attributes to uniquely set it apart from something else. 2. ‘Entity Resolution’: The processes and methodologies used to uncover instances where the same ‘entity’ is referred to across disparate sources of digital information (ie, records, news stories, blogs/microblogs, etc.). Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack 3
  • 4. 4 Large Scale Entity Resolution Use Case #1 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 5. 5 Scenario Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently. Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems. Providers recruit members to provide and escalate services that are not rendered. Result The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk. Large Scale Entity Resolution Use Case #2 Almost every prescription is in social isolation (> 96%) Non-Social Large % of prescriptions show socialization (long tail) Social Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 6. 6 Large Scale Entity Resolution Challenges 1. Permanence/Persistence 2. Transparency 3. Spatial and Temporal Considerations 4. Source Credibility Considerations 5. Completeness Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 7. Entity Resolution Methodologies 7 Rules-Based: − Based on logic (IF/ELSE or SWITCH statements) − Example: If field values 1, 2 and 5 from source ‘a’ are equivalent to values 3, 6 and 7 in source ‘b’, respectively, then declare a match. Statistics-Based: − Based on computation of weights and thresholds; a match is declared only when the sum of all weights surpasses a certain threshold − Example: Threshold = 29 Sum of Individual Field Scores (based on specificity Values) Source A Source B Field 1 Score Field 2 Score Field 3 Score Field 4 Score Field 1 Score Field 2 Score Field 3 Score Field 4 Score Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 8. 8 Choosing the Right Methodology Methodology Pros Cons Rules-Based • High Precision • Optimal for Small Datasets • Heavy Maintenance Required • Performance Degradation as Rule Set and Datasets Increase • Re-writing of Rules Required as Additional Languages are Present Statistics-Based • Language Agnostic • Entity Agnostic • Optimal for Large Datasets • Overkill for Small Datasets Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 9. 9 Why Statistically-Based Systems Excel “The advantage of this [statistical] approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts” - Ray Kurzweil, How To Create A Mind (in reference to using statistical approaches for speech recognition technology) Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 10. 10 Consideration #1: “Dirty” Data US Consumer Data Frequent Zip Code Patterns US Consumer Data Frequent Phone Number Values International Cargo Shipping Data – Shipper Names Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 11. 11 Consideration #2: Incomplete Data Null Field Value Scenarios Partial Field Value Scenarios Cluster # F Name M Name L Name 1 Sardar Khan Niazi 2. S. K. Niazi Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
  • 12. 12Strategies for Entity Resolution to Reveal Hidden Connections Consideration #3: Semi-Structured Data International Postal Addresses 101 ZUBAIDA GARDEN NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI 101 BLOCK E FIRST FLOOR ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FA,KARACHI E-101 ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI
  • 13. 13Strategies for Entity Resolution to Reveal Hidden Connections Consideration #4: Semi-Structured Data US Postal Addresses 939 JEFFERSON ST 110 E ELM ST 426 NEW YORK AVE 212 E MAIN ST 1900 EAGLE DR Street Name City Name State Name Bakersfield Ashland Newton Brookfield Middletown California North Carolina Ohio Connecticut Maryland Average Specificity: 19.63 11.12 5.03 Location 14.03
  • 14. Entity Resolution Benefits 14Strategies for Entity Resolution to Reveal Hidden Connections Which Scenario is More Optimal for Your Business?
  • 15. Entity Resolution Vision 15Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack • Across industry and government, many initiatives and missions boil down to 4 primary entity types: • People • Businesses/Organizations • Locations • Assets • A deeper understanding of entities and their interconnections translates to: • Increased successes in cracking fraud, waste and abuse • Better matching of people to people across social networks • Stronger indicators of supply chain risk for the enterprise
  • 16. Entity Resolution Vision 16Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack From a technical implementation standpoint, can scientific findings pertaining to the neocortex help us further revolutionize entity resolution technology as it stands today? • Our statistical approach has us heading in the right direction • We are continuously finding new ways to represent the hierarchical nature of entities • We should take heed of the brain’s innate ability to “prune”, while possibly looking at ways to emulate “pruning” so that unnecessary retention of data with little to no value doesn’t continue to bog the enterprise down
  • 17. 17Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack Mary Galvin Technical Consultant LexisNexis Special Services, Inc. (LNSSI) LexisNexis | Risk Solutions 202.595.4043 Mobile mary.galvin@lnssi.com Q&A