Linguistic Considerations of Identity Resolution (2008)

330 views

Published on

Identity resolution systems indicate if two individuals really are the same person. Identity retrieval systems help you find the individual you’re after. These systems appear anywhere from analysts’ desks to border crossings. But how do can you tell if a system's any good before it's deployed? You need to understand the problems it should tackle and how to measure how well it’s doing.

This talk considers metrics and data for evaluating identity resolution and retrieval systems. It also explores the linguistic challenges these systems face.

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
330
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Linguistic Considerations of Identity Resolution (2008)

  1. 1. GOVERNMENT USERS Conference “Navigating the Human Terrain” College Park, MD, May 20-21, 2008 Linguistic Considerations of Identity Resolution David Murgatroyd Software Architect Basis Technology
  2. 2. 2 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  3. 3. 3 Introduction: An Exercise Jim Killeen Kileen, J. D. Jaime Kilin ‫كلين‬ ‫جمس‬  Is there a >50% chance these refer to the same person? If…US Citizens; On a ferry to Spain; In a documentary
  4. 4. 4 What is Identity Resolution?  Identity Resolution (aka Entity Resolution):  determining if two or more given references refer to the same entity.  Different from name matching as it’s about identity of entities not similarity of names  See also:  Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.
  5. 5. 5 What sorts of references?  Non-linguistic reference examples:  Numerical identifiers — SSN — Some portions of address (Street Number, Zip Code)  Visual identifiers (e.g., pictures, symbols)  Biometrics (e.g., DNA, iris, signature, voice)  Linguistic reference examples:  Nouns or pronouns in documents (e.g., “the CEO of Basis”)  Names of associated/related entities — Locations (e.g., Street or City Name) — Organizations — Individuals  Name of entity <- we’re going to focus on this one
  6. 6. 6 Let’s focus on names of people  Common and familiar  Often fairly identifying piece of personal information  Demonstrate typical challenges of resolution with linguistic data
  7. 7. 7 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  8. 8. 8 Variation (Intentional)  Variation may be intentional  References may be draw on a large set of names: — Formality (e.g., nicknames) — Transparency (e.g., aliases) — Location (e.g., toponym) — Life status  Vocation (e.g., titles)  Marital status (e.g., marriage/divorce/widowhood)  Parenthood (e.g., patronymic)  Faith (e.g., christening, pilgrimage)  Death (e.g., posthumous names) — Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”) — Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn) Jim Killeen
  9. 9. 9 Variation (Unintentional)  Variation may be unintentional, arising from:  Typos — E.g., “Killeen” vs. “Kileen”  Guessing spelling based on pronunciation — E.g., “Caliin”  Ambiguities inherent in the encoding (e.g., Unicode): — Characters with the same glyph  E.g., Latin and Cyrillic small “i” — Characters with similar glyphs  E.g., Latin “K” and Greenlandic “ĸ” — Characters with composed/combined forms  E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla) Kileen, J. D.
  10. 10. 10 Composition  Names have differing orders:  Given v. Surname: “Killen, Jim” v. “Jim Killeen”  Varies by culture  Name references may be partial:  “Jim” v. “Jim Killeen”
  11. 11. 11 Under-specification  Name components may be abbreviated  Initials (e.g., “J. D.”)  Abbreviations (e.g., “Jas.”)  Name references may have incomplete…  orthography (e.g., Semitic languages)  segmentation (e.g., Asian languages)  phonology (e.g., Ideographic languages) Kileen, J. D. ‫كلين‬ ‫جمس‬
  12. 12. 12 Frequency  Any person can make up a name (an open class)  A few are common, most are very uncommon  Zipfian distribution  Lesson:  Valuable to know common names  Valuable to have a strategy for unknown names
  13. 13. 13 Multilinguality  Names may appear in many languages-of-use  This leads to variation at many linguistic levels.  Orthographic:  transliteration confronts skew in: —orthographic-to-phonetic mappings of source and target languages-of-use —sound systems between the languages ‫كلين‬ ‫جمس‬ <-> James Klein
  14. 14. 14 Multilinguality (cont’d)  Syntactic:  different languages-of-use may imply different name word order  Semantic:  name words which communicate meaning (e.g., titles) may vary (e.g., “Jr.” for “‫الصغر‬ “which means “the younger”)  Pragmatic:  different languages-of-use may use different names based on the audience (e.g., “Mr. Laden” vs. “‫المير‬” which means “the prince”)
  15. 15. 15 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  16. 16. 16 Inputs & Outputs  Inputs options include:  Pair-wise: simple integration, but no shared effort  Set-based: harder integration, but able to optimize  Output options include:  Feature-based: with weights/tuning  Probability-based: —more principled combination —NOTE: similarity is not probability
  17. 17. 17 Integration Properties  Certain properties help make efficient implementations:  Reflexivity: —Resolve(a,a) is always true —NOTE: does not imply Resolve(a,a’) where a~a’  Commutativity: —Resolve(a,b)  Resolve(b,a)  Transitivity: —Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
  18. 18. 18 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  19. 19. 19 Corpora: Find or Build?  Requirements:  Annotated for ground truth  Represent linguistic challenges  Scalable/practical  Options  Adapt public “database” corpora: — Wikipedia:  Annotated: yes  Representative: somewhat  Scalable: yes — Citation DBs:  Annotated: no  Representative: somewhat  Scalable: yes
  20. 20. 20 Corpora: Find or Build? (cont’d)  Adapt public “document” corpora: — Co-reference documents:  Annotated: yes  Representative: less as often single doc/language-of-use  Scalable: yes  Create corpora by hand: — From scratch: “parrot sessions” (auditory or visual)  Annotated: yes  Representative: largely  Scalable: no — From un-annotated databases:  Annotated: no  Representative: yes  Scalable/practical: no; databases may be private — Synthesize from generative model  Annotated: yes  Representative: no, tied to generating model  Scalable: yes
  21. 21. 21 Metrics  Back to our initial example Jim Killeen Kileen, J. D. Jaime Kilin ‫كلين‬ ‫جمس‬ Jim JDKJimK illeen J. Diw Killeen Reference System A System B
  22. 22. 22 Metrics: Adopt or Create?  How to quantify the quality of the system’s resolutions vs. the reference?  Goals:  Discriminative: separates good v. bad systems for users’ needs  Interpretable: number aligns with intuition  Considerations:  Assume transitive closure (TC) of output?  Apply weights to try to be more discriminative?  Common concepts:  Precision: % of stuff in answer that’s right  Recall: % of right stuff in answer  F-Score: Harmonic mean of these = 2*P*R/(P+R)
  23. 23. 23 Candidate Metrics  Pair-wise % correct: over all N*(N-1)/2 node pairs  Pair-wise P&R: based on links drawn  Edit-distance: # of links to add/subtract to correct  Metrics used in document co-reference resolution:  MUC-6: entity-based P&R on missing links from graph  B-CUBED: average per-reference P&R of links  CEAF (Constrained Entity-Alignment F): entities aligned using some similarity measure; P&R are % of possible similarity level achieved
  24. 24. 24 Comparing Metrics Jim Killeen Jaime Kilin ‫كلين‬ ‫جمس‬ Jim JDKJimK illeen J. Diw Killeen Reference System A System B Kileen, J. D. No TCTC 3 6 1 4 Edit-dist 81858973717982B 90788062618279A No TCTCNo TCTC CEAF (TC) B-CUBED (TC) MUC-6 (TC) Pairwise F% Correct My preference
  25. 25. 25 Conclusion  Identity resolution systems face linguistic challenges  They need to be carefully integrated to meet these challenges  Evaluation corpora should reflect these challenges  Evaluation metrics should align with qualitative judgements
  26. 26. 26 Bibliography Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistic Coreference. Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1183--1210. Luo, X. (2005). On coreference resolution performance metrics. In Proc. of HLT-EMNLP, pp 25--32. Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First International VLDB Workshop on Clean Databases. Seoul, Korea. Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management. Spock Team (2008). The Spock Challenge. http://challenge.spock.com/ (Retrieved February 5.) Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.
  27. 27. 27 Questions? More information: http://www.basistech.com

×