Learning Blocking Schemes for Record Linkage_PVERConf_May2011
Upcoming SlideShare
Loading in...5
×
 

Learning Blocking Schemes for Record Linkage_PVERConf_May2011

on

  • 599 views

May 2011 Personal Validation and Entity Resolution Conference Presentation. Presenters: Matthew Michelson, Craig A. Knoblock

May 2011 Personal Validation and Entity Resolution Conference Presentation. Presenters: Matthew Michelson, Craig A. Knoblock

Statistics

Views

Total Views
599
Views on SlideShare
599
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Beam search – adds backtracking. Search s.t. keep maximizing RR, and cover min. PC
  • 1% of RR and PC mean different things. EG in Marlin dataset, 50% test ~50 matches, so 2% diff. With 90% data, our drop in PC represents ~ 6.5 matches on avg. FOR RR cars, going from RR: 98.346 to RR: 98.337 increases by 402 cands. Red box = not stat. sig with 2 tail t-test, alpha=0.05

Learning Blocking Schemes for Record Linkage_PVERConf_May2011 Learning Blocking Schemes for Record Linkage_PVERConf_May2011 Presentation Transcript

  • Learning Blocking Schemes for Record Linkage Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute Person Validation and Entity Resolution Conference May 23rd, 2011, Washington D.C.
  • Record Linkage Census Data A.I. Researchers First Name Last Name Phone Zip Matt Michelson 555-5555 12345 Jane Jones 555-1111 12345 Joe Smith 555-0011 12345 First Name Last Name Phone Zip Matthew Michelson 555-5555 12345 Jim Jones 555-1111 12345 Joe Smeth 555-0011 12345 match match
  • Blocking – Generating Candidates (token, last name) AND (1 st letter, first name) = block-key First Name Last Name Matt Michelson Jane Jones First Name Last Name Matthew Michelson Jim Jones First Name Last Name Zip Matt Michelson 12345 Matt Michelson 12345 Matt Michelson 12345 First Name Last Name Zip Matthew Michelson 12345 Jim Jones 12345 Joe Smeth 12345 (token, zip) . . . . . .
  • Blocking - Intuition Census Data Zip = ‘ 12345 ’ 1 Block of 12345 Zips  Compare to the block First Name Last Name Zip Matt Michelson 12345 Jane Jones 12345 Joe Smith 12345
  • Blocking Effectiveness Reduction Ratio ( RR ) = 1 – ||C|| / (||S|| *|| T||) S,T are data sets; C is the set of candidates Pairs Completeness ( PC ) = S m / N m S m = # true matches in candidates, N m = # true matches between S and T (token, last name) AND (1 st letter, first name) RR = 1 – 2/9 ≈ 0.78 PC = 1 / 2 = 0.50 Examples: (token, zip) RR = 1 – 9/9 = 0.0 PC = 2 / 2 = 1.0
  • How to choose methods and attributes?
    • Blocking Goals:
      • Small number of candidates (High RR )
      • Don ’ t leave any true matches behind! (High PC )
    • Previous approaches:
      • Ad-hoc by researchers or domain expers
    • New Approach:
      • B locking S cheme L earner (BSL) – modified Sequential Covering Algorithm
  • Learning Schemes – Intuition
    • Learn restrictive conjunctions
      • partition the space  minimize False Positives
    • Union restrictive conjunctions
      • Cover all training matches
      • Since minimized FPs, conjunctions should not contribute many FPs to the disjunction
  • Example to clear things up! Space of training examples = Not match = Match Rule 1 :- ( zip|token ) & ( first|token ) Rule 2 :- ( last|1 st Letter ) & ( first|1 st Letter ) Final Rule :- [(zip|token) & (first|token)] UNION [(last|1 st Letter) & (first|1 st letter)]
  • SCA: propositional rules
    • Multi-pass blocking = disjunction of conjunctions
    • Learn conjunctions and union them together!
    • Cover all training matches to maximize PC
    SEQUENTIAL-COVERING( class, attributes, examples, threshold) LearnedRules ← {} Rule ← LEARN-ONE-RULE(class, attributes, examples) While examples left to cover, do LearnedRules ← LearnedRules U Rule Examples ← Examples – {Examples covered by Rule} Rule ← LEARN-ONE-RULE(class, attributes, examples) If Rule contains any previously learned rules, remove them Return LearnedRules
  • SCA: propositional rules
    • LEARN-ONE-RULE is greedy
      • Rule containment as you go, instead of comparison afterward
      • Example rule: (token|zip) & (token|first)
      • (token|zip) CONTAINS (token|zip) & (token|first)
      • Guarantee later rule is less restrictive
        • If not how are there examples left to cover?
  • Learn-One-Rule
    • Learn conjunction that maximizes RR
    • General-to-specific beam search
      • Keep adding/intersecting (attribute, method) pairs
        • Until can ’ t improve RR
        • Must satisfy minimum PC
    (token, zip) (token, last name) (1 st letter, last name) (token, first name) … (1st letter, last name) (token, first name) …
  • Experiments HFM = ({token, make} ∩ {token, year} ∩ {token, trim} ) U ({1 st letter, make} ∩ {1 st letter, year} ∩ {1 st letter, trim}) U ({synonym, trim}) BSL = ({token, model} ∩ {token, year} ∩ {token, trim} ) U ({token, model} ∩ {token, year} ∩ {synonym, trim}) Restaurants RR PC Marlin 55.35 100.00 BSL 99.26 98.16 BSL (10%) 99.57 93.48 Cars RR PC HFM 47.92 99.97 BSL 99.86 99.92 BSL (10%) 99.87 99.88 Census RR PC Best 5 Winkler 99.52 99.16 Adaptive Filtering 99.9 92.7 BSL 98.12 99.85 BSL (10%) 99.50 99.13
  • Conclusions and Future Work
    • Automatic Blocking Schemes using Machine Learning
      • Better than non-expert, ad-hoc methods
      • Comparable to domain experts rules
    • Future Work
      • Capture/Recapture to estimate lost matches  unsupervised
    • Software for BSL is available!
      • www.isi.edu/integration/people/knoblock/software.html