Effective string processing and matching for author entity

Effective String Processing and
Matching for Author
Disambiguation
Source:Journal of Machine Learning Research’14
Speaker:LIN,CI-JIE

Outline
Introduction
Method
Experiment
Conclusion

Introduction
 Track 2 of KDD Cup 2013 aims at determining
duplicated authors in a data set from Microsoft
Academic Search
 Track 2 in KDD Cup 2013 is a task of name
disambiguation

Introduction
1. Author.csv
2. Paper.csv
3. PaperAuthor.csv
4. Conference.csv
5. Journal.csv

Overview of Data sets
1. Alleviation
2. Unusual Name
3. Inconsistent Information
4. Typo
5. Incomplete Name
6. Empty Entry
7. Missing Value
8. Nickname
9. Wrong matching between authors and papers
10.Non-English characters

Main Strategies
 The first strategy is that we identify duplicates mainly
based on string matching
 The second strategy is that if an author in Author.csv
has no publication records in PaperAuthor.csv, then
we assume that this author has no duplicates
 The third strategy is to classify an author as Chinese
or non-Chinese before any string matching

Implementation
Two implementations of the framework were
finished by two groups within the team

The Framework
1. Chinese-or-not
2. Cleaning
3. Selection
4. Identification
5. Splitting
6. Linking

The First Implementation
1. Chinese-or-not

2. Cleaning
 Split two consecutive uppercase characters
 replace “CJ" with “C J.”
 Remove English honorics (e.g., “Mr." and “Dr.")
 Transform uppercase to lowercase
 Remove apostrophes and replace punctuation. For
example, “o'relly" becomes “orelly.“

2. Cleaning
 Replace European alphabets with similar English
alphabets
 Replace common English nicknames
 replace “bill” with “william."

3. Selection
 build a dictionary of (key, value) pairs. Each key is a set
of words, while each value is a set of authors containing
the key

4. Identification
 Matching Functions
1. Two names share the same set of words
 “ Chih Jen Lin“ and “ Lin Chih Jen.„
2. A shortened name
 " Ch. J. Lin" and " Chih Jen Lin.“
3. A partially shortened name
 "C. J. Lin" and "C. J. Lint.“
4. Alias
 dry-run procedure
 “ C J Lin “ and “ Chih Jen Lion “ are loosely identical, while
“ Chih Jen Lin“ and “ Chen Ju Lin “ are not.

5. Splitting
 Each author name is a full name with two words.
 Neither author name is a partially shortened
name of the other.
 " kazuo kobayashi" and "kunikazu kobayashi"

The Second Implementation
1. Chinese-or-not
2. Cleaning
3. Selection
4. Identification

5. Splitting
 Common extended names(CEN)
 Longest common extended name(LCEN)

6. Linking
 previous stages group duplicated names rather
than identifiers.
 However, the competition task is to group
duplicated identifiers

Ensemble
 the two implementations detect different sets of
duplicates,an ensemble of their results may improve
the performance
 The filter considers that two authors have a similar
background
 Affiliation、field of study

Typo Correction
handle typos in author names of Author.csv
Two author names are considered as
duplicates if
their word sets are the same after treating each
typo the same as after its correction,and
their affiliations share at least one common word

Conclusion
try best to keep all information and delay the
modification on the data set because the
provided data set is noisy and incomplete
an important advantage of using rule-based
approaches is that we can easily trace the
change of results after a rule is added

Effective string processing and matching for author entity

More Related Content

What's hot

Viewers also liked

Effective string processing and matching for author entity