Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CompanyDepot:	Employer	
Name	Normalization	in	the	
Online	Recruitment	Industry
Qiaoling	Liu
Sep.	2017
The	Employer	Name	Normalization	Task
links	employer	names	in	job	postings	or	resumes	to	entities in	an	
employer	knowledge...
A	domain-specific	case	of	entity	linking
Traditional entity	
linking
links
entity	mentions
in
text
to	entities	in
often	a ...
Key	Challenges
1. Handle	name	variations
Ø Legacy	names,	nicknames,	acronyms,	typos
2. Handle	irrelevant	or	unlinkable inp...
Two	Levels	of	Employer	Name	Normalization
•Handle	name	variations
•Handle	irrelevant	or	unlinkable input	data
•Handle	empl...
Entity-Level	Normalization
-- mapping	a	query	to	an	entity
Entity-Level Normalization
- mapping a query to an entity
Walma...
Cluster-Level	Normalization
-- mapping	a	query	to	a	cluster	of	entities
Cluster-Level Normalization
- mapping a query to a...
Architecture	of
CompanyDepot
atistics and examples for mapping sources.
e Example
K IBM Corp. ! International Business Mac...
Entity-Level	Normalization
Query	Expansion	using	External	Knowledge	from		
5	Mapping	Sources
Table 1: Statistics and examples for mapping sources.
So...
Indexing	Step
• Using	Lucene	indexer
Table 2: Index structure.
(a) A document in the KB index.
Field Value
id 15
normalize...
Retrieval	Step
1. Get	top	1000	entities	using	Lucene	aggregated	search	combining			
(1)	keyword	searches;	(2)	fuzzy	search...
Reranking Step
1. Generate	Features	for	each	entity:	
• query	features:	query	length,	if	query	location/url is	specified,	...
Validation	Step
1. Generate	features	for	the	top-ranked	entity
• all	features	from	the	previous	step;	
• score	of	the	lear...
Cluster-Level	Normalization
Graph-Based	Clustering	using	External	Knowledge	
from	5	Mapping	Sources
Table 1: Statistics and examples for mapping sourc...
Create	an	Undirected	GraphCreate an Undirected Graph
walmartcanada
targetpharmacywalmartsupercenter
walmart
walmartstores
...
Remove	Low-Quality	EdgesRemove Low-Quality Edges
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
4
1
3
2
tar...
Find	All	Connected	Components	as	ClustersFind All Connected Components as Clusters
walmartcanada
walmartsupercenter
walmar...
Select	Cluster	Representative	Entity
Wal-Mart Stores, Inc. Target Corporation
Select Cluster Representative Entity
walmart...
Experiments
Entity-Level	Datasets
Table 4: Statistics about the entity-level datasets. %Country
(State, URL) means the percentage of q...
Metrics	for	Entity-Level	Normalization
• Ic:	correct	results;	Iw:	wrong	results;	In:	null	results
• Precision	=	Ic /	(Ic +...
Results	on	Entity-Level	Normalization	Datasets
0.0 0.2 0.4 0.6 0.8 1.0
0.20.40.60.81.0
Coverage
Precision
●●
●
●
●
●
●
●
R...
Cluster-Level	Datasets
• Resume	dataset
• Search	for	resumes	by	98	most	frequent	search	queries	about	companies
• Get	20	m...
Metrics	for	Cluster-Level	Normalization
• Success	Rate	(SR):	
• how	likely	the	system	returns	a	correct	result.
• Diversit...
Results	on	Cluster-Level	Normalization	Datasets
1.0
asets.
means
Figure 5: Results on JOBFEED (entity-level normalization)...
the trad
data so
the em
as han
employ
and du
per ada
system
duplica
2.2
e sys
the foll
on exte
to impr
cluster
normal
Our
...
Conclusion	and	Future	Work
ü Presented	CompanyDepot:	supporting	employer	name	
normalization	at	both	entity	and	cluster	le...
Thank	you!
Any	questions?
qiaoling.liu@careerbuilder.com
Backup
Calibrating	Employer	Names
1. Convert	the	name	to	lowercase,	and	replace	’s	with	s;
2. Convert	all	the	non-alphanumeric	ch...
Related	Work
• Entity	Linking	with	a	Knowedge Base
• Domain-Specific	Name	Normalization
• Deduplicating Domain-Specific	KB...
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Upcoming SlideShare
Loading in …5
×

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

272 views

Published on

CompanyDepot: Employer Name Normalization in the Online Recruitment Industry
In the recruitment domain, the employer name normalization task, which links employer names in job postings or resumes to entities in an employer knowledge base (KB), is important to many business applications. It has several unique challenges: handling employer names from both job postings and resumes, leveraging the corresponding location and url context, as well as handling name variations, irrelevant input data, and noises in the KB. In this talk, we present a system called CompanyDepot which uses machine learning techniques to address these challenges. The proposed system achieves 2.5%- 21.4% higher coverage at the same precision level compared to a legacy system used at CareerBuilder over multiple real-world datasets. After applying it to several applications at CareerBuilder, we faced a new challenge: how to avoid duplicate normalization results when the KB is noisy and contains many duplicate entities. To address this challenge, we extend the CompanyDepot system to normalize employer names not only at entity level, but also at cluster level by mapping a query to a cluster in the KB that best matches the query. The proposed system performs an efficient graph-based clustering based on external knowledge from five mapping sources. We also propose a new metric based on success rate and diversity reduction ratio for evaluating the cluster-level normalization. Through experiments and applications, we demonstrate a large improvement on normalization quality from entity-level to cluster-level normalization.

Published in: Technology
  • Be the first to comment

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

  1. 1. CompanyDepot: Employer Name Normalization in the Online Recruitment Industry Qiaoling Liu Sep. 2017
  2. 2. The Employer Name Normalization Task links employer names in job postings or resumes to entities in an employer knowledge base (KB)
  3. 3. A domain-specific case of entity linking Traditional entity linking links entity mentions in text to entities in often a global KB Employer name normalization employer names jobs / resumes an employer KB
  4. 4. Key Challenges 1. Handle name variations Ø Legacy names, nicknames, acronyms, typos 2. Handle irrelevant or unlinkable input data Ø E.g., “self-employed”, “not specified” 3. Handle employer names from both job postings and resumes Ø Different semi-structured formats 4. Leverage the location/url context Ø e.g., (Macys.com, San Francisco) 5. Handle duplicates in the KB Ø e.g., {“Enterprise Rent A Car”, “Enterprise Rentacar”, “Enterprise Rent-A-Car Company”} Unique Challenges! Common Challenges!
  5. 5. Two Levels of Employer Name Normalization •Handle name variations •Handle irrelevant or unlinkable input data •Handle employer names from both job postings & resumes •Leverage the location/url context Entity-level normalization •Handle duplicates in the KB Cluster-level normalization
  6. 6. Entity-Level Normalization -- mapping a query to an entity Entity-Level Normalization - mapping a query to an entity Walmart Pharmacy Target Pharmacy Walmart Supercenter Walmart Wal-Mart Stores, Inc. target.com Target Corporation Entities walmart pharmacy target.comQueries walmart
  7. 7. Cluster-Level Normalization -- mapping a query to a cluster of entities Cluster-Level Normalization - mapping a query to a cluster of entities Walmart Pharmacy Target Pharmacy Walmart Supercenter Walmart Wal-Mart Stores, Inc. target.com Target Corporation Entities walmart pharmacy target.comwalmartQueries
  8. 8. Architecture of CompanyDepot atistics and examples for mapping sources. e Example K IBM Corp. ! International Business Machines Corporation MSFT ! Microso Corporation K Amazon Web Services, Inc. ! Amazon.com, Inc. M bankofamerica ! Bank of America Corporation M pricewaterhouse coopers ! PwC is ready, it can take normalization requests. Each of an employer name and its location context (part ation information could be empty). e system then searcher to retrieve a list of N employer entities. date entities are then sent to the reranking step, s a feature vector for each entity and uses a machine anking model to rank them. Finally, the top-ranked the validation step to decide whether it is a correct uery using a binary classier. If it says yes, the this entity to the user; otherwise, it outputs NIL. ng Sources s are used in both our entity-level normalization (to sion) and cluster-level normalization (to do graph- g). Each source contains a set of mappings from o normalized forms. Table 1 shows the statistics and ach source. We describe how each mapping source w: Cluster Result Entity Result Query Employer Knowledge Base Mapping Source 2 Mapping Source 1 Mapping Source 5 Mapping Source 4 Mapping Source 3 Client KB Index Clusters Mapping Index Cluster Index Reranking Step Indexing Step Retrieval Step Validation Step Clustering Step Cluster Lookup Offline Online Learning to Rank
  9. 9. Entity-Level Normalization
  10. 10. Query Expansion using External Knowledge from 5 Mapping Sources Table 1: Statistics and examples for mapping sources. Source Size Example Wikipedia 135K IBM Corp. ! International Business Machines Corporation Stock 6K MSFT ! Microso Corporation Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc. Legacy 26M bankofamerica ! Bank of America Corporation Provider 10M pricewaterhouse coopers ! PwC Once the index is ready, it can take normalization requests. Each
  11. 11. Indexing Step • Using Lucene indexer Table 2: Index structure. (a) A document in the KB index. Field Value id 15 normalized form International Business Machines Corporation calibrated name internationalbusinessmachines domain ibm.com json {“id”: “15”, “normalized form”: “International Business Machines Corporation”, …} (b) A document in the mapping index. Field Value surface form IBM normalized form International Business Machines Corporation mapping source wikipedia (c) A document in the cluster index. Field Value cluster member key internationalbusinessmachines cluster representative International Business Machines Corporation
  12. 12. Retrieval Step 1. Get top 1000 entities using Lucene aggregated search combining (1) keyword searches; (2) fuzzy searches; (3) phrase searches. • Query expansion based on mappings, e.g., MSFT - Microsoft Corporation 2. From these results, get top N1 entities by Lucene score, top N2 entities by Levenshtein Distance, top N3 entities by mapping table, and top N4 entities by url matching. 3. Return the pool of N=N1+N2+N3+N4 entities (N is about 10~20).
  13. 13. Reranking Step 1. Generate Features for each entity: • query features: query length, if query location/url is specified, etc. • query-entity features: Lucene score, string similarity, location/url match, etc. • entity features: entity popularity, # locations, legal word presence, etc. 2. Learn to rank the entities using coordinate ascent in RankLib, • a list-wise method that can directly optimize any user specified ranking measure (e.g., P@1).
  14. 14. Validation Step 1. Generate features for the top-ranked entity • all features from the previous step; • score of the learning-to-rank method. 2. Classify the top-ranked entity into CORRECT or WRONG • binary classification using LibSVM.
  15. 15. Cluster-Level Normalization
  16. 16. Graph-Based Clustering using External Knowledge from 5 Mapping Sources Table 1: Statistics and examples for mapping sources. Source Size Example Wikipedia 135K IBM Corp. ! International Business Machines Corporation Stock 6K MSFT ! Microso Corporation Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc. Legacy 26M bankofamerica ! Bank of America Corporation Provider 10M pricewaterhouse coopers ! PwC Once the index is ready, it can take normalization requests. Each
  17. 17. Create an Undirected GraphCreate an Undirected Graph walmartcanada targetpharmacywalmartsupercenter walmart walmartstores target wamart targets 4 1 3 2 3 2 targetstore 1
  18. 18. Remove Low-Quality EdgesRemove Low-Quality Edges walmartcanada walmartsupercenter walmart walmartstores wamart 4 1 3 2 targetpharmacy target targets 3 2 targetstore 1
  19. 19. Find All Connected Components as ClustersFind All Connected Components as Clusters walmartcanada walmartsupercenter walmart walmartstores wamart targetpharmacy target targets targetstore
  20. 20. Select Cluster Representative Entity Wal-Mart Stores, Inc. Target Corporation Select Cluster Representative Entity walmartcanada walmartsupercenter walmart walmartstores targetpharmacy target targetstore
  21. 21. Experiments
  22. 22. Entity-Level Datasets Table 4: Statistics about the entity-level datasets. %Country (State, URL) means the percentage of queries with country (state, url) specied. %US means the percentage of queries with country=US when country is specied. Dataset #eries %Country %US %State %URL RDB 1098 58.5% 96.4% 50.9% 0% EDGE 1093 97.3% 45.3% 20.8% 0% JOB1 1100 100% 100% 99.7% 0% JOB2 500 100% 98.4% 100% 0% JOBFEED 453 87.5% 100% 87.5% 100%
  23. 23. Metrics for Entity-Level Normalization • Ic: correct results; Iw: wrong results; In: null results • Precision = Ic / (Ic + Iw): percentage of correct results out of all non- null results. • Coverage = (Ic + Iw) / (Ic + Iw + In): percentage of queries that a non- null result is returned.
  24. 24. Results on Entity-Level Normalization Datasets 0.0 0.2 0.4 0.6 0.8 1.0 0.20.40.60.81.0 Coverage Precision ●● ● ● ● ● ● ● RDB: CD−V2−E RDB: CD−V1 RDB: Legacy RDB: WService EDGE: CD−V2−E EDGE: CD−V1 EDGE: Legacy EDGE: WService JOB1: CD−V2−E JOB1: CD−V1 JOB1: Legacy JOB1: WService JOB2: CD−V2−E JOB2: CD−V1 JOB2: Legacy JOB2: WService 0.0 0.2 0.4 0.6 0.8 1.0 0.800.901.00 Coverage Precision JOBFEED: CD−V2−E (using query url) JOBFEED: CD−V2−E (ignoring query url) JOBFEED: WService Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.963 0.704 0.814 CD-V1.5-C 0.897 0.688 0.779 CD-V2-E 0.958 0.416 0.580 (b) Job dataset. 0.40.60.81.0 ●● ● ● ● ● ● RDB: CD−V2−E RDB: CD−V1 RDB: Legacy RDB: WService EDGE: CD−V2−E EDGE: CD−V1 EDGE: Legacy EDGE: WService JOB1: CD−V2−E JOB1: CD−V1 JOB1: Legacy JOB1: WService JOB2: CD−V2−E 0.0 0.2 0.4 0.6 0.8 1.0 0.800.901.00 Coverage Precision JOBFEED: CD−V2−E (using query url) JOBFEED: CD−V2−E (ignoring query url) JOBFEED: WService Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset.
  25. 25. Cluster-Level Datasets • Resume dataset • Search for resumes by 98 most frequent search queries about companies • Get 20 most frequent raw employer names from these resumes • Collect 817 unique raw employer names from resumes • Job dataset • Get top 182 employer entities with the most jobs by a baseline normalizer • Get the raw employer names in the jobs posted by these entities • Collect 6515 unique raw employer names from job postings
  26. 26. Metrics for Cluster-Level Normalization • Success Rate (SR): • how likely the system returns a correct result. • Diversity Reduction Ratio (DRR): • how much result diversity the system reduces correctly via clustering. • Light-weight labeling: • for each query, label whether the result returned by the system is correct. each query q, we label whether the result r returned by the sys- tem is correct or not. Let QS be the set of successful queries, i.e., the queries which receive a correct result, i.e., QS = {q 2 Q | fC (q) is a correct result for q}. We dene Success Rate (SR) of the system as SR = |QS | |Q| (1) To measure the diversity in results returned by a system, we adapted the true diversity metric [14] which is dened based on entropy. As it does not maer how diverse the wrong results are, we only compute the diversity in the correct results. Let QS |r be the set of successful queries that are mapped to the cluster of r, i.e., QS |r = {q 2 QS | fC (q) = r}. We rst compute the entropy of the correct results as H = ’ r 2R |QS |r | |QS | · ln ✓ |QS |r | |QS | ◆ (2) e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which makes it a lile hard to understand and interpret. So True Diver- Train EDGE Test RDB EDGE and shared across system correct cluster for each i evaluation indicators. entity-level normalizatio 6.3 Systems and Table 5 summarizes the 6.3.1 Results of Entit normalization datasets, CD-V1, CD-V2-E, Legac E, the output contains a By varying the threshol precision-coverage curv condence score is not (q) is a correct result for q}. We dene Success Rate (SR) of stem as SR = |QS | |Q| (1) measure the diversity in results returned by a system, we ed the true diversity metric [14] which is dened based on py. As it does not maer how diverse the wrong results are, ly compute the diversity in the correct results. Let QS |r be of successful queries that are mapped to the cluster of r, i.e., = {q 2 QS | fC (q) = r}. We rst compute the entropy of the t results as H = ’ r 2R |QS |r | |QS | · ln ✓ |QS |r | |QS | ◆ (2) bove entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which it a lile hard to understand and interpret. So True Diver- 4] is proposed as TD = exp(H). It gives the eective number rect clusters returned by the system, and is linear to |QS |. and shared across systems. correct cluster for each input a evaluation indicators. ird, entity-level normalization: Suc 6.3 Systems and Resu Table 5 summarizes the system 6.3.1 Results of Entity-Leve normalization datasets, we co CD-V1, CD-V2-E, Legacy, and E, the output contains a con By varying the threshold on t precision-coverage curve. For condence score is not availa precision and coverage value. Figure 4 shows the precision H = ’ r 2R |QS |r | |QS | · ln ✓ |QS |r | |QS | ◆ (2) e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which makes it a lile hard to understand and interpret. So True Diver- sity [14] is proposed as TD = exp(H). It gives the eective number of correct clusters returned by the system, and is linear to |QS |. Based on the above True Diversity, we can compute how much result diversity the system reduces correctly, i.e., Diversity Reduc- tion Ratio (DRR), which is in range [0, 1]: DRR = 1 exp(H) 1 |QS | 1 (3) Finally, we compute the f-score (or the harmonic mean) of Suc- cess Rate and Diversity Reduction Ratio to measure the normaliza- tion quality: F-score = 2 · SR · DRR SR + DRR (4) tion Ratio (DRR), which is in range [0, 1]: DRR = 1 exp(H) 1 |QS | 1 Finally, we compute the f-score (or the harmon cess Rate and Diversity Reduction Ratio to measu tion quality: F-score = 2 · SR · DRR SR + DRR e proposed metric has three merits. First, it is showing the correctness and diversity of the resu cluster-level normalization system. Second, it only labeling eort, i.e., labeling for each (query, resu the result is correct for the query or not. e labe
  27. 27. Results on Cluster-Level Normalization Datasets 1.0 asets. means Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.963 0.704 0.814 CD-V1.5-C 0.897 0.688 0.779 CD-V2-E 0.958 0.416 0.580 (b) Job dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.904 0.979 0.940 CD-V1.5-C 0.778 0.981 0.868 CD-V2-E 0.905 0.926 0.915 the other hand, CD-V2-C has a much higher diversity reduction
  28. 28. the trad data so the em as han employ and du per ada system duplica 2.2 e sys the foll on exte to impr cluster normal Our malizat domain e employer name normalization task discussed in can be viewed as a general entity linking problem, yet it d the traditional entity linking task in three aspects [20]: ( data sources; (2) dierent contexts; (3) dierent KBs. the employer name normalization task has unique chall as handling the location and the url context associate employer names in jobs and resumes, as well as hand and duplicate entities in the KB. e system proposed per adapts the three-module framework used in the en systems. We also propose cluster-level normalization duplicate results, which is not considered in entity linki 2.2 Domain-Specic Name Normalizati e system described in this paper extends the system in the following contributions: (1) performing query expan on external mapping sources and supporting using u to improve normalization quality; (2) supporting norm cluster level; (3) proposing a new metric for evaluating c normalization. More details will be described in Section Our work is also related to a set of domain-specic malization applications. For example, within the same r From entity-level normalization to cluster-level normalization: ü Correctness remained ü Diversity reduced Candidate Search Results Facets
  29. 29. Conclusion and Future Work ü Presented CompanyDepot: supporting employer name normalization at both entity and cluster level ü Proposed new metrics for cluster-level normalization q Improve clustering, e.g., merge and split q Develop more features for entity quality and query segmentation. q Improve the quality and coverage of the employer KB
  30. 30. Thank you! Any questions? qiaoling.liu@careerbuilder.com
  31. 31. Backup
  32. 32. Calibrating Employer Names 1. Convert the name to lowercase, and replace ’s with s; 2. Convert all the non-alphanumeric characters to space; 3. Remove stop-phrases (e.g., “pvt ltd” and “l l c”) and stop-words (e.g., “inc”, “corporation”, “incorporated”, and “the”); 4. Expand commonly used abbreviations, e.g., “ctr” - “center”, “svc” - “services”; 5. remove all spaces in the name. Employer name After calibration International Business Machines Corporation internationalbusinessmachines Sherman Howard L.L.C. shermanhoward Oxnard Police Dept oxnardpolicedepartment Macy’s, Inc. macys
  33. 33. Related Work • Entity Linking with a Knowedge Base • Domain-Specific Name Normalization • Deduplicating Domain-Specific KBs • Clustering Methods and Evaluation Metrics

×