SlideShare a Scribd company logo
1 of 54
Data Science @ The Search Party
Jan Luts
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
About myself
• Master in Information Sciences, Universiteit Hasselt, Belgium
• Master in Bioinformatics, Katholieke Universiteit Leuven, Belgium
• Master in Statistics, Katholieke Universiteit Leuven, Belgium
• PhD and Postdoc in Engineering, Department of Electrical Engineering,
Katholieke Universiteit Leuven (Sabine Van Huffel, Johan Suykens)
“Predictive computer models, machine learning, decision support systems”
• Postdoc, School of Mathematical Sciences, University of Technology Sydney,
Australia (Matt Wand) “Mean field variational Bayes,
semiparametric regression, streaming data, real-time analysis”
• October 2013: Data Scientist, The Search Party, Sydney
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
The Search Party
There are major forces acting on Recruitment as an industry…
Traditional
recruitment model
under pressure from
technology
Pressure on
pricing damaging
agency
profitability
Bulk of agency
costs are people
who drive revenue
Global
economic
uncertainty
Corp. investment
in internal talent
sourcing teams
?
We allow potential employers to
search a vast ocean of the worlds
best candidates
We connect employers with the Agencies who represent them to agree
a fee and arrange an introduction
Supporting this evolution is the world’s first marketplace for talent………..
http://thesearchparty.com/
Employer
Employer
Recruiter
Recruiter
Employer
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
Data
• 2 million candidates
Data
• 2 million candidates
• 46 million skills
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
Concrete Formworker
Doran Contractors
1999-2012
Site Supervisor
Allied Gold
1997-2000
Java Developer
IBM
2010-2011
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries, 384 subsectors
Engineerin
g
Accounting
Administration & Office Support
Advertising, Arts & Media
Banking & Financial Services
Call Centre & Customer Services
Community Services & Development
Construction
Consulting & Strategy
Design & Architecture
Education & Training
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries, 384 subsectors
• 75 GB marketplace logs
Create Candidate
Publish Candidate
Forgot Password
Submit CandidateVote Up
Vote Down
Request Candidate
Appeared In Search Results
Account Login
Upload CV
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries, 384 subsectors
• 75 GB marketplace logs
• 100 recruitment agencies
Data science @ The Search Party!
• Testing hypotheses
• Design of experiments
• Cross-validation
• Training data vs. test data
• Performance measure
• Building a prediction model
• Regression
• Support vector machines
• Variable selection
• Sensitivity, specificity
• Cost and benefit
• Clustering
• Topic modeling
• Distributed computing
• Programming
• Software engineering
• Data structures
• Term frequency - inverse document frequency
• Entity resolution
• Sentence detection
• Tokenization
• Sentiment analysis
• Part-of-speech tagging
 statistics
 machine learning
 data mining
 computer science
 information retrieval
 natural language processing
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
Deduplication of candidates
Recruiter 1
Recruiter 2
Recruiter 3
The Search Party
Database
Employer
Deduplication of candidates
(Figure from Lise Getoor)
Deduplication of candidates
(Figure from Lise Getoor)
Deduplication of candidates
(Figure from Lise Getoor)
Clustering
• Entity resolution does not happen independently for each
pair or candidates separately
• Number of clusters is unknown
• Many, many small (possibly singleton) clusters
Correlation clustering
• Take a pair‐wise similarity graph as input
• Edge 𝑥𝑖𝑗 ∈ {0,1} with 𝑥𝑖𝑗 = 1 if candidates i and j assigned to
same cluster. 𝑝𝑖𝑗 is the ‘belief’ that candidates i and j are
the same
• Optimize:
Define:
Correlation clustering
Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear
Programming for Natural Langauge Processing (ILP '09). Association for
Computational Linguistics, Stroudsburg, PA, USA, 19-27.
Pairwise similarity matrix
• We need a measure that quantifies the similarity between
candidates:
• Candidate 1: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS
• Candidate 2: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS
• Candidate 3: Jam Lutf, jan.m.luts@gmail.com
• Candidate 4: J Luts, KULeuven
• Candidate 5: Ian Luts, jan.m.luts@gmail.com, KULeuven, UTS, TSP
• Candidate 6: Jan Luts, john@staffrecruitment.com, UTS, TSP
Term frequency - inverse document frequency
jan. an.m n.m. luts uts@ mail gmai .com @hot jan_
Candidate1 1 1 1 1 1 1 1 1 0 0
Candidate2 1 1 1 1 1 1 1 1 0 0
Candidate3 1 1 1 1 1 1 1 1 0 0
Candidate4 0 0 0 0 0 0 0 0 0 0
Candidate5 1 1 1 1 1 0 1 1 0 0
Candidate6 0 0 0 1 1 1 0 1 1 1
 These are called ‘term frequencies’
 Inverse document frequency for ‘.com’: log(6/5)
 TF-IDF for ‘.com’ for candidate 6: 1 * log(6/5) = 0.18
 TF-IDF for ‘jan_’ for candidate 6: 1 * log(6/1) = 1.79
Terms

Pairwise similarity matrix
• Combine cosine similarity values for name, email
address, phone number, mobile number, skills,
employment history, …
Cand 1 Cand 2 Cand 3 Cand 4 Cand 5 Cand 6
Cand 1 1 1 0.8 0.9 0.95 0.75
Cand 2 1 0.8 0.9 0.95 0.75
Cand 3 1 0.6 0.87 0.7
Cand 4 1 0.75 0.7
Cand 5 1 0.8
Cand 6 1
Correlation clustering
Correlation clustering
Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear
Programming for Natural Langauge Processing (ILP '09). Association for
Computational Linguistics, Stroudsburg, PA, USA, 19-27.
O(𝑛2)
Does not scale with
increasing number
of candidates!
‘Big Data’
• ‘Big Data’ criticism:
• ‘You May Not Need Big Data After All’, HBR, December 2013
• ‘Google Flu Trends: The Limits of Big Data’, NYT, March 2014
• ‘Big data: are we making a big mistake?’, FT Magazine, March 2014
• ‘The backlash against big data’, The Economist, April, 2014
• @ The Search Party:
• Sampling can help sometimes, but not always …
• We have a lot of data, this creates new problems …
• … and we just have to deal with it
• We need the right tools and algorithms to process millions of data
points
Deduplication of candidates
• So how can we do correlation clustering on millions of
candidates?
o Blocking: e.g. split data set in separate blocks based on
gender, geographical location, …
o Canopy clustering:
 Pre-clustering algorithm used as a preprocessing
step: Use a cheap distance measure to partition the
data into overlapping subsets (i.e. canopies)
 Run expensive clustering on each canopy
All candidates
Canopy clustering
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-
dimensional data sets with application to reference matching. In Proceedings of the sixth
ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00).
ACM, New York, NY, USA, 169-178.
• Start with a list of the candidates in any order, and with
two distance thresholds, T1 and T2, where T1 > T2.
• Pick a candidate of the list, make it a canopy center and
approximately measure its distance to all other candidates.
• Put all candidates that are within distance threshold T1
into a canopy. Remove from the list all candidates that are
within distance threshold T2. Repeat until the list is empty.
Canopy clustering
Five canopies found
Do correlation clustering on each canopy
Deduplication of candidates
Strategy outline:
• Do canopy clustering using TF-IDFs
• Do expensive correlation clustering for each canopy using a
similarity matrix based on all available candidate information
(e.g. name, email, phone, mobile, employment history,
publications, certificates, …)
• We need to do < 0.005 of all possible pairwise comparisons
Optimization:
• Parallelization of TF-IDF computation, canopy clustering
• Run correlation clustering in parallel for each canopy
Large-scale data processing:
• Open-source software framework for distributed computing
• MapReduce programming model
• Resilient to failure
How to do canopy clustering on Hadoop?
• Two steps:
• Canopy generation: identify the canopy centers
• Canopy filling: assign candidates to canopies
Canopy generation on Hadoop
Initialize:
centers1 = {} centers2 = {} centers3 = {} centers4 = {}
For each batch in parallel if ∀𝑖, distance(candidate x, center i) > T2
output the pair (‘intermediateCenter’, candidate x)
Candidates
Batch 1
Candidates
Batch 2
Candidates
Batch 3
Candidates
Batch 4
Intermediate
Centers
Map:
Reduce:
Initialize: finalCenters = {}
If ∀𝑖, distance(intermediateCenter x, finalCenter i) > T2
output the pair (‘finalCenter’, intermediateCenter x)
Canopy filling on Hadoop
Retrieve canopyCenters from canopy generation job
For each batch in parallel ∀𝑖, if distance(candidate x, center i)
< T1 output the pair (center i, candidate x)
Candidates
Batch 1
Candidates
Batch 2
Candidates
Batch 3
Candidates
Batch 4
Center-Candidate
Batch 1
Map:
Reduce:
For each batch: Output the list of all candidates belonging to the
same canopy with center i
Center-Candidate
Batch 2
Center-Candidate
Batch 3
Deduplication of candidates - Summary
• Our dedupe pipeline is a blend of concepts from information
retrieval (TF-IDF), statistics and machine learning
(correlation clustering)
• Applying it to large data sets causes new problems and
requires redesigning/adjusting the algorithms (canopy
clustering, distributed computing, hadoop)
• Integration in the existing platform:
o How do data get in and out of the dedupe pipeline
o Making it work in a ‘production environment’: Fail-safe
code - in case of failure, handle it in a safe way
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
Visualization of career paths
• 14 million employment history records:
• Longitudinal data: transitions between different jobs
• Available data: job titles, employer, full description, skills,
start dates, end dates, different versions of CV…
Visualization of career paths
• Visualize transition between jobs based on job title:
network consultant
senior network
consultant
technical project
manager
senior network
engineer
technical consultantnetwork analyst
network manager
consultant
network engineer
network architect
project manager
IT manager
.05
.04
.04
.11
.10
.12
.10.09
.06
.08
.18
Visualization of career paths
Demo
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
Technology - Software
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
Conclusion
• Innovative work in a challenging environment
• Variety: understanding business problems, literature
review, algorithm design, prototyping, evaluation,
implementation, optimization
• Data science: statistics has a very important role to play
• Software engineering skills
• Big data: large data sets cause new problems
• Team work
• Passion!
Thanks!

More Related Content

Recently uploaded

LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2AuEnriquezLontok
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11GelineAvendao
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialMarkus Roggen
 
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasBACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasChayanika Das
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 

Recently uploaded (20)

LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s Potential
 
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasBACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Data Science @ The Search Party (Dr. Jan Luts)

  • 1. Data Science @ The Search Party Jan Luts
  • 2. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 3. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 4. About myself • Master in Information Sciences, Universiteit Hasselt, Belgium • Master in Bioinformatics, Katholieke Universiteit Leuven, Belgium • Master in Statistics, Katholieke Universiteit Leuven, Belgium • PhD and Postdoc in Engineering, Department of Electrical Engineering, Katholieke Universiteit Leuven (Sabine Van Huffel, Johan Suykens) “Predictive computer models, machine learning, decision support systems” • Postdoc, School of Mathematical Sciences, University of Technology Sydney, Australia (Matt Wand) “Mean field variational Bayes, semiparametric regression, streaming data, real-time analysis” • October 2013: Data Scientist, The Search Party, Sydney
  • 5. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 6. The Search Party There are major forces acting on Recruitment as an industry… Traditional recruitment model under pressure from technology Pressure on pricing damaging agency profitability Bulk of agency costs are people who drive revenue Global economic uncertainty Corp. investment in internal talent sourcing teams ?
  • 7. We allow potential employers to search a vast ocean of the worlds best candidates We connect employers with the Agencies who represent them to agree a fee and arrange an introduction Supporting this evolution is the world’s first marketplace for talent………..
  • 14. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 15. Data • 2 million candidates
  • 16. Data • 2 million candidates • 46 million skills
  • 17. Data • 2 million candidates • 46 million skills • 14 million employment history records Concrete Formworker Doran Contractors 1999-2012 Site Supervisor Allied Gold 1997-2000 Java Developer IBM 2010-2011
  • 18. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies
  • 19. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies • 29 industries, 384 subsectors Engineerin g Accounting Administration & Office Support Advertising, Arts & Media Banking & Financial Services Call Centre & Customer Services Community Services & Development Construction Consulting & Strategy Design & Architecture Education & Training
  • 20. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies • 29 industries, 384 subsectors • 75 GB marketplace logs Create Candidate Publish Candidate Forgot Password Submit CandidateVote Up Vote Down Request Candidate Appeared In Search Results Account Login Upload CV
  • 21. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies • 29 industries, 384 subsectors • 75 GB marketplace logs • 100 recruitment agencies
  • 22. Data science @ The Search Party! • Testing hypotheses • Design of experiments • Cross-validation • Training data vs. test data • Performance measure • Building a prediction model • Regression • Support vector machines • Variable selection • Sensitivity, specificity • Cost and benefit • Clustering • Topic modeling • Distributed computing • Programming • Software engineering • Data structures • Term frequency - inverse document frequency • Entity resolution • Sentence detection • Tokenization • Sentiment analysis • Part-of-speech tagging  statistics  machine learning  data mining  computer science  information retrieval  natural language processing
  • 23. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 24. Deduplication of candidates Recruiter 1 Recruiter 2 Recruiter 3 The Search Party Database
  • 29. Clustering • Entity resolution does not happen independently for each pair or candidates separately • Number of clusters is unknown • Many, many small (possibly singleton) clusters
  • 30. Correlation clustering • Take a pair‐wise similarity graph as input • Edge 𝑥𝑖𝑗 ∈ {0,1} with 𝑥𝑖𝑗 = 1 if candidates i and j assigned to same cluster. 𝑝𝑖𝑗 is the ‘belief’ that candidates i and j are the same • Optimize: Define:
  • 31. Correlation clustering Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 19-27.
  • 32. Pairwise similarity matrix • We need a measure that quantifies the similarity between candidates: • Candidate 1: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS • Candidate 2: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS • Candidate 3: Jam Lutf, jan.m.luts@gmail.com • Candidate 4: J Luts, KULeuven • Candidate 5: Ian Luts, jan.m.luts@gmail.com, KULeuven, UTS, TSP • Candidate 6: Jan Luts, john@staffrecruitment.com, UTS, TSP
  • 33. Term frequency - inverse document frequency jan. an.m n.m. luts uts@ mail gmai .com @hot jan_ Candidate1 1 1 1 1 1 1 1 1 0 0 Candidate2 1 1 1 1 1 1 1 1 0 0 Candidate3 1 1 1 1 1 1 1 1 0 0 Candidate4 0 0 0 0 0 0 0 0 0 0 Candidate5 1 1 1 1 1 0 1 1 0 0 Candidate6 0 0 0 1 1 1 0 1 1 1  These are called ‘term frequencies’  Inverse document frequency for ‘.com’: log(6/5)  TF-IDF for ‘.com’ for candidate 6: 1 * log(6/5) = 0.18  TF-IDF for ‘jan_’ for candidate 6: 1 * log(6/1) = 1.79 Terms 
  • 34. Pairwise similarity matrix • Combine cosine similarity values for name, email address, phone number, mobile number, skills, employment history, … Cand 1 Cand 2 Cand 3 Cand 4 Cand 5 Cand 6 Cand 1 1 1 0.8 0.9 0.95 0.75 Cand 2 1 0.8 0.9 0.95 0.75 Cand 3 1 0.6 0.87 0.7 Cand 4 1 0.75 0.7 Cand 5 1 0.8 Cand 6 1 Correlation clustering
  • 35. Correlation clustering Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 19-27. O(𝑛2) Does not scale with increasing number of candidates!
  • 36. ‘Big Data’ • ‘Big Data’ criticism: • ‘You May Not Need Big Data After All’, HBR, December 2013 • ‘Google Flu Trends: The Limits of Big Data’, NYT, March 2014 • ‘Big data: are we making a big mistake?’, FT Magazine, March 2014 • ‘The backlash against big data’, The Economist, April, 2014 • @ The Search Party: • Sampling can help sometimes, but not always … • We have a lot of data, this creates new problems … • … and we just have to deal with it • We need the right tools and algorithms to process millions of data points
  • 37. Deduplication of candidates • So how can we do correlation clustering on millions of candidates? o Blocking: e.g. split data set in separate blocks based on gender, geographical location, … o Canopy clustering:  Pre-clustering algorithm used as a preprocessing step: Use a cheap distance measure to partition the data into overlapping subsets (i.e. canopies)  Run expensive clustering on each canopy All candidates
  • 38. Canopy clustering Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high- dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. • Start with a list of the candidates in any order, and with two distance thresholds, T1 and T2, where T1 > T2. • Pick a candidate of the list, make it a canopy center and approximately measure its distance to all other candidates. • Put all candidates that are within distance threshold T1 into a canopy. Remove from the list all candidates that are within distance threshold T2. Repeat until the list is empty.
  • 39. Canopy clustering Five canopies found Do correlation clustering on each canopy
  • 40. Deduplication of candidates Strategy outline: • Do canopy clustering using TF-IDFs • Do expensive correlation clustering for each canopy using a similarity matrix based on all available candidate information (e.g. name, email, phone, mobile, employment history, publications, certificates, …) • We need to do < 0.005 of all possible pairwise comparisons Optimization: • Parallelization of TF-IDF computation, canopy clustering • Run correlation clustering in parallel for each canopy
  • 41. Large-scale data processing: • Open-source software framework for distributed computing • MapReduce programming model • Resilient to failure
  • 42. How to do canopy clustering on Hadoop? • Two steps: • Canopy generation: identify the canopy centers • Canopy filling: assign candidates to canopies
  • 43. Canopy generation on Hadoop Initialize: centers1 = {} centers2 = {} centers3 = {} centers4 = {} For each batch in parallel if ∀𝑖, distance(candidate x, center i) > T2 output the pair (‘intermediateCenter’, candidate x) Candidates Batch 1 Candidates Batch 2 Candidates Batch 3 Candidates Batch 4 Intermediate Centers Map: Reduce: Initialize: finalCenters = {} If ∀𝑖, distance(intermediateCenter x, finalCenter i) > T2 output the pair (‘finalCenter’, intermediateCenter x)
  • 44. Canopy filling on Hadoop Retrieve canopyCenters from canopy generation job For each batch in parallel ∀𝑖, if distance(candidate x, center i) < T1 output the pair (center i, candidate x) Candidates Batch 1 Candidates Batch 2 Candidates Batch 3 Candidates Batch 4 Center-Candidate Batch 1 Map: Reduce: For each batch: Output the list of all candidates belonging to the same canopy with center i Center-Candidate Batch 2 Center-Candidate Batch 3
  • 45. Deduplication of candidates - Summary • Our dedupe pipeline is a blend of concepts from information retrieval (TF-IDF), statistics and machine learning (correlation clustering) • Applying it to large data sets causes new problems and requires redesigning/adjusting the algorithms (canopy clustering, distributed computing, hadoop) • Integration in the existing platform: o How do data get in and out of the dedupe pipeline o Making it work in a ‘production environment’: Fail-safe code - in case of failure, handle it in a safe way
  • 46. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 47. Visualization of career paths • 14 million employment history records: • Longitudinal data: transitions between different jobs • Available data: job titles, employer, full description, skills, start dates, end dates, different versions of CV…
  • 48. Visualization of career paths • Visualize transition between jobs based on job title: network consultant senior network consultant technical project manager senior network engineer technical consultantnetwork analyst network manager consultant network engineer network architect project manager IT manager .05 .04 .04 .11 .10 .12 .10.09 .06 .08 .18
  • 50. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 52. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  • 53. Conclusion • Innovative work in a challenging environment • Variety: understanding business problems, literature review, algorithm design, prototyping, evaluation, implementation, optimization • Data science: statistics has a very important role to play • Software engineering skills • Big data: large data sets cause new problems • Team work • Passion!