Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Science @ The Search Party
Jan Luts
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
About myself
• Master in Information Sciences, Universiteit Hasselt, Belgium
• Master in Bioinformatics, Katholieke Univer...
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
The Search Party
There are major forces acting on Recruitment as an industry…
Traditional
recruitment model
under pressure...
We allow potential employers to
search a vast ocean of the worlds
best candidates
We connect employers with the Agencies w...
http://thesearchparty.com/
Employer
Employer
Recruiter
Recruiter
Employer
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
Data
• 2 million candidates
Data
• 2 million candidates
• 46 million skills
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
Concrete Formworker
Doran Contract...
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries,...
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries,...
Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries,...
Data science @ The Search Party!
• Testing hypotheses
• Design of experiments
• Cross-validation
• Training data vs. test ...
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
Deduplication of candidates
Recruiter 1
Recruiter 2
Recruiter 3
The Search Party
Database
Employer
Deduplication of candidates
(Figure from Lise Getoor)
Deduplication of candidates
(Figure from Lise Getoor)
Deduplication of candidates
(Figure from Lise Getoor)
Clustering
• Entity resolution does not happen independently for each
pair or candidates separately
• Number of clusters i...
Correlation clustering
• Take a pair‐wise similarity graph as input
• Edge 𝑥𝑖𝑗 ∈ {0,1} with 𝑥𝑖𝑗 = 1 if candidates i and j ...
Correlation clustering
Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering bey...
Pairwise similarity matrix
• We need a measure that quantifies the similarity between
candidates:
• Candidate 1: Jan Luts,...
Term frequency - inverse document frequency
jan. an.m n.m. luts uts@ mail gmai .com @hot jan_
Candidate1 1 1 1 1 1 1 1 1 0...
Pairwise similarity matrix
• Combine cosine similarity values for name, email
address, phone number, mobile number, skills...
Correlation clustering
Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering bey...
‘Big Data’
• ‘Big Data’ criticism:
• ‘You May Not Need Big Data After All’, HBR, December 2013
• ‘Google Flu Trends: The L...
Deduplication of candidates
• So how can we do correlation clustering on millions of
candidates?
o Blocking: e.g. split da...
Canopy clustering
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-
dimensional data se...
Canopy clustering
Five canopies found
Do correlation clustering on each canopy
Deduplication of candidates
Strategy outline:
• Do canopy clustering using TF-IDFs
• Do expensive correlation clustering f...
Large-scale data processing:
• Open-source software framework for distributed computing
• MapReduce programming model
• Re...
How to do canopy clustering on Hadoop?
• Two steps:
• Canopy generation: identify the canopy centers
• Canopy filling: ass...
Canopy generation on Hadoop
Initialize:
centers1 = {} centers2 = {} centers3 = {} centers4 = {}
For each batch in parallel...
Canopy filling on Hadoop
Retrieve canopyCenters from canopy generation job
For each batch in parallel ∀𝑖, if distance(cand...
Deduplication of candidates - Summary
• Our dedupe pipeline is a blend of concepts from information
retrieval (TF-IDF), st...
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
Visualization of career paths
• 14 million employment history records:
• Longitudinal data: transitions between different ...
Visualization of career paths
• Visualize transition between jobs based on job title:
network consultant
senior network
co...
Visualization of career paths
Demo
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
Technology - Software
Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visua...
Conclusion
• Innovative work in a challenging environment
• Variety: understanding business problems, literature
review, a...
Thanks!
Upcoming SlideShare
Loading in …5
×

Data Science @ The Search Party (Dr. Jan Luts)

1,787 views

Published on

University of Technology Sydney
Seminar 24 June 2014.
http://www.statsoc.org.au/events/ssai-events/data-science-search-party/

The Search Party is a Sydney-based technology platform that is a positive disruptor for the recruitment industry. We have created the first online marketplace for talent which makes it quicker and easier to hire better people, whilst for recruitment agencies we provide a sustainable and profitable revenue stream. In this presentation I will give an overview of the various challenges that we are facing to extract value from the large amounts of data that are daily circulating in our software platform. These data originate from job seekers, employers and recruiters, and processing them requires interdisciplinary work at the intersection of statistics, machine learning, data mining, computer science, information retrieval and natural language processing. The ultimate goal is to accurately match vacancies with job seekers and automate the recruitment service.

Published in: Science, Technology, Business
  • Be the first to comment

Data Science @ The Search Party (Dr. Jan Luts)

  1. 1. Data Science @ The Search Party Jan Luts
  2. 2. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  3. 3. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  4. 4. About myself • Master in Information Sciences, Universiteit Hasselt, Belgium • Master in Bioinformatics, Katholieke Universiteit Leuven, Belgium • Master in Statistics, Katholieke Universiteit Leuven, Belgium • PhD and Postdoc in Engineering, Department of Electrical Engineering, Katholieke Universiteit Leuven (Sabine Van Huffel, Johan Suykens) “Predictive computer models, machine learning, decision support systems” • Postdoc, School of Mathematical Sciences, University of Technology Sydney, Australia (Matt Wand) “Mean field variational Bayes, semiparametric regression, streaming data, real-time analysis” • October 2013: Data Scientist, The Search Party, Sydney
  5. 5. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  6. 6. The Search Party There are major forces acting on Recruitment as an industry… Traditional recruitment model under pressure from technology Pressure on pricing damaging agency profitability Bulk of agency costs are people who drive revenue Global economic uncertainty Corp. investment in internal talent sourcing teams ?
  7. 7. We allow potential employers to search a vast ocean of the worlds best candidates We connect employers with the Agencies who represent them to agree a fee and arrange an introduction Supporting this evolution is the world’s first marketplace for talent………..
  8. 8. http://thesearchparty.com/
  9. 9. Employer
  10. 10. Employer
  11. 11. Recruiter
  12. 12. Recruiter
  13. 13. Employer
  14. 14. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  15. 15. Data • 2 million candidates
  16. 16. Data • 2 million candidates • 46 million skills
  17. 17. Data • 2 million candidates • 46 million skills • 14 million employment history records Concrete Formworker Doran Contractors 1999-2012 Site Supervisor Allied Gold 1997-2000 Java Developer IBM 2010-2011
  18. 18. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies
  19. 19. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies • 29 industries, 384 subsectors Engineerin g Accounting Administration & Office Support Advertising, Arts & Media Banking & Financial Services Call Centre & Customer Services Community Services & Development Construction Consulting & Strategy Design & Architecture Education & Training
  20. 20. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies • 29 industries, 384 subsectors • 75 GB marketplace logs Create Candidate Publish Candidate Forgot Password Submit CandidateVote Up Vote Down Request Candidate Appeared In Search Results Account Login Upload CV
  21. 21. Data • 2 million candidates • 46 million skills • 14 million employment history records • 40000 vacancies • 29 industries, 384 subsectors • 75 GB marketplace logs • 100 recruitment agencies
  22. 22. Data science @ The Search Party! • Testing hypotheses • Design of experiments • Cross-validation • Training data vs. test data • Performance measure • Building a prediction model • Regression • Support vector machines • Variable selection • Sensitivity, specificity • Cost and benefit • Clustering • Topic modeling • Distributed computing • Programming • Software engineering • Data structures • Term frequency - inverse document frequency • Entity resolution • Sentence detection • Tokenization • Sentiment analysis • Part-of-speech tagging  statistics  machine learning  data mining  computer science  information retrieval  natural language processing
  23. 23. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  24. 24. Deduplication of candidates Recruiter 1 Recruiter 2 Recruiter 3 The Search Party Database
  25. 25. Employer
  26. 26. Deduplication of candidates (Figure from Lise Getoor)
  27. 27. Deduplication of candidates (Figure from Lise Getoor)
  28. 28. Deduplication of candidates (Figure from Lise Getoor)
  29. 29. Clustering • Entity resolution does not happen independently for each pair or candidates separately • Number of clusters is unknown • Many, many small (possibly singleton) clusters
  30. 30. Correlation clustering • Take a pair‐wise similarity graph as input • Edge 𝑥𝑖𝑗 ∈ {0,1} with 𝑥𝑖𝑗 = 1 if candidates i and j assigned to same cluster. 𝑝𝑖𝑗 is the ‘belief’ that candidates i and j are the same • Optimize: Define:
  31. 31. Correlation clustering Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 19-27.
  32. 32. Pairwise similarity matrix • We need a measure that quantifies the similarity between candidates: • Candidate 1: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS • Candidate 2: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS • Candidate 3: Jam Lutf, jan.m.luts@gmail.com • Candidate 4: J Luts, KULeuven • Candidate 5: Ian Luts, jan.m.luts@gmail.com, KULeuven, UTS, TSP • Candidate 6: Jan Luts, john@staffrecruitment.com, UTS, TSP
  33. 33. Term frequency - inverse document frequency jan. an.m n.m. luts uts@ mail gmai .com @hot jan_ Candidate1 1 1 1 1 1 1 1 1 0 0 Candidate2 1 1 1 1 1 1 1 1 0 0 Candidate3 1 1 1 1 1 1 1 1 0 0 Candidate4 0 0 0 0 0 0 0 0 0 0 Candidate5 1 1 1 1 1 0 1 1 0 0 Candidate6 0 0 0 1 1 1 0 1 1 1  These are called ‘term frequencies’  Inverse document frequency for ‘.com’: log(6/5)  TF-IDF for ‘.com’ for candidate 6: 1 * log(6/5) = 0.18  TF-IDF for ‘jan_’ for candidate 6: 1 * log(6/1) = 1.79 Terms 
  34. 34. Pairwise similarity matrix • Combine cosine similarity values for name, email address, phone number, mobile number, skills, employment history, … Cand 1 Cand 2 Cand 3 Cand 4 Cand 5 Cand 6 Cand 1 1 1 0.8 0.9 0.95 0.75 Cand 2 1 0.8 0.9 0.95 0.75 Cand 3 1 0.6 0.87 0.7 Cand 4 1 0.75 0.7 Cand 5 1 0.8 Cand 6 1 Correlation clustering
  35. 35. Correlation clustering Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 19-27. O(𝑛2) Does not scale with increasing number of candidates!
  36. 36. ‘Big Data’ • ‘Big Data’ criticism: • ‘You May Not Need Big Data After All’, HBR, December 2013 • ‘Google Flu Trends: The Limits of Big Data’, NYT, March 2014 • ‘Big data: are we making a big mistake?’, FT Magazine, March 2014 • ‘The backlash against big data’, The Economist, April, 2014 • @ The Search Party: • Sampling can help sometimes, but not always … • We have a lot of data, this creates new problems … • … and we just have to deal with it • We need the right tools and algorithms to process millions of data points
  37. 37. Deduplication of candidates • So how can we do correlation clustering on millions of candidates? o Blocking: e.g. split data set in separate blocks based on gender, geographical location, … o Canopy clustering:  Pre-clustering algorithm used as a preprocessing step: Use a cheap distance measure to partition the data into overlapping subsets (i.e. canopies)  Run expensive clustering on each canopy All candidates
  38. 38. Canopy clustering Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high- dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. • Start with a list of the candidates in any order, and with two distance thresholds, T1 and T2, where T1 > T2. • Pick a candidate of the list, make it a canopy center and approximately measure its distance to all other candidates. • Put all candidates that are within distance threshold T1 into a canopy. Remove from the list all candidates that are within distance threshold T2. Repeat until the list is empty.
  39. 39. Canopy clustering Five canopies found Do correlation clustering on each canopy
  40. 40. Deduplication of candidates Strategy outline: • Do canopy clustering using TF-IDFs • Do expensive correlation clustering for each canopy using a similarity matrix based on all available candidate information (e.g. name, email, phone, mobile, employment history, publications, certificates, …) • We need to do < 0.005 of all possible pairwise comparisons Optimization: • Parallelization of TF-IDF computation, canopy clustering • Run correlation clustering in parallel for each canopy
  41. 41. Large-scale data processing: • Open-source software framework for distributed computing • MapReduce programming model • Resilient to failure
  42. 42. How to do canopy clustering on Hadoop? • Two steps: • Canopy generation: identify the canopy centers • Canopy filling: assign candidates to canopies
  43. 43. Canopy generation on Hadoop Initialize: centers1 = {} centers2 = {} centers3 = {} centers4 = {} For each batch in parallel if ∀𝑖, distance(candidate x, center i) > T2 output the pair (‘intermediateCenter’, candidate x) Candidates Batch 1 Candidates Batch 2 Candidates Batch 3 Candidates Batch 4 Intermediate Centers Map: Reduce: Initialize: finalCenters = {} If ∀𝑖, distance(intermediateCenter x, finalCenter i) > T2 output the pair (‘finalCenter’, intermediateCenter x)
  44. 44. Canopy filling on Hadoop Retrieve canopyCenters from canopy generation job For each batch in parallel ∀𝑖, if distance(candidate x, center i) < T1 output the pair (center i, candidate x) Candidates Batch 1 Candidates Batch 2 Candidates Batch 3 Candidates Batch 4 Center-Candidate Batch 1 Map: Reduce: For each batch: Output the list of all candidates belonging to the same canopy with center i Center-Candidate Batch 2 Center-Candidate Batch 3
  45. 45. Deduplication of candidates - Summary • Our dedupe pipeline is a blend of concepts from information retrieval (TF-IDF), statistics and machine learning (correlation clustering) • Applying it to large data sets causes new problems and requires redesigning/adjusting the algorithms (canopy clustering, distributed computing, hadoop) • Integration in the existing platform: o How do data get in and out of the dedupe pipeline o Making it work in a ‘production environment’: Fail-safe code - in case of failure, handle it in a safe way
  46. 46. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  47. 47. Visualization of career paths • 14 million employment history records: • Longitudinal data: transitions between different jobs • Available data: job titles, employer, full description, skills, start dates, end dates, different versions of CV…
  48. 48. Visualization of career paths • Visualize transition between jobs based on job title: network consultant senior network consultant technical project manager senior network engineer technical consultantnetwork analyst network manager consultant network engineer network architect project manager IT manager .05 .04 .04 .11 .10 .12 .10.09 .06 .08 .18
  49. 49. Visualization of career paths Demo
  50. 50. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  51. 51. Technology - Software
  52. 52. Outline • About myself • The Search Party • What is data science @ The Search Party? • Deduplication of candidates • Visualization of career paths • Technology - Software • Conclusion
  53. 53. Conclusion • Innovative work in a challenging environment • Variety: understanding business problems, literature review, algorithm design, prototyping, evaluation, implementation, optimization • Data science: statistics has a very important role to play • Software engineering skills • Big data: large data sets cause new problems • Team work • Passion!
  54. 54. Thanks!

×