DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
TitleCategoriesLI
1. TITLE CATEGORIZATION 2.1
MENTOR: RAMESH SUBRAMONIAN
TEAM: DATA ANALYTICS
LEADER: DANIEL TUNKELANG
ACKNOWLEDGEMENTS:
SIMLA CEYHAN
DANIEL TUNKELANG
MONICA ROGATTI
LAUREN OLERICH
RON BEKKERMAN
FLOW
CHRISTIAN POSSE
2. Motivation: CURRENT STATUS
25 JOB FUNCTIONS:
• TOO FEW No Field Sales
• TOO NON-SPECIFIC Reporting is difficult
• TOO BIG
25000 CLEAN JOB TITLES:
• TOO MANY
• TOO BIG (“Owner” ~ 5M)
• TOO SMALL (~ 500)
• TOO SPECIFIC (“Human Resources Info. Sys. Mgr.”)
• TOO NON-SPECIFIC (“Specialist”)
3. CONSTRAINTS
• INPUT • OUTPUT
CLEAN TITLE “IMPRESSSIONS” Clean title Category
… … … …
facilities manager 95674 Blonde hair Hair stylist
… … stylist
Chair stylist Furniture maker
Title 1 Title 2 Cosine … …
… … Owner VAGUE
(UNCATEGORIZABLE)
(1,0)
barista Independent: not
vague
Doesn’t fit in any
existing category, too
small to form
Category …
4. CONSTRAINTS (CONTD)
• ~ 200 categories (from Sales: can be dealt with
on human scale)
• Title maps to Unique category
• Precision over coverage
• Coverage ~ 80% of categorizable titles
• 2-3 nearest categories for each category
• 2 alternate categories for each title
5. Machine solution V00
User Domain Expert Feedback (Ester/Lauren in Sales)
Less than 1.5% change in coverage!
Illustrates “goodness” of computational solution!
12. Status
• Handed over to Ester/Lauren in Sales
• Iteratively incorporate human feedback
• Solution is Public, code is documented and
with Ramesh, working on final report
• ~2-3 new technical innovations
• Developed a proposal for “titles” based on
current understanding of LinkedIn needs
13. Feedback Functionality: Implemented
• Title:
1. Delete from Category (Independent)
2. Move to vague
3. Move to another category
4. Define new category
Category:
1. Delete if empty
2. Rename
3. Merge with another
14. Cool Technical stuff
• Distribution of membership over titles
– How used
• Geometry of Title Word vector space
– How used and should be
– Lack of hyperstructure/scale
• How to cluster stars and “Local Dimension”
– How used
– Lack of asymptotic behaviour or transition point
during clustering
16. Membership Distribution in Titles
Slope drops to
within some % of -1:
90% members in 6000 titles 0.6 diminishing marginal
10% members in 19000 titles Returns : should be based
on marginal increase in
impsminustitles
potential earnings –
0.4
marginal increase
in overhead costs
0.2
Slope of curve nearly -1
Cut-off Rank ~ 6000
0.0
0 5000 10000 15000 20000
Rank_decr_imps
Slope = -1
%ile of titles by impressions - %ile of titles by rank VS. Rank of title
7/13/11 Grp Mtg RSTate, LinkedIn 16
17. Projective Word-vector space
Weighted point set
embedded in Euclidean, Based on
XYZ - axis
with induced metric Cosine Sim.
Boundary of nearest
neighbour polyhedra 25008 points
Of Bins. In 50,000 D!
Ti Ti of size ni Recall that n points
define only n-1 D
UVW - axis
ϑij Tj
DIMENSIONALLY SPARSE!, not just in density
ABC - axis
Most angles are nearly 90 deg.s
18. GEOMETRY OF DATA SPACE:
How should be used:
1. Project Title Word
vectors onto N-1
simplex: Σ 1. 2-3.
components = 1
2. Calculate Mean Word
Vector
3. Drop Titles
Ti
(KLPDS) 4-5.
4. Recalculate the Mean
Word Vector and
MOVE there (increases Tj
discrimination) θ
5. Project vectors onto
unit sphere
6. angle is geodesic
measure
Sin (θ/2) = |Ti-Tj|/2
(distances, density etc.).
As opposed to?
19. Radial distribution function of Titles
1e+07
Almost all angles are > 45
8e+06
6e+06
count
4e+06
2e+06
0e+00
10 20 30 40 50 60 70
theta
No SCALE OR higher order structure (for hierarchical taxonomy)
20. Log(count) vs. Theta
7
6
5
4
count
3
2
1
0
10 20 30 40 50 60 70
theta
No scale or higher order structure (for hierarchical taxonomy)
23. LOCAL DIMENSION
Radius mass
1 1
2 8
3 27
4 64
Exponent (coeff of linear term in log-log plot)
= Dimension (above , it is 3)
Each point (title) has a local dimension Di
Which is used to calculate density of the cluster:
Imps/r^Di
These densities are then compared
and highest selected for categories
24. Aggregate Radial Distribu on of Titles
8
7
y = 6.5687x - 5.3293
6
log10(Number of Titles)
5
4
logcount
Linear (logcount)
3
2
1
0
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
log10(Theta)
Average cluster dimension ~ 6.6
25. Log(count) vs Dim.
What does “dimension of cluster” mean?
10
8
6
count
4
2
0
0 20 40 60 80 100 120
Dim
26. Power law evolution of clustering?
No natural break points.
3.6
3.4
3.2
log(AvgDens, 10)
Exponent = -1
3.0
2.8
2.6
2.4
2.2 2.4 2.6 2.8 3.0 3.2 3.4
log(Cats, 10)
27. FLOW
Big Picture: Taxonomy
Use case 2:
Title categorization Search,
CLIENT: Recruiter, Advertiser, Recc.
Semantic network
Sales Team or Search
Manage Manufa Top Level
Marketing Software choices
ment cturing
Marketing Sales Sales
VP Sales Relational
Sales Rep
Dir. Sales
7/13/11 Grp Mtg RSTate, LinkedIn 27
28. FLOW
Taxonomy Big Picture: Relational
Title categorization Use case 1:
Semantic network Sales FIELD SALES
Categories
Sales Sales Rep
Sales Assoc.
Sales Mgr Reg. Sales Mgr
Prob
Defn 1 Titles
Prob
Defn 2
Members
PYMJPCOJ
7/13/11 Grp Mtg RSTate, LinkedIn 28
29. Inadequacy of Cosine Similarity
• Bit vectors differing in 1/3 of their 1-bits
~ 70% Cosine Similarity FLOW
and 70% Sine Dissimilarity
• PROOF of maintaining preference order
does NOT account for Computational
fragility: at θ=6.3o
+/- 0.005 in Cosine => 2.6o – 8.5o in angle
• Vectors at 30 degs have Cosine Sim ~ 85%
• NOT a distance – NO geometry Obtaining Clean titles 2.0
• DOES NOT provide good discrimination
between close neighbours V2.1 LEANER DATA
Even as intermediate means of calculating Deconstruct V2.0 and V2.1
angle, computationally fragile:
• Poor choice, prone to error in region of V2.2 Data Space
interest
• 0 < angle < pi/2 (Maximally dissimilar only
90 degs away!) Title categorization
• Inadequate notion of “maximally Semantic network
dissimilar”
30. What does LinkedIn want from Titles?
1. Navigational ease for Sales, Search, Recommendation
2. Robust and maintainable structure
3. Dynamic response to labor mkt changes
4. Structure based on Domain expertise, NOT on member
information
5. Assignment of members based on profile and inferred info
6. “Universal” acceptability
7. Free and available? Somebody else done the work?
8. Expand use of LinkedIn as point of entry for
recruiters, based on how they define jobs and use titles in
searches