go.indeed.com/IndeedEngTalks
Tokens and Millicents
Technical challenges in launching Indeed
around the world
Engineering Director
Dan Heller
We help
people
get jobs.
what where
job title, keywords or company name city, state or zip code
software Find Jobsaustin
was wo
job title, keywords or company name city, state or zip code
produktionshelfer Jobs findenmünchen
キーワード 勤務地
職種、キーワード、会社名など 都道府県名または市区町村名
登録栄養士 求人検索大阪
Αθήνα
τι που
τίτλος θέσης εργασίας, λέξεις-κλειδιά ή όνομα εταιρείας πόλη ή πολιτεία
βοηθός λογιστή Εύρεση θέσεων εργασίας
Software Engineer
Preetha Appan
Precision and Recall
ALL JOBS
Relevant
Jobs
Returned
Jobs
Precision: Positive Predictive Value
# Returned and relevant
# Returned
Precision
Job seeker searches for “architect”
usually means “building architect”
Precision
Job seeker searches for “architect”
10 jobs returned:
8 building architect jobs
2 software architect jobs
Precision
Job seeker searches for “architect”
10 jobs returned:
8 building architect jobs Relevant
2 software architect jobs Not Relevant
Precision: 8 / 10
Recall: Specificity
# Returned and relevant
# Relevant
Recall
Job seeker searches for “hr”
Jobs that mention “hr” or “human resources”
are both relevant to the job seeker.
Recall
Job seeker searches for “hr”
10 jobs are relevant:
7 hr jobs
3 human resources jobs
Recall
Job seeker searches for “hr”
10 jobs are relevant:
7 hr jobs Returned
3 human resources jobs Not Returned
Recall: 7 / 10
Improving Recall
in Job Search
Senior Software Engineer - Search
Indeed - Austin, TX
Indeed.com is seeking a Senior Software Engineer
responsible for the information retrieval system that
powers Indeed’s job search website.
If you are an engineer who's passionate about building
innovative products...
Job Description - English
Senior Software Engineer - Search
Indeed - Austin, TX
Indeed.com is seeking a Senior Software Engineer
responsible for the information retrieval system that
powers Indeed’s job search website.
If you are an engineer who's passionate about building
innovative products...
Tokenization
Inverted Index
● Like index in the back of a book
● words = tokens, page numbers = doc ids
Inverted Index
Token Job A Job B Job C
assistant ✔
developer ✔
engineer ✔
lawyer ✔ ✔
paralegal ✔ ✔
retrieval ✔
Inverted Indexes
Allow you to:
● Quickly find all documents containing a
token
● Perform boolean queries, e.g “java AND
developer”
Apache Lucene
Open source inverted index implementation
Fast, widely used
Tokenization with Lucene
StandardAnalyzer
● Uses space and punctuation to determine
token boundaries
StandardAnalyzer - problems
● C++, C# → C
● O’Reilly → O, Reilly
Tokenization with Lucene
JobAnalyzer
● forked StandardAnalyzer
● Modified it to make it work for jobs
Secrétaire
Saclay
Au sein de la direction de la Qualité et de l'Environnement (DQE)
vous seconderez la secrétaire-assistante. Vos principales
missions seront :
- organisation de réunions
- l'accueil téléphonique
- la gestion des missions ..
Job Description - French
Chinese
Japanese
Korean
(CJK)
Job Description - Chinese
岗位描述:
1、全厂电气设备的日常检查、记录,在操作工或
主操的指导下进行工艺操作.
2、现场液体充装,现场充装安全的管理.
3、负责现场工作环境的整洁.
...
Job Description - Japanese
ちょっと想像してみてください。
ご近所のサーティワンにあなたが企画開発
Kanji
Job Description - Japanese
ちょっと想像してみてください。
ご近所のサーティワンにあなたが企画開発
Kanji
Job Description - Japanese
ちょっと想像してみてください。
ご近所のサーティワンにあなたが企画開発
Hiragana
Job Description - Japanese
ちょっと想像してみてください。
ご近所のサーティワンにあなたが企画開発
Kanji Hiragana Katakana
Chinese using JobAnalyzer
全厂电气设备的日常检查、记录,
在操作工或主操的指导下进行工艺操作.
全厂电气设备的日常检查、记录,
在操作工或主操的指导下进行工艺操作.
Chinese using JobAnalyzer
全厂电气设备的日常检查、记录,
在操作工或主操的指导下进行工艺操作.
“Daily inspection of electrical equipment plant-wide”
Chinese using JobAnalyzer
JobAnalyzer in CJK =
Poor recall
CJKAnalyzer - bigrams
医療事務兼検査助手
医療事務兼検査助手
medical
医療事務兼検査助手
????
医療事務兼検査助手
affairs
Use bigram tokenizer
on query
“東京都”
Tokyo prefecture
“東京都”
Tokyo prefecture
東京都東京都
東京都
“東京都”
Tokyo prefecture
東京都
Tokyo
東京都東京都
“東京都”
Tokyo prefecture
Tokyo Kyoto
Bigram tokenizer Drawbacks
● Poor precision
Bigram tokenizer Drawbacks
● Poor precision
● Too many terms
Properly tokenize CJK
Accent and gender normalization
● secrétaire, secretaire
Accent and gender normalization
● secrétaire, secretaire
● vendeur, vendeuse
Accent and gender normalization
● secrétaire, secretaire
● vendeur, vendeuse
● promotor@s
Language Detection
Language Detection options
● HTTP Content-Language response header
○ Most sites don’t provide this header
○ May not be accurate
Language Detection - ICU4J
● ICU4J’s CharsetDetector
○ Works well for languages with single byte
encoded characters
○ Detect that language is one of
Danish, Dutch, English, French,
German, Italian, Portuguese, Swedish
Naive Bayesian classifier
● Features - words
● Strong independence assumption
● Class label - language
Naive Bayesian Language detector
Hand labelled training data in each language
Naive Bayesian Language detector
For each language, calculate P(wi
ϵ Lj
)
● P(“experience” ϵ en) = 0.85
Naive Bayesian Language detector
P(w1
ϵ Lj
) * P(w2
ϵ Lj
) * P(w3
ϵ Lj
)*..
Using Unicode Blocks
Thai
min
max
min
max
Greek
● 100% accurate
● Used in:
○ Thai
○ Greek
○ Korean
○ Hebrew
Using Unicode Blocks
CJ language detection
● Strongly weight Hiragana and Katakana
● Some characters (Kanji) common between
Chinese and Japanese
● p(卒 ϵ ja) = 0.99 p(卒 ϵ zh) = 0.000001
Language Results
● Did cross validation on hand labeled testing
data
● 99% accurate for text > 30 characters
○ Average job description is 200 characters
● Fast - 0.6ms per job
Other language detectors
Google - https://code.google.com/p/cld2/
CJK Tokenization
CJK tokenizers
● Dictionary-based
● Statistical model & dictionary
Dictionary-based tokenizers
● Dictionary of words in language
● Scan input sentence, return all possible
tokenizations
Context matters
北京大学生前来应聘
北京 大学生前来应聘
Beijing
北京 大学生 前来应聘
Beijing college students
北京 大学生 前来 应聘
Beijing college students come to
北京 大学生 前来 应聘
Beijing college students come to apply jobs
北京 大学生 前来 应聘
Beijing college students come to apply jobs
北京大学生前来应聘
北京 大学生 前来 应聘
Beijing college students come to apply jobs
北京大学 生前来应聘
Peking University
北京 大学生 前来 应聘
Beijing college students come to apply jobs
北京大学 生前 来应聘
Peking University before death
北京 大学生 前来 应聘
Beijing college students come to apply jobs
北京大学 生前 来应聘
Peking University before death come to apply jobs
北京 大学生 前来 应聘
Beijing college students come to apply jobs
北京大学 生前 来应聘
Peking University before death come to apply jobs
Hidden Markov Model
北京大学
Peking University
中国
China
生前
before death
北京大学生前来应聘
北京大学
Peking
University
生前
before death
北京
Beijing
大学生
college
student
✔
✘
北京大学
Peking
University
生前
before death
北京
Beijing
大学生
college
student
北京大学生前来应聘
CJK tokenizers
● Chinese - Imdict
● Japanese - Sen
● Korean - LuceneKorean
Chinese tokenization
http://nlp.stanford.edu/projects/chinese-nlp.shtml
● Different rules per language around
○ Gender
○ Plurals
○ Collation
More recall challenges
Apply language specific rules to transform words
to canonical form
Use detected language
Stemming
What is stemming?
the process of turning multiple variations of a
word into a single equivalent root
Stemming examples
● driver, drivers → driver
● secretaire, secrétaire → secretaire
● vendeur, vendeuse → vendeur
Why stemming matters
● Return all possible relevant jobs given the
user’s query, not just exact matches
Stemming - Lucene Analyzers
● Do stemming before adding to inverted
index
● Examples
○ PorterStemFilter
○ SnowballAnalyzer
○ EnglishMinimalStemmer
Inverted Index
Job A: Directrice de Documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
Search with stemming tokenizers
● At search time, use the same analyzer on
the query
○ “directrice” → “directeur”
● Search for “directrice” returns both jobs
Modifying stem rules require full
index rebuild
● If roots have changed need to re-
process all jobs
Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
Drawbacks
● Loss of precise information
○ “Directrice” search should return exact match only
Decouple stemming
from indexing
Term Expansion Maps
Term Expansion Maps
● Map from String->List<String>
● Key is root, values are tokens that stem
to that root
● driver → driver, drivers
● vendeur → vendeur, vendeuse
Stemmer interface
● One method
● String stem(String token)
● Many implementations
● EnglishStemmer
● FrenchStemmer
● GermanStemmer
● SpanishStemmer
Building term expansion map
for each language
for each term in language
root = Stemmer.stem(term)
termMap[root].append(term)
● Takes ~1.5 minutes on index with 2
million tokens and 18 languages
Using term
expansion map
Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔
directrice ✔
documentaires ✔
production ✔
Search Service
“directrice”
“directrice”
“directeur”
French Stemmer
“directrice”
“directeur”
Term
Expansion
Map
French Stemmer
Query Rewriter
“directrice”
“directeur”
Term
Expansion
Map
French Stemmer
Query Rewriter
“directrice” OR “directeur”
Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔
directrice ✔
documentaires ✔
production ✔
Benefits
● Modifying stem rules don’t require index
rebuilds
○ Takes minutes on index with millions of jobs
○ Had flexibility to iteratively implement stemming
rules as we come across different use cases
Benefits
● Precise information
○ “directrice” search query returns exact match only
Code deploy to change
rules or add languages
49 team members
26 nationalities
18 languages
Scale Stemming
● Indeed continued international expansion
● Needed stemming to scale without code
deploys and coordination between
developers and country managers.
Goal
● Efficient
○ Store term expansion maps efficiently
○ Search time as fast as possible
Goal
● Generic
○ identify patterns common to all languages.
■ ies→y in English, se→r in French
Goal
● Comprehensive
○ Support all use cases we care about:
■ plurals
■ synonyms
■ abbreviations
■ accent collation
■ gender suffixes
Goal
● Scalable
○ Adding a new language shouldn’t need a code
deploy
Rule driven stemming
one stemmer. all languages.
What is a stemming rule
● Rules transform tokens into their root form
Rule attributes
● Rules have “from” (origin) and “to”
(replacement)
Rule attributes
● Rules have a type
○ Types define exactly how the text transformation
happens
Rule type - exact
● Change origin to replacement when its an
exact match
Exact rule
English
sr→senior
attorney→lawyer
Italian
colf→domestica
Dutch
leraar→docent
Rule type - substring
● Change all occurrences of origin to
replacement
Substring rule
English - é→e
résumé → resume
café → cafe
German - ä → a
verkäufer → verkaufer
French - ô→o
hôtesse → hotesse
Rule type - suffix
● Change origin to replacement if it matches
at the end of token
Suffix Rule - English
● ies→y
○ families → family
○ policies → policy
● s→’’
○ nurses → nurse
○ drivers → driver
Suffix Rule - French
● euse→eur
○ serveuse→serveur
● ienne→ien
○ gardienne→ gardien
Rules are ordered
Order matters
Stem “families”
Rules
● s→’’
● ies→y
Apply s→’’
Order matters
Stem “families”
Rules
● s→’’
● ies→y
Apply s→’’
● families → familie
Stem “families”
Rules
● s→’’
● ies→y
Apply s→’’
● families → familie
Order matters
✘
Rules can be marked as terminal
● No more rules applied after terminal rule
Prevent over-stemming
● s → ‘’ can cause this → thi
● Min Length - special terminal rule
● Usually set to anywhere from 3 to 5
Babelfish: Stem rule editor
● Webapp to edit and publish rules
● Rules interpreted by generic stemmer
● 27 languages
Stem rule editor
Stem rule editor
Stem rule editor
Ability to audit rules
directrices
directrice suffix rule “s” → “”
directeur suffix rule “trice” → “teur”
ingénieur
ingenieur substring rule “é” → “e”
Job
Seekers
Stem Rule Editor
EN s → ‘’, ces → y, …
FR e → é, u → ù, …
Jobs Index Builder
Term Expansion Map
sale → sale, sales
policy → policy, policies
Search Service
Country Managers
query
results
Term expansion map storage
● Custom serialization format
○ Store string array as UTF8 bytes and offsets
○ Front encoding for additional compression
● 2X smaller than using Java native
serialization
Comprehensive
● Gender
● Accents
● Plurals
● Synonyms
Scalable
27 languages use stemming rules
Re-used language detection and stemming
libraries in resume search
Efficient
● Term expansion map in Europe index has 2
million terms in 18 languages - 60MB on
disk
● Building term expansion maps takes ~ 1.5
minutes
● Doing boolean query for stemming adds
~5ms to median search time (~35ms)
Stemming helps job seekers
Searches that return no jobs reduced by 60%
with stemming
3% to 5% more clicks
Multi-currency
Sponsored Jobs
Sponsored Jobs at Indeed
Real-time auction used to determine
Sponsored Job impressions
Sponsored Jobs at Indeed
Real-time auction used to determine
Sponsored Job impressions
Auction winner based on expected value
Expected
Value
= Bid x eCTR
Expected
Click-Through
Rate*
Expected
Value
= Bid x eCTR
Expected
Click-Through
Rate*
Expected
Value
= Bid x eCTR
Expected
Click-Through
Rate*
Job Bid
A $3.00
B $2.00
C $1.00
Job Bid eCTR
A $3.00 5%
B $2.00 10%
C $1.00 8%
Job Bid x eCTR = Value
A $3.00 5% $0.15
B $2.00 10% $0.20
C $1.00 8% $0.08
Job Bid x eCTR = Value → Rank
A $3.00 5% $0.15 2
B $2.00 10% $0.20 1
C $1.00 8% $0.08 3
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...
B could win the auction with a lower bid...
…only charge what’s needed to win!
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...
…only charge what’s needed to win!
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
$1.50 x 10% = $0.15
B could win the auction with a lower bid...
…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...
…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...
…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
Sponsored Jobs at Indeed
“Generalized Second Price Auction”
Sponsored Jobs at Indeed
“Generalized Second Price Auction”
● Fair for employers
● Ensures sponsored results are relevant and
useful for job seekers
Sponsored Jobs at Indeed
Employers set their bid & budget
Sponsored Jobs at Indeed
Employers set their bid & budget
employer_id int(10) unsigned,
bid decimal(10,2) unsigned,
daily_budget decimal(10,2) unsigned,
Sponsored Jobs at Indeed
A builder process creates read-optimized data
structures for the auction system
On search results page, execute auction to
determine sponsored impressions
Sponsored Jobs at Indeed
Sponsored Jobs at Indeed
When job seeker clicks on sponsored result,
log information from the auction
employerId
jobId
bid
cost
…
Sponsored Jobs at Indeed
Process click logs to update budgets and
charge employers
Sponsored Jobs at Indeed
Process click logs to update budgets and
charge employers
Apply business rules during click processing:
● Fraud detection
● Duplicate click detection
SJ outside the US
Non-US employers wanted their jobs in
sponsored results...
SJ outside the US
Non-US employers wanted their jobs in
sponsored results...
...but they don’t have US Dollars
SJ outside the US
v1: Use credit cards
Credit card company convert charges to
employer’s currency
SJ outside the US
Credit Cards
+ No changes needed
SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers
SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers
- Disadvantaged exchange rates
SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers
- Disadvantaged exchange rates
- Employers bear currency risk
Credit Cards: Currency Risk
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Credit Cards: Currency Risk
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Exchange rate on Jan 31: 0.8970
Effective Daily Budget: CA $104.25
Credit Cards: Currency Risk
+4.25%
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Exchange rate on Jan 31: 0.8970
Effective Daily Budget: CA $104.25
Multi-currency
Sponsored Jobs
Auction
Multi-currency SJ
Employers can set bids and budgets in
preferred currency
Canadian Dollars CAD
Australian Dollars AUD
Japanese Yen JPY
Euro EUR
British Pounds GBP
Swiss Francs CHF
Multi-currency SJ
Single auction for all employers using any
currency
Multi-currency SJ
Fair exchange rates for employers
Multi-currency SJ
Transparent and repeatable calculations
Multi-currency SJ
Create a new “pseudo-currency” for use within
the auction:
millicent
Millicents
Exchange rate between USD and millicents
is fixed:
$0.01 == 1000 millicents
$1.00 == 105
millicents
Millicents
Exchange rates between other currencies and
millicents can vary over time:
€1.00 == 136,170 millicents
¥100 == 98,350 millicents
Millicents
Provide enough granularity to differentiate
similar values in different currencies
Millicents
Provide enough granularity to differentiate
similar values in different currencies
All of these are about $1.00 (USD):
£0.60 (GBP)
€0.73 (EUR)
¥102 (JPY)
Millicents
Provide enough granularity to differentiate
similar values in different currencies
All of these are about $1.00 (USD):
£0.60 (GBP)
€0.73 (EUR) Which is larger?
¥102 (JPY)
Millicents
Converting to USD doesn’t help
USD: $1.00 → $1.00
GBP: £0.60 → $1.00
EUR: €0.73 → $1.00
JPY: ¥102 → $1.00
Millicents
Millicents provide granularity to rank values
USD: $1.00 → 100000 mc
GBP: £0.60 → 100450 mc
EUR: €0.73 → 99519 mc
JPY: ¥102 → 100317 mc
Millicents
32 bit signed values
$21,474 USD equivalent
64 bit signed values
$9.2 trillion USD equivalent
Local Currency Values
Values in specific currency are represented
with currency code and an integer
Integer represents “minor unit”, depends on
the currency type:
(USD, 543) == $5.43
(EUR, 543) == €5.43
(JPY, 543) == ¥543
Local Currency Values
For each currency, preferable that the “minor
unit” is roughly equal to $0.01 USD
● Exchange rate representation
● Fairness in auction competition
Local Currency Values
32 bit signed values
$21 million USD (and others)
¥2.1 billion JPY
64 bit signed values
$90 quadrillion USD (and others)
¥9 quintillion JPY
Multi-currency SJ
Change bid and budget representations to use
[currency, integer]
Multi-currency SJ
Create process to retrieve and record
exchange rates every day
Multi-currency SJ
Auction builder process converts bids to
millicents, saves the exchange rate used
Multi-currency SJ
Execute auction in millicents
Multi-currency SJ
Record results in millicents & local currency
Multi-currency SJ
Add multi-currency data to click logs:
employerId
jobId
bid
cost
...
employerId
jobId
currency
exchangeRate
bidInCurrency
bidMillicents
costMillicents
...
Multi-currency SJ
During click processing, convert auction cost
(in millicents) back to employer’s currency
using same exchange rate
costInMillicents
currency
exchangeRate
→ costInCurrency
“How much revenue did we make today?”
$1,000
“How much revenue did we make today?”
$1,000
$548 USD
€273 EUR
¥8,253 JPY
“How much revenue did we make today?”
$1,000
$548 USD
€273 EUR
¥8,253 JPY
100,000,000 mc
Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39483
Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39483
Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39
Actual Millicent Cost: 53,168 millicents
Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39
Actual Millicent Cost: 53,168 millicents
1.2% difference!
Active Non-US Employers
Great Britain
Japan
Canada
30% of Sponsored clicks non-USD
2004
2014
International Success
United Kingdom 1.) Indeed 2.) Reed 3.) Totaljobs
France 1.) Indeed 2.) Cadremploi 3.) Monster
Netherlands 1.) Indeed 2.) NVB 3.) Monsterboard
Canda 1.) Indeed 2.) Workopolis 3.) Monster
Italy 1.) Indeed 2.) Infojobs 3.) Jobrapido
Brazil 1.) Indeed 2.) Catho 3.) Infojobs
Japan 1.) Rikunabi 2.) Indeed 3.) Rikunabi Next
Australia 1.) Seek 2.) Indeed 3.) Careerone
India 1.) Naukri 2.) Timesjobs 3.) Indeed
Next @IndeedEng Talk
August 27th, 2014
http://engineering.indeed.com/talks
https://twitter.com/IndeedEng

@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed around the world