Spell Checking in Deezer Search Engine

Spell Checking in Deezer
Search Engine
Marion Baranes (Search-Scientist)
WiMLDS Paris
April 19th, 2018

WiMLDS 2018/04 : Spell Checking in DEEZER Search Engine
Deezer Search Engine
/01

What is Deezer’s Search Engine?
Spell checking in Search Engine
Main features :
- Search across multiple
types (artist, album, tracks,
playlist, podcast,... )
- Localized and
personalized ranking

What do we do in Deezer Search Team?
Extra features :
- Top result
- Trends prediction
- Related queries
- Advanced search
- Search by tags
- Spell checking

Some numbers about Deezer Search
2.5 M daily users
9 M requests/day
Large catalog:53M tracks, 7M albums, 2M artists, 9M playlists,...
≈ 100 milliseconds, time to find a result
25 % of the stream sessions comes from the Search

/02
Our Spell Checking System
7WiMLDS 2018/04 : Spell Checking in DEEZER Search Engine

Why do we need spelling correction?
misunderstanding
disengagement
unsubscription
….

Error prediction
Originally, we used fuzzy approach to treat misspelled
queries.
In search engine, doing that:
● introduce noise in search results
● increase number of attempted requests
● increase search engine response time
→ We choose to predict future user’s misspelled queries

/A
/B
/C
Spell checking system
Learn user’s misspelled queries
Generate new misspelled queries
Prediction system

A. Search Engine with spell checking
user’s query
Spell Checking module
Is this a frequent query?
Is this a known
misspelled query?
Search the user’s
query
Search this query as
a frequent query
Search the
associated frequent
query of this mistake
yes
yes
no
Search Engine
no

→ Link misspelled queries with frequent queries using behavioral similarity.
● group queries of a same user need:using temporal and textual features.
● flag reformulated queries in a group
eg. here q3, flagged as reformulated, is a frequent query, the misspelled query is q2.
B. Learn user's misspelled queries
From data
daft
q0
daft p
q1
daft pink
q2
daft punk
q3
insertion at end insertion at end substitution

a) Validation by graphical similarity
b) Validation by phonetic similarity
daft punk - daft pink
lacrim - lace
pierpoljak - pierre paul jacque
polo & pan - pollo
reseaux - resa
pharrell williams - farel williams
havan - havana
...
Validation of pairs

● Damerau and Levenshtein score
count number of operations (insertion, deletion, inversion, substitution) needed to convert a
string in another
● Jaro and Winkler score
count the number of transpositions needed to convert one string to another. This algorithm
favours words that share the same prefix by impacting transpositions located in the
beginning of the word.
Validation of pairs - graphical similarity

Phonetic of a word depends on the speaker's
mother tongue:
Eg. for the name Schubert:
● english:/ʃubət/ (≈ chubet)
● french:/ʃubεʁ/ (≈ chuber)
Romanic Baltic Hellenic
Germanic Slavic Uralic ©http://www.listlanguage.com/european-languages-tree.html
Validation of pairs - phonetic similarity

Phonetic of a word depends on the speaker's
mother tongue:
Eg. for the name Schubert:
● english:/ʃubət/ (≈ chubet)
● french:/ʃubεʁ/ (≈ chuber)
Generation of phonetic version for pairs of
frequent query and misspelled query.
eg. billy gin - billie jean ≈ /bilidjin/
Romanic Baltic Hellenic
Germanic Slavic Uralic
Validation of pairs - phonetic similarity
©http://www.listlanguage.com/european-languages-tree.html

C. Generate new misspelled queries
How to predict a spelling error?
● Formal analogy
● Analogy for spell checking
● Extraction of spell checking rules

Formal analogy means that relation between these four objects has to be graphemic.
complicated : complication :: created : creation
x y z t
Formal analogy

Formal analogy means that relation between these four objects has to be graphemic.
Stroppa and Yvon (2005, 2006) define formal analogy with two notions:
(1) an object can be split into sub-parts called factors
(2) Two pairs of objects share a relation of analogy, if all factors can be exchanged together:
○ inside each pair of objects,
○ between two pairs of objects.
complicat ed : complicat ion x1 = y1
x1 x2 y1 y2 z1 = t1
creat ed : creat ion x2 = z2
z1 z2 t1 t2 y2 = t2
For t the attended form to resolve the analogy [x:y :: z:? ], we can predict t (composed of factor of y and z)
::
Formal analogy

Analogy for spell checking

::

→
:: ::

1. Create a training corpus train with pairs of frequent and misspelled queries.
2. Detect the common factor and extract remaining factors:
S y n Cole
S i n Cole
3. Extract relevant information and create weighted spell checking rules:
previous context:[s] previous context:[l]
syn Cole : sin Cole mistake: y → i lykke Li : likke Li mistake: y → i [sl] y → i [nk]
next context:[n] next context:[k]
Eg. Marilyn Manson → Marilin Manson
Extraction of spell checking rules

Evaluation and conclusion
/03

Results in Search
We suggest or force a correction depending on our confidence and the frequency of the request:

Evaluation
Quality of our system only evaluable by user feedbacks:
→ on ≈ 500 000 queries extracted from desktop search:
≈ 10 000 are concerned by our spelling system
Force correction Suggest correction Total
Accepted by the user 84% 10% 94%
Rejected by the user 3% 3% 6%
Total 87% 13% 100%

Conclusion
Around 1 query in 50 is misspelled and well corrected
(per day and per distinct user on desktop search)
Next steps for spell checking in Deezer Search Engine:
- Improve
- Personalize the current system
- Localize

Spell Checking in Deezer Search Engine

Recommended

Recommended

More Related Content

Similar to Spell Checking in Deezer Search Engine

Similar to Spell Checking in Deezer Search Engine (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

Spell Checking in Deezer Search Engine