SlideShare a Scribd company logo
1 of 22
Download to read offline
PO Department
PEOPLE OPERATION’S
MONTHLY UPDATE
09/2019
1
CPU and memory efficient
spellchecker implementation in TIKI
2
Results for “iphone”
3
Results for “ipohne” without spellchecker
4
Results for “ipohne” with spellchecker
5
General approach
words, result = (tokenize(query), [])
for w in words:
candidates = generate_candidates(w)
best_c, best_score = (None, 0.)
for c in candidates:
score = spellchecker_score(w, c)
if score > best_score:
best_c, best_score = (c, score)
result.append(best_c)
6
Generate candidates
Generate all possible similar words:
- Need to define a measure of similarity - we use Damerau-Levenshtein distance
- It allows insertions, deletions, substitutions and transpositions of symbols
- We limit maximum allowed distance depending on the length of the word
- Then just generate all edits out of 4 possible types (CPU greedy)
- We will optimize this approach later
Examples of Damerau-Levenshtein distance:
- distance(nguyễn, nguyên) = 1 (one substitution)
- distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions)
- distance(behaivour, behaviour) = 1 (one transposition)
7
Spellchecker score
“Noisy channel” model:
- Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w)
- Need to find candidate c which maximizes P(c|w)
- Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates
Used probabilities:
- P(c|w) - probability of c being intended when w was observed
- P(w|c) - probability of the word w to be a misspelling of c - error model
- P(c) - probability to observe c - language model
8
Building the language model
N-gram model:
- Building a 2-gram dictionary
- Remove 2-grams below a certain threshold
Used data:
- All product contents on Tiki
- All Tiki search queries for a year
- Some randomly crawled texts from the Vietnamese Web
- Total: 5.5Gb gzip-ed
9
Building the language model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
10
Building the language model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
We just count all possible single words and
word pairs from our counted queries data and
write it down into language model.
This will let us calculate the probability of the
word to be observed without a context or with
a context of 1 word before or after it.
11
Building the language model (example)
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
Query: máy => “< máy >"
P(máy) = 0.5 * (P(< máy) + P(máy >))
= 0.5 * (410/410+0/410) = 0.5
Query: máy xay tóc
P(xay) = 0.5 * (P(máy xay) + P(xay tóc))
= 0.5 * (105/410+5/105) ~ 0.30
P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc))
= 0.5 * (100/410+100/105) ~ 0.60
Language model here suggests that the
probability to see “sấy” in this context is
higher than the probability to see “xay”.
12
Building the error model
Automatic extraction of P(w|c):
- Extract triplets (w1, w2, w3) from our texts set
- Group triplets by (w1, *, w3) and sort by descending popularity
- Remove groupings below a certain threshold
- Remove samples where w2 words are too far from each other (using
Damerau-Levenshtein distance)
- Remove samples with popularity comparable to the most popular sample in this
grouping
- Write w2 words from all left samples into error model mapping as triplets of
(observed word, intended word, count)
Used data:
- Same as for the language model
13
Building the error model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
14
Building the error model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Triplets:
205 < máy rửa
200 rửa mặt >
5 rửa mắt >
100 máy sấy tóc
5 máy xay tóc
200 máy rửa mặt
5 máy rửa mắt
105 < máy xay
100 sinh tố >
...
We count all possible triplets from our counted
queries data.
15
Building the error model (example)
Triplets (grouped):
rửa * >
200 rửa mặt >
5 rửa mắt >
máy * tóc
100 máy sấy tóc
5 máy xay tóc
máy * sinh
100 máy xay sinh
sinh * >
100 sinh tố >
...
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
16
Building the error model (example)
Query: kem rửa mắt
P(mắt|mắt) = 0/5 = 0.0 - we divide the number of
times “mắt" was intended when "mắt" was
observed in error model to just the total number of
times when "mắt" was observed in error model.
P(mắt|mặt) = 5/5 = 1.0 - again, we divide the
number of times "mặt" was intended when "mắt"
was observed in error model to just the total
number of times when "mắt" was observed in error
model.
This means that according to error model built
on our data, it is extremely likely for “mắt" to
be a misspelling of “mặt".
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
17
Quality optimizations
Idea:
- Language model is more important in bigger context
- Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda)
- Lambda depends on the length of available context
Results:
- Using bigger lambda for longer context => better test result (idea works!)
- For bigger N-gram need to use machine learning to optimize lambdas
18
Performance optimizations
Important fact:
It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w
and c we can find a combination of no more than N deletes of a single character from
each side, which will lead to the same result. Examples below:
distance(iphone, iphobee) = 2 (one insertion, one substitution)
iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!)
distance(iphone, pihoone) = 2 (one transposition, one insertion)
iphone -> ihone VS pihoone -> ihoone -> ihone (match!)
Let’s use it to optimize candidates generation!
19
Performance optimizations
Problem 1 - generating candidates is CPU greedy:
- Precompute “deletes” dictionary
- Use only delete operations from both sides
- Need to double-check the distance (can be up to 2N, but we need N)
- Fast, but requires RAM
Problem 2 - having “deletes” dictionary requires RAM:
- Use different data compression techniques
- From what we’ve tried, Judy dynamic arrays work the best
- We decreased RAM requirements from 10.5Gb to 2.3Gb
20
Testing results
Testing set:
- 5,000 random queries, 10,000 misspelled queries
- Suggestions collected through Google API and then manually checked
- Only one marker per query
Results:
- Slightly (10-12%) worse than Google (ok for such RAM requirements)
- In A/B test shows 3-9% purchases increase
21
Future plans
Implementation:
- Use 3-gram data (still trying to keep it RAM-optimal)
Testing:
- Use multi-marker test set
- Properly handle cases when spellchecker returns multiple variants
Thank you!
22

More Related Content

What's hot

91132158 kỹ-thuật-hang-đợi
91132158 kỹ-thuật-hang-đợi91132158 kỹ-thuật-hang-đợi
91132158 kỹ-thuật-hang-đợi
Thang Khac
 

What's hot (20)

SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Async Messaging in CQRS: Part 1 - Masstransit + DDD Intro
Async Messaging in CQRS: Part 1 - Masstransit + DDD IntroAsync Messaging in CQRS: Part 1 - Masstransit + DDD Intro
Async Messaging in CQRS: Part 1 - Masstransit + DDD Intro
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Bizweb Microservices Architecture
Bizweb Microservices ArchitectureBizweb Microservices Architecture
Bizweb Microservices Architecture
 
Domain Driven Design và Event Driven Architecture
Domain Driven Design và Event Driven Architecture Domain Driven Design và Event Driven Architecture
Domain Driven Design và Event Driven Architecture
 
Tiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startup
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Introduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQIntroduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQ
 
Alphorm.com Formation CEHV9 I
Alphorm.com Formation CEHV9 IAlphorm.com Formation CEHV9 I
Alphorm.com Formation CEHV9 I
 
20120612 02 - Automatisation des tests avec squash TA en environnement bancai...
20120612 02 - Automatisation des tests avec squash TA en environnement bancai...20120612 02 - Automatisation des tests avec squash TA en environnement bancai...
20120612 02 - Automatisation des tests avec squash TA en environnement bancai...
 
Integration Testing with a Citrus twist
Integration Testing with a Citrus twistIntegration Testing with a Citrus twist
Integration Testing with a Citrus twist
 
Jfokus_Bringing the cloud back down to earth.pptx
Jfokus_Bringing the cloud back down to earth.pptxJfokus_Bringing the cloud back down to earth.pptx
Jfokus_Bringing the cloud back down to earth.pptx
 
Concurrency With Go
Concurrency With GoConcurrency With Go
Concurrency With Go
 
Writing clean code in C# and .NET
Writing clean code in C# and .NETWriting clean code in C# and .NET
Writing clean code in C# and .NET
 
Go micro framework to build microservices
Go micro framework to build microservicesGo micro framework to build microservices
Go micro framework to build microservices
 
What the CRaC - Superfast JVM startup
What the CRaC - Superfast JVM startupWhat the CRaC - Superfast JVM startup
What the CRaC - Superfast JVM startup
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Clean Code
Clean CodeClean Code
Clean Code
 
Clean code
Clean codeClean code
Clean code
 
91132158 kỹ-thuật-hang-đợi
91132158 kỹ-thuật-hang-đợi91132158 kỹ-thuật-hang-đợi
91132158 kỹ-thuật-hang-đợi
 

Similar to Grokking TechTalk #35: Efficient spellchecking

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
Jean Silva
 

Similar to Grokking TechTalk #35: Efficient spellchecking (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and Selection
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - Slidecast
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in Elixir
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and Python
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product Information
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion Applications
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by Professionals
 
Php optimization
Php optimizationPhp optimization
Php optimization
 
Php101
Php101Php101
Php101
 

More from Grokking VN

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...
Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...
Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...
Grokking VN
 

More from Grokking VN (20)

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
 
Grokking TechTalk #18B: Giới thiệu về Viễn thông Di động
Grokking TechTalk #18B:  Giới thiệu về Viễn thông Di độngGrokking TechTalk #18B:  Giới thiệu về Viễn thông Di động
Grokking TechTalk #18B: Giới thiệu về Viễn thông Di động
 
Grokking TechTalk #18B: VoIP Architecture For Telecommunications
Grokking TechTalk #18B: VoIP Architecture For TelecommunicationsGrokking TechTalk #18B: VoIP Architecture For Telecommunications
Grokking TechTalk #18B: VoIP Architecture For Telecommunications
 
Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...
Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...
Grokking TechTalk #18A: Vietnamese Sentiment Analysis in a Big Data Scenario:...
 
Grokking TechTalk #17: Introduction to blockchain
Grokking TechTalk #17: Introduction to blockchainGrokking TechTalk #17: Introduction to blockchain
Grokking TechTalk #17: Introduction to blockchain
 

Recently uploaded

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Recently uploaded (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

Grokking TechTalk #35: Efficient spellchecking

  • 1. PO Department PEOPLE OPERATION’S MONTHLY UPDATE 09/2019 1 CPU and memory efficient spellchecker implementation in TIKI
  • 3. 3 Results for “ipohne” without spellchecker
  • 4. 4 Results for “ipohne” with spellchecker
  • 5. 5 General approach words, result = (tokenize(query), []) for w in words: candidates = generate_candidates(w) best_c, best_score = (None, 0.) for c in candidates: score = spellchecker_score(w, c) if score > best_score: best_c, best_score = (c, score) result.append(best_c)
  • 6. 6 Generate candidates Generate all possible similar words: - Need to define a measure of similarity - we use Damerau-Levenshtein distance - It allows insertions, deletions, substitutions and transpositions of symbols - We limit maximum allowed distance depending on the length of the word - Then just generate all edits out of 4 possible types (CPU greedy) - We will optimize this approach later Examples of Damerau-Levenshtein distance: - distance(nguyễn, nguyên) = 1 (one substitution) - distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions) - distance(behaivour, behaviour) = 1 (one transposition)
  • 7. 7 Spellchecker score “Noisy channel” model: - Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w) - Need to find candidate c which maximizes P(c|w) - Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates Used probabilities: - P(c|w) - probability of c being intended when w was observed - P(w|c) - probability of the word w to be a misspelling of c - error model - P(c) - probability to observe c - language model
  • 8. 8 Building the language model N-gram model: - Building a 2-gram dictionary - Remove 2-grams below a certain threshold Used data: - All product contents on Tiki - All Tiki search queries for a year - Some randomly crawled texts from the Vietnamese Web - Total: 5.5Gb gzip-ed
  • 9. 9 Building the language model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 10. 10 Building the language model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... We just count all possible single words and word pairs from our counted queries data and write it down into language model. This will let us calculate the probability of the word to be observed without a context or with a context of 1 word before or after it.
  • 11. 11 Building the language model (example) Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... Query: máy => “< máy >" P(máy) = 0.5 * (P(< máy) + P(máy >)) = 0.5 * (410/410+0/410) = 0.5 Query: máy xay tóc P(xay) = 0.5 * (P(máy xay) + P(xay tóc)) = 0.5 * (105/410+5/105) ~ 0.30 P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc)) = 0.5 * (100/410+100/105) ~ 0.60 Language model here suggests that the probability to see “sấy” in this context is higher than the probability to see “xay”.
  • 12. 12 Building the error model Automatic extraction of P(w|c): - Extract triplets (w1, w2, w3) from our texts set - Group triplets by (w1, *, w3) and sort by descending popularity - Remove groupings below a certain threshold - Remove samples where w2 words are too far from each other (using Damerau-Levenshtein distance) - Remove samples with popularity comparable to the most popular sample in this grouping - Write w2 words from all left samples into error model mapping as triplets of (observed word, intended word, count) Used data: - Same as for the language model
  • 13. 13 Building the error model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 14. 14 Building the error model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Triplets: 205 < máy rửa 200 rửa mặt > 5 rửa mắt > 100 máy sấy tóc 5 máy xay tóc 200 máy rửa mặt 5 máy rửa mắt 105 < máy xay 100 sinh tố > ... We count all possible triplets from our counted queries data.
  • 15. 15 Building the error model (example) Triplets (grouped): rửa * > 200 rửa mặt > 5 rửa mắt > máy * tóc 100 máy sấy tóc 5 máy xay tóc máy * sinh 100 máy xay sinh sinh * > 100 sinh tố > ... Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 16. 16 Building the error model (example) Query: kem rửa mắt P(mắt|mắt) = 0/5 = 0.0 - we divide the number of times “mắt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. P(mắt|mặt) = 5/5 = 1.0 - again, we divide the number of times "mặt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. This means that according to error model built on our data, it is extremely likely for “mắt" to be a misspelling of “mặt". Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 17. 17 Quality optimizations Idea: - Language model is more important in bigger context - Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda) - Lambda depends on the length of available context Results: - Using bigger lambda for longer context => better test result (idea works!) - For bigger N-gram need to use machine learning to optimize lambdas
  • 18. 18 Performance optimizations Important fact: It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w and c we can find a combination of no more than N deletes of a single character from each side, which will lead to the same result. Examples below: distance(iphone, iphobee) = 2 (one insertion, one substitution) iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!) distance(iphone, pihoone) = 2 (one transposition, one insertion) iphone -> ihone VS pihoone -> ihoone -> ihone (match!) Let’s use it to optimize candidates generation!
  • 19. 19 Performance optimizations Problem 1 - generating candidates is CPU greedy: - Precompute “deletes” dictionary - Use only delete operations from both sides - Need to double-check the distance (can be up to 2N, but we need N) - Fast, but requires RAM Problem 2 - having “deletes” dictionary requires RAM: - Use different data compression techniques - From what we’ve tried, Judy dynamic arrays work the best - We decreased RAM requirements from 10.5Gb to 2.3Gb
  • 20. 20 Testing results Testing set: - 5,000 random queries, 10,000 misspelled queries - Suggestions collected through Google API and then manually checked - Only one marker per query Results: - Slightly (10-12%) worse than Google (ok for such RAM requirements) - In A/B test shows 3-9% purchases increase
  • 21. 21 Future plans Implementation: - Use 3-gram data (still trying to keep it RAM-optimal) Testing: - Use multi-marker test set - Properly handle cases when spellchecker returns multiple variants