SlideShare a Scribd company logo
Fuzzy Text Search in Ruby
How to find and replace when you
don't really know what to look for
Paweł Strzałkowski
Task: Replace company name with "The Company" in a job offer description
Take
- Job offer description
- Company name
Task: Replace company name with "The Company" in a job offer description
Take
- Job offer description
- Company name
Simple?
> description.gsub(company_name, 'The Company')
Task: Replace company name with "The Company" in a job offer description
Thank you for your attention
Take
- Job offer description
- Company name
Simple?
> description.gsub(company_name, 'The Company')
Task: Replace company name with "The Company" in a job offer description
Take
- Job offer description
- Company name
Simple?
Not so much (misspells, custom usages of the company name)
> description.gsub(company_name, 'The Company')
Fuzzy Text Comparison: Definition
Fuzzy - difficult to perceive clearly or understand and explain precisely; indistinct or
vague
Fuzzy text comparison doesn't answer whether phrases match
but defines their similarity
Example: Jaro–Winkler similarity level of:
- 1 means phrases are equal
- 0 means phrases are completely different
- Something in-between means some level of difference
Fuzzy Text Comparison: Examples
> require 'fuzzy-string-match'
> # (... create a wrapper as fuzzy_similarity method ...)
> fuzzy_similarity "Lorem Ipsum Dorem", "a"
=> 0.0
> fuzzy_similarity "Lorem Ipsum Dorem", "Visuality Test"
=> 0.4393162393162393
> fuzzy_similarity "Visuality", "Visuality"
=> 1.0
How to find and replace fuzzy matched phrases in a longer text?
Library?
- Not found...
Best practice?
- Not found...
Custom solution?
- Let's do it!
Fuzzy Text Comparison: Assumption #1
- Assumed some level of similarity to be considered "equal"
- Didn't know whether it should be 0.7 or 0.99 so went with something in-
between and tweaked
- It ended up being 0.9
- Whenever it is said that "phrases are equal in fuzzy-matching context", it
means that their Jaro-Winkler similarity is more than 0.9
Fuzzy Text Comparison: Assumption #2
- Searched phrase may be longer or shorter than the given term
- 1 word longer
- 2 words longer
- 3 words longer
> fuzzy_similarity "Lorem Ipsum Dorem", "Lorem Ipsum Dorem Carem"
=> 0.9791666666666666
> fuzzy_similarity "Lorem Ipsum Dorem", "Lorem Ipsum Dorem Carem Visuarem"
=> 0.9281462585034014
> fuzzy_similarity "Lorem Ipsum Dorem",
"Lorem Ipsum Dorem Carem Visuarem Odolanem"
=> 0.8865740740740742
Fuzzy Text Comparison: Assumption Summary
- Phrases are equal if their similarity level is more than 0.9
- Matching phrase might be up to 2 words longer than the searched one
Algorithm (at last)
Assignment: Take a story, find the name "Mary Joan" and replace it
with "The Girl" to make it anonymous. We'll start (and finish) with the story title.
Title: Mary S Jane of house Visuality
Problem: Somebody has
- taken a wrong Mary
- used an additional middle initial
We now have "Mary S Jane" instead of "Mary Joan" in the story.
Algorithm: Step 1 - Break Into Pieces
- Searched phrase (Mary Joan) is 2 words long
- Maximal matching phrase length is 4
"Mary S Jane of house Visuality".break_into_pieces_of(4) =>
(N == 1) => ["Mary", "S", "Jane", "of", "house", "Visuality"]
(N == 2) => ["Mary S", "S Jane", "Jane of", "of house", "house Visuality"]
(N == 3) => ["Mary S Jane", "S Jane of", "of house Visuality"]
(N == 4) => ["Mary S Jane of", "S Jane of house", "Jane of house Visuality"]
Algorithm: Step 1 - Break Into Pieces (Excercise)
- Take some of the created phrases
- Check how similar they are to the searched term
> fuzzy_similarity "Mary Joan" , "of house Visuality"
=> 0.4583333333333333
> fuzzy_similarity "Mary Joan" , "Mary S"
=> 0.86
> fuzzy_similarity "Mary Joan" , "Mary S Jane of"
=> 0.9156336088154271
> fuzzy_similarity "Mary Joan" , "Mary S Jane"
=> 0.9305555555555555
<< Equal
<< Equal
Algorithm: Step 2 - Choose best results
Mary S Jane of
S Jane of
0.92
Mary S Jane 0.510.93
S JaneMary S 0.380.86
Algorithm: Ta - Daaaaa
> FuzzyText.new(
"Mary S Jane of house Visuality", "Mary Joan"
).matches
=> ["Mary S Jane"]
Algorithm: Tweaks and Gotchas
- Skip characters which make the fuzzy comparison blurry
- Punctuation
- White characters
- Saxon genitives - 's
- Common words (when applicable, ie. "is", "are", "was", "were" etc.)
- Compare only once
- Use a hash table containing the results of every comparison
- Cache the results
- TDD
English is too easy, let's take a bit more challenging language
> hero = "Kapitan Ameryka"
> description = "Wbrew powszechnej opinii o Kapitanie z Ameryki nie był on
pierwszym superbohaterem wykorzystującym tematykę flagi Stanów
Zjednoczonych. Poprzedzał go superbohater o pseudonimie Shield czternaście
miesięcy wcześniej niż Kapitan Ameryka. Shield nosił na sobie pancerny
kostium z kuloodporną tarczą na piersi (zwraca się uwagę na uderzające
podobieństwo do wczesnej tarczy Kapitana Ameryki, która w późniejszych
wydaniach była już okrągła). Na dodatek obie postacie łączyła podobna
geneza: Shield tak jak bohater Marvela zyskał nadludzkie moce po zażyciu
serum superżołnierza, którego twórca został zabity przez hitlerowców.
Postać Kapitana Ameryki pojawiała się w wielu różnych adaptacjach."
English is too easy, let's take a bit more challenging language
> hero = "Kapitan Ameryka"
> description = "Wbrew powszechnej opinii o Kapitanie z Ameryki nie był on
pierwszym superbohaterem wykorzystującym tematykę flagi Stanów
Zjednoczonych. Poprzedzał go superbohater o pseudonimie Shield czternaście
miesięcy wcześniej niż Kapitan Ameryka. Shield nosił na sobie pancerny
kostium z kuloodporną tarczą na piersi (zwraca się uwagę na uderzające
podobieństwo do wczesnej tarczy Kapitana Ameryki, która w późniejszych
wydaniach była już okrągła). Na dodatek obie postacie łączyła podobna
geneza: Shield tak jak bohater Marvela zyskał nadludzkie moce po zażyciu
serum superżołnierza, którego twórca został zabity przez hitlerowców.
Postać Kapitana Ameryki pojawiała się w wielu różnych adaptacjach."
> FuzzyText.new(description, hero).matches
=> ["Kapitanie z Ameryki", "Kapitan Ameryka", "Kapitana Ameryki",
"Kapitana", "Kapitanie", "Kapitan"]
Thank you for your attention
Questions?

More Related Content

More from Visuality

How to use AWS SES with Lambda 
in Ruby on Rails application - Michał Łęcicki
How to use AWS SES with Lambda 
in Ruby on Rails application - Michał ŁęcickiHow to use AWS SES with Lambda 
in Ruby on Rails application - Michał Łęcicki
How to use AWS SES with Lambda 
in Ruby on Rails application - Michał Łęcicki
Visuality
 
What is NOT machine learning - Burak Aybar
What is NOT machine learning - Burak AybarWhat is NOT machine learning - Burak Aybar
What is NOT machine learning - Burak Aybar
Visuality
 
Do you really need to reload?
Do you really need to reload?Do you really need to reload?
Do you really need to reload?
Visuality
 
How to check valid email? Find using regex(p?)
How to check valid email? Find using regex(p?)How to check valid email? Find using regex(p?)
How to check valid email? Find using regex(p?)
Visuality
 
Fantastic stresses and where to find them
Fantastic stresses and where to find themFantastic stresses and where to find them
Fantastic stresses and where to find them
Visuality
 
Rfc process in visuality
Rfc process in visualityRfc process in visuality
Rfc process in visuality
Visuality
 
GraphQL in Ruby on Rails - basics
GraphQL in Ruby on Rails - basicsGraphQL in Ruby on Rails - basics
GraphQL in Ruby on Rails - basics
Visuality
 
Consumer Driven Contracts
Consumer Driven ContractsConsumer Driven Contracts
Consumer Driven Contracts
Visuality
 
How do we use CircleCi in Laterallink?
How do we use CircleCi in Laterallink?How do we use CircleCi in Laterallink?
How do we use CircleCi in Laterallink?
Visuality
 
React Native - Short introduction
React Native - Short introductionReact Native - Short introduction
React Native - Short introduction
Visuality
 
Risk in project management
Risk in project managementRisk in project management
Risk in project management
Visuality
 
Ruby formatters
Ruby formattersRuby formatters
Ruby formatters
Visuality
 
Proxying api calls
Proxying api callsProxying api calls
Proxying api calls
Visuality
 
Gogo Conference 2018
Gogo Conference 2018Gogo Conference 2018
Gogo Conference 2018
Visuality
 
Pair programming- Mariusz Kozieł
Pair programming- Mariusz KoziełPair programming- Mariusz Kozieł
Pair programming- Mariusz Kozieł
Visuality
 
Introduction to R Language
Introduction to R LanguageIntroduction to R Language
Introduction to R Language
Visuality
 
Dashing, smashing, party crashing
Dashing, smashing, party crashingDashing, smashing, party crashing
Dashing, smashing, party crashing
Visuality
 
Progressive Enhancement - Umit Naimian
Progressive Enhancement - Umit NaimianProgressive Enhancement - Umit Naimian
Progressive Enhancement - Umit Naimian
Visuality
 
From idea to concept - webinar by Michał Krochecki
From idea to concept - webinar by Michał KrocheckiFrom idea to concept - webinar by Michał Krochecki
From idea to concept - webinar by Michał Krochecki
Visuality
 
Immutability and Javascript - Nadia Miętkiewicz
Immutability and Javascript - Nadia MiętkiewiczImmutability and Javascript - Nadia Miętkiewicz
Immutability and Javascript - Nadia Miętkiewicz
Visuality
 

More from Visuality (20)

How to use AWS SES with Lambda 
in Ruby on Rails application - Michał Łęcicki
How to use AWS SES with Lambda 
in Ruby on Rails application - Michał ŁęcickiHow to use AWS SES with Lambda 
in Ruby on Rails application - Michał Łęcicki
How to use AWS SES with Lambda 
in Ruby on Rails application - Michał Łęcicki
 
What is NOT machine learning - Burak Aybar
What is NOT machine learning - Burak AybarWhat is NOT machine learning - Burak Aybar
What is NOT machine learning - Burak Aybar
 
Do you really need to reload?
Do you really need to reload?Do you really need to reload?
Do you really need to reload?
 
How to check valid email? Find using regex(p?)
How to check valid email? Find using regex(p?)How to check valid email? Find using regex(p?)
How to check valid email? Find using regex(p?)
 
Fantastic stresses and where to find them
Fantastic stresses and where to find themFantastic stresses and where to find them
Fantastic stresses and where to find them
 
Rfc process in visuality
Rfc process in visualityRfc process in visuality
Rfc process in visuality
 
GraphQL in Ruby on Rails - basics
GraphQL in Ruby on Rails - basicsGraphQL in Ruby on Rails - basics
GraphQL in Ruby on Rails - basics
 
Consumer Driven Contracts
Consumer Driven ContractsConsumer Driven Contracts
Consumer Driven Contracts
 
How do we use CircleCi in Laterallink?
How do we use CircleCi in Laterallink?How do we use CircleCi in Laterallink?
How do we use CircleCi in Laterallink?
 
React Native - Short introduction
React Native - Short introductionReact Native - Short introduction
React Native - Short introduction
 
Risk in project management
Risk in project managementRisk in project management
Risk in project management
 
Ruby formatters
Ruby formattersRuby formatters
Ruby formatters
 
Proxying api calls
Proxying api callsProxying api calls
Proxying api calls
 
Gogo Conference 2018
Gogo Conference 2018Gogo Conference 2018
Gogo Conference 2018
 
Pair programming- Mariusz Kozieł
Pair programming- Mariusz KoziełPair programming- Mariusz Kozieł
Pair programming- Mariusz Kozieł
 
Introduction to R Language
Introduction to R LanguageIntroduction to R Language
Introduction to R Language
 
Dashing, smashing, party crashing
Dashing, smashing, party crashingDashing, smashing, party crashing
Dashing, smashing, party crashing
 
Progressive Enhancement - Umit Naimian
Progressive Enhancement - Umit NaimianProgressive Enhancement - Umit Naimian
Progressive Enhancement - Umit Naimian
 
From idea to concept - webinar by Michał Krochecki
From idea to concept - webinar by Michał KrocheckiFrom idea to concept - webinar by Michał Krochecki
From idea to concept - webinar by Michał Krochecki
 
Immutability and Javascript - Nadia Miętkiewicz
Immutability and Javascript - Nadia MiętkiewiczImmutability and Javascript - Nadia Miętkiewicz
Immutability and Javascript - Nadia Miętkiewicz
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

Fuzzy search in Ruby

  • 1. Fuzzy Text Search in Ruby How to find and replace when you don't really know what to look for Paweł Strzałkowski
  • 2. Task: Replace company name with "The Company" in a job offer description Take - Job offer description - Company name
  • 3. Task: Replace company name with "The Company" in a job offer description Take - Job offer description - Company name Simple? > description.gsub(company_name, 'The Company')
  • 4. Task: Replace company name with "The Company" in a job offer description Thank you for your attention Take - Job offer description - Company name Simple? > description.gsub(company_name, 'The Company')
  • 5. Task: Replace company name with "The Company" in a job offer description Take - Job offer description - Company name Simple? Not so much (misspells, custom usages of the company name) > description.gsub(company_name, 'The Company')
  • 6. Fuzzy Text Comparison: Definition Fuzzy - difficult to perceive clearly or understand and explain precisely; indistinct or vague Fuzzy text comparison doesn't answer whether phrases match but defines their similarity Example: Jaro–Winkler similarity level of: - 1 means phrases are equal - 0 means phrases are completely different - Something in-between means some level of difference
  • 7. Fuzzy Text Comparison: Examples > require 'fuzzy-string-match' > # (... create a wrapper as fuzzy_similarity method ...) > fuzzy_similarity "Lorem Ipsum Dorem", "a" => 0.0 > fuzzy_similarity "Lorem Ipsum Dorem", "Visuality Test" => 0.4393162393162393 > fuzzy_similarity "Visuality", "Visuality" => 1.0
  • 8. How to find and replace fuzzy matched phrases in a longer text? Library? - Not found... Best practice? - Not found... Custom solution? - Let's do it!
  • 9. Fuzzy Text Comparison: Assumption #1 - Assumed some level of similarity to be considered "equal" - Didn't know whether it should be 0.7 or 0.99 so went with something in- between and tweaked - It ended up being 0.9 - Whenever it is said that "phrases are equal in fuzzy-matching context", it means that their Jaro-Winkler similarity is more than 0.9
  • 10. Fuzzy Text Comparison: Assumption #2 - Searched phrase may be longer or shorter than the given term - 1 word longer - 2 words longer - 3 words longer > fuzzy_similarity "Lorem Ipsum Dorem", "Lorem Ipsum Dorem Carem" => 0.9791666666666666 > fuzzy_similarity "Lorem Ipsum Dorem", "Lorem Ipsum Dorem Carem Visuarem" => 0.9281462585034014 > fuzzy_similarity "Lorem Ipsum Dorem", "Lorem Ipsum Dorem Carem Visuarem Odolanem" => 0.8865740740740742
  • 11. Fuzzy Text Comparison: Assumption Summary - Phrases are equal if their similarity level is more than 0.9 - Matching phrase might be up to 2 words longer than the searched one
  • 12. Algorithm (at last) Assignment: Take a story, find the name "Mary Joan" and replace it with "The Girl" to make it anonymous. We'll start (and finish) with the story title. Title: Mary S Jane of house Visuality Problem: Somebody has - taken a wrong Mary - used an additional middle initial We now have "Mary S Jane" instead of "Mary Joan" in the story.
  • 13. Algorithm: Step 1 - Break Into Pieces - Searched phrase (Mary Joan) is 2 words long - Maximal matching phrase length is 4 "Mary S Jane of house Visuality".break_into_pieces_of(4) => (N == 1) => ["Mary", "S", "Jane", "of", "house", "Visuality"] (N == 2) => ["Mary S", "S Jane", "Jane of", "of house", "house Visuality"] (N == 3) => ["Mary S Jane", "S Jane of", "of house Visuality"] (N == 4) => ["Mary S Jane of", "S Jane of house", "Jane of house Visuality"]
  • 14. Algorithm: Step 1 - Break Into Pieces (Excercise) - Take some of the created phrases - Check how similar they are to the searched term > fuzzy_similarity "Mary Joan" , "of house Visuality" => 0.4583333333333333 > fuzzy_similarity "Mary Joan" , "Mary S" => 0.86 > fuzzy_similarity "Mary Joan" , "Mary S Jane of" => 0.9156336088154271 > fuzzy_similarity "Mary Joan" , "Mary S Jane" => 0.9305555555555555 << Equal << Equal
  • 15. Algorithm: Step 2 - Choose best results Mary S Jane of S Jane of 0.92 Mary S Jane 0.510.93 S JaneMary S 0.380.86
  • 16. Algorithm: Ta - Daaaaa > FuzzyText.new( "Mary S Jane of house Visuality", "Mary Joan" ).matches => ["Mary S Jane"]
  • 17. Algorithm: Tweaks and Gotchas - Skip characters which make the fuzzy comparison blurry - Punctuation - White characters - Saxon genitives - 's - Common words (when applicable, ie. "is", "are", "was", "were" etc.) - Compare only once - Use a hash table containing the results of every comparison - Cache the results - TDD
  • 18. English is too easy, let's take a bit more challenging language > hero = "Kapitan Ameryka" > description = "Wbrew powszechnej opinii o Kapitanie z Ameryki nie był on pierwszym superbohaterem wykorzystującym tematykę flagi Stanów Zjednoczonych. Poprzedzał go superbohater o pseudonimie Shield czternaście miesięcy wcześniej niż Kapitan Ameryka. Shield nosił na sobie pancerny kostium z kuloodporną tarczą na piersi (zwraca się uwagę na uderzające podobieństwo do wczesnej tarczy Kapitana Ameryki, która w późniejszych wydaniach była już okrągła). Na dodatek obie postacie łączyła podobna geneza: Shield tak jak bohater Marvela zyskał nadludzkie moce po zażyciu serum superżołnierza, którego twórca został zabity przez hitlerowców. Postać Kapitana Ameryki pojawiała się w wielu różnych adaptacjach."
  • 19. English is too easy, let's take a bit more challenging language > hero = "Kapitan Ameryka" > description = "Wbrew powszechnej opinii o Kapitanie z Ameryki nie był on pierwszym superbohaterem wykorzystującym tematykę flagi Stanów Zjednoczonych. Poprzedzał go superbohater o pseudonimie Shield czternaście miesięcy wcześniej niż Kapitan Ameryka. Shield nosił na sobie pancerny kostium z kuloodporną tarczą na piersi (zwraca się uwagę na uderzające podobieństwo do wczesnej tarczy Kapitana Ameryki, która w późniejszych wydaniach była już okrągła). Na dodatek obie postacie łączyła podobna geneza: Shield tak jak bohater Marvela zyskał nadludzkie moce po zażyciu serum superżołnierza, którego twórca został zabity przez hitlerowców. Postać Kapitana Ameryki pojawiała się w wielu różnych adaptacjach." > FuzzyText.new(description, hero).matches => ["Kapitanie z Ameryki", "Kapitan Ameryka", "Kapitana Ameryki", "Kapitana", "Kapitanie", "Kapitan"]
  • 20. Thank you for your attention Questions?