SlideShare a Scribd company logo
1 of 22
About us
• Founded in 2007
• B2B Startup
• Semantic Services started in
2011
• Big Data projects in 2012
• Training, Consultancy and
Support on Solr and Hadoop
About me
• Leo Oliveira
• 15 years of experience with
websites and search engines
• Specialized in Relevancy and Semantics
• Graduated in Business Management & IT
Innovation
The Importance of Relevancy
 The game changer in Search
 Google became what it is due to relevancy algorithms and PageRank
 Can be achieved through many different types of data
 Cross-reference, log analysis, social media, market research, new ideas etc
 It’s about the user
 And not about what we think of the user
 If you don’t understand your data, find what is relevant through
research
 Research user behavior and needs and you’ll find a way of finding relevant
information or you will find a way of discovering what is relevant automatically.
 If the user can’t find what they want using your search…
 … they’ll leave.
The Importance of having a resultset
 Sometimes you have little data and users get a lot of zero-result
queries.
 Then it’s time to find the right words for the job
 Synonym discovery, stemming
 Analysis: understand what is the user searching for
 Suggest other options
 Sometimes a zero-result query is just typo. Use suggestions to find the right
results.
 Create an interface that makes it easy for the user to try different keywords to get
results
 Analyze words that are searched together by the same users to create special
suggestions or second searches that brings some results
Relevancy and Semantics
 THE RELEVANCY RULES (the most forgotten ones)
 Phrases are more relevant the single words
 “Pure” words, that is, words the way they were typed, are more relevant than
stemmed words or even synonyms
 In general, newest docs are more relevant than old docs, but this rule could be
reversed.
 Closer venues and locations are more relevant than distant ones.
 Give the right weights to the right fields (seems simple, but it’s not)
 When you have a mix of relevant things…
 for instance, in an e-commerce, that you must consider the freshest docs, plus the
smaller prices and the best sellers, sometimes you need to create a math formula to
cope with all these items for relevancy. And it’s tricky.
 You can boost each parameter separately, but then you must test your formulas
very well not to prioritize one parameter more than the other.
Relevancy and Weighting
 USING DISMAX (an example of relevancy configuration)
 By using Dismax, you have two main options: qf and pf.
 Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)
 Field weights must be decided carefully. You can use a scale, such as Fibonacci
numbers, so as to not lose control of what is more relevant and what is not that relevant in
your “search formula”
 EXAMPLE OF pf for a movies website:
 MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8
 EXAMPLE OF qf:
 MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
Synonyms Theory Vs Synonyms in Search
Synonym Theory Vs Synonym in Search
 In search, synonyms can get complicated
 You don’t need to use only “real” synonyms
 You must focus on user needs rather than “thesaurus”
 You can use it for other useful stuff, such as the most common typos to be
corrected or even some unrelated words that could do the job for you
 Sometimes you just need to get creative.
1-way transformations
 SynonymFilter
 E. g. Compter => computer
 Very useful for common typos or misconceptions
 Sometimes requires expansion.
 In our setup, we’ll set a fieldtype that will do the job.
2-way transformations
 SynonymFilter
 E. g. computer, pc, mac, notebook, server
 Expansion of meanings
 Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can
be used.
 It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
Index Time Vs Query Time
 Most synonyms should work with Query Time only
 But there are exceptions. E. g.: photoalbum, photo album
 When using a phrase synonym, this might require an index time synonym to work
correctly
 In our fieldtype, we’ll discuss how to achieve this.
 This index time synonym file is also useful for search when you need symbols in some
terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc
 Keep in mind that this index time file will have the exceptions that don’t work well in query-
time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such
as phrase synonyms.
Some examples
 Query time
 Enginer => Engineer (1-way)
 Engineer, engineering (2-way with expand=true)
 Computer, pc, mac
 Index time
 C# => csharp
 .NET => dotnet
 Human resources, HR
 Business intelligence, BI
 CRM, Costumer Relationship Management
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt"
ignoreCase="true" expand="true"/>
3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
4. <filter class="solr.LowerCaseFilterFactory"/>
5. <filter class="solr.BrazilianStemFilterFactory"/>
6. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a relevant formula for search
 Imagine you have a field named “Product_name”
 To keep search result semantically relevant, you’ll need a “Product_name” field and also a
“Product_name_pure”
 Keep in mind that phrases are also more relevant than individual words
 In that case, here’s how Solrconfig would be:
 QF (query fields, single words)
 Product_name_pure^5 Product_name^3
 PF (phrase fields)
 Product_name_pure^8 Product_nameˆ5
What will you get?
 Relevant results semantically
 Users will understand the results
 Synonyms or stemmed tokens won’t disturb or create noise
 Similar to what main search websites are doing for many different things, such as
document search, e-commerce, intranet search, document search, library search etc
 Be able to find relevant results even in a scenario with millions or billions of documents.
I N F I N I T E P O S S I B I L I T I ES
THANK YOU!
Get in touch!
Email: loliveira@semantix.com.br
Twitter: @SemantixBR

More Related Content

Viewers also liked

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
Daryl Chow
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadas
Emerson Salcedo
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shop
plvisit
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22
solomonmin
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments
6primariaparecollvic
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)
PriSkills Knowledge Solutions
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
cristiansamisan
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera parte
cjperu
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenation
DHIFAF BAGDAH
 

Viewers also liked (17)

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
 
Seleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasSeleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de Asturias
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadas
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shop
 
ReuniÓn De Padres
ReuniÓn De PadresReuniÓn De Padres
ReuniÓn De Padres
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22
 
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationYoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
 
Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...
 
Blusens Networks IPTV 2013
Blusens Networks IPTV 2013Blusens Networks IPTV 2013
Blusens Networks IPTV 2013
 
Regulatory Information Management
Regulatory Information ManagementRegulatory Information Management
Regulatory Information Management
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments
 
Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
 
GUILLERMO SIMÓN.
GUILLERMO SIMÓN.GUILLERMO SIMÓN.
GUILLERMO SIMÓN.
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera parte
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenation
 

Similar to Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
Full text search
Full text searchFull text search
Full text search
deleteman
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 

Similar to Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA (20)

How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital era
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?
 
Full text search
Full text searchFull text search
Full text search
 
How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7d
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
 
Essential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compEssential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003comp
 
11.pptx
11.pptx11.pptx
11.pptx
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
How To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleHow To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay Example
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

  • 1.
  • 2. About us • Founded in 2007 • B2B Startup • Semantic Services started in 2011 • Big Data projects in 2012 • Training, Consultancy and Support on Solr and Hadoop
  • 3. About me • Leo Oliveira • 15 years of experience with websites and search engines • Specialized in Relevancy and Semantics • Graduated in Business Management & IT Innovation
  • 4. The Importance of Relevancy  The game changer in Search  Google became what it is due to relevancy algorithms and PageRank  Can be achieved through many different types of data  Cross-reference, log analysis, social media, market research, new ideas etc  It’s about the user  And not about what we think of the user  If you don’t understand your data, find what is relevant through research  Research user behavior and needs and you’ll find a way of finding relevant information or you will find a way of discovering what is relevant automatically.  If the user can’t find what they want using your search…  … they’ll leave.
  • 5. The Importance of having a resultset  Sometimes you have little data and users get a lot of zero-result queries.  Then it’s time to find the right words for the job  Synonym discovery, stemming  Analysis: understand what is the user searching for  Suggest other options  Sometimes a zero-result query is just typo. Use suggestions to find the right results.  Create an interface that makes it easy for the user to try different keywords to get results  Analyze words that are searched together by the same users to create special suggestions or second searches that brings some results
  • 6. Relevancy and Semantics  THE RELEVANCY RULES (the most forgotten ones)  Phrases are more relevant the single words  “Pure” words, that is, words the way they were typed, are more relevant than stemmed words or even synonyms  In general, newest docs are more relevant than old docs, but this rule could be reversed.  Closer venues and locations are more relevant than distant ones.  Give the right weights to the right fields (seems simple, but it’s not)  When you have a mix of relevant things…  for instance, in an e-commerce, that you must consider the freshest docs, plus the smaller prices and the best sellers, sometimes you need to create a math formula to cope with all these items for relevancy. And it’s tricky.  You can boost each parameter separately, but then you must test your formulas very well not to prioritize one parameter more than the other.
  • 7. Relevancy and Weighting  USING DISMAX (an example of relevancy configuration)  By using Dismax, you have two main options: qf and pf.  Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)  Field weights must be decided carefully. You can use a scale, such as Fibonacci numbers, so as to not lose control of what is more relevant and what is not that relevant in your “search formula”  EXAMPLE OF pf for a movies website:  MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8  EXAMPLE OF qf:  MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
  • 8. Synonyms Theory Vs Synonyms in Search
  • 9. Synonym Theory Vs Synonym in Search  In search, synonyms can get complicated  You don’t need to use only “real” synonyms  You must focus on user needs rather than “thesaurus”  You can use it for other useful stuff, such as the most common typos to be corrected or even some unrelated words that could do the job for you  Sometimes you just need to get creative.
  • 10. 1-way transformations  SynonymFilter  E. g. Compter => computer  Very useful for common typos or misconceptions  Sometimes requires expansion.  In our setup, we’ll set a fieldtype that will do the job.
  • 11. 2-way transformations  SynonymFilter  E. g. computer, pc, mac, notebook, server  Expansion of meanings  Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can be used.  It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
  • 12. Index Time Vs Query Time  Most synonyms should work with Query Time only  But there are exceptions. E. g.: photoalbum, photo album  When using a phrase synonym, this might require an index time synonym to work correctly  In our fieldtype, we’ll discuss how to achieve this.  This index time synonym file is also useful for search when you need symbols in some terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc  Keep in mind that this index time file will have the exceptions that don’t work well in query- time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such as phrase synonyms.
  • 13. Some examples  Query time  Enginer => Engineer (1-way)  Engineer, engineering (2-way with expand=true)  Computer, pc, mac  Index time  C# => csharp  .NET => dotnet  Human resources, HR  Business intelligence, BI  CRM, Costumer Relationship Management
  • 14. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 15. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> 3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 4. <filter class="solr.LowerCaseFilterFactory"/> 5. <filter class="solr.BrazilianStemFilterFactory"/> 6. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 16. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 17. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 18. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 19. Creating a relevant formula for search  Imagine you have a field named “Product_name”  To keep search result semantically relevant, you’ll need a “Product_name” field and also a “Product_name_pure”  Keep in mind that phrases are also more relevant than individual words  In that case, here’s how Solrconfig would be:  QF (query fields, single words)  Product_name_pure^5 Product_name^3  PF (phrase fields)  Product_name_pure^8 Product_nameˆ5
  • 20. What will you get?  Relevant results semantically  Users will understand the results  Synonyms or stemmed tokens won’t disturb or create noise  Similar to what main search websites are doing for many different things, such as document search, e-commerce, intranet search, document search, library search etc  Be able to find relevant results even in a scenario with millions or billions of documents.
  • 21. I N F I N I T E P O S S I B I L I T I ES
  • 22. THANK YOU! Get in touch! Email: loliveira@semantix.com.br Twitter: @SemantixBR