SlideShare a Scribd company logo
About us
• Founded in 2007
• B2B Startup
• Semantic Services started in
2011
• Big Data projects in 2012
• Training, Consultancy and
Support on Solr and Hadoop
About me
• Leo Oliveira
• 15 years of experience with
websites and search engines
• Specialized in Relevancy and Semantics
• Graduated in Business Management & IT
Innovation
The Importance of Relevancy
 The game changer in Search
 Google became what it is due to relevancy algorithms and PageRank
 Can be achieved through many different types of data
 Cross-reference, log analysis, social media, market research, new ideas etc
 It’s about the user
 And not about what we think of the user
 If you don’t understand your data, find what is relevant through
research
 Research user behavior and needs and you’ll find a way of finding relevant
information or you will find a way of discovering what is relevant automatically.
 If the user can’t find what they want using your search…
 … they’ll leave.
The Importance of having a resultset
 Sometimes you have little data and users get a lot of zero-result
queries.
 Then it’s time to find the right words for the job
 Synonym discovery, stemming
 Analysis: understand what is the user searching for
 Suggest other options
 Sometimes a zero-result query is just typo. Use suggestions to find the right
results.
 Create an interface that makes it easy for the user to try different keywords to get
results
 Analyze words that are searched together by the same users to create special
suggestions or second searches that brings some results
Relevancy and Semantics
 THE RELEVANCY RULES (the most forgotten ones)
 Phrases are more relevant the single words
 “Pure” words, that is, words the way they were typed, are more relevant than
stemmed words or even synonyms
 In general, newest docs are more relevant than old docs, but this rule could be
reversed.
 Closer venues and locations are more relevant than distant ones.
 Give the right weights to the right fields (seems simple, but it’s not)
 When you have a mix of relevant things…
 for instance, in an e-commerce, that you must consider the freshest docs, plus the
smaller prices and the best sellers, sometimes you need to create a math formula to
cope with all these items for relevancy. And it’s tricky.
 You can boost each parameter separately, but then you must test your formulas
very well not to prioritize one parameter more than the other.
Relevancy and Weighting
 USING DISMAX (an example of relevancy configuration)
 By using Dismax, you have two main options: qf and pf.
 Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)
 Field weights must be decided carefully. You can use a scale, such as Fibonacci
numbers, so as to not lose control of what is more relevant and what is not that relevant in
your “search formula”
 EXAMPLE OF pf for a movies website:
 MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8
 EXAMPLE OF qf:
 MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
Synonyms Theory Vs Synonyms in Search
Synonym Theory Vs Synonym in Search
 In search, synonyms can get complicated
 You don’t need to use only “real” synonyms
 You must focus on user needs rather than “thesaurus”
 You can use it for other useful stuff, such as the most common typos to be
corrected or even some unrelated words that could do the job for you
 Sometimes you just need to get creative.
1-way transformations
 SynonymFilter
 E. g. Compter => computer
 Very useful for common typos or misconceptions
 Sometimes requires expansion.
 In our setup, we’ll set a fieldtype that will do the job.
2-way transformations
 SynonymFilter
 E. g. computer, pc, mac, notebook, server
 Expansion of meanings
 Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can
be used.
 It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
Index Time Vs Query Time
 Most synonyms should work with Query Time only
 But there are exceptions. E. g.: photoalbum, photo album
 When using a phrase synonym, this might require an index time synonym to work
correctly
 In our fieldtype, we’ll discuss how to achieve this.
 This index time synonym file is also useful for search when you need symbols in some
terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc
 Keep in mind that this index time file will have the exceptions that don’t work well in query-
time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such
as phrase synonyms.
Some examples
 Query time
 Enginer => Engineer (1-way)
 Engineer, engineering (2-way with expand=true)
 Computer, pc, mac
 Index time
 C# => csharp
 .NET => dotnet
 Human resources, HR
 Business intelligence, BI
 CRM, Costumer Relationship Management
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt"
ignoreCase="true" expand="true"/>
3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
4. <filter class="solr.LowerCaseFilterFactory"/>
5. <filter class="solr.BrazilianStemFilterFactory"/>
6. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a relevant formula for search
 Imagine you have a field named “Product_name”
 To keep search result semantically relevant, you’ll need a “Product_name” field and also a
“Product_name_pure”
 Keep in mind that phrases are also more relevant than individual words
 In that case, here’s how Solrconfig would be:
 QF (query fields, single words)
 Product_name_pure^5 Product_name^3
 PF (phrase fields)
 Product_name_pure^8 Product_nameˆ5
What will you get?
 Relevant results semantically
 Users will understand the results
 Synonyms or stemmed tokens won’t disturb or create noise
 Similar to what main search websites are doing for many different things, such as
document search, e-commerce, intranet search, document search, library search etc
 Be able to find relevant results even in a scenario with millions or billions of documents.
I N F I N I T E P O S S I B I L I T I ES
THANK YOU!
Get in touch!
Email: loliveira@semantix.com.br
Twitter: @SemantixBR

More Related Content

Viewers also liked

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
Daryl Chow
 
Seleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasSeleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de Asturias
Javier Cuevas
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadasEmerson Salcedo
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shopplvisit
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22solomonmin
 
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationYoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
PHATCreative
 
Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...
ExternalEvents
 
Regulatory Information Management
Regulatory Information ManagementRegulatory Information Management
Regulatory Information Management
JBScott44
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments6primariaparecollvic
 
Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)
trademarketingHE
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)PriSkills Knowledge Solutions
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamoracristiansamisan
 
GUILLERMO SIMÓN.
GUILLERMO SIMÓN.GUILLERMO SIMÓN.
GUILLERMO SIMÓN.
GUILLERMO SIMÓN
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera partecjperu
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGDAH
 

Viewers also liked (17)

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
 
Seleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasSeleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de Asturias
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadas
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shop
 
ReuniÓn De Padres
ReuniÓn De PadresReuniÓn De Padres
ReuniÓn De Padres
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22
 
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationYoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
 
Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...
 
Blusens Networks IPTV 2013
Blusens Networks IPTV 2013Blusens Networks IPTV 2013
Blusens Networks IPTV 2013
 
Regulatory Information Management
Regulatory Information ManagementRegulatory Information Management
Regulatory Information Management
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments
 
Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
 
GUILLERMO SIMÓN.
GUILLERMO SIMÓN.GUILLERMO SIMÓN.
GUILLERMO SIMÓN.
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera parte
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenation
 

Similar to Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital era
Apex CoVantage
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
voginip
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
VOGIN-academie
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Thanawalla
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
Ben DeMott
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?
Fred Leise
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.
Peggy Johnson
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7d
Klaus Lieblich
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
James Melzer
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
Michael King
 
Essential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compEssential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003comp
ljnd
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
Razorleaf Corporation
 
How To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleHow To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay Example
Jennifer Hellmuth
 

Similar to Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA (20)

How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital era
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?
 
Full text search
Full text searchFull text search
Full text search
 
How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7d
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
 
Essential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compEssential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003comp
 
11.pptx
11.pptx11.pptx
11.pptx
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
How To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleHow To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay Example
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

  • 1.
  • 2. About us • Founded in 2007 • B2B Startup • Semantic Services started in 2011 • Big Data projects in 2012 • Training, Consultancy and Support on Solr and Hadoop
  • 3. About me • Leo Oliveira • 15 years of experience with websites and search engines • Specialized in Relevancy and Semantics • Graduated in Business Management & IT Innovation
  • 4. The Importance of Relevancy  The game changer in Search  Google became what it is due to relevancy algorithms and PageRank  Can be achieved through many different types of data  Cross-reference, log analysis, social media, market research, new ideas etc  It’s about the user  And not about what we think of the user  If you don’t understand your data, find what is relevant through research  Research user behavior and needs and you’ll find a way of finding relevant information or you will find a way of discovering what is relevant automatically.  If the user can’t find what they want using your search…  … they’ll leave.
  • 5. The Importance of having a resultset  Sometimes you have little data and users get a lot of zero-result queries.  Then it’s time to find the right words for the job  Synonym discovery, stemming  Analysis: understand what is the user searching for  Suggest other options  Sometimes a zero-result query is just typo. Use suggestions to find the right results.  Create an interface that makes it easy for the user to try different keywords to get results  Analyze words that are searched together by the same users to create special suggestions or second searches that brings some results
  • 6. Relevancy and Semantics  THE RELEVANCY RULES (the most forgotten ones)  Phrases are more relevant the single words  “Pure” words, that is, words the way they were typed, are more relevant than stemmed words or even synonyms  In general, newest docs are more relevant than old docs, but this rule could be reversed.  Closer venues and locations are more relevant than distant ones.  Give the right weights to the right fields (seems simple, but it’s not)  When you have a mix of relevant things…  for instance, in an e-commerce, that you must consider the freshest docs, plus the smaller prices and the best sellers, sometimes you need to create a math formula to cope with all these items for relevancy. And it’s tricky.  You can boost each parameter separately, but then you must test your formulas very well not to prioritize one parameter more than the other.
  • 7. Relevancy and Weighting  USING DISMAX (an example of relevancy configuration)  By using Dismax, you have two main options: qf and pf.  Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)  Field weights must be decided carefully. You can use a scale, such as Fibonacci numbers, so as to not lose control of what is more relevant and what is not that relevant in your “search formula”  EXAMPLE OF pf for a movies website:  MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8  EXAMPLE OF qf:  MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
  • 8. Synonyms Theory Vs Synonyms in Search
  • 9. Synonym Theory Vs Synonym in Search  In search, synonyms can get complicated  You don’t need to use only “real” synonyms  You must focus on user needs rather than “thesaurus”  You can use it for other useful stuff, such as the most common typos to be corrected or even some unrelated words that could do the job for you  Sometimes you just need to get creative.
  • 10. 1-way transformations  SynonymFilter  E. g. Compter => computer  Very useful for common typos or misconceptions  Sometimes requires expansion.  In our setup, we’ll set a fieldtype that will do the job.
  • 11. 2-way transformations  SynonymFilter  E. g. computer, pc, mac, notebook, server  Expansion of meanings  Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can be used.  It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
  • 12. Index Time Vs Query Time  Most synonyms should work with Query Time only  But there are exceptions. E. g.: photoalbum, photo album  When using a phrase synonym, this might require an index time synonym to work correctly  In our fieldtype, we’ll discuss how to achieve this.  This index time synonym file is also useful for search when you need symbols in some terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc  Keep in mind that this index time file will have the exceptions that don’t work well in query- time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such as phrase synonyms.
  • 13. Some examples  Query time  Enginer => Engineer (1-way)  Engineer, engineering (2-way with expand=true)  Computer, pc, mac  Index time  C# => csharp  .NET => dotnet  Human resources, HR  Business intelligence, BI  CRM, Costumer Relationship Management
  • 14. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 15. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> 3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 4. <filter class="solr.LowerCaseFilterFactory"/> 5. <filter class="solr.BrazilianStemFilterFactory"/> 6. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 16. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 17. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 18. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 19. Creating a relevant formula for search  Imagine you have a field named “Product_name”  To keep search result semantically relevant, you’ll need a “Product_name” field and also a “Product_name_pure”  Keep in mind that phrases are also more relevant than individual words  In that case, here’s how Solrconfig would be:  QF (query fields, single words)  Product_name_pure^5 Product_name^3  PF (phrase fields)  Product_name_pure^8 Product_nameˆ5
  • 20. What will you get?  Relevant results semantically  Users will understand the results  Synonyms or stemmed tokens won’t disturb or create noise  Similar to what main search websites are doing for many different things, such as document search, e-commerce, intranet search, document search, library search etc  Be able to find relevant results even in a scenario with millions or billions of documents.
  • 21. I N F I N I T E P O S S I B I L I T I ES
  • 22. THANK YOU! Get in touch! Email: loliveira@semantix.com.br Twitter: @SemantixBR