SlideShare a Scribd company logo
1 of 53
Download to read offline
E-commerce Search Engine with
Apache Lucene/Solr
A sightseeing with few practical examples
By Vincenzo D’Amore
v.damore@gmail.com
@VincenzoDAmore
eCommerce Search
- Why not an RDBMS
- Characteristics
- Scenarios
- Applications
- Text Retrieval Basics
- Lucene vs Solr
- Hands on
Why not an RDBMS
AKA eCommerce and
Search? Start with why
Why not an
RDBMS
Use the right tool
for the job.
(NoSQL) Full Text Search vs RDBMS
NoSQL
- Documents
- Denormalization
RDBMS
- Tables/Records
- Normalization
(NoSQL) Full Text Search vs RDBMS
- Cluster-friendly
- Optimistic locking
- Schema-less (almost)
- Scale vertically
- ACID transactions
(NoSQL) Full Text Search vs RDBMS
- Text Analysis/Stemming
- Full text search scored
- Faceting/Categorization
- Non-text data
manipulation
When to use a full text search engine
1. High volume of documents to be searched and/or
faceted/categorized
2. High volume of interactive text-based queries
3. Demand for very flexible full text search querying
4. Demand for highly relevant search results
When to use a RDBMS
1. Demands for many different record types
2. Non-text data manipulation
3. Secure transaction processing
eCommerce Search
characteristics
Scalable and fast to users
requests
eCommerce Search characteristics
Scalable and fast to users requests
- Thousands of concurrent users
- Millions of queries per day (with
peaks during Xmas
and Black Friday)
- Average response under few ms
eCommerce Search
characteristics
Flexible marketing
requirements
eCommerce Search characteristics
Flexible marketing requirements
- Promoted products should appear first
- Best sellers products should appear first
- Fresh products should appear first (freshness?)
- Hot keywords and curated search
- Everything should appear first (OMG & WTH)
eCommerce Search
characteristics
Users can't buy if they
can't find it
eCommerce Search characteristics
Users can't buy if they can't find it
- Search and discovery is mission critical
- Products descriptions and metadata
are poorly written and often don't fits
users requests
eCommerce Search characteristics
Users can't buy if they can't find it
- Users don't know how to spell
bluettoth, blu tooh, blutooh, bluetoot,
bluetooh, blue toot, blue tooh, blue tooth
=> bluetooth
eCommerce Search characteristics
Users can't buy if they can't find it
- Users don't know how to spell
hawey, uawei, huwaei, huwei, wawei,
hawuei, huawai, hawei, huwawei,
huwavei, huwawei, huawey, hauwei,
hawuei, hawei, hawawei, huawe
=> huawei
eCommerce Search characteristics
Users can't buy if they can't find it
- Users don't know how to spell
tapi rulan, tapisrulant, tapis rulant,
tapi roulant, tapiroulant, tapisroulant
=> tapis roulant
eCommerce Search characteristics
Common search documents vs eCommerce documents
VS
eCommerce Search
Scenarios
How an online store
typically look like
eCommerce Search Scenarios
How an online store typically look like
- Thousands, millions and even billions of products
- Lots of meta-data in text form
eCommerce Search Scenarios
How an online store typically look like
- Tricky product names & manufacturer names
- star trek, star wars (w/ or w/o space?)
- ÖKOKombi
eCommerce Search Scenarios
How an online store typically look like
- Word-level ambiguities in product
Names
- Gulliver
- Portatile
- Sacco
- WD Desktop
- Reflex
eCommerce
Search
Applications
A list of the most popular
eCommerce Search
Applications
Applications
Search suggest drop-down list (aka autocomplete)
Applications
Typo tolerance aka Spellchecker aka “forse cercavi”
Applications
Typo tolerance aka Spellchecker aka “forse cercavi”
Applications
Instant search
aka search as you type
aka incremental search
Applications
Filters and Facets (Refiners)
Applications
More like this - Related products and articles
Applications
Spatial search
Applications
Zero results page
Text Retrieval Basics
A list of few basic concepts, principles
of Text Retrieval
Text Retrieval Basics
What is Information retrieval (IR)
- Information retrieval is the science of searching for
information in a document, searching for documents
themselves, and also searching for metadata that
describe data, and for databases of texts, images or
sounds.
Text Retrieval Basics
What is text retrieval (TR)
- Collection of documents exists
- The user submit a query to express the information
need
- The search engine returns documents relevant to the
user’s query.
Text Retrieval Basics
What is Relevance
- the quality of results returned from a query,
encompassing both what documents are found, and
their relative ranking (the order that they are returned
to the user.)
- Measure of the effectiveness of communication
- Trying also to satisfy the marketing requests
Text Retrieval Basics
Access Mode: Push vs Pull
Text Retrieval Basics
Pull mode: Querying vs Browsing
Internet
Directory
VS
Text Retrieval Basics
Measure of relevance: what is Precision/Recall?
TRUE NEGATIVESFALSE NEGATIVES
TRUE
POSITIVES
FALSE
POSITIVES
Precision = Recall =
Selected or retrieved elements
How many items are
relevant?
How many relevant items are
selected?
Relevant elements
┏━━━━━━━━━━━┓
A B
C D A
A B A
C
A
Text Retrieval Basics
Measures
- Effectiveness or accuracy
- System centered
- User centered
Text Retrieval Basics
Measures
- Efficiency
- Retrieval time
- Indexing time
- Index size
Text Retrieval Basics
Measures
- Usability
Text Retrieval Basics
Understanding and improving search relevancy can
often feel like a never ending journey.
Lucene vs Solr
Apache Lucene and Solr
What is Apache Lucene
- Java-based indexing and search technology, as well as spellchecking,
hit highlighting and advanced analysis/tokenization capabilities.
- Many Lucene-based projects: Solr, Elasticsearch, Hadoop, Nutch, etc.
Lucene vs Solr
What is Apache Solr
- Solr (pronounced "solar") is an open source enterprise search platform.
Its major features include full-text search, hit highlighting, faceted
search, real-time indexing, dynamic clustering, database integration,
NoSQL features and rich document (e.g., Word, PDF) handling.
Providing distributed search and index replication, Solr is designed for
scalability and fault tolerance.
Lucene vs Solr
Lucene vs Solr - Create SynonymGraphFilterFactory
Map<String, String> args = new HashMap<>();
args.put("synonyms", "synonyms.txt");
args.put("ignoreCase", Boolean.toString(true));
args.put("expand", Boolean.toString(true));
SynonymGraphFilterFactory syf = new
SynonymGraphFilterFactory(args);
ResourceLoader rl = new
FilesystemResourceLoader(Paths.get("."),
this.getClassLoader());
syf.inform(rl);
Lucene vs Solr - Apply SynonymGraphFilterFactory
StringBuilder sb = new StringBuilder();
try (Tokenizer wt = new WhitespaceTokenizer()) {
wt.setReader(new StringReader(input));
try (TokenStream syn = localSyf.create(wt)) {
syn.reset();
CharTermAttribute term = syn.addAttribute(CharTermAttribute.class);
if (syn.incrementToken()) {
sb.append(term.toString());
while (syn.incrementToken()) {
sb.append(" ");
sb.append(term.toString());
}
}
}
}
Lucene vs Solr - Solr SynonymGraphFilterFactory
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true"
synonyms="synonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true"
synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Lucene vs Solr
Go ahead with Solr
Should I use Lucene
or Solr?
Cool!
Are you
Twitter?
YES
NO
Lucene vs Solr
IndexIndexer Searcher Results
Tokenizer Tokenizer
Docs
Query
Simple Solr Architecture
Hands on Lucene &
Solr
A practical example of what Lucene & Solr
is
Hands on Lucene & Solr
Examples in the presentation
https://github.com/freedev/lucene-example
Apache Lucene/Solr Books
- Apache Solr Reference Guide
- Lucene in Action
- Relevant Search
- Apache Solr Search Patterns

More Related Content

What's hot

Formal Specifications in Formal Methods
Formal Specifications in Formal MethodsFormal Specifications in Formal Methods
Formal Specifications in Formal MethodsHaroon Ghazanfar
 
Deep dive into ChatGPT
Deep dive into ChatGPTDeep dive into ChatGPT
Deep dive into ChatGPTvaluebound
 
Software Engineering Past Papers Notes
Software Engineering Past Papers Notes Software Engineering Past Papers Notes
Software Engineering Past Papers Notes MuhammadTalha436
 
Introduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTUREIntroduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTUREIvano Malavolta
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Technology
 
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...luisw19
 
Google flutter the easy and practical way IEEE Alazhar
Google flutter the easy and practical way IEEE AlazharGoogle flutter the easy and practical way IEEE Alazhar
Google flutter the easy and practical way IEEE AlazharAhmed Abu Eldahab
 
The Object Model
The Object Model  The Object Model
The Object Model yndaravind
 
An intro to GraphQL
An intro to GraphQLAn intro to GraphQL
An intro to GraphQLvaluebound
 
Group6SDFinal
Group6SDFinalGroup6SDFinal
Group6SDFinalHong Lu
 
Software engineering fundamentals
Software engineering fundamentalsSoftware engineering fundamentals
Software engineering fundamentalsJigyasaAgrawal7
 
7 - Architetture Software - Software product line
7 - Architetture Software - Software product line7 - Architetture Software - Software product line
7 - Architetture Software - Software product lineMajong DevJfu
 

What's hot (20)

Dialogflow
DialogflowDialogflow
Dialogflow
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
flutter intro.pptx
flutter intro.pptxflutter intro.pptx
flutter intro.pptx
 
Formal Specifications in Formal Methods
Formal Specifications in Formal MethodsFormal Specifications in Formal Methods
Formal Specifications in Formal Methods
 
Deep dive into ChatGPT
Deep dive into ChatGPTDeep dive into ChatGPT
Deep dive into ChatGPT
 
Asp objects
Asp objectsAsp objects
Asp objects
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
Software Engineering Past Papers Notes
Software Engineering Past Papers Notes Software Engineering Past Papers Notes
Software Engineering Past Papers Notes
 
Web Engineering
Web EngineeringWeb Engineering
Web Engineering
 
Introduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTUREIntroduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTURE
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeurs
 
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
 
Google flutter the easy and practical way IEEE Alazhar
Google flutter the easy and practical way IEEE AlazharGoogle flutter the easy and practical way IEEE Alazhar
Google flutter the easy and practical way IEEE Alazhar
 
DOT Net overview
DOT Net overviewDOT Net overview
DOT Net overview
 
The Object Model
The Object Model  The Object Model
The Object Model
 
An intro to GraphQL
An intro to GraphQLAn intro to GraphQL
An intro to GraphQL
 
Group6SDFinal
Group6SDFinalGroup6SDFinal
Group6SDFinal
 
Software engineering fundamentals
Software engineering fundamentalsSoftware engineering fundamentals
Software engineering fundamentals
 
7 - Architetture Software - Software product line
7 - Architetture Software - Software product line7 - Architetture Software - Software product line
7 - Architetture Software - Software product line
 
flutter.school #HelloWorld
flutter.school #HelloWorldflutter.school #HelloWorld
flutter.school #HelloWorld
 

Similar to E-commerce Search Engine with Apache Lucene/Solr

Search domain basics
Search domain basicsSearch domain basics
Search domain basicspmanvi
 
Introduction to Polyglot Persistence
Introduction to Polyglot Persistence Introduction to Polyglot Persistence
Introduction to Polyglot Persistence Antonios Giannopoulos
 
Simplifying RESTful Search- Impetus Webinar
Simplifying RESTful Search- Impetus WebinarSimplifying RESTful Search- Impetus Webinar
Simplifying RESTful Search- Impetus WebinarImpetus Technologies
 
Introduction to enterprise search
Introduction to enterprise searchIntroduction to enterprise search
Introduction to enterprise searchUsama Nada
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsZemanta
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semanticsAndraz Tori
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Sematext Group, Inc.
 
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...Dr. Haxel Consult
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Miningsathish sak
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryAlessandro Benedetti
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
SemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformSemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformClark & Parsia LLC
 
In search of: A meetup about Liferay and Search 2016-04-20
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20Tibor Lipusz
 

Similar to E-commerce Search Engine with Apache Lucene/Solr (20)

Search domain basics
Search domain basicsSearch domain basics
Search domain basics
 
Introduction to Polyglot Persistence
Introduction to Polyglot Persistence Introduction to Polyglot Persistence
Introduction to Polyglot Persistence
 
Simplifying RESTful Search- Impetus Webinar
Simplifying RESTful Search- Impetus WebinarSimplifying RESTful Search- Impetus Webinar
Simplifying RESTful Search- Impetus Webinar
 
Introduction to enterprise search
Introduction to enterprise searchIntroduction to enterprise search
Introduction to enterprise search
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semantics
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011
 
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Enterprise REST
Enterprise RESTEnterprise REST
Enterprise REST
 
SemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformSemTech 2010: Pelorus Platform
SemTech 2010: Pelorus Platform
 
In search of: A meetup about Liferay and Search 2016-04-20
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20
 

E-commerce Search Engine with Apache Lucene/Solr

  • 1. E-commerce Search Engine with Apache Lucene/Solr A sightseeing with few practical examples By Vincenzo D’Amore v.damore@gmail.com @VincenzoDAmore
  • 2. eCommerce Search - Why not an RDBMS - Characteristics - Scenarios - Applications - Text Retrieval Basics - Lucene vs Solr - Hands on
  • 3. Why not an RDBMS AKA eCommerce and Search? Start with why
  • 4. Why not an RDBMS Use the right tool for the job.
  • 5. (NoSQL) Full Text Search vs RDBMS NoSQL - Documents - Denormalization RDBMS - Tables/Records - Normalization
  • 6. (NoSQL) Full Text Search vs RDBMS - Cluster-friendly - Optimistic locking - Schema-less (almost) - Scale vertically - ACID transactions
  • 7. (NoSQL) Full Text Search vs RDBMS - Text Analysis/Stemming - Full text search scored - Faceting/Categorization - Non-text data manipulation
  • 8. When to use a full text search engine 1. High volume of documents to be searched and/or faceted/categorized 2. High volume of interactive text-based queries 3. Demand for very flexible full text search querying 4. Demand for highly relevant search results
  • 9. When to use a RDBMS 1. Demands for many different record types 2. Non-text data manipulation 3. Secure transaction processing
  • 11. eCommerce Search characteristics Scalable and fast to users requests - Thousands of concurrent users - Millions of queries per day (with peaks during Xmas and Black Friday) - Average response under few ms
  • 13. eCommerce Search characteristics Flexible marketing requirements - Promoted products should appear first - Best sellers products should appear first - Fresh products should appear first (freshness?) - Hot keywords and curated search - Everything should appear first (OMG & WTH)
  • 14. eCommerce Search characteristics Users can't buy if they can't find it
  • 15. eCommerce Search characteristics Users can't buy if they can't find it - Search and discovery is mission critical - Products descriptions and metadata are poorly written and often don't fits users requests
  • 16. eCommerce Search characteristics Users can't buy if they can't find it - Users don't know how to spell bluettoth, blu tooh, blutooh, bluetoot, bluetooh, blue toot, blue tooh, blue tooth => bluetooth
  • 17. eCommerce Search characteristics Users can't buy if they can't find it - Users don't know how to spell hawey, uawei, huwaei, huwei, wawei, hawuei, huawai, hawei, huwawei, huwavei, huwawei, huawey, hauwei, hawuei, hawei, hawawei, huawe => huawei
  • 18. eCommerce Search characteristics Users can't buy if they can't find it - Users don't know how to spell tapi rulan, tapisrulant, tapis rulant, tapi roulant, tapiroulant, tapisroulant => tapis roulant
  • 19. eCommerce Search characteristics Common search documents vs eCommerce documents VS
  • 20. eCommerce Search Scenarios How an online store typically look like
  • 21. eCommerce Search Scenarios How an online store typically look like - Thousands, millions and even billions of products - Lots of meta-data in text form
  • 22. eCommerce Search Scenarios How an online store typically look like - Tricky product names & manufacturer names - star trek, star wars (w/ or w/o space?) - ÖKOKombi
  • 23. eCommerce Search Scenarios How an online store typically look like - Word-level ambiguities in product Names - Gulliver - Portatile - Sacco - WD Desktop - Reflex
  • 24. eCommerce Search Applications A list of the most popular eCommerce Search Applications
  • 25. Applications Search suggest drop-down list (aka autocomplete)
  • 26. Applications Typo tolerance aka Spellchecker aka “forse cercavi”
  • 27. Applications Typo tolerance aka Spellchecker aka “forse cercavi”
  • 28. Applications Instant search aka search as you type aka incremental search
  • 30. Applications More like this - Related products and articles
  • 33. Text Retrieval Basics A list of few basic concepts, principles of Text Retrieval
  • 34. Text Retrieval Basics What is Information retrieval (IR) - Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.
  • 35. Text Retrieval Basics What is text retrieval (TR) - Collection of documents exists - The user submit a query to express the information need - The search engine returns documents relevant to the user’s query.
  • 36. Text Retrieval Basics What is Relevance - the quality of results returned from a query, encompassing both what documents are found, and their relative ranking (the order that they are returned to the user.) - Measure of the effectiveness of communication - Trying also to satisfy the marketing requests
  • 37. Text Retrieval Basics Access Mode: Push vs Pull
  • 38. Text Retrieval Basics Pull mode: Querying vs Browsing Internet Directory VS
  • 39. Text Retrieval Basics Measure of relevance: what is Precision/Recall? TRUE NEGATIVESFALSE NEGATIVES TRUE POSITIVES FALSE POSITIVES Precision = Recall = Selected or retrieved elements How many items are relevant? How many relevant items are selected? Relevant elements ┏━━━━━━━━━━━┓ A B C D A A B A C A
  • 40. Text Retrieval Basics Measures - Effectiveness or accuracy - System centered - User centered
  • 41. Text Retrieval Basics Measures - Efficiency - Retrieval time - Indexing time - Index size
  • 43. Text Retrieval Basics Understanding and improving search relevancy can often feel like a never ending journey.
  • 44. Lucene vs Solr Apache Lucene and Solr
  • 45. What is Apache Lucene - Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. - Many Lucene-based projects: Solr, Elasticsearch, Hadoop, Nutch, etc. Lucene vs Solr
  • 46. What is Apache Solr - Solr (pronounced "solar") is an open source enterprise search platform. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Lucene vs Solr
  • 47. Lucene vs Solr - Create SynonymGraphFilterFactory Map<String, String> args = new HashMap<>(); args.put("synonyms", "synonyms.txt"); args.put("ignoreCase", Boolean.toString(true)); args.put("expand", Boolean.toString(true)); SynonymGraphFilterFactory syf = new SynonymGraphFilterFactory(args); ResourceLoader rl = new FilesystemResourceLoader(Paths.get("."), this.getClassLoader()); syf.inform(rl);
  • 48. Lucene vs Solr - Apply SynonymGraphFilterFactory StringBuilder sb = new StringBuilder(); try (Tokenizer wt = new WhitespaceTokenizer()) { wt.setReader(new StringReader(input)); try (TokenStream syn = localSyf.create(wt)) { syn.reset(); CharTermAttribute term = syn.addAttribute(CharTermAttribute.class); if (syn.incrementToken()) { sb.append(term.toString()); while (syn.incrementToken()) { sb.append(" "); sb.append(term.toString()); } } } }
  • 49. Lucene vs Solr - Solr SynonymGraphFilterFactory <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 50. Lucene vs Solr Go ahead with Solr Should I use Lucene or Solr? Cool! Are you Twitter? YES NO
  • 51. Lucene vs Solr IndexIndexer Searcher Results Tokenizer Tokenizer Docs Query Simple Solr Architecture
  • 52. Hands on Lucene & Solr A practical example of what Lucene & Solr is
  • 53. Hands on Lucene & Solr Examples in the presentation https://github.com/freedev/lucene-example Apache Lucene/Solr Books - Apache Solr Reference Guide - Lucene in Action - Relevant Search - Apache Solr Search Patterns