An introduction to the Search World with a special eye to E-Commerce by passing Apache Lucene and Solr. Explaining how and why to use a search engine, explaining what are the differences between rdbms and full text search, between the common search and the search applied to the e-commerce world. Also explaining what are the salient differences between Lucene and Solr.
5. (NoSQL) Full Text Search vs RDBMS
NoSQL
- Documents
- Denormalization
RDBMS
- Tables/Records
- Normalization
6. (NoSQL) Full Text Search vs RDBMS
- Cluster-friendly
- Optimistic locking
- Schema-less (almost)
- Scale vertically
- ACID transactions
7. (NoSQL) Full Text Search vs RDBMS
- Text Analysis/Stemming
- Full text search scored
- Faceting/Categorization
- Non-text data
manipulation
8. When to use a full text search engine
1. High volume of documents to be searched and/or
faceted/categorized
2. High volume of interactive text-based queries
3. Demand for very flexible full text search querying
4. Demand for highly relevant search results
9. When to use a RDBMS
1. Demands for many different record types
2. Non-text data manipulation
3. Secure transaction processing
11. eCommerce Search characteristics
Scalable and fast to users requests
- Thousands of concurrent users
- Millions of queries per day (with
peaks during Xmas
and Black Friday)
- Average response under few ms
13. eCommerce Search characteristics
Flexible marketing requirements
- Promoted products should appear first
- Best sellers products should appear first
- Fresh products should appear first (freshness?)
- Hot keywords and curated search
- Everything should appear first (OMG & WTH)
15. eCommerce Search characteristics
Users can't buy if they can't find it
- Search and discovery is mission critical
- Products descriptions and metadata
are poorly written and often don't fits
users requests
16. eCommerce Search characteristics
Users can't buy if they can't find it
- Users don't know how to spell
bluettoth, blu tooh, blutooh, bluetoot,
bluetooh, blue toot, blue tooh, blue tooth
=> bluetooth
17. eCommerce Search characteristics
Users can't buy if they can't find it
- Users don't know how to spell
hawey, uawei, huwaei, huwei, wawei,
hawuei, huawai, hawei, huwawei,
huwavei, huwawei, huawey, hauwei,
hawuei, hawei, hawawei, huawe
=> huawei
18. eCommerce Search characteristics
Users can't buy if they can't find it
- Users don't know how to spell
tapi rulan, tapisrulant, tapis rulant,
tapi roulant, tapiroulant, tapisroulant
=> tapis roulant
21. eCommerce Search Scenarios
How an online store typically look like
- Thousands, millions and even billions of products
- Lots of meta-data in text form
22. eCommerce Search Scenarios
How an online store typically look like
- Tricky product names & manufacturer names
- star trek, star wars (w/ or w/o space?)
- ÖKOKombi
23. eCommerce Search Scenarios
How an online store typically look like
- Word-level ambiguities in product
Names
- Gulliver
- Portatile
- Sacco
- WD Desktop
- Reflex
34. Text Retrieval Basics
What is Information retrieval (IR)
- Information retrieval is the science of searching for
information in a document, searching for documents
themselves, and also searching for metadata that
describe data, and for databases of texts, images or
sounds.
35. Text Retrieval Basics
What is text retrieval (TR)
- Collection of documents exists
- The user submit a query to express the information
need
- The search engine returns documents relevant to the
user’s query.
36. Text Retrieval Basics
What is Relevance
- the quality of results returned from a query,
encompassing both what documents are found, and
their relative ranking (the order that they are returned
to the user.)
- Measure of the effectiveness of communication
- Trying also to satisfy the marketing requests
39. Text Retrieval Basics
Measure of relevance: what is Precision/Recall?
TRUE NEGATIVESFALSE NEGATIVES
TRUE
POSITIVES
FALSE
POSITIVES
Precision = Recall =
Selected or retrieved elements
How many items are
relevant?
How many relevant items are
selected?
Relevant elements
┏━━━━━━━━━━━┓
A B
C D A
A B A
C
A
45. What is Apache Lucene
- Java-based indexing and search technology, as well as spellchecking,
hit highlighting and advanced analysis/tokenization capabilities.
- Many Lucene-based projects: Solr, Elasticsearch, Hadoop, Nutch, etc.
Lucene vs Solr
46. What is Apache Solr
- Solr (pronounced "solar") is an open source enterprise search platform.
Its major features include full-text search, hit highlighting, faceted
search, real-time indexing, dynamic clustering, database integration,
NoSQL features and rich document (e.g., Word, PDF) handling.
Providing distributed search and index replication, Solr is designed for
scalability and fault tolerance.
Lucene vs Solr
47. Lucene vs Solr - Create SynonymGraphFilterFactory
Map<String, String> args = new HashMap<>();
args.put("synonyms", "synonyms.txt");
args.put("ignoreCase", Boolean.toString(true));
args.put("expand", Boolean.toString(true));
SynonymGraphFilterFactory syf = new
SynonymGraphFilterFactory(args);
ResourceLoader rl = new
FilesystemResourceLoader(Paths.get("."),
this.getClassLoader());
syf.inform(rl);
48. Lucene vs Solr - Apply SynonymGraphFilterFactory
StringBuilder sb = new StringBuilder();
try (Tokenizer wt = new WhitespaceTokenizer()) {
wt.setReader(new StringReader(input));
try (TokenStream syn = localSyf.create(wt)) {
syn.reset();
CharTermAttribute term = syn.addAttribute(CharTermAttribute.class);
if (syn.incrementToken()) {
sb.append(term.toString());
while (syn.incrementToken()) {
sb.append(" ");
sb.append(term.toString());
}
}
}
}