Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Search explained
My Name is Hans Höchtl
Technical director @ Onedrop Solutions
PHP, Java, Ruby Developer
Participation in TYPO3 Solr
SELECT * FROM mytable WHERE
field LIKE „%searchword%“
SELECT * FROM mytable WHERE
field SOUNDS LIKE
„searchword“
Appearance of a word inside a
text can be determined easily.
But is it relevant?
Relevance is subjective and
depends on the judgement of
users.
We use „scoring“ to predict
relevance.
Scoring is computed by a function
applied on our indexed
documents using the search term
as input parameter.
TF-IDF 

Term frequency-inverse document
frequency
BM25

Okapi BM25 - Best Matching
DFR

Divergence from randomness
and ma...
All those scoring calculations should
fulfill these two requirements:
1. Precision

Are the results relevant to the user?
...
How to store documents for
efficient computing of scoring?
Vector Space Model

Default in Solr, Elasticsearch
Document: A vector of terms
Term: A „word“ inside a document
Each uniqu...
Vector Space Model

The best match is the narrowest
angle between query and
document
Document 1
„unique unique
bag“
Document 2
„unique bag
bag“
Query
unique bag
unique
bag
v(d1)
v(q)
v(d2)
The calculation of the cosine of
the angle between the vectors is
much easier than the calculation
of the angle itself. (C...
Where d2 * q is the intersection
(dot product) of the document and
the query vectors.
||q|| is the norm vector of q
A cosine value of zero means that
the query and document vector
are orthogonal and have no
match.
TF-IDF
Regarding the vector space model (VSM)
the weight of the vector is now
represented for a document d as:
Term freque...
TF-IDF
Now we have everything together to
calculate the similarity between
documents using TF-IDF:
TF-IDF
PROs CONs
- Simple model based on linear
algebra
- Term weights not binary
- Allows computing a
continuous degree o...
TF-IDF - The Lucene way
Coord: Boosts documents that
match more of the search terms
(multiple words) => 3/4 vs 4/4
Norm: L...
TF-IDF - Multiple fields
TF-IDF expects a document to be
just one field containing text. But in
reality we have semi-struc...
TF-IDF - Multiple fields
TF-IDF expects a document to be
just one field containing text. But in
reality we have semi-struc...
TF-IDF - Multiple fields
Solr Solution: DisMax Query Parser (Maximum
Disjunction)
Searchterm: „my funny house“
Documents
m...
Natural languages
Adjectives, Adverbs, Nouns,
Verbs, Conjunctions, Prepositions,
Predicates, Compounds, Plurals,
Past tens...
Language families
Indo-European languages
Sino-Tibetan languages
TF-IDF Problem
Only exakt Term matches are
considered a hit.
„Car“ is not the same term as
„Cars“
Handling human languages (Analyzers)
Tokenizers:

Splits a stream of characters into a series of
tokens.
Filters:

The gen...
Index Analyzers vs. Query Analyzers
Index Analyzers:

Perform their analysis chain on the token stream
during indexation. ...
Available analyzers
Solr (https://goo.gl/TXEjZK)

Language best practices (https://goo.gl/11O2Qz)
Elasticsearch (https://g...
FieldTypes
Solr and Elasticsearch use
fieldTypes assigned to fields for
defining the analyzer chain that
should be perform...
Let’s take a look in the
configuration of TYPO3 Solr and
Neos Elasticsearch
Let’s test the analyzer chain
Solr and Elasticsearch
Display score calculation
Solr: 

/solr/core_de/select?
q=test&debugQuery=1
Elasticsearch: 

/_explain instead of /_search
Let’s take a look at
0.51602894 = (MATCH) sum of:
0.51602894 = (MATCH) max of:
0.51602894 = (MATCH) weight(content:sony^40...
Product-Codes
„AS1134-B“
„131555813“
„EOS 500D“
„13 S24 36-G“
Product-Codes
Index the code in multiple fields to
have different analyzers and boost
them from strict to fuzzy.
Make use ...
Use the knowledge you gain from
your customers to improve your
search, … like Google does.
- Use Google Analytics during index
time (preAddModifyDocuments hook)
- Use recency of news (boostfunction)
- Analyze the ...
Some more interesting thinks
- Facets
- Spellchecking
- Phonetics
- Spatial
Thank you
Mail: hhoechtl@1drop.de or jhoechtl@gmail.com

Twitter: @hhoechtl

Blog: http://blog.1drop.de
Search explained T3DD15
Search explained T3DD15
Upcoming SlideShare
Loading in …5
×

Search explained T3DD15

1,085 views

Published on

The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.

Published in: Software
  • Be the first to comment

Search explained T3DD15

  1. 1. Search explained
  2. 2. My Name is Hans Höchtl Technical director @ Onedrop Solutions PHP, Java, Ruby Developer Participation in TYPO3 Solr
  3. 3. SELECT * FROM mytable WHERE field LIKE „%searchword%“ SELECT * FROM mytable WHERE field SOUNDS LIKE „searchword“
  4. 4. Appearance of a word inside a text can be determined easily. But is it relevant?
  5. 5. Relevance is subjective and depends on the judgement of users. We use „scoring“ to predict relevance.
  6. 6. Scoring is computed by a function applied on our indexed documents using the search term as input parameter.
  7. 7. TF-IDF 
 Term frequency-inverse document frequency BM25
 Okapi BM25 - Best Matching DFR
 Divergence from randomness and many more
  8. 8. All those scoring calculations should fulfill these two requirements: 1. Precision
 Are the results relevant to the user? 2. Recall
 Have we found all relevant content in the index?
  9. 9. How to store documents for efficient computing of scoring?
  10. 10. Vector Space Model
 Default in Solr, Elasticsearch Document: A vector of terms Term: A „word“ inside a document Each unique term is a dimension
  11. 11. Vector Space Model
 The best match is the narrowest angle between query and document
  12. 12. Document 1 „unique unique bag“ Document 2 „unique bag bag“ Query unique bag unique bag v(d1) v(q) v(d2)
  13. 13. The calculation of the cosine of the angle between the vectors is much easier than the calculation of the angle itself. (CPU cycles)
  14. 14. Where d2 * q is the intersection (dot product) of the document and the query vectors. ||q|| is the norm vector of q
  15. 15. A cosine value of zero means that the query and document vector are orthogonal and have no match.
  16. 16. TF-IDF Regarding the vector space model (VSM) the weight of the vector is now represented for a document d as: Term frequency Inverse document frequency
  17. 17. TF-IDF Now we have everything together to calculate the similarity between documents using TF-IDF:
  18. 18. TF-IDF PROs CONs - Simple model based on linear algebra - Term weights not binary - Allows computing a continuous degree of similarity between queries and documents - Allows ranking of documents according to their possible relevance - Allows partial matching - Long documents have poor similarity values (small scalar and large dimensionality) - Search keywords must precisely match terms - Missing semantic sensitivity - Order of terms in document not taken into account - Terms are usually not statistically independent (as this model states)
  19. 19. TF-IDF - The Lucene way Coord: Boosts documents that match more of the search terms (multiple words) => 3/4 vs 4/4 Norm: Length normalization boosts fields that are shorter
  20. 20. TF-IDF - Multiple fields TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
  21. 21. TF-IDF - Multiple fields TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
  22. 22. TF-IDF - Multiple fields Solr Solution: DisMax Query Parser (Maximum Disjunction) Searchterm: „my funny house“ Documents matching query in field title Documents matching query in field subtitle Documents matching query in field content TF-IDF calculated for every field independently. Score of a document is the highest score of the field scoring values.
  23. 23. Natural languages Adjectives, Adverbs, Nouns, Verbs, Conjunctions, Prepositions, Predicates, Compounds, Plurals, Past tense, Declination, Semantics, etc.
  24. 24. Language families Indo-European languages Sino-Tibetan languages
  25. 25. TF-IDF Problem Only exakt Term matches are considered a hit. „Car“ is not the same term as „Cars“
  26. 26. Handling human languages (Analyzers) Tokenizers:
 Splits a stream of characters into a series of tokens. Filters:
 The generated tokens are passed through a series of filters that add, change or remove tokens.
  27. 27. Index Analyzers vs. Query Analyzers Index Analyzers:
 Perform their analysis chain on the token stream during indexation. The generated tokens will be indexed. Query Analyzers:
 Perform their analysis chain on the entered search query during query execution. Otherwise the query would hit just an exact match. Beware of Synonyms!
  28. 28. Available analyzers Solr (https://goo.gl/TXEjZK)
 Language best practices (https://goo.gl/11O2Qz) Elasticsearch (https://goo.gl/QR1IYb)
 Language best practices (https://goo.gl/6FQt7A)
  29. 29. FieldTypes Solr and Elasticsearch use fieldTypes assigned to fields for defining the analyzer chain that should be performed
  30. 30. Let’s take a look in the configuration of TYPO3 Solr and Neos Elasticsearch
  31. 31. Let’s test the analyzer chain Solr and Elasticsearch
  32. 32. Display score calculation Solr: 
 /solr/core_de/select? q=test&debugQuery=1 Elasticsearch: 
 /_explain instead of /_search
  33. 33. Let’s take a look at 0.51602894 = (MATCH) sum of: 0.51602894 = (MATCH) max of: 0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity], result of: 0.51602894 = fieldWeight in 5, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.3025851 = idf(docFreq=4, maxDocs=50) 0.078125 = fieldNorm(doc=5) 0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity], result of: 0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0 ), product of: 0.05 = queryWeight, product of: 2.0 = boost 3.3025851 = idf(docFreq=4, maxDocs=50) 0.0075698276 = queryNorm 3.3025851 = fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3025851 = idf(docFreq=4, maxDocs=50) 1.0 = fieldNorm(doc=5)
  34. 34. Product-Codes „AS1134-B“ „131555813“ „EOS 500D“ „13 S24 36-G“
  35. 35. Product-Codes Index the code in multiple fields to have different analyzers and boost them from strict to fuzzy. Make use of N-Grams, EdgeN- Grams, WordDelimiter, Trim, etc.
  36. 36. Use the knowledge you gain from your customers to improve your search, … like Google does.
  37. 37. - Use Google Analytics during index time (preAddModifyDocuments hook) - Use recency of news (boostfunction) - Analyze the search behavior of your customers (popularity of pages) - Track search result clicks
  38. 38. Some more interesting thinks - Facets - Spellchecking - Phonetics - Spatial
  39. 39. Thank you Mail: hhoechtl@1drop.de or jhoechtl@gmail.com
 Twitter: @hhoechtl
 Blog: http://blog.1drop.de

×