This presentation discusses getting the right results from elasticsearch. It introduces different query types that can be used, how analyzers impact results, and how to use the explain functionality to determine why one document matches better than others. The presentation also covers elasticsearch and Lucene concepts like indexes, shards, scoring, and analyzers and how they affect search results.
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Returning the right results - Jettro Coenradie
1. Explain explained 05/11/14 11:56
Summary
1: Getting the right results
Jettro Coenradie
This presentation is about getting the right results from elasticsearch. There
are a lot of things that you can do to improve the results you get back from
elasticsearch. You will get an introduction into different kind of queries that
you can use, the impact of analysers on results and we take a deep dive into
the explain functionality. Using the explain functionality you can find out why
one document is matching better than another.
http://localhost:9200/_plugin/preso/#/print Page 1 of 42
2. Explain explained 05/11/14 11:56
Returning the right results
@jettroCoenradie
http://localhost:9200/_plugin/preso/#/print Page 2 of 42
3. Explain explained 05/11/14 11:56
2: About me
how to contact me
My name is Jettro Coenradie, I am the follow of Luminis Amsterdam. My
specialty is search solutions and specifically elasticsearch. You can follow
me on twitter, linkedin and my code is on github.
email jettro.coenradie@luminis.eu
twitter @jettroCoenradie
linkedin https://www.linkedin.com/in/jettro
Github https://github.com/jettro
Blog http://www.gridshore.nl
3: The right results
what are they?
How would you define the right results. Of course a lot depends on the
context of the asked question. Like always you need to find the term on
wikipedia to get your first explanation.
How would you explain The
right results?
4: The right results
according to wikipedia
Every presentation has to start with wikipedia. To bad there is no page for
the right results, but there is an interesting link to be found. This shows an
excerpt from the toyota way. The right process will produce the right results.
http://localhost:9200/_plugin/preso/#/print Page 3 of 42
4. Explain explained 05/11/14 11:56
This is also true for returning the right results using elasticsearch. During this
presentation the right process will become clear.
5: What is elasticsearch?
more than search
Before we can start explaining why elasticsearch returns the results that it
does, you first need to know more about what elasticsearch is, what it can
do for you and some terminology used though out the remainder of the
presentation. You will learn about structured and unstructured data, data
sources and how we use the data.
http://localhost:9200/_plugin/preso/#/print Page 4 of 42
7. Explain explained 05/11/14 11:56
6: Lucene
what we need it for
http://localhost:9200/_plugin/preso/#/print Page 7 of 42
8. Explain explained 05/11/14 11:56
Introduce lucene, explain we use analyzers to create terms, the terms are
stored in an inverted index and the inverted index is used to search the
terms.
Create terms,
Store terms,
Search terms.
7: Elasticsearch and lucene
cluster, index, shards, lucene
In here I want to explain the different components of an elasticsearch cluster.
I am showing images containing the structure of these components. A
cluster contains multiple nodes. Each nodes contains shards of multiple
indices. Each shard is a lucene index.
http://localhost:9200/_plugin/preso/#/print Page 8 of 42
10. Explain explained 05/11/14 11:56
8: Executing a query
calling all shards
In this slide I am explaining what happens when you execute a query. You
will learn that we first execute a query that is send to all shards by the client.
The results are gathered and merged and if the right set of documents is
created the actual required documents are fetched.
http://localhost:9200/_plugin/preso/#/print Page 10 of 42
13. Explain explained 05/11/14 11:56
9: Executing a query
basic concepts
http://localhost:9200/_plugin/preso/#/print Page 13 of 42
14. Explain explained 05/11/14 11:56
This slide shows the both the apis that elasticsearch provideds. You can
execute queries using the java api or the rest api through one of the available
drivers. No matter what mechanism you choose you can use a lot of
different queries.
http://localhost:9200/_plugin/preso/#/print Page 14 of 42
15. Explain explained 05/11/14 11:56
10: Example with curl
find all docs
In this slide we present you the most basic match all docs query using curl.
http://localhost:9200/_plugin/preso/#/print Page 15 of 42
16. Explain explained 05/11/14 11:56
11: Other query tools
there are a lot
Some examples of other query tools that are available.
http://localhost:9200/_plugin/preso/#/print Page 16 of 42
19. Explain explained 05/11/14 11:56
12: Execute query
basic match query
This is the most basic variant of executing a query.
GET /slides/_search
{
"query": {
"match": {
"description": "What you type!"
}
}
}
Results only in life presentation
13: The calculated score
use the explain api
Here we are going to discuss the most basic explain you can get.
GET /slides/_search?explain
http://localhost:9200/_plugin/preso/#/print Page 19 of 42
20. Explain explained 05/11/14 11:56
{
"query": {
"match": {
"description": "What you type!"
}
}
}
Results only in life presentation
14: Explain query explained
the basics
In this slide I am going to show details about the explain basics. This is
important to notice the pattern that all explain queries will have for every
term that is matched.
http://localhost:9200/_plugin/preso/#/print Page 20 of 42
23. Explain explained 05/11/14 11:56
15: Calculating score
the theory
In this slide we are going to explain the theory behind creating score using
simulariry algorithms.
Score is calculated for matching documents (Boolean Model),
Score represents similarity between search and document terms,
Lucene uses enhanced TF/IDF (coordination factor and field length),
Other algorithms can be used: Okapi BM25
16: Lucene similarity
formula
Shows the formula used by lucene to calculate the score.
http://localhost:9200/_plugin/preso/#/print Page 23 of 42
24. Explain explained 05/11/14 11:56
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/17: Calculating score
the terms
This slide gives an overview of the most important definitions for calculating
the score.
queryNorm Attempt to make different queries comparable.
coord Factor for total score based on amount of queried
and found terms
Term frequency Amount of times a term is matched in the field
Inverse document
amount of documents that have the term
frequency
fieldNorm Length of the field the terms was found in
boost Boost a field score
18: An explain example
using match query
In this slide we are going to use a very simple match query with an index
containing only three documents. The goal is to show the effect on term
frequency, inverse document frequency and the fieldnorm with very little
documents in the index.
Show tf/idf/fieldNorm and score
Doc 1 Doc 2 Doc 3
http://localhost:9200/_plugin/preso/#/print Page 24 of 42
25. Explain explained 05/11/14 11:56
one two three two three three
one 1 / 1 / 0.5
0.702
two 1 / 2 / 0.5 1 / 2 / 0.625
0.5 0.625
three 1 / 3 / 0.5 1 / 3 / 0.625 1 / 3 / 1
0.356 0.445 0.712
19: Explain multiple terms
with a trick
Here we are going to shows what happens to the results when using capital
letters, multipe terms and introduce the camel case analyzer.
GET /onetwothree/_search?explain
Results only in life presentation
20: What is an analyzer
the parts
Explain what the different components of an analyzer are.
Character filters Tidy up the string before tokenising.
Tokeniser Splits the string into a number of tokens
Token filters Do something with the tokens
21: One Two Three Analyzer
settings
http://localhost:9200/_plugin/preso/#/print Page 25 of 42
26. Explain explained 05/11/14 11:56
Show the settings part of the analyzer as used in the onetwothree sample
with the camel case.
GET /onetwothree/_settings
Results only in life presentation
22: One Two Three Analyzer
mappings
Show the mappings part of the analyzer as used in the onetwothree sample
with the camel case.
GET /onetwothree/_mappings
Results only in life presentation
23: One Two Three Analyzer
analyze api
Test the analyzer using the analyze api.
GET /onetwothree/_analyze?analyzer=camel&text=OneTwoThree
Results only in life presentation
24: Back to explain
recap single term
In this slide we get back to the explain api, from the image with the real
explain output we introduce a short notation.
GET /slides/_search?explain
http://localhost:9200/_plugin/preso/#/print Page 26 of 42
28. Explain explained 05/11/14 11:56
using validate api
In this slide we are going to demonstrate the validate api for a query using
multiple terms.
POST /slides/_validate/query?explain
26: Validate query
using validate api
In this slide we are going to demonstrate the validate api for a query using
multiple terms using the and operator.
POST /slides/_validate/query?explain
27: Bool query
the base for all queries
Introduce the bool query as the base query to all other queries. In the end all
queries can be written as a bool query. Explain the difference between
operator AND/OR.
{
"query": {
"bool": {
"must": [
{}
],
"must_not": [
{}
],
"should": [
{}
]
}
}
}
28: Boolean model
http://localhost:9200/_plugin/preso/#/print Page 28 of 42
29. Explain explained 05/11/14 11:56
must, must_not and should
In this slide we are going to explain the transformation of all queries into a
bool query.
{
"query": {
"match": {
"description": "basic search elasticsearch"
}
}
}
{
"query": {
"bool": {
"should": [
{
"term": {
"description": {
"value": "basic"
}
}
},
{
"term": {
"description": {
"value": "search"
}
}
},
{
"term": {
"description": {
"value": "elasticsearch"
}
}
}
]
}
}
}
http://localhost:9200/_plugin/preso/#/print Page 29 of 42
30. Explain explained 05/11/14 11:56
29: Explain 2 terms
match with 2
Now we are going to show the short notation for the explaination of a query
with two terms due to a standard analyzer.
GET /slides/_search?explain
{
"query": {
"match": {
"description": "basic search"
}
}
}
structure of the score calculation
[*]
[+]
description:basic
coord (1/2)
30: Explain 3 terms
match with 3
Now we are going to show the short notation for the explaination of a query
with three terms due to a standard analyzer.
GET /slides/_search?explain
http://localhost:9200/_plugin/preso/#/print Page 30 of 42
31. Explain explained 05/11/14 11:56
{
"query": {
"match": {
"description": "basic search elasticsearch"
}
}
}
structure of the score calculation
[*]
[+]
description:basic
description:elasticsearch
coord (2/3)
31: Validate query
using validate api
In this slide we are going to demonstrate the validate api for a query using
multiple terms and multiple fields with the default best_fields type query.
POST /slides/_validate/query?explain
32: Validate query
using validate api
In this slide we are going to demonstrate the validate api for a query using
multiple terms and multiple fields with the most_fields type query.
POST /slides/_validate/query?explain
33: Validate query
http://localhost:9200/_plugin/preso/#/print Page 31 of 42
32. Explain explained 05/11/14 11:56
using validate api
In this slide we are going to demonstrate the validate api for a query using
multiple terms and multiple fields with the cross_fields type query.
POST /slides/_validate/query?explain
34: Explain multi_field
best_fields
Show the effect of a multi_field query using the default best_fields type.
GET /slides/_search?explain
{
"query": {
"multi_match": {
"query": "basic query",
"fields": [
"title",
"description"
],
"type": "best_fields"
}
}
}
structure of the score calculation
[max_of]
[+]
description:basic
description:query
[*]
[+]
title:query
http://localhost:9200/_plugin/preso/#/print Page 32 of 42
33. Explain explained 05/11/14 11:56
coord (1/2)
35: Explain multi_field
most_fields
Show the effect of a multi_field query using the most_fields type.
GET /slides/_search?explain
{
"query": {
"multi_match": {
"query": "basic query",
"fields": [
"title",
"description"
],
"type": "most_fields"
}
}
}
structure of the score calculation
[sum_of]
[+]
description:basic
description:query
[*]
[+]
title:query
coord (1/2)
36: Explain multi_field
http://localhost:9200/_plugin/preso/#/print Page 33 of 42
34. Explain explained 05/11/14 11:56
cross_fields
Show the effect of a multi_field query using the cross_fields type.
GET /slides/_search?explain
{
"query": {
"multi_match": {
"query": "basic query",
"fields": [
"title",
"description"
],
"type": "cross_fields"
}
}
}
structure of the score calculation
[sum_of]
[max_of]
description:basic
[max_of]
description:query
title:query
37: Explain dis_max query
use tie breaker
Show the effect of a dis_max query which is a balance between cross_fields
and best matching field.
GET /slides/_search?explain
http://localhost:9200/_plugin/preso/#/print Page 34 of 42
36. Explain explained 05/11/14 11:56
Explain the different options we have for multi field queries and explain the
differences when calculating the score.
Best field returns the field with the highest score,
Most fields adds the scores for the different fields,
Cross fields treets all field as one big field and add maximum score for
term,
Dis max takes the best field and adds a part of the score of other fields
39: Boosting
the basics
In match queries you can apply a boost to a certain field. Important to notice
is that the structure of the output of explain is not changing using this kind of
boost. It is only the score that changes, the boost is reflected within the
query norm of the explain.
{
"query": {
"multi_match": {
"query": "basic query",
"fields": [
"title^5",
"description"
]
}
}
}
No boost Title boost
_score 0.729 0.312
description:basic 0.533 0.107
description:query 0.195 0.0391
description query norm 0.197 0.0394
title:query 0.624 0.624
http://localhost:9200/_plugin/preso/#/print Page 36 of 42
37. Explain explained 05/11/14 11:56
coord (1/2) 0.5 0.5
40: Boosting query
match with negative impact
The most basic boosting, is boosting on a field basis. Sometimes you have
other boosting requirements. One thing could be to give a negative boost to
some term. Of course you can use the must_not in a bool query but this is
different. In that situation you do not have a match, but we want a match just
with a lower score if a certain term is available. Here we show that the
negative term query adds no score but does give a penalty to the complete
score.
{
"query": {
"boosting": {
"positive": {
"term": {
"description": {
"value": "basic"
}
}
},
"negative": {
"term": {
"description": {
"value": "query"
}
}
},
"negative_boost": 0.2
}
}
}
http://localhost:9200/_plugin/preso/#/print Page 37 of 42
38. Explain explained 05/11/14 11:56
41: Sorting results
by score and ...
In this slide I want to discuss the options you have for sorting results.
Sort by score (the default),
Sort by date,
Sort by analyzed fields,
GET /slides/_search
http://localhost:9200/_plugin/preso/#/print Page 38 of 42
39. Explain explained 05/11/14 11:56
{
"query": {
"match": {
"description": "What you type!"
}
},
"sort": [
{
"title.raw": {
"order": "asc"
}
}
]
}
Results only in life presentation
42: Fuzzy query
taking care of typos
In here we are going to demonstrate the effect of fuzzy searching on the
score. We are going to use the term basik which is wrong for all slides
except this slide. Show what happens with the boost factor for documents
with that match due to the fuzzy matching.
43: Fuzzy query
explain score for match
Explain why the score for the document with the fuzzy match is higher than
the score for the exact match.
Total score is a product of field and query weight.
found term query weight field weight
description:basic^0.8 0.36849 0.992109
description:basik 0.59356 0.51138
44: Fuzzy query
http://localhost:9200/_plugin/preso/#/print Page 39 of 42
40. Explain explained 05/11/14 11:56
enhance result with a signal
Since we got the wrong document on top with the previous fuzzy query we
now want to help improve the results with a Signal. A signal can help to
change the score in a way you prefer. In this case we make the score higher
if there is an exact match.
45: Fuzzy query
explain score for match
Explain why adding a signal query as a should query with a match query
does change the order of the results.
No match means a coord penalty.
found term must (fuzzy) should (match)
description:basik 0.26101 0.26101
description:basic^0.8 0.31438 * 0.5 (coord 1/2)
46: Function score query
using popularity
One query that is used a lot on news sites is the function_score query. With
this query you can change the score based on another field like the
popularity or recency. In this slide we discuss the effect on the explain
output for such a query.
GET /blogging/_search?explain
http://localhost:9200/_plugin/preso/#/print Page 40 of 42
41. Explain explained 05/11/14 11:56
{
"query": {
"function_score": {
"query": {
"match": {
"description": "elasticsearch"
}
},
"functions": [
{
"field_value_factor": {
"field": "popularity",
"factor": 1.2,
"modifier": "ln"
}
}
]
}
}
}
structure of the score calculation
function score
[*]
description:elasticsearch
Math.min
ln(doc[popularity].value *
factor=1.2)
maxBoost
47: Summarizing
the take away
Explain the right process to produce the right results.
http://localhost:9200/_plugin/preso/#/print Page 41 of 42
42. Explain explained 05/11/14 11:56
The right process to produce the right results.
Use the correct analyzer,
Construct the right query,
Analyze the results with your users,
Explain the results using explain/validate and improve.
48: Questions
I am here the whole day
Place holder sheet that can be used during the questions moment.
jettro.coenradie@luminis.eu
@jettroCoenradie
https://github.com/jettro/preso-explain
http://localhost:9200/_plugin/preso/#/print Page 42 of 42