Returning the right results - Jettro Coenradie

Explain explained 05/11/14 11:56
Summary
1: Getting the right results
Jettro Coenradie
This presentation is about getting the right results from elasticsearch. There
are a lot of things that you can do to improve the results you get back from
elasticsearch. You will get an introduction into different kind of queries that
you can use, the impact of analysers on results and we take a deep dive into
the explain functionality. Using the explain functionality you can find out why
one document is matching better than another.
http://localhost:9200/_plugin/preso/#/print Page 1 of 42

Returning the right results
@jettroCoenradie

2: About me
how to contact me
My name is Jettro Coenradie, I am the follow of Luminis Amsterdam. My
specialty is search solutions and specifically elasticsearch. You can follow
me on twitter, linkedin and my code is on github.
email jettro.coenradie@luminis.eu
twitter @jettroCoenradie
linkedin https://www.linkedin.com/in/jettro
Github https://github.com/jettro
Blog http://www.gridshore.nl
3: The right results
what are they?
How would you define the right results. Of course a lot depends on the
context of the asked question. Like always you need to find the term on
wikipedia to get your first explanation.
How would you explain The
right results?
4: The right results
according to wikipedia
Every presentation has to start with wikipedia. To bad there is no page for
the right results, but there is an interesting link to be found. This shows an
excerpt from the toyota way. The right process will produce the right results.

This is also true for returning the right results using elasticsearch. During this
presentation the right process will become clear.
5: What is elasticsearch?
more than search
Before we can start explaining why elasticsearch returns the results that it
does, you first need to know more about what elasticsearch is, what it can
do for you and some terminology used though out the remainder of the
presentation. You will learn about structured and unstructured data, data
sources and how we use the data.

6: Lucene
what we need it for

Introduce lucene, explain we use analyzers to create terms, the terms are
stored in an inverted index and the inverted index is used to search the
terms.
Create terms,
Store terms,
Search terms.
7: Elasticsearch and lucene
cluster, index, shards, lucene
In here I want to explain the different components of an elasticsearch cluster.
I am showing images containing the structure of these components. A
cluster contains multiple nodes. Each nodes contains shards of multiple
indices. Each shard is a lucene index.

8: Executing a query
calling all shards
In this slide I am explaining what happens when you execute a query. You
will learn that we first execute a query that is send to all shards by the client.
The results are gathered and merged and if the right set of documents is
created the actual required documents are fetched.

9: Executing a query
basic concepts

This slide shows the both the apis that elasticsearch provideds. You can
execute queries using the java api or the rest api through one of the available
drivers. No matter what mechanism you choose you can use a lot of
different queries.

10: Example with curl
find all docs
In this slide we present you the most basic match all docs query using curl.

11: Other query tools
there are a lot
Some examples of other query tools that are available.

12: Execute query
basic match query
This is the most basic variant of executing a query.
GET /slides/_search
{
"query": {
"match": {
"description": "What you type!"
}
}
}
Results only in life presentation
13: The calculated score
use the explain api
Here we are going to discuss the most basic explain you can get.
GET /slides/_search?explain

{
"query": {
"match": {
}
}
}
14: Explain query explained
the basics
In this slide I am going to show details about the explain basics. This is
important to notice the pattern that all explain queries will have for every
term that is matched.

15: Calculating score
the theory
In this slide we are going to explain the theory behind creating score using
simulariry algorithms.
Score is calculated for matching documents (Boolean Model),
Score represents similarity between search and document terms,
Lucene uses enhanced TF/IDF (coordination factor and field length),
Other algorithms can be used: Okapi BM25
16: Lucene similarity
formula
Shows the formula used by lucene to calculate the score.

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/17: Calculating score
the terms
This slide gives an overview of the most important definitions for calculating
the score.
queryNorm Attempt to make different queries comparable.
coord Factor for total score based on amount of queried
and found terms
Term frequency Amount of times a term is matched in the field
Inverse document
amount of documents that have the term
frequency
fieldNorm Length of the field the terms was found in
boost Boost a field score
18: An explain example
using match query
In this slide we are going to use a very simple match query with an index
containing only three documents. The goal is to show the effect on term
frequency, inverse document frequency and the fieldnorm with very little
documents in the index.
Show tf/idf/fieldNorm and score
Doc 1 Doc 2 Doc 3

one two three two three three
one 1 / 1 / 0.5
0.702
two 1 / 2 / 0.5 1 / 2 / 0.625
0.5 0.625
three 1 / 3 / 0.5 1 / 3 / 0.625 1 / 3 / 1
0.356 0.445 0.712
19: Explain multiple terms
with a trick
Here we are going to shows what happens to the results when using capital
letters, multipe terms and introduce the camel case analyzer.
GET /onetwothree/_search?explain
20: What is an analyzer
the parts
Explain what the different components of an analyzer are.
Character filters Tidy up the string before tokenising.
Tokeniser Splits the string into a number of tokens
Token filters Do something with the tokens
21: One Two Three Analyzer
settings

Show the settings part of the analyzer as used in the onetwothree sample
with the camel case.
GET /onetwothree/_settings
mappings
Show the mappings part of the analyzer as used in the onetwothree sample
with the camel case.
GET /onetwothree/_mappings
analyze api
Test the analyzer using the analyze api.
GET /onetwothree/_analyze?analyzer=camel&text=OneTwoThree
24: Back to explain
recap single term
In this slide we get back to the explain api, from the image with the real
explain output we introduce a short notation.

{
"query": {
"match": {
"description": "basic"
}
}
}
structure of the score calculation
description:basic
[*]
tf / idf / fieldNorm
25: Validate query

using validate api
In this slide we are going to demonstrate the validate api for a query using
multiple terms.
POST /slides/_validate/query?explain
26: Validate query
using validate api
multiple terms using the and operator.
27: Bool query
the base for all queries
Introduce the bool query as the base query to all other queries. In the end all
queries can be written as a bool query. Explain the difference between
operator AND/OR.
{
"query": {
"bool": {
"must": [
{}
],
"must_not": [
{}
],
"should": [
{}
]
}
}
}
28: Boolean model

must, must_not and should
In this slide we are going to explain the transformation of all queries into a
bool query.
{
"query": {
"match": {
"description": "basic search elasticsearch"
}
}
}
{
"query": {
"bool": {
"should": [
{
"term": {
"description": {
"value": "basic"
}
}
},
{
"term": {
"description": {
"value": "search"
}
}
},
{
"term": {
"description": {
"value": "elasticsearch"
}
}
}
]
}
}
}

29: Explain 2 terms
match with 2
Now we are going to show the short notation for the explaination of a query
with two terms due to a standard analyzer.
{
"query": {
"match": {
"description": "basic search"
}
}
}
[*]
[+]
description:basic
coord (1/2)
30: Explain 3 terms
match with 3
Now we are going to show the short notation for the explaination of a query
with three terms due to a standard analyzer.

{
"query": {
"match": {
"description": "basic search elasticsearch"
}
}
}
[*]
[+]
description:basic
description:elasticsearch
coord (2/3)
31: Validate query
using validate api
multiple terms and multiple fields with the default best_fields type query.
32: Validate query
using validate api
multiple terms and multiple fields with the most_fields type query.
33: Validate query

using validate api
multiple terms and multiple fields with the cross_fields type query.
34: Explain multi_field
best_fields
Show the effect of a multi_field query using the default best_fields type.
{
"query": {
"multi_match": {
"query": "basic query",
"fields": [
"title",
"description"
],
"type": "best_fields"
}
}
}
[max_of]
[+]
description:basic
description:query
[*]
[+]
title:query

coord (1/2)
most_fields
Show the effect of a multi_field query using the most_fields type.
{
"query": {
"multi_match": {
"fields": [
"title",
"description"
],
"type": "most_fields"
}
}
}
[sum_of]
[+]
description:basic
description:query
[*]
[+]
title:query
coord (1/2)

cross_fields
Show the effect of a multi_field query using the cross_fields type.
{
"query": {
"multi_match": {
"fields": [
"title",
"description"
],
"type": "cross_fields"
}
}
}
[sum_of]
[max_of]
description:basic
[max_of]
description:query
title:query
37: Explain dis_max query
use tie breaker
Show the effect of a dis_max query which is a balance between cross_fields
and best matching field.

{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"description": "basic query"
}
},
{
"match": {
"title": "basic query"
}
}
]
}
}
}
[max_of + 0.7 [*] others]
[+]
description:basic
description:query
[*]
[+]
title:query
coord (1/2)
38: Multile terms and fields
summary

Explain the different options we have for multi field queries and explain the
differences when calculating the score.
Best field returns the field with the highest score,
Most fields adds the scores for the different fields,
Cross fields treets all field as one big field and add maximum score for
term,
Dis max takes the best field and adds a part of the score of other fields
39: Boosting
the basics
In match queries you can apply a boost to a certain field. Important to notice
is that the structure of the output of explain is not changing using this kind of
boost. It is only the score that changes, the boost is reflected within the
query norm of the explain.
{
"query": {
"multi_match": {
"fields": [
"title^5",
"description"
]
}
}
}
No boost Title boost
_score 0.729 0.312
description:basic 0.533 0.107
description:query 0.195 0.0391
description query norm 0.197 0.0394
title:query 0.624 0.624

coord (1/2) 0.5 0.5
40: Boosting query
match with negative impact
The most basic boosting, is boosting on a field basis. Sometimes you have
other boosting requirements. One thing could be to give a negative boost to
some term. Of course you can use the must_not in a bool query but this is
different. In that situation you do not have a match, but we want a match just
with a lower score if a certain term is available. Here we show that the
negative term query adds no score but does give a penalty to the complete
score.
{
"query": {
"boosting": {
"positive": {
"term": {
"description": {
"value": "basic"
}
}
},
"negative": {
"term": {
"description": {
"value": "query"
}
}
},
"negative_boost": 0.2
}
}
}

41: Sorting results
by score and ...
In this slide I want to discuss the options you have for sorting results.
Sort by score (the default),
Sort by date,
Sort by analyzed fields,
GET /slides/_search

{
"query": {
"match": {
}
},
"sort": [
{
"title.raw": {
"order": "asc"
}
}
]
}
42: Fuzzy query
taking care of typos
In here we are going to demonstrate the effect of fuzzy searching on the
score. We are going to use the term basik which is wrong for all slides
except this slide. Show what happens with the boost factor for documents
with that match due to the fuzzy matching.
43: Fuzzy query
explain score for match
Explain why the score for the document with the fuzzy match is higher than
the score for the exact match.
Total score is a product of field and query weight.
found term query weight field weight
description:basic^0.8 0.36849 0.992109
description:basik 0.59356 0.51138
44: Fuzzy query

enhance result with a signal
Since we got the wrong document on top with the previous fuzzy query we
now want to help improve the results with a Signal. A signal can help to
change the score in a way you prefer. In this case we make the score higher
if there is an exact match.
45: Fuzzy query
explain score for match
Explain why adding a signal query as a should query with a match query
does change the order of the results.
No match means a coord penalty.
found term must (fuzzy) should (match)
description:basik 0.26101 0.26101
description:basic^0.8 0.31438 * 0.5 (coord 1/2)
46: Function score query
using popularity
One query that is used a lot on news sites is the function_score query. With
this query you can change the score based on another field like the
popularity or recency. In this slide we discuss the effect on the explain
output for such a query.
GET /blogging/_search?explain

{
"query": {
"function_score": {
"query": {
"match": {
"description": "elasticsearch"
}
},
"functions": [
{
"field_value_factor": {
"field": "popularity",
"factor": 1.2,
"modifier": "ln"
}
}
]
}
}
}
function score
[*]
description:elasticsearch
Math.min
ln(doc[popularity].value *
factor=1.2)
maxBoost
47: Summarizing
the take away
Explain the right process to produce the right results.

The right process to produce the right results.
Use the correct analyzer,
Construct the right query,
Analyze the results with your users,
Explain the results using explain/validate and improve.
48: Questions
I am here the whole day
Place holder sheet that can be used during the questions moment.
jettro.coenradie@luminis.eu
@jettroCoenradie
https://github.com/jettro/preso-explain

Returning the right results - Jettro Coenradie

More Related Content

What's hot

Similar to Returning the right results - Jettro Coenradie

More from NLJUG

Recently uploaded

Returning the right results - Jettro Coenradie