MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR

[Machine Learning] Creating
more relevant search results
with "Learn to Rank"
Nick Veenhof

1
Basic Concepts
How to apply in Drupal 2

I hope you saw Oleg Bogut’s session?
“SEARCH API: TIPS AND TRICKS - FROM BEGINNING
TO CUSTOM SOLUTIONS”
See the recording if you haven’t!
Search Api
3

8
http://dropsolid-search-relevance.ngrok.io/
search/originalScoreModel
let’s look at our current situation

Is it Artiﬁcial Intelligence?
Mathematical Model, based on sample data (training
data) in order to make predictions or decisions without
being explicitly programmed to perform the task.
Machine Learning
9

Traditional ML solves a prediction problem (classiﬁcation
or regression) on a single instance at a time.
if you are doing spam detection on email, you will look at
all the features associated with that email and classify it
as spam or not.
The aim of traditional ML is to come up with a class
(spam or no-spam) or a single numerical score for that
instance.
Traditional ML
10

Learning to Rank (LTR) is a class of techniques that apply
supervised machine learning (ML) to solve ranking
problems.
Learn To Rank
11

RankNet was originally developed using neural nets, but
the underlying model can be different and is not
constrained to just neural nets. The cost function for
RankNet aims to minimize the number of inversions in
ranking.
RankNet
12

Burgess et. al. found that during
RankNet training procedure, you don’t
need the costs, only need the
gradients (λ) of the cost with respect
to the model score. You can think of
these gradients as little arrows
attached to each document in the
ranked list, indicating the direction
we’d like those documents to move.
LambdaRank
13

While MART uses gradient boosted
decision trees for prediction tasks,
LambdaMART uses gradient boosted
decision trees using a cost function
derived from LambdaRank for solving
a ranking task. On experimental
datasets, LambdaMART has shown
better results than LambdaRank and
the original RankNet.
LambdaMart
14

• Document
• - Contains Fields
• - - Made of Types
• - - - Which are Normalized by Processors
• Are stored in an index
• Each document is by default stand-alone*, does not
need other documents for full scope
• An index is created so that based on tokens (read:
words), we ﬁnd references to the documents.
Solr/Elasticsearch (Hint, it’s actually both Lucene)
15
* there are ways to have references to other documents but this is not the 99% case as
it is used today in Drupal. This becomes handy when it comes to personalized search

• donʹka (or donjka?) / донька / Daughter
• donʹok (or donjok?)/ доньок / Daughters
http://localhost:8983/solr/#/ukrainetest/analysis?analysis.fieldvalue=%D0%B4%D0%BE%D0%BD%D1%8C%D0%BE%D0%BA%2
0daughters&analysis.query=don%CA%B9ka%20daughter&analysis.fieldtype=text&verbose_output=0
Solr
16

// Return all documents where
// Our Search Query is cookies
q=cookies&
Solr
18
q=cookies&qf=tm_aggregated_ﬁeld^1.0&qf=tm_title^5.0&ﬂ=id,score,tm_title,sm_
url&fq=index_id:umami_search_index&fq=hash:g8deii&rows=10

// and limit search to these fields
qf=tm_aggregated_field^1.0&
qf=tm_title^5.0&
Solr
19

// Return these fields
fl=id,score,tm_title,sm_url&
Solr
20

// Limit documents to this index & site
fq=index_id:umami_search_index&
fq=hash:g8deii&
Solr
21

// Return 10 results
rows=10
Solr
22

2What is relevant?
Back To Drupal
Basic Concepts
3
1

Let’s take Restaurants as an example. When searching for
a restaurant, what are the criteria that you would use to
mark the restaurant as high quality?
Can we do the same for our site with Articles, Recipes
and Pages?
Feature Deﬁnition
24

• amountOfTags
• freshness
• originalScore
• titleScore
• typeScore
• urlScore
• descriptionScore
Feature Deﬁnition
25

3Back to Drupal
Learn To Rank
What Is Relevant?
4
2

Database best practices can be found at
https://drupalsear.ch/
Code can be found at
https://github.com/nickveenhof/drupal8-umami-
search
Back to Drupal
28

Solr best practices are different.
Back to Drupal
29

What ﬁelds to index?
Back to Drupal
30

Rendered HTML Output
Title
Done
Back to Drupal
31

What about hidden ﬁelds such as metatags? We are exposing it to Search
Engines like Google, why not to our internal search?
Right now, this is blocked by a patch
https://www.drupal.org/project/metatag/issues/2901039
Back to Drupal
32

All other ﬁelds are only useful for Facets or for our Machine Learning Model
Back to Drupal
33

Query Parser Multiple Words
35
q=(+(tm_aggregated_field:"chocolate"^
1 tm_title:"chocolate"^5)
+(tm_aggregated_field:"cake"^1
tm_title:"cake"^5))

Query Parser Direct
36
q=(tm_aggregated_field:(chocolate+cak
e)^1+tm_title:(chocolate+cake)^5)

What ﬁelds to search
Back to Drupal
37

What processors to enable
Back to Drupal
38

How to get the excerpt in your view
Back to Drupal
39

How to get the
excerpt in your
view
Back to Drupal
40

How to get the
excerpt in your
view
Back to Drupal
41

4Learn To Rank
Demo
Back To Drupal
5
3

Pre Optimization
Post Optimization
Improvement
43

Precision is the ratio of correctly predicted positive
observations to the total predicted positive observations.
The question that Precision answers is the following: of
all results that labeled as relevant, how many actually
surfaced to the top? High precision relates to the low
false positive rate.” The higher, the better, with a
maximum of 10.
Src:
https://blog.exsilio.com/all/accuracy-precision-recall-f1-s
core-interpretation-of-performance-measures/
Precision
45

Recall is the ratio of correctly predicted positive
observations to the all observations in actual class.
The question recall answers is: Of all the relevant results
that came back, what is the ratio compared to all
documents that were labeled as relevant? The higher the
better, with a maximum of 1.
Src:
https://blog.exsilio.com/all/accuracy-precision-recall-f1-s
core-interpretation-of-performance-measures/
Recall
46

Using the RankLib library, we can train our model and
import it into Apache Solr. There are a couple of different
models that you can pick to train - for example Linear or
Lambdamart - and you can further reﬁne the model to
include the number of trees and metrics to optimize for.
https://lucene.apache.org/solr/guide/7_4/learning-to-ran
k.html
Ranklib
47

Let’s look at such a model
Ranklib
48

&rq={
!ltr
efi.query=ereloonsupplement
model=lambdamart-NDCG@10-100-2019-02-11-12:24
reRankDocs=100
}
Applying our model
49
fl=*,score&start=0&hl.fragsize=0&fq=%2Bindex_id:umami_search_index&rows=11&q
=(tm_aggregated_field:(chocolate+cake)^1+tm_title:(chocolate+cake)^5)&omitHeader
=true&wt=json&rq={!ltr+efi.query%3Dchocolate+cake+model%3Dlambdamart-NDCG-
10-100-1558611159+reRankDocs%3D100}

https://www.drupal.org/project/search_api_ltr
Applying our model
51

If we look at the actual
result, it shows us that the
search results that we’ve
marked as relevant are
suddenly surfacing to the
top. Our model assessed
each property that we
deﬁned and it learned from
the feedback! Hurray!
Results
52

6Demo
Questions
Learn To Rank
7
5

Have you all given feedback that can be used for training
data?
# Let’s delete all models
python3 ./manage.py delete-model --all-models
# Train a new one
python3 ./manage.py train lambdamart
# Let’s look at the results
http://localhost:5000/stats
Demo
54

7Conclusion
Demo 6
Questions 8

We compiled a learning dataset
we trained our model and uploaded the result to Apache
Solr.
Next, we used this model during our queries to re-rank
the last 100 results based on the trained model.
It is still important to have a good data model, which
means getting all the basics on Search covered ﬁrst.
Conclusion
56

57
Drupal Hosting
Platform that
natively
supports Solr &
LTR Machine
Learning.

Got interested after this talk to work with us?
Come see me during the day if you think you’re up for
challenges like this or send me an email at
cto@dropsolid.com
We’re hiring
58

Reach out to Dropsolid.
Find us at https://dropsolid.com
Email me.
Are you a client or working for a client that needs
this?
59

MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR

Similar to MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR (20)

More from DrupalCamp Kyiv

More from DrupalCamp Kyiv (20)

Recently uploaded

Recently uploaded (20)

MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR