In this session I'll give you a summary of what machine learning is but more importantly how can you use it for a very common problem, namely the relevancy of your internal site search.
Recently, a client of ours shared with us their frustration that their website’s internal site search results didn’t display the most relevant items when searching for certain keywords. They had done their homework and provided us with a list of over 150 keywords and the expected corresponding search result performance. I'll take you on a roadshow how complex Search is and why we all came to rely on Google and came to expect similar quality from our other searches online.
You'll leave the session with a general understanding of not only machine learning concepts but also how search works and how you can use the toolkit of Solr/Lucene to improve your site search with minimal impact for your Drupal site. I'll try to keep it understandable for all audiences but do expect a high level of technical content and concepts.
https://drupalcampkyiv.org/node/80
3. I hope you saw Oleg Bogut’s session?
“SEARCH API: TIPS AND TRICKS - FROM BEGINNING
TO CUSTOM SOLUTIONS”
See the recording if you haven’t!
Search Api
3
9. Is it Artificial Intelligence?
Mathematical Model, based on sample data (training
data) in order to make predictions or decisions without
being explicitly programmed to perform the task.
Machine Learning
9
10. Traditional ML solves a prediction problem (classification
or regression) on a single instance at a time.
if you are doing spam detection on email, you will look at
all the features associated with that email and classify it
as spam or not.
The aim of traditional ML is to come up with a class
(spam or no-spam) or a single numerical score for that
instance.
Traditional ML
10
11. Learning to Rank (LTR) is a class of techniques that apply
supervised machine learning (ML) to solve ranking
problems.
Learn To Rank
11
12. RankNet was originally developed using neural nets, but
the underlying model can be different and is not
constrained to just neural nets. The cost function for
RankNet aims to minimize the number of inversions in
ranking.
RankNet
12
13. Burgess et. al. found that during
RankNet training procedure, you don’t
need the costs, only need the
gradients (λ) of the cost with respect
to the model score. You can think of
these gradients as little arrows
attached to each document in the
ranked list, indicating the direction
we’d like those documents to move.
LambdaRank
13
14. While MART uses gradient boosted
decision trees for prediction tasks,
LambdaMART uses gradient boosted
decision trees using a cost function
derived from LambdaRank for solving
a ranking task. On experimental
datasets, LambdaMART has shown
better results than LambdaRank and
the original RankNet.
LambdaMart
14
15. • Document
• - Contains Fields
• - - Made of Types
• - - - Which are Normalized by Processors
• Are stored in an index
• Each document is by default stand-alone*, does not
need other documents for full scope
• An index is created so that based on tokens (read:
words), we find references to the documents.
Solr/Elasticsearch (Hint, it’s actually both Lucene)
15
* there are ways to have references to other documents but this is not the 99% case as
it is used today in Drupal. This becomes handy when it comes to personalized search
18. // Return all documents where
// Our Search Query is cookies
q=cookies&
Solr
18
q=cookies&qf=tm_aggregated_field^1.0&qf=tm_title^5.0&fl=id,score,tm_title,sm_
url&fq=index_id:umami_search_index&fq=hash:g8deii&rows=10
19. // and limit search to these fields
qf=tm_aggregated_field^1.0&
qf=tm_title^5.0&
Solr
19
q=cookies&qf=tm_aggregated_field^1.0&qf=tm_title^5.0&fl=id,score,tm_title,sm_
url&fq=index_id:umami_search_index&fq=hash:g8deii&rows=10
20. // Return these fields
fl=id,score,tm_title,sm_url&
Solr
20
q=cookies&qf=tm_aggregated_field^1.0&qf=tm_title^5.0&fl=id,score,tm_title,sm_
url&fq=index_id:umami_search_index&fq=hash:g8deii&rows=10
21. // Limit documents to this index & site
fq=index_id:umami_search_index&
fq=hash:g8deii&
Solr
21
q=cookies&qf=tm_aggregated_field^1.0&qf=tm_title^5.0&fl=id,score,tm_title,sm_
url&fq=index_id:umami_search_index&fq=hash:g8deii&rows=10
24. Let’s take Restaurants as an example. When searching for
a restaurant, what are the criteria that you would use to
mark the restaurant as high quality?
Can we do the same for our site with Articles, Recipes
and Pages?
Feature Definition
24
28. Database best practices can be found at
https://drupalsear.ch/
Code can be found at
https://github.com/nickveenhof/drupal8-umami-
search
Back to Drupal
28
32. What about hidden fields such as metatags? We are exposing it to Search
Engines like Google, why not to our internal search?
Right now, this is blocked by a patch
https://www.drupal.org/project/metatag/issues/2901039
Back to Drupal
32
33. All other fields are only useful for Facets or for our Machine Learning Model
Back to Drupal
33
45. Precision is the ratio of correctly predicted positive
observations to the total predicted positive observations.
The question that Precision answers is the following: of
all results that labeled as relevant, how many actually
surfaced to the top? High precision relates to the low
false positive rate.” The higher, the better, with a
maximum of 10.
Src:
https://blog.exsilio.com/all/accuracy-precision-recall-f1-s
core-interpretation-of-performance-measures/
Precision
45
46. Recall is the ratio of correctly predicted positive
observations to the all observations in actual class.
The question recall answers is: Of all the relevant results
that came back, what is the ratio compared to all
documents that were labeled as relevant? The higher the
better, with a maximum of 1.
Src:
https://blog.exsilio.com/all/accuracy-precision-recall-f1-s
core-interpretation-of-performance-measures/
Recall
46
47. Using the RankLib library, we can train our model and
import it into Apache Solr. There are a couple of different
models that you can pick to train - for example Linear or
Lambdamart - and you can further refine the model to
include the number of trees and metrics to optimize for.
https://lucene.apache.org/solr/guide/7_4/learning-to-ran
k.html
Ranklib
47
52. If we look at the actual
result, it shows us that the
search results that we’ve
marked as relevant are
suddenly surfacing to the
top. Our model assessed
each property that we
defined and it learned from
the feedback! Hurray!
Results
52
54. Have you all given feedback that can be used for training
data?
# Let’s delete all models
python3 ./manage.py delete-model --all-models
# Train a new one
python3 ./manage.py train lambdamart
# Let’s look at the results
http://localhost:5000/stats
Demo
54
56. We compiled a learning dataset
we trained our model and uploaded the result to Apache
Solr.
Next, we used this model during our queries to re-rank
the last 100 results based on the trained model.
It is still important to have a good data model, which
means getting all the basics on Search covered first.
Conclusion
56
58. Got interested after this talk to work with us?
Come see me during the day if you think you’re up for
challenges like this or send me an email at
cto@dropsolid.com
We’re hiring
58
59. Reach out to Dropsolid.
Find us at https://dropsolid.com
Email me.
Are you a client or working for a client that needs
this?
59