Rated Ranking Evaluator:
Hands-on Relevance Testing
Alessandro Benedetti, Software Engineer
Andrea Gazzarini, Software Engineer
26th January 2021
Who We Are
Andrea Gazzarini
! Born in Vignanello
! Hermit Software Engineer
! Master in Economy
! RRE Creator
! Apache Lucene/Solr Expert
! Apache Qpid Committer
! Father, Husband
! Bass player, aspiring (just frustrated at the moment)
Chapman Stick player
Alessandro Benedetti
! Born in Tarquinia(ancient Etruscan city)
! R&D Software Engineer
! Director
! RRE Contributor
! Master in Computer Science
! Apache Lucene/Solr Committer
! Semantic, NLP, Machine Leaning
technologies passionate
! Beach Volleyball player and Snowboarder
Who We Are
● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Es experts
● Community Contributors
● Active Researchers
● Hot Trends :
Search Quality Evaluation
Learning To Rank,
Document Similarity,
Relevancy Tuning
www.sease.io
Search Services
Clients
Sease
● Website: www.sease.io
● Blog: https://sease.io/blog
● Github: https://github.com/SeaseLtd
● Twitter: https://twitter.com/SeaseLtd
✓Relevance Testing
‣ Context overview
‣ Search System Status
‣ Information Need and Relevance Ratings
‣ Evaluation Measures
➢ Rated Ranking Evaluator (RRE)
➢ Future Works
Agenda
Search Quality Evaluation is the activity of
assessing how good a search system is.
Defining what good means depends on the interests
of who(stakeholder, developer, ect…) is doing the
evaluation.
So it is necessary to measure multiple metrics to
cover all the aspects of the perceived quality and
understand how the system is behaving.
Context Overview
Search Quality Evaluation
Software Quality
Internal Factors
External Factors
Correctness
Robustness
Extendibility
Reusability
Efficiency
Timeliness
Modularity
Readability
Maintainability
Testability
Maintainability
Understandability
Reusability
….
Focused on
Primarily focused on
Relevance Testing
In Information Retrieval, the effectiveness is the
ability of the system to retrieve relevant documents
while at the same time suppressing the retrieval of
non-relevant documents
For each internal (gray) and external (red) iteration it
is vital to measure retrieval effectiveness variations.
Evaluation measures are used to assert how well the
search results satisfies the user's query intent.
Retrieval
Effectiveness
New system Existing system
Here are the
requirements
V1.0 has been released
Cool!
a month later…
We have a change
request.
We found a bug
We need to improve our
search system, users are
complaining about junk in
search results.
v0.1
…
v0.9
v1.1
v1.2
v1.3
…
v2.0
v2.0
In terms of retrieval effectiveness,
how can we know the system
performance across various
versions?
! Find anomalies in data, like: weird distribution of the
features, strange collected values, …
! Check how the ranking function performs before using
it in production: implement improvements, fix bugs, tune
parameters, …
! Save time and money. Put in production a bad ranking
function can worse the user experience on the website.
Advantages:
[Offline] A Business Perspective
Relevance Testing: Ratings
A key concept in the calculation
of offline search quality
metrics is the relevance of a
document given a user
information need(query).
Before assessing the
correctness of the system it is
necessary to associate a
relevance rating to each pair
<query, document> involved in
our evaluation.
Assign Ratings
Ratings
Set
Explicit
Feedback
Implicit
Feedback
Judgements Collector
Interactions Logger
Queen music
Bohemian
Rhapsody
D a n c i n g
Queen
Queen
Albums
Bohemian
Rhapsody
D a n c i n g
Queen
Queen
Albums
Relevance Testing: Measures
Evaluation Measures
Online Measures
Offline Measures
Average Precision
Mean Reciprocal Rank
Recall
NDCG
Precision Click-through rate
F-Measure
Zero result rate
Session abandonment rate
Session success rate
….
….
We are focused on these ones
Evaluation measures for an information retrieval
system try to formalise how well a search system
satisfies its user information needs.
Measures are generally split into two categories:
online and offline measures.
In this context we will focus on offline measures.
Evaluation Measures
Relevance Testing: Evaluate a System
Input
Information Need with Ratings
e.g.
Set of queries with expected
resulting documents annotated
Metric
e.g.
Precision
Evaluation
System
Corpus of Information
Evaluate Results
Metric Score
0..1
Reproducibility
Keeping these factors locked
I am expecting the same Metric Score
➢ Relevance Testing
✓Rated Ranking Evaluator
‣ Rated Ranking Evaluator
‣ Information Need and Relevance Ratings
‣ Evaluation Measures
‣ Apache Solr/ES
‣ Search System Status
‣ Evaluation and Output
➢ Future Works
Agenda
Rated Ranking Evaluator: What is it?
• A set of Relevance Testing tools
• A search quality evaluation framework
• Multi (search) platform
• Written in Java
• It can be used also in non-Java projects
• Licensed under Apache 2.0
• Open to contributions
• Extremely dynamic!
RRE: What is it?
https://github.com/SeaseLtd/rated-ranking-evaluator
RRE: Ecosystem
The picture illustrates the main modules composing
the RRE ecosystem.
All modules with a dashed border are planned for a
future release.
RRE CLI has a double border because although the
rre-cli module hasn’t been developed, you can run
RRE from a command line using RRE Maven
archetype, which is part of the current release.
As you can see, the current implementation includes
two target search platforms: Apache Solr and
Elasticsearch.
The Search Platform API module provide a search
platform abstraction for plugging-in additional
search systems.
RRE Ecosystem
CORE
Plugin
Plugin
Reporting Plugin
Search
Platform
API
RequestHandler
RRE Server
RRE CLI
Plugin
Plugin
Plugin
Archetypes
Relevance Testing: Evaluate a System
INPUT
RRE Ratings
e.g.
Json representation of
Information Need with
related annotated
documents
Metric
e.g.
Precision
Evaluation
Apache Solr/ ES
Index (Corpus Of Information)
- Data
- Index Time Configuration
- Query Building(Search API)
- Query Time Configuration
Evaluate Results
Metric Score
0..1
Reproducibility
Running RRE with the same status
I am expecting the same metric score
RRE: Information Need Domain Model
• Rooted Tree (the Root is the Evaluation)
• each level enriches the details of the information
need
• The corpus identify the data collection
• The topic assign a human readable semantic
• Query groups expect the same results from the
children
The benefit of having a composite structure is clear:
we can see a metric value at different levels (e.g. a
query, all queries belonging to a query group, all
queries belonging to a topic or at corpus level)
RRE Domain Model
Evaluation
Corpus
1..*
Topic
QueryGroup
Query
1..*
1..*
1..*
Top level domain entity
dataset / collection to evaluate
High Level Information need
Query variants
Queries
RRE: Define Information Need and Ratings
Although the domain model structure is able to
capture complex scenarios, sometimes we want to
model simpler contexts.
In order to avoid verbose and redundant ratings
definitions it’s possible to omit some level.
Combinations accepted for each corpus are:
• only queries
• query groups and queries
• topics, query groups and queries
RRE Domain Model
Evaluation
Corpus
1..*
Doc 2 Doc 3 Doc N
Topic
QueryGroup
Query
1..*
1..*
1..*
…
= Optional
= Required
Doc 1
RRE: Json Ratings
Ratings files associate the RRE domain model
entities with relevance judgments. A ratings file
provides the association between queries and
relevant documents.
There must be at least one ratings file (otherwise no
evaluation happens). Usually there’s a 1:1
relationship between a rating file and a dataset.
Judgments, the most important part of this file,
consist of a list of all relevant documents for a
query group.
Each listed document has a corresponding “gain”
which is the relevancy judgment we want to assign
to that document.
Ratings
OR
Hands On: Relevance Ratings
https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/What%20We%20Need%20To%20Provide#ratings
https://github.com/querqy/chorus/blob/master/rre/src/etc/ratings/chorus_demo_rre.json
RRE: Available metrics
These are the RRE built-in metrics which can be
used out of the box.
The most part of them are computed at query level
and then aggregated at upper levels.
However, compound metrics (e.g. MAP, or GMAP)
are not explicitly declared or defined, because the
computation doesn’t happen at query level. The result
of the aggregation executed on the upper levels will
automatically produce these metric.
e.g.
the Average Precision computed for Q1, Q2, Q3, Qn
becomes the Mean Average Precision at Query
Group or Topic levels.
Available Metrics
Precision
Recall
Precision at 1 (P@1)
Precision at 2 (P@2)
Precision at 3 (P@3)
Precision at 10 (P@10)
Average Precision (AP)
Reciprocal Rank
Mean Reciprocal Rank
Mean Average Precision (MAP)
Normalised Discounted Cumulative Gain (NDCG)
F-Measure Compound Metric
Hands On: Available Metrics
https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/Evaluation%20Measures
https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/Maven%20Plugin#parameterized-metrics
https://github.com/SeaseLtd/rated-ranking-evaluator/tree/master/rre-core/src/main/java/io/sease/rre/core/domain/metrics/impl
Relevance Testing: Evaluate a System
INPUT
RRE Ratings
e.g.
Json representation of
Information Need with
r e l a t e d a n n o t a t e d
documents
Metric
e.g.
Precision
Evaluation
Apache Solr/ ES
Index (Corpus Of Information)
- Data
- Index Time Configuration
- Query Building(Search API)
- Query Time Configuration
Evaluate Results
Metric Score
0..1
Reproducibility
Running RRE with the same status
I am expecting the same metric score
Open Source Search Engines
Solr is the popular, blazing-fast, open
source enterprise search platform built on
Apache Lucene™
Elasticsearch is a distributed, RESTful search and
analytics engine capable of solving a growing
number of use cases.
System Status: Init Search Engine
Data
Configuration
Spins up an embedded
Search Platform
INPUT LAYER
EVALUATION LAYER
- an instance of ES/ Solr is instantiated from
the input configurations
- Data is populated from the input
- The Instance is ready to respond to queries
and be evaluated
N.B. an alternative approach we are working on
is to target a QA instance already populated
In that scenario is vital to keep version
controlled the configuration and data
Embedded
Search System Status: Index
- Data
Documents in input
- Index Time Configurations
Indexing Application Pipeline
Update Processing Chain
Text Analysis Configuration
Index
(Corpus of Information)
System Status: Configuration Sets
- Configurations evolve with time
Reproducibility: track it with version control
systems!
- RRE can take various version of configurations in
input to compare them
- The evaluation process allows you to define
inclusion / exclusion rules (i.e. include only
version 1.0 and 2.0)
Index/Query Time
Configuration
System Status: Feed the Data
An evaluation execution can involve more than one
datasets targeting a given search platform.
A dataset consists consists of representative domain
data; although a compressed dataset can be
provided, generally it has a small/medium size.
Within RRE, corpus, dataset, collection are
synonyms.
Datasets must be located under a configurable
folder. Each dataset is then referenced in one or
more ratings file.
Corpus Of Information
(Data)
Hands On: Corpus of Information and Configurations
https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/What%20We%20Need%20To%20Provide#corpus
https://github.com/SeaseLtd/chorus/tree/rre-workshop/rre/src/etc/configuration_sets
System Status: Query
- Search-API
Build the client query
- Query Time Configurations
Query Parser
Query Building
(Information Need)
Search-API
Query Parser
QUERY: The White Tiger
QUERY: ?q=the white tiger&qf=title,content^10&bf=popularity
QUERY: title:the white tiger OR
content:the white tiger …
System Status: Build the Queries
For each query or query group) it’s possible to
define a template, which is a kind of query shape
containing one or more placeholders.
Then, in the ratings file you can reference one of
those defined templates and you can provide a value
for each placeholder.
Templates have been introduced in order to:
• allow a common query management between
search platforms
• define complex queries
• define runtime parameters that cannot be
statically determined (e.g. filters)
Query templates
only_q.json
filter_by_language.json
https://github.com/SeaseLtd/chorus/tree/rre-workshop/rre/src/etc/
templates
Hands On: Query Templates
Relevance Testing: Evaluate a System
INPUT
RRE Ratings
e.g.
Json representation of
Information Need with
r e l a t e d a n n o t a t e d
documents
Metric
e.g.
Precision
Evaluation
Apache Solr/ ES
Index (Corpus Of Information)
- Data
- Index Time Configuration
- Query Building(Search API)
- Query Time Configuration
Evaluate Results
Metric Score
0..1
Reproducibility
Running RRE with the same status
I am expecting the same metric score
RRE: Evaluation process overview (1/2)
Data
Configuration
Ratings
Search Platform
uses a
produces
Evaluation Data
INPUT LAYER EVALUATION LAYER OUTPUT LAYER
JSON
RRE Console
…
used for generating
RRE: Evaluation process overview (2/2)
Runtime Container
RRE Core
Rating Files
Datasets
Queries
Starts the search
platform
Stops the search
platform
Creates & configure the index
Indexes
Executes
Computes
outputs the evaluation data
Init System
Set Status
Set Status
Stop System
RRE: Evaluation Output
The RRE Core itself is a library, so it outputs its
result as a Plain Java object that must be
programmatically used.
However when wrapped within a runtime container,
like the Maven Plugin, the evaluation object tree is
marshalled in JSON format.
Being interoperable, the JSON format can be used by
some other component for producing a different kind
of output.
An example of such usage is the RRE Apache
Maven Reporting Plugin which can
• output a spreadsheet
• send the evaluation data to a running RRE Server
Evaluation output
RRE: Workbook
The RRE domain model (topics, groups and queries)
is on the left and each metric (on the right section)
has a value for each version / entity pair.
In case the evaluation process includes multiple
datasets, there will be a spreadsheet for each of
them.
This output format is useful when
• you want to have (or maintain somewhere) a
snapshot about how the system performed in a
given moment
• the comparison includes a lot of versions
• you want to include all available metrics
Workbook
RRE: RRE Console
• SpringBoot/AngularJS app that shows real-time
information about evaluation results.
• Each time a build happens, the RRE reporting
plugin sends the evaluation result to a RESTFul
endpoint provided by RRE Console.
• The received data immediately updates the web
dashboard with fresh data.
• Useful during the development / tuning phase
iterations (you don’t have to open again and again
the excel report)
RRE Console
RRE: Iterative development & tuning
Dev, tune & Build
Check evaluation results
We are thinking about how
to fill a third monitor
• precision = Ratio of relevant results among all the search results
returned
• precision@K = Ratio of relevant results among the top-K search
results returned
What happens if :



Precision
means fewer <relevant results> in the top-K
means more <relevant results> in the top-K
precision@k
precision@k
Focus On : Precision@K
Hands On: Evaluation Process
• DCG@K = Discounted Cumulative Gain@K
It measures the usefulness, or gain, of a document based
on its position in the result list.
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
• It is in the range [0,1]
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
14.01 15.76 17.64 22.60
0.62 0.70 0.78 1.0
Normalized Discounted Cumulative Gain
relevance weight
result position
DCG
NDCG
• DCG@K = Discounted Cumulative Gain@K
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
means less <relevant results> in worse positions

with worse relevance
NDCG@k
relevance weight
result position
means more <relevant results> in better positions

with better relevance
NDCG@k
Normalized Discounted Cumulative Gain
Focus On : NDCG@K
Hands On: Evaluation Process
• ERR@K = Expected Reciprocal Rank@K
It measures the probability of stopping at doc k, discounted
of the reciprocal of the position of a result
• It is in the range [0,1]
Expected Reciprocal Rank
Probability Relevance Grade
at rank r is satisfying
Probability Relevance grade
at rank I was not satisfying
probability at position i. Once the user is satisfied with
a document, he/she terminates the search and documents
below this result are not examined regardless of their posi-
tion. It is of course natural to expect Ri to be an increasing
function of the relevance grade, and indeed in what follows
we will assimilate it to the often loosely-defined notion of
“relevance”. This generic version of the cascade model is
summarized in Algorithm 1.
Algorithm 1 The cascade user model
Require: R1, . . . , R10 the relevance of the 10 URLs on the
result page.
1: i = 1
2: User examines position i.
3: if random(0,1) ≤ Ri then
4: User is satisfied with the document in position i and
stops.
5: else
6: i ← i + 1; go to 2
7: end if
Two instantiations of this model have been presented in
[12, 8]. In the former, Ri is the same as the attractiveness
defined above for position-based models: it measures a prob-
ability of click which can be interpreted as the relevance of
the snippet. In that model, it is assumed that the user is al-
with the gain function for DCG used in [4], we m
it to be:
R(g) :=
2g
− 1
2gmax
, g ∈ {0, . . . , gmax}.
When the document is non-relevant (g = 0), the p
that the user finds it relevant is 0, while when the
is extremely relevant (g = 4 if a 5 point scale is us
the probability of relevance is near 1.
We first define the metric in a more general wa
sidering a utility function ϕ of the position. This
typically satisfies ϕ(1) = 1 and ϕ(r) → 0 as r goes
Definition 1 (Cascade based metric). G
ity function ϕ, a cascade based metric is the expe
ϕ(r), where r is the rank where the user finds the
he was looking for. The underlying user model is th
model (2), where the Ri are given by (3).
In the rest of this paper we will be considering t
case ϕ(r) = 1/r, but there is nothing particular a
choice and, for instance, we could have instead pick
1
log2(r+1)
as in the discount function of DCG.
Definition 2 (Expected Reciprocal Ran
The Expected Reciprocal Rank is a cascade based m
Focus On : ERR@K
Hands On: Evaluation Process
• recall = Ratio of relevant results found among all the relevant results

i.e.

Out of all the relevance results, how many have you found?
• recall@k = Ratio of all the relevant results, you found in the top-K

i.e.

Out of all the relevance results, how many have you found in the top-
K?



It means how much the topK is covering all relevant results.

It can be useful to understand if you need to increase the size of
pages, for recall intensive applications.
What happens if :



Recall
means that your top-K is covering a smaller ratio of all relevant
recall@k
means that your top-K is covering a bigger ratio of all relevant
recall@k
• recall@k = Ratio of all the relevant results, you found in the top-K

If you have a high precision@k but a very low recall@k, it means that
your search result page is not big enough to satisfy recall-intensive
users



Recall - further considerations
Focus On : Recall@K
Hands On: Evaluation Process
Focus On : Continuous Integration
Hands On: Evaluation Process Jenkins
RRE: Github Repository and Resources
Chorus Project Section
https://github.com/SeaseLtd/chorus/tree/rre-
workshop
https://github.com/SeaseLtd/
rated-ranking-evaluator
Github Repo
January 2021
Latest Blog
https://groups.google.com/g/rre-user
Mailing List
https://github.com/querqy/chorus
➢ Relevance Testing
➢ Rated Ranking Evaluator
✓Future Works
‣ RRE-Enterprise
‣ Judgement Collector Browser Plugin
Agenda
https://sease.io/2019/12/road-to-rated-ranking-evaluator-enterprise.html
● Implicit/Explicit Ratings generation
● Configurations from git repo/file system
● Query Discovery
● Search Engine Id Discovery
● Search platform support through Docker
● Advanced Results dashboard
● Advanced exploration of results
● C.I. integration tools
What’s coming next?
● First prototype from the
Sease
Machine Learning
Front End Division
● Explicit ratings collection
● Browser plugin
● Configurable for your Search Engine UI
(and more will come later)
● Integrated with RRE-Enterprise
https://videos.files.wordpress.com/yvZ0MQBh/judgementcollector_hd.mp4
N.B. the project is under development, the final version may change slightly
What’s coming next?
Thanks!

Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus

  • 1.
    Rated Ranking Evaluator: Hands-onRelevance Testing Alessandro Benedetti, Software Engineer Andrea Gazzarini, Software Engineer 26th January 2021
  • 2.
    Who We Are AndreaGazzarini ! Born in Vignanello ! Hermit Software Engineer ! Master in Economy ! RRE Creator ! Apache Lucene/Solr Expert ! Apache Qpid Committer ! Father, Husband ! Bass player, aspiring (just frustrated at the moment) Chapman Stick player
  • 3.
    Alessandro Benedetti ! Bornin Tarquinia(ancient Etruscan city) ! R&D Software Engineer ! Director ! RRE Contributor ! Master in Computer Science ! Apache Lucene/Solr Committer ! Semantic, NLP, Machine Leaning technologies passionate ! Beach Volleyball player and Snowboarder Who We Are
  • 4.
    ● Headquarter inLondon/distributed ● Open Source Enthusiasts ● Apache Lucene/Solr/Es experts ● Community Contributors ● Active Researchers ● Hot Trends : Search Quality Evaluation Learning To Rank, Document Similarity, Relevancy Tuning www.sease.io Search Services
  • 5.
  • 6.
    Sease ● Website: www.sease.io ●Blog: https://sease.io/blog ● Github: https://github.com/SeaseLtd ● Twitter: https://twitter.com/SeaseLtd
  • 7.
    ✓Relevance Testing ‣ Contextoverview ‣ Search System Status ‣ Information Need and Relevance Ratings ‣ Evaluation Measures ➢ Rated Ranking Evaluator (RRE) ➢ Future Works Agenda
  • 8.
    Search Quality Evaluationis the activity of assessing how good a search system is. Defining what good means depends on the interests of who(stakeholder, developer, ect…) is doing the evaluation. So it is necessary to measure multiple metrics to cover all the aspects of the perceived quality and understand how the system is behaving. Context Overview Search Quality Evaluation Software Quality Internal Factors External Factors Correctness Robustness Extendibility Reusability Efficiency Timeliness Modularity Readability Maintainability Testability Maintainability Understandability Reusability …. Focused on Primarily focused on
  • 9.
    Relevance Testing In InformationRetrieval, the effectiveness is the ability of the system to retrieve relevant documents while at the same time suppressing the retrieval of non-relevant documents For each internal (gray) and external (red) iteration it is vital to measure retrieval effectiveness variations. Evaluation measures are used to assert how well the search results satisfies the user's query intent. Retrieval Effectiveness New system Existing system Here are the requirements V1.0 has been released Cool! a month later… We have a change request. We found a bug We need to improve our search system, users are complaining about junk in search results. v0.1 … v0.9 v1.1 v1.2 v1.3 … v2.0 v2.0 In terms of retrieval effectiveness, how can we know the system performance across various versions?
  • 10.
    ! Find anomaliesin data, like: weird distribution of the features, strange collected values, … ! Check how the ranking function performs before using it in production: implement improvements, fix bugs, tune parameters, … ! Save time and money. Put in production a bad ranking function can worse the user experience on the website. Advantages: [Offline] A Business Perspective
  • 11.
    Relevance Testing: Ratings Akey concept in the calculation of offline search quality metrics is the relevance of a document given a user information need(query). Before assessing the correctness of the system it is necessary to associate a relevance rating to each pair <query, document> involved in our evaluation. Assign Ratings Ratings Set Explicit Feedback Implicit Feedback Judgements Collector Interactions Logger Queen music Bohemian Rhapsody D a n c i n g Queen Queen Albums Bohemian Rhapsody D a n c i n g Queen Queen Albums
  • 12.
    Relevance Testing: Measures EvaluationMeasures Online Measures Offline Measures Average Precision Mean Reciprocal Rank Recall NDCG Precision Click-through rate F-Measure Zero result rate Session abandonment rate Session success rate …. …. We are focused on these ones Evaluation measures for an information retrieval system try to formalise how well a search system satisfies its user information needs. Measures are generally split into two categories: online and offline measures. In this context we will focus on offline measures. Evaluation Measures
  • 13.
    Relevance Testing: Evaluatea System Input Information Need with Ratings e.g. Set of queries with expected resulting documents annotated Metric e.g. Precision Evaluation System Corpus of Information Evaluate Results Metric Score 0..1 Reproducibility Keeping these factors locked I am expecting the same Metric Score
  • 14.
    ➢ Relevance Testing ✓RatedRanking Evaluator ‣ Rated Ranking Evaluator ‣ Information Need and Relevance Ratings ‣ Evaluation Measures ‣ Apache Solr/ES ‣ Search System Status ‣ Evaluation and Output ➢ Future Works Agenda
  • 15.
    Rated Ranking Evaluator:What is it? • A set of Relevance Testing tools • A search quality evaluation framework • Multi (search) platform • Written in Java • It can be used also in non-Java projects • Licensed under Apache 2.0 • Open to contributions • Extremely dynamic! RRE: What is it? https://github.com/SeaseLtd/rated-ranking-evaluator
  • 16.
    RRE: Ecosystem The pictureillustrates the main modules composing the RRE ecosystem. All modules with a dashed border are planned for a future release. RRE CLI has a double border because although the rre-cli module hasn’t been developed, you can run RRE from a command line using RRE Maven archetype, which is part of the current release. As you can see, the current implementation includes two target search platforms: Apache Solr and Elasticsearch. The Search Platform API module provide a search platform abstraction for plugging-in additional search systems. RRE Ecosystem CORE Plugin Plugin Reporting Plugin Search Platform API RequestHandler RRE Server RRE CLI Plugin Plugin Plugin Archetypes
  • 17.
    Relevance Testing: Evaluatea System INPUT RRE Ratings e.g. Json representation of Information Need with related annotated documents Metric e.g. Precision Evaluation Apache Solr/ ES Index (Corpus Of Information) - Data - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status I am expecting the same metric score
  • 18.
    RRE: Information NeedDomain Model • Rooted Tree (the Root is the Evaluation) • each level enriches the details of the information need • The corpus identify the data collection • The topic assign a human readable semantic • Query groups expect the same results from the children The benefit of having a composite structure is clear: we can see a metric value at different levels (e.g. a query, all queries belonging to a query group, all queries belonging to a topic or at corpus level) RRE Domain Model Evaluation Corpus 1..* Topic QueryGroup Query 1..* 1..* 1..* Top level domain entity dataset / collection to evaluate High Level Information need Query variants Queries
  • 19.
    RRE: Define InformationNeed and Ratings Although the domain model structure is able to capture complex scenarios, sometimes we want to model simpler contexts. In order to avoid verbose and redundant ratings definitions it’s possible to omit some level. Combinations accepted for each corpus are: • only queries • query groups and queries • topics, query groups and queries RRE Domain Model Evaluation Corpus 1..* Doc 2 Doc 3 Doc N Topic QueryGroup Query 1..* 1..* 1..* … = Optional = Required Doc 1
  • 20.
    RRE: Json Ratings Ratingsfiles associate the RRE domain model entities with relevance judgments. A ratings file provides the association between queries and relevant documents. There must be at least one ratings file (otherwise no evaluation happens). Usually there’s a 1:1 relationship between a rating file and a dataset. Judgments, the most important part of this file, consist of a list of all relevant documents for a query group. Each listed document has a corresponding “gain” which is the relevancy judgment we want to assign to that document. Ratings OR
  • 21.
    Hands On: RelevanceRatings https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/What%20We%20Need%20To%20Provide#ratings https://github.com/querqy/chorus/blob/master/rre/src/etc/ratings/chorus_demo_rre.json
  • 22.
    RRE: Available metrics Theseare the RRE built-in metrics which can be used out of the box. The most part of them are computed at query level and then aggregated at upper levels. However, compound metrics (e.g. MAP, or GMAP) are not explicitly declared or defined, because the computation doesn’t happen at query level. The result of the aggregation executed on the upper levels will automatically produce these metric. e.g. the Average Precision computed for Q1, Q2, Q3, Qn becomes the Mean Average Precision at Query Group or Topic levels. Available Metrics Precision Recall Precision at 1 (P@1) Precision at 2 (P@2) Precision at 3 (P@3) Precision at 10 (P@10) Average Precision (AP) Reciprocal Rank Mean Reciprocal Rank Mean Average Precision (MAP) Normalised Discounted Cumulative Gain (NDCG) F-Measure Compound Metric
  • 23.
    Hands On: AvailableMetrics https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/Evaluation%20Measures https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/Maven%20Plugin#parameterized-metrics https://github.com/SeaseLtd/rated-ranking-evaluator/tree/master/rre-core/src/main/java/io/sease/rre/core/domain/metrics/impl
  • 24.
    Relevance Testing: Evaluatea System INPUT RRE Ratings e.g. Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric e.g. Precision Evaluation Apache Solr/ ES Index (Corpus Of Information) - Data - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status I am expecting the same metric score
  • 25.
    Open Source SearchEngines Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™ Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases.
  • 26.
    System Status: InitSearch Engine Data Configuration Spins up an embedded Search Platform INPUT LAYER EVALUATION LAYER - an instance of ES/ Solr is instantiated from the input configurations - Data is populated from the input - The Instance is ready to respond to queries and be evaluated N.B. an alternative approach we are working on is to target a QA instance already populated In that scenario is vital to keep version controlled the configuration and data Embedded
  • 27.
    Search System Status:Index - Data Documents in input - Index Time Configurations Indexing Application Pipeline Update Processing Chain Text Analysis Configuration Index (Corpus of Information)
  • 28.
    System Status: ConfigurationSets - Configurations evolve with time Reproducibility: track it with version control systems! - RRE can take various version of configurations in input to compare them - The evaluation process allows you to define inclusion / exclusion rules (i.e. include only version 1.0 and 2.0) Index/Query Time Configuration
  • 29.
    System Status: Feedthe Data An evaluation execution can involve more than one datasets targeting a given search platform. A dataset consists consists of representative domain data; although a compressed dataset can be provided, generally it has a small/medium size. Within RRE, corpus, dataset, collection are synonyms. Datasets must be located under a configurable folder. Each dataset is then referenced in one or more ratings file. Corpus Of Information (Data)
  • 30.
    Hands On: Corpusof Information and Configurations https://github.com/SeaseLtd/rated-ranking-evaluator/wiki/What%20We%20Need%20To%20Provide#corpus https://github.com/SeaseLtd/chorus/tree/rre-workshop/rre/src/etc/configuration_sets
  • 31.
    System Status: Query -Search-API Build the client query - Query Time Configurations Query Parser Query Building (Information Need) Search-API Query Parser QUERY: The White Tiger QUERY: ?q=the white tiger&qf=title,content^10&bf=popularity QUERY: title:the white tiger OR content:the white tiger …
  • 32.
    System Status: Buildthe Queries For each query or query group) it’s possible to define a template, which is a kind of query shape containing one or more placeholders. Then, in the ratings file you can reference one of those defined templates and you can provide a value for each placeholder. Templates have been introduced in order to: • allow a common query management between search platforms • define complex queries • define runtime parameters that cannot be statically determined (e.g. filters) Query templates only_q.json filter_by_language.json
  • 33.
  • 34.
    Relevance Testing: Evaluatea System INPUT RRE Ratings e.g. Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric e.g. Precision Evaluation Apache Solr/ ES Index (Corpus Of Information) - Data - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status I am expecting the same metric score
  • 35.
    RRE: Evaluation processoverview (1/2) Data Configuration Ratings Search Platform uses a produces Evaluation Data INPUT LAYER EVALUATION LAYER OUTPUT LAYER JSON RRE Console … used for generating
  • 36.
    RRE: Evaluation processoverview (2/2) Runtime Container RRE Core Rating Files Datasets Queries Starts the search platform Stops the search platform Creates & configure the index Indexes Executes Computes outputs the evaluation data Init System Set Status Set Status Stop System
  • 37.
    RRE: Evaluation Output TheRRE Core itself is a library, so it outputs its result as a Plain Java object that must be programmatically used. However when wrapped within a runtime container, like the Maven Plugin, the evaluation object tree is marshalled in JSON format. Being interoperable, the JSON format can be used by some other component for producing a different kind of output. An example of such usage is the RRE Apache Maven Reporting Plugin which can • output a spreadsheet • send the evaluation data to a running RRE Server Evaluation output
  • 38.
    RRE: Workbook The RREdomain model (topics, groups and queries) is on the left and each metric (on the right section) has a value for each version / entity pair. In case the evaluation process includes multiple datasets, there will be a spreadsheet for each of them. This output format is useful when • you want to have (or maintain somewhere) a snapshot about how the system performed in a given moment • the comparison includes a lot of versions • you want to include all available metrics Workbook
  • 39.
    RRE: RRE Console •SpringBoot/AngularJS app that shows real-time information about evaluation results. • Each time a build happens, the RRE reporting plugin sends the evaluation result to a RESTFul endpoint provided by RRE Console. • The received data immediately updates the web dashboard with fresh data. • Useful during the development / tuning phase iterations (you don’t have to open again and again the excel report) RRE Console
  • 40.
    RRE: Iterative development& tuning Dev, tune & Build Check evaluation results We are thinking about how to fill a third monitor
  • 41.
    • precision =Ratio of relevant results among all the search results returned • precision@K = Ratio of relevant results among the top-K search results returned What happens if :
 
 Precision means fewer <relevant results> in the top-K means more <relevant results> in the top-K precision@k precision@k
  • 42.
    Focus On :Precision@K Hands On: Evaluation Process
  • 43.
    • DCG@K =Discounted Cumulative Gain@K It measures the usefulness, or gain, of a document based on its position in the result list. Normalised Discounted Cumulative Gain • NDCG@K = DCG@K/ Ideal DCG@K • It is in the range [0,1] Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 14.01 15.76 17.64 22.60 0.62 0.70 0.78 1.0 Normalized Discounted Cumulative Gain relevance weight result position DCG NDCG
  • 44.
    • DCG@K =Discounted Cumulative Gain@K Normalised Discounted Cumulative Gain • NDCG@K = DCG@K/ Ideal DCG@K means less <relevant results> in worse positions
 with worse relevance NDCG@k relevance weight result position means more <relevant results> in better positions
 with better relevance NDCG@k Normalized Discounted Cumulative Gain
  • 45.
    Focus On :NDCG@K Hands On: Evaluation Process
  • 46.
    • ERR@K =Expected Reciprocal Rank@K It measures the probability of stopping at doc k, discounted of the reciprocal of the position of a result • It is in the range [0,1] Expected Reciprocal Rank Probability Relevance Grade at rank r is satisfying Probability Relevance grade at rank I was not satisfying probability at position i. Once the user is satisfied with a document, he/she terminates the search and documents below this result are not examined regardless of their posi- tion. It is of course natural to expect Ri to be an increasing function of the relevance grade, and indeed in what follows we will assimilate it to the often loosely-defined notion of “relevance”. This generic version of the cascade model is summarized in Algorithm 1. Algorithm 1 The cascade user model Require: R1, . . . , R10 the relevance of the 10 URLs on the result page. 1: i = 1 2: User examines position i. 3: if random(0,1) ≤ Ri then 4: User is satisfied with the document in position i and stops. 5: else 6: i ← i + 1; go to 2 7: end if Two instantiations of this model have been presented in [12, 8]. In the former, Ri is the same as the attractiveness defined above for position-based models: it measures a prob- ability of click which can be interpreted as the relevance of the snippet. In that model, it is assumed that the user is al- with the gain function for DCG used in [4], we m it to be: R(g) := 2g − 1 2gmax , g ∈ {0, . . . , gmax}. When the document is non-relevant (g = 0), the p that the user finds it relevant is 0, while when the is extremely relevant (g = 4 if a 5 point scale is us the probability of relevance is near 1. We first define the metric in a more general wa sidering a utility function ϕ of the position. This typically satisfies ϕ(1) = 1 and ϕ(r) → 0 as r goes Definition 1 (Cascade based metric). G ity function ϕ, a cascade based metric is the expe ϕ(r), where r is the rank where the user finds the he was looking for. The underlying user model is th model (2), where the Ri are given by (3). In the rest of this paper we will be considering t case ϕ(r) = 1/r, but there is nothing particular a choice and, for instance, we could have instead pick 1 log2(r+1) as in the discount function of DCG. Definition 2 (Expected Reciprocal Ran The Expected Reciprocal Rank is a cascade based m
  • 47.
    Focus On :ERR@K Hands On: Evaluation Process
  • 48.
    • recall =Ratio of relevant results found among all the relevant results
 i.e.
 Out of all the relevance results, how many have you found? • recall@k = Ratio of all the relevant results, you found in the top-K
 i.e.
 Out of all the relevance results, how many have you found in the top- K?
 
 It means how much the topK is covering all relevant results.
 It can be useful to understand if you need to increase the size of pages, for recall intensive applications. What happens if :
 
 Recall means that your top-K is covering a smaller ratio of all relevant recall@k means that your top-K is covering a bigger ratio of all relevant recall@k
  • 49.
    • recall@k =Ratio of all the relevant results, you found in the top-K
 If you have a high precision@k but a very low recall@k, it means that your search result page is not big enough to satisfy recall-intensive users
 
 Recall - further considerations
  • 50.
    Focus On :Recall@K Hands On: Evaluation Process
  • 51.
    Focus On :Continuous Integration Hands On: Evaluation Process Jenkins
  • 52.
    RRE: Github Repositoryand Resources Chorus Project Section https://github.com/SeaseLtd/chorus/tree/rre- workshop https://github.com/SeaseLtd/ rated-ranking-evaluator Github Repo January 2021 Latest Blog https://groups.google.com/g/rre-user Mailing List https://github.com/querqy/chorus
  • 53.
    ➢ Relevance Testing ➢Rated Ranking Evaluator ✓Future Works ‣ RRE-Enterprise ‣ Judgement Collector Browser Plugin Agenda
  • 54.
    https://sease.io/2019/12/road-to-rated-ranking-evaluator-enterprise.html ● Implicit/Explicit Ratingsgeneration ● Configurations from git repo/file system ● Query Discovery ● Search Engine Id Discovery ● Search platform support through Docker ● Advanced Results dashboard ● Advanced exploration of results ● C.I. integration tools What’s coming next?
  • 55.
    ● First prototypefrom the Sease Machine Learning Front End Division ● Explicit ratings collection ● Browser plugin ● Configurable for your Search Engine UI (and more will come later) ● Integrated with RRE-Enterprise https://videos.files.wordpress.com/yvZ0MQBh/judgementcollector_hd.mp4 N.B. the project is under development, the final version may change slightly What’s coming next?
  • 56.