Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation


Published on

Every team working on Information Retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(at a specific point in time and historically).

Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders.

To satisfy these requirements an helpful tool must be:

flexible and highly configurable for a technical user
immediate, visual and concise for an optimal business utilization
In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort.

To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend. It is composed by a core library and a set of modules and plugins that give it the flexibility to be integrated in automated evaluation processes and in continuous integrations flows.

This talk will introduce RRE, it will describe its latest developments and demonstrate how it can be integrated in a project to measure and assess the search quality of your search application.

The focus of the presentation will be on a live demo showing an example project with a set of initial relevancy issues that we will solve iteration after iteration: using RRE output feedbacks to gradually drive the improvement process until we reach an optimal balance between quality evaluation measures.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation

  1. 1. Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation Alessandro Benedetti, Software Engineer
 Andrea Gazzarini, Software Engineer 24th April 2019
  2. 2. Who I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  3. 3. Who we are ▪ Software Engineer (1999-) ▪ “Hermit” Software Engineer (2010-) ▪ Java & Information Retrieval Passionate ▪ Apache Qpid (past) Committer ▪ Husband & Father ▪ Bass Player Andrea Gazzarini, “Gazza”
  4. 4. Sease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  5. 5. ✓ Search Quality Evaluation ‣ Context overview ‣ Search System Status ‣ Information Need and Relevancy Ratings ‣ Evaluation Measures ➢ Rated Ranking Evaluator (RRE) ➢ Future Works Agenda
  6. 6. Search Quality Evaluation is the activity of assessing how good a search system is.
 Defining what good means depends on the interests of who(stakeholder, developer, ect…) is doing the evaluation.
 So it is necessary to measure multiple metrics to cover all the aspects of the perceived quality and understand how the system is behaving.
 Context Overview Search Quality Evaluation Search Quality Internal Factors External Factors Correctness RobustnessExtendibility Reusability Efficiency Timeliness Modularity Readability Maintainability Testability Maintainability Understandability Reusability …. Focused on Primarily focused on
  7. 7. Search Quality: Correctness In Information Retrieval Correctness is the ability of a system to meet the information needs of its users. For each internal (gray) and external (red) iteration it is vital to measure correctness variations. Evaluation measures are used to assert how well the search results satisfies the user's query intent. Correctness New system Existing system Here are the requirements V1.0 has been released Cool! a month later… We have a change request. We found a bug We need to improve our search system, users are complaining about junk in search results. v0.1 … v0.9 v1.1 v1.2 v1.3 … v2.0 v2.0 In terms of correctness, how can we know the system performance across various versions?
  8. 8. Search Quality: Relevancy Ratings A key concept in the calculation of offline search quality metrics is the relevance of a document given a user information need(query). Before assessing the correctness of the system it is necessary to associate a relevancy rating to each pair <query, document> involved in our evaluation. 
 Assign Ratings Ratings Set Explicit Feedback Implicit Feedback Judgements Collector Interactions Logger Queen music Bohemian
 Rhapsody D a n c i n g Queen Queen 
 Albums Bohemian
 Rhapsody D a n c i n g Queen Queen 
  9. 9. Search Quality: Measures Evaluation Measures Online Measures Offline Measures Average Precision Mean Reciprocal Rank Recall NDCG Precision Click-through rate F-Measure Zero result rate Session abandonment rate Session success rate …. …. We are mainly focused here Evaluation measures for an information retrieval system try to formalise how well a search system satisfies its user information needs. Measures are generally split into two categories: online and offline measures. In this context we will focus on offline measures. Evaluation Measures
  10. 10. Search Quality: Evaluate a System Input
 Information Need with Ratings
 Set of queries with expected resulting documents annotated Metric
 Evaluation System Corpus of Information
 Evaluate Results Metric Score 0..1 Reproducibility Keeping these factors locked
 I am expecting the same Metric Score
  11. 11. ➢ Search Quality Evaluation ✓An Open Source Approach(RRE) ‣ Apache Solr/ES ‣ Search System Status ‣ Rated Ranking Evaluator ‣ Information Need and Relevancy Ratings ‣ Evaluation Measures ‣ Evaluation and Output ➢ Future Works Agenda
  12. 12. Open Source Search Engines Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™ Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases.
  13. 13. Search System Status: Index - Data 
 Documents in input - Index Time Configurations 
 Indexing Application Pipeline
 Update Processing Chain
 Text Analysis Configuration Index
 (Corpus of Information)
  14. 14. System Status: Query - Search-API 
 Build the client query - Query Time Configurations 
 Query Parser Query Building (Information Need) Search-API Query Parser QUERY: The White Tiger QUERY: ?q=the white tiger&qf=title,content^10&bf=popularity QUERY: title:the white tiger OR content:the white tiger …
  15. 15. RRE: What is it? • A set of search quality evaluation tools • A search quality evaluation framework • Multi (search) platform • Written in Java • It can be used also in non-Java projects • Licensed under Apache 2.0 • Open to contributions • Extremely dynamic! RRE: What is it?
  16. 16. RRE: Ecosystem The picture illustrates the main modules composing the RRE ecosystem. All modules with a dashed border are planned for a future release. RRE CLI has a double border because although the rre-cli module hasn’t been developed, you can run RRE from a command line using RRE Maven archetype, which is part of the current release. As you can see, the current implementation includes two target search platforms: Apache Solr and Elasticsearch. The Search Platform API module provide a search platform abstraction for plugging-in additional search systems. RRE Ecosystem CORE Plugin Plugin Reporting Plugin Search Platform API RequestHandler RRE Server RRE CLI Plugin Plugin Plugin Archetypes
  17. 17. RRE: Reproducibility in Evaluating a System INPUT
 RRE Ratings
 Json representation of Information Need with related annotated documents Metric
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  18. 18. RRE: Information Need Domain Model • Rooted Tree (the Root is the Evaluation) • each level enriches the details of the information need • The corpus identify the data collection • The topic assign a human readable semantic • Query groups expect the same results from the children The benefit of having a composite structure is clear: we can see a metric value at different levels (e.g. a query, all queries belonging to a query group, all queries belonging to a topic or at corpus level) RRE Domain ModelEvaluation Corpus 1..* Topic Query Group Query 1..* 1..* 1..* Top level domain entity dataset / collection to evaluate High Level Information need Query variants Queries
  19. 19. RRE: Define Information Need and Ratings Although the domain model structure is able to capture complex scenarios, sometimes we want to model simpler contexts. In order to avoid verbose and redundant ratings definitions it’s possible to omit some level. 
 Combinations accepted for each corpus are: • only queries • query groups and queries • topics, query groups and queries RRE Domain ModelEvaluation Corpus 1..* Doc 2 Doc 3 Doc N Topic Query Group Query 1..* 1..* 1..* … = Optional = Required Doc 1
  20. 20. RRE: Json Ratings Ratings files associate the RRE domain model entities with relevance judgments. A ratings file provides the association between queries and relevant documents. There must be at least one ratings file (otherwise no evaluation happens). Usually there’s a 1:1 relationship between a rating file and a dataset. Judgments, the most important part of this file, consist of a list of all relevant documents for a query group. Each listed document has a corresponding “gain” which is the relevancy judgment we want to assign to that document. Ratings OR
  21. 21. RRE: Available metrics These are the RRE built-in metrics which can be used out of the box. The most part of them are computed at query level and then aggregated at upper levels. However, compound metrics (e.g. MAP, or GMAP) are not explicitly declared or defined, because the computation doesn’t happen at query level. The result of the aggregation executed on the upper levels will automatically produce these metric. e.g.
 the Average Precision computed for Q1, Q2, Q3, Qn becomes the Mean Average Precision at Query Group or Topic levels. Available Metrics Precision Recall Precision at 1 (P@1) Precision at 2 (P@2) Precision at 3 (P@3) Precision at 10 (P@10) Average Precision (AP) Reciprocal Rank Mean Reciprocal Rank Mean Average Precision (MAP) Normalised Discounted Cumulative Gain (NDCG) F-Measure Compound Metric
  22. 22. RRE: Reproducibility in Evaluating a System INPUT
 RRE Ratings
 Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  23. 23. System Status: Init Search Engine Data Configuration Spins up an embedded
 Search Platform INPUT LAYER EVALUATION LAYER - an instance of ES/ Solr is instantiated from 
 the input configurations - Data is populated from the input - The Instance is ready to respond to queries
 and be evaluated N.B. an alternative approach we are working on
 is to target a QA instance already populated
 In that scenario is vital to keep version controlled the configuration and data Embedded
  24. 24. System Status: Configuration Sets - Configurations evolve with time
 Reproducibility: track it with version control systems! - RRE can take various version of configurations in input to compare them - The evaluation process allows you to define inclusion / exclusion rules (i.e. include only version 1.0 and 2.0) Index/Query Time
  25. 25. System Status: Feed the Data An evaluation execution can involve more than one datasets targeting a given search platform. A dataset consists consists of representative domain data; although a compressed dataset can be provided, generally it has a small/medium size. Within RRE, corpus, dataset, collection are synonyms. Datasets must be located under a configurable folder. Each dataset is then referenced in one or more ratings file. Corpus Of Information
  26. 26. System Status: Build the Queries For each query or query group) it’s possible to define a template, which is a kind of query shape containing one or more placeholders. Then, in the ratings file you can reference one of those defined templates and you can provide a value for each placeholder. Templates have been introduced in order to: • allow a common query management between search platforms • define complex queries • define runtime parameters that cannot be statically determined (e.g. filters) Query templates only_q.json filter_by_language.json
  27. 27. RRE: Reproducibility in Evaluating a System INPUT
 RRE Ratings
 Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  28. 28. RRE: Evaluation process overview (1/2) Data Configuration Ratings Search Platform uses a produces Evaluation Data INPUT LAYER EVALUATION LAYER OUTPUT LAYER JSON RRE Console … used for generating
  29. 29. RRE: Evaluation process overview (2/2) Runtime Container RRE Core Rating Files Datasets Queries Starts the search platform Stops the search platform Creates & configure the index Indexes data Executes query Computes metric outputs the evaluation data Init System Set Status Set Status Stop System
  30. 30. RRE: Evaluation Output The RRE Core itself is a library, so it outputs its result as a Plain Java object that must be programmatically used. However when wrapped within a runtime container, like the Maven Plugin, the evaluation object tree is marshalled in JSON format. Being interoperable, the JSON format can be used by some other component for producing a different kind of output. An example of such usage is the RRE Apache Maven Reporting Plugin which can • output a spreadsheet • send the evaluation data to a running RRE Server Evaluation output
  31. 31. RRE: Workbook The RRE domain model (topics, groups and queries) is on the left and each metric (on the right section) has a value for each version / entity pair. In case the evaluation process includes multiple datasets, there will be a spreadsheet for each of them. This output format is useful when • you want to have (or maintain somewhere) a snapshot about how the system performed in a given moment • the comparison includes a lot of versions • you want to include all available metrics Workbook
  32. 32. RRE: RRE Console • SpringBoot/AngularJS app that shows real-time information about evaluation results. • Each time a build happens, the RRE reporting plugin sends the evaluation result to a RESTFul endpoint provided by RRE Console. • The received data immediately updates the web dashboard with fresh data. • Useful during the development / tuning phase iterations (you don’t have to open again and again the excel report) RRE Console
  33. 33. RRE: Iterative development & tuning Dev, tune & Build Check evaluation results We are thinking about how to fill a third monitor
  34. 34. RRE: We are working on… “I think if we could create a simplified pass/fail report for the business team, that would be ideal. So they could understand the tradeoffs of the new search.” “Many search engines process the user query heavily before it's submitted to the search engine in whatever DSL is required, and if you don't retain some idea of the original query in the system how can you” relate the test results back to user behaviour? Do I have to write all judgments manually?? How can I use RRE if I have a custom search platform? Java is not in my stack Can I persist the evaluation data?
  35. 35. RRE: Github Repository and Resources • A sample RRE-enabled project • No Java code, only configuration • Search Platform: Elasticsearch 6.3.2 • Seven example iterations • Index shapes & queries from Relevant Search [1] • Dataset: TMBD (extract) Demo Project rre-demo-walkthrough [1] rated-ranking-evaluator Github Repo rated-ranking-evaluator.html Blog article
  36. 36. Thanks!