Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Search Quality Evaluation to Help Reproducibility: An Open-source Approach


Published on

Every information retrieval practitioner ordinarily struggles with the task of evaluating how well a search engine is performing and to reproduce the performance achieved in a specific point in time.
Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
Additionally it is extremely important to track the evolution of the search system in time and to be able to reproduce and measure the same performance (through metrics of interest such as precison@k, recall, NDCG@k...).

The talk will describe the Rated Ranking Evaluator from a researcher and software engineer perspective.
RRE is an open source search quality evaluation tool, that can be used to produce a set of reports about the quality of a system, iteration after iteration and that could be integrated within a continuous integration infrastructure to monitor quality metrics after each release .
Focus of the talk will be to raise public awareness of the topic of search quality evaluation and reproducibility describing how RRE could help the industry.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Search Quality Evaluation to Help Reproducibility: An Open-source Approach

  1. 1. Search Quality Evaluation to Help Reproducibility:
 An Open-source Approach Alessandro Benedetti, Software Engineer 18th April 2019
  2. 2. Who I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  3. 3. Sease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  4. 4. ✓ Search Quality Evaluation ‣ Context overview ‣ Search System Status ‣ Information Need and Relevancy Ratings ‣ Evaluation Measures ➢ Rated Ranking Evaluator (RRE) ➢ Future Works Agenda
  5. 5. Search Quality Evaluation is the activity of assessing how good a search system is.
 Defining what good means depends on the interests of who(stakeholder, developer, ect…) is doing the evaluation.
 So it is necessary to measure multiple metrics to cover all the aspects of the perceived quality and understand how the system is behaving.
 Context Overview Search Quality Evaluation Search Quality Internal Factors External Factors Correctness RobustnessExtendibility Reusability Efficiency Timeliness Modularity Readability Maintainability Testability Maintainability Understandability Reusability …. Focused on Primarily focused on
  6. 6. Search Quality: Correctness In Information Retrieval Correctness is the ability of a system to meet the information needs of its users. For each internal (gray) and external (red) iteration it is vital to measure correctness variations. Evaluation measures are used to assert how well the search results satisfies the user's query intent. Correctness New system Existing system Here are the requirements V1.0 has been released Cool! a month later… We have a change request. We found a bug We need to improve our search system, users are complaining about junk in search results. v0.1 … v0.9 v1.1 v1.2 v1.3 … v2.0 v2.0 In terms of correctness, how can we know the system performance across various versions?
  7. 7. Search Quality: Relevancy Ratings A key concept in the calculation of offline search quality metrics is the relevance of a document given a user information need(query). Before assessing the correctness of the system it is necessary to associate a relevancy rating to each pair <query, document> involved in our evaluation. 
 Assign Ratings Ratings Set Explicit Feedback Implicit Feedback Judgements Collector Interactions Logger Queen music Bohemian
 Rhapsody D a n c i n g Queen Queen 
 Albums Bohemian
 Rhapsody D a n c i n g Queen Queen 
  8. 8. Search Quality: Measures Evaluation Measures Online Measures Offline Measures Average Precision Mean Reciprocal Rank Recall NDCG Precision Click-through rate F-Measure Zero result rate Session abandonment rate Session success rate …. …. We are mainly focused here Evaluation measures for an information retrieval system try to formalise how well a search system satisfies its user information needs. Measures are generally split into two categories: online and offline measures. In this context we will focus on offline measures. Evaluation Measures
  9. 9. Search Quality: Evaluate a System Input
 Information Need with Ratings
 Set of queries with expected resulting documents annotated Metric
 Evaluation System Corpus of Information
 Evaluate Results Metric Score 0..1 Reproducibility Keeping these factors locked
 I am expecting the same Metric Score
  10. 10. ➢ Search Quality Evaluation ✓An Open Source Approach(RRE) ‣ Apache Solr/ES ‣ Search System Status ‣ Rated Ranking Evaluator ‣ Information Need and Relevancy Ratings ‣ Evaluation Measures ‣ Evaluation and Output ➢ Future Works Agenda
  11. 11. Open Source Search Engines Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™ Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases.
  12. 12. Search System Status: Index - Data 
 Documents in input - Index Time Configurations 
 Indexing Application Pipeline
 Update Processing Chain
 Text Analysis Configuration Index
 (Corpus of Information)
  13. 13. System Status: Query - Search-API 
 Build the client query - Query Time Configurations 
 Query Parser Query Building (Information Need) Search-API Query Parser QUERY: The White Tiger QUERY: ?q=the white tiger&qf=title,content^10&bf=popularity QUERY: title:the white tiger OR content:the white tiger …
  14. 14. RRE: What is it? • A set of search quality evaluation tools • A search quality evaluation framework • Multi (search) platform • Written in Java • It can be used also in non-Java projects • Licensed under Apache 2.0 • Open to contributions • Extremely dynamic! RRE: What is it?
  15. 15. RRE: Ecosystem The picture illustrates the main modules composing the RRE ecosystem. All modules with a dashed border are planned for a future release. RRE CLI has a double border because although the rre-cli module hasn’t been developed, you can run RRE from a command line using RRE Maven archetype, which is part of the current release. As you can see, the current implementation includes two target search platforms: Apache Solr and Elasticsearch. The Search Platform API module provide a search platform abstraction for plugging-in additional search systems. RRE Ecosystem CORE Plugin Plugin Reporting Plugin Search Platform API RequestHandler RRE Server RRE CLI Plugin Plugin Plugin Archetypes
  16. 16. RRE: Reproducibility in Evaluating a System INPUT
 RRE Ratings
 Json representation of Information Need with related annotated documents Metric
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  17. 17. RRE: Information Need Domain Model • Rooted Tree (the Root is the Evaluation) • each level enriches the details of the information need • The corpus identify the data collection • The topic assign a human readable semantic • Query groups expect the same results from the children The benefit of having a composite structure is clear: we can see a metric value at different levels (e.g. a query, all queries belonging to a query group, all queries belonging to a topic or at corpus level) RRE Domain ModelEvaluation Corpus 1..* Topic Query Group Query 1..* 1..* 1..* Top level domain entity dataset / collection to evaluate High Level Information need Query variants Queries
  18. 18. RRE: Define Information Need and Ratings Although the domain model structure is able to capture complex scenarios, sometimes we want to model simpler contexts. In order to avoid verbose and redundant ratings definitions it’s possible to omit some level. 
 Combinations accepted for each corpus are: • only queries • query groups and queries • topics, query groups and queries RRE Domain ModelEvaluation Corpus 1..* Doc 2 Doc 3 Doc N Topic Query Group Query 1..* 1..* 1..* … = Optional = Required Doc 1
  19. 19. RRE: Json Ratings Ratings files associate the RRE domain model entities with relevance judgments. A ratings file provides the association between queries and relevant documents. There must be at least one ratings file (otherwise no evaluation happens). Usually there’s a 1:1 relationship between a rating file and a dataset. Judgments, the most important part of this file, consist of a list of all relevant documents for a query group. Each listed document has a corresponding “gain” which is the relevancy judgment we want to assign to that document. Ratings OR
  20. 20. RRE: Available metrics These are the RRE built-in metrics which can be used out of the box. The most part of them are computed at query level and then aggregated at upper levels. However, compound metrics (e.g. MAP, or GMAP) are not explicitly declared or defined, because the computation doesn’t happen at query level. The result of the aggregation executed on the upper levels will automatically produce these metric. e.g.
 the Average Precision computed for Q1, Q2, Q3, Qn becomes the Mean Average Precision at Query Group or Topic levels. Available Metrics Precision Recall Precision at 1 (P@1) Precision at 2 (P@2) Precision at 3 (P@3) Precision at 10 (P@10) Average Precision (AP) Reciprocal Rank Mean Reciprocal Rank Mean Average Precision (MAP) Normalised Discounted Cumulative Gain (NDCG) F-Measure Compound Metric
  21. 21. RRE: Reproducibility in Evaluating a System INPUT
 RRE Ratings
 Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  22. 22. System Status: Init Search Engine Data Configuration Spins up an embedded
 Search Platform INPUT LAYER EVALUATION LAYER - an instance of ES/ Solr is instantiated from 
 the input configurations - Data is populated from the input - The Instance is ready to respond to queries
 and be evaluated N.B. an alternative approach we are working on
 is to target a QA instance already populated
 In that scenario is vital to keep version controlled the configuration and data Embedded
  23. 23. System Status: Configuration Sets - Configurations evolve with time
 Reproducibility: track it with version control systems! - RRE can take various version of configurations in input to compare them - The evaluation process allows you to define inclusion / exclusion rules (i.e. include only version 1.0 and 2.0) Index/Query Time
  24. 24. System Status: Feed the Data An evaluation execution can involve more than one datasets targeting a given search platform. A dataset consists consists of representative domain data; although a compressed dataset can be provided, generally it has a small/medium size. Within RRE, corpus, dataset, collection are synonyms. Datasets must be located under a configurable folder. Each dataset is then referenced in one or more ratings file. Corpus Of Information
  25. 25. System Status: Build the Queries For each query or query group) it’s possible to define a template, which is a kind of query shape containing one or more placeholders. Then, in the ratings file you can reference one of those defined templates and you can provide a value for each placeholder. Templates have been introduced in order to: • allow a common query management between search platforms • define complex queries • define runtime parameters that cannot be statically determined (e.g. filters) Query templates only_q.json filter_by_language.json
  26. 26. RRE: Reproducibility in Evaluating a System INPUT
 RRE Ratings
 Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  27. 27. RRE: Evaluation process overview (1/2) Data Configuration Ratings Search Platform uses a produces Evaluation Data INPUT LAYER EVALUATION LAYER OUTPUT LAYER JSON RRE Console … used for generating
  28. 28. RRE: Evaluation process overview (2/2) Runtime Container RRE Core Rating Files Datasets Queries Starts the search platform Stops the search platform Creates & configure the index Indexes data Executes query Computes metric outputs the evaluation data Init System Set Status Set Status Stop System
  29. 29. RRE: Evaluation Output The RRE Core itself is a library, so it outputs its result as a Plain Java object that must be programmatically used. However when wrapped within a runtime container, like the Maven Plugin, the evaluation object tree is marshalled in JSON format. Being interoperable, the JSON format can be used by some other component for producing a different kind of output. An example of such usage is the RRE Apache Maven Reporting Plugin which can • output a spreadsheet • send the evaluation data to a running RRE Server Evaluation output
  30. 30. RRE: Workbook The RRE domain model (topics, groups and queries) is on the left and each metric (on the right section) has a value for each version / entity pair. In case the evaluation process includes multiple datasets, there will be a spreadsheet for each of them. This output format is useful when • you want to have (or maintain somewhere) a snapshot about how the system performed in a given moment • the comparison includes a lot of versions • you want to include all available metrics Workbook
  31. 31. RRE: RRE Console • SpringBoot/AngularJS app that shows real-time information about evaluation results. • Each time a build happens, the RRE reporting plugin sends the evaluation result to a RESTFul endpoint provided by RRE Console. • The received data immediately updates the web dashboard with fresh data. • Useful during the development / tuning phase iterations (you don’t have to open again and again the excel report) RRE Console
  32. 32. RRE: Iterative development & tuning Dev, tune & Build Check evaluation results We are thinking about how to fill a third monitor
  33. 33. RRE: We are working on… “I think if we could create a simplified pass/fail report for the business team, that would be ideal. So they could understand the tradeoffs of the new search.” “Many search engines process the user query heavily before it's submitted to the search engine in whatever DSL is required, and if you don't retain some idea of the original query in the system how can you” relate the test results back to user behaviour? Do I have to write all judgments manually?? How can I use RRE if I have a custom search platform? Java is not in my stack Can I persist the evaluation data?
  34. 34. RRE: Github Repository and Resources • A sample RRE-enabled project • No Java code, only configuration • Search Platform: Elasticsearch 6.3.2 • Seven example iterations • Index shapes & queries from Relevant Search [1] • Dataset: TMBD (extract) Demo Project rre-demo-walkthrough [1] rated-ranking-evaluator Github Repo rated-ranking-evaluator.html Blog article
  35. 35. Thanks!