Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr


Published on

The University Seminar series aim to provide a basic understanding of Open Source Information Retrieval and its application in the real world through the Apache Lucene/Solr technologies.

Published in: Engineering
  • Login to see the comments

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

  1. 1. Seminars Let’s Build an Inverted Index: Introduction to Apache Lucene/Solr
 Alessandro Benedetti, Software Engineer
 Andrea Gazzarini, Software Engineer 28th November 2019
  2. 2. Seminars ▪ R&D Software Engineer ▪ Search Consultant ▪ Director ▪ Master Degree in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Conference Speaker ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  3. 3. Seminars ▪ Software Engineer (1999-) ▪ “Hermit” Software Engineer (2010-) ▪ Java & Information Retrieval Passionate ▪ Apache Qpid (past) Committer ▪ Husband & Father ▪ Bass Player Andrea Gazzarini, “Gazza”
  4. 4. Seminars Search Services
 ● London Based - Italian made :) ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, 
 Document Similarity, Search Quality Evaluation,
 Relevancy Tuning
  5. 5. Seminars Who we are
  6. 6. Seminars Why should you use Open Source? • State of the Art / very valid technologies • Community Support • Vast Documentation • Code is accessible! • Customisable • Mostly free licensing
  7. 7. Seminars Why should you contribute to Open Source? • Share knowledge and ideas • Improve established technologies • Become part of a Community • Not only code - all your skills are relevant! • Be useful to the world
  8. 8. Seminars We only deal with Open Source Informational Retrieval … Revenue ? ● Trainings - Beginner/Intermediate/Advance/Ad Hoc for
 Information Retrieval, Apache Lucene/Solr, Search Relevance, Learning To Rank… ● Consulting - Open Source Software is ubiquitous/ Expertise ? Not really
 ! R&D Projects - Cheaper and more flexible for Companies using Open Source
 ! IR Projects - From the Client requirements collection till the Software delivery
  9. 9. Seminars Clients
  10. 10. Seminars Information Retrieval “Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.” Wikipedia Information Need Corpus
  11. 11. Seminars Apache Lucene • • High-performance, scalable information retrieval software *library* • Enables search capabilities to your applications • Cohesive and simple interface, which hides a really complex world • Open Source: Apache Top Level Project
  12. 12. Seminars Apache Lucene - Brief History 2019 Lucene 8.3.0 (November)
  13. 13. Seminars Apache Solr • • Highly reliable, scalable and fault tolerant search *server* • A Lucene “serverization” with a lot of additional features • All services are exposed through a HTTP (REST-Like interface) • Written in Java • Rich ecosystem for building enterprise-level applications (Plugins, Integrations, Clients) • Open Source: Apache Top Level Project “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.”
  14. 14. Seminars
  15. 15. Seminars Apache Solr - Brief History Version 8.3.0 (November)2019
  16. 16. Seminars The Inverted Index The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents. From wikipedia : “In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.”
  17. 17. Seminars The Lucene Document Document Field ValueField Name • Documents are the unit of information 
 for indexing and search. • A Document is a set of fields. • Each field has a name and a value.
  18. 18. Seminars The Lucene Inverted Index
  19. 19. Seminars The Lucene Inverted Index • Lucene directory (in memory, on disk, memory mapped) • Collection of immutable segments (fully working) • Each segment is composed by a set of binary files[1] [1] Lucene File Format Documentation Indexes evolve by: 1. Creating new segments for newly added documents. 2. Merging existing segments.
  20. 20. Seminars Schema Configuration • Per collection/index • Xml file • Define how the inverted Index will be built • Fields/Field Types definition
  21. 21. Seminars Schema Configuration • Define flexible expressions for groups of fields • Shared attributes for each field instance • Copy the source content to a destination field • Allow to run multiple analysis chains for the same content
  22. 22. Seminars Field Type • Define how the single terms (in the inverted index) will be generated out of the content Index Time Query Time Analysis chain executed when building the index Analysis chain executed when building the query
  23. 23. Seminars Text Analysis • Only text fields types (e.g. solr.TextField or subclasses) have a text analysis chain associated An analyzer can define • Zero or more CharFilter • One and only one Tokenizer • Zero or more TokenFilter
  24. 24. Seminars Char Filters • CharFilter is a component that pre-processes input characters. • CharFilters can be chained like Token Filters and placed in front of a Tokenizer. • CharFilters can add, change, or remove characters
 while preserving the original character offsets to support features like highlighting.
  25. 25. Seminars Tokenizers Tokenizers are responsible for breaking field data into lexical units, or tokens.[1] [1]
  26. 26. Seminars Token Filters Filters[1] examine a stream of tokens and keep them, transform them or discard them, 
 depending on the filter type being used. [1]
  27. 27. Seminars Word Delimiters Filter • Improve recall • Dedicated Filters: 
 solr.WordDelimiterGraphFilterFactory [1] Example: Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters. <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory"/> <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters --> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory"/> </analyzer>
 In: "hot-spot RoboBlaster/9000 100XL" Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL" Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"
  28. 28. Seminars Stopword Filters • Reduce index size • Can improve precision (removing terms with low semantic value) • Can improve recall • Dedicated Filters: solr.StopFilterFactory, solr.ManagedStopFilterFactory [1] Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer> In: "To be or what?" Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4) Out: "what"(4)
  29. 29. Seminars Stemmers • Improve Recall • Reduce index size • Dedicated Filters: solr.EnglishMinimalStemFilterFactory, solr.HunspellStemFilterFactory, solr.KStemFilterFactory,
 solr.PorterStemFilterFactory, solr.SnowballPorterFilterFactory [1] Example: <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> </analyzer> In: "dogs cats" Tokenizer to Filter: "dogs", "cats" Out: "dog", "cat"
  30. 30. Seminars Synonym Filters[1/2] • Improve Recall • Dedicated Filters: solr.SynonymGraphFilterFactory • Index Time -> affect terms distributions, needs re-indexing • Query Time -> more flexible [1] couch,sofa,divan teh => the huge,ginormous,humungous => large small => tiny,teeny,weeny
  31. 31. Seminars Synonym Filters[2/2] • Improve Recall • Dedicated Filters: 
 solr.SynonymGraphFilterFactory [1] Example: <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters --> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/> </analyzer>
 In: "teh small couch" Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3) Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
  32. 32. Seminars Keep Word Filter • Help in Entity tagging • Dedicated Filters: solr.KeepWordFilterFactory [1] Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/> </analyzer> In: "Happy, sad or funny" Tokenizer to Filter: "Happy", "sad", "or", "funny" Out: "Happy", "funny"
  33. 33. Seminars N-Gram Filtering • Improve Recall • Ideal for autocompletion • Dedicated Filters: solr.EdgeNGramFilterFactory, solr.NGramFilterFactory [1] <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/> </analyzer>
  34. 34. Seminars Phonetic Matching • Improve Recall • Dedicated Filters: solr.BeiderMorseFilterFactory, solr.DaitchMokotoffSoundexFilterFactory, solr.DoubleMetaphoneFilterFactory, solr.PhoneticFilterFactory [1] <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"> </filter> </analyzer>
  35. 35. Seminars Common Grams Filter • Improve Precision • Useful for phrase queries [1] Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer> In: "the Cat" Tokenizer to Filter: "the", "Cat" Out: "the_cat"
  36. 36. Seminars Field Attributes
  37. 37. Seminars Field Attributes
  38. 38. Seminars Solr Text Analysis - Hands On! • Analysis Screen from Solr Admin • Let’s explore the schema.xml
  39. 39. Seminars Indexing • Using the Solr Cell framework built on Apache Tika 
 for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.
 (Recommended for prototyping and exercise)
 • Uploading XML files by sending HTTP requests to the Solr server 
 from any environment where such requests can be generated.
 (Recommended for prototyping and exercise)
 • Writing a custom Java application to ingest data through Solr’s Java Client API 
 (which is described in more detail in Client APIs). 
 Using the Java API may be the best choice if you’re working with an application, 
 such as a Content Management System (CMS), that offers a Java API.

  40. 40. Seminars Indexing • Indexing is the procedure of building an index from the documents in input • Transaction Log (Rotating on hard commits) • Index built in memory • Soft commits(visibility) • Hard commits(durability) • openSearcher=true(visibility) • Auto commit • Merge policy
  41. 41. Seminars Lucene Score In order to measure the relevancy of a given result, Solr(Lucene) assigns it a “score” The formula behind the score computation is behind the scope of this course, however important things tha contribute to that formula are: • Term Frequency (TF): how many times a given term occurs within a single document • Document Frequency (DF): how many documents in the dataset contain a given term • TF/IDF: the ratio between the term frequency and the inverse document frequency (1/DF) • Field length: how many terms compose a field • Boosting: functions or in general things that boost the score computed for a given match. Boosting 
 can be applied at index time (deprecated now) or a query time Score values cannot be compared across queries, or even with the same query but with a different index.
  42. 42. Seminars ! Origin from Probabilistic Information Retrieval ! Default Similarity from Lucene 6.0 [1] ! 25th iteration in improving TF-IDF ! TF ! IDF ! Document(Field) Length ! Configuration parameters [1] LUCENE-6789 BM25 Term Scorer
  43. 43. Seminars BM25 Term Scorer - Inverse Document Frequency IDF Score
 has very similar behavior
  44. 44. Seminars BM25 Term Scorer - Term Frequency TF Score
 asymptotically (k+1)
 k=1.2 in this example
  45. 45. Seminars BM25 Term Scorer - Document Length Document Length /
 Avg Document Length
 affects how fast we saturate TF score
  46. 46. Seminars Basic Search The list is not exhaustive and is not statically defined, because it depends on the query parser Some parameter (i.e. filters) accepts more than one value:
  47. 47. Seminars Queries Query • Regulated by Query Parsers • Calculates scores • Cached with results order preserved Filter Query • Regulated by Query Parsers • Does not calculate scores • Cached independently • Reusable q=field:value fq=field:value
  48. 48. Seminars Query Parsers • Main responsibility of the query parser is understand the input query syntax and build a Lucene query • This is the first component involved in the query execution chain • If it is not specified, then a default parser is used (Solr Standard Query Parser) • Solr comes with several available and ready-to-use query parsers • The query parameter “defType” defines the query parser that will be used in a request
  49. 49. Seminars Standard Query Parser Parameter Description q Defines a query using standard query syntax. This parameter is mandatory. q.op Specifies the default operator for query expressions, overriding the default operator specified in the Schema. Possible values are "AND" or "OR". df Specifies a default field, overriding the definition of a default field in the Schema. sow Split on whitespace: if set to false, whitespace-separated term sequences will be provided to text analysis in one shot, enabling proper function of analysis filters that operate over term sequences, e.g. multi-word synonyms and shingles. Defaults to true: text analysis is invoked separately for each individual whitespace-separated term.
  50. 50. Seminars Standard Query Parser • Phrase Search
 q=title:”a tale of two cities”
 • Wildcard Search
 • Fuzzy Search
 q=title:cties~1 • Proximity Search 
 q=title:"tale cities"~2 • Range Search 
 downloads:[1000 TO 2000], author:{Ada TO Carmen} • Boosted Search
 q=tale of two cities^100 bunny • Constant Score Search
 AND subjects:(war stories)^=4 • Boolean Search
 (field1:term1) AND (field2:term1)
  51. 51. Seminars Date Queries Queries against fields using the TrieDateField type (typically range queries) should use the appropriate date syntax [1]: • timestamp:[* TO NOW] • createdate:[1976-03-06T23:59:59.999Z TO *] • createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] • pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY] • createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR] • createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z] [1] Timezone By default, all date math expressions are evaluated relative to the UTC TimeZone, but the TZ parameter can be specified to override this behaviour N.B. Independently of the locale Solr is executed, only ISO-8601 dates are supported in requests
  52. 52. Seminars Solr Query Debug - Hands On! • debug=query: return debug information about the query only. • debug=timing: return debug information about how long the query took to process. • debug=results: return debug information about the score results (also known as "explain").
  53. 53. Seminars Master Thesis: Click Models to Estimate Relevancy Ratings from Users Interactions Main responsibility of the candidate will be to:
 • learn basic concepts of Agile methodologies for software engineering
 • learn details of Search Quality Evaluation
 • grasp the fundamentals of click modelling, implicit and explicit relevancy feedback
 • design and implement the module in an existing Spring Boot REST service application 
 • benchmark the solution(s) through a careful quality/performance(times/ space) analysis
  54. 54. Seminars Master Thesis: Search Quality Evaluation for Continuous Integration Tools Main responsibility of the candidate will be to: 
 • learn basic concepts of Agile methodologies for software engineering • get familiar with Apache Lucene based search engines (Apache Solr/ Elasticsearch) • learn details of Search Quality Evaluation • grasp the fundamentals of Continuous Integration and Continuous Deployment through well established industry level technologies • design and implement plugins for Apache Jenkins, Atlassian Bamboo and JetBrains 
  55. 55. Seminars