THE TYPED INDEX
Christoph Goller
christoph.goller@intrafind.de

Chief Scientist at IntraFind Software AG
Outline
•

IntraFind Software AG

•

Analyzers, Inverted File Index

•

Different Types of Terms

•

Why do we need them i...
A few words about me and about IntraFind
IntraFind Software AG
•
•
•
•
•

Specialist for Information Retrieval and Enterprise Search
Founding of the company: Octob...
Selected Customers
Analyzers and the Inverted File Index
Analysis / Tokenization
Break stream of characters into tokens /terms
•

Normalization (e.g. case)

•

Stop Words

•

Stem...
Inverted File Index
Different Term Normalizations
Different Types of Terms
Morphological Analyzer vs. Stemming
•

Lemmatizer: maps words to their base forms
English

German

going



go (Verb)

li...
Bad Precision with Algorithmic Stemmer
High Recall and High Precision with
Morphological Analyzers
High Recall and High Precision with
Morphological Analyzers
Word Decomposition and Search

Federal Ministry for Family Affairs
Why do we need other Normalizations?
•

Stemmers / Lemmatizers are language-specific

•

MultiTermQueries: WildcardQuery, ...
Named Entity Recognition (NER)
Automated extraction of information from
unstructured data
•
People names
•
Company names
•...
Why do we need these different types of terms
in one field?
Why do we need them in one field?
•

Query: “MAN sagt” PhraseQuery / NearQuery !!!!!
Matching Document: “MAN sagte” not “m...
Semantic Search
Question:

Semantic Search

Wer hat Microsoft gegründet?
Semantic Search
Question:

Wo liegen Werke von Audi?

Semantic Search
The Typed Index
Multilingual Search
Mixed Language Documents
The typed Index
•

We need different types of terms in one field

•

Types are term properties: payloads are not a good op...
Multilingual Search: Standard Approach
Generate a language-specific copy of every content-field:
– configure language-spec...
Multilingual Search and the Typed Index
Choose analyzer depending on language but do not use different fields:
– Analyzers...
Summary: Advantages of Typed Index to
Multi-Field Index
• Keep positions aligned in an easier way
• Only tokenize once : P...
Thanks for listening
Questions ?
By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch
Dr. ...
The Typed Index
Upcoming SlideShare
Loading in...5
×

The Typed Index

2,150

Published on

Presented by Christoph Goller, Chief Scientist, IntraFind Software AG

If you want to search in a multilingual environment with high-quality language-specific word-normalization, if you want to handle mixed-language documents, if you want to add phonetic search for names if you need a semantic search which distinguishes between a search for the color "brown" and a person with the second name "brown", in all these cases you have to deal with different types of terms. I will show why it makes much more sense to attach types (prefixes) to Lucene terms instead of relying on different fields or even different indexes for different kinds of terms. Furthermore I will show how queries to such a typed index look and why e.g. SpanQueries are needed to correctly treat compound words and phrases or realize a reasonable phonetic search. The Analyzers and the QueryParser described are available as plugins for Lucene, Solr, and elasticsearch.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,150
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Typed Index

  1. 1. THE TYPED INDEX Christoph Goller christoph.goller@intrafind.de Chief Scientist at IntraFind Software AG
  2. 2. Outline • IntraFind Software AG • Analyzers, Inverted File Index • Different Types of Terms • Why do we need them in one field? • The Typed Index • Multilingual Search / Mixed Language Documents
  3. 3. A few words about me and about IntraFind
  4. 4. IntraFind Software AG • • • • • Specialist for Information Retrieval and Enterprise Search Founding of the company: October 2000 More than 850 customers mainly in Germany, Austria, and Switzerland Employees: 30 Lucene Committers: B. Messer, C. Goller • • • • Independent Software Vendor, entirely self-financed Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Focus on Quality / Text Analytics / SOA Architecture – Linguistic Analyzers for most European Languages – Semantic Search – Named Entity Recognition – Text Classification – Clustering
  5. 5. Selected Customers
  6. 6. Analyzers and the Inverted File Index
  7. 7. Analysis / Tokenization Break stream of characters into tokens /terms • Normalization (e.g. case) • Stop Words • Stemming • Lemmatizer / Decomposer • Part of Speech Tagger • Information Extraction
  8. 8. Inverted File Index
  9. 9. Different Term Normalizations Different Types of Terms
  10. 10. Morphological Analyzer vs. Stemming • Lemmatizer: maps words to their base forms English German going  go (Verb) lief  laufen (Verb) bought  buy (Verb) rannte  rennen (Verb)  Buch (Noun) bags bag (Noun) Bücher bacteria •   bacterium (Noun) Taschen  Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children‘s book)  Kind (Noun) | Buch (Noun) Versicherungsvertrag (insurance contract)  Versicherung (Noun) | Vertrag (Noun) Stemmer: usually simple algorithm (huge collection of stemmers available in lucene contributions) going -> go decoder, decoding, decodes -> decod Overstemming: Messer -> mess ?????? king -> k ??????????? several, server -> server ???? Understemming: spoke -> speak
  11. 11. Bad Precision with Algorithmic Stemmer
  12. 12. High Recall and High Precision with Morphological Analyzers
  13. 13. High Recall and High Precision with Morphological Analyzers
  14. 14. Word Decomposition and Search Federal Ministry for Family Affairs
  15. 15. Why do we need other Normalizations? • Stemmers / Lemmatizers are language-specific • MultiTermQueries: WildcardQuery, FuzzyQuery – – – – • Case-Sensitive – – • no stemming, no lemmatization should work on original terms generated from Tokenizer only very simple normalizations such as: Citroën -> Citroen in Solr: <analyzer type=“multiterm”> Stemmers / Lemmatizers map everything to lowercase sometimes case matters: MAN vs. man Phonetic Search (Double Metaphone): – – – Mazlum -> MSLM; Muslim -> -> MSLM book -> PK; books -> PKS Kaother Tabai -> K0R TP , Kouther Tapei -> K0R TP
  16. 16. Named Entity Recognition (NER) Automated extraction of information from unstructured data • People names • Company names • Brands from product lists • Technical key figures from technical data (raw materials, product types, order IDs, process numbers, eClass categories) • Names of streets and locations • Currency and accounting values • Dates • Phone numbers, email addresses, hyperlinks
  17. 17. Why do we need these different types of terms in one field?
  18. 18. Why do we need them in one field? • Query: “MAN sagt” PhraseQuery / NearQuery !!!!! Matching Document: “MAN sagte” not “man sagte” • Query: “book of Kouther Tapei” PhraseQuery / NearQuery !!!!! Matching Document: books of Kaouther Tabai – For book to match books we need a stemmer or a lemmatizer – For the names to match we need phonetics • Query: Mazlum – It leads to matches for the very frequent word Muslim – Users want: Give me phonetic matches for Mazlim but not Muslim – Mazlum=P AND NOT Muslim=E doesn’t do the job!!! – – • • No match for “Mazlum is a member of the Muslim society in Munich” spanNot(spanOr([body:V_mazzlim, body:F_MSLM]), body:V_muslim)) New Syntax: <Mazlim=P BUTNOT Muslim=E> Query: Persons near synonyms of founding and Microsoft “E_Person found Microsoft” PhraseQuery / NearQuery
  19. 19. Semantic Search Question: Semantic Search Wer hat Microsoft gegründet?
  20. 20. Semantic Search Question: Wo liegen Werke von Audi? Semantic Search
  21. 21. The Typed Index Multilingual Search Mixed Language Documents
  22. 22. The typed Index • We need different types of terms in one field • Types are term properties: payloads are not a good option • Use prefixes to distinguish them: – – – – – • V_ for fullforms (case sensitive) N_ for diacritics normalizations F_ for phonetic normal forms E_ for entities • E_Person, E_Location, E_Organization • E_PersonName_Brown, E_Location_Munich B_ for baseforms: B_Noun_book, B_Verb_fly, … Multilingual Search is handled in the same way B_EN_NOUN_book, B_DE_NOUN_buch
  23. 23. Multilingual Search: Standard Approach Generate a language-specific copy of every content-field: – configure language-specific analyzers for the language-specific fields – Indexing: Adapt indexing chain to determine document language, generate new language-specific fields – Search: Use MultiFieldQueryParser to expand query to every language-specific field – Highlighting: depending on document-language call Highlighter for language-specific fields with the respective analyzer – no solution for mixed-language documents
  24. 24. Multilingual Search and the Typed Index Choose analyzer depending on language but do not use different fields: – Analyzers generate terms typed with language: B_EN_NOUN_book, B_DE_NOUN_buch – Indexing: choose analyzer in indexing chain based on language – Search: Use a special MultiAnalyzerQueryParser to expand query to every language – Highlighting: choose analyzer based on language and apply it to content-field – Advantage: you could implement a multi-language analyzer for handling mixedlanguage documents, which switches language even within paragraphs.
  25. 25. Summary: Advantages of Typed Index to Multi-Field Index • Keep positions aligned in an easier way • Only tokenize once : Performance! • Reuse existing Queries like PhraseQueries, MultiPhraseQueries • Treatment for Mixed-Language Documents: Use Lemmatizer Results to switch between languages
  26. 26. Thanks for listening Questions ? By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch Dr. Christoph Goller Phone: +49 89 3090446-0 Fax: +49 89 3090446-29 Email: christoph.goller@intrafind.de Web: www.intrafind.de IntraFindSoftware AG Landsberger Straße 368 80687 München Germany
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×