Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018
What are we discussing today
1. Search IS AI (NLP/Information Retrieval)
2. NGrams (on letters and terms)
Example: Count-based Named Entity Recognition
3. OpenNLP (Statistical methods/ML)
Example: ML-based Named Entity Recognition
4. Gazetteer
Example: Lookup-based Named Entity Recognition
5. Significant Terms (query parser) example
6. Semantic Knowledge Graph (facets) example
Basic text processing pipeline - English
<fieldType name="text_en_splitting_tight" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<analyzer type="query"> ....
Basic text processing pipeline - Farsi
<fieldType name="text_fa" class="solr.TextField"
<!-- for ZeroWidthNonJoiner -->
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
Complex text processing pipeline - Thai
1) tokenize Thai text with built-in rules+dictionary
2) map it to latin characters (with special accents indicating tones
3) get rid of tone marks, as nobody uses them
4) do some phonetic (BMF) broadening to match possible alternative spellings in English
<fieldType name="thai_english" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" />
<filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC"/>
<filter class="solr.BeiderMorseFilterFactory"/>
<analyzer type="query">...
● Intro:
● Solr Reference Guide:
○ Understanding Analyzers, Tokenizers, and Filters
○ Analyzers
○ About Tokenizers
○ About Filters
○ Tokenizers
○ Filter Descriptions
○ CharFilterFactories
○ Language Analysis
○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex,
Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik,
○ Running Your Analyzer
○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters
● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd)
○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes)
○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text)
○ CJKBigramFilterFactory (Chinese-Japanese-Korean)
● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...)
○ ShingleFilterFactory (token n-grams)
○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory
■ shingles but only with common (stop) words
● Can be used for named entities identification
○ Shingle the normalized tokens (e.g. lowercased)
○ Facet on the results
N-grams - example - define
<fieldType name="text_shingle" class="solr.TextField"
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="2"
tokenSeparator=" " fillerToken="_"/>
<field name="shingles" type="text_shingle" indexed="true" stored="true"/>
N-grams - example - use
● Index
○ The rain in Spain falls gently on the plane.
○ The rain is quite heavy in Spain
○ Heavy rain could be dangerous
○ The weather in Spain could be quite nice
● Query .../select?
● Result (top entries)
○ in spain,3
○ could be,2
○ the rain,2
○ be dangerous,1,
○ ...
OpenNLP integration
● The Apache OpenNLP library is a machine learning based toolkit
for the processing of natural language text.
● Reference:
● OpenNLP in Solr
○ OpenNLPTokenizerFactory (including sentence chunking)
○ OpenNLPLemmatizerFilter (as opposed to stemming)
○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc)
○ OpenNLPChunkerFilter (e.g. Noun Phrase)
○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!)
○ OpenNLPLangDetectUpdateProcessorFactory (detect language)
■ one of 3 language detectors in Solr
● Challenge
○ All require models
○ Solr does not include models
○ OpenNLP only provides some models - need to train your own 10
OpenNLP - NER - managed-schema
<fieldType name="opennlp-en-tokenization"
<tokenizer class="solr.OpenNLPTokenizerFactory"
OpenNLP - NER - example - solrconfig.xml
● Chain definition (not default):
<updateRequestProcessorChain name="opennlp-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">people_ss</str>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-organization.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">organizations_ss</str>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
OpenNLP - NER - example - cont
● In solrconfig.xml, add extra libraries:
regex="lucene-analyzers-opennlp-.*.jar" />
regex="solr-analysis-extras-.*.jar" />
● Download (4) models from OpenNLP site:
● Put them into <core>/conf/models (for non-Cloud setup)
● Reference (one line):
OpenNLP - NER - example - index and query
● Index (one long line):
bin/post -c test -params update.chain="opennlp-extract"
-type text/csv -out yet -d
$'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I
work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I
ever work with John Scott for General Motors.'
● Query http://localhost:8983/solr/test/select?q="*:* :
text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of
pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.,
people_ss:[John Scott],
organizations_ss:[IBM, Apple,General Motors],
Gazetteer (reverse lookup)
● Gazetteer: A dictionary, listing, or index of geographic names
● NLP Gazetteer: A closed list of names (entities) to match in the text
● Solr implementation: Tagger handler aka SolrTextTagger
○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request
handler and it will return every occurrence of one of those names with offsets and other
document metadata desired. Tagger does Lucene text analysis.
● Reference:
○ Includes full working tutorial
○ Not going to repeat it here
● Let's use films example's name field as gazetteer
○ Create films example as per example/films/README.txt (but don't index text yet)
○ Add Tagger schema changes (skip text field definition) and request handler definition
○ Index films into updated definition (or reindex, if you indexed already)
Reminder - Film Example
● Recently added example in example/films
○ 1100 records about the real movies
○ available in XML, JSON, and CSV format to demonstrate indexing
○ uses basic schema and also shows how to work around "schemaless mode" limitations
○ gives full instructions to get it working
○ good toy dataset with text and facetable fields
● Sample record:
"id": "/en/black_hawk_down",
"directed_by": [ "Ridley Scott"],
"initial_release_date": "2001-12-18",
"name": "Black Hawk Down",
"genre": ["War film", "Action/Adventure", "Action Film",
"History", "Combat Films", "Drama"]
Gazetteer (reverse lookup) - calling
● The tagger is a separate request handler (/tag)
● We send it text (and parameters) and get back matches with desired fields
● curl -X POST 'http://localhost:8983/solr/films/tag?
fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I
loved the movie A beautiful mind but was not too keen on a Knight Tale'
Gazetteer (reverse lookup) - result (reformatted)
"responseHeader":{ "status":0, "QTime":0},
[ "startOffset",19, "endOffset",35,
"matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]],
[ "startOffset",61, "endOffset",74,
"matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]],
"directed_by":["Ron Howard"],
"name":"A Beautiful Mind"
"directed_by":["Brian Helgeland"],
"name":"A Knight's Tale"
Significant terms
● Significant terms - returns terms, scored on how frequently they appear in the
result set and how rarely they appear in the entire corpus.
● Uses TF-IDF to calculate score - not just appearance count
● Currently documented for Streams at:
● Is also available as a Query Parser, but (in 7.4) misses documentation, was
misspelt, had local-params issues.
● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details
● Syntax:
○ fq={!significantTerms field=name numTerms=3 minTermLength=5}
○ Has to be in fq as it does not affect documents, just outputs additional info
○ Has to be against text field (so genre, not genre_str in this specific example)
Significant terms in the film example
● Query (7.5 syntax):
● Compare pure frequency (facet) with significant terms
Significant terms in the film example - result
● q=*:*
○ Significant Terms (in decreased significance order here, normally increased):
american, movie, black, ghost, final, death, story, godzilla, blood
○ Facets (in decreased count order here):
the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from,
bad, dark, final, ghost, ii, with, 3, boys, day, death
● q=genre:Drama
○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death
● q=genre:Romantic
○ Significant Terms: movie, story, house, dirty, brother, black
● q=genre:Japanese
○ Significant Terms (only 2): godzilla, death
Semantic Knowledge Graph
● Score relevance against background
○ Part of "new" JSON Facets API
○ Flexible about foreground/background/global queries
○ Context-aware if used in nested facets
○ Solr "Inception" (aka "Not sure I fully grok it yet")
● Reference (and hobbies vs age example):
● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh
● Baseline statistics query (one line):
http://...../films/select?q=directed_by_str:"Ridley Scott"
Steven Soderbergh
Drama 5
Romance Film 3
Biographical film 2
Comedy-drama 2
Indie film 2
--- the rest is 1 each ---
Comedy, Crime Fiction, Docudrama,
Drama film, Ensemble Film, Erotica,
Feminist Film, Historical drama, Legal drama,
Mystery, Romantic comedy, Thriller, Trial drama,
War film
Ridley Scott
Drama 5
Action Film 3
Crime Thriller 2
War film 2
--- the rest is 1 each ---
Action/Adventure, Adventure Film,
Biographical film, Combat Films, Comedy,
Comedy of manners, Comedy-drama,
Crime Drama, Crime Fiction, Epic film,
Film adaptation, Gangster Film,
Historical drama, Historical period drama,
History, Horror, Mystery, Psychological thriller,
Romance Film, Romantic comedy, Slice of life,
Thriller, True crime
Baseline Genre statistics
"facets": {
"count": 1100,
"genre": {
"buckets": [
{"val": "Drama", "count": 552},
{"val": "Comedy", "count": 389},
{"val": "Romance Film", "count": 270},
{"val": "Thriller", "count": 259},
{"val": "Action Film", "count": 196}
POST http://.../films/select
params: {q:"*:*",rows: 0},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5
1. Find all documents
2. Return none of them (to reduce output)
3. Calculate facets on field genre_str
4. Return top 5 buckets
Basic JSON Facet query
"val": "Drama film", "count": 1,
"rl-SS": {
"relatedness": 0.13611,
"foreground_popularity": 0.00091,
"background_popularity": 0.00091
"rl-RS": {
"relatedness": -0.00075,
"foreground_popularity": 0,
"background_popularity": 0.00091
POST http://.../films/select
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back: "*:*"
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
Comparative Relatedness
Not "Drama"
that one is too popular
Compare relatedness values - samples
Genre Global Count Steven Soderbergh Ridley Scott
Count Relatedness Count Relatedness
Drama film 1 1 0.13611 0 -0.00075
Legal drama 1 1 0.13611 0 -0.00075
Feminist Film 2 1 0.09963 0 -0.00107
Comedy-drama 58 2 0.0365 1 0.01602
Drama 552 5 0.00455 5 0.00455
{"val": "Romance Film", "count": 270,
"rl-SS": {
"relatedness": 0.01003,
"foreground_popularity": 0.3,
"background_popularity": 0.4
"rl-RS": {
"relatedness": -0.01003,
"foreground_popularity": 0.1,
"background_popularity": 0.4
"year2000-2004": {"count": 156,
"rl-SS": {
"relatedness": 0.01539,
"foreground_popularity": 0.3,
"background_popularity": 0.6
"rl-RS": {
"relatedness": -0.01338,
"foreground_popularity": 0,
"background_popularity": 0.6}
"Inception" Relatedness
POST http://.../films/select
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")"
facet: {
genre: {
type: terms, field: genre_str, limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
year2000-2004: {
type: query,
q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
More awesomeness - another time
● Learning to Rank
○ Machine-learned ranking models
■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg
■ Lucene/Solr Revolution 2017
● Graph traversal
● Streaming (including Map/Reduce)
● Result clustering
● Commercial solutions (e.g. Basis Technology)
● Searching images by auto-generated captures
Activate - The Search and AI conference
● Used to be called Lucene/Solr Revolution
● This year in Montreal, October 17-18 (with training beforehand)
● New direction with focus on AI
● Samples:
○ Making Search at Reddit Relevant
○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning
○ The Neural Search Frontier
○ How to Build a Semantic Search System
○ Query-time Nonparametric Regression with Temporally Bounded Models
○ Building Analytics Applications with Streaming Expressions in Apache Solr
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018

  • 1. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018
  • 2. What are we discussing today 1. Search IS AI (NLP/Information Retrieval) 2. NGrams (on letters and terms) Example: Count-based Named Entity Recognition 3. OpenNLP (Statistical methods/ML) Example: ML-based Named Entity Recognition 4. Gazetteer Example: Lookup-based Named Entity Recognition 5. Significant Terms (query parser) example 6. Semantic Knowledge Graph (facets) example 2
  • 3. Basic text processing pipeline - English <fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.FlattenGraphFilterFactory"/> </analyzer> <analyzer type="query"> .... 3
  • 4. Basic text processing pipeline - Farsi <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"> <analyzer> <!-- for ZeroWidthNonJoiner --> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt"/> </analyzer> </fieldType> 4
  • 5. Complex text processing pipeline - Thai <!-- 1) tokenize Thai text with built-in rules+dictionary 2) map it to latin characters (with special accents indicating tones 3) get rid of tone marks, as nobody uses them 4) do some phonetic (BMF) broadening to match possible alternative spellings in English --> <fieldType name="thai_english" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" /> <filter class="solr.ICUTransformFilterFactory" id="NFD; [:Nonspacing Mark:] Remove; NFC"/> <filter class="solr.BeiderMorseFilterFactory"/> </analyzer> <analyzer type="query">... Source: 5
  • 6. Resources ● Intro: ● Solr Reference Guide: ○ ○ Understanding Analyzers, Tokenizers, and Filters ○ Analyzers ○ About Tokenizers ○ About Filters ○ Tokenizers ○ Filter Descriptions ○ CharFilterFactories ○ Language Analysis ○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex, Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik, NYSIIS) ○ Running Your Analyzer ● ○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters 6
  • 7. N-grams ● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd) ○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes) ○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text) ○ CJKBigramFilterFactory (Chinese-Japanese-Korean) ● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...) ○ ShingleFilterFactory (token n-grams) ○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory ■ shingles but only with common (stop) words ● Can be used for named entities identification ○ Shingle the normalized tokens (e.g. lowercased) ○ Facet on the results 7
  • 8. N-grams - example - define <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="_"/> </analyzer> </fieldType> <field name="shingles" type="text_shingle" indexed="true" stored="true"/> 8
  • 9. N-grams - example - use ● Index ○ The rain in Spain falls gently on the plane. ○ The rain is quite heavy in Spain ○ Heavy rain could be dangerous ○ The weather in Spain could be quite nice ● Query .../select? q=*:* &facet=on &facet.field=shingles ● Result (top entries) ○ in spain,3 ○ could be,2 ○ the rain,2 ○ be dangerous,1, ○ ... 9
  • 10. OpenNLP integration ● The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. ● Reference: ● OpenNLP in Solr ○ OpenNLPTokenizerFactory (including sentence chunking) ○ OpenNLPLemmatizerFilter (as opposed to stemming) ○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc) ○ OpenNLPChunkerFilter (e.g. Noun Phrase) ○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!) ○ OpenNLPLangDetectUpdateProcessorFactory (detect language) ■ one of 3 language detectors in Solr ● Challenge ○ All require models ○ Solr does not include models ○ OpenNLP only provides some models - need to train your own 10
  • 11. OpenNLP - NER - managed-schema <fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="conf/models/en-sent.bin" tokenizerModel="conf/models/en-token.bin"/> </analyzer> </fieldType> 11
  • 12. OpenNLP - NER - example - solrconfig.xml ● Chain definition (not default): <updateRequestProcessorChain name="opennlp-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">people_ss</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-organization.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">organizations_ss</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> 12
  • 13. OpenNLP - NER - example - cont ● In solrconfig.xml, add extra libraries: <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/" regex="lucene-analyzers-opennlp-.*.jar" /> <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-analysis-extras-.*.jar" /> ● Download (4) models from OpenNLP site: ● Put them into <core>/conf/models (for non-Cloud setup) ● Reference (one line): ○ #update-processor-factories-that-can-be-loaded-as-plugins 13
  • 14. OpenNLP - NER - example - index and query ● Index (one long line): bin/post -c test -params update.chain="opennlp-extract" -type text/csv -out yet -d $'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.' ● Query http://localhost:8983/solr/test/select?q="*:* : { id:1, text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors., people_ss:[John Scott], organizations_ss:[IBM, Apple,General Motors], _version_:1606739364120887296}] } 14
  • 15. Gazetteer (reverse lookup) ● Gazetteer: A dictionary, listing, or index of geographic names ● NLP Gazetteer: A closed list of names (entities) to match in the text ● Solr implementation: Tagger handler aka SolrTextTagger ○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. Tagger does Lucene text analysis. ● Reference: ○ Includes full working tutorial ○ Not going to repeat it here ● Let's use films example's name field as gazetteer ○ Create films example as per example/films/README.txt (but don't index text yet) ○ Add Tagger schema changes (skip text field definition) and request handler definition ○ Index films into updated definition (or reindex, if you indexed already) 15
  • 16. Reminder - Film Example 16 ● Recently added example in example/films ○ 1100 records about the real movies ○ available in XML, JSON, and CSV format to demonstrate indexing ○ uses basic schema and also shows how to work around "schemaless mode" limitations ○ gives full instructions to get it working ○ good toy dataset with text and facetable fields ● Sample record: { "id": "/en/black_hawk_down", "directed_by": [ "Ridley Scott"], "initial_release_date": "2001-12-18", "name": "Black Hawk Down", "genre": ["War film", "Action/Adventure", "Action Film", "History", "Combat Films", "Drama"] }
  • 17. Gazetteer (reverse lookup) - calling ● The tagger is a separate request handler (/tag) ● We send it text (and parameters) and get back matches with desired fields ● curl -X POST 'http://localhost:8983/solr/films/tag? fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I loved the movie A beautiful mind but was not too keen on a Knight Tale' 17
  • 18. Gazetteer (reverse lookup) - result (reformatted) { "responseHeader":{ "status":0, "QTime":0}, "tagsCount":2, "tags":[ [ "startOffset",19, "endOffset",35, "matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]], [ "startOffset",61, "endOffset",74, "matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]], "response":{"numFound":2,"start":0,"docs":[ { "id":"/en/a_beautiful_mind", "directed_by":["Ron Howard"], "name":"A Beautiful Mind" }, { "id":"/en/a_knights_tale", "directed_by":["Brian Helgeland"], "name":"A Knight's Tale" } ]}} 18
  • 19. Significant terms ● Significant terms - returns terms, scored on how frequently they appear in the result set and how rarely they appear in the entire corpus. ● Uses TF-IDF to calculate score - not just appearance count ● Currently documented for Streams at: ● Is also available as a Query Parser, but (in 7.4) misses documentation, was misspelt, had local-params issues. ● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details ● Syntax: ○ fq={!significantTerms field=name numTerms=3 minTermLength=5} ○ Has to be in fq as it does not affect documents, just outputs additional info ○ Has to be against text field (so genre, not genre_str in this specific example) 19
  • 20. Significant terms in the film example ● Query (7.5 syntax): .../films/select?rows=0 &q=*:* &facet=on&facet.field=name& &fq={!significantTerms field=name minTermLength=5 numTerms=10 } ● Compare pure frequency (facet) with significant terms 20
  • 21. Significant terms in the film example - result ● q=*:* ○ Significant Terms (in decreased significance order here, normally increased): american, movie, black, ghost, final, death, story, godzilla, blood ○ Facets (in decreased count order here): the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from, bad, dark, final, ghost, ii, with, 3, boys, day, death ● q=genre:Drama ○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death ● q=genre:Romantic ○ Significant Terms: movie, story, house, dirty, brother, black ● q=genre:Japanese ○ Significant Terms (only 2): godzilla, death 21
  • 22. Semantic Knowledge Graph ● Score relevance against background ○ Part of "new" JSON Facets API ○ Flexible about foreground/background/global queries ○ Context-aware if used in nested facets ○ Solr "Inception" (aka "Not sure I fully grok it yet") ● Reference (and hobbies vs age example): ● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh ● Baseline statistics query (one line): http://...../films/select?q=directed_by_str:"Ridley Scott" &facet=on&facet.field=genre_str &rows=0&facet.mincount=1 22
  • 23. Steven Soderbergh Drama 5 Romance Film 3 Biographical film 2 Comedy-drama 2 Indie film 2 --- the rest is 1 each --- Comedy, Crime Fiction, Docudrama, Drama film, Ensemble Film, Erotica, Feminist Film, Historical drama, Legal drama, Mystery, Romantic comedy, Thriller, Trial drama, War film Ridley Scott Drama 5 Action Film 3 Crime Thriller 2 War film 2 --- the rest is 1 each --- Action/Adventure, Adventure Film, Biographical film, Combat Films, Comedy, Comedy of manners, Comedy-drama, Crime Drama, Crime Fiction, Epic film, Film adaptation, Gangster Film, Historical drama, Historical period drama, History, Horror, Mystery, Psychological thriller, Romance Film, Romantic comedy, Slice of life, Thriller, True crime Baseline Genre statistics 23
  • 24. .... "facets": { "count": 1100, "genre": { "buckets": [ {"val": "Drama", "count": 552}, {"val": "Comedy", "count": 389}, {"val": "Romance Film", "count": 270}, {"val": "Thriller", "count": 259}, {"val": "Action Film", "count": 196} ] .... POST http://.../films/select { params: {q:"*:*",rows: 0}, facet: { genre: { type: terms, field: genre_str, limit: 5 }}} 1. Find all documents 2. Return none of them (to reduce output) 3. Calculate facets on field genre_str 4. Return top 5 buckets Basic JSON Facet query 24
  • 25. .... { "val": "Drama film", "count": 1, "rl-SS": { "relatedness": 0.13611, "foreground_popularity": 0.00091, "background_popularity": 0.00091 }, "rl-RS": { "relatedness": -0.00075, "foreground_popularity": 0, "background_popularity": 0.00091 } }, .... POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back: "*:*" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", } }}} Comparative Relatedness 25 Not "Drama" that one is too popular
  • 26. Compare relatedness values - samples 26 Genre Global Count Steven Soderbergh Ridley Scott Count Relatedness Count Relatedness Drama film 1 1 0.13611 0 -0.00075 Legal drama 1 1 0.13611 0 -0.00075 Feminist Film 2 1 0.09963 0 -0.00107 Comedy-drama 58 2 0.0365 1 0.01602 Drama 552 5 0.00455 5 0.00455
  • 27. {"val": "Romance Film", "count": 270, "rl-SS": { "relatedness": 0.01003, "foreground_popularity": 0.3, "background_popularity": 0.4 }, "rl-RS": { "relatedness": -0.01003, "foreground_popularity": 0.1, "background_popularity": 0.4 }, "year2000-2004": {"count": 156, "rl-SS": { "relatedness": 0.01539, "foreground_popularity": 0.3, "background_popularity": 0.6 }, "rl-RS": { "relatedness": -0.01338, "foreground_popularity": 0, "background_popularity": 0.6} } }, "Inception" Relatedness 27 POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", year2000-2004: { type: query, q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", }} } }}}
  • 28. More awesomeness - another time ● Learning to Rank ○ Machine-learned ranking models ○ ○ ■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg ■ Lucene/Solr Revolution 2017 ● Graph traversal ○ ● Streaming (including Map/Reduce) ○ ● Result clustering ○ ● Commercial solutions (e.g. Basis Technology) ● Searching images by auto-generated captures 28
  • 29. Activate - The Search and AI conference ● Used to be called Lucene/Solr Revolution ● This year in Montreal, October 17-18 (with training beforehand) ● New direction with focus on AI ● ● Samples: ○ Making Search at Reddit Relevant ○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning ○ The Neural Search Frontier ○ How to Build a Semantic Search System ○ Query-time Nonparametric Regression with Temporally Bounded Models ○ Building Analytics Applications with Streaming Expressions in Apache Solr 29
  • 30. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018