Leveraging Solr for AI tasks

Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018

What are we discussing today
1. Search IS AI (NLP/Information Retrieval)
2. NGrams (on letters and terms)
Example: Count-based Named Entity Recognition
3. OpenNLP (Statistical methods/ML)
Example: ML-based Named Entity Recognition
4. Gazetteer
Example: Lookup-based Named Entity Recognition
5. Significant Terms (query parser) example
6. Semantic Knowledge Graph (facets) example
2

Basic text processing pipeline - English
<fieldType name="text_en_splitting_tight" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query"> ....
3

Basic text processing pipeline - Farsi
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">
<analyzer>

<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fa.txt"/>
</analyzer>
</fieldType>
4

Complex text processing pipeline - Thai

<fieldType name="thai_english" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" />
<filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC"/>
<filter class="solr.BeiderMorseFilterFactory"/>
</analyzer>
<analyzer type="query">...
Source: https://github.com/arafalov/solr-thai-test/
5

Resources
● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system
● Solr Reference Guide:
○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html
○ Understanding Analyzers, Tokenizers, and Filters
○ Analyzers
○ About Tokenizers
○ About Filters
○ Tokenizers
○ Filter Descriptions
○ CharFilterFactories
○ Language Analysis
○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex,
Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik,
NYSIIS)
○ Running Your Analyzer
● http://www.solr-start.com/info/analyzers/
○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters
6

N-grams
● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd)
○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes)
○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text)
○ CJKBigramFilterFactory (Chinese-Japanese-Korean)
● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...)
○ ShingleFilterFactory (token n-grams)
○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory
■ shingles but only with common (stop) words
● Can be used for named entities identification
○ Shingle the normalized tokens (e.g. lowercased)
○ Facet on the results
7

N-grams - example - define
<fieldType name="text_shingle" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="2"
outputUnigrams="false"
outputUnigramsIfNoShingles="false"
tokenSeparator=" " fillerToken="_"/>
</analyzer>
</fieldType>
<field name="shingles" type="text_shingle" indexed="true" stored="true"/>
8

N-grams - example - use
● Index
○ The rain in Spain falls gently on the plane.
○ The rain is quite heavy in Spain
○ Heavy rain could be dangerous
○ The weather in Spain could be quite nice
● Query .../select?
q=*:*
&facet=on
&facet.field=shingles
● Result (top entries)
○ in spain,3
○ could be,2
○ the rain,2
○ be dangerous,1,
○ ...
9

OpenNLP integration
● The Apache OpenNLP library is a machine learning based toolkit
for the processing of natural language text.
● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html
● OpenNLP in Solr
○ OpenNLPTokenizerFactory (including sentence chunking)
○ OpenNLPLemmatizerFilter (as opposed to stemming)
○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc)
○ OpenNLPChunkerFilter (e.g. Noun Phrase)
○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!)
○ OpenNLPLangDetectUpdateProcessorFactory (detect language)
■ one of 3 language detectors in Solr
● Challenge
○ All require models
○ Solr does not include models
○ OpenNLP only provides some models - need to train your own 10

OpenNLP - NER - managed-schema
<fieldType name="opennlp-en-tokenization"
class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="conf/models/en-sent.bin"
tokenizerModel="conf/models/en-token.bin"/>
</analyzer>
</fieldType>
11

OpenNLP - NER - example - solrconfig.xml
● Chain definition (not default):
<updateRequestProcessorChain name="opennlp-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">people_ss</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-organization.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">organizations_ss</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
12

OpenNLP - NER - example - cont
● In solrconfig.xml, add extra libraries:
<lib
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/"
regex="lucene-analyzers-opennlp-.*.jar" />
<lib
dir="${solr.install.dir:../../../..}/dist/"
regex="solr-analysis-extras-.*.jar" />
● Download (4) models from OpenNLP site:
http://opennlp.sourceforge.net/models-1.5/
● Put them into <core>/conf/models (for non-Cloud setup)
● Reference (one line):
○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html
#update-processor-factories-that-can-be-loaded-as-plugins
13

OpenNLP - NER - example - index and query
● Index (one long line):
bin/post -c test -params update.chain="opennlp-extract"
-type text/csv -out yet -d
$'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I
work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I
ever work with John Scott for General Motors.'
● Query http://localhost:8983/solr/test/select?q="*:* :
{
id:1,
text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of
pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.,
people_ss:[John Scott],
organizations_ss:[IBM, Apple,General Motors],
_version_:1606739364120887296}]
}
14

Gazetteer (reverse lookup)
● Gazetteer: A dictionary, listing, or index of geographic names
● NLP Gazetteer: A closed list of names (entities) to match in the text
● Solr implementation: Tagger handler aka SolrTextTagger
○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request
handler and it will return every occurrence of one of those names with offsets and other
document metadata desired. Tagger does Lucene text analysis.
● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html
○ Includes full working tutorial
○ Not going to repeat it here
● Let's use films example's name field as gazetteer
○ Create films example as per example/films/README.txt (but don't index text yet)
○ Add Tagger schema changes (skip text field definition) and request handler definition
○ Index films into updated definition (or reindex, if you indexed already)
15

Reminder - Film Example
16
● Recently added example in example/films
○ 1100 records about the real movies
○ available in XML, JSON, and CSV format to demonstrate indexing
○ uses basic schema and also shows how to work around "schemaless mode" limitations
○ gives full instructions to get it working
○ good toy dataset with text and facetable fields
● Sample record:
{
"id": "/en/black_hawk_down",
"directed_by": [ "Ridley Scott"],
"initial_release_date": "2001-12-18",
"name": "Black Hawk Down",
"genre": ["War film", "Action/Adventure", "Action Film",
"History", "Combat Films", "Drama"]
}

Gazetteer (reverse lookup) - calling
● The tagger is a separate request handler (/tag)
● We send it text (and parameters) and get back matches with desired fields
● curl -X POST 'http://localhost:8983/solr/films/tag?
fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I
loved the movie A beautiful mind but was not too keen on a Knight Tale'
17

Gazetteer (reverse lookup) - result (reformatted)
{
"responseHeader":{ "status":0, "QTime":0},
"tagsCount":2,
"tags":[
[ "startOffset",19, "endOffset",35,
"matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]],
[ "startOffset",61, "endOffset",74,
"matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]],
"response":{"numFound":2,"start":0,"docs":[
{
"id":"/en/a_beautiful_mind",
"directed_by":["Ron Howard"],
"name":"A Beautiful Mind"
},
{
"id":"/en/a_knights_tale",
"directed_by":["Brian Helgeland"],
"name":"A Knight's Tale"
}
]}}
18

Significant terms
● Significant terms - returns terms, scored on how frequently they appear in the
result set and how rarely they appear in the entire corpus.
● Uses TF-IDF to calculate score - not just appearance count
● Currently documented for Streams at:
https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms
● Is also available as a Query Parser, but (in 7.4) misses documentation, was
misspelt, had local-params issues.
● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details
● Syntax:
○ fq={!significantTerms field=name numTerms=3 minTermLength=5}
○ Has to be in fq as it does not affect documents, just outputs additional info
○ Has to be against text field (so genre, not genre_str in this specific example)
19

Significant terms in the film example
● Query (7.5 syntax):
.../films/select?rows=0
&q=*:*
&facet=on&facet.field=name&
&fq={!significantTerms
field=name
minTermLength=5
numTerms=10
}
● Compare pure frequency (facet) with significant terms
20

Significant terms in the film example - result
● q=*:*
○ Significant Terms (in decreased significance order here, normally increased):
american, movie, black, ghost, final, death, story, godzilla, blood
○ Facets (in decreased count order here):
the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from,
bad, dark, final, ghost, ii, with, 3, boys, day, death
● q=genre:Drama
○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death
● q=genre:Romantic
○ Significant Terms: movie, story, house, dirty, brother, black
● q=genre:Japanese
○ Significant Terms (only 2): godzilla, death
21

Semantic Knowledge Graph
● Score relevance against background
○ Part of "new" JSON Facets API
○ Flexible about foreground/background/global queries
○ Context-aware if used in nested facets
○ Solr "Inception" (aka "Not sure I fully grok it yet")
● Reference (and hobbies vs age example):
https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs
● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh
● Baseline statistics query (one line):
http://...../films/select?q=directed_by_str:"Ridley Scott"
&facet=on&facet.field=genre_str
&rows=0&facet.mincount=1
22

Steven Soderbergh
Drama 5
Romance Film 3
Biographical film 2
Comedy-drama 2
Indie film 2
--- the rest is 1 each ---
Comedy, Crime Fiction, Docudrama,
Drama film, Ensemble Film, Erotica,
Feminist Film, Historical drama, Legal drama,
Mystery, Romantic comedy, Thriller, Trial drama,
War film
Ridley Scott
Drama 5
Action Film 3
Crime Thriller 2
War film 2
--- the rest is 1 each ---
Action/Adventure, Adventure Film,
Biographical film, Combat Films, Comedy,
Comedy of manners, Comedy-drama,
Crime Drama, Crime Fiction, Epic film,
Film adaptation, Gangster Film,
Historical drama, Historical period drama,
History, Horror, Mystery, Psychological thriller,
Romance Film, Romantic comedy, Slice of life,
Thriller, True crime
Baseline Genre statistics
23

....
"facets": {
"count": 1100,
"genre": {
"buckets": [
{"val": "Drama", "count": 552},
{"val": "Comedy", "count": 389},
{"val": "Romance Film", "count": 270},
{"val": "Thriller", "count": 259},
{"val": "Action Film", "count": 196}
]
....
POST http://.../films/select
{
params: {q:"*:*",rows: 0},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5
}}}
1. Find all documents
2. Return none of them (to reduce output)
3. Calculate facets on field genre_str
4. Return top 5 buckets
Basic JSON Facet query
24

....
{
"val": "Drama film", "count": 1,
"rl-SS": {
"relatedness": 0.13611,
"foreground_popularity": 0.00091,
"background_popularity": 0.00091
},
"rl-RS": {
"relatedness": -0.00075,
"foreground_popularity": 0,
}
},
....
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back: "*:*"
},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}
}}}
Comparative Relatedness
25
Not "Drama"
that one is too popular

Compare relatedness values - samples
26
Genre Global Count Steven Soderbergh Ridley Scott
Count Relatedness Count Relatedness
Drama film 1 1 0.13611 0 -0.00075
Legal drama 1 1 0.13611 0 -0.00075
Feminist Film 2 1 0.09963 0 -0.00107
Comedy-drama 58 2 0.0365 1 0.01602
Drama 552 5 0.00455 5 0.00455

{"val": "Romance Film", "count": 270,
"rl-SS": {
},
"rl-RS": {
},
"year2000-2004": {"count": 156,
"rl-SS": {
},
"rl-RS": {
"foreground_popularity": 0,
"background_popularity": 0.6}
}
},
"Inception" Relatedness
27
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")"
},
facet: {
genre: {
type: terms, field: genre_str, limit: 5,
sort: "rl-SS desc",
facet: {
year2000-2004: {
type: query,
q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]",
facet: {
}}
}
}}}

More awesomeness - another time
● Learning to Rank
○ Machine-learned ranking models
○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html
○ https://www.youtube.com/watch?v=OJJe-OWHjfI
■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg
■ Lucene/Solr Revolution 2017
● Graph traversal
○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html
● Streaming (including Map/Reduce)
○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html
● Result clustering
○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html
● Commercial solutions (e.g. Basis Technology)
● Searching images by auto-generated captures
28

Activate - The Search and AI conference
● Used to be called Lucene/Solr Revolution
● This year in Montreal, October 17-18 (with training beforehand)
● New direction with focus on AI
● https://activate-conf.com/agenda/
● Samples:
○ Making Search at Reddit Relevant
○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning
○ The Neural Search Frontier
○ How to Build a Semantic Search System
○ Query-time Nonparametric Regression with Temporally Bounded Models
○ Building Analytics Applications with Streaming Expressions in Apache Solr
29

Leveraging Solr for AI tasks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging Solr for AI tasks

Similar to Leveraging Solr for AI tasks (20)

More from Alexandre Rafalovitch

More from Alexandre Rafalovitch (6)

Recently uploaded

Recently uploaded (20)

Leveraging Solr for AI tasks