SlideShare a Scribd company logo
1 of 30
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018
What are we discussing today
1. Search IS AI (NLP/Information Retrieval)
2. NGrams (on letters and terms)
Example: Count-based Named Entity Recognition
3. OpenNLP (Statistical methods/ML)
Example: ML-based Named Entity Recognition
4. Gazetteer
Example: Lookup-based Named Entity Recognition
5. Significant Terms (query parser) example
6. Semantic Knowledge Graph (facets) example
2
Basic text processing pipeline - English
<fieldType name="text_en_splitting_tight" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query"> ....
3
Basic text processing pipeline - Farsi
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<!-- for ZeroWidthNonJoiner -->
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fa.txt"/>
</analyzer>
</fieldType>
4
Complex text processing pipeline - Thai
<!--
1) tokenize Thai text with built-in rules+dictionary
2) map it to latin characters (with special accents indicating tones
3) get rid of tone marks, as nobody uses them
4) do some phonetic (BMF) broadening to match possible alternative spellings in English
-->
<fieldType name="thai_english" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" />
<filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC"/>
<filter class="solr.BeiderMorseFilterFactory"/>
</analyzer>
<analyzer type="query">...
Source: https://github.com/arafalov/solr-thai-test/
5
Resources
● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system
● Solr Reference Guide:
○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html
○ Understanding Analyzers, Tokenizers, and Filters
○ Analyzers
○ About Tokenizers
○ About Filters
○ Tokenizers
○ Filter Descriptions
○ CharFilterFactories
○ Language Analysis
○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex,
Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik,
NYSIIS)
○ Running Your Analyzer
● http://www.solr-start.com/info/analyzers/
○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters
6
N-grams
● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd)
○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes)
○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text)
○ CJKBigramFilterFactory (Chinese-Japanese-Korean)
● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...)
○ ShingleFilterFactory (token n-grams)
○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory
■ shingles but only with common (stop) words
● Can be used for named entities identification
○ Shingle the normalized tokens (e.g. lowercased)
○ Facet on the results
7
N-grams - example - define
<fieldType name="text_shingle" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="2"
outputUnigrams="false"
outputUnigramsIfNoShingles="false"
tokenSeparator=" " fillerToken="_"/>
</analyzer>
</fieldType>
<field name="shingles" type="text_shingle" indexed="true" stored="true"/>
8
N-grams - example - use
● Index
○ The rain in Spain falls gently on the plane.
○ The rain is quite heavy in Spain
○ Heavy rain could be dangerous
○ The weather in Spain could be quite nice
● Query .../select?
q=*:*
&facet=on
&facet.field=shingles
● Result (top entries)
○ in spain,3
○ could be,2
○ the rain,2
○ be dangerous,1,
○ ...
9
OpenNLP integration
● The Apache OpenNLP library is a machine learning based toolkit
for the processing of natural language text.
● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html
● OpenNLP in Solr
○ OpenNLPTokenizerFactory (including sentence chunking)
○ OpenNLPLemmatizerFilter (as opposed to stemming)
○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc)
○ OpenNLPChunkerFilter (e.g. Noun Phrase)
○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!)
○ OpenNLPLangDetectUpdateProcessorFactory (detect language)
■ one of 3 language detectors in Solr
● Challenge
○ All require models
○ Solr does not include models
○ OpenNLP only provides some models - need to train your own 10
OpenNLP - NER - managed-schema
<fieldType name="opennlp-en-tokenization"
class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="conf/models/en-sent.bin"
tokenizerModel="conf/models/en-token.bin"/>
</analyzer>
</fieldType>
11
OpenNLP - NER - example - solrconfig.xml
● Chain definition (not default):
<updateRequestProcessorChain name="opennlp-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">people_ss</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-organization.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">organizations_ss</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
12
OpenNLP - NER - example - cont
● In solrconfig.xml, add extra libraries:
<lib
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/"
regex="lucene-analyzers-opennlp-.*.jar" />
<lib
dir="${solr.install.dir:../../../..}/dist/"
regex="solr-analysis-extras-.*.jar" />
● Download (4) models from OpenNLP site:
http://opennlp.sourceforge.net/models-1.5/
● Put them into <core>/conf/models (for non-Cloud setup)
● Reference (one line):
○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html
#update-processor-factories-that-can-be-loaded-as-plugins
13
OpenNLP - NER - example - index and query
● Index (one long line):
bin/post -c test -params update.chain="opennlp-extract"
-type text/csv -out yet -d
$'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I
work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I
ever work with John Scott for General Motors.'
● Query http://localhost:8983/solr/test/select?q="*:* :
{
id:1,
text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of
pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.,
people_ss:[John Scott],
organizations_ss:[IBM, Apple,General Motors],
_version_:1606739364120887296}]
}
14
Gazetteer (reverse lookup)
● Gazetteer: A dictionary, listing, or index of geographic names
● NLP Gazetteer: A closed list of names (entities) to match in the text
● Solr implementation: Tagger handler aka SolrTextTagger
○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request
handler and it will return every occurrence of one of those names with offsets and other
document metadata desired. Tagger does Lucene text analysis.
● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html
○ Includes full working tutorial
○ Not going to repeat it here
● Let's use films example's name field as gazetteer
○ Create films example as per example/films/README.txt (but don't index text yet)
○ Add Tagger schema changes (skip text field definition) and request handler definition
○ Index films into updated definition (or reindex, if you indexed already)
15
Reminder - Film Example
16
● Recently added example in example/films
○ 1100 records about the real movies
○ available in XML, JSON, and CSV format to demonstrate indexing
○ uses basic schema and also shows how to work around "schemaless mode" limitations
○ gives full instructions to get it working
○ good toy dataset with text and facetable fields
● Sample record:
{
"id": "/en/black_hawk_down",
"directed_by": [ "Ridley Scott"],
"initial_release_date": "2001-12-18",
"name": "Black Hawk Down",
"genre": ["War film", "Action/Adventure", "Action Film",
"History", "Combat Films", "Drama"]
}
Gazetteer (reverse lookup) - calling
● The tagger is a separate request handler (/tag)
● We send it text (and parameters) and get back matches with desired fields
● curl -X POST 'http://localhost:8983/solr/films/tag?
fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I
loved the movie A beautiful mind but was not too keen on a Knight Tale'
17
Gazetteer (reverse lookup) - result (reformatted)
{
"responseHeader":{ "status":0, "QTime":0},
"tagsCount":2,
"tags":[
[ "startOffset",19, "endOffset",35,
"matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]],
[ "startOffset",61, "endOffset",74,
"matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]],
"response":{"numFound":2,"start":0,"docs":[
{
"id":"/en/a_beautiful_mind",
"directed_by":["Ron Howard"],
"name":"A Beautiful Mind"
},
{
"id":"/en/a_knights_tale",
"directed_by":["Brian Helgeland"],
"name":"A Knight's Tale"
}
]}}
18
Significant terms
● Significant terms - returns terms, scored on how frequently they appear in the
result set and how rarely they appear in the entire corpus.
● Uses TF-IDF to calculate score - not just appearance count
● Currently documented for Streams at:
https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms
● Is also available as a Query Parser, but (in 7.4) misses documentation, was
misspelt, had local-params issues.
● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details
● Syntax:
○ fq={!significantTerms field=name numTerms=3 minTermLength=5}
○ Has to be in fq as it does not affect documents, just outputs additional info
○ Has to be against text field (so genre, not genre_str in this specific example)
19
Significant terms in the film example
● Query (7.5 syntax):
.../films/select?rows=0
&q=*:*
&facet=on&facet.field=name&
&fq={!significantTerms
field=name
minTermLength=5
numTerms=10
}
● Compare pure frequency (facet) with significant terms
20
Significant terms in the film example - result
● q=*:*
○ Significant Terms (in decreased significance order here, normally increased):
american, movie, black, ghost, final, death, story, godzilla, blood
○ Facets (in decreased count order here):
the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from,
bad, dark, final, ghost, ii, with, 3, boys, day, death
● q=genre:Drama
○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death
● q=genre:Romantic
○ Significant Terms: movie, story, house, dirty, brother, black
● q=genre:Japanese
○ Significant Terms (only 2): godzilla, death
21
Semantic Knowledge Graph
● Score relevance against background
○ Part of "new" JSON Facets API
○ Flexible about foreground/background/global queries
○ Context-aware if used in nested facets
○ Solr "Inception" (aka "Not sure I fully grok it yet")
● Reference (and hobbies vs age example):
https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs
● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh
● Baseline statistics query (one line):
http://...../films/select?q=directed_by_str:"Ridley Scott"
&facet=on&facet.field=genre_str
&rows=0&facet.mincount=1
22
Steven Soderbergh
Drama 5
Romance Film 3
Biographical film 2
Comedy-drama 2
Indie film 2
--- the rest is 1 each ---
Comedy, Crime Fiction, Docudrama,
Drama film, Ensemble Film, Erotica,
Feminist Film, Historical drama, Legal drama,
Mystery, Romantic comedy, Thriller, Trial drama,
War film
Ridley Scott
Drama 5
Action Film 3
Crime Thriller 2
War film 2
--- the rest is 1 each ---
Action/Adventure, Adventure Film,
Biographical film, Combat Films, Comedy,
Comedy of manners, Comedy-drama,
Crime Drama, Crime Fiction, Epic film,
Film adaptation, Gangster Film,
Historical drama, Historical period drama,
History, Horror, Mystery, Psychological thriller,
Romance Film, Romantic comedy, Slice of life,
Thriller, True crime
Baseline Genre statistics
23
....
"facets": {
"count": 1100,
"genre": {
"buckets": [
{"val": "Drama", "count": 552},
{"val": "Comedy", "count": 389},
{"val": "Romance Film", "count": 270},
{"val": "Thriller", "count": 259},
{"val": "Action Film", "count": 196}
]
....
POST http://.../films/select
{
params: {q:"*:*",rows: 0},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5
}}}
1. Find all documents
2. Return none of them (to reduce output)
3. Calculate facets on field genre_str
4. Return top 5 buckets
Basic JSON Facet query
24
....
{
"val": "Drama film", "count": 1,
"rl-SS": {
"relatedness": 0.13611,
"foreground_popularity": 0.00091,
"background_popularity": 0.00091
},
"rl-RS": {
"relatedness": -0.00075,
"foreground_popularity": 0,
"background_popularity": 0.00091
}
},
....
POST http://.../films/select
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back: "*:*"
},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}
}}}
Comparative Relatedness
25
Not "Drama"
that one is too popular
Compare relatedness values - samples
26
Genre Global Count Steven Soderbergh Ridley Scott
Count Relatedness Count Relatedness
Drama film 1 1 0.13611 0 -0.00075
Legal drama 1 1 0.13611 0 -0.00075
Feminist Film 2 1 0.09963 0 -0.00107
Comedy-drama 58 2 0.0365 1 0.01602
Drama 552 5 0.00455 5 0.00455
{"val": "Romance Film", "count": 270,
"rl-SS": {
"relatedness": 0.01003,
"foreground_popularity": 0.3,
"background_popularity": 0.4
},
"rl-RS": {
"relatedness": -0.01003,
"foreground_popularity": 0.1,
"background_popularity": 0.4
},
"year2000-2004": {"count": 156,
"rl-SS": {
"relatedness": 0.01539,
"foreground_popularity": 0.3,
"background_popularity": 0.6
},
"rl-RS": {
"relatedness": -0.01338,
"foreground_popularity": 0,
"background_popularity": 0.6}
}
},
"Inception" Relatedness
27
POST http://.../films/select
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")"
},
facet: {
genre: {
type: terms, field: genre_str, limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
year2000-2004: {
type: query,
q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}}
}
}}}
More awesomeness - another time
● Learning to Rank
○ Machine-learned ranking models
○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html
○ https://www.youtube.com/watch?v=OJJe-OWHjfI
■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg
■ Lucene/Solr Revolution 2017
● Graph traversal
○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html
● Streaming (including Map/Reduce)
○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html
● Result clustering
○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html
● Commercial solutions (e.g. Basis Technology)
● Searching images by auto-generated captures
28
Activate - The Search and AI conference
● Used to be called Lucene/Solr Revolution
● This year in Montreal, October 17-18 (with training beforehand)
● New direction with focus on AI
● https://activate-conf.com/agenda/
● Samples:
○ Making Search at Reddit Relevant
○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning
○ The Neural Search Frontier
○ How to Build a Semantic Search System
○ Query-time Nonparametric Regression with Temporally Bounded Models
○ Building Analytics Applications with Streaming Expressions in Apache Solr
29
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018

More Related Content

What's hot

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

What's hot (20)

Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
DITA introduction
DITA introductionDITA introduction
DITA introduction
 
Neo4j Innovation Lab – Bringing the Best of Data Science and Design Thinking ...
Neo4j Innovation Lab – Bringing the Best of Data Science and Design Thinking ...Neo4j Innovation Lab – Bringing the Best of Data Science and Design Thinking ...
Neo4j Innovation Lab – Bringing the Best of Data Science and Design Thinking ...
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Compact Representation of Large RDF Data Sets for Publishing and Exchange
Compact Representation of Large RDF Data Sets for Publishing and ExchangeCompact Representation of Large RDF Data Sets for Publishing and Exchange
Compact Representation of Large RDF Data Sets for Publishing and Exchange
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Graph Thinking: Why it Matters
Graph Thinking: Why it MattersGraph Thinking: Why it Matters
Graph Thinking: Why it Matters
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
 
Introducation to metadata
Introducation to metadataIntroducation to metadata
Introducation to metadata
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Go Programlama Dili - Seminer
Go Programlama Dili - SeminerGo Programlama Dili - Seminer
Go Programlama Dili - Seminer
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Overview of new features in Apache Ranger
Overview of new features in Apache RangerOverview of new features in Apache Ranger
Overview of new features in Apache Ranger
 

Similar to Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks

[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Donghyeok Kang
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scripting
Prashank Singh
 

Similar to Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks (20)

Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Meta Object Protocols
Meta Object ProtocolsMeta Object Protocols
Meta Object Protocols
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Reversing JavaScript
Reversing JavaScriptReversing JavaScript
Reversing JavaScript
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scripting
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
 
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation LanguagesSyntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
jstl ( jsp standard tag library )
jstl ( jsp standard tag library )jstl ( jsp standard tag library )
jstl ( jsp standard tag library )
 

More from Alexandre Rafalovitch

More from Alexandre Rafalovitch (6)

JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
 
Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Recently uploaded

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Recently uploaded (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks

  • 1. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018
  • 2. What are we discussing today 1. Search IS AI (NLP/Information Retrieval) 2. NGrams (on letters and terms) Example: Count-based Named Entity Recognition 3. OpenNLP (Statistical methods/ML) Example: ML-based Named Entity Recognition 4. Gazetteer Example: Lookup-based Named Entity Recognition 5. Significant Terms (query parser) example 6. Semantic Knowledge Graph (facets) example 2
  • 3. Basic text processing pipeline - English <fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.FlattenGraphFilterFactory"/> </analyzer> <analyzer type="query"> .... 3
  • 4. Basic text processing pipeline - Farsi <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"> <analyzer> <!-- for ZeroWidthNonJoiner --> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt"/> </analyzer> </fieldType> 4
  • 5. Complex text processing pipeline - Thai <!-- 1) tokenize Thai text with built-in rules+dictionary 2) map it to latin characters (with special accents indicating tones 3) get rid of tone marks, as nobody uses them 4) do some phonetic (BMF) broadening to match possible alternative spellings in English --> <fieldType name="thai_english" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" /> <filter class="solr.ICUTransformFilterFactory" id="NFD; [:Nonspacing Mark:] Remove; NFC"/> <filter class="solr.BeiderMorseFilterFactory"/> </analyzer> <analyzer type="query">... Source: https://github.com/arafalov/solr-thai-test/ 5
  • 6. Resources ● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system ● Solr Reference Guide: ○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html ○ Understanding Analyzers, Tokenizers, and Filters ○ Analyzers ○ About Tokenizers ○ About Filters ○ Tokenizers ○ Filter Descriptions ○ CharFilterFactories ○ Language Analysis ○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex, Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik, NYSIIS) ○ Running Your Analyzer ● http://www.solr-start.com/info/analyzers/ ○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters 6
  • 7. N-grams ● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd) ○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes) ○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text) ○ CJKBigramFilterFactory (Chinese-Japanese-Korean) ● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...) ○ ShingleFilterFactory (token n-grams) ○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory ■ shingles but only with common (stop) words ● Can be used for named entities identification ○ Shingle the normalized tokens (e.g. lowercased) ○ Facet on the results 7
  • 8. N-grams - example - define <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="_"/> </analyzer> </fieldType> <field name="shingles" type="text_shingle" indexed="true" stored="true"/> 8
  • 9. N-grams - example - use ● Index ○ The rain in Spain falls gently on the plane. ○ The rain is quite heavy in Spain ○ Heavy rain could be dangerous ○ The weather in Spain could be quite nice ● Query .../select? q=*:* &facet=on &facet.field=shingles ● Result (top entries) ○ in spain,3 ○ could be,2 ○ the rain,2 ○ be dangerous,1, ○ ... 9
  • 10. OpenNLP integration ● The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. ● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html ● OpenNLP in Solr ○ OpenNLPTokenizerFactory (including sentence chunking) ○ OpenNLPLemmatizerFilter (as opposed to stemming) ○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc) ○ OpenNLPChunkerFilter (e.g. Noun Phrase) ○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!) ○ OpenNLPLangDetectUpdateProcessorFactory (detect language) ■ one of 3 language detectors in Solr ● Challenge ○ All require models ○ Solr does not include models ○ OpenNLP only provides some models - need to train your own 10
  • 11. OpenNLP - NER - managed-schema <fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="conf/models/en-sent.bin" tokenizerModel="conf/models/en-token.bin"/> </analyzer> </fieldType> 11
  • 12. OpenNLP - NER - example - solrconfig.xml ● Chain definition (not default): <updateRequestProcessorChain name="opennlp-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">people_ss</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-organization.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">organizations_ss</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> 12
  • 13. OpenNLP - NER - example - cont ● In solrconfig.xml, add extra libraries: <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/" regex="lucene-analyzers-opennlp-.*.jar" /> <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-analysis-extras-.*.jar" /> ● Download (4) models from OpenNLP site: http://opennlp.sourceforge.net/models-1.5/ ● Put them into <core>/conf/models (for non-Cloud setup) ● Reference (one line): ○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html #update-processor-factories-that-can-be-loaded-as-plugins 13
  • 14. OpenNLP - NER - example - index and query ● Index (one long line): bin/post -c test -params update.chain="opennlp-extract" -type text/csv -out yet -d $'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.' ● Query http://localhost:8983/solr/test/select?q="*:* : { id:1, text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors., people_ss:[John Scott], organizations_ss:[IBM, Apple,General Motors], _version_:1606739364120887296}] } 14
  • 15. Gazetteer (reverse lookup) ● Gazetteer: A dictionary, listing, or index of geographic names ● NLP Gazetteer: A closed list of names (entities) to match in the text ● Solr implementation: Tagger handler aka SolrTextTagger ○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. Tagger does Lucene text analysis. ● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html ○ Includes full working tutorial ○ Not going to repeat it here ● Let's use films example's name field as gazetteer ○ Create films example as per example/films/README.txt (but don't index text yet) ○ Add Tagger schema changes (skip text field definition) and request handler definition ○ Index films into updated definition (or reindex, if you indexed already) 15
  • 16. Reminder - Film Example 16 ● Recently added example in example/films ○ 1100 records about the real movies ○ available in XML, JSON, and CSV format to demonstrate indexing ○ uses basic schema and also shows how to work around "schemaless mode" limitations ○ gives full instructions to get it working ○ good toy dataset with text and facetable fields ● Sample record: { "id": "/en/black_hawk_down", "directed_by": [ "Ridley Scott"], "initial_release_date": "2001-12-18", "name": "Black Hawk Down", "genre": ["War film", "Action/Adventure", "Action Film", "History", "Combat Films", "Drama"] }
  • 17. Gazetteer (reverse lookup) - calling ● The tagger is a separate request handler (/tag) ● We send it text (and parameters) and get back matches with desired fields ● curl -X POST 'http://localhost:8983/solr/films/tag? fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I loved the movie A beautiful mind but was not too keen on a Knight Tale' 17
  • 18. Gazetteer (reverse lookup) - result (reformatted) { "responseHeader":{ "status":0, "QTime":0}, "tagsCount":2, "tags":[ [ "startOffset",19, "endOffset",35, "matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]], [ "startOffset",61, "endOffset",74, "matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]], "response":{"numFound":2,"start":0,"docs":[ { "id":"/en/a_beautiful_mind", "directed_by":["Ron Howard"], "name":"A Beautiful Mind" }, { "id":"/en/a_knights_tale", "directed_by":["Brian Helgeland"], "name":"A Knight's Tale" } ]}} 18
  • 19. Significant terms ● Significant terms - returns terms, scored on how frequently they appear in the result set and how rarely they appear in the entire corpus. ● Uses TF-IDF to calculate score - not just appearance count ● Currently documented for Streams at: https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms ● Is also available as a Query Parser, but (in 7.4) misses documentation, was misspelt, had local-params issues. ● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details ● Syntax: ○ fq={!significantTerms field=name numTerms=3 minTermLength=5} ○ Has to be in fq as it does not affect documents, just outputs additional info ○ Has to be against text field (so genre, not genre_str in this specific example) 19
  • 20. Significant terms in the film example ● Query (7.5 syntax): .../films/select?rows=0 &q=*:* &facet=on&facet.field=name& &fq={!significantTerms field=name minTermLength=5 numTerms=10 } ● Compare pure frequency (facet) with significant terms 20
  • 21. Significant terms in the film example - result ● q=*:* ○ Significant Terms (in decreased significance order here, normally increased): american, movie, black, ghost, final, death, story, godzilla, blood ○ Facets (in decreased count order here): the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from, bad, dark, final, ghost, ii, with, 3, boys, day, death ● q=genre:Drama ○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death ● q=genre:Romantic ○ Significant Terms: movie, story, house, dirty, brother, black ● q=genre:Japanese ○ Significant Terms (only 2): godzilla, death 21
  • 22. Semantic Knowledge Graph ● Score relevance against background ○ Part of "new" JSON Facets API ○ Flexible about foreground/background/global queries ○ Context-aware if used in nested facets ○ Solr "Inception" (aka "Not sure I fully grok it yet") ● Reference (and hobbies vs age example): https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs ● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh ● Baseline statistics query (one line): http://...../films/select?q=directed_by_str:"Ridley Scott" &facet=on&facet.field=genre_str &rows=0&facet.mincount=1 22
  • 23. Steven Soderbergh Drama 5 Romance Film 3 Biographical film 2 Comedy-drama 2 Indie film 2 --- the rest is 1 each --- Comedy, Crime Fiction, Docudrama, Drama film, Ensemble Film, Erotica, Feminist Film, Historical drama, Legal drama, Mystery, Romantic comedy, Thriller, Trial drama, War film Ridley Scott Drama 5 Action Film 3 Crime Thriller 2 War film 2 --- the rest is 1 each --- Action/Adventure, Adventure Film, Biographical film, Combat Films, Comedy, Comedy of manners, Comedy-drama, Crime Drama, Crime Fiction, Epic film, Film adaptation, Gangster Film, Historical drama, Historical period drama, History, Horror, Mystery, Psychological thriller, Romance Film, Romantic comedy, Slice of life, Thriller, True crime Baseline Genre statistics 23
  • 24. .... "facets": { "count": 1100, "genre": { "buckets": [ {"val": "Drama", "count": 552}, {"val": "Comedy", "count": 389}, {"val": "Romance Film", "count": 270}, {"val": "Thriller", "count": 259}, {"val": "Action Film", "count": 196} ] .... POST http://.../films/select { params: {q:"*:*",rows: 0}, facet: { genre: { type: terms, field: genre_str, limit: 5 }}} 1. Find all documents 2. Return none of them (to reduce output) 3. Calculate facets on field genre_str 4. Return top 5 buckets Basic JSON Facet query 24
  • 25. .... { "val": "Drama film", "count": 1, "rl-SS": { "relatedness": 0.13611, "foreground_popularity": 0.00091, "background_popularity": 0.00091 }, "rl-RS": { "relatedness": -0.00075, "foreground_popularity": 0, "background_popularity": 0.00091 } }, .... POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back: "*:*" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", } }}} Comparative Relatedness 25 Not "Drama" that one is too popular
  • 26. Compare relatedness values - samples 26 Genre Global Count Steven Soderbergh Ridley Scott Count Relatedness Count Relatedness Drama film 1 1 0.13611 0 -0.00075 Legal drama 1 1 0.13611 0 -0.00075 Feminist Film 2 1 0.09963 0 -0.00107 Comedy-drama 58 2 0.0365 1 0.01602 Drama 552 5 0.00455 5 0.00455
  • 27. {"val": "Romance Film", "count": 270, "rl-SS": { "relatedness": 0.01003, "foreground_popularity": 0.3, "background_popularity": 0.4 }, "rl-RS": { "relatedness": -0.01003, "foreground_popularity": 0.1, "background_popularity": 0.4 }, "year2000-2004": {"count": 156, "rl-SS": { "relatedness": 0.01539, "foreground_popularity": 0.3, "background_popularity": 0.6 }, "rl-RS": { "relatedness": -0.01338, "foreground_popularity": 0, "background_popularity": 0.6} } }, "Inception" Relatedness 27 POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", year2000-2004: { type: query, q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", }} } }}}
  • 28. More awesomeness - another time ● Learning to Rank ○ Machine-learned ranking models ○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html ○ https://www.youtube.com/watch?v=OJJe-OWHjfI ■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg ■ Lucene/Solr Revolution 2017 ● Graph traversal ○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html ● Streaming (including Map/Reduce) ○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html ● Result clustering ○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html ● Commercial solutions (e.g. Basis Technology) ● Searching images by auto-generated captures 28
  • 29. Activate - The Search and AI conference ● Used to be called Lucene/Solr Revolution ● This year in Montreal, October 17-18 (with training beforehand) ● New direction with focus on AI ● https://activate-conf.com/agenda/ ● Samples: ○ Making Search at Reddit Relevant ○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning ○ The Neural Search Frontier ○ How to Build a Semantic Search System ○ Query-time Nonparametric Regression with Temporally Bounded Models ○ Building Analytics Applications with Streaming Expressions in Apache Solr 29
  • 30. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018