SlideShare a Scribd company logo
1 of 30
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018
What are we discussing today
1. Search IS AI (NLP/Information Retrieval)
2. NGrams (on letters and terms)
Example: Count-based Named Entity Recognition
3. OpenNLP (Statistical methods/ML)
Example: ML-based Named Entity Recognition
4. Gazetteer
Example: Lookup-based Named Entity Recognition
5. Significant Terms (query parser) example
6. Semantic Knowledge Graph (facets) example
2
Basic text processing pipeline - English
<fieldType name="text_en_splitting_tight" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query"> ....
3
Basic text processing pipeline - Farsi
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<!-- for ZeroWidthNonJoiner -->
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fa.txt"/>
</analyzer>
</fieldType>
4
Complex text processing pipeline - Thai
<!--
1) tokenize Thai text with built-in rules+dictionary
2) map it to latin characters (with special accents indicating tones
3) get rid of tone marks, as nobody uses them
4) do some phonetic (BMF) broadening to match possible alternative spellings in English
-->
<fieldType name="thai_english" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" />
<filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC"/>
<filter class="solr.BeiderMorseFilterFactory"/>
</analyzer>
<analyzer type="query">...
Source: https://github.com/arafalov/solr-thai-test/
5
Resources
● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system
● Solr Reference Guide:
○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html
○ Understanding Analyzers, Tokenizers, and Filters
○ Analyzers
○ About Tokenizers
○ About Filters
○ Tokenizers
○ Filter Descriptions
○ CharFilterFactories
○ Language Analysis
○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex,
Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik,
NYSIIS)
○ Running Your Analyzer
● http://www.solr-start.com/info/analyzers/
○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters
6
N-grams
● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd)
○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes)
○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text)
○ CJKBigramFilterFactory (Chinese-Japanese-Korean)
● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...)
○ ShingleFilterFactory (token n-grams)
○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory
■ shingles but only with common (stop) words
● Can be used for named entities identification
○ Shingle the normalized tokens (e.g. lowercased)
○ Facet on the results
7
N-grams - example - define
<fieldType name="text_shingle" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="2"
outputUnigrams="false"
outputUnigramsIfNoShingles="false"
tokenSeparator=" " fillerToken="_"/>
</analyzer>
</fieldType>
<field name="shingles" type="text_shingle" indexed="true" stored="true"/>
8
N-grams - example - use
● Index
○ The rain in Spain falls gently on the plane.
○ The rain is quite heavy in Spain
○ Heavy rain could be dangerous
○ The weather in Spain could be quite nice
● Query .../select?
q=*:*
&facet=on
&facet.field=shingles
● Result (top entries)
○ in spain,3
○ could be,2
○ the rain,2
○ be dangerous,1,
○ ...
9
OpenNLP integration
● The Apache OpenNLP library is a machine learning based toolkit
for the processing of natural language text.
● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html
● OpenNLP in Solr
○ OpenNLPTokenizerFactory (including sentence chunking)
○ OpenNLPLemmatizerFilter (as opposed to stemming)
○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc)
○ OpenNLPChunkerFilter (e.g. Noun Phrase)
○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!)
○ OpenNLPLangDetectUpdateProcessorFactory (detect language)
■ one of 3 language detectors in Solr
● Challenge
○ All require models
○ Solr does not include models
○ OpenNLP only provides some models - need to train your own 10
OpenNLP - NER - managed-schema
<fieldType name="opennlp-en-tokenization"
class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="conf/models/en-sent.bin"
tokenizerModel="conf/models/en-token.bin"/>
</analyzer>
</fieldType>
11
OpenNLP - NER - example - solrconfig.xml
● Chain definition (not default):
<updateRequestProcessorChain name="opennlp-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">people_ss</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-organization.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">organizations_ss</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
12
OpenNLP - NER - example - cont
● In solrconfig.xml, add extra libraries:
<lib
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/"
regex="lucene-analyzers-opennlp-.*.jar" />
<lib
dir="${solr.install.dir:../../../..}/dist/"
regex="solr-analysis-extras-.*.jar" />
● Download (4) models from OpenNLP site:
http://opennlp.sourceforge.net/models-1.5/
● Put them into <core>/conf/models (for non-Cloud setup)
● Reference (one line):
○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html
#update-processor-factories-that-can-be-loaded-as-plugins
13
OpenNLP - NER - example - index and query
● Index (one long line):
bin/post -c test -params update.chain="opennlp-extract"
-type text/csv -out yet -d
$'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I
work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I
ever work with John Scott for General Motors.'
● Query http://localhost:8983/solr/test/select?q="*:* :
{
id:1,
text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of
pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.,
people_ss:[John Scott],
organizations_ss:[IBM, Apple,General Motors],
_version_:1606739364120887296}]
}
14
Gazetteer (reverse lookup)
● Gazetteer: A dictionary, listing, or index of geographic names
● NLP Gazetteer: A closed list of names (entities) to match in the text
● Solr implementation: Tagger handler aka SolrTextTagger
○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request
handler and it will return every occurrence of one of those names with offsets and other
document metadata desired. Tagger does Lucene text analysis.
● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html
○ Includes full working tutorial
○ Not going to repeat it here
● Let's use films example's name field as gazetteer
○ Create films example as per example/films/README.txt (but don't index text yet)
○ Add Tagger schema changes (skip text field definition) and request handler definition
○ Index films into updated definition (or reindex, if you indexed already)
15
Reminder - Film Example
16
● Recently added example in example/films
○ 1100 records about the real movies
○ available in XML, JSON, and CSV format to demonstrate indexing
○ uses basic schema and also shows how to work around "schemaless mode" limitations
○ gives full instructions to get it working
○ good toy dataset with text and facetable fields
● Sample record:
{
"id": "/en/black_hawk_down",
"directed_by": [ "Ridley Scott"],
"initial_release_date": "2001-12-18",
"name": "Black Hawk Down",
"genre": ["War film", "Action/Adventure", "Action Film",
"History", "Combat Films", "Drama"]
}
Gazetteer (reverse lookup) - calling
● The tagger is a separate request handler (/tag)
● We send it text (and parameters) and get back matches with desired fields
● curl -X POST 'http://localhost:8983/solr/films/tag?
fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I
loved the movie A beautiful mind but was not too keen on a Knight Tale'
17
Gazetteer (reverse lookup) - result (reformatted)
{
"responseHeader":{ "status":0, "QTime":0},
"tagsCount":2,
"tags":[
[ "startOffset",19, "endOffset",35,
"matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]],
[ "startOffset",61, "endOffset",74,
"matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]],
"response":{"numFound":2,"start":0,"docs":[
{
"id":"/en/a_beautiful_mind",
"directed_by":["Ron Howard"],
"name":"A Beautiful Mind"
},
{
"id":"/en/a_knights_tale",
"directed_by":["Brian Helgeland"],
"name":"A Knight's Tale"
}
]}}
18
Significant terms
● Significant terms - returns terms, scored on how frequently they appear in the
result set and how rarely they appear in the entire corpus.
● Uses TF-IDF to calculate score - not just appearance count
● Currently documented for Streams at:
https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms
● Is also available as a Query Parser, but (in 7.4) misses documentation, was
misspelt, had local-params issues.
● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details
● Syntax:
○ fq={!significantTerms field=name numTerms=3 minTermLength=5}
○ Has to be in fq as it does not affect documents, just outputs additional info
○ Has to be against text field (so genre, not genre_str in this specific example)
19
Significant terms in the film example
● Query (7.5 syntax):
.../films/select?rows=0
&q=*:*
&facet=on&facet.field=name&
&fq={!significantTerms
field=name
minTermLength=5
numTerms=10
}
● Compare pure frequency (facet) with significant terms
20
Significant terms in the film example - result
● q=*:*
○ Significant Terms (in decreased significance order here, normally increased):
american, movie, black, ghost, final, death, story, godzilla, blood
○ Facets (in decreased count order here):
the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from,
bad, dark, final, ghost, ii, with, 3, boys, day, death
● q=genre:Drama
○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death
● q=genre:Romantic
○ Significant Terms: movie, story, house, dirty, brother, black
● q=genre:Japanese
○ Significant Terms (only 2): godzilla, death
21
Semantic Knowledge Graph
● Score relevance against background
○ Part of "new" JSON Facets API
○ Flexible about foreground/background/global queries
○ Context-aware if used in nested facets
○ Solr "Inception" (aka "Not sure I fully grok it yet")
● Reference (and hobbies vs age example):
https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs
● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh
● Baseline statistics query (one line):
http://...../films/select?q=directed_by_str:"Ridley Scott"
&facet=on&facet.field=genre_str
&rows=0&facet.mincount=1
22
Steven Soderbergh
Drama 5
Romance Film 3
Biographical film 2
Comedy-drama 2
Indie film 2
--- the rest is 1 each ---
Comedy, Crime Fiction, Docudrama,
Drama film, Ensemble Film, Erotica,
Feminist Film, Historical drama, Legal drama,
Mystery, Romantic comedy, Thriller, Trial drama,
War film
Ridley Scott
Drama 5
Action Film 3
Crime Thriller 2
War film 2
--- the rest is 1 each ---
Action/Adventure, Adventure Film,
Biographical film, Combat Films, Comedy,
Comedy of manners, Comedy-drama,
Crime Drama, Crime Fiction, Epic film,
Film adaptation, Gangster Film,
Historical drama, Historical period drama,
History, Horror, Mystery, Psychological thriller,
Romance Film, Romantic comedy, Slice of life,
Thriller, True crime
Baseline Genre statistics
23
....
"facets": {
"count": 1100,
"genre": {
"buckets": [
{"val": "Drama", "count": 552},
{"val": "Comedy", "count": 389},
{"val": "Romance Film", "count": 270},
{"val": "Thriller", "count": 259},
{"val": "Action Film", "count": 196}
]
....
POST http://.../films/select
{
params: {q:"*:*",rows: 0},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5
}}}
1. Find all documents
2. Return none of them (to reduce output)
3. Calculate facets on field genre_str
4. Return top 5 buckets
Basic JSON Facet query
24
....
{
"val": "Drama film", "count": 1,
"rl-SS": {
"relatedness": 0.13611,
"foreground_popularity": 0.00091,
"background_popularity": 0.00091
},
"rl-RS": {
"relatedness": -0.00075,
"foreground_popularity": 0,
"background_popularity": 0.00091
}
},
....
POST http://.../films/select
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back: "*:*"
},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}
}}}
Comparative Relatedness
25
Not "Drama"
that one is too popular
Compare relatedness values - samples
26
Genre Global Count Steven Soderbergh Ridley Scott
Count Relatedness Count Relatedness
Drama film 1 1 0.13611 0 -0.00075
Legal drama 1 1 0.13611 0 -0.00075
Feminist Film 2 1 0.09963 0 -0.00107
Comedy-drama 58 2 0.0365 1 0.01602
Drama 552 5 0.00455 5 0.00455
{"val": "Romance Film", "count": 270,
"rl-SS": {
"relatedness": 0.01003,
"foreground_popularity": 0.3,
"background_popularity": 0.4
},
"rl-RS": {
"relatedness": -0.01003,
"foreground_popularity": 0.1,
"background_popularity": 0.4
},
"year2000-2004": {"count": 156,
"rl-SS": {
"relatedness": 0.01539,
"foreground_popularity": 0.3,
"background_popularity": 0.6
},
"rl-RS": {
"relatedness": -0.01338,
"foreground_popularity": 0,
"background_popularity": 0.6}
}
},
"Inception" Relatedness
27
POST http://.../films/select
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")"
},
facet: {
genre: {
type: terms, field: genre_str, limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
year2000-2004: {
type: query,
q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}}
}
}}}
More awesomeness - another time
● Learning to Rank
○ Machine-learned ranking models
○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html
○ https://www.youtube.com/watch?v=OJJe-OWHjfI
■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg
■ Lucene/Solr Revolution 2017
● Graph traversal
○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html
● Streaming (including Map/Reduce)
○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html
● Result clustering
○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html
● Commercial solutions (e.g. Basis Technology)
● Searching images by auto-generated captures
28
Activate - The Search and AI conference
● Used to be called Lucene/Solr Revolution
● This year in Montreal, October 17-18 (with training beforehand)
● New direction with focus on AI
● https://activate-conf.com/agenda/
● Samples:
○ Making Search at Reddit Relevant
○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning
○ The Neural Search Frontier
○ How to Build a Semantic Search System
○ Query-time Nonparametric Regression with Temporally Bounded Models
○ Building Analytics Applications with Streaming Expressions in Apache Solr
29
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018

More Related Content

What's hot

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuningCarlos del Cacho
 
Physical architecture of sql server
Physical architecture of sql serverPhysical architecture of sql server
Physical architecture of sql serverDivya Sharma
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFSNilesh Wagmare
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
 
How to build a data dictionary
How to build a data dictionaryHow to build a data dictionary
How to build a data dictionaryPiotr Kononow
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDFNarni Rajesh
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 

What's hot (20)

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Avro
AvroAvro
Avro
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Securefile LOBs
Securefile LOBsSecurefile LOBs
Securefile LOBs
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
 
Physical architecture of sql server
Physical architecture of sql serverPhysical architecture of sql server
Physical architecture of sql server
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
How to build a data dictionary
How to build a data dictionaryHow to build a data dictionary
How to build a data dictionary
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Oracle DBA
Oracle DBAOracle DBA
Oracle DBA
 

Similar to Leveraging Solr for AI tasks

Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자Donghyeok Kang
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)Rego Consulting
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingPrashank Singh
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Vasil Remeniuk
 
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation LanguagesSyntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation LanguagesTara Athan
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseAlexandre Rafalovitch
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesArtur Barseghyan
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
jstl ( jsp standard tag library )
jstl ( jsp standard tag library )jstl ( jsp standard tag library )
jstl ( jsp standard tag library )Adarsh Patel
 

Similar to Leveraging Solr for AI tasks (20)

Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Meta Object Protocols
Meta Object ProtocolsMeta Object Protocols
Meta Object Protocols
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Reversing JavaScript
Reversing JavaScriptReversing JavaScript
Reversing JavaScript
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scripting
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
 
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation LanguagesSyntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
jstl ( jsp standard tag library )
jstl ( jsp standard tag library )jstl ( jsp standard tag library )
jstl ( jsp standard tag library )
 

More from Alexandre Rafalovitch

From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)Alexandre Rafalovitch
 
Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)Alexandre Rafalovitch
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 

More from Alexandre Rafalovitch (6)

JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
 
Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Recently uploaded

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Leveraging Solr for AI tasks

  • 1. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018
  • 2. What are we discussing today 1. Search IS AI (NLP/Information Retrieval) 2. NGrams (on letters and terms) Example: Count-based Named Entity Recognition 3. OpenNLP (Statistical methods/ML) Example: ML-based Named Entity Recognition 4. Gazetteer Example: Lookup-based Named Entity Recognition 5. Significant Terms (query parser) example 6. Semantic Knowledge Graph (facets) example 2
  • 3. Basic text processing pipeline - English <fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.FlattenGraphFilterFactory"/> </analyzer> <analyzer type="query"> .... 3
  • 4. Basic text processing pipeline - Farsi <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"> <analyzer> <!-- for ZeroWidthNonJoiner --> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt"/> </analyzer> </fieldType> 4
  • 5. Complex text processing pipeline - Thai <!-- 1) tokenize Thai text with built-in rules+dictionary 2) map it to latin characters (with special accents indicating tones 3) get rid of tone marks, as nobody uses them 4) do some phonetic (BMF) broadening to match possible alternative spellings in English --> <fieldType name="thai_english" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" /> <filter class="solr.ICUTransformFilterFactory" id="NFD; [:Nonspacing Mark:] Remove; NFC"/> <filter class="solr.BeiderMorseFilterFactory"/> </analyzer> <analyzer type="query">... Source: https://github.com/arafalov/solr-thai-test/ 5
  • 6. Resources ● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system ● Solr Reference Guide: ○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html ○ Understanding Analyzers, Tokenizers, and Filters ○ Analyzers ○ About Tokenizers ○ About Filters ○ Tokenizers ○ Filter Descriptions ○ CharFilterFactories ○ Language Analysis ○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex, Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik, NYSIIS) ○ Running Your Analyzer ● http://www.solr-start.com/info/analyzers/ ○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters 6
  • 7. N-grams ● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd) ○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes) ○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text) ○ CJKBigramFilterFactory (Chinese-Japanese-Korean) ● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...) ○ ShingleFilterFactory (token n-grams) ○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory ■ shingles but only with common (stop) words ● Can be used for named entities identification ○ Shingle the normalized tokens (e.g. lowercased) ○ Facet on the results 7
  • 8. N-grams - example - define <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="_"/> </analyzer> </fieldType> <field name="shingles" type="text_shingle" indexed="true" stored="true"/> 8
  • 9. N-grams - example - use ● Index ○ The rain in Spain falls gently on the plane. ○ The rain is quite heavy in Spain ○ Heavy rain could be dangerous ○ The weather in Spain could be quite nice ● Query .../select? q=*:* &facet=on &facet.field=shingles ● Result (top entries) ○ in spain,3 ○ could be,2 ○ the rain,2 ○ be dangerous,1, ○ ... 9
  • 10. OpenNLP integration ● The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. ● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html ● OpenNLP in Solr ○ OpenNLPTokenizerFactory (including sentence chunking) ○ OpenNLPLemmatizerFilter (as opposed to stemming) ○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc) ○ OpenNLPChunkerFilter (e.g. Noun Phrase) ○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!) ○ OpenNLPLangDetectUpdateProcessorFactory (detect language) ■ one of 3 language detectors in Solr ● Challenge ○ All require models ○ Solr does not include models ○ OpenNLP only provides some models - need to train your own 10
  • 11. OpenNLP - NER - managed-schema <fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="conf/models/en-sent.bin" tokenizerModel="conf/models/en-token.bin"/> </analyzer> </fieldType> 11
  • 12. OpenNLP - NER - example - solrconfig.xml ● Chain definition (not default): <updateRequestProcessorChain name="opennlp-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">people_ss</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-organization.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">organizations_ss</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> 12
  • 13. OpenNLP - NER - example - cont ● In solrconfig.xml, add extra libraries: <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/" regex="lucene-analyzers-opennlp-.*.jar" /> <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-analysis-extras-.*.jar" /> ● Download (4) models from OpenNLP site: http://opennlp.sourceforge.net/models-1.5/ ● Put them into <core>/conf/models (for non-Cloud setup) ● Reference (one line): ○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html #update-processor-factories-that-can-be-loaded-as-plugins 13
  • 14. OpenNLP - NER - example - index and query ● Index (one long line): bin/post -c test -params update.chain="opennlp-extract" -type text/csv -out yet -d $'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.' ● Query http://localhost:8983/solr/test/select?q="*:* : { id:1, text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors., people_ss:[John Scott], organizations_ss:[IBM, Apple,General Motors], _version_:1606739364120887296}] } 14
  • 15. Gazetteer (reverse lookup) ● Gazetteer: A dictionary, listing, or index of geographic names ● NLP Gazetteer: A closed list of names (entities) to match in the text ● Solr implementation: Tagger handler aka SolrTextTagger ○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. Tagger does Lucene text analysis. ● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html ○ Includes full working tutorial ○ Not going to repeat it here ● Let's use films example's name field as gazetteer ○ Create films example as per example/films/README.txt (but don't index text yet) ○ Add Tagger schema changes (skip text field definition) and request handler definition ○ Index films into updated definition (or reindex, if you indexed already) 15
  • 16. Reminder - Film Example 16 ● Recently added example in example/films ○ 1100 records about the real movies ○ available in XML, JSON, and CSV format to demonstrate indexing ○ uses basic schema and also shows how to work around "schemaless mode" limitations ○ gives full instructions to get it working ○ good toy dataset with text and facetable fields ● Sample record: { "id": "/en/black_hawk_down", "directed_by": [ "Ridley Scott"], "initial_release_date": "2001-12-18", "name": "Black Hawk Down", "genre": ["War film", "Action/Adventure", "Action Film", "History", "Combat Films", "Drama"] }
  • 17. Gazetteer (reverse lookup) - calling ● The tagger is a separate request handler (/tag) ● We send it text (and parameters) and get back matches with desired fields ● curl -X POST 'http://localhost:8983/solr/films/tag? fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I loved the movie A beautiful mind but was not too keen on a Knight Tale' 17
  • 18. Gazetteer (reverse lookup) - result (reformatted) { "responseHeader":{ "status":0, "QTime":0}, "tagsCount":2, "tags":[ [ "startOffset",19, "endOffset",35, "matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]], [ "startOffset",61, "endOffset",74, "matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]], "response":{"numFound":2,"start":0,"docs":[ { "id":"/en/a_beautiful_mind", "directed_by":["Ron Howard"], "name":"A Beautiful Mind" }, { "id":"/en/a_knights_tale", "directed_by":["Brian Helgeland"], "name":"A Knight's Tale" } ]}} 18
  • 19. Significant terms ● Significant terms - returns terms, scored on how frequently they appear in the result set and how rarely they appear in the entire corpus. ● Uses TF-IDF to calculate score - not just appearance count ● Currently documented for Streams at: https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms ● Is also available as a Query Parser, but (in 7.4) misses documentation, was misspelt, had local-params issues. ● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details ● Syntax: ○ fq={!significantTerms field=name numTerms=3 minTermLength=5} ○ Has to be in fq as it does not affect documents, just outputs additional info ○ Has to be against text field (so genre, not genre_str in this specific example) 19
  • 20. Significant terms in the film example ● Query (7.5 syntax): .../films/select?rows=0 &q=*:* &facet=on&facet.field=name& &fq={!significantTerms field=name minTermLength=5 numTerms=10 } ● Compare pure frequency (facet) with significant terms 20
  • 21. Significant terms in the film example - result ● q=*:* ○ Significant Terms (in decreased significance order here, normally increased): american, movie, black, ghost, final, death, story, godzilla, blood ○ Facets (in decreased count order here): the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from, bad, dark, final, ghost, ii, with, 3, boys, day, death ● q=genre:Drama ○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death ● q=genre:Romantic ○ Significant Terms: movie, story, house, dirty, brother, black ● q=genre:Japanese ○ Significant Terms (only 2): godzilla, death 21
  • 22. Semantic Knowledge Graph ● Score relevance against background ○ Part of "new" JSON Facets API ○ Flexible about foreground/background/global queries ○ Context-aware if used in nested facets ○ Solr "Inception" (aka "Not sure I fully grok it yet") ● Reference (and hobbies vs age example): https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs ● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh ● Baseline statistics query (one line): http://...../films/select?q=directed_by_str:"Ridley Scott" &facet=on&facet.field=genre_str &rows=0&facet.mincount=1 22
  • 23. Steven Soderbergh Drama 5 Romance Film 3 Biographical film 2 Comedy-drama 2 Indie film 2 --- the rest is 1 each --- Comedy, Crime Fiction, Docudrama, Drama film, Ensemble Film, Erotica, Feminist Film, Historical drama, Legal drama, Mystery, Romantic comedy, Thriller, Trial drama, War film Ridley Scott Drama 5 Action Film 3 Crime Thriller 2 War film 2 --- the rest is 1 each --- Action/Adventure, Adventure Film, Biographical film, Combat Films, Comedy, Comedy of manners, Comedy-drama, Crime Drama, Crime Fiction, Epic film, Film adaptation, Gangster Film, Historical drama, Historical period drama, History, Horror, Mystery, Psychological thriller, Romance Film, Romantic comedy, Slice of life, Thriller, True crime Baseline Genre statistics 23
  • 24. .... "facets": { "count": 1100, "genre": { "buckets": [ {"val": "Drama", "count": 552}, {"val": "Comedy", "count": 389}, {"val": "Romance Film", "count": 270}, {"val": "Thriller", "count": 259}, {"val": "Action Film", "count": 196} ] .... POST http://.../films/select { params: {q:"*:*",rows: 0}, facet: { genre: { type: terms, field: genre_str, limit: 5 }}} 1. Find all documents 2. Return none of them (to reduce output) 3. Calculate facets on field genre_str 4. Return top 5 buckets Basic JSON Facet query 24
  • 25. .... { "val": "Drama film", "count": 1, "rl-SS": { "relatedness": 0.13611, "foreground_popularity": 0.00091, "background_popularity": 0.00091 }, "rl-RS": { "relatedness": -0.00075, "foreground_popularity": 0, "background_popularity": 0.00091 } }, .... POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back: "*:*" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", } }}} Comparative Relatedness 25 Not "Drama" that one is too popular
  • 26. Compare relatedness values - samples 26 Genre Global Count Steven Soderbergh Ridley Scott Count Relatedness Count Relatedness Drama film 1 1 0.13611 0 -0.00075 Legal drama 1 1 0.13611 0 -0.00075 Feminist Film 2 1 0.09963 0 -0.00107 Comedy-drama 58 2 0.0365 1 0.01602 Drama 552 5 0.00455 5 0.00455
  • 27. {"val": "Romance Film", "count": 270, "rl-SS": { "relatedness": 0.01003, "foreground_popularity": 0.3, "background_popularity": 0.4 }, "rl-RS": { "relatedness": -0.01003, "foreground_popularity": 0.1, "background_popularity": 0.4 }, "year2000-2004": {"count": 156, "rl-SS": { "relatedness": 0.01539, "foreground_popularity": 0.3, "background_popularity": 0.6 }, "rl-RS": { "relatedness": -0.01338, "foreground_popularity": 0, "background_popularity": 0.6} } }, "Inception" Relatedness 27 POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", year2000-2004: { type: query, q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", }} } }}}
  • 28. More awesomeness - another time ● Learning to Rank ○ Machine-learned ranking models ○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html ○ https://www.youtube.com/watch?v=OJJe-OWHjfI ■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg ■ Lucene/Solr Revolution 2017 ● Graph traversal ○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html ● Streaming (including Map/Reduce) ○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html ● Result clustering ○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html ● Commercial solutions (e.g. Basis Technology) ● Searching images by auto-generated captures 28
  • 29. Activate - The Search and AI conference ● Used to be called Lucene/Solr Revolution ● This year in Montreal, October 17-18 (with training beforehand) ● New direction with focus on AI ● https://activate-conf.com/agenda/ ● Samples: ○ Making Search at Reddit Relevant ○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning ○ The Neural Search Frontier ○ How to Build a Semantic Search System ○ Query-time Nonparametric Regression with Temporally Bounded Models ○ Building Analytics Applications with Streaming Expressions in Apache Solr 29
  • 30. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018