SlideShare a Scribd company logo
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018
What are we discussing today
1. Search IS AI (NLP/Information Retrieval)
2. NGrams (on letters and terms)
Example: Count-based Named Entity Recognition
3. OpenNLP (Statistical methods/ML)
Example: ML-based Named Entity Recognition
4. Gazetteer
Example: Lookup-based Named Entity Recognition
5. Significant Terms (query parser) example
6. Semantic Knowledge Graph (facets) example
2
Basic text processing pipeline - English
<fieldType name="text_en_splitting_tight" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query"> ....
3
Basic text processing pipeline - Farsi
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<!-- for ZeroWidthNonJoiner -->
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fa.txt"/>
</analyzer>
</fieldType>
4
Complex text processing pipeline - Thai
<!--
1) tokenize Thai text with built-in rules+dictionary
2) map it to latin characters (with special accents indicating tones
3) get rid of tone marks, as nobody uses them
4) do some phonetic (BMF) broadening to match possible alternative spellings in English
-->
<fieldType name="thai_english" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" />
<filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC"/>
<filter class="solr.BeiderMorseFilterFactory"/>
</analyzer>
<analyzer type="query">...
Source: https://github.com/arafalov/solr-thai-test/
5
Resources
● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system
● Solr Reference Guide:
○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html
○ Understanding Analyzers, Tokenizers, and Filters
○ Analyzers
○ About Tokenizers
○ About Filters
○ Tokenizers
○ Filter Descriptions
○ CharFilterFactories
○ Language Analysis
○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex,
Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik,
NYSIIS)
○ Running Your Analyzer
● http://www.solr-start.com/info/analyzers/
○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters
6
N-grams
● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd)
○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes)
○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text)
○ CJKBigramFilterFactory (Chinese-Japanese-Korean)
● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...)
○ ShingleFilterFactory (token n-grams)
○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory
■ shingles but only with common (stop) words
● Can be used for named entities identification
○ Shingle the normalized tokens (e.g. lowercased)
○ Facet on the results
7
N-grams - example - define
<fieldType name="text_shingle" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="2"
outputUnigrams="false"
outputUnigramsIfNoShingles="false"
tokenSeparator=" " fillerToken="_"/>
</analyzer>
</fieldType>
<field name="shingles" type="text_shingle" indexed="true" stored="true"/>
8
N-grams - example - use
● Index
○ The rain in Spain falls gently on the plane.
○ The rain is quite heavy in Spain
○ Heavy rain could be dangerous
○ The weather in Spain could be quite nice
● Query .../select?
q=*:*
&facet=on
&facet.field=shingles
● Result (top entries)
○ in spain,3
○ could be,2
○ the rain,2
○ be dangerous,1,
○ ...
9
OpenNLP integration
● The Apache OpenNLP library is a machine learning based toolkit
for the processing of natural language text.
● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html
● OpenNLP in Solr
○ OpenNLPTokenizerFactory (including sentence chunking)
○ OpenNLPLemmatizerFilter (as opposed to stemming)
○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc)
○ OpenNLPChunkerFilter (e.g. Noun Phrase)
○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!)
○ OpenNLPLangDetectUpdateProcessorFactory (detect language)
■ one of 3 language detectors in Solr
● Challenge
○ All require models
○ Solr does not include models
○ OpenNLP only provides some models - need to train your own 10
OpenNLP - NER - managed-schema
<fieldType name="opennlp-en-tokenization"
class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="conf/models/en-sent.bin"
tokenizerModel="conf/models/en-token.bin"/>
</analyzer>
</fieldType>
11
OpenNLP - NER - example - solrconfig.xml
● Chain definition (not default):
<updateRequestProcessorChain name="opennlp-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">people_ss</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">conf/models/en-ner-organization.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text_s</str>
<str name="dest">organizations_ss</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
12
OpenNLP - NER - example - cont
● In solrconfig.xml, add extra libraries:
<lib
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/"
regex="lucene-analyzers-opennlp-.*.jar" />
<lib
dir="${solr.install.dir:../../../..}/dist/"
regex="solr-analysis-extras-.*.jar" />
● Download (4) models from OpenNLP site:
http://opennlp.sourceforge.net/models-1.5/
● Put them into <core>/conf/models (for non-Cloud setup)
● Reference (one line):
○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html
#update-processor-factories-that-can-be-loaded-as-plugins
13
OpenNLP - NER - example - index and query
● Index (one long line):
bin/post -c test -params update.chain="opennlp-extract"
-type text/csv -out yet -d
$'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I
work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I
ever work with John Scott for General Motors.'
● Query http://localhost:8983/solr/test/select?q="*:* :
{
id:1,
text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of
pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.,
people_ss:[John Scott],
organizations_ss:[IBM, Apple,General Motors],
_version_:1606739364120887296}]
}
14
Gazetteer (reverse lookup)
● Gazetteer: A dictionary, listing, or index of geographic names
● NLP Gazetteer: A closed list of names (entities) to match in the text
● Solr implementation: Tagger handler aka SolrTextTagger
○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request
handler and it will return every occurrence of one of those names with offsets and other
document metadata desired. Tagger does Lucene text analysis.
● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html
○ Includes full working tutorial
○ Not going to repeat it here
● Let's use films example's name field as gazetteer
○ Create films example as per example/films/README.txt (but don't index text yet)
○ Add Tagger schema changes (skip text field definition) and request handler definition
○ Index films into updated definition (or reindex, if you indexed already)
15
Reminder - Film Example
16
● Recently added example in example/films
○ 1100 records about the real movies
○ available in XML, JSON, and CSV format to demonstrate indexing
○ uses basic schema and also shows how to work around "schemaless mode" limitations
○ gives full instructions to get it working
○ good toy dataset with text and facetable fields
● Sample record:
{
"id": "/en/black_hawk_down",
"directed_by": [ "Ridley Scott"],
"initial_release_date": "2001-12-18",
"name": "Black Hawk Down",
"genre": ["War film", "Action/Adventure", "Action Film",
"History", "Combat Films", "Drama"]
}
Gazetteer (reverse lookup) - calling
● The tagger is a separate request handler (/tag)
● We send it text (and parameters) and get back matches with desired fields
● curl -X POST 'http://localhost:8983/solr/films/tag?
fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I
loved the movie A beautiful mind but was not too keen on a Knight Tale'
17
Gazetteer (reverse lookup) - result (reformatted)
{
"responseHeader":{ "status":0, "QTime":0},
"tagsCount":2,
"tags":[
[ "startOffset",19, "endOffset",35,
"matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]],
[ "startOffset",61, "endOffset",74,
"matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]],
"response":{"numFound":2,"start":0,"docs":[
{
"id":"/en/a_beautiful_mind",
"directed_by":["Ron Howard"],
"name":"A Beautiful Mind"
},
{
"id":"/en/a_knights_tale",
"directed_by":["Brian Helgeland"],
"name":"A Knight's Tale"
}
]}}
18
Significant terms
● Significant terms - returns terms, scored on how frequently they appear in the
result set and how rarely they appear in the entire corpus.
● Uses TF-IDF to calculate score - not just appearance count
● Currently documented for Streams at:
https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms
● Is also available as a Query Parser, but (in 7.4) misses documentation, was
misspelt, had local-params issues.
● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details
● Syntax:
○ fq={!significantTerms field=name numTerms=3 minTermLength=5}
○ Has to be in fq as it does not affect documents, just outputs additional info
○ Has to be against text field (so genre, not genre_str in this specific example)
19
Significant terms in the film example
● Query (7.5 syntax):
.../films/select?rows=0
&q=*:*
&facet=on&facet.field=name&
&fq={!significantTerms
field=name
minTermLength=5
numTerms=10
}
● Compare pure frequency (facet) with significant terms
20
Significant terms in the film example - result
● q=*:*
○ Significant Terms (in decreased significance order here, normally increased):
american, movie, black, ghost, final, death, story, godzilla, blood
○ Facets (in decreased count order here):
the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from,
bad, dark, final, ghost, ii, with, 3, boys, day, death
● q=genre:Drama
○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death
● q=genre:Romantic
○ Significant Terms: movie, story, house, dirty, brother, black
● q=genre:Japanese
○ Significant Terms (only 2): godzilla, death
21
Semantic Knowledge Graph
● Score relevance against background
○ Part of "new" JSON Facets API
○ Flexible about foreground/background/global queries
○ Context-aware if used in nested facets
○ Solr "Inception" (aka "Not sure I fully grok it yet")
● Reference (and hobbies vs age example):
https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs
● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh
● Baseline statistics query (one line):
http://...../films/select?q=directed_by_str:"Ridley Scott"
&facet=on&facet.field=genre_str
&rows=0&facet.mincount=1
22
Steven Soderbergh
Drama 5
Romance Film 3
Biographical film 2
Comedy-drama 2
Indie film 2
--- the rest is 1 each ---
Comedy, Crime Fiction, Docudrama,
Drama film, Ensemble Film, Erotica,
Feminist Film, Historical drama, Legal drama,
Mystery, Romantic comedy, Thriller, Trial drama,
War film
Ridley Scott
Drama 5
Action Film 3
Crime Thriller 2
War film 2
--- the rest is 1 each ---
Action/Adventure, Adventure Film,
Biographical film, Combat Films, Comedy,
Comedy of manners, Comedy-drama,
Crime Drama, Crime Fiction, Epic film,
Film adaptation, Gangster Film,
Historical drama, Historical period drama,
History, Horror, Mystery, Psychological thriller,
Romance Film, Romantic comedy, Slice of life,
Thriller, True crime
Baseline Genre statistics
23
....
"facets": {
"count": 1100,
"genre": {
"buckets": [
{"val": "Drama", "count": 552},
{"val": "Comedy", "count": 389},
{"val": "Romance Film", "count": 270},
{"val": "Thriller", "count": 259},
{"val": "Action Film", "count": 196}
]
....
POST http://.../films/select
{
params: {q:"*:*",rows: 0},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5
}}}
1. Find all documents
2. Return none of them (to reduce output)
3. Calculate facets on field genre_str
4. Return top 5 buckets
Basic JSON Facet query
24
....
{
"val": "Drama film", "count": 1,
"rl-SS": {
"relatedness": 0.13611,
"foreground_popularity": 0.00091,
"background_popularity": 0.00091
},
"rl-RS": {
"relatedness": -0.00075,
"foreground_popularity": 0,
"background_popularity": 0.00091
}
},
....
POST http://.../films/select
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back: "*:*"
},
facet: {
genre: {
type: terms,
field: genre_str,
limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}
}}}
Comparative Relatedness
25
Not "Drama"
that one is too popular
Compare relatedness values - samples
26
Genre Global Count Steven Soderbergh Ridley Scott
Count Relatedness Count Relatedness
Drama film 1 1 0.13611 0 -0.00075
Legal drama 1 1 0.13611 0 -0.00075
Feminist Film 2 1 0.09963 0 -0.00107
Comedy-drama 58 2 0.0365 1 0.01602
Drama 552 5 0.00455 5 0.00455
{"val": "Romance Film", "count": 270,
"rl-SS": {
"relatedness": 0.01003,
"foreground_popularity": 0.3,
"background_popularity": 0.4
},
"rl-RS": {
"relatedness": -0.01003,
"foreground_popularity": 0.1,
"background_popularity": 0.4
},
"year2000-2004": {"count": 156,
"rl-SS": {
"relatedness": 0.01539,
"foreground_popularity": 0.3,
"background_popularity": 0.6
},
"rl-RS": {
"relatedness": -0.01338,
"foreground_popularity": 0,
"background_popularity": 0.6}
}
},
"Inception" Relatedness
27
POST http://.../films/select
{
params: {
q:"*:*",rows: 0,
fore_SS:"directed_by:"Steven Soderbergh"",
fore_RS:"directed_by:"Ridley Scott"",
back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")"
},
facet: {
genre: {
type: terms, field: genre_str, limit: 5,
sort: "rl-SS desc",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
year2000-2004: {
type: query,
q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]",
facet: {
rl-SS: "relatedness($fore_SS, $back)",
rl-RS: "relatedness($fore_RS, $back)",
}}
}
}}}
More awesomeness - another time
● Learning to Rank
○ Machine-learned ranking models
○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html
○ https://www.youtube.com/watch?v=OJJe-OWHjfI
■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg
■ Lucene/Solr Revolution 2017
● Graph traversal
○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html
● Streaming (including Map/Reduce)
○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html
● Result clustering
○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html
● Commercial solutions (e.g. Basis Technology)
● Searching images by auto-generated captures
28
Activate - The Search and AI conference
● Used to be called Lucene/Solr Revolution
● This year in Montreal, October 17-18 (with training beforehand)
● New direction with focus on AI
● https://activate-conf.com/agenda/
● Samples:
○ Making Search at Reddit Relevant
○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning
○ The Neural Search Frontier
○ How to Build a Semantic Search System
○ Query-time Nonparametric Regression with Temporally Bounded Models
○ Building Analytics Applications with Streaming Expressions in Apache Solr
29
Searching for AI
Leveraging Solr for classic
Artificial Intelligence tasks
Alexandre Rafalovitch
July 2018

More Related Content

What's hot

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱
confluent
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
DataWorks Summit
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
MongoDB
 

What's hot (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 

Similar to Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks

Solr workshop
Solr workshopSolr workshop
Solr workshop
Yasas Senarath
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
Paul Borgermans
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자Donghyeok Kang
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Alexandre Rafalovitch
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Reversing JavaScript
Reversing JavaScriptReversing JavaScript
Reversing JavaScript
Roberto Suggi Liverani
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
Alkacon Software GmbH & Co. KG
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego Consulting
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scripting
Prashank Singh
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
Vasil Remeniuk
 
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation LanguagesSyntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Tara Athan
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
Binesh Gummadi
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
Artur Barseghyan
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
bangaloredjangousergroup
 
jstl ( jsp standard tag library )
jstl ( jsp standard tag library )jstl ( jsp standard tag library )
jstl ( jsp standard tag library )
Adarsh Patel
 

Similar to Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks (20)

Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Meta Object Protocols
Meta Object ProtocolsMeta Object Protocols
Meta Object Protocols
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Reversing JavaScript
Reversing JavaScriptReversing JavaScript
Reversing JavaScript
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
Rego University: Hidden Automation & Gel Scripting, CA PPM (CA Clarity PPM)
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scripting
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
 
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation LanguagesSyntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
Syntax Reuse: XSLT as a Metalanguage for Knowledge Representation Languages
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
jstl ( jsp standard tag library )
jstl ( jsp standard tag library )jstl ( jsp standard tag library )
jstl ( jsp standard tag library )
 

More from Alexandre Rafalovitch

JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
Alexandre Rafalovitch
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
Alexandre Rafalovitch
 
Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)
Alexandre Rafalovitch
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
Alexandre Rafalovitch
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Alexandre Rafalovitch
 

More from Alexandre Rafalovitch (6)

JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
 
Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)Rapid Solr Schema Development (Phone directory)
Rapid Solr Schema Development (Phone directory)
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Recently uploaded

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 

Recently uploaded (20)

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 

Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks

  • 1. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018
  • 2. What are we discussing today 1. Search IS AI (NLP/Information Retrieval) 2. NGrams (on letters and terms) Example: Count-based Named Entity Recognition 3. OpenNLP (Statistical methods/ML) Example: ML-based Named Entity Recognition 4. Gazetteer Example: Lookup-based Named Entity Recognition 5. Significant Terms (query parser) example 6. Semantic Knowledge Graph (facets) example 2
  • 3. Basic text processing pipeline - English <fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.FlattenGraphFilterFactory"/> </analyzer> <analyzer type="query"> .... 3
  • 4. Basic text processing pipeline - Farsi <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"> <analyzer> <!-- for ZeroWidthNonJoiner --> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt"/> </analyzer> </fieldType> 4
  • 5. Complex text processing pipeline - Thai <!-- 1) tokenize Thai text with built-in rules+dictionary 2) map it to latin characters (with special accents indicating tones 3) get rid of tone marks, as nobody uses them 4) do some phonetic (BMF) broadening to match possible alternative spellings in English --> <fieldType name="thai_english" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" /> <filter class="solr.ICUTransformFilterFactory" id="NFD; [:Nonspacing Mark:] Remove; NFC"/> <filter class="solr.BeiderMorseFilterFactory"/> </analyzer> <analyzer type="query">... Source: https://github.com/arafalov/solr-thai-test/ 5
  • 6. Resources ● Intro: https://www.slideshare.net/lucenerevolution/language-support-and-linguistics-in-lucene-solr-its-eco-system ● Solr Reference Guide: ○ https://lucene.apache.org/solr/guide/7_4/understanding-analyzers-tokenizers-and-filters.html ○ Understanding Analyzers, Tokenizers, and Filters ○ Analyzers ○ About Tokenizers ○ About Filters ○ Tokenizers ○ Filter Descriptions ○ CharFilterFactories ○ Language Analysis ○ Phonetic Matching (Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff , Soundex, Double Metaphone, Metaphone, Soundex, Refined Soundex, Caverphone, Kölner Phonetik, NYSIIS) ○ Running Your Analyzer ● http://www.solr-start.com/info/analyzers/ ○ Automatically-generated list of Analyzers, CharFilters, Tokenizers, and TokenFilters 6
  • 7. N-grams ● Character-level (ngram {2,3} abcd -> ab, abc, bc, bcd, cd) ○ EdgeNGramTokenizerFactory and EdgeNGramFilterFactory (prefixes) ○ NGramTokenizerFactory and NGramFilterFactory (anywhere in the text) ○ CJKBigramFilterFactory (Chinese-Japanese-Korean) ● Token (word) level (the rain in spain falls mainly -> the rain, rain in, ...) ○ ShingleFilterFactory (token n-grams) ○ CommonGramsFilterFactory and CommonGramsQueryFilterFactory ■ shingles but only with common (stop) words ● Can be used for named entities identification ○ Shingle the normalized tokens (e.g. lowercased) ○ Facet on the results 7
  • 8. N-grams - example - define <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="_"/> </analyzer> </fieldType> <field name="shingles" type="text_shingle" indexed="true" stored="true"/> 8
  • 9. N-grams - example - use ● Index ○ The rain in Spain falls gently on the plane. ○ The rain is quite heavy in Spain ○ Heavy rain could be dangerous ○ The weather in Spain could be quite nice ● Query .../select? q=*:* &facet=on &facet.field=shingles ● Result (top entries) ○ in spain,3 ○ could be,2 ○ the rain,2 ○ be dangerous,1, ○ ... 9
  • 10. OpenNLP integration ● The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. ● Reference: https://lucene.apache.org/solr/guide/7_4/language-analysis.html ● OpenNLP in Solr ○ OpenNLPTokenizerFactory (including sentence chunking) ○ OpenNLPLemmatizerFilter (as opposed to stemming) ○ OpenNLPPOSFilterFactory (part of speech: noun, verb, adjective, etc) ○ OpenNLPChunkerFilter (e.g. Noun Phrase) ○ OpenNLPExtractNamedEntitiesUpdateProcessorFactory (NER URP!) ○ OpenNLPLangDetectUpdateProcessorFactory (detect language) ■ one of 3 language detectors in Solr ● Challenge ○ All require models ○ Solr does not include models ○ OpenNLP only provides some models - need to train your own 10
  • 11. OpenNLP - NER - managed-schema <fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="conf/models/en-sent.bin" tokenizerModel="conf/models/en-token.bin"/> </analyzer> </fieldType> 11
  • 12. OpenNLP - NER - example - solrconfig.xml ● Chain definition (not default): <updateRequestProcessorChain name="opennlp-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">people_ss</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">conf/models/en-ner-organization.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text_s</str> <str name="dest">organizations_ss</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> 12
  • 13. OpenNLP - NER - example - cont ● In solrconfig.xml, add extra libraries: <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs/" regex="lucene-analyzers-opennlp-.*.jar" /> <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-analysis-extras-.*.jar" /> ● Download (4) models from OpenNLP site: http://opennlp.sourceforge.net/models-1.5/ ● Put them into <core>/conf/models (for non-Cloud setup) ● Reference (one line): ○ https://lucene.apache.org/solr/guide/7_4/update-request-processors.html #update-processor-factories-that-can-be-loaded-as-plugins 13
  • 14. OpenNLP - NER - example - index and query ● Index (one long line): bin/post -c test -params update.chain="opennlp-extract" -type text/csv -out yet -d $'id,text_sn1,When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors.' ● Query http://localhost:8983/solr/test/select?q="*:* : { id:1, text_s:When I was working at IBM, I ate a lot of apples. Now that I work at Apple, I eat a lot of pears. I have no idea what fruit I will like if I ever work with John Scott for General Motors., people_ss:[John Scott], organizations_ss:[IBM, Apple,General Motors], _version_:1606739364120887296}] } 14
  • 15. Gazetteer (reverse lookup) ● Gazetteer: A dictionary, listing, or index of geographic names ● NLP Gazetteer: A closed list of names (entities) to match in the text ● Solr implementation: Tagger handler aka SolrTextTagger ○ Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. Tagger does Lucene text analysis. ● Reference: https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html ○ Includes full working tutorial ○ Not going to repeat it here ● Let's use films example's name field as gazetteer ○ Create films example as per example/films/README.txt (but don't index text yet) ○ Add Tagger schema changes (skip text field definition) and request handler definition ○ Index films into updated definition (or reindex, if you indexed already) 15
  • 16. Reminder - Film Example 16 ● Recently added example in example/films ○ 1100 records about the real movies ○ available in XML, JSON, and CSV format to demonstrate indexing ○ uses basic schema and also shows how to work around "schemaless mode" limitations ○ gives full instructions to get it working ○ good toy dataset with text and facetable fields ● Sample record: { "id": "/en/black_hawk_down", "directed_by": [ "Ridley Scott"], "initial_release_date": "2001-12-18", "name": "Black Hawk Down", "genre": ["War film", "Action/Adventure", "Action Film", "History", "Combat Films", "Drama"] }
  • 17. Gazetteer (reverse lookup) - calling ● The tagger is a separate request handler (/tag) ● We send it text (and parameters) and get back matches with desired fields ● curl -X POST 'http://localhost:8983/solr/films/tag? fl="id,name,directed_by"&matchText="true"' -H 'Content-Type:text/plain' -d 'I loved the movie A beautiful mind but was not too keen on a Knight Tale' 17
  • 18. Gazetteer (reverse lookup) - result (reformatted) { "responseHeader":{ "status":0, "QTime":0}, "tagsCount":2, "tags":[ [ "startOffset",19, "endOffset",35, "matchText","A beautiful mind", "ids",["/en/a_beautiful_mind"]], [ "startOffset",61, "endOffset",74, "matchText","a Knight Tale", "ids",["/en/a_knights_tale"]]], "response":{"numFound":2,"start":0,"docs":[ { "id":"/en/a_beautiful_mind", "directed_by":["Ron Howard"], "name":"A Beautiful Mind" }, { "id":"/en/a_knights_tale", "directed_by":["Brian Helgeland"], "name":"A Knight's Tale" } ]}} 18
  • 19. Significant terms ● Significant terms - returns terms, scored on how frequently they appear in the result set and how rarely they appear in the entire corpus. ● Uses TF-IDF to calculate score - not just appearance count ● Currently documented for Streams at: https://lucene.apache.org/solr/guide/7_4/stream-source-reference.html#significantterms ● Is also available as a Query Parser, but (in 7.4) misses documentation, was misspelt, had local-params issues. ● Here, we use 7.5 syntax, see SOLR-12395 and SOLR-12553 for details ● Syntax: ○ fq={!significantTerms field=name numTerms=3 minTermLength=5} ○ Has to be in fq as it does not affect documents, just outputs additional info ○ Has to be against text field (so genre, not genre_str in this specific example) 19
  • 20. Significant terms in the film example ● Query (7.5 syntax): .../films/select?rows=0 &q=*:* &facet=on&facet.field=name& &fq={!significantTerms field=name minTermLength=5 numTerms=10 } ● Compare pure frequency (facet) with significant terms 20
  • 21. Significant terms in the film example - result ● q=*:* ○ Significant Terms (in decreased significance order here, normally increased): american, movie, black, ghost, final, death, story, godzilla, blood ○ Facets (in decreased count order here): the, of, a, and, in, 2, to, american, movie, amp, black, dead, love, all, big, blue, for, an, from, bad, dark, final, ghost, ii, with, 3, boys, day, death ● q=genre:Drama ○ Significant Terms: black, american, don't, about, ghost, dirty, story, blood, death ● q=genre:Romantic ○ Significant Terms: movie, story, house, dirty, brother, black ● q=genre:Japanese ○ Significant Terms (only 2): godzilla, death 21
  • 22. Semantic Knowledge Graph ● Score relevance against background ○ Part of "new" JSON Facets API ○ Flexible about foreground/background/global queries ○ Context-aware if used in nested facets ○ Solr "Inception" (aka "Not sure I fully grok it yet") ● Reference (and hobbies vs age example): https://lucene.apache.org/solr/guide/7_4/json-facet-api.html#semantic-knowledge-graphs ● Our case: Compare Genres for works of Ridley Scott and Steven Soderbergh ● Baseline statistics query (one line): http://...../films/select?q=directed_by_str:"Ridley Scott" &facet=on&facet.field=genre_str &rows=0&facet.mincount=1 22
  • 23. Steven Soderbergh Drama 5 Romance Film 3 Biographical film 2 Comedy-drama 2 Indie film 2 --- the rest is 1 each --- Comedy, Crime Fiction, Docudrama, Drama film, Ensemble Film, Erotica, Feminist Film, Historical drama, Legal drama, Mystery, Romantic comedy, Thriller, Trial drama, War film Ridley Scott Drama 5 Action Film 3 Crime Thriller 2 War film 2 --- the rest is 1 each --- Action/Adventure, Adventure Film, Biographical film, Combat Films, Comedy, Comedy of manners, Comedy-drama, Crime Drama, Crime Fiction, Epic film, Film adaptation, Gangster Film, Historical drama, Historical period drama, History, Horror, Mystery, Psychological thriller, Romance Film, Romantic comedy, Slice of life, Thriller, True crime Baseline Genre statistics 23
  • 24. .... "facets": { "count": 1100, "genre": { "buckets": [ {"val": "Drama", "count": 552}, {"val": "Comedy", "count": 389}, {"val": "Romance Film", "count": 270}, {"val": "Thriller", "count": 259}, {"val": "Action Film", "count": 196} ] .... POST http://.../films/select { params: {q:"*:*",rows: 0}, facet: { genre: { type: terms, field: genre_str, limit: 5 }}} 1. Find all documents 2. Return none of them (to reduce output) 3. Calculate facets on field genre_str 4. Return top 5 buckets Basic JSON Facet query 24
  • 25. .... { "val": "Drama film", "count": 1, "rl-SS": { "relatedness": 0.13611, "foreground_popularity": 0.00091, "background_popularity": 0.00091 }, "rl-RS": { "relatedness": -0.00075, "foreground_popularity": 0, "background_popularity": 0.00091 } }, .... POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back: "*:*" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", } }}} Comparative Relatedness 25 Not "Drama" that one is too popular
  • 26. Compare relatedness values - samples 26 Genre Global Count Steven Soderbergh Ridley Scott Count Relatedness Count Relatedness Drama film 1 1 0.13611 0 -0.00075 Legal drama 1 1 0.13611 0 -0.00075 Feminist Film 2 1 0.09963 0 -0.00107 Comedy-drama 58 2 0.0365 1 0.01602 Drama 552 5 0.00455 5 0.00455
  • 27. {"val": "Romance Film", "count": 270, "rl-SS": { "relatedness": 0.01003, "foreground_popularity": 0.3, "background_popularity": 0.4 }, "rl-RS": { "relatedness": -0.01003, "foreground_popularity": 0.1, "background_popularity": 0.4 }, "year2000-2004": {"count": 156, "rl-SS": { "relatedness": 0.01539, "foreground_popularity": 0.3, "background_popularity": 0.6 }, "rl-RS": { "relatedness": -0.01338, "foreground_popularity": 0, "background_popularity": 0.6} } }, "Inception" Relatedness 27 POST http://.../films/select { params: { q:"*:*",rows: 0, fore_SS:"directed_by:"Steven Soderbergh"", fore_RS:"directed_by:"Ridley Scott"", back:"directed_by:("Steven Soderbergh" OR "Ridley Scott")" }, facet: { genre: { type: terms, field: genre_str, limit: 5, sort: "rl-SS desc", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", year2000-2004: { type: query, q: "initial_release_date:[2000-01-01T00:00:00Z TO 2005-01-01T00:00:00Z]", facet: { rl-SS: "relatedness($fore_SS, $back)", rl-RS: "relatedness($fore_RS, $back)", }} } }}}
  • 28. More awesomeness - another time ● Learning to Rank ○ Machine-learned ranking models ○ https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html ○ https://www.youtube.com/watch?v=OJJe-OWHjfI ■ Learning-to-Rank with Apache Solr & Bees - Christine Poerschke, Bloomberg ■ Lucene/Solr Revolution 2017 ● Graph traversal ○ https://lucene.apache.org/solr/guide/7_4/graph-traversal.html ● Streaming (including Map/Reduce) ○ https://lucene.apache.org/solr/guide/7_4/streaming-expressions.html ● Result clustering ○ https://lucene.apache.org/solr/guide/7_4/result-clustering.html ● Commercial solutions (e.g. Basis Technology) ● Searching images by auto-generated captures 28
  • 29. Activate - The Search and AI conference ● Used to be called Lucene/Solr Revolution ● This year in Montreal, October 17-18 (with training beforehand) ● New direction with focus on AI ● https://activate-conf.com/agenda/ ● Samples: ○ Making Search at Reddit Relevant ○ The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep Learning ○ The Neural Search Frontier ○ How to Build a Semantic Search System ○ Query-time Nonparametric Regression with Temporally Bounded Models ○ Building Analytics Applications with Streaming Expressions in Apache Solr 29
  • 30. Searching for AI Leveraging Solr for classic Artificial Intelligence tasks Alexandre Rafalovitch July 2018