Angel Borroy
Software Engineer
March 2020
A Practical
Introduction to
Apache SOLR
CODELAB
22
Requirements
Java Runtime Environment 1.8+
$ java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
Supported Operating Systems
• Linux
• MacOS
• Windows
https://lucene.apache.org/solr/downloads.html
https://www.slideshare.net/angelborroy/a-practical-introduction-to-apache-solr
33
A Practical Introduction to Apache SOLR
• Open Source
• What is SOLR
• Key SOLR Concepts
• SOLR Lab
• Quick References
NEOCOM 2020
44
5
Why should you use Open Source?
• State of the Art Technologies
• Community Support
• Vast Documentation
• Code is accessible
• Customizable
• Mostly free licensing
6
Why should you contribute to Open Source?
• Share Knowledge and Ideas
• Improve established Technologies
• Become part of a Community
• Not only code, all your skills are relevant
• Be useful to the World
77
8
What is SOLR
• A Search Engine
• A REST-like API
• Built on Lucene
• Open Source
• Blazing-fast
• Scalable
• Fault tolerant
9
Why SOLR
Scalable
Solr scales by distributing work (indexing and query processing) to multiple servers in a cluster.
Ready to deploy
Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get
started.
Optimized for search
Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds.
Large volumes of documents
Solr is designed to deal with indexes containing many millions of documents.
Text-centric
Solr is optimized for searching natural-language text, like emails, web pages, resumes, PDF documents,
and social messages such as tweets or blogs.
Results sorted by relevance
Solr returns documents in ranked order based on how relevant each document is to the user’s query.
10
Lucene based Search Engines
Amazon
Elasticsearch
Service
11
Features overview
• Pagination and sorting
• Faceting
• Autosuggest
• Spell-checking
• Highlighting
• Geospatial search
• More Like This
12
Features overview
• Flexible query support
• Document clustering
• Import rich document formats (PDF, Office…)
• Import data from databases
• Multilingual support
DIH
Data Import Handler
13
Companies using SOLR
1414
Key SOLR Concepts
15
Key SOLR Concepts
• Documents
• Searching
• Relevancy
• Precision and Recall
• Searching at Scale STORAGE RETRIEVAL
Tracking
Indexing
Query
16
Lucene Document
• Documents are the unit of information for
indexing and search
• A Document is a set of fields
• Each field has a name and a value
• All field types must be defined, and all field
names (or dynamic field-naming patterns)
should be specified in Solr’s schema.xml
Seminars
Schema Configuration
• Per collection/index
• Xml file
• Define how the inverted Index will be built
• Fields/Field Types definition
Seminars
Schema Configuration
• Per collection/index
• Xml file
• Define how the inverted Index will be built
• Fields/Field Types definition
DOCUMENT
FIELD
17
Lucene Document – Search problem
The Beginner’s Guide to Buying a House
How to Buy Your First House
Purchasing a Home
Becoming a New Home owner
Buying a New Home
Decorating Your Home
A Fun Guide to Cooking
How to Raise a Child
Buying a New Car
SELECT * FROM Books WHERE Name = 'buying a new home’;
0 results
SELECT * FROM Books
WHERE Name LIKE '%buying%’
AND Name LIKE '%a%’
AND Name LIKE '%home%’;
1 result
Buying a New Home
SELECT * FROM Books
WHERE Name LIKE '%buying%’
OR Name LIKE '%a%’
OR Name LIKE '%home%’;
8 results
A Fun Guide to Cooking, Decorating Your Home, How to Raise a Child, Buying a New Car,
Buying a New Home, The Beginner’s Guide to Buying a House, Purchasing a Home,
Becoming a New Home owner
Unimportant words
Synonyms
Linguistic variations
Ordering
18
Lucene Document – Inverted Index
Doc # Content field Term Doc #
1 A Fun Guide to Cooking a 1,3,4,5,6,7,8
2 Decorating Your Home becoming 8
3 How to Raise a Child beginner’s 6
4 Buying a New Car buy 9
5 Buying a New Home buying 4,5,6
6 The Beginner’s Guide to Buying a House child 3
7 Purchasing a Home cooking 1
8 Becoming a New Home Owner decorating 2
9 How to Buy Your First House home 2,5,7,8
house 6,9
how 3,9
new 4,5,8
purchasing 7
your 2,9
INVERTED
INDEX
19
Searching
TERM DOCS
buying 4,5,6,7,9
home 2,5,6,7,8,9
Unimportant word “a” is skipped
Synonyms purchasing ~ buying
Linguistic variations buy ~ buying
Synonyms house ~ home
(AND) = 5,6,7,9
Buying a New Home
The Beginner’s Guide to Buying a House
Purchasing a Home
How to Buy Your First House
20
Searching operators
• Required terms
• Optional terms
• Negated terms
• Phrases
• Grouped expressions
• Fuzzy matching
• Wildcard
• Range
• Distance
• Proximity
buying AND home
buying OR home
buying NOT home
“buying a home”
(buying OR renting) AND home
offi* off*r off?r
yearsOld:[18-21]
administrator~
“chief officer”~1
21
Relevancy till SOLR 4 (TF/IDF)
A relevancy score for each document is calculated and the search results are sorted from the highest score to the lowest.
Similarity
Term frequency
• A document is more relevant for a particular term if the term appears multiple times
Inverse document frequency
• Measure of how “rare” a search term is, is calculated by finding the document frequency (how many total documents
the search term appears within)
Boosting
• Multiplier in query time to adjust the weight of a field
• title:solr^2.5 description:solr
Normalization factors for fields, queries and coord
Ordering
22
Relevancy from SOLR 6 (BM25)
BM25 improves upon TF/IDF
BM25 stands for “Best Match 25” (25th iteration on TF/IDF)
Includes different factors
• Frequency of a term in all Documents
• Term Frequency in a Document
• Document Length
BM25 limits influence of term frequency:
• less influence of commonwords
With TF/IDF: short fields (title,...) are automatically scored higher
BM25: Scales field length with average
• field length treatment does not automatically boost short fields
Ordering
23
Precision and Recall
Precision is a measure of how “good” each of the results of a query is. A query that returns one single
correct document out of a million other correct documents is still considered perfectly precise.
Recall is a measure of how many of the correct documents are returned. A query that returns one
single correct document out of a million other correct documents is considered a very poor recall
scoring.
>> Precision and Recall balance will improve the quality of your search results.
20 correct documents
Search results containing 10 documents
(8 correct and 2 incorrect)
Precision = 80% (8 / 10)
Recall = 40% (8 / 20)
What is the precision and
recall for the
previous ”buying a home”
sample?
24
Searching at Scale
Scaling SOLR
Solr is able to scale to handle
billions of documents and an
infinite number of queries
by adding servers.
Some limitations
• You can insert, delete, and update documents, but not single fields (easily)
• Solr is not optimized for processing quite long queries (thousands of terms) or returning quite
large result sets to users.
2525
Lab
26
Requirements
• Java Runtime Environment 1.8+
$ java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
• Supported Operating Systems
• Linux
• MacOS
• Windows
https://lucene.apache.org/solr/downloads.html
27
Directory layout
bin/
• solr | solr.cmd : start SOLR
• post : posting content to SOLR
• solr.in.sh | solr.in.cmd : configuration
contrib/
• add-ons plugins
dist/
• SOLR Jar files
docs/
• JavaDocs
example/
• CSV, XML and JSON
• DIH for databases
• Word and PDF files
licenses/
• 3rd party libraries
server/
• SOLR Admin UI
• Jetty Libraries
• Log files
• Sample configsets
28
Starting SOLR
• Use the command line interface tool called bin/solr (Linux) or binsolr.cmd (Windows)
$ bin/solr start -p 8983
Waiting up to 180 seconds to see Solr running on port 8983 []
Started Solr server on port 8983 (pid=4521). Happy searching!
• Check if Solr is Running
$ bin/solr status
Found 1 Solr nodes:
Solr process 4521 running on port 8983
{
"solr_home":"/Users/aborroy/Downloads/solr-introduction-university/solr-8.4.1/server/solr",
"version":"8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28",
"startTime":"2020-03-08T08:13:49.969Z",
"uptime":"0 days, 0 hours, 17 minutes, 56 seconds",
"memory":"91.6 MB (%17.9) of 512 MB"}
29
The SOLR Admin Web Interface
http://127.0.0.1:8983/solr/#/
30
Creating a new Core
$ bin/solr create -c films
• -c indicates the collection name
Check default fields added by SOLR to the Schema >>>>>>>
Check JSON Data to be posted in example/films/films.json
{
"id": "/en/45_2006",
"directed_by": [
"Gary Lennon"
],
"initial_release_date": "2006-11-30",
"genre": [
"Black comedy",
"Thriller"
],
"name": ".45"
}
31
Posting data
$ bin/post -c films example/films/films.json
Posting files to [base] url http://localhost:8983/solr/films/update...
POSTing file films.json (application/json) to [base]/json/docs
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/films/update/json/docs
SimplePostTool: WARNING: Response: {
"responseHeader":{
"status":400,
"QTime":120},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","java.lang.NumberFormatException"],
"msg":"ERROR: [doc=/en/quien_es_el_senor_lopez] Error adding field 'name'='¿Quién es el señor López?' msg=For input string:
"¿Quién es el señor López?"",
"code":400}}
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for
URL: http://localhost:8983/solr/films/update/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.323
32
How many results were posted?
http://127.0.0.1:8983/solr/films/select?indent=on&q=*:*&wt=json
• q: query event
• fq: filter queries
• sort: asc or desc
• start, rows: offset and number of rows
• fl: list of fields to return
• wt: response in XML or JSON
33
What was wrong?
Check carefully JSON Data to be posted in example/films/films.json
{
"id": "/en/quien_es_el_senor_lopez",
"directed_by": [
"Luis Mandoki"
],
"genre": [
"Documentary film"
],
"name": "u00bfQuiu00e9n es el seu00f1or Lu00f3pez?"
},
http://127.0.0.1:8983/solr/#/films/schema?field=name
34
Auto-Generated SOLR Schema
http://127.0.0.1:8983/solr/#/films/files?file=managed-schema
A single document might
contain multiple values
for this field type
The value of the field
can be used in queries
to retrieve matching
documents (true by
default)
SOLR rejects any
attempts to add a
document which does
not have a value for this
field
The actual value of the
field can be retrieved by
queries
name can contain text!
35
Re-Creating the Core
Deleting core “films”
$ bin/solr delete -c films
Deleting core 'films' using command:
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=film
s&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true
Creating core “films”
$ bin/solr create -c films
Created new core 'films’
Creating the field “name” for the core “films”
http://127.0.0.1:8983/solr/#/films/schema
36
Posting Data 2
$ bin/post -c films example/films/films.json
Posting files to [base] url http://localhost:8983/solr/films/update...
POSTing file films.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.417
http://127.0.0.1:8983/solr/films/select?indent=on&q=*:*&wt=json
37
Exploring SOLR Analyzers
• Solr analyzes both index content and query input before matching the results
• The live analysis can be observed by using “Analysis” option from Solr Admin UI
38
Exploring SOLR Analyzers
• Using the right locale will produce better results
39
Searching
q = genre:Fantasy directed_by:"Robert Zemeckis"
• This query is searching for both genre Fantasy and directed by Robert Zemeckis (OR is default operator)
40
Filtering
q = genre:Fantasy
fq = initial_release_date:[NOW-12YEAR TO *]
• This query is searching for both genre Fantasy in the latest 12 years
41
Sorting
q = *:*
sort = initial_release_date desc
• This query is ordering all the films by release date in descent order
42
Fuzzy Edit
q = directed_by:Zemeckis
q = directed_by:Zemekis~1
q = directed_by:Zemequis~2
43
Faceting
q = *:*
fq = genre:epic
facet = on
facet_field = directed_by_str
http://127.0.0.1:8983/solr/films/select?facet.field=directed_by_str
&facet=on&facet.mincount=1&fq=genre:epic&indent=on&q=*:*
&wt=json
44
Faceting
Multiple fields for faceting
http://127.0.0.1:8983/solr/films/select?facet.field=directed_by_str&facet.field=genre&facet=on&indent=on&q=*:*&wt=js
on
45
Highlighting
q = genre:epic
hl = on
hl.fl = genre
46
Indexing Documents
Create a new collection
$ bin/solr create -c files -d example/files/conf
Posting Word and PDF Documents
% bin/post -c files ../Documents
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering recursive mode, max depth=999, delay=0s
Indexing directory ../Documents (3 files, depth=0)
POSTing file Non-text-searchable.pdf (application/pdf) to [base]/extract
POSTing file Sample-Document.pdf (application/pdf) to [base]/extract
POSTing file Sample-Document-scoring.docx (application/vnd.openxmlformats-
officedocument.wordprocessingml.document) to [base]/extract
3 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/files/update...
Time spent: 0:00:06.338
47
Searching documents
q = video
48
Documents : ExtractingUpdateRequestHandler
The magic happens in files/conf/solrconfig.xml
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
<str name="capture">content</str>
<str name="fmap.meta">attr_meta_</str>
<str name="uprefix">attr_</str>
<str name="lowernames">true</str>
</lst>
</requestHandler>
4949
Alfresco using Apache SOLR
50
Alfresco uses an Angular app to get results from SOLR
ADF
Angular App
Repository
REST API
SOLR
IndexesFilesDB
User
51
Alfresco Content Application
5252
References
53
Quick References
SOLR
• https://lucene.apache.org/solr/resources.html#documentation
• https://www.manning.com/books/solr-in-action
• https://github.com/treygrainger/solr-in-action
“Let’s Build an Inverted Index: Introduction to Apache Lucene/Solr” by Sease
• https://www.slideshare.net/SeaseLtd/lets-build-an-inverted-index-introduction-to-apache-lucenesolr
Source code
• https://github.com/apache/lucene-solr
• https://cwiki.apache.org/confluence/display/solr/HowToContribute
This presentation
• https://www.slideshare.net/angelborroy/a-practical-introduction-to-apache-solr
Angel Borroy
Software Engineer
March 2020
A Practical
Introduction to
Apache SOLR
CODELAB

A Practical Introduction to Apache Solr

  • 1.
    Angel Borroy Software Engineer March2020 A Practical Introduction to Apache SOLR CODELAB
  • 2.
    22 Requirements Java Runtime Environment1.8+ $ java -version openjdk version "11.0.2" 2019-01-15 OpenJDK Runtime Environment 18.9 (build 11.0.2+9) OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode) Supported Operating Systems • Linux • MacOS • Windows https://lucene.apache.org/solr/downloads.html https://www.slideshare.net/angelborroy/a-practical-introduction-to-apache-solr
  • 3.
    33 A Practical Introductionto Apache SOLR • Open Source • What is SOLR • Key SOLR Concepts • SOLR Lab • Quick References NEOCOM 2020
  • 4.
  • 5.
    5 Why should youuse Open Source? • State of the Art Technologies • Community Support • Vast Documentation • Code is accessible • Customizable • Mostly free licensing
  • 6.
    6 Why should youcontribute to Open Source? • Share Knowledge and Ideas • Improve established Technologies • Become part of a Community • Not only code, all your skills are relevant • Be useful to the World
  • 7.
  • 8.
    8 What is SOLR •A Search Engine • A REST-like API • Built on Lucene • Open Source • Blazing-fast • Scalable • Fault tolerant
  • 9.
    9 Why SOLR Scalable Solr scalesby distributing work (indexing and query processing) to multiple servers in a cluster. Ready to deploy Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get started. Optimized for search Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds. Large volumes of documents Solr is designed to deal with indexes containing many millions of documents. Text-centric Solr is optimized for searching natural-language text, like emails, web pages, resumes, PDF documents, and social messages such as tweets or blogs. Results sorted by relevance Solr returns documents in ranked order based on how relevant each document is to the user’s query.
  • 10.
    10 Lucene based SearchEngines Amazon Elasticsearch Service
  • 11.
    11 Features overview • Paginationand sorting • Faceting • Autosuggest • Spell-checking • Highlighting • Geospatial search • More Like This
  • 12.
    12 Features overview • Flexiblequery support • Document clustering • Import rich document formats (PDF, Office…) • Import data from databases • Multilingual support DIH Data Import Handler
  • 13.
  • 14.
  • 15.
    15 Key SOLR Concepts •Documents • Searching • Relevancy • Precision and Recall • Searching at Scale STORAGE RETRIEVAL Tracking Indexing Query
  • 16.
    16 Lucene Document • Documentsare the unit of information for indexing and search • A Document is a set of fields • Each field has a name and a value • All field types must be defined, and all field names (or dynamic field-naming patterns) should be specified in Solr’s schema.xml Seminars Schema Configuration • Per collection/index • Xml file • Define how the inverted Index will be built • Fields/Field Types definition Seminars Schema Configuration • Per collection/index • Xml file • Define how the inverted Index will be built • Fields/Field Types definition DOCUMENT FIELD
  • 17.
    17 Lucene Document –Search problem The Beginner’s Guide to Buying a House How to Buy Your First House Purchasing a Home Becoming a New Home owner Buying a New Home Decorating Your Home A Fun Guide to Cooking How to Raise a Child Buying a New Car SELECT * FROM Books WHERE Name = 'buying a new home’; 0 results SELECT * FROM Books WHERE Name LIKE '%buying%’ AND Name LIKE '%a%’ AND Name LIKE '%home%’; 1 result Buying a New Home SELECT * FROM Books WHERE Name LIKE '%buying%’ OR Name LIKE '%a%’ OR Name LIKE '%home%’; 8 results A Fun Guide to Cooking, Decorating Your Home, How to Raise a Child, Buying a New Car, Buying a New Home, The Beginner’s Guide to Buying a House, Purchasing a Home, Becoming a New Home owner Unimportant words Synonyms Linguistic variations Ordering
  • 18.
    18 Lucene Document –Inverted Index Doc # Content field Term Doc # 1 A Fun Guide to Cooking a 1,3,4,5,6,7,8 2 Decorating Your Home becoming 8 3 How to Raise a Child beginner’s 6 4 Buying a New Car buy 9 5 Buying a New Home buying 4,5,6 6 The Beginner’s Guide to Buying a House child 3 7 Purchasing a Home cooking 1 8 Becoming a New Home Owner decorating 2 9 How to Buy Your First House home 2,5,7,8 house 6,9 how 3,9 new 4,5,8 purchasing 7 your 2,9 INVERTED INDEX
  • 19.
    19 Searching TERM DOCS buying 4,5,6,7,9 home2,5,6,7,8,9 Unimportant word “a” is skipped Synonyms purchasing ~ buying Linguistic variations buy ~ buying Synonyms house ~ home (AND) = 5,6,7,9 Buying a New Home The Beginner’s Guide to Buying a House Purchasing a Home How to Buy Your First House
  • 20.
    20 Searching operators • Requiredterms • Optional terms • Negated terms • Phrases • Grouped expressions • Fuzzy matching • Wildcard • Range • Distance • Proximity buying AND home buying OR home buying NOT home “buying a home” (buying OR renting) AND home offi* off*r off?r yearsOld:[18-21] administrator~ “chief officer”~1
  • 21.
    21 Relevancy till SOLR4 (TF/IDF) A relevancy score for each document is calculated and the search results are sorted from the highest score to the lowest. Similarity Term frequency • A document is more relevant for a particular term if the term appears multiple times Inverse document frequency • Measure of how “rare” a search term is, is calculated by finding the document frequency (how many total documents the search term appears within) Boosting • Multiplier in query time to adjust the weight of a field • title:solr^2.5 description:solr Normalization factors for fields, queries and coord Ordering
  • 22.
    22 Relevancy from SOLR6 (BM25) BM25 improves upon TF/IDF BM25 stands for “Best Match 25” (25th iteration on TF/IDF) Includes different factors • Frequency of a term in all Documents • Term Frequency in a Document • Document Length BM25 limits influence of term frequency: • less influence of commonwords With TF/IDF: short fields (title,...) are automatically scored higher BM25: Scales field length with average • field length treatment does not automatically boost short fields Ordering
  • 23.
    23 Precision and Recall Precisionis a measure of how “good” each of the results of a query is. A query that returns one single correct document out of a million other correct documents is still considered perfectly precise. Recall is a measure of how many of the correct documents are returned. A query that returns one single correct document out of a million other correct documents is considered a very poor recall scoring. >> Precision and Recall balance will improve the quality of your search results. 20 correct documents Search results containing 10 documents (8 correct and 2 incorrect) Precision = 80% (8 / 10) Recall = 40% (8 / 20) What is the precision and recall for the previous ”buying a home” sample?
  • 24.
    24 Searching at Scale ScalingSOLR Solr is able to scale to handle billions of documents and an infinite number of queries by adding servers. Some limitations • You can insert, delete, and update documents, but not single fields (easily) • Solr is not optimized for processing quite long queries (thousands of terms) or returning quite large result sets to users.
  • 25.
  • 26.
    26 Requirements • Java RuntimeEnvironment 1.8+ $ java -version openjdk version "11.0.2" 2019-01-15 OpenJDK Runtime Environment 18.9 (build 11.0.2+9) OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode) • Supported Operating Systems • Linux • MacOS • Windows https://lucene.apache.org/solr/downloads.html
  • 27.
    27 Directory layout bin/ • solr| solr.cmd : start SOLR • post : posting content to SOLR • solr.in.sh | solr.in.cmd : configuration contrib/ • add-ons plugins dist/ • SOLR Jar files docs/ • JavaDocs example/ • CSV, XML and JSON • DIH for databases • Word and PDF files licenses/ • 3rd party libraries server/ • SOLR Admin UI • Jetty Libraries • Log files • Sample configsets
  • 28.
    28 Starting SOLR • Usethe command line interface tool called bin/solr (Linux) or binsolr.cmd (Windows) $ bin/solr start -p 8983 Waiting up to 180 seconds to see Solr running on port 8983 [] Started Solr server on port 8983 (pid=4521). Happy searching! • Check if Solr is Running $ bin/solr status Found 1 Solr nodes: Solr process 4521 running on port 8983 { "solr_home":"/Users/aborroy/Downloads/solr-introduction-university/solr-8.4.1/server/solr", "version":"8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28", "startTime":"2020-03-08T08:13:49.969Z", "uptime":"0 days, 0 hours, 17 minutes, 56 seconds", "memory":"91.6 MB (%17.9) of 512 MB"}
  • 29.
    29 The SOLR AdminWeb Interface http://127.0.0.1:8983/solr/#/
  • 30.
    30 Creating a newCore $ bin/solr create -c films • -c indicates the collection name Check default fields added by SOLR to the Schema >>>>>>> Check JSON Data to be posted in example/films/films.json { "id": "/en/45_2006", "directed_by": [ "Gary Lennon" ], "initial_release_date": "2006-11-30", "genre": [ "Black comedy", "Thriller" ], "name": ".45" }
  • 31.
    31 Posting data $ bin/post-c films example/films/films.json Posting files to [base] url http://localhost:8983/solr/films/update... POSTing file films.json (application/json) to [base]/json/docs SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/films/update/json/docs SimplePostTool: WARNING: Response: { "responseHeader":{ "status":400, "QTime":120}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","java.lang.NumberFormatException"], "msg":"ERROR: [doc=/en/quien_es_el_senor_lopez] Error adding field 'name'='¿Quién es el señor López?' msg=For input string: "¿Quién es el señor López?"", "code":400}} SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/films/update/json/docs 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/films/update... Time spent: 0:00:00.323
  • 32.
    32 How many resultswere posted? http://127.0.0.1:8983/solr/films/select?indent=on&q=*:*&wt=json • q: query event • fq: filter queries • sort: asc or desc • start, rows: offset and number of rows • fl: list of fields to return • wt: response in XML or JSON
  • 33.
    33 What was wrong? Checkcarefully JSON Data to be posted in example/films/films.json { "id": "/en/quien_es_el_senor_lopez", "directed_by": [ "Luis Mandoki" ], "genre": [ "Documentary film" ], "name": "u00bfQuiu00e9n es el seu00f1or Lu00f3pez?" }, http://127.0.0.1:8983/solr/#/films/schema?field=name
  • 34.
    34 Auto-Generated SOLR Schema http://127.0.0.1:8983/solr/#/films/files?file=managed-schema Asingle document might contain multiple values for this field type The value of the field can be used in queries to retrieve matching documents (true by default) SOLR rejects any attempts to add a document which does not have a value for this field The actual value of the field can be retrieved by queries name can contain text!
  • 35.
    35 Re-Creating the Core Deletingcore “films” $ bin/solr delete -c films Deleting core 'films' using command: http://localhost:8983/solr/admin/cores?action=UNLOAD&core=film s&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true Creating core “films” $ bin/solr create -c films Created new core 'films’ Creating the field “name” for the core “films” http://127.0.0.1:8983/solr/#/films/schema
  • 36.
    36 Posting Data 2 $bin/post -c films example/films/films.json Posting files to [base] url http://localhost:8983/solr/films/update... POSTing file films.json (application/json) to [base]/json/docs 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/films/update... Time spent: 0:00:00.417 http://127.0.0.1:8983/solr/films/select?indent=on&q=*:*&wt=json
  • 37.
    37 Exploring SOLR Analyzers •Solr analyzes both index content and query input before matching the results • The live analysis can be observed by using “Analysis” option from Solr Admin UI
  • 38.
    38 Exploring SOLR Analyzers •Using the right locale will produce better results
  • 39.
    39 Searching q = genre:Fantasydirected_by:"Robert Zemeckis" • This query is searching for both genre Fantasy and directed by Robert Zemeckis (OR is default operator)
  • 40.
    40 Filtering q = genre:Fantasy fq= initial_release_date:[NOW-12YEAR TO *] • This query is searching for both genre Fantasy in the latest 12 years
  • 41.
    41 Sorting q = *:* sort= initial_release_date desc • This query is ordering all the films by release date in descent order
  • 42.
    42 Fuzzy Edit q =directed_by:Zemeckis q = directed_by:Zemekis~1 q = directed_by:Zemequis~2
  • 43.
    43 Faceting q = *:* fq= genre:epic facet = on facet_field = directed_by_str http://127.0.0.1:8983/solr/films/select?facet.field=directed_by_str &facet=on&facet.mincount=1&fq=genre:epic&indent=on&q=*:* &wt=json
  • 44.
    44 Faceting Multiple fields forfaceting http://127.0.0.1:8983/solr/films/select?facet.field=directed_by_str&facet.field=genre&facet=on&indent=on&q=*:*&wt=js on
  • 45.
  • 46.
    46 Indexing Documents Create anew collection $ bin/solr create -c files -d example/files/conf Posting Word and PDF Documents % bin/post -c files ../Documents Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log Entering recursive mode, max depth=999, delay=0s Indexing directory ../Documents (3 files, depth=0) POSTing file Non-text-searchable.pdf (application/pdf) to [base]/extract POSTing file Sample-Document.pdf (application/pdf) to [base]/extract POSTing file Sample-Document-scoring.docx (application/vnd.openxmlformats- officedocument.wordprocessingml.document) to [base]/extract 3 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/files/update... Time spent: 0:00:06.338
  • 47.
  • 48.
    48 Documents : ExtractingUpdateRequestHandler Themagic happens in files/conf/solrconfig.xml <!-- Solr Cell Update Request Handler http://wiki.apache.org/solr/ExtractingRequestHandler --> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str> <str name="capture">content</str> <str name="fmap.meta">attr_meta_</str> <str name="uprefix">attr_</str> <str name="lowernames">true</str> </lst> </requestHandler>
  • 49.
  • 50.
    50 Alfresco uses anAngular app to get results from SOLR ADF Angular App Repository REST API SOLR IndexesFilesDB User
  • 51.
  • 52.
  • 53.
    53 Quick References SOLR • https://lucene.apache.org/solr/resources.html#documentation •https://www.manning.com/books/solr-in-action • https://github.com/treygrainger/solr-in-action “Let’s Build an Inverted Index: Introduction to Apache Lucene/Solr” by Sease • https://www.slideshare.net/SeaseLtd/lets-build-an-inverted-index-introduction-to-apache-lucenesolr Source code • https://github.com/apache/lucene-solr • https://cwiki.apache.org/confluence/display/solr/HowToContribute This presentation • https://www.slideshare.net/angelborroy/a-practical-introduction-to-apache-solr
  • 54.
    Angel Borroy Software Engineer March2020 A Practical Introduction to Apache SOLR CODELAB