Solr workshop

Workshop
Yasas Senarath
Visiting Instructor & Research Assistant
Dept. of Computer Science and Engineering,
University of Moratuwa
Solr

Introduction [Recall]
● Search Platform
● Open-Source
● Search Applications
● Built on top of Lucene
● Why…
○ Enterprise-ready
○ Fast
○ Highly Scalable
● Search + NoSQL
○ Non Relational Data Storage

Features of Apache Solr [Recall]
● Restful APIs
○ No Java programming skills Required
● Full text search
○ tokens, phrases, spell check, wildcard, and auto-complete
● Enterprise ready
● Flexible and Extensible
● NoSQL database
● Admin Interface
● Highly Scalable
● Text-Centric and Sorted by Relevance

Installing Solr
● Go to Solr Website and Download Binary Version of Solr-8.1.1 (Latest Version
of Slor)
● Extract the Downloaded Compressed File to Your System
● Now in the Terminal Run Command (should change directory of terminal to
Extracted Solr Folder)
○ Unix*: bin/solr start
○ Windows: binsolr.cmd start
● Goto http://localhost:8983/

Techproducts Example
● Starting Solr with Example
○ Unix*: bin/solr -e techproducts
○ Windows: binsolr.cmd -e techproducts
● To verify that Solr is running, you can do this:
○ Unix*: bin/solr status
○ Windows: binsolr.cmd status
● Access Admin Panel
○ http://localhost:8983/solr/

Adding Documents
● Open example/exampledocs/sd500.xml
● Add files to Solr using post.jar
○ cd example/exampledocs
○ java -Dc=techproducts -jar post.jar sd500.xml
● 2 main ways
○ HTTP
○ Native client
<add><doc>
<field name="id">9885A004</field>
<field name="name">Canon PowerShot SD500</field>
<field name="manu">Canon Inc.</field>
...
<field name="inStock">true</field>
</doc></add>

Searching Overview
● Select API Command
○ http://localhost:8983/solr/ techproducts/select?q=sd500&wt=json
● Need only Name and ID of all elements?
○ http://localhost:8983/solr/ techproducts/select?q=inStock:false&wt=jso
n&fl=id,name
● Shutdown
○ Unix*: bin/solr stop
○ Windows: binsolr.cmd stop
● Delete Collection
○ Unix*: bin/solr delete -c techproducts
○ Windows: binsolr.cmd delete -c techproducts

Basic Solr Concepts
● Inverted Index
● Index consists of one or more Documents
● Document consists of one or more Fields
● Every field has a Field Type
● Schema
○ Before adding documents to Solr, you need to specify the schema ! (very important)
○ Schema File: schema.xml
● Schema declares
○ what kinds of fields there are
○ which field should be used as the unique/primary key
○ which fields are required
○ how to index and search each field

Basic Solr Concepts [Contd..]
● Field Types
○ float
○ long
○ double
○ date
○ Text
● Define new field types!
<fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldtype>

Basic Solr Concepts [Contd..]
● Defining a Field
○ name: Name of the field
○ type: Field type
○ indexed: Should this field be added to the inverted index?
○ stored: Should the original value of this field be stored?
○ multiValued: Can this field have multiple values
<field name="id" type="text" indexed="true" stored="true" multiValued="true"/>

Example Documents
● Use your own project corpus
● Movie Dataset: URL: https://bit.ly/2JhpEhF

Create a Collection
● Start Solr
○ Unix*: bin/solr start
○ Windows: binsolr.cmd start
● Create Collection
○ Unix*: bin/solr create -c movies
○ Windows: binsolr.cmd create -c movies
● Defining Schema
○ Two Approaches
■ Schemaless with “field guessing” feature (Managed Schema)
■ Use schema.xml with custom schema

Custom Schema
● Rename managed_schema file to schema.xml
● schema.xml
○ <field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
○ <field name="tagline" type="text_general" indexed="true" stored="true" multiValued="false"/>
○ <field name="overview" type="text_general" indexed="true" stored="true" multiValued="false"/>
○ <field name="status" type="text_general" indexed="true" stored="true" multiValued="false"/>
○ <field name="budget" type="plong" indexed="true" stored="true" multiValued="false"/>
○ <field name="popularity" type="pdouble" indexed="true" stored="true" multiValued="false"/>
○ <field name="release_date" type="pdate" indexed="true" stored="true" multiValued="false"/>
○ <field name="revenue" type="plong" indexed="true" stored="true" multiValued="false"/>
○ <field name="runtime" type="pint" indexed="true" stored="true" multiValued="false"/>
○ <field name="vote_average" type="pfloat" indexed="true" stored="true" multiValued="false"/>
○ <field name="vote_count" type="pint" indexed="true" stored="true" multiValued="false"/>
● solrconfig.xml
○ <schemaFactory class="ClassicIndexSchemaFactory"/>
○ ${update.autoCreateFields:false}

Add Documents
Curl "http://localhost:8983/solr/movies/update?commit=true"
--data-binary @example/movies/movies_metadata.csv -H
"Content-type:application/csv"

Basic Queries
Get All Documents:
http://localhost:8983/solr/movies/select?q=*:*&wt=json
Search Documents Containing “Toy Story” in “title” field:
http://localhost:8983/solr/movies/select?q=title:Toy%20Story&
wt=json
Search Documents Containing “Toy Story”:
http://localhost:8983/solr/movies/select?q=Toy%20Story&wt=j
son (!)

The Fix… (Copy Field)
● Add a Copy Field
<copyField source="*" dest="_text_"/>
● Is it ok? No!
● Only Few Fields
● Which Fields?
○ Title
○ Tagline
○ Overview

Custom Copy Fields
● Add following to schema.xml
<copyField source="title" dest="_text_"/>
<copyField source="tagline" dest="_text_"/>
<copyField source="overview" dest="_text_"/>
● Note that the destination should be marked multiValued="true"
<field name="_text_" type="text_general" indexed="true"
stored="false" multiValued="true"/>

Analyzers
● Analyzers are specified as a child of the <fieldType>
<fieldType name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
</fieldType>
● Using simple processing steps
<analyzer>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>

● Create custom Text Field: text_title
● Filters used in Analyzers
○ Tokenize : Tokenizer
○ Stopwords : Filter (stopwords.txt)
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
○ LowerCase: Filter
○ Synonyms : Filter (synonyms.txt)
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
Filters

Analysis Phases
● Separate Analyzers for Index and Query
<analyzer type="index">
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
</analyzer>
<analyzer type="query">
</analyzer>
</fieldType>

Synonyms (synonyms.txt)
● Add Some Synonyms to synonyms.txt
○ story, story, tale, fiction
○ heat, heat, hot, warm
○ se7en, se7en, seven, 7
● Spell correction with Synonyms
○ stores => stories

Advanced Queries
● Search title:Mask AND tagline:hero
○ title:Mask AND tagline:hero
○ http://localhost:8983/solr/movies/select?q=title%3AMask%20AND%20tagline%3Ahero
● Search The Mask in title or Mask in title with hero in tagline
○ title:Mask AND tagline:hero
○ http://localhost:8983/solr/movies/select?q=(title%3AMask%20AND%20tagline%3Ahero)%20O
R%20title%3A%22The%20Mask%22
● Wildcard matching: Search movies that have a title starting with “The”
○ title: ^the
○ http://localhost:8983/solr/movies/select?q=title%3A%22the*%22
● Proximity matching: Search “exorcist spirits" with proximity of 4 words in the
overview field
○ “exorcist spirits"~4
○ http://localhost:8983/solr/movies/select?q=overview%3A%22exorcist%20spirits%22~4

● Range Queries
○ Inclusive Range Query: Square brackets [ & ]
■ budget:[500000 TO *]
○ Exclusive Range Query: Curly brackets { & }
■ budget:{500000 TO *}
● Boosting a Term with ^
○ Want a term to be more relevant?
■ toy^4 story
● For more about Queries:
○ https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html
Advanced Queries

The Schemaless Approach
● Let's do the same in Schemaless Approach

Questions?
Yasas Senarath
Visiting Instructor & Research Assistant
Dept. of Computer Science and Engineering,
University of Moratuwa

Solr workshop

More Related Content

What's hot

Similar to Solr workshop

More from Yasas Senarath

Recently uploaded

Solr workshop