Introduction to solr

Introduction to Solr
Radu Gheorghe
Sematext Group, Inc.

About me
Logsene
SPM
ES API
metrics
...
Products Services
+ https://sematext.com/blog/author/radu7gheorghe/
+ https://www.manning.com/books/elasticsearch-in-action

Agenda
What is Solr
When to use it
When not to use it
How it works
Demo
Pleeeeease ask questions.
Otherwise it will be boring :(

What is
Open source
Search engine
Based on Apache *
Distributed (SolrCloud) or not (master-slave)
* Actually the two project merged in 2010

More on search: the term dictionary and its friends
Term Docs Positions counts,
stored, etc
big 1,2 [0],[2] ...
bucharest 3 [0]
data 1 [1]
fun 1 ...
is 1,3
other 2
text 2
1) Big data is fun
2) Other text
3) Bucharest is big
analysis
big AND data
“big data”

The [relevancy] score
BM25: bag-of-words based on TF-IDFq=big AND data
big
big
big
big
big
big
I have big big big data
Term
Frequency data
data
Inverse
Document
Frequency
more
occurrences in
the document,
more weight
less
occurrences
in the index,
more weight

Relevancy tuning
title: Big Data
description: this is a book about big data
published: 2016
title: Spark Rulz
description: big data big data big data big data
published: 2015
q=big AND data
boost fields
boost values

Back to sorting: where the inverted index fails
Term Docs
1 [star] 1,2,8,5,128
2 7,84,129,
3 3,29,345
4 11,123,455
5 12,14,16,17
Search returned docs
84, 455, 12 and 8
Now sort them by
rating.
¯_(ツ)_/¯

Enter doc values
Doc Terms
8 1
12 3
84 5
129 4
455 2
Search returned docs
84, 455, 12 and 8
Now sort them by
rating.
Similar, but not quite
like stored fields*
* Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields
and in-memory field cache

Facets
search returns
doc IDs
facet=true
facet.field=host
doc1: host=server01
doc2: host=server02
doc3: host=server01
doc4: host=server01
server01: 3
server02: 1
doc values,
usually*
* can be filter cache on low cardinality fields (depends on facet.method)

Facets can be hierarchical
top_genres:{
terms:{
field: genre,
limit: 5,
facet:{
top_authors:{
terms:{
field: author,
limit: 2
"top_genres":{
"buckets":[
{
"val":"Fantasy",
"count":5432,
"top_authors":{ // top authors in the "Fantasy" genre
"buckets":[{
"val":"Mercedes Lackey",
"count":121},
{
"val":"Piers Anthony",
"count":98}
]
}
},
{
"val":"Mystery",
"count":4322,
"top_authors":{ // top authors in the "Mystery" genre
"buckets":[{
"val":"James Patterson",
"count":146},
Can also be numeric/date
ranges or functions like avg,
sum, unique or percentile

Beyond the shards: streaming aggregations
Sources
search
facet
jdbc
...
Decorators
rollup
unique
innerJoin
parallel
...
shard1 shard2
worker1 worker2
Solr endpoint
client
app

Beyond the shards: streaming aggregations
Sources
search
facet
jdbc
...
Decorators
rollup
unique
innerJoin
parallel
...
Parallel SQL
Text Classification
Graph Traversal
⇒ shard1 shard2
worker1 worker2
Solr endpoint
client
app

Master-slave
indexer master
slave1
slave2
slave3
searcher
docs
queries
replicates
segments

Master-slave: high-QPS on static data
indexer master
slave1
slave2
slave3
searcher
replicates
segments
docs
queries
Simple
Battle-tested
Index data only once
Slaves can cache like crazy
Separate roles ⇒ separate (see optimized) hardware and configs

SolrCloud
leader2
leader1
replica2
replica1
Zookeeper
Solr nodes
indexer searcher

SolrCloud
leader2
leader1
replica2
replica1
Zookeeper
Solr nodes
indexer searcher
Near realtime search
Durability
Scales both reads and writes
No SPOF
Central config, nicer APIs

In a nutshell
Typical use-cases Typical challenges
Product search (books, movies, bikes
weapons… anything that requires relevancy)
Updates (though there’s WiP for numeric
doc values in SOLR-5944)
Time-series data (logs, metrics, social
media...)
Not really schema-less (schema can only
be appended)
Search on top of (or as a source of) other Big
Data tools (Spark, HDFS…)
Doesn’t like sparse data (again, there’s
ongoing work to make it better, see
LUCENE-7253)
Search on top of (or alongside) relational
DBs
Some relational, stream and batch
processing capabilities, but not the tool
for those jobs

Demo
Commands available at
https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh

Thank you!
Radu Gheorghe
radu.gheorghe@sematext.com
@radu0gheorghe
Sematext
info@sematext.com
http://sematext.com
@sematext
Join Us! We are hiring!
http://sematext.com/jobs
Backend, UI, Sales,
Consulting, Trainers

Introduction to solr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to solr

Similar to Introduction to solr (20)

More from Sematext Group, Inc.

More from Sematext Group, Inc. (13)

Recently uploaded

Recently uploaded (20)

Introduction to solr