Introduction to Solr
Radu Gheorghe
Sematext Group, Inc.
About me
Logsene
SPM
ES API
metrics
...
Products Services
+ https://sematext.com/blog/author/radu7gheorghe/
+ https://www.manning.com/books/elasticsearch-in-action
Agenda
What is Solr
When to use it
When not to use it
How it works
Demo
Pleeeeease ask questions.
Otherwise it will be boring :(
What is
Open source
Search engine
Based on Apache *
Distributed (SolrCloud) or not (master-slave)
* Actually the two project merged in 2010
More on search: the term dictionary and its friends
Term Docs Positions counts,
stored, etc
big 1,2 [0],[2] ...
bucharest 3 [0]
data 1 [1]
fun 1 ...
is 1,3
other 2
text 2
1) Big data is fun
2) Other text
3) Bucharest is big
analysis
big AND data
“big data”
Segments and merging
The [relevancy] score
BM25: bag-of-words based on TF-IDFq=big AND data
big
big
big
big
big
big
I have big big big data
Term
Frequency data
data
Inverse
Document
Frequency
more
occurrences in
the document,
more weight
less
occurrences
in the index,
more weight
Relevancy tuning
title: Big Data
description: this is a book about big data
published: 2016
title: Spark Rulz
description: big data big data big data big data
published: 2015
q=big AND data
boost fields
boost values
Back to sorting: where the inverted index fails
Term Docs
1 [star] 1,2,8,5,128
2 7,84,129,
3 3,29,345
4 11,123,455
5 12,14,16,17
Search returned docs
84, 455, 12 and 8
Now sort them by
rating.
¯_(ツ)_/¯
Enter doc values
Doc Terms
8 1
12 3
84 5
129 4
455 2
Search returned docs
84, 455, 12 and 8
Now sort them by
rating.
Similar, but not quite
like stored fields*
* Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields
and in-memory field cache
Facets
search returns
doc IDs
facet=true
facet.field=host
doc1: host=server01
doc2: host=server02
doc3: host=server01
doc4: host=server01
server01: 3
server02: 1
doc values,
usually*
* can be filter cache on low cardinality fields (depends on facet.method)
Facets can be hierarchical
top_genres:{
terms:{
field: genre,
limit: 5,
facet:{
top_authors:{
terms:{
field: author,
limit: 2
"top_genres":{
"buckets":[
{
"val":"Fantasy",
"count":5432,
"top_authors":{ // top authors in the "Fantasy" genre
"buckets":[{
"val":"Mercedes Lackey",
"count":121},
{
"val":"Piers Anthony",
"count":98}
]
}
},
{
"val":"Mystery",
"count":4322,
"top_authors":{ // top authors in the "Mystery" genre
"buckets":[{
"val":"James Patterson",
"count":146},
Can also be numeric/date
ranges or functions like avg,
sum, unique or percentile
Beyond the shards: streaming aggregations
Sources
search
facet
jdbc
...
Decorators
rollup
unique
innerJoin
parallel
...
shard1 shard2
worker1 worker2
Solr endpoint
client
app
Beyond the shards: streaming aggregations
Sources
search
facet
jdbc
...
Decorators
rollup
unique
innerJoin
parallel
...
Parallel SQL
Text Classification
Graph Traversal
⇒ shard1 shard2
worker1 worker2
Solr endpoint
client
app
Master-slave
indexer master
slave1
slave2
slave3
searcher
docs
queries
replicates
segments
Master-slave: high-QPS on static data
indexer master
slave1
slave2
slave3
searcher
replicates
segments
docs
queries
Simple
Battle-tested
Index data only once
Slaves can cache like crazy
Separate roles ⇒ separate (see optimized) hardware and configs
SolrCloud
leader2
leader1
replica2
replica1
Zookeeper
Solr nodes
indexer searcher
SolrCloud
leader2
leader1
replica2
replica1
Zookeeper
Solr nodes
indexer searcher
Near realtime search
Durability
Scales both reads and writes
No SPOF
Central config, nicer APIs
In a nutshell
Typical use-cases Typical challenges
Product search (books, movies, bikes
weapons… anything that requires relevancy)
Updates (though there’s WiP for numeric
doc values in SOLR-5944)
Time-series data (logs, metrics, social
media...)
Not really schema-less (schema can only
be appended)
Search on top of (or as a source of) other Big
Data tools (Spark, HDFS…)
Doesn’t like sparse data (again, there’s
ongoing work to make it better, see
LUCENE-7253)
Search on top of (or alongside) relational
DBs
Some relational, stream and batch
processing capabilities, but not the tool
for those jobs
Demo
Commands available at
https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh
Thank you!
Radu Gheorghe
radu.gheorghe@sematext.com
@radu0gheorghe
Sematext
info@sematext.com
http://sematext.com
@sematext
Join Us! We are hiring!
http://sematext.com/jobs
Backend, UI, Sales,
Consulting, Trainers

Introduction to solr

  • 1.
    Introduction to Solr RaduGheorghe Sematext Group, Inc.
  • 2.
    About me Logsene SPM ES API metrics ... ProductsServices + https://sematext.com/blog/author/radu7gheorghe/ + https://www.manning.com/books/elasticsearch-in-action
  • 3.
    Agenda What is Solr Whento use it When not to use it How it works Demo Pleeeeease ask questions. Otherwise it will be boring :(
  • 4.
    What is Open source Searchengine Based on Apache * Distributed (SolrCloud) or not (master-slave) * Actually the two project merged in 2010
  • 5.
    More on search:the term dictionary and its friends Term Docs Positions counts, stored, etc big 1,2 [0],[2] ... bucharest 3 [0] data 1 [1] fun 1 ... is 1,3 other 2 text 2 1) Big data is fun 2) Other text 3) Bucharest is big analysis big AND data “big data”
  • 6.
  • 7.
    The [relevancy] score BM25:bag-of-words based on TF-IDFq=big AND data big big big big big big I have big big big data Term Frequency data data Inverse Document Frequency more occurrences in the document, more weight less occurrences in the index, more weight
  • 8.
    Relevancy tuning title: BigData description: this is a book about big data published: 2016 title: Spark Rulz description: big data big data big data big data published: 2015 q=big AND data boost fields boost values
  • 9.
    Back to sorting:where the inverted index fails Term Docs 1 [star] 1,2,8,5,128 2 7,84,129, 3 3,29,345 4 11,123,455 5 12,14,16,17 Search returned docs 84, 455, 12 and 8 Now sort them by rating. ¯_(ツ)_/¯
  • 10.
    Enter doc values DocTerms 8 1 12 3 84 5 129 4 455 2 Search returned docs 84, 455, 12 and 8 Now sort them by rating. Similar, but not quite like stored fields* * Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields and in-memory field cache
  • 11.
    Facets search returns doc IDs facet=true facet.field=host doc1:host=server01 doc2: host=server02 doc3: host=server01 doc4: host=server01 server01: 3 server02: 1 doc values, usually* * can be filter cache on low cardinality fields (depends on facet.method)
  • 12.
    Facets can behierarchical top_genres:{ terms:{ field: genre, limit: 5, facet:{ top_authors:{ terms:{ field: author, limit: 2 "top_genres":{ "buckets":[ { "val":"Fantasy", "count":5432, "top_authors":{ // top authors in the "Fantasy" genre "buckets":[{ "val":"Mercedes Lackey", "count":121}, { "val":"Piers Anthony", "count":98} ] } }, { "val":"Mystery", "count":4322, "top_authors":{ // top authors in the "Mystery" genre "buckets":[{ "val":"James Patterson", "count":146}, Can also be numeric/date ranges or functions like avg, sum, unique or percentile
  • 13.
    Beyond the shards:streaming aggregations Sources search facet jdbc ... Decorators rollup unique innerJoin parallel ... shard1 shard2 worker1 worker2 Solr endpoint client app
  • 14.
    Beyond the shards:streaming aggregations Sources search facet jdbc ... Decorators rollup unique innerJoin parallel ... Parallel SQL Text Classification Graph Traversal ⇒ shard1 shard2 worker1 worker2 Solr endpoint client app
  • 15.
  • 16.
    Master-slave: high-QPS onstatic data indexer master slave1 slave2 slave3 searcher replicates segments docs queries Simple Battle-tested Index data only once Slaves can cache like crazy Separate roles ⇒ separate (see optimized) hardware and configs
  • 17.
  • 18.
    SolrCloud leader2 leader1 replica2 replica1 Zookeeper Solr nodes indexer searcher Nearrealtime search Durability Scales both reads and writes No SPOF Central config, nicer APIs
  • 19.
    In a nutshell Typicaluse-cases Typical challenges Product search (books, movies, bikes weapons… anything that requires relevancy) Updates (though there’s WiP for numeric doc values in SOLR-5944) Time-series data (logs, metrics, social media...) Not really schema-less (schema can only be appended) Search on top of (or as a source of) other Big Data tools (Spark, HDFS…) Doesn’t like sparse data (again, there’s ongoing work to make it better, see LUCENE-7253) Search on top of (or alongside) relational DBs Some relational, stream and batch processing capabilities, but not the tool for those jobs
  • 20.
  • 21.