1
Solr on CloudSolr on Cloud
Tallinn University of TechnologyTallinn University of Technology
Introduction to Development in Cloud by Anton Vedešin
Road Management Team
2
What is ?What is ?
“ Solr is the popular, blazing-fast open
source enterprise search platform built on
Apache Lucene™.
Solr powers the search and navigation
features of many of the world's largest
internet sites.
3
What is not ?What is not ?
4
Key aspects ofKey aspects of
highly reliable
scalable
fault tolerant
provides distributed indexing
replication
load-balanced querying
automated failover and recovery
centralised configuration
5
Why we need ?Why we need ?
optimised for search
larges volumes of documents
text-centric
results sorted by relevance
read-dominant
document-oriented
flexible schema
6
How it works?How it works?
7
All terms in the index map to one or more documents.
Terms in the inverted index are sorted in ascending lexicographical
order
Inverted indexInverted index
8
Finding sets of documentsFinding sets of documents
9
Relevancy calculationRelevancy calculation
term frequency (tf)
inverse document frequency (idf)
term boosts (t.getBoost)
field normalisation (norm)
coordination factor (coord)
query normalisation (queryNorm)
ScoreScore
10
Inverse term frequency (itf)Inverse term frequency (itf)
Not all search terms or created equal !Not all search terms or created equal !
11
Unstructured dataUnstructured data
12
Text-centric dataText-centric data
13
Read-dominantRead-dominant
14
Document-orientedDocument-oriented
15
Flexible schemaFlexible schema
16
Keyword searchKeyword search
relevant results must be returned quickly
spelling correction is needed
autosuggestions save keystrokes
synonyms of query terms must be recognised
phrase handling is needed
queries with common words must be handled
show more results if the top results aren’t satisfactory
17
Ranked retrievalRanked retrieval
18
Faceted searchFaceted search
19
ScalableScalable
cache management concurrent queries
CPU & I/O constraints query throughput
number of documents indexed
replicas shards
20
Fault-tolerantFault-tolerant
number of documents indexed
21
Geospatial searchGeospatial search
22
Multilingual supportMultilingual support
23
Near real-time search (NRT)Near real-time search (NRT)
24
Data modeling featuresData modeling features
Result grouping/field collapsing
Flexible query support
Joins
Document clustering
Importing rich document formats such as PDF, Word
Importing data from relational databases
flat denormalised documentflat denormalised document
25
Other important featuresOther important features
Atomic updates with optimistic concurrency
Real-time get
Write-durability using a transaction log
26
SolrCloudSolrCloud
centralised configuration
distributed indexing with no SPoF
automated failover to a new shard leader
queries can be sent to any node in a cluster to trigger
a full, distributed search across all shards, with
failover and load-balancing support built in.
fault-tolerance & high availabilityfault-tolerance & high availability
ZooKeeper
27
Not to use !Not to use !
request a large result set
do deep analytic tasks
querying across relationships
document-level security
28
ReferencesReferences
(Chapter 1 and 3)
http://lucene.apache.org/solr/quickstart.html
https://www.manning.com/books/solr-in-action?
a_bid=39472865&a_aid=1​
https://engineering.linkedin.com/faceting/many-
facets-faceted-search
29
Thank you!Thank you!
30
Who?Who?
Postgres DBA @
Studying MSc Comp. & Systems Eng.
@
Studied BSc Maths Eng. @
Writes blog on 2ndQuadrant
Does some childish
Loves independent films
2ndQuadrant
Tallinn University of Technology
Yildiz Technical
University
blog
paintings
@
Skype: gulcin2ndq
Github:
apatheticmagpie
gulcin

Solr on Cloud