Apache Solr - Enterprise search platform

Enterprise Search platform
Building solid scalable enterprise search REST services on top of Apache Lucene

Tommaso Teoﬁli

Agenda

• Apache Lucene overview

• Why do we need Apache Solr?

• Everyman tales from Solr

• Enterprise what?

• One step beyond...

Apache Lucene overview

• Information Retrieval library

• Inverted indexes are quick and efﬁcient

• Vector space model

• Advanced search options (synonims, stopwords, similarity, nearness)

• Different language implementations (Java, .NET, C, Python)

The Lucene API

• Lucene indexes are built on a Directory

• Directory can be accessed by IndexReaders and IndexWriters

• IndexSearchers are built on top of Directories and IndexReaders

• IndexWriters can write Documents inside the index

• Documents are made of Fields

• Fields have value(s) and options

• Directory > IndexReader/Writer > Document > Field

Indexing Lucene

• A Lucene index has one or more segments and a generation

• Changes to the index must be committed (and optimized)

• No fixed schema

• Each field can be STORED, INDEXED and ANALYZED

• Each field can have NORMS and TERM VECTORS

Searching Lucene

• Open an IndexSearcher on top of an IndexReader over a Directory

• Many query types: TermQuery, MultiTermQuery, BooleanQuery,
WildcardQuery, PhraseQuery, PreﬁxQuery, MultiPhraseQuery, FuzzyQuery,
TermRangeQuery, NumericRangeQuery

• Get results from a TopDocs object

Why do we need Apache Solr?

• Lucene is a library

• Lucene by itself can only be queried programmatically

• Often the search system has to be totally independent from other
systems (i.e.: CMS)

• A ready to deploy search server is what you need

• Need to scale both vertically and horizontally

Apache Solr - Overview

• Ready to use enterprise search server

• REST (and programmatic) API

• Results in XML, JSON, PHP, Ruby, etc...

• Exploit Lucene power

• Scaling capabilities (replication, distributed search)

• Easy administration interface

• Easy to extend and customize (plugin architecture)

Apache Solr - Project status

• Latest release 1.4.1 on June 2010

• Lots of new features on trunk

• Most of new features on branch 3.0

• A huge very active community

• Lucid Imagination powered project

Solr - 5 minutes tutorial

• Download latest release (1.4.1)

• cd $SOLR_HOME/example

• java -jar -server start.jar

• You have an up and running Solr instance you can access via http://localhost:8983/solr
(this runs on top of Jetty)

• cd $SOLR_HOME/example/exampledocs

• Index with the command: sh post.sh *.xml

• Search with your browser

Solr - Query syntax

• Default operator is OR (you can override adding &q.op=AND to the HTTP req)

• You can query fields with fieldname:value

• Common + - AND OR NOT modifiers

• Range queries on date or numeric fields timestamp:[* TO NOW]

• Boost terms, i.e.: roma^2 inter

• Fuzzy search roam~0.6

• ...

Solr - Basic configuration steps
• Define fields, types and analysis inside schema.xml

• Play with solrconfig.xml:

• request handlers (update, search)

• index parameters

• caches

• deletion policy

• autowarming

• replication, clustering, etc...

Solr - schema.xml

• Types

• Analyzers to use for each type

• Fields with name, type and options

• Unique key

• Dynamic ﬁelds

• Copy ﬁelds

• Don’t use the default schema.xml, write it from scratch!

Solr - Type deﬁnition
Analyzers for querying and indexing
inside the schema

Solr - solrconfig.xml

• Where Solr will write the index

• Index merge factor

• Control different caches: documents, query results, filters

• Request handlers available to consume (HTTP) requests, typically at least a (standard)
search and an update handler exist

• Update request processor chains to configure indexing behavior

• Event listeners (newSearcher, firstSearcher)

• and much more...

Solr - Indexing

• Update requests on index are given with XML commands via HTTP POST

• <add> to insert and update

• <del> to remove by unique key or query

Solr - Searching

• HTTP GET to Solr instance with mandatory q parameter which specify the
query

• df - the default field to query

• fl - the list of fields to return (stored fields only)

• sort - fields used for sorting, default to score (it’s not a field)

• start, rows - paging attributes

• wt - response type, default to xml, can be json, php, ruby, etc

Solr - Data import

• Typically “old” systems rely on databases

• Data can be imported from DBs using the DataImportHandler component

• Deﬁne datasource, driver and mappings

Solr - Highlighting

• Useful when a snippet of the search results is needed

• In Solr 1.4.1 only stored fields can be highlighted

• Add &hl=true&hl.fl=field1,field2 to HTTP search request in order to enable
highlighting on field1 and field2

Solr - Faceting

• Break up search results into multiple categories showing counts for each

• Often used in stores

• Can be very useful in guiding user experience

• User can then drill down only results of a certain category

Solr - Filter queries

• Queries used as filters against the actual query

• Define document superset without influencing score

• Useful for domain specific queries where you want the user to search only in
certain “areas” of the index

• Add &fq=somefilterquery with the default Solr syntax

Solr - Enterprise
what?
Multicore
Replication
Distributed search
...

Solr - Multi core

• Deﬁne multiple Solr cores inside one only Solr instance

• Each cores maintain its own index

• Uniﬁed administration interface

• Runtime commands to create, swap, load, unload, delete cores

Solr - Replication

• It’s useful in case of high trafﬁc to replicate a Solr instance and split (with
eventually some load balancer in front) the queries

• Master has the original index

• Slave polls master asking the last version of index

• If slave has a lower version of the index asks the master for the difference
(rsync like)

• In the meanwhile indexes remain available

Solr - Distributed search

• When an index is too large, in terms of space or memory required, it can be
useful to deﬁne two or more shards

• A shard is a Solr instance and can be searched or indexed independently

• At the same time it’s possible to query all the shards having the result be
merged from the sub-results of each shard

• http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/
solr&indent=true&q=category:information

• Note that the document distribution among indexes is up to the user (or who
feeds the indexes)

One step beyond...

• Solr in the cloud

• Spatial search

• Solr & UIMA :-)

References

• http://lucene.apache.org/solr/

• http://lucene.apache.org/solr/tutorial.html

• http://wiki.apache.org/solr/FrontPage

Apache Solr - Enterprise search platform

More Related Content

What's hot

Similar to Apache Solr - Enterprise search platform

More from Tommaso Teofili

Recently uploaded

Apache Solr - Enterprise search platform