Hippo meetup: enterprise search with Solr and elasticsearch

15th January 2013 – Hippo meetup

Luca Cavanna
Software developer & Search consultant at Trifork Amsterdam

luca.cavanna@trifork.nl - @lucacavanna

Trifork (aka Jteam/Dutchworks/Orange11)

Focus areas:
– Big data & Search
– Mobile
– Custom solutions
– Knowledge (GOTO Amsterdam)

● Hippo partner

● Hippo related search projects:
– uva.nl
– working on rijksoverheid.nl

Agenda

● Search introduction
– Lucene foundation
– Why do we need Solr or elasticsearch?
● Scaling with Solr
● Elasticsearch distributed nature
● Elasticsearch features

Apache Lucene

● High-performance, full-featured text search engine
library written entirely in Java

● It indexes documents as collections of fields

● A field is a string based key-value pair

● What data structure does it use under the hood?

Inverted index

term freq Posting list
1 The old night keeper keeps the keep in the town and 1 6
big 2 23
2 In the big old house in the big old gown.
dark 1 6
3 The house in the town had the big old keep did 1 4
grown 1 2
4 Where the old night keeper never did sleep.
had 1 3
house 2 23
5 The night keeper keeps the keep in the night
in 5 12356
6 And keeps in the dark and sleeps in the light. keep 3 135
keeper 3 145
keeps 3 156
light 1 6
never 1 4
night 3 145
old 4 1234
sleep 1 4
sleeps 1 6
the 6 123456
town 2 13
where 1 4

Inverted index

● Indexing
– Text analysis
● Tokenization, lowercasing and more

● The inverted index can contain more data
– Term offsets and more

● The inverted index itself doesn't contain the text for
displaying the search results

Indexing

● Lucene writes indexes as segments
● Segments are not modifiable: Write-Once
● Each segment is a searchable mini index

● Each segment contains
– Inverted index
– Stored fields
– ...and more

Indexing: the commit operation

● Documents are searchable only after a commit!

● Commit gives also durability

● The most expensive operation in Lucene!!!

Near-real-time search (since Lucene 2.9, exposed in Solr 4.0)

● With the Lucene near-real time API you don't need a
commit to make new documents searchable

● Less expensive than commit

● Doesn't guarantee durability though

● Exposed as soft commit in Solr 4.0

Lucene code example – indexing data

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
new StandardAnalyzer(Version.LUCENE_40));
Directory directory = FSDirectory.open(new File("data"));
IndexWriter writer = new IndexWriter(directory, config);

Document document = new Document();

FieldType idFieldType = new FieldType();
idFieldType.setIndexed(true);
idFieldType.setStored(true);
idFieldType.setTokenized(false);
document.add(new Field("id","id-1", idFieldType));

FieldType titleFieldType = new FieldType();
titleFieldType.setIndexed(true);
titleFieldType.setStored(true);
document.add(new Field("title","This is the title", titleFieldType));

FieldType descriptionFieldType = new FieldType();
descriptionFieldType.setIndexed(true);
document.add(new Field("description","This is the description", descriptionFieldType));

writer.addDocument(document);

writer.close();

Lucene code example – querying and showing results

QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title",
new StandardAnalyzer(Version.LUCENE_40));
Query query = queryParser.parse(queryAsString);

Directory directory = FSDirectory.open(new File("data"));
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
TopDocs topDocs = indexSearcher.search(query, 10);

System.out.println("Total hits: " + topDocs.totalHits);

for (ScoreDoc hit : topDocs.scoreDocs) {
Document document = indexSearcher.doc(hit.doc);
for (IndexableField field : document) {
System.out.println(field.name() + ": " + field.stringValue());
}
}

What's missing?

● A common way to represent documents
● Interface to send document to (HTTP)
● A way to represent queries
● Interface to send queries to (HTTP)
● Configuration
● Caching
● Distributed infrastructure
● And more....

Scaling – why?

‣ The more concurrent searches you run, the slower they
get

‣ Indexing and searching on the same machine will
substantially harm search performance

‣ Segment merging may be CPU/IO intensive
operations

‣ Disk cache invalidation

‣ Fail over

Solr replication (pull approach)

• Master-slave based solution
• Single machine for indexing data (master)
• Multiple machines for querying (slaves)
• Master is not aware of the slaves
• Slave is aware of the master
• Load balancer responsible for balancing the query
requests

• What about real-time search? No way!

SolrCloud

• A set of new distributed capabilities in Solr
• uses Apache Zookeeper as a system of record for
the cluster state, for central configuration, and for
leader election

• Whatever server (shard) you send data to:
• the documents get distributed over the shards
• A shard can be a leader or a replica and contains a
subset of the data

• Easily scale up adding new Solr nodes

elasticsearch

● Distributed search engine built on top of Lucene
● Apache 2 license
● Written in Java
● RESTful
● Created and mainly developed by Shay Banon
● A company behind it: elasticsearch.com
● Regular releases
– Latest release 0.20.2

elasticsearch

● Schemaless
– Uses defaults and automatic type guessing
– Custom mappings may be defined if needed
● JSON oriented
● Multi tenancy
– Multiple indexes per node, multiple types per index
● Designed to be distributed from the beginning
● Almost everything is available as API (including
configuration)
● Wide range of administration APIs

elasticsearch distributed terminology

● Node: a running instance of elasticsearch which belongs
to a cluster (usually one node per server)
● Cluster: one or more nodes with the same cluster name
● Shard: a single Lucene instance. A low-level worker unit
managed by elasticsearch. An index is split into one or
more shards.
● Index: a logical namespace which points to one or more
shards
– Your code won't deal directly with a shard, only with
an index
– But an index is composed of more lucene indexes
(one per shard)

elasticsearch distributed terminology

● More shards:
– improve indexing performance
– increase data distribution (depends on # of nodes)
– Watch out: each shard has a cost as well!

● More replicas:
– increase failover
– improve querying performance

Transaction Log

• Indexed docs are fully persistent
• No need for a Lucene IndexWriter#commit
• Managed using a transaction log / WAL
• Full single node durability (kill dash 9)
• Utilized when doing hot relocation of shards
• Periodically “flushed” (calling IW#commit)
• Durability and real time search together!

Index - Shards & Replicas

Node Node

curl -XPUT localhost:9200/hippo -d '
{
"index" : {
Client "number_of_shards" : 2,
"number_of_replicas" : 1
}
}'

Index - Shards & Replicas

Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

{
"index" : {
Client "number_of_shards" : 2,
}
}'

Indexing - 1

• Automatic sharding, push replication
Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

curl -XPUT localhost:9200/hippo/users/1 -d '
{
"name" : {
"first" : "Luca",
Client "last" : "Cavanna"
}
}'

Indexing - 2

Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

curl -XPUT localhost:9200/hippo/users/2 -d '
{
"name" : {
Client "first" : "Jeroen",
"last" : "Reijn"
}
}'

Search - 1

• Scatter / Gather search
Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

Client

curl -XPUT localhost:9200/hippo/_search?q=luca

Search - 2

• Automatic balancing between replicas
Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

Client


Search - 3

• Automatic failover
Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) failure (primary)

Client


Adding a node

• “Hot” reallocation of shards to the new node

Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

Adding a node


Node Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

Adding a node


Node Node Node
Shard 0 Shard 0 Shard 0
(primary) (replica) (replica)

Shard 1 Shard 1
(replica) (primary)

Node failure

Node Node Node
Shard 0 Shard 0
(primary) (replica)

Shard 1 Shard 1
(replica) (primary)

Node failure - 1

• Replicas can automatically become primaries

Node Node
Shard 0
(primary)

Shard 1
(primary)

Node failure - 2

• Shards are automatically assigned and do “hot”
recovery

Node Node
Shard 0 Shard 0
(replica) (primary)

Shard 1 Shard 1
(primary) (replica)

Dynamic Replicas

Node Node Node
Shard 0 Shard 0
(primary) (replica)

{
"index" : {
"number_of_shards" : 1,
Client }
}'

Dynamic Replicas

Node Node Node
Shard 0 Shard 0 Shard 0
(primary) (replica) (replica)

{
"index" : {
Client "number_of_replicas" : 2
}
}'

Indexing (Push) - ElasticSearch

• Documents added through push requests

• Full JSON Object representation of Documents supported

• Embedded objects

• 1st class Parent / Child and Versioning

• Near Realtime index refreshing available

• Realtime get supported {
"name": "Luca Cavanna",
"location": {
"city": "Amsterdam",
"country": "The Netherlands"
}
}

Indexing (Pull) - ElasticSearch

• Data flows from sources using ‘Rivers’

• Continues to add data as it ‘flows’

• Can be added, removed, configured dynamically

• Out-of-the-box support for CouchDB, Twitter (implemented by the es
team)

• Community implementations for DBs, other NoSQL and Solr

River

River

Searching - ElasticSearch

• Search request in Request Body

• Powerful and extensible Query DSL

• Separation of Query and Filters

• Named Filters allowing tracking of which Documents matched which
Filters

• By default storing the source of each document (_source field)

• Catch all feature enabled by default (_all field)

• Sorting of results

• Highlighting, Faceting, Boosting...and more

Search Example - ElasticSearch

$ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '
{
"query" : { {
"term" : { "first_name" : "luca" } "_shards": {
} "total" : 5,
}' "successful" : 5,
"failed" : 0
},
"hits": {
"total" : 1,
"hits" : [
{
"_index" : "hippo",
"_type" : "users",
"_id" : "1",
"_source" : {
"first_name" : "Luca",
"last_name" : "Cavanna"
}
}
]
}
}

Thanks

There would be a lot more to say:
• Query DSL

• Scripting module (pluggable implementation)

• Percolator

• Running it embedded

Check them out yourself if you are interested!

Questions?

Hippo meetup: enterprise search with Solr and elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hippo meetup: enterprise search with Solr and elasticsearch

Similar to Hippo meetup: enterprise search with Solr and elasticsearch (20)

Recently uploaded

Recently uploaded (20)

Hippo meetup: enterprise search with Solr and elasticsearch