Elasticsearch War Stories

Elasticsearch
“war stories”

Well Hello There! I am Arno Broekhof
Data Engineer ( full stack ) @Dataworkz

Working with elasticsearch since 2011

Dutch National Police

History of Elasticsearch
• Created by Shay Banon

• Compass

• Elasticsearch == Compass 3.0

• First release in February 2010

• Abstraction layer on top of Lucene

Present Day
24 Elasticsearch Clusters
441 Nodes
5477 GB Ram Memory
343 TB Used Data
3798 Indices

Zen Discovery
discovery.zen.ping.multicast.enabled: true
• Elasticsearch nodes uses multicast traﬃc for discovery

• Default setting in ES < 5x

Not a database
• Persistency
• Consistency
• Security
• SELECT * FROM pet WHERE name LIKE 'b%';
• Total amount of data < 512GB

Shard Sizing
“Too Many Shards or the Gazillion Shards Problem”
• A shard is a Lucene index under the covers, which uses file handles, memory, and CPU cycles.

• Every search request needs to hit a copy of every shard in the index. That’s fine if every shard is
sitting on a different node, but not if many shards have to compete for the same resources.

• Term statistics, used to calculate relevance, are per shard. Having a small amount of data in many
shards leads to poor relevance.

How many shards?
• 1.000.000 documents

• Index of 256GB

• 6 nodes

• 1 node has 8 cores and 30GB Heap
256GB / ( 80% heap of 1 node ) = +/- 10 shards
curl -XGET http://localhost:9200/_cat/indices

Disable _source ﬁeld
• The update, update_by_query, and reindex APIs.

• On the ﬂy highlighting.

• The ability to reindex from one Elasticsearch index to another,  
either to change mappings or analysis,  
or to upgrade an index to a new major version.

• The ability to debug queries or aggregations  
by viewing the original document used at index time.

• Potentially in the future, the ability to repair index corruption automatically.

How much indices
“remember that there is no rule that limits
your application to using only a single index.”

Dynamic Mappings
• Not everything needs to be searchable
"avatarLink": {
"type": "string",
"index": "not_analyzed",
"doc_values": true
},
• Use Explicit Mapping when possible
{
“job” : “Some job description”,
“date”: “1-10-2017”
}
{
“job” : “Some job description”,
“date”: “NO_DATE”
}

Where is my memory?
{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10
}
}
},
“aggs”: {
“other”: {
“terms” : {
“size”: 5
}
}
}
}
• The aggregation will return a list of the  
top 10 players and a list of the  
top five supporting players for each top player

• 50 results

• Minimal effort, Maximum memory

Where is my memory?
{
“aggs” : {
“players”: {
“terms”: {
“size”: 10,
“collect_mode”: “breadth_ﬁrst”
}
}
},
“aggs”: {
“other”: {
“terms” : {
“size”: 5
}
}
}
}
• Use collect mode if possible

• Trims one level at a time

• Minimal change, Maximum performance

Where is my data?
public void insert(final JsonArray jsonArray) {
if (jsonArray.size() == 0) {
return;
}
BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk();
this.setEsRefreshInterval("-1");
jsonArray.forEach(e -> {
String id = e.getAsJsonObject().get("name").toString();
bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(),
configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id));
});
BulkResponse bulkResponse = bulkRequestBuilder.get();
LOGGER.debug("bulk inserted {} items took: {} with failures: {}",
bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures());
}

Query or Filter?
Queries —> should be used when performing a full-text search,  
when scoring of results is required (think search results ranked by relevancy). 
 
Filters —> are much faster than queries, mainly because they don’t score the results.
If you just want to return all of the products that are blue,  
or that cost more than €50, use ﬁlters!

_type == _type
• Use unique types

• Why wordpress post_type == _type is a bad idea

• When deleting a post a document is identiﬁed both by its _id and _type

Search limits
• Default limits to 10

• Max results limits to 10.000

• If you want everything use the scroll api

We have a distributed search engine, nodes can fail!
• We have shards replica’s

• Single master

• Use dedicated masters

Slow recovery
-XPUT _cluster/settings -d ‘{
"transient" : {
"cluster.routing.allocation.cluster_concurrent_rebalance" : "5",
"cluster.routing.allocation.node_concurrent_recoveries" : "5",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
"indices.recovery.concurrent_streams" : "4",
"indices.recovery.max_bytes_per_sec" : "200mb",
"indices.store.throttle.max_bytes_per_sec" : "100mb"
}
}’

What brings the future?
• Java Transport Client is deprecated, REST is the way to go

• Cross Cluster Searches

• Index sorting during indexing

• Only one type can exist

• Better use of transaction logs

• Sparse Doc Values

Elasticsearch War Stories

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elasticsearch War Stories

Similar to Elasticsearch War Stories (20)

Recently uploaded

Recently uploaded (20)

Elasticsearch War Stories