Elasticsearch
“war stories”
Well Hello There! I am Arno Broekhof
Data Engineer ( full stack ) @Dataworkz

Working with elasticsearch since 2011

Dutch National Police
History of Elasticsearch
• Created by Shay Banon

• Compass

• Elasticsearch == Compass 3.0

• First release in February 2010

• Abstraction layer on top of Lucene
Present Day
24 Elasticsearch Clusters
441 Nodes
5477 GB Ram Memory
343 TB Used Data
3798 Indices
Zen Discovery
discovery.zen.ping.multicast.enabled: true
• Elasticsearch nodes uses multicast traffic for discovery

• Default setting in ES < 5x
Not a database
• Persistency
• Consistency
• Security
• SELECT * FROM pet WHERE name LIKE 'b%';
• Total amount of data < 512GB
Shard Sizing
“Too Many Shards or the Gazillion Shards Problem”
• 	 A shard is a Lucene index under the covers, which uses file handles, memory, and CPU cycles.	 

• Every search request needs to hit a copy of every shard in the index. That’s fine if every shard is
sitting on a different node, but not if many shards have to compete for the same resources.

• Term statistics, used to calculate relevance, are per shard. Having a small amount of data in many
shards leads to poor relevance.
How many shards?
• 1.000.000 documents

• Index of 256GB

• 6 nodes

• 1 node has 8 cores and 30GB Heap
256GB / ( 80% heap of 1 node ) = +/- 10 shards
curl -XGET http://localhost:9200/_cat/indices
Disable _source field
• The update, update_by_query, and reindex APIs.

• On the fly highlighting.

• The ability to reindex from one Elasticsearch index to another, 

either to change mappings or analysis, 

or to upgrade an index to a new major version.

• The ability to debug queries or aggregations 

by viewing the original document used at index time.

• Potentially in the future, the ability to repair index corruption automatically.
How much indices
“remember that there is no rule that limits
your application to using only a single index.”
Dynamic Mappings
• Not everything needs to be searchable
"avatarLink": {
"type": "string",
"index": "not_analyzed",
"doc_values": true
},
• Use Explicit Mapping when possible
{
“job” : “Some job description”,
“date”: “1-10-2017”
}
{
“job” : “Some job description”,
“date”: “NO_DATE”
}
Where is my memory?
{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10
}
}
},
“aggs”: {
“other”: {
“terms” : {
“field”: “players”,
“size”: 5
}
}
}
}
• The aggregation will return a list of the 

top 10 players and a list of the 

top five supporting players for each top player

• 50 results

• Minimal effort, Maximum memory
Where is my memory?
{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10,
“collect_mode”: “breadth_first”
}
}
},
“aggs”: {
“other”: {
“terms” : {
“field”: “players”,
“size”: 5
}
}
}
}
• Use collect mode if possible

• Trims one level at a time

• Minimal change, Maximum performance
Where is my data?
public void insert(final JsonArray jsonArray) {
if (jsonArray.size() == 0) {
return;
}
BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk();
this.setEsRefreshInterval("-1");
jsonArray.forEach(e -> {
String id = e.getAsJsonObject().get("name").toString();
bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(),
configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id));
});
BulkResponse bulkResponse = bulkRequestBuilder.get();
LOGGER.debug("bulk inserted {} items took: {} with failures: {}",
bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures());
}
Where is my data?
public void insert(final JsonArray jsonArray) {
if (jsonArray.size() == 0) {
return;
}
BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk();
this.setEsRefreshInterval("-1");
jsonArray.forEach(e -> {
String id = e.getAsJsonObject().get("name").toString();
bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(),
configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id));
});
BulkResponse bulkResponse = bulkRequestBuilder.get();
LOGGER.debug("bulk inserted {} items took: {} with failures: {}",
bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures());
}
Query or Filter?
Queries —> should be used when performing a full-text search, 

when scoring of results is required (think search results ranked by relevancy).



Filters —> are much faster than queries, mainly because they don’t score the results.  
If you just want to return all of the products that are blue, 

or that cost more than €50, use filters!
_type == _type
• Use unique types

• Why wordpress post_type == _type is a bad idea

• When deleting a post a document is identified both by its _id and _type
Search limits
• Default limits to 10

• Max results limits to 10.000

• If you want everything use the scroll api
We have a distributed search engine, nodes can fail!
• We have shards replica’s

• Single master

• Use dedicated masters
Slow recovery
-XPUT _cluster/settings -d ‘{
"transient" : {
"cluster.routing.allocation.cluster_concurrent_rebalance" : "5",
"cluster.routing.allocation.node_concurrent_recoveries" : "5",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
"indices.recovery.concurrent_streams" : "4",
"indices.recovery.max_bytes_per_sec" : "200mb",
"indices.store.throttle.max_bytes_per_sec" : "100mb"
}
}’
What brings the future?
• Java Transport Client is deprecated, REST is the way to go

• Cross Cluster Searches

• Index sorting during indexing

• Only one type can exist

• Better use of transaction logs

• Sparse Doc Values
Questions?

Elasticsearch War Stories

  • 1.
  • 2.
    Well Hello There!I am Arno Broekhof Data Engineer ( full stack ) @Dataworkz Working with elasticsearch since 2011 Dutch National Police
  • 3.
    History of Elasticsearch •Created by Shay Banon • Compass • Elasticsearch == Compass 3.0 • First release in February 2010 • Abstraction layer on top of Lucene
  • 4.
    Present Day 24 ElasticsearchClusters 441 Nodes 5477 GB Ram Memory 343 TB Used Data 3798 Indices
  • 5.
    Zen Discovery discovery.zen.ping.multicast.enabled: true •Elasticsearch nodes uses multicast traffic for discovery • Default setting in ES < 5x
  • 6.
    Not a database •Persistency • Consistency • Security • SELECT * FROM pet WHERE name LIKE 'b%'; • Total amount of data < 512GB
  • 7.
    Shard Sizing “Too ManyShards or the Gazillion Shards Problem” • A shard is a Lucene index under the covers, which uses file handles, memory, and CPU cycles. • Every search request needs to hit a copy of every shard in the index. That’s fine if every shard is sitting on a different node, but not if many shards have to compete for the same resources. • Term statistics, used to calculate relevance, are per shard. Having a small amount of data in many shards leads to poor relevance.
  • 8.
    How many shards? •1.000.000 documents • Index of 256GB • 6 nodes • 1 node has 8 cores and 30GB Heap 256GB / ( 80% heap of 1 node ) = +/- 10 shards curl -XGET http://localhost:9200/_cat/indices
  • 9.
    Disable _source field •The update, update_by_query, and reindex APIs. • On the fly highlighting. • The ability to reindex from one Elasticsearch index to another, 
 either to change mappings or analysis, 
 or to upgrade an index to a new major version. • The ability to debug queries or aggregations 
 by viewing the original document used at index time. • Potentially in the future, the ability to repair index corruption automatically.
  • 10.
    How much indices “rememberthat there is no rule that limits your application to using only a single index.”
  • 11.
    Dynamic Mappings • Noteverything needs to be searchable "avatarLink": { "type": "string", "index": "not_analyzed", "doc_values": true }, • Use Explicit Mapping when possible { “job” : “Some job description”, “date”: “1-10-2017” } { “job” : “Some job description”, “date”: “NO_DATE” }
  • 12.
    Where is mymemory? { “aggs” : { “players”: { “terms”: { “field”: “players”, “size”: 10 } } }, “aggs”: { “other”: { “terms” : { “field”: “players”, “size”: 5 } } } } • The aggregation will return a list of the 
 top 10 players and a list of the 
 top five supporting players for each top player • 50 results • Minimal effort, Maximum memory
  • 13.
    Where is mymemory? { “aggs” : { “players”: { “terms”: { “field”: “players”, “size”: 10, “collect_mode”: “breadth_first” } } }, “aggs”: { “other”: { “terms” : { “field”: “players”, “size”: 5 } } } } • Use collect mode if possible • Trims one level at a time • Minimal change, Maximum performance
  • 14.
    Where is mydata? public void insert(final JsonArray jsonArray) { if (jsonArray.size() == 0) { return; } BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk(); this.setEsRefreshInterval("-1"); jsonArray.forEach(e -> { String id = e.getAsJsonObject().get("name").toString(); bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(), configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id)); }); BulkResponse bulkResponse = bulkRequestBuilder.get(); LOGGER.debug("bulk inserted {} items took: {} with failures: {}", bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures()); }
  • 15.
    Where is mydata? public void insert(final JsonArray jsonArray) { if (jsonArray.size() == 0) { return; } BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk(); this.setEsRefreshInterval("-1"); jsonArray.forEach(e -> { String id = e.getAsJsonObject().get("name").toString(); bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(), configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id)); }); BulkResponse bulkResponse = bulkRequestBuilder.get(); LOGGER.debug("bulk inserted {} items took: {} with failures: {}", bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures()); }
  • 16.
    Query or Filter? Queries—> should be used when performing a full-text search, 
 when scoring of results is required (think search results ranked by relevancy).
 
 Filters —> are much faster than queries, mainly because they don’t score the results.   If you just want to return all of the products that are blue, 
 or that cost more than €50, use filters!
  • 17.
    _type == _type •Use unique types • Why wordpress post_type == _type is a bad idea • When deleting a post a document is identified both by its _id and _type
  • 18.
    Search limits • Defaultlimits to 10 • Max results limits to 10.000 • If you want everything use the scroll api
  • 19.
    We have adistributed search engine, nodes can fail! • We have shards replica’s • Single master • Use dedicated masters
  • 20.
    Slow recovery -XPUT _cluster/settings-d ‘{ "transient" : { "cluster.routing.allocation.cluster_concurrent_rebalance" : "5", "cluster.routing.allocation.node_concurrent_recoveries" : "5", "cluster.routing.allocation.node_initial_primaries_recoveries" : "4", "indices.recovery.concurrent_streams" : "4", "indices.recovery.max_bytes_per_sec" : "200mb", "indices.store.throttle.max_bytes_per_sec" : "100mb" } }’
  • 21.
    What brings thefuture? • Java Transport Client is deprecated, REST is the way to go • Cross Cluster Searches • Index sorting during indexing • Only one type can exist • Better use of transaction logs • Sparse Doc Values
  • 22.