The Loggly service utilizes Elasticsearch as the search engine underneath a lot of our core functionality. Log management imposes some tough requirements on search technology. To boil it down, it must be able to:
• Reliably perform near real-time indexing at huge scale – in our case, more than 100,000 log events per second
• Simultaneously handle high search volumes on the same index with solid performance and efficiency
When we were building our Gen2 log management service, we wanted to be sure that we were setting all of Elasticsearch’s configurations in the way that would deliver maximum performance for both indexing and search. Unfortunately, we found it very difficult to find this information in the Elasticsearch documentation because it’s not located in one place. This deck summarizes our learnings and can serve as a checklist of configuration properties you can reference to optimize ES for your application.
Get even more tips and insight on our full blog post → http://bit.ly/NineTipsOnES
2. Elasticsearch Tip #1
Know Your Deployment Topology
Before You Set Configs
http://bit.ly/NineTipsOnES
• Loggly is running ES 0.90.13 with separate master and data nodes.
• In addition, we use the ES node client to talk to data nodes. This makes
the client transparent to data nodes; all it cares about is talking to node
client. You establish your ES nodes as data and master using two
properties that are set as true or false. For example, to make an
Elasticsearch a data node, you set: node.master: false and node.data: true
3. http://bit.ly/NineTipsOnES
Elasticsearch Tip #2
mlockall Offers the Biggest Bang for
the Performance Efficiency Buck
• Linux divides its physical RAM into chunks of memory called pages.
Swapping is the process whereby a page of memory is copied to the
preconfigured space on the hard disk, called swap space, to free up that
page of memory. The combined sizes of the physical memory and the swap
space is the amount of virtual memory available.
• Swapping does have a downside. Compared to memory, disks are very slow.
• The mlockall property in ES allows the ES node not to swap its memory. This
property can be set in the yaml file by doing the following:
bootstrap.mlockall: true
4. http://bit.ly/NineTipsOnES
Elasticsearch Tip #3
discovery.zen Properties Control the
Discovery Protocol for Elasticsearch
• Zen discovery is the protocol is used by Elasticsearch to discover and communicate between the nodes in the cluster. The zen
discovery setting is controlled by the discovery.zen.* properties. Both unicast and multicast are available as part of discovery
protocol.
• The discovery.zen.minimum_master_nodes control the minimum number of eligible master nodes that a node should “see” in
order to operate within the cluster. It’s recommended that you set it to a higher value than 1 when running more than 2 nodes
in the cluster.
• Data and master nodes detect each other in two different ways:
– By the master pinging all other nodes in the cluster and verify they are up and running
– By all other nodes pinging the master xnodes to verify if they are up and running or if an election process needs to be
initiated
• The node detection process is controlled by discover.zen.fd.ping_timeout property. The default value is 30s, which determines
how long the node will wait for a response. This property should be adjusted if you are operating on a slow or congested
network. If you are on slow network, set the value higher. The higher the value, the smaller the chance of discovery failure.
5. http://bit.ly/NineTipsOnES
Elasticsearch Tip #4
Watch Out for delete_all_indices!
• It’s really important to know that the curl API in ES does not have very good
authentication built into it. A simple curl API can cause all the indices to delete
themselves and lose all data.
• This is just one example of a command that could cause a mistaken deletion:
curl -XDELETE ‘http://localhost:9200/*/’
• To avoid this type of grief, you can set the following property:
action.disable_delete_all_indices: true.
– This will make sure when above command is given, it will not delete the
index and will instead result in an error.
6. http://bit.ly/NineTipsOnES
Elasticsearch Tip #5
Field Data Caching Can Cause
Extremely Slow Facet Searches
• The field data cache is used mainly when sorting on or faceting on a field. It loads all
the field values to memory in order to provide fast document based access to those
values.
• You need to keep in mind that not setting this value properly can cause:
– Facet searches and sorting to have very poor performance
– The ES node to run out of memory if you run the facet query against a large index
• An example: indices.fielddata.cache. consideration size: 25%
• In setting this value, the key is what kind of facet searches your application performs.
7. http://bit.ly/NineTipsOnES
Elasticsearch Tip #6
Optimizing Index Requests
• At Loggly, we built our own index management system since the nature of log management means
that we have frequent updates and mapping changes. This index manager’s responsibility is to
manage indices on our ES cluster. It detects when the index needs to be created or closed based on
the configured policies. There are many policies in the index manager. For example, if the index
grows beyond a certain size or lives for more than a certain time, the index manager will close the
index and create a new one.
• When the index manager send a node an index request to process, the node updates its own
mapping and then sends that mapping to the master. While the master processes it, that node
receives a state that includes an older version of the mapping. If there’s a conflict, it’s not bad (i.e.
the cluster state will eventually have the correct mapping), but we send a refresh just in case from
that node to the master. In order to make the index request more efficient, we have set this
property on our data nodes: indices.cluster.send_refresh_mapping: false
8. http://bit.ly/NineTipsOnES
Elasticsearch Tip #7
Navigating Elasticsearch’s
Allocation-related Properties
• Shard allocation is the process of allocating shards to nodes. This can happen during initial
recovery, replica allocation, or rebalancing. Or it can happen when handling nodes that are
being added or removed.
• The cluster.routing.allocation.cluster_concurrent_rebalance property determines the number of
shards allowed for concurrent rebalance. This property needs to be set appropriately
depending on the hardware being used, for example the number of CPUs, IO capacity, etc. If
this property is not set appropriately, it can impact the performance of ES indexing.
• cluster.routing.allocation.cluster_concurrent_rebalance:2
By default the value is set at 2, meaning that at any point in time only 2 shards are allowed
to be moving. It is good to set this property low so that the rebalance of shards is throttled
and doesn’t affect indexing.
9. Elasticsearch Tip #8
Recovery Properties Allow for
Faster Restart Times
http://bit.ly/NineTipsOnES
• ES includes several recovery properties which improve both Elasticsearch cluster recovery
and restart times. The value that will work best for you depends on the hardware you have
in use, and the best advice we can give is to test, test, and test again.
• cluster.routing.allocation.node_concurrent_recoveries:4
This property is how many shards per node are allowed for recovery at any moment in time.
Recovering shards is a very IO-intensive operation, so you should set this value with real
caution.
• cluster.routing.allocation.node_initial_primaries_recoveries:18
This controls the number of primary shards initialized concurrently on a single node. The
number of parallel stream of data transfer from node to recover shard from peer node is
controlled by indices.recovery.concurrent_streams.
10. http://bit.ly/NineTipsOnES
Elasticsearch Tip #9
Threadpool Properties Prevent
Data Loss
• Elasticsearch node has several thread pools in order to improve how threads are managed
within a node. At Loggly, we use bulk request extensively, and we have found that setting
the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in
order to avoid data loss or _bulk retries:
• threadpool.bulk.queue_size: 3000
This property value is for the bulk request. This tells ES the number of requests that can be
queued for execution in the node when there is no thread available to execute a bulk
request. This value should be set according to your bulk request load. If your bulk request
number goes higher than queue size, you will get a RemoteTransportException as shown
below.
11. Log Management is Our Full-Time Job.
It Shouldn’t Be Yours.
Loggly is the world’s most popular cloud-based log management solution, used by
more than 5,000 happy customers to effortlessly spot problems in real-time, easily
pinpoint root causes and resolve issues faster to ensure application success.
Visit us at www.loggly.com or follow @loggly on Twitter.
Unless You Want it to Be Your Full-Time Job…
…If so, check out our career page to see if there is an position open that is a great match
for your skills! Join us at www.loggly.com/careers.
Try Loggly for Free! → http://bit.ly/1nl87Uc