2. User Scenarios
• Internal IBM Infrastructure
– Relatively small number of groups that generate a ton
of logs ( groups that want to generate 3-5 TB/day )
– Logs are produced on VMs running various cloud
services operated by IBM
• External Bluemix Log Producers
– Relatively large number of groups ( Bluemix
organizations ) that generate a variety of log data but
in relatively smaller quantities – total logs measured
anywhere from kilobytes to gigabytes / day
– Only a handful of organizations are currently
generating large volumes of log data
8. Service Architecture
Key facts
• OpenStack Heat automation
– Multiple AutoScale Groups (ASGs)
– Docker image per ASG
– Ansible to configure s/w
• Currently deployed on
OpenStack
– Virtual Machines host single docker
container
– Security groups for firewall rules
– HAProxy for load balancing
9. Deployment and Automation
• Open Stack deployment using heat templates
– Provides scale-up/down capabilities to add capacity when needed
• Ansible configuration automation integrated with the heat
deployment to configure the nodes
• Docker images are used as our standard deployment artifact (
configured by Ansible )
• Jenkins jobs for building and testing the docker images
• UCD automation for deployment and upgrade processing – provides
operational management for tracking what is deployed to each of
the environments
• Mixture of Jenkins and UCD for jobs to manage the daily operations
including item such as data expiration, index pre-creation and
various health check scripts.
11. Multi-tenant Logstash Forwarder
• Took the logstash forwarder and added multi-tenancy
capabilities
• Similar changes to the logstash input lumberjack plugin
• Fixed log rotation capabilities in the MT-LSF – was
triggering disk full problems on clients since it was
holding locks on files for up to 24 hours before it timed
out
• Found that increasing the spool size resulted in some
performance improvement up to a certain point. 512
was a sweet spot, going to larger values ending up
having worse performance.
12. Multi-tenant Lumberjack Server
• Lumberjack server had issue with long-lasting
connections and file descriptor leaks that required
frequent restarts under load
• Terminating connections on the client to get better
server utilization ( forced load balancer switch), but
didn’t resolve the underlying issue
• Logstash 1.5.2 lumberjack public solved the problem
connection problems with a fix to the Jruby OpenSSL
library which was encountering file descriptor leaks
under load
• Switching the kafka output plugin to run with async
gave some performance improvements ( 10-15% )
13. Logstash Lumberjack Performance
• The great thing about logstash is that it’s a Swiss Army Knife for
solving data transformation problems
• 12 Lumberjack servers in a cluster can process about 50 Mb /s ≈ 4.3
TB /day which is pretty good for most logging applications
• If you are only utilizing the basic input / output functionality then
creating a specific task based solution can result in better
performance
• We are prototyping a replacement logstash server to handle the
processing of the mt-lumberjack and initial results are very good –
in the area of 12x throughput improvement on the same hardware.
• The queuing mechanism that makes logstash very flexible turns out
to also be one of the bottlenecks when stressing out the platform
14. Kafka
• Distributed messaging system for buffering log
and metric data
• We keep 3 days worth of data to allow us to
handle the input spikes and buffers logs when
Elasticsearch or logstash indexers are not
performing well
• Logs for Kafka itself can become quite large
when errors occur, so getting the right logging
settings are important
15. Logstash Indexers
• Logstash Indexers are responsible for processing the
log entries and pushing the data to Elasticsearch
• Stability of Logstash 1.4.2 plugins for ES was not good
– Tried all 3 protocols ( node, transport, http )
– Node was fast but has issues when large metadata was
transferred on ES node failures ( frequent OOM )
– Transport had reasonable performance and stability but
did not have multi-node support
– Http has best performance after tuning to use a larger
batch size, but did not have multi-node support
• Logstash 1.5.2 ES plugins all have multi-node support
• Settled on the 1.5.2 Http protocol version running
against dedicated http client nodes in the cluster
16. Logstash Indexers
• Even with Logstash 1.5.2, the indexers are
somewhat gated to the amount of data a
single node can process
• Expanded the number of Kafka partitions to
allow growth beyond the initial 19 partitions
we had allocated for the logging topic.
• Logstash indexers can be scaled beyond 19
nodes in order to get to the point where we
can stress the ES cluster
17. Indexing Log Data
• Relying on your users to be well behaved is
dangerous – some logs contain what appears to
be well formed json document with a GUID as a
key and all of a sudden the field metadata
explodes in ES
• Need to monitor which documents you run
through the json filter in Logstash
• Adding filters to Logstash also slows down the
indexing process especially if you are attempting
to use many of the cool plugins
18. Elasticsearch
• If your network is a problem, then ES is not going to be
happy
• Elasticsearch 1.4.4 did not react well to network blips –
indexes would start shuffling themselves trying to
proactively recover which generally resulted in long
recovery times with default configurations
• The default recovery settings meant clusters remained
red or yellow for extended periods which impacted the
data ingestion
• Elasticsearch 1.7.1 has been much more stable for us
19. Sharding
• Pre-allocating the right number of shards for an
index is hard if you don’t know how much data
you are going to get
• Target that seems to work well is about 25 GB per
shard
• Problems with shard size is really highlighted
when you need to recover a failed node
• How many shards can you put in an ES cluster?
– We found 80k was too many -> changed how we
allocated shards based on historical usage
– We think that for our clusters about 40k
20. Elasticsearch Configurations
• 3 master nodes
• 10 data nodes per cluster
• 3 http nodes per cluster for queries
• 30 GB heap
• 2 data replicas to allow 2 node failures
• index.translog.flush_threshold_size = 1g
• indices.fielddata.cache.size: 50%
21. Elasticsearch Recovery
• Increase the rate at which an index can
recover
indices.recovery.max_bytes_per_sec: 200mb
• Increase the concurrent recoveries supported
cluster.routing.allocation.node_concurrent_re
coveries: 500
• Having the Kafka cluster caching data provides
us some windows where the data is delayed
getting to the ES during recovery
22. Elasticsearch Load Testing
• Run client drivers to simulate traffic into the
external stack
• We have a number of sample workloads from
real tenants that we use in our workloads
• There are lots of knobs to tune ES so having
some consistent workloads to validate our
theories has been invaluable
23. Performance
• Clusters running in production can support up
to around 70k records/sec ( 30 MB/s ) based
on our monitoring
• In our performance environments we are
seeing consistent numbers beyond 40 MB/s
• For larger indexes, increasing the number of
shards provided – 50 GB of logs spread across
10 shards was loaded about 50% faster than
with 5 shards
24. Adjust throttling for loading large indices
0
5
10
15
20
25
30
35
40
45
Baseline 20 mb Throttle 100mb Throttle none
MB/sec
Throttling Settings
10 shards
25. Scaling Elastic Search
Multiple Elastic Search Clusters
• Tenants get placed onto an ES cluster
• Tribe nodes to federate access across ES clusters
– Enables massive tenants spanning ES clusters