Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka

127 views

Published on

YouTube: https://www.youtube.com/watch?v=1cCD5axQf9U&list=PLnKL6-WWWE_VtIMfNLW3N3RGuCUcQkDMl&index=7

Time-based data, especially logs are all around us. Every application, system or hardware piece logs something - from simple messages, to large stack traces. In this talk we will learn how to build and tune resilient log aggregation pipeline using Elasticsearch and Kafka as its heart. We will start by looking at the overall architecture and how we can connect Elasticsearch and Kafka together. We will look at how to scale our system through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages. Finally, we'll take a look at the pipeline of getting the logs to Elasticsearch and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.

Published in: Technology
  • Be the first to comment

DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka

  1. 1. Building Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka Rafał Kuć @ Sematext Group, Inc.
  2. 2. Sematext & I Logsene SPM logs metrics
  3. 3. Next 30 minutes… Log shipping - buffers - protocols - parsing Central buffering - Kafka - Redis Storage & Analysis - Elasticsearch - Kibana - Grafana
  4. 4. Log shipping architecture File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  5. 5. Focus: Elasticsearch File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  6. 6. Elasticsearch cluster architecture client client client data data data data data data master master master ingest ingest ingest
  7. 7. Dedicated masters please client client client data data data data data data master master master discovery.zen.minimum_master_nodes -> N/2 + 1 master eligible nodes ingest ingest ingest
  8. 8. One big index is a no-go Not scalable enough for time based data
  9. 9. One big index is a no-go Indexing slows down with time
  10. 10. One big index is a no-go Expensive merges
  11. 11. One big index is a no-go Delete by query needed for data retention
  12. 12. One big index is a no-go Not scalable enough for time based data Indexing slows down with time Expensive merges Delete by query needed for data retention
  13. 13. Daily indices are a good start 2016.11.18 2016.11.19 2016.11.22 2016.11.23. . . Indexing is faster for smaller indices Deletes are cheap Search can be performed on indices that are needed Static indices are cache friendly indexing most searches
  14. 14. Daily indices are a good start 2016.11.18 2016.11.19 2016.11.22 2016.11.23. . . Indexing is faster for smaller indices Deletes are cheap Search can be performed on indices that are needed Static indices are cache friendly indexing most searches We delete whole indices
  15. 15. Daily indices are sub-optimal black friday saturday sunday load is not even
  16. 16. Size based indices are optimal size limit for indices logs_01 indexing around 5 – 10GB per shard on AWS
  17. 17. Size based indices are optimal size limit for indices logs_01 indexing around 5 – 10GB per shard on AWS
  18. 18. Size based indices are optimal size limit for indices logs_01 indexing logs_02 around 5 – 10GB per shard on AWS
  19. 19. Size based indices are optimal size limit for indices logs_01 indexing logs_02 around 5 – 10GB per shard on AWS
  20. 20. Size based indices are optimal size limit for indices logs_01 logs_02 indexing logs_N. . . around 5 – 10GB per shard on AWS
  21. 21. Slice using size Predictable searching and indexing performance Better indices balancing Fewer shards Easier handling of spiky loads Less costs because of better hardware utilization
  22. 22. Proper Elasticsearch configuration Keep index.refresh_interval at maximum possible value 1 sec -> 100%, 5 sec -> 125%, 30 sec -> 175% You can loosen up merges - possible because of heavy aggregation use - segments_per_tier -> higher - max_merge_at_once-> higher - max_merged_segment -> lower All prefixed with index.merge.policy } higher indexing throughput
  23. 23. Proper Elasticsearch configuration Index only needed fields Use doc values Do not index _source Do not store _all
  24. 24. Optimization time We can optimize data nodes for time based data client client client data data data data data data master master master ingest ingest ingest
  25. 25. Hot – cold architecture ES hot ES cold ES cold -Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold
  26. 26. Hot – cold architecture logs_2016.11.22 ES hot ES cold ES cold -Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold curl -XPUT localhost:9200/logs_2016.11.22 -d '{ "settings" : { "index.routing.allocation.exclude.tag" : "cold", "index.routing.allocation.include.tag" : "hot" } }'
  27. 27. Hot – cold architecture logs_2016.11.22 ES hot ES cold ES cold indexing
  28. 28. Hot – cold architecture logs_2016.11.22 logs_2016.11.23 ES hot ES cold ES cold indexing
  29. 29. Hot – cold architecture logs_2016.11.22 logs_2016.11.23 ES hot ES cold ES cold indexing move index after day ends curl -XPUT localhost:9200/logs_2016.11.22/_settings -d '{ "index.routing.allocation.exclude.tag" : "hot", "index.routing.allocation.include.tag” : "cold" }'
  30. 30. Hot – cold architecture logs_2016.11.23 logs_2016.11.22 ES hot ES cold ES cold indexing
  31. 31. Hot – cold architecture logs_2016.11.23 logs_2016.11.24 logs_2016.11.22 ES hot ES cold ES cold indexing
  32. 32. Hot – cold architecture logs_2016.11.23 logs_2016.11.24 logs_2016.11.22 ES hot ES cold ES cold indexing move index after day ends
  33. 33. Hot – cold architecture logs_2016.11.24 logs_2016.11.22 logs_2016.11.23 ES hot ES cold ES cold indexing
  34. 34. Hot – cold architecture Hot ES Tier Good CPU Lots of I/O Cold ES Tier Memory bound Decent I/O ES cold Cold ES Tier Memory bound Decent I/O
  35. 35. Hot – cold architecture summary ES cold Optimize costs – different hardware for different tier Performance – use case optimized hardware Isolation – long running searches don’t affect indexing
  36. 36. Elasticsearch client node needs client client client data data data data data data master master master ingest ingest ingest
  37. 37. Elasticsearch client node needs No data = no IOPS Large query throughput = high CPU usage Lots of results = high memory usage Lots of concurrent queries = higher resources utilization
  38. 38. Elasticsearch ingest node needs client client client data data data data data data master master master ingest ingest ingest
  39. 39. Elasticsearch ingest node needs No data = no IOPS Large index throughput = high CPU & memory usage Complicated rules = high CPU usage Larger documents = more resources utilization
  40. 40. Elasticsearch master node needs client client client data data data data data data master master master ingest ingest ingest
  41. 41. Elasticsearch ingest node needs No data = no IOPS Large number of indices = high CPU & memory usage Complicated mappings = high memory usage Daily indices = spikes in resources utilization
  42. 42. Focus: Centralized Buffer File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  43. 43. Why Apache Kafka? Fast & easy to use Easy to scale Fault tolerant and highly available Supports streaming Works in publish/subscribe mode
  44. 44. Kafka architecture ZooKeeper ZooKeeper ZooKeeper Kafka Kafka KafkaKafka
  45. 45. Kafka & topics security_logs access_logs app1_logs app2_logs Kafka stores data in topics written on disk
  46. 46. Kafka & topics & partitions & replicas logs partition 2 logs partition 1 logs partition 3 logs partition 4 logs replica partition 2 logs replica partition 1 logs replica partition 3 logs replica partition 4
  47. 47. Scaling Kafka logs partition 1
  48. 48. Scaling Kafka logs partition 1 logs partition 2 logs partition 3 logs partition 4
  49. 49. Scaling Kafka logs partition 1 logs partition 2 logs partition 3 logs partition 4 logs partition 5 logs partition 6 logs partition 7 logs partition 8 logs partition 9 logs partition 10 logs partition 11 logs partition 12 logs partition 13 logs partition 14 logs partition 15 logs partition 16
  50. 50. Things to remember when using Kafka Scales by adding more partitions not threads The more IOPS the better Keep the # of consumers equal to # of partitions Replicas used for HA and FT only Offsets stored per consumer – multiple destinations easily possible
  51. 51. Focus: Shipper File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  52. 52. What about the shipper? logs Centralized Buffer Which shipper to use? Which protocol should be used What about the buffering Log to JSON or parse and how
  53. 53. Buffers performance & availability batches & threads when central buffer is gone
  54. 54. Buffer types Disk || memory || combined hybrid approach On source || centralized App Buffer App Buffer file or local log shipper easy scaling – fewer moving parts often with the use of lightweight shipper App App Kafka / Redis / Logstash / etc… one place for all changes extra features made easy (like TTL) ES ES
  55. 55. Buffers Summary Simple Reliable App Buffer App Buffer ES App App ES
  56. 56. Protocols UDP – fast, cool for the application, not reliable TCP – reliable (almost) application gets ACK when written to buffer Application level ACKs may be needed HTTP RELP Beats Kafka Logstash, rsyslog, Fluentd Logstash, rsyslog Logstash, Filebeat Logstash, rsyslog, Filebeat, Fluentd
  57. 57. Choosing the shipper application rsyslog Elasticsearch http socket memory & disk assisted queues
  58. 58. Choosing the shipper application rsyslog Elasticsearch http socket memory & disk assisted queues application file rsyslog filebeat consumer
  59. 59. What about OS? Say NO to swap Set the right disk scheduler CFQ for spinning disks deadline for SSD Use proper mount options for ext4 noatime nodirtime data=writeback, nobarier For bare metal check CPU governor disable transparent huge pages /proc/sys/vm/nr_hugepages=0
  60. 60. We are engineers! We develop DevOps tools! We are DevOps people! We do fun stuff ;) http://sematext.com/jobs
  61. 61. Thank you for listening! Get in touch! Rafał rafal.kuc@sematext.com @kucrafal http://sematext.com @sematext http://sematext.com/jobs Come talk to us at the booth

×