Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introducing log analysis to your organization

1,956 views

Published on

This talk was given during DevOps Con 2017.
Have you ever spent time digging through various terminals, greping, lessing, awking and trying to find that few log lines that may be important? Have you every done that under time pressure, because mission critical services were not working? Have you every heard from your developers that they can’t tell you anything, because they don’t have access to application logs? Have you ever considered a centralized storage for logs, but time and resources are not on your side?
If you said yes, to any of the above questions, than this talk is for you. During the talk we’ll introduce you to the world of log centralization and analysis, both when it comes to open source, but also commercial tools. We will go from top to bottom and learn how to setup log centralization and analysis for servers, virtualized environments and containers. We will get from log shipping, through centralized buffering to storage and analysis to show you, that having a centralized log analysis tool is not a rocket science.
Finally, you will see how useful is to combine the logs from all your servers in a single place for blazingly fast correlation.

Published in: Engineering
  • Be the first to comment

Introducing log analysis to your organization

  1. 1. Introducing Log Analysis To Your Organization Rafał Kuć
  2. 2. Sematext Und Mich logs metrics cloud &
  3. 3. Next 60 minutes… Log shipping - buffers - protocols - parsing Central buffering - Kafka - Redis Storage & Analysis - Elasticsearch - Kibana - Grafana Why & How? - Should I try? - Open source - Commercial
  4. 4. Why You Should Care Environments are getting bigger
  5. 5. Why You Should Care Environments are getting bigger Containers are everywhere
  6. 6. Why You Should Care Environments are getting bigger Containers are everywhere Infrastructure work gets automated Created by Kjpargeter - Freepik.com
  7. 7. Why You Should Care Environments are getting bigger Containers are everywhere Infrastructure work gets automated Logs & metrics at the same place
  8. 8. Why You Should Care Environments are getting bigger Containers are everywhere Infrastructure work gets automated Faster diagnostics == less money spent Logs & metrics at the same place
  9. 9. Going For Commercial Solution cloud
  10. 10. Going For Commercial Solution cloud
  11. 11. Going For Commercial Solution cloud
  12. 12. Going For Commercial Solution cloud
  13. 13. Going For Commercial Solution cloud
  14. 14. Going For Commercial Solution cloud
  15. 15. Going For Commercial Solution cloud
  16. 16. Going For Commercial Solution cloud
  17. 17. Going For Commercial Solution cloud
  18. 18. Going For Commercial Solution Icon made by Smashicons from www.flaticon.com
  19. 19. Going Open-Source
  20. 20. Going Open-Source
  21. 21. Going Open-Source
  22. 22. Going Open-Source
  23. 23. Going Open-Source – Today’s Focus
  24. 24. Log shipping architecture File
  25. 25. Log shipping architecture File Shipper
  26. 26. Log shipping architecture File Shipper File Shipper File Shipper
  27. 27. Log shipping architecture File Shipper File Shipper File Shipper Centralized Buffer
  28. 28. Log shipping architecture File Shipper File Shipper File Shipper Centralized Buffer data
  29. 29. Log shipping architecture File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  30. 30. Focus: Shipper File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  31. 31. What about the shipper? logs Centralized Buffer Which shipper to use? Which protocol should be used What about the buffering Log to JSON or parse and how
  32. 32. Buffers performance & availability batches & threads when central buffer is gone
  33. 33. Buffer types Disk || memory || combined hybrid approach On source || centralized App Buffer App Buffer file or local log shipper easy scaling – fewer moving parts often with the use of lightweight shipper App App Kafka / Redis / Logstash / etc… one place for all changes extra features made easy (like TTL) ES ES
  34. 34. Buffers Summary Simple Reliable App Buffer App Buffer ES App App ES
  35. 35. Protocols UDP – fast, cool for the application, not reliable TCP – reliable (almost) application gets ACK when written to buffer Application level ACKs may be needed HTTP RELP Beats Kafka Logstash, rsyslog, Fluentd Logstash, rsyslog Logstash, Filebeat Logstash, rsyslog, Filebeat, Fluentd
  36. 36. Choosing the shipper application rsyslog Elasticsearch http socket memory & disk assisted queues
  37. 37. Final Architecture application rsyslog Elasticsearch http socket memory & disk assisted queues application file rsyslog Logagent filebeat consumer
  38. 38. Final Architecture application rsyslog Elasticsearch http socket memory & disk assisted queues application file rsyslog Logagent filebeat consumer Parsing Done Here
  39. 39. Focus: Centralized Buffer File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  40. 40. Why Apache Kafka? Fast & easy to use Easy to scale Fault tolerant and highly available Supports streaming Works in publish/subscribe mode
  41. 41. Kafka architecture ZooKeeper ZooKeeper ZooKeeper Kafka Kafka KafkaKafka
  42. 42. Kafka & topics security_logs access_logs app1_logs app2_logs Kafka stores data in topics written on disk
  43. 43. Kafka & topics & partitions & replicas logs partition 2 logs partition 1 logs partition 3 logs partition 4 logs replica partition 2 logs replica partition 1 logs replica partition 3 logs replica partition 4
  44. 44. Scaling Kafka logs partition 1
  45. 45. Scaling Kafka logs partition 1 logs partition 2 logs partition 3 logs partition 4
  46. 46. Scaling Kafka logs partition 1 logs partition 2 logs partition 3 logs partition 4 logs partition 5 logs partition 6 logs partition 7 logs partition 8 logs partition 9 logs partition 10 logs partition 11 logs partition 12 logs partition 13 logs partition 14 logs partition 15 logs partition 16
  47. 47. Things to remember when using Kafka Scales by adding more partitions not threads The more IOPS the better Keep the # of consumers equal to # of partitions Replicas used for HA and FT only Offsets stored per consumer – multiple destinations easily possible
  48. 48. Focus: Elasticsearch File Shipper File Shipper File Shipper Centralized Buffer ES ES ES ES ES ES ES ES ES data
  49. 49. Elasticsearch cluster architecture client client client data data data data data data master master master ingest ingest ingest
  50. 50. Dedicated masters please client client client data data data data data data master master master discovery.zen.minimum_master_nodes -> N/2 + 1 master eligible nodes ingest ingest ingest
  51. 51. Elasticsearch – Indices Index – logical place for data
  52. 52. Elasticsearch – Indices Index – logical place for data Index – can be compared to database in DB
  53. 53. Elasticsearch – Indices Index – logical place for data Index – can be compared to database in DB Index – built out of one or more shards
  54. 54. Elasticsearch – Indices Index – logical place for data Index – can be compared to database in DB Index – built out of one or more shards Shard – can be spread among multiple nodes
  55. 55. Scaling Elasticsearch Logs Shard1
  56. 56. Scaling Elasticsearch Logs Shard1 Users Shard1 Invoices Shard1
  57. 57. Scaling Elasticsearch Logs Shard1 Logs Shard2 Logs Shard3 Logs Shard4
  58. 58. Scaling Elasticsearch Logs Shard3 Logs Shard2 Logs Shard4 Logs Shard1
  59. 59. Scaling Elasticsearch Logs Shard1 Logs Replica4 Logs Shard2 Logs Replica3 Logs Shard4 Logs Replica1 Logs Shard3 Logs Replica2
  60. 60. One big index is a no-go Not scalable enough for time based data
  61. 61. One big index is a no-go Not scalable enough for time based data Indexing slows down with time
  62. 62. One big index is a no-go Not scalable enough for time based data Indexing slows down with time Expensive merges
  63. 63. One big index is a no-go Not scalable enough for time based data Indexing slows down with time Expensive merges Delete by query needed for data retention
  64. 64. Daily indices are a good start 2017.11.16 2017.11.17 2017.11.20 2017.11.21. . . Indexing is faster for smaller indices Deletes are cheap Search can be performed on indices that are needed Static indices are cache friendly indexing most searches
  65. 65. Daily indices are a good start 2017.11.16 2017.11.17 2017.11.20 2017.11.21. . . Indexing is faster for smaller indices Deletes are cheap Search can be performed on indices that are needed Static indices are cache friendly indexing most searches We delete whole indices
  66. 66. Daily indices are sub-optimal black friday saturday sunday load is not even
  67. 67. Size based indices are optimal size limit for indices logs_01 indexing around 5 – 10GB per shard on AWS
  68. 68. Size based indices are optimal size limit for indices logs_01 indexing around 5 – 10GB per shard on AWS
  69. 69. Size based indices are optimal size limit for indices logs_01 indexing logs_02 around 5 – 10GB per shard on AWS
  70. 70. Size based indices are optimal size limit for indices logs_01 indexing logs_02 around 5 – 10GB per shard on AWS
  71. 71. Size based indices are optimal size limit for indices logs_01 logs_02 indexing logs_N. . . around 5 – 10GB per shard on AWS
  72. 72. Slice using size Predictable searching and indexing performance Better indices balancing Fewer shards Easier handling of spiky loads Less costs because of better hardware utilization
  73. 73. Proper Elasticsearch configuration Keep index.refresh_interval at maximum possible value 1 sec -> 100%, 5 sec -> 125%, 30 sec -> 175% You can loosen up merges - possible because of heavy aggregation use - segments_per_tier -> higher - max_merge_at_once-> higher - max_merged_segment -> lower All prefixed with index.merge.policy } higher indexing throughput
  74. 74. Proper Elasticsearch configuration Index only needed fields Use doc values Do not index _source Do not store _all
  75. 75. Optimization time We can optimize data nodes for time based data client client client data data data data data data master master master ingest ingest ingest
  76. 76. Hot – cold architecture ES hot ES cold ES cold -Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold
  77. 77. Hot – cold architecture logs_2017.11.22 ES hot ES cold ES cold -Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold curl -XPUT localhost:9200/logs_2017.11.22 -d '{ "settings" : { "index.routing.allocation.exclude.tag" : "cold", "index.routing.allocation.include.tag" : "hot" } }'
  78. 78. Hot – cold architecture logs_2017.11.22 ES hot ES cold ES cold indexing
  79. 79. Hot – cold architecture logs_2017.11.22 logs_2017.11.23 ES hot ES cold ES cold indexing
  80. 80. Hot – cold architecture logs_2017.11.22 logs_2017.11.23 ES hot ES cold ES cold indexing move index after day ends curl -XPUT localhost:9200/logs_2017.11.22/_settings -d '{ "index.routing.allocation.exclude.tag" : "hot", "index.routing.allocation.include.tag” : "cold" }'
  81. 81. Hot – cold architecture logs_2017.11.23 logs_2017.11.22 ES hot ES cold ES cold indexing
  82. 82. Hot – cold architecture logs_2017.11.23 logs_2017.11.24 logs_2017.11.22 ES hot ES cold ES cold indexing
  83. 83. Hot – cold architecture logs_2017.11.23 logs_2017.11.24 logs_2017.11.22 ES hot ES cold ES cold indexing move index after day ends
  84. 84. Hot – cold architecture logs_2017.11.24 logs_2017.11.22 logs_2017.11.23 ES hot ES cold ES cold indexing
  85. 85. Hot – cold architecture Hot ES Tier Good CPU Lots of I/O Cold ES Tier Memory bound Decent I/O ES cold Cold ES Tier Memory bound Decent I/O
  86. 86. Hot – cold architecture summary ES cold Optimize costs – different hardware for different tier Performance – use case optimized hardware Isolation – long running searches don’t affect indexing
  87. 87. Elasticsearch client node needs client client client data data data data data data master master master ingest ingest ingest
  88. 88. Elasticsearch client node needs No data = no IOPS Large query throughput = high CPU usage Lots of results = high memory usage Lots of concurrent queries = higher resources utilization
  89. 89. Elasticsearch ingest node needs client client client data data data data data data master master master ingest ingest ingest
  90. 90. Elasticsearch ingest node needs No data = no IOPS Large index throughput = high CPU & memory usage Complicated rules = high CPU usage Larger documents = more resources utilization
  91. 91. Elasticsearch master node needs client client client data data data data data data master master master ingest ingest ingest
  92. 92. Elasticsearch ingest node needs No data = no IOPS Large number of indices = high CPU & memory usage Complicated mappings = high memory usage Daily indices = spikes in resources utilization
  93. 93. What about OS? Say NO to swap Set the right disk scheduler CFQ for spinning disks deadline for SSD Use proper mount options for ext4 noatime nodirtime data=writeback, nobarier For bare metal check CPU governor disable transparent huge pages /proc/sys/vm/nr_hugepages=0
  94. 94. Analysis - Kibana
  95. 95. Analysis - Kibana
  96. 96. Analysis - Kibana
  97. 97. Analysis - Kibana
  98. 98. Analysis - Kibana
  99. 99. Analysis - Kibana
  100. 100. Analysis - Kibana
  101. 101. Analysis - Grafana
  102. 102. Analysis - Grafana
  103. 103. Analysis - Grafana
  104. 104. Where To Go From Here?
  105. 105. We are engineers! We develop DevOps tools! We are DevOps people! We do fun stuff ;) http://sematext.com/jobs
  106. 106. Thank you for listening! Get in touch! Rafał rafal.kuc@sematext.com @kucrafal http://sematext.com @sematext http://sematext.com/jobs

×