SlideShare a Scribd company logo
Log Analytics with ELK Stack
(Architecture for aggressive cost optimization and infinite data scale)
Denis D’Souza | 27th July 2019
About me...
● Currently a DevOps engineer at Moonfrog Labs
● 6 + years working as DevOps Engineer, SRE and Linux administrator
Worked on a variety of technologies in both service-based and
product-based organisations
● How do I spend my free time ?
Learning new technologies and Playing PC Games
www.linkedin.com/in/denis-dsouza
• A Mobile Gaming Company making mass market social games
• More than 5M+ Daily Active, 15M+ Weekly Active Users
• Real time, Cross platform games optimized for Primary
Market(s) - India and subcontinent
• Profitable!
Current Scale
Who we are ?
1. Our business requirements
2. Choosing the right option
3. ELK Stack overview
4. Our ELK architecture
5. Optimizations we did
6. Cost savings
7. Key takeaways
Our problem statement
● Log analytics platform (Web-Server, Application, Database logs)
● Data Ingestion rate: ~300GB/day
● Frequently accessed data: last 8 days
● Infrequently accessed
● Uptime: 99.90
● Hot Retention period: 90 days
● Cold Retention period: 90 days (with potential to increase)
● Simple and Cost effective solution
● Fairly predictable concurrent user-base
● Not to be used for storing user/business data
Our business requirements
ELK stack Splunk Sumo logic
Product Self managed Cloud Professional
Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month *
Data Ingestion ~ 300 GB / day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day *
* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.
References:
https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2
https://www.sumologic.com/pricing/apac/
Choosing the right option
ELK Stack overview
● Index
● Shard
○ Primary
○ Replica
● Segment
● Node
References:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html
ELK Stack overview: Terminologies
Our ELK architecture
Our ELK architecture: Hot-Warm-Cold data storage
(infinite scale)
Service
Number of
Nodes
Total CPU
Cores
Total RAM
Storage
EBS
1 Elasticsearch 7 28 141 GB
2 Logstash 3 6 12 GB
3 Kibana 1 1 4 GB
Total 11 35 157 GB ~ 20 TB
Data-ingestion per day ~ 300 GB
Hot Retention period 90 days
Docs/sec (at peak load) ~ 7K
Our ELK architecture: Size and scale
Application Side
● Logstash
● Elasticsearch
Infrastructure Side
● EC2
● EBS
● Data transfer
Optimizations we did
Optimizations we did: Application side
Logstash
Pipeline Workers:
● Adjusted "pipeline.workers" to x4 the number of
Cores to improve CPU utilisation on Logstash
server (as threads may spend significant time in
an I/O wait state)
### Core-count: 2 ###
...
pipeline.workers: 8
...
logstash.yml
References:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Optimizations we did: Logstash
'info' logs:
● Separated application 'info' log to be store in a
different index with retention policy of fewer days
if [sourcetype] == "app_logs" and [level] == "info"
{
elasticsearch {
index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"
...
Filter config
if [sourcetype] == "nginx" and [status] == "200"
{
elasticsearch {
index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"
...
References:
https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html
'200' response-code logs:
● Separated Access log with '200' response-code
be store in a different index with retention policy
of fewer days
Optimizations we did: Logstash
Log ‘message’ field:
● Removed "message" field if there were no
'grok-failures' in logstash while applying grok
patterns
(reduced storage footprint by ~30% per doc)
if "_grokparsefailure" not in [tags] {
mutate {
remove_field => ["message"]
}
}
Filter config
Eg:
Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0
Gecko" "-"
Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident}
[%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?:
HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-)
%{QS:referrer} %{QS:agent} %{QS:forwarder}
Optimizations we did: Logstash
Elasticsearch
Optimizations we did: Application side
JVM heap vs non-heap memory:
● Optimised JVM heap-size by monitoring the GC
interval, this helped in efficient utilization of system
Memory (33% for JVM, 66% for non-heap) *
jvm.options
### Total system Memory 15GB ###
-Xms5g
-Xmx5g
Heap too small
Heap too large
Optimised Heap
* Recommended heap-size settings by Elastic:
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
Optimizations we did: Elasticsearch
Shards:
● Created templates with number of shards which
are multiples of the number of Elasticsearch
nodes
(helps fix issues with shards distribution
imbalance which resulted in uneven disk,
compute resource usage)
### Number of ES nodes: 5 ###
{
"template": "appserver-*",
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0",
...
}
}'
Trade-offs:
● Removing replicas will result in search queries
running slower as replicas are used while
performing search operations
● It is not recommended to run production clusters
without replicas
Replicas:
● Removed replicas for the required indexes
(50% savings on storage cost, ~30% reduction in
compute resource utilization)
Optimizations we did: Elasticsearch
Template config
AWS
● EC2
● EBS
● Data transfer (Inter AZ)
Spotinst platform allows users to reliably
leverage excess capacity, simplify cloud
operations and save 80% on compute costs.
Optimizations we did: Infrastructure side
Optimizations we did: Infrastructure side
EC2
Stateful EC2 Spot instances:
● Moved all ELK nodes to run on spot instances
(Instances maintain IP address, EBS volumes)
Recovery time: < 10 mins
Trade-offs:
● Prefer using previous generation instance
types to reduce frequent spot take-backs
Optimizations we did: EC2 and spot
Auto-Scaling:
● Performance/time based auto-scaling for
Logstash Instances
Optimizations we did: EC2 and spot
Optimizations we did: Infrastructure side
EBS
"Hot-Warm" Architecture:
● "Hot" nodes: store active indexes, use GP2
EBS-disks (General purpose SSD)
● "Warm" nodes: store passive indexes, use SC1
EBS-disks (Cold storage)
(~69% savings on storage cost)
node.attr.box_type: hot
...
elasticsearch.yml
"template": "appserver-*",
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"box_type": "hot"}
}
}
},
...
Template config
Trade-offs:
● Since "Warm" nodes are using SC1 EBS-disks,
they have lower IOPS, throughput this will result
in search operations being comparatively slower
References:
https://cinhtau.net/2017/06/14/hot-warm-architecture/
Optimizations we did: EBS
Moving indexes to "Warm" nodes:
● Reallocated indexes older than 8 days to "Warm"
nodes
● Recommended to perform this operation during
off-peak hours as it is I/O intensive
actions:
1:
action: allocation
description: "Move index to Warm-nodes after 8
days"
options:
key: box_type
value: warm
allocation_type: require
timeout_override:
continue_if_exception: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 8
...
Curator config
References:
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x
Optimizations we did: EBS
Single Availability Zone:
● Migrated all ELK node to a single availability zone
(reduce inter AZ data transfer cost for ELK nodes
by 100%)
● Data transfer/day: ~700GB
(Logstash to Elasticsearch: ~300GB,
Elasticsearch inter-communication: ~400GB)
Trade-offs:
● It is not recommended to run production clusters in
a single AZ as it will result in downtime and
potential data loss in case of AZ failures
Optimizations we did: Inter-AZ data transfer
Using S3 for index Snapshots:
● Take snapshots of indexes and store them in S3
curl -XPUT
"http://<domain>:9200/_snapshot/s3_repository/
snap1?pretty?wait_for_completion=true" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
Backup:
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f
Data backup and restore
curl -s -XPOST --url
"http://<domain>:9200/_snapshot/s3_repository/s
nap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
}'
On-demand Elasticsearch cluster:
● Launching a on demand ES cluster and importing
the snapshots from S3
Existing Cluster:
● Restore the required snapshots to existing cluster
Restore:
Data backup and restore
Data corruption:
● List out indexes with status as ‘Red’
● Deleted the corrupted indexes
● Restore indexes from S3 snapshots
● Recovery time: depends of size of data
Node failure due to AZ going down:
● Launch a new ELK cluster using AWS cloud
formation templates
● Do the necessary config changes in Filebeat,
Logstash etc.
● Restore the required indexes from S3 snapshots
● Recovery time: depends on provisioning time and
size of data
Node failures due to underlying hardware issue:
● Recycle node in Spotinst console
(will take AMI of root volume, launch new instance,
attach EBS volumes, maintain private IP)
● Recovery time: < 10 mins/node
Snapshot restore time (estimates):
● < 4mins for a 20GB snapshot (test-cluster: 3
nodes, multiple indexes with 3 primary shards
each, no replicas)
Disaster recovery
EC2
Instance type Service Daily cost
5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80
3 x c5.large (6C, 12GB) Logstash 7.17
1 x t3.medium (2C, 4GB) Kibana 1.29
Total ~ 49.26$
EC2 (optimized)
Instance type Service
Daily cost
65% savings + Spotinst charges (20% of savings) Total Savings
5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64
2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50
3 x c4.large (6C, 12GB) Logstash 3.50
1 x t2.medium (2C, 4GB) Kibana 0.69
Total ~ 26.33$ ~ 47%
Cost savings: EC2
Ingesting: 300GB/day
Retention: 90 days
Replica count: 1
Storage
Storage type Retention Daily cost
~54TB (GP2) 90 days ~ 237.60$
Storage (optimized)
Storage type Retention Daily cost Total Savings
~ 3TB (GP2) Hot 8 days 12.00
~ 24TB (SC1) Warm 82 days 24.00
~ 27TB (S3) Backup 90 days 22.50
Total ~ 58.50$ ~ 75%
Ingesting: 300GB/day
Retention: 90 days
Replica count: 0
Backups: Daily S3 snapshots
Cost savings: Storage
ELK stack
ELK stack
(optimized) Savings
EC2 49.40 26.33 47%
Storage 237.60 58.50 75%
Data-transfer 7 0 100%
Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% *
Cost/GB (daily) ~ 0.98$ ~ 0.28$
* Total savings are exclusive of some of the application-level optimizations done
Total savings
ELK Stack
(optimized) ELK Stack Splunk Sumo logic
Product Self managed Self managed Cloud Professional
Data Ingestion ~ 300GB/day ~ 300GB/day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day *
Savings over traditional ELK stack: 71% *
* Total savings are exclusive of some of the application-level optimizations done
Our Costs vs other Platforms
ELK Stack Scalability:
● Logstash: auto-scaling
● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling
Handling potential data-loss while AZ is down:
● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
● We do not store user-data or business metrics in ELK, users/business will not be impacted
Handling potential data-corruptions in Elasticsearch:
● DR mechanisms in place, recover index from S3 index-snapshots
Managing downtime during spot take-backs:
● Logstash: multiple nodes, minimal impact
● Elasticsearch/Kibana: < 10min downtime per node
● Use previous generation instance types as spot take-back chances are comparatively low
Key Takeaways
Handling back-pressure when a node is down:
● Filebeat: will auto-retry to send old logs
● Logstash: use ‘date’ filter for document timestamp, auto-scaling
● Elasticsearch: overprovisioning
Other log analytics alternatives:
● We have only evaluated ELK, Splunk and Sumo Logic
ELK stack upgrade path:
● Blue Green deployment for major version upgrade
Key Takeaways
● We built a platform tailored to our requirements, yours might be different...
● Building a log analytics platform is not rocket science, but it can be painfully iterative if you
are not aware of the options
● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for
your specific requirements
Reflection
Thank you!
Happy to take your questions..
Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..

More Related Content

What's hot

What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
Edureka!
 
ELK Stack
ELK StackELK Stack
ELK Stack
Phuc Nguyen
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
Richa Budhraja
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
Geert Pante
 
Elk
Elk Elk
Introduction To Kibana
Introduction To KibanaIntroduction To Kibana
Introduction To Kibana
Jen Stirrup
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
The Elastic ELK Stack
The Elastic ELK StackThe Elastic ELK Stack
The Elastic ELK Stack
enterprisesearchmeetup
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
Jonathan Katz
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
SpringPeople
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
Roopendra Vishwakarma
 
Splunk metrics via telegraf
Splunk metrics via telegrafSplunk metrics via telegraf
Splunk metrics via telegraf
Ashvin Pandey
 

What's hot (20)

What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
Elk
Elk Elk
Elk
 
Introduction To Kibana
Introduction To KibanaIntroduction To Kibana
Introduction To Kibana
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
The Elastic ELK Stack
The Elastic ELK StackThe Elastic ELK Stack
The Elastic ELK Stack
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Splunk metrics via telegraf
Splunk metrics via telegrafSplunk metrics via telegraf
Splunk metrics via telegraf
 

Similar to Log analytics with ELK stack

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
Etti Gur
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
Joerg Henning
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
data://disrupted®
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011bostonrb
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
Sematext Group, Inc.
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
Engine Yard
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
ronwarshawsky
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Hernan Costante
 
Durable Azure Functions
Durable Azure FunctionsDurable Azure Functions
Durable Azure Functions
Pushkar Saraf
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
Odoo
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
grooverdan
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
Dainius Jocas
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
Bob Cotton
 
Varnish - PLNOG 4
Varnish - PLNOG 4Varnish - PLNOG 4
Varnish - PLNOG 4
Leszek Urbanski
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 

Similar to Log analytics with ELK stack (20)

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Durable Azure Functions
Durable Azure FunctionsDurable Azure Functions
Durable Azure Functions
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
Varnish - PLNOG 4
Varnish - PLNOG 4Varnish - PLNOG 4
Varnish - PLNOG 4
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
 

More from AWS User Group Bengaluru

Demystifying identity on AWS
Demystifying identity on AWSDemystifying identity on AWS
Demystifying identity on AWS
AWS User Group Bengaluru
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
AWS User Group Bengaluru
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
AWS User Group Bengaluru
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
AWS User Group Bengaluru
 
Medlife journey with AWS
Medlife journey with AWSMedlife journey with AWS
Medlife journey with AWS
AWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
AWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
AWS User Group Bengaluru
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS account
AWS User Group Bengaluru
 
Serverless Culture
Serverless CultureServerless Culture
Serverless Culture
AWS User Group Bengaluru
 
Refactoring to serverless
Refactoring to serverlessRefactoring to serverless
Refactoring to serverless
AWS User Group Bengaluru
 
Amazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances WorkshopAmazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances Workshop
AWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
AWS User Group Bengaluru
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
AWS User Group Bengaluru
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
AWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
AWS User Group Bengaluru
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
AWS User Group Bengaluru
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
AWS User Group Bengaluru
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
AWS User Group Bengaluru
 
Cost Optimization in AWS
Cost Optimization in AWSCost Optimization in AWS
Cost Optimization in AWS
AWS User Group Bengaluru
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
 

More from AWS User Group Bengaluru (20)

Demystifying identity on AWS
Demystifying identity on AWSDemystifying identity on AWS
Demystifying identity on AWS
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Medlife journey with AWS
Medlife journey with AWSMedlife journey with AWS
Medlife journey with AWS
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS account
 
Serverless Culture
Serverless CultureServerless Culture
Serverless Culture
 
Refactoring to serverless
Refactoring to serverlessRefactoring to serverless
Refactoring to serverless
 
Amazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances WorkshopAmazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances Workshop
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
 
Cost Optimization in AWS
Cost Optimization in AWSCost Optimization in AWS
Cost Optimization in AWS
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Log analytics with ELK stack

  • 1. Log Analytics with ELK Stack (Architecture for aggressive cost optimization and infinite data scale) Denis D’Souza | 27th July 2019
  • 2. About me... ● Currently a DevOps engineer at Moonfrog Labs ● 6 + years working as DevOps Engineer, SRE and Linux administrator Worked on a variety of technologies in both service-based and product-based organisations ● How do I spend my free time ? Learning new technologies and Playing PC Games www.linkedin.com/in/denis-dsouza
  • 3. • A Mobile Gaming Company making mass market social games • More than 5M+ Daily Active, 15M+ Weekly Active Users • Real time, Cross platform games optimized for Primary Market(s) - India and subcontinent • Profitable! Current Scale Who we are ?
  • 4. 1. Our business requirements 2. Choosing the right option 3. ELK Stack overview 4. Our ELK architecture 5. Optimizations we did 6. Cost savings 7. Key takeaways Our problem statement
  • 5. ● Log analytics platform (Web-Server, Application, Database logs) ● Data Ingestion rate: ~300GB/day ● Frequently accessed data: last 8 days ● Infrequently accessed ● Uptime: 99.90 ● Hot Retention period: 90 days ● Cold Retention period: 90 days (with potential to increase) ● Simple and Cost effective solution ● Fairly predictable concurrent user-base ● Not to be used for storing user/business data Our business requirements
  • 6. ELK stack Splunk Sumo logic Product Self managed Cloud Professional Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month * Data Ingestion ~ 300 GB / day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day * * values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only. References: https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2 https://www.sumologic.com/pricing/apac/ Choosing the right option
  • 8. ● Index ● Shard ○ Primary ○ Replica ● Segment ● Node References: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html ELK Stack overview: Terminologies
  • 10. Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)
  • 11. Service Number of Nodes Total CPU Cores Total RAM Storage EBS 1 Elasticsearch 7 28 141 GB 2 Logstash 3 6 12 GB 3 Kibana 1 1 4 GB Total 11 35 157 GB ~ 20 TB Data-ingestion per day ~ 300 GB Hot Retention period 90 days Docs/sec (at peak load) ~ 7K Our ELK architecture: Size and scale
  • 12. Application Side ● Logstash ● Elasticsearch Infrastructure Side ● EC2 ● EBS ● Data transfer Optimizations we did
  • 13. Optimizations we did: Application side Logstash
  • 14. Pipeline Workers: ● Adjusted "pipeline.workers" to x4 the number of Cores to improve CPU utilisation on Logstash server (as threads may spend significant time in an I/O wait state) ### Core-count: 2 ### ... pipeline.workers: 8 ... logstash.yml References: https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html Optimizations we did: Logstash
  • 15. 'info' logs: ● Separated application 'info' log to be store in a different index with retention policy of fewer days if [sourcetype] == "app_logs" and [level] == "info" { elasticsearch { index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}" ... Filter config if [sourcetype] == "nginx" and [status] == "200" { elasticsearch { index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}" ... References: https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html '200' response-code logs: ● Separated Access log with '200' response-code be store in a different index with retention policy of fewer days Optimizations we did: Logstash
  • 16. Log ‘message’ field: ● Removed "message" field if there were no 'grok-failures' in logstash while applying grok patterns (reduced storage footprint by ~30% per doc) if "_grokparsefailure" not in [tags] { mutate { remove_field => ["message"] } } Filter config Eg: Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-" Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} [%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder} Optimizations we did: Logstash
  • 18. JVM heap vs non-heap memory: ● Optimised JVM heap-size by monitoring the GC interval, this helped in efficient utilization of system Memory (33% for JVM, 66% for non-heap) * jvm.options ### Total system Memory 15GB ### -Xms5g -Xmx5g Heap too small Heap too large Optimised Heap * Recommended heap-size settings by Elastic: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html Optimizations we did: Elasticsearch
  • 19. Shards: ● Created templates with number of shards which are multiples of the number of Elasticsearch nodes (helps fix issues with shards distribution imbalance which resulted in uneven disk, compute resource usage) ### Number of ES nodes: 5 ### { "template": "appserver-*", "settings": { "number_of_shards": "5", "number_of_replicas": "0", ... } }' Trade-offs: ● Removing replicas will result in search queries running slower as replicas are used while performing search operations ● It is not recommended to run production clusters without replicas Replicas: ● Removed replicas for the required indexes (50% savings on storage cost, ~30% reduction in compute resource utilization) Optimizations we did: Elasticsearch Template config
  • 20. AWS ● EC2 ● EBS ● Data transfer (Inter AZ) Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs. Optimizations we did: Infrastructure side
  • 21. Optimizations we did: Infrastructure side EC2
  • 22. Stateful EC2 Spot instances: ● Moved all ELK nodes to run on spot instances (Instances maintain IP address, EBS volumes) Recovery time: < 10 mins Trade-offs: ● Prefer using previous generation instance types to reduce frequent spot take-backs Optimizations we did: EC2 and spot
  • 23. Auto-Scaling: ● Performance/time based auto-scaling for Logstash Instances Optimizations we did: EC2 and spot
  • 24. Optimizations we did: Infrastructure side EBS
  • 25. "Hot-Warm" Architecture: ● "Hot" nodes: store active indexes, use GP2 EBS-disks (General purpose SSD) ● "Warm" nodes: store passive indexes, use SC1 EBS-disks (Cold storage) (~69% savings on storage cost) node.attr.box_type: hot ... elasticsearch.yml "template": "appserver-*", "settings": { "index": { "routing": { "allocation": { "require": { "box_type": "hot"} } } }, ... Template config Trade-offs: ● Since "Warm" nodes are using SC1 EBS-disks, they have lower IOPS, throughput this will result in search operations being comparatively slower References: https://cinhtau.net/2017/06/14/hot-warm-architecture/ Optimizations we did: EBS
  • 26. Moving indexes to "Warm" nodes: ● Reallocated indexes older than 8 days to "Warm" nodes ● Recommended to perform this operation during off-peak hours as it is I/O intensive actions: 1: action: allocation description: "Move index to Warm-nodes after 8 days" options: key: box_type value: warm allocation_type: require timeout_override: continue_if_exception: false filters: - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 8 ... Curator config References: https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x Optimizations we did: EBS
  • 27. Single Availability Zone: ● Migrated all ELK node to a single availability zone (reduce inter AZ data transfer cost for ELK nodes by 100%) ● Data transfer/day: ~700GB (Logstash to Elasticsearch: ~300GB, Elasticsearch inter-communication: ~400GB) Trade-offs: ● It is not recommended to run production clusters in a single AZ as it will result in downtime and potential data loss in case of AZ failures Optimizations we did: Inter-AZ data transfer
  • 28. Using S3 for index Snapshots: ● Take snapshots of indexes and store them in S3 curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/ snap1?pretty?wait_for_completion=true" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false } Backup: References: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f Data backup and restore
  • 29. curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/s nap1/_restore" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false, }' On-demand Elasticsearch cluster: ● Launching a on demand ES cluster and importing the snapshots from S3 Existing Cluster: ● Restore the required snapshots to existing cluster Restore: Data backup and restore
  • 30. Data corruption: ● List out indexes with status as ‘Red’ ● Deleted the corrupted indexes ● Restore indexes from S3 snapshots ● Recovery time: depends of size of data Node failure due to AZ going down: ● Launch a new ELK cluster using AWS cloud formation templates ● Do the necessary config changes in Filebeat, Logstash etc. ● Restore the required indexes from S3 snapshots ● Recovery time: depends on provisioning time and size of data Node failures due to underlying hardware issue: ● Recycle node in Spotinst console (will take AMI of root volume, launch new instance, attach EBS volumes, maintain private IP) ● Recovery time: < 10 mins/node Snapshot restore time (estimates): ● < 4mins for a 20GB snapshot (test-cluster: 3 nodes, multiple indexes with 3 primary shards each, no replicas) Disaster recovery
  • 31. EC2 Instance type Service Daily cost 5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80 3 x c5.large (6C, 12GB) Logstash 7.17 1 x t3.medium (2C, 4GB) Kibana 1.29 Total ~ 49.26$ EC2 (optimized) Instance type Service Daily cost 65% savings + Spotinst charges (20% of savings) Total Savings 5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64 2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50 3 x c4.large (6C, 12GB) Logstash 3.50 1 x t2.medium (2C, 4GB) Kibana 0.69 Total ~ 26.33$ ~ 47% Cost savings: EC2
  • 32. Ingesting: 300GB/day Retention: 90 days Replica count: 1 Storage Storage type Retention Daily cost ~54TB (GP2) 90 days ~ 237.60$ Storage (optimized) Storage type Retention Daily cost Total Savings ~ 3TB (GP2) Hot 8 days 12.00 ~ 24TB (SC1) Warm 82 days 24.00 ~ 27TB (S3) Backup 90 days 22.50 Total ~ 58.50$ ~ 75% Ingesting: 300GB/day Retention: 90 days Replica count: 0 Backups: Daily S3 snapshots Cost savings: Storage
  • 33. ELK stack ELK stack (optimized) Savings EC2 49.40 26.33 47% Storage 237.60 58.50 75% Data-transfer 7 0 100% Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% * Cost/GB (daily) ~ 0.98$ ~ 0.28$ * Total savings are exclusive of some of the application-level optimizations done Total savings
  • 34. ELK Stack (optimized) ELK Stack Splunk Sumo logic Product Self managed Self managed Cloud Professional Data Ingestion ~ 300GB/day ~ 300GB/day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day * Savings over traditional ELK stack: 71% * * Total savings are exclusive of some of the application-level optimizations done Our Costs vs other Platforms
  • 35. ELK Stack Scalability: ● Logstash: auto-scaling ● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling Handling potential data-loss while AZ is down: ● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour ● We do not store user-data or business metrics in ELK, users/business will not be impacted Handling potential data-corruptions in Elasticsearch: ● DR mechanisms in place, recover index from S3 index-snapshots Managing downtime during spot take-backs: ● Logstash: multiple nodes, minimal impact ● Elasticsearch/Kibana: < 10min downtime per node ● Use previous generation instance types as spot take-back chances are comparatively low Key Takeaways
  • 36. Handling back-pressure when a node is down: ● Filebeat: will auto-retry to send old logs ● Logstash: use ‘date’ filter for document timestamp, auto-scaling ● Elasticsearch: overprovisioning Other log analytics alternatives: ● We have only evaluated ELK, Splunk and Sumo Logic ELK stack upgrade path: ● Blue Green deployment for major version upgrade Key Takeaways
  • 37. ● We built a platform tailored to our requirements, yours might be different... ● Building a log analytics platform is not rocket science, but it can be painfully iterative if you are not aware of the options ● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for your specific requirements Reflection
  • 38. Thank you! Happy to take your questions.. Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..