SlideShare a Scribd company logo
1 of 25
Scaling Logging and Monitoring
JP Parkin (jpparkin@ca.ibm.com)
User Scenarios
• Internal IBM Infrastructure
– Relatively small number of groups that generate a ton
of logs ( groups that want to generate 3-5 TB/day )
– Logs are produced on VMs running various cloud
services operated by IBM
• External Bluemix Log Producers
– Relatively large number of groups ( Bluemix
organizations ) that generate a variety of log data but
in relatively smaller quantities – total logs measured
anywhere from kilobytes to gigabytes / day
– Only a handful of organizations are currently
generating large volumes of log data
Bluemix Logging
Bluemix Metrics
Advanced View - Kibana
Advanced View - Grafana
Grafana – Build your own Dashboard
Service Architecture
Key facts
• OpenStack Heat automation
– Multiple AutoScale Groups (ASGs)
– Docker image per ASG
– Ansible to configure s/w
• Currently deployed on
OpenStack
– Virtual Machines host single docker
container
– Security groups for firewall rules
– HAProxy for load balancing
Deployment and Automation
• Open Stack deployment using heat templates
– Provides scale-up/down capabilities to add capacity when needed
• Ansible configuration automation integrated with the heat
deployment to configure the nodes
• Docker images are used as our standard deployment artifact (
configured by Ansible )
• Jenkins jobs for building and testing the docker images
• UCD automation for deployment and upgrade processing – provides
operational management for tracking what is deployed to each of
the environments
• Mixture of Jenkins and UCD for jobs to manage the daily operations
including item such as data expiration, index pre-creation and
various health check scripts.
Node Configurations
System CPU Memory Java Heap Local Disk
Lumberjack 4 8 GB 5 GB 25 GB
Logstash 4 8 GB 5 GB 25 GB
Kafka 4 8 GB 3 GB 25 GB
+ 5 TB volume
Elasticsearch
Master Node
10 32 GB 16 GB 25 GB
Elasticsearch
Http Node
10 32 GB 16 GB 25 GB
Elasticsearch
Data Node
20 64 GB 30 GB 18 TB spinning
local RAID disk
Multi-tenant Logstash Forwarder
• Took the logstash forwarder and added multi-tenancy
capabilities
• Similar changes to the logstash input lumberjack plugin
• Fixed log rotation capabilities in the MT-LSF – was
triggering disk full problems on clients since it was
holding locks on files for up to 24 hours before it timed
out
• Found that increasing the spool size resulted in some
performance improvement up to a certain point. 512
was a sweet spot, going to larger values ending up
having worse performance.
Multi-tenant Lumberjack Server
• Lumberjack server had issue with long-lasting
connections and file descriptor leaks that required
frequent restarts under load
• Terminating connections on the client to get better
server utilization ( forced load balancer switch), but
didn’t resolve the underlying issue
• Logstash 1.5.2 lumberjack public solved the problem
connection problems with a fix to the Jruby OpenSSL
library which was encountering file descriptor leaks
under load
• Switching the kafka output plugin to run with async
gave some performance improvements ( 10-15% )
Logstash Lumberjack Performance
• The great thing about logstash is that it’s a Swiss Army Knife for
solving data transformation problems
• 12 Lumberjack servers in a cluster can process about 50 Mb /s ≈ 4.3
TB /day which is pretty good for most logging applications
• If you are only utilizing the basic input / output functionality then
creating a specific task based solution can result in better
performance
• We are prototyping a replacement logstash server to handle the
processing of the mt-lumberjack and initial results are very good –
in the area of 12x throughput improvement on the same hardware.
• The queuing mechanism that makes logstash very flexible turns out
to also be one of the bottlenecks when stressing out the platform
Kafka
• Distributed messaging system for buffering log
and metric data
• We keep 3 days worth of data to allow us to
handle the input spikes and buffers logs when
Elasticsearch or logstash indexers are not
performing well
• Logs for Kafka itself can become quite large
when errors occur, so getting the right logging
settings are important
Logstash Indexers
• Logstash Indexers are responsible for processing the
log entries and pushing the data to Elasticsearch
• Stability of Logstash 1.4.2 plugins for ES was not good
– Tried all 3 protocols ( node, transport, http )
– Node was fast but has issues when large metadata was
transferred on ES node failures ( frequent OOM )
– Transport had reasonable performance and stability but
did not have multi-node support
– Http has best performance after tuning to use a larger
batch size, but did not have multi-node support
• Logstash 1.5.2 ES plugins all have multi-node support
• Settled on the 1.5.2 Http protocol version running
against dedicated http client nodes in the cluster
Logstash Indexers
• Even with Logstash 1.5.2, the indexers are
somewhat gated to the amount of data a
single node can process
• Expanded the number of Kafka partitions to
allow growth beyond the initial 19 partitions
we had allocated for the logging topic.
• Logstash indexers can be scaled beyond 19
nodes in order to get to the point where we
can stress the ES cluster
Indexing Log Data
• Relying on your users to be well behaved is
dangerous – some logs contain what appears to
be well formed json document with a GUID as a
key and all of a sudden the field metadata
explodes in ES
• Need to monitor which documents you run
through the json filter in Logstash
• Adding filters to Logstash also slows down the
indexing process especially if you are attempting
to use many of the cool plugins
Elasticsearch
• If your network is a problem, then ES is not going to be
happy 
• Elasticsearch 1.4.4 did not react well to network blips –
indexes would start shuffling themselves trying to
proactively recover which generally resulted in long
recovery times with default configurations
• The default recovery settings meant clusters remained
red or yellow for extended periods which impacted the
data ingestion
• Elasticsearch 1.7.1 has been much more stable for us
Sharding
• Pre-allocating the right number of shards for an
index is hard if you don’t know how much data
you are going to get
• Target that seems to work well is about 25 GB per
shard
• Problems with shard size is really highlighted
when you need to recover a failed node
• How many shards can you put in an ES cluster?
– We found 80k was too many -> changed how we
allocated shards based on historical usage
– We think that for our clusters about 40k
Elasticsearch Configurations
• 3 master nodes
• 10 data nodes per cluster
• 3 http nodes per cluster for queries
• 30 GB heap
• 2 data replicas to allow 2 node failures
• index.translog.flush_threshold_size = 1g
• indices.fielddata.cache.size: 50%
Elasticsearch Recovery
• Increase the rate at which an index can
recover
indices.recovery.max_bytes_per_sec: 200mb
• Increase the concurrent recoveries supported
cluster.routing.allocation.node_concurrent_re
coveries: 500
• Having the Kafka cluster caching data provides
us some windows where the data is delayed
getting to the ES during recovery
Elasticsearch Load Testing
• Run client drivers to simulate traffic into the
external stack
• We have a number of sample workloads from
real tenants that we use in our workloads
• There are lots of knobs to tune ES so having
some consistent workloads to validate our
theories has been invaluable
Performance
• Clusters running in production can support up
to around 70k records/sec ( 30 MB/s ) based
on our monitoring
• In our performance environments we are
seeing consistent numbers beyond 40 MB/s
• For larger indexes, increasing the number of
shards provided – 50 GB of logs spread across
10 shards was loaded about 50% faster than
with 5 shards
Adjust throttling for loading large indices
0
5
10
15
20
25
30
35
40
45
Baseline 20 mb Throttle 100mb Throttle none
MB/sec
Throttling Settings
10 shards
Scaling Elastic Search
Multiple Elastic Search Clusters
• Tenants get placed onto an ES cluster
• Tribe nodes to federate access across ES clusters
– Enables massive tenants spanning ES clusters

More Related Content

What's hot

ELK Wrestling (Leeds DevOps)
ELK Wrestling (Leeds DevOps)ELK Wrestling (Leeds DevOps)
ELK Wrestling (Leeds DevOps)Steve Elliott
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysisDivante
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Andrii Vozniuk
 
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Airat Khisamov
 
Centralised logging with ELK stack
Centralised logging with ELK stackCentralised logging with ELK stack
Centralised logging with ELK stackSimon Hanmer
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
"How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics."How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics.Vladimir Pavkin
 
ELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementEl Mahdi Benzekri
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysisDhaval Mehta
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsPhase2
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.
 
Elk devops
Elk devopsElk devops
Elk devopsIdeato
 
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech MeetupLogstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech MeetupStartit
 

What's hot (20)

ELK Wrestling (Leeds DevOps)
ELK Wrestling (Leeds DevOps)ELK Wrestling (Leeds DevOps)
ELK Wrestling (Leeds DevOps)
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
 
Centralised logging with ELK stack
Centralised logging with ELK stackCentralised logging with ELK stack
Centralised logging with ELK stack
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
ELK introduction
ELK introductionELK introduction
ELK introduction
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
"How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics."How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics.
 
ELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log Management
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring Tools
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Elk devops
Elk devopsElk devops
Elk devops
 
Elk
Elk Elk
Elk
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech MeetupLogstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
 

Viewers also liked

Devoxx france 2015 influx db
Devoxx france 2015 influx dbDevoxx france 2015 influx db
Devoxx france 2015 influx dbNicolas Muller
 
Learn ELK in docker
Learn ELK in dockerLearn ELK in docker
Learn ELK in dockerLarry Cai
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and MonitoringToby Jackson
 
How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldKyle Anderson
 
Time Series Database and Tick Stack
Time Series Database and Tick StackTime Series Database and Tick Stack
Time Series Database and Tick StackGianluca Arbezzano
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 

Viewers also liked (6)

Devoxx france 2015 influx db
Devoxx france 2015 influx dbDevoxx france 2015 influx db
Devoxx france 2015 influx db
 
Learn ELK in docker
Learn ELK in dockerLearn ELK in docker
Learn ELK in docker
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and Monitoring
 
How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA World
 
Time Series Database and Tick Stack
Time Series Database and Tick StackTime Series Database and Tick Stack
Time Series Database and Tick Stack
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 

Similar to Toronto High Scalability meetup - Scaling ELK

Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of stateYoni Farin
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logsMathew Beane
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streamsYoni Farin
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)Mathew Beane
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331Fengchang Xie
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataacelyc1112009
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesRose Toomey
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesRose Toomey
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackRich Lee
 
Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsAmazon Web Services
 

Similar to Toronto High Scalability meetup - Scaling ELK (20)

Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 

Recently uploaded

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Toronto High Scalability meetup - Scaling ELK

  • 1. Scaling Logging and Monitoring JP Parkin (jpparkin@ca.ibm.com)
  • 2. User Scenarios • Internal IBM Infrastructure – Relatively small number of groups that generate a ton of logs ( groups that want to generate 3-5 TB/day ) – Logs are produced on VMs running various cloud services operated by IBM • External Bluemix Log Producers – Relatively large number of groups ( Bluemix organizations ) that generate a variety of log data but in relatively smaller quantities – total logs measured anywhere from kilobytes to gigabytes / day – Only a handful of organizations are currently generating large volumes of log data
  • 6. Advanced View - Grafana
  • 7. Grafana – Build your own Dashboard
  • 8. Service Architecture Key facts • OpenStack Heat automation – Multiple AutoScale Groups (ASGs) – Docker image per ASG – Ansible to configure s/w • Currently deployed on OpenStack – Virtual Machines host single docker container – Security groups for firewall rules – HAProxy for load balancing
  • 9. Deployment and Automation • Open Stack deployment using heat templates – Provides scale-up/down capabilities to add capacity when needed • Ansible configuration automation integrated with the heat deployment to configure the nodes • Docker images are used as our standard deployment artifact ( configured by Ansible ) • Jenkins jobs for building and testing the docker images • UCD automation for deployment and upgrade processing – provides operational management for tracking what is deployed to each of the environments • Mixture of Jenkins and UCD for jobs to manage the daily operations including item such as data expiration, index pre-creation and various health check scripts.
  • 10. Node Configurations System CPU Memory Java Heap Local Disk Lumberjack 4 8 GB 5 GB 25 GB Logstash 4 8 GB 5 GB 25 GB Kafka 4 8 GB 3 GB 25 GB + 5 TB volume Elasticsearch Master Node 10 32 GB 16 GB 25 GB Elasticsearch Http Node 10 32 GB 16 GB 25 GB Elasticsearch Data Node 20 64 GB 30 GB 18 TB spinning local RAID disk
  • 11. Multi-tenant Logstash Forwarder • Took the logstash forwarder and added multi-tenancy capabilities • Similar changes to the logstash input lumberjack plugin • Fixed log rotation capabilities in the MT-LSF – was triggering disk full problems on clients since it was holding locks on files for up to 24 hours before it timed out • Found that increasing the spool size resulted in some performance improvement up to a certain point. 512 was a sweet spot, going to larger values ending up having worse performance.
  • 12. Multi-tenant Lumberjack Server • Lumberjack server had issue with long-lasting connections and file descriptor leaks that required frequent restarts under load • Terminating connections on the client to get better server utilization ( forced load balancer switch), but didn’t resolve the underlying issue • Logstash 1.5.2 lumberjack public solved the problem connection problems with a fix to the Jruby OpenSSL library which was encountering file descriptor leaks under load • Switching the kafka output plugin to run with async gave some performance improvements ( 10-15% )
  • 13. Logstash Lumberjack Performance • The great thing about logstash is that it’s a Swiss Army Knife for solving data transformation problems • 12 Lumberjack servers in a cluster can process about 50 Mb /s ≈ 4.3 TB /day which is pretty good for most logging applications • If you are only utilizing the basic input / output functionality then creating a specific task based solution can result in better performance • We are prototyping a replacement logstash server to handle the processing of the mt-lumberjack and initial results are very good – in the area of 12x throughput improvement on the same hardware. • The queuing mechanism that makes logstash very flexible turns out to also be one of the bottlenecks when stressing out the platform
  • 14. Kafka • Distributed messaging system for buffering log and metric data • We keep 3 days worth of data to allow us to handle the input spikes and buffers logs when Elasticsearch or logstash indexers are not performing well • Logs for Kafka itself can become quite large when errors occur, so getting the right logging settings are important
  • 15. Logstash Indexers • Logstash Indexers are responsible for processing the log entries and pushing the data to Elasticsearch • Stability of Logstash 1.4.2 plugins for ES was not good – Tried all 3 protocols ( node, transport, http ) – Node was fast but has issues when large metadata was transferred on ES node failures ( frequent OOM ) – Transport had reasonable performance and stability but did not have multi-node support – Http has best performance after tuning to use a larger batch size, but did not have multi-node support • Logstash 1.5.2 ES plugins all have multi-node support • Settled on the 1.5.2 Http protocol version running against dedicated http client nodes in the cluster
  • 16. Logstash Indexers • Even with Logstash 1.5.2, the indexers are somewhat gated to the amount of data a single node can process • Expanded the number of Kafka partitions to allow growth beyond the initial 19 partitions we had allocated for the logging topic. • Logstash indexers can be scaled beyond 19 nodes in order to get to the point where we can stress the ES cluster
  • 17. Indexing Log Data • Relying on your users to be well behaved is dangerous – some logs contain what appears to be well formed json document with a GUID as a key and all of a sudden the field metadata explodes in ES • Need to monitor which documents you run through the json filter in Logstash • Adding filters to Logstash also slows down the indexing process especially if you are attempting to use many of the cool plugins
  • 18. Elasticsearch • If your network is a problem, then ES is not going to be happy  • Elasticsearch 1.4.4 did not react well to network blips – indexes would start shuffling themselves trying to proactively recover which generally resulted in long recovery times with default configurations • The default recovery settings meant clusters remained red or yellow for extended periods which impacted the data ingestion • Elasticsearch 1.7.1 has been much more stable for us
  • 19. Sharding • Pre-allocating the right number of shards for an index is hard if you don’t know how much data you are going to get • Target that seems to work well is about 25 GB per shard • Problems with shard size is really highlighted when you need to recover a failed node • How many shards can you put in an ES cluster? – We found 80k was too many -> changed how we allocated shards based on historical usage – We think that for our clusters about 40k
  • 20. Elasticsearch Configurations • 3 master nodes • 10 data nodes per cluster • 3 http nodes per cluster for queries • 30 GB heap • 2 data replicas to allow 2 node failures • index.translog.flush_threshold_size = 1g • indices.fielddata.cache.size: 50%
  • 21. Elasticsearch Recovery • Increase the rate at which an index can recover indices.recovery.max_bytes_per_sec: 200mb • Increase the concurrent recoveries supported cluster.routing.allocation.node_concurrent_re coveries: 500 • Having the Kafka cluster caching data provides us some windows where the data is delayed getting to the ES during recovery
  • 22. Elasticsearch Load Testing • Run client drivers to simulate traffic into the external stack • We have a number of sample workloads from real tenants that we use in our workloads • There are lots of knobs to tune ES so having some consistent workloads to validate our theories has been invaluable
  • 23. Performance • Clusters running in production can support up to around 70k records/sec ( 30 MB/s ) based on our monitoring • In our performance environments we are seeing consistent numbers beyond 40 MB/s • For larger indexes, increasing the number of shards provided – 50 GB of logs spread across 10 shards was loaded about 50% faster than with 5 shards
  • 24. Adjust throttling for loading large indices 0 5 10 15 20 25 30 35 40 45 Baseline 20 mb Throttle 100mb Throttle none MB/sec Throttling Settings 10 shards
  • 25. Scaling Elastic Search Multiple Elastic Search Clusters • Tenants get placed onto an ES cluster • Tribe nodes to federate access across ES clusters – Enables massive tenants spanning ES clusters