How KeyBank Used Elastic to Build an Enterprise Monitoring Solution

Mick Miller
Senior Product Manager, Cloud Native
KeyBank
mick_miller@keybank.com
@mickmill
Lessons from the trenches
How KeyBank used Elastic to build an enterprise monitoring
solution

15
states
1,100+
branches
1,400+
ATMs
17,206
employees
$143.6
B
assets
$6.4B
revenue
2
datacenters

Business and technology landscape
Design by entropy1
2
3
The floodgates open
Urgency

Business and
technology problems
21+ monitoring systems resulted in:
• Slow MTTR, great MTTB
• Difficulty in identifying the root cause of problems
• Poor mobile application satisfaction scores
• Poor branch workstation performance metrics

Design by entropy
2
1
3
The floodgates open
Urgency

The floodgates open
2017–2018 
Architecture
2018 
Pilot/Prod
2019 
Scale out
Enterprise Monitoring transformation
timeline
Late 2017
All new monitoring stopped
Log storage cost and stability
issues
Huge backlog of monitoring
requests
Early 2018
Critical situation w/ eastern WA
branches
No visibility into root cause
Elasticsearch dev environment
functional
Deployed metric and winlog beats
The floodgates
opened!
Two days later...
Root causes identified
Remediation project
launched

Design by entropy
3
1
2
The floodgates open
Urgency

Pathway to production cleared
First production cluster was deployed mid-2018
Funding provided by retiring 5 of the 21 existing monitoring
systems
• More systems retired in 2018, 2019 with plans for more in 2020
Estimated savings over $5M
• No accurate way to estimate savings from eliminating all the different
support and skill sets required to support 21 systems
Cluster immediately started to experience performance issues;
tuning and scaling became highest priority
Urgency

Lessons from KeyBank's monitoring transformation
Cluster size estimations1
2
3
4
5
Architecture and design
approaches
Automation
Feedback loops and monitoring
Unexpected business value

Bad news: it's nearly impossible to precisely predict your workload
• Index growth, data ingestions, query usage, etc.
• Expect to iterate three or more times
• You’re always wrong the first time
• Elasticsearch is not an RDBMS and does not scale like a datastore
Cluster size estimation
There’s no way anyone can tell you how to design a perfect
cluster, and those who claim they do are liars.
—Fred de Villamil, Elasticsearch for fun and profit
Good news: it is possible to design for growth
• Start small and isolated; KeyBank used a dedicated eight-server hyper-
converged cluster
• Plan on scaling out and up; with a preference to scale out
• Automation is the key to scaling out and up without outages

Development cluster: first iteration
No separation

Development cluster: current
Data nodes
Master nodes
Coordinator nodes

Development: breakout of Elasticsearch services
Development
environment: current
iteration

Production cluster: first iteration

Production cluster: master and data nodes only
Production cluster:
first iteration

Production cluster: current iteration

Production cluster: breakout of cluster services
Production cluster: 
current iteration

Type OS Heap size
(GB)
RAM
(GB)
Disk (GB) CPU
Data node RHEL
7.x
30 80 8000 16
Master node RHEL
7.x
30 64 40 2
Coordinator
node
RHEL
7.x
30 64 40 8
Current compute node specs
Initial lessons learned:
• Keep heap around 30 GB
• 16 cores are enough
• Monitor garbage collection
• VMs on hyper-converged
systems are not great for data
nodes
• Hyper-converged nodes are
expensive to scale
• Low-cost physical; more
compute for your $

Pain points with current design
• Hyper-converged systems / hypervisor disk I/O too high
• Hypervisor VM replication waste of disk space and drive writes
• Logstash ingestion pipeline fragile
Next iteration design goals
• Move data nodes to 49 physical servers; keep master and coordinator
nodes on hyper-converged systems
• Move Logstash ingestion pipelines to containers for isolation and scale
on demand
Unless you plan to use Elasticsearch to power the
search of your blog or a small e-commerce
website, your first design will fail.
—Fred de Villamil, Elasticsearch for fun and profit
Current issues and next iteration goals

Next iteration cluster design: massive scale out
Ingestion ≅ 1.5TB/day
49 physical servers
14 VMs on HX
3TB RAM
638 cores
Hot nodes: 14TB
Warm nodes: 228TB
Cold nodes: 182.3TB
Total cluster: 453TB

Cluster size estimations
2
1
3
4
5
approaches
Automation

Independently scalable tiers
Loosely coupled tiers
High availability
Design to maintain the environment during business
hours with no outages
Architecture and design approaches

Logical architecture
Logical architecture

Independently scalable tiers allow for surgical scaling
HA tiers allows for production live updates
Physical vs. virtual
• Virtual → Kafka, Elasticsearch master and coordinator nodes
• Physical → Elasticsearch data nodes
• Containerize Logstash streams for pipeline isolation and on-
demand scaling
How to scale

Measure ingestion to indexed performance (SLAs)
• KeyBank SLA is under 0.5 sec. from inbound raw Kafka topic to
indexed in Elasticsearch
• Backlog item to create Kibana dashboard and alerting
• Ingestion SLAs are only part of the metric; need to make sure
Kafka output topics are near empty all the time.
• When end-to-end indexing speed exceeds SLAs, it's time to tune
or scale out
Measure shards per data node
• When these get too high (approx. 300), it's time to tune or scale
out
Monitor query execution times
• Possibly scale out feedback
When to scale

3
1
2
4
5
approaches
Automation

Core principle: 
Infrastructure as Code (IaC)
Automation platform: 
Ansible and AWX
Source control: 
Bitbucket and simplified gitflow
Automate everything

4
1
2
3
5
approaches
Automation

Important metrics for monitoring:
• Number of shards per node
• High shard per node count is a leading indicator of
degraded system performance
• Disk I/O per second
• Heap size
• Excessive garbage collection
• CPU utilization
• Ingestion SLA (time from Kafka raw topic to fully indexed)

5
1
2
3
4
approaches
Automation

Initial focus on enterprise
monitoring data led to integrations
with:
• Database
• ITSM
• Collaboration
• Development pipeline
Unexpected business insight through
integrations

Elasticsearch documentation
https://www.elastic.co/guide/index.html
Running Elasticsearch for fun and profit
https://fdv.github.io/running-elasticsearch-fun-profit/
https://github.com/fdv/running-elasticsearch-fun-profit
References

How KeyBank Used Elastic to Build an Enterprise Monitoring Solution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How KeyBank Used Elastic to Build an Enterprise Monitoring Solution

Similar to How KeyBank Used Elastic to Build an Enterprise Monitoring Solution (20)

More from Elasticsearch

More from Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

How KeyBank Used Elastic to Build an Enterprise Monitoring Solution