Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

4 - Customer story: Telenet

12 views

Published on

Telenet was looking for a centralisation of their logs to make them easier searchable and allow easier troubleshooting for their infrastructure. The've partnered up with Kangaroot for the design & implementation and enjoy Enterprise Support via Elastic. During this session, you'll find out how they've started up with this project.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

4 - Customer story: Telenet

  1. 1. ELASTICSEARCH @OPEN19 14 May 2019
  2. 2. § Why Elastic? § Use cases: – Elasticsearch for troubleshooting – Elasticsearch for trending (metrics) – Monitoring ELK stack § Setup – Implementation diagram – Details – HW migration strategy – Numbers – Alerting – Intelligent alerts OUTLINE 2
  3. 3. – Central location of logs – To allow easier troubleshooting of infrastructure/apps – No need to login to different systems to check the logs > keep logs longer then allowed by local diskspace on app servers – Implementation > Partnered with Kangaroot for design/implementation – Using Ansible for deployment/upgrades > Entreprise support via Elastic WHY ELASTIC 3
  4. 4. § Log analysis of F5 access logs – Graphs/Alerts on average response times for web apps – Heavily used by Operations § VMware logs – vCenter logs for auditing reasons (Oracle licensing) – when ESXi crashes you might lose your logs § Network & storage device logs § Kafka broker monitoring – {metric,file}beat § Monitoring Elastic itself – Logstash filebeat, Elastic nodes, Kibana nodes, Elastic cluster health § Application logs for developers to allow easier troubleshooting – Weblogic, Tomcat, JBoss/WildFly, AEM, … § Generate alerts towards entreprise monitoring solution using watches § Replacement of GSA with a custom API with Elastic backend USE CASES 4
  5. 5. 5 § ELK implementation diagram Shipper Shipper Indexer Indexer Indexer
  6. 6. § Logstash – Shipper layer uses 1 pipeline – Index layer uses multiple pipelines > Grok filters for parsing logfiles, need some logging standards Alternative > Use native json logging format – Monitoring via x-pack > destination: Elastic monitoring cluster § Kafka – Monitoring using filebeat/metricbeat > destination: Elastic cluster, bypassing Logstash/Kafka § Kibana – Using coordination-only node – Loadbalance queries across Elastic nodes DETAILS 6
  7. 7. § Setup new independent cluster on new HW (master nodes, data nodes, kibana) § Setup new logstash indexer layer using a unique group_id (different kafka consumer_id) § Migrate index patterns, existing roles, index templates, visualizations & dashboards, watches § data sources need no modification § Data is ingested to both clusters – Allows for testing new Hardware without impact on current cluster – data migration of older data if needed using snapshot/restore – Minimal to no data migration by running in parallel for time of data retention – Once done => switch Kibana VIP from old Kibana to new Kibana instance HW MIGRATION STRATEGY 7
  8. 8. § PRD cluster – 7 physical warm datanodes – 3 physical hot datanodes – 3 dedicated virtual master nodes § Currently running version 6.5 § Retention: – 30-days of data for infrastructure related logs – 3 weeks of data for application logs – Few months for metrics § Current replicated datavolume: 32TB § Roughly 850 GB/day incoming logs § 7000 events/s for F5 access logs => daily replicated volume: 500 to 600 Gb/day § 3200 events/s for VMware logs => daily replicated volume: 350 Gb/day § 500 events/s for Metricbeat => monthly replicated volume: 400 Gb NUMBERS 8
  9. 9. § WATCHER – Input > Search (Elastic query) > Http request – Trigger > Time based: when to execute watcher (e.g. every 5min) – Condition > When to execute action against – Action to take if condition is met > log message to file > send e-mail > notification to Chat tool (e.g. Slack) > Call to Webhook ALERTING 9
  10. 10. § Alerts are typically static – E.g. cpu usage should be below 90%, response times should be below 0.5s – Not aware of periodicity, e.g. billing cycle, weekends, … § Enter machine learning (ML) – Creates a ML model that recognizes periodicity, can do forecasting – Anomaly detection, visually identify anomalies using heatmap – Simple ML jobs > based on 1 metric – Multi metric ML jobs: > split a single time series into multiple time series based on a categorical field. INTELLIGENT ALERTS 10
  11. 11. THANK YOU

×