Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Architecting for Disaster Recovery is Important for Your Time Series Data by Capital One

194 views

Published on

Time Series data at Capital One consists of Infrastructure, Application, and Business Process Metrics. The combination of these metrics are what the internal stakeholders rely on for observability which allows them to deliver better service and uptime for their customers, so protecting this critical data with a proven and tested recovery plan is not a “nice to have” but a “must have.”

In this talk, the members of IT staff, Saravanan Krisharaju, Rajeev Tomer, and Karl Daman will share how they built a fault-tolerant solution based on InfluxEnterprise and AWS that collects and stores metrics and events. They added to this, Machine Learning, which uses the collected time series to model predictions which are then brought back into InfluxDB time series database for real-time access. This Capital One team shares the journey they took to architect and build this solution as well as plan and execute on their disaster recovery plan.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Why Architecting for Disaster Recovery is Important for Your Time Series Data by Capital One

  1. 1. Confidential Time Series Data Management Evolution at Capital One Building a highly resilient Enterprise Influx across multiple regions Monitoring Intelligence October 23, 2018
  2. 2. Confidential Agenda 2 1. Influx at Capital One 2. Architecture How Influx Architecture evolved 3. Resiliency Complete protection against entire region failure 4. Performance Metrics Critical Platform Metrics 5. Q & A
  3. 3. Confidential Influx at Capital One 3 • Business Transaction Metrics • Infrastructure Health metrics • Application Performance Metrics • Service Adoption Metrics
  4. 4. Confidential Agenda 4 1. Influx at Capital One 2. Architecture How Influx Architecture evolved 3. Resiliency Complete protection against entire region failure 4. Performance Metrics Critical Platform Metrics 5. Q & A
  5. 5. Confidential Architecture – Gen1 5 InfluxDB ( Active ) InfluxDB (Standby) DR Site Visualization https Primary Site LB Backup/Restore Features 1. Grafana for visualization Challenges 1. High Data Retention ( > 400 days ) 2. Unstable DR Solution Splunk Direct API Telegraf
  6. 6. Confidential Architecture – Gen2 6 InfluxDB ( Active ) InfluxDB (Standby) DR Site Visualization Primary Site LB AWS S3 Data Lake ML Model Execution Model Governance ML Model Dev/Train Backup/Restore Daily batch Features 1. Grafana for visualization 2. Raw Data Exported Daily to One Lake 3. Raw Data available for ML Challenges High Data Retention ( > 400 days ) 2. Unstable DR Solution A B. ReviewC. Deploy E D Splunk Direct API Telegraf https
  7. 7. Confidential Architecture – Current 7 InfluxDB ( Active ) InfluxDB (Passive) DR Site Visualization Primary Site LB AWS S3 Data Lake ML Model Execution Model Governance ML Model Dev/Train AWS S3 ExportImport Mini batch Features 1. Grafana for visualization 2. Raw Data Exported every 30 minutes to One Lake 3. Raw Data available for ML 4. Stable DR Solution with Passive Cluster Challenges High Data Retention ( > 400 days ) Unstable DR Solution A B. ReviewC. Deploy E D Splunk Direct API Telegraf https
  8. 8. Confidential Agenda 8 1. Influx at Capital One 2. Architecture How Influx Architecture evolved 3. Resiliency Complete protection against entire region failure 4. Performance Metrics Critical Platform Metrics 5. Q & A
  9. 9. Confidential Resiliency – Region 1 Active 9 https D M D M D M Region 1 ( Active) Zone 1 Zone 2 Zone 3 LB S3 Cluster 1 D M D M D M Region 2 (DR) Zone 1 Zone 2 Zone 3 LB S3 Cluster 2 Route53 (DNS Switch) Cross Region Replication Admin A ASGASG M Meta Node D Data Node A Admin Node 3 A Route53 (DNS Switch) 1 4 2 Splunk Direct API Telegraf All Traffic routed to Region 11 Influx Export Script every 15 min2 Data Replicated to Region 23 Influx Import Script every 15 min4
  10. 10. Confidential Resiliency – Region 2 Active 10 https D M D M D M Region 1 ( DR) Zone 1 Zone 2 Zone 3 LB S3 Cluster 1 D M D M D M Region 2 (Active) Zone 1 Zone 2 Zone 3 LB S3 Cluster 2 Route53 (DNS Switch) Cross Region Replication Admin A ASGASG M Meta Node D Data Node A Admin Node 3 A Route53 (DNS Switch) 1 2 4 Splunk Direct API Telegraf All Traffic routed to Region 21 Influx Export Script every 15 min2 Data Replicated to Region 13 Influx Import Script every 15 min4
  11. 11. Confidential Resiliency – Export/Import 11 Export* Import* * High level pseudo code , Not the actual code
  12. 12. Confidential Agenda 12 1. Influx at Capital One 2. Architecture How Influx Architecture evolved 3. Resiliency Complete protection against entire region failure 4. Performance Metrics Critical Platform Metrics 5. Q & A
  13. 13. Confidential Performance Metrics Performance Metrics collected using Telegraf
  14. 14. Confidential Q & A

×