SlideShare a Scribd company logo
1
Monitoring & Incident
Management for Yamato
2023
Yamato monitoring system
2
Agenda
● Overview
● Monitoring
● Incident Management
● Monitoring technical details
● Incident Management technical details
● Concerns and next steps
● QA
3
Overview
The monitoring system components are based in a monitoring dashboard to analyse the performance and the
Incident Management to be reactive to the incidents.
4
Monitoring
Yamato platform dashboard is oriented to know the current status and the historical data oriented in a
performance view:
● General overview
● DWH: Infrastructure
● DWH: Workload
5
Monitoring
Yamato platform jobs dashboard is oriented to know the Jobs historical data when they are completed:
● Completed jobs
● Instrumented jobs
● Maintenance jobs
6
Incident Management
Pagerduty is collecting alarms sent by Newrelic.
● Yamato INFRA collects alarms from Yamato Redshift Cluster, WLM Shift Enforcer, …
● Yamato DEV collects alarms which are developed on top of the infrastructure (maintenance, RDL (livesync
and hydra), other pipelines, …)
Every alarm have defined the prioritisation ( high, low and info notifications)
To review the IM SLA’s use the PD Insight
7
Incident Management
The info notifications are not an incident and they can be reviewed in the Alerts panel
8
Incident Management
The audience for the incidences are set to EU Data team which has two
schedules to manage the incident escalation.
● Level 1 (data team developers) receives low and high incidents
● Level 2 (data team managers) receives high incidents if they are
not acknowledge by Level 1
Note: The data-team is managing infrastructure and development
incidents that hey own. SRE are focused in infra that the own. This about
the nature of the incidents, don’t confuse :)
Note: The purpose of Level 2 is not to resolve an incident.
Monitoring technical
details
10
Monitoring Tech details
The dashboard use Cloudwatch metrics converted to NR metrics. It uses metric streamer.
Also we add Custom Metrics from CW
The custom metrics uses the field label of the Redshift system tables. This approach is used by odyn dags and it
allows us to identify the jobs name and the job type
This is applied in all the instrumented jobs (livesync, hydra, amnesia) and the odyn dags.
The rest will show as unknown
11
Monitoring Tech details
The Custom metrics importer is pushing two types of metrics:
● Completed metrics: These metrics are about queries have finished aggregated in a job Level. The
dimensions are jobType, jobName and Username. Nowadays is running every 30 minutes using the super
admin queue.
"JobQueries","JobCPUUtilization","JobExecutionTime","JobQueueTime","JobBlocksRead","JobTempBlocksToDisk","JobSpectrumUsag
e","JobNestedLoopJoinRowCount","JobReturnRowCount","JobJoinRowCount","JobMaxSegmentExecutionTime","JobMaxSegmentIo
Skew","JobConcurrencyScalingTime","JobResultCachingRatio","JobWorkmem","JobSessions"
● Inflight metrics: These metrics are in memory queries aggregated in a job level. The dimensions are
jobType, jobName, QueueName and Username. It is running every 1 minutes using the super admin queue.
"InflightJobQueries","InflightJobCPUUtilization","InflightJobExecutionTime","InflightJobQueueTime","InflightJobBlocksRead","Inflight
JobTempBlocksToDisk","InflightJobSpectrumUsage","InflightJobNestedLoopJoinRowCount","InflightJobReturnRowCount","InflightJ
obJoinRowCount","InflightJobMaxSegmentExecutionTime","InflightJobMaxSegmentIoSkew","InflightJobConcurrencyScalingTime","I
nflightJobWorkmem","InflightJobSessions"
12
Monitoring Tech details
The NR alarms are defined in these three policies:
● Infra Alert policy: Redshift, lambda enforcer
● Dev Job alert policy: livesync, hydra, amnesia, maintenance, custom metrics importer
● Custom Metric alert policy: Inflight Job metrics
13
Monitoring Tech details
Every policy recover the information and pass it to PD using NR notification template.
● DEV Job notification
● INFRA notification
● DEV Custom Metric Job notification
IM technical details
15
IM Tech details
Newrelic incident lifecycle has auto-resolve incident when the condition of the alarm is recovered. This is a good
approach for the infrastructure incident but it is not acceptable for developer incidents which requires a manual
intervention to close it.
The orchestration rules in PD can disable the auto-resolve behaviour
● The infrastructure service allow auto-resolve an incident from NR
● The developer service disallow auto-resolve an incident from NR
16
IM Tech details
To manage the prioritisation every alarm has defined pdUrgency attribute and/or pdPriority which, through the
orchestration rules, set the severity and priority in PD.
The services in PD are configured based in the severity of the alert.
17
IM Tech details
● The INFRA alarms has defined the pdUrgency and pdPriority attributes in NR
18
IM Tech details
● The Job DEV alarms for the APM (instrumented jobs) has defined the prioritisation attributes using and
environment variable that you can set at level of job pipeline
● The Job DEV alarms for the Maintenance Jobs has defined the prioritisation in the NR alarm.
19
IM Tech details
● The Custom Metrics DEV alarms has defined the prioritisation in the NR alarm too.
Concerns and next
steps
21
Concerns and next steps
● Custom metrics brings a valuable information but this has a cost in terms of performance. But in IMO these
metrics are fundamental and have to have the biggest priority. The inflight queries are using the 6% of the
time and the completed queries are taking 2,2% of total time.
● We need to know every error which is happening, try to not hide trough retries. The alarm noise could be
reduced grouping it, so as not to be overwhelmed.
Next steps
● Add in other pipelines the query trace to have better visibility in AWS Custom Metrics.
● Integrate YUNS as PD extension.
● Integrate Redshift event notification to NR and PD.
Q/A

More Related Content

Similar to DWH Monitoring System

A Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control SystemA Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control System
IRJET Journal
 
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
IRJET Journal
 
Madhuraveeran g automation application engineer
Madhuraveeran g   automation application engineerMadhuraveeran g   automation application engineer
Madhuraveeran g automation application engineer
Madhura Magesh
 
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
NNfamily
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded software
Mr. Chanuwan
 
Cpm 200 c technical performance measures - alleman (ppm)
Cpm 200 c   technical performance measures - alleman (ppm)Cpm 200 c   technical performance measures - alleman (ppm)
Cpm 200 c technical performance measures - alleman (ppm)
Glen Alleman
 
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET Journal
 
ARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEMARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEM
MOHAMMAD HANNAN
 
PPT of PLC and SCADA
PPT of PLC and SCADAPPT of PLC and SCADA
PPT of PLC and SCADA
Mohseen1234
 
05 uap terminal system issue1.10
05 uap terminal system issue1.1005 uap terminal system issue1.10
05 uap terminal system issue1.10
Christian Silva Espinoza
 
conrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptxconrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptx
jbri1395
 
Annunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature ControlAnnunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature Control
IOSR Journals
 
Plc on cnc
Plc on cncPlc on cnc
Plc on cnc
AnjanaPaswan
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET Journal
 
UNIT 1.pptx
UNIT 1.pptxUNIT 1.pptx
UNIT 1.pptx
lalithamani sampath
 
Alarm management at DeltaV
Alarm management at DeltaVAlarm management at DeltaV
Alarm management at DeltaV
Robert-Emmanuel Mayssat
 
DEVENDRAPLC .pptx
DEVENDRAPLC .pptxDEVENDRAPLC .pptx
DEVENDRAPLC .pptx
DEVENDRA SHRIVASH
 
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
IRJET Journal
 
Cad cam unit i [pls vis it our blog sres11meches]
Cad cam unit  i [pls vis it our blog sres11meches]Cad cam unit  i [pls vis it our blog sres11meches]
Cad cam unit i [pls vis it our blog sres11meches]
Sres IImeches
 
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET Journal
 

Similar to DWH Monitoring System (20)

A Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control SystemA Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control System
 
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
 
Madhuraveeran g automation application engineer
Madhuraveeran g   automation application engineerMadhuraveeran g   automation application engineer
Madhuraveeran g automation application engineer
 
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded software
 
Cpm 200 c technical performance measures - alleman (ppm)
Cpm 200 c   technical performance measures - alleman (ppm)Cpm 200 c   technical performance measures - alleman (ppm)
Cpm 200 c technical performance measures - alleman (ppm)
 
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
 
ARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEMARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEM
 
PPT of PLC and SCADA
PPT of PLC and SCADAPPT of PLC and SCADA
PPT of PLC and SCADA
 
05 uap terminal system issue1.10
05 uap terminal system issue1.1005 uap terminal system issue1.10
05 uap terminal system issue1.10
 
conrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptxconrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptx
 
Annunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature ControlAnnunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature Control
 
Plc on cnc
Plc on cncPlc on cnc
Plc on cnc
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
 
UNIT 1.pptx
UNIT 1.pptxUNIT 1.pptx
UNIT 1.pptx
 
Alarm management at DeltaV
Alarm management at DeltaVAlarm management at DeltaV
Alarm management at DeltaV
 
DEVENDRAPLC .pptx
DEVENDRAPLC .pptxDEVENDRAPLC .pptx
DEVENDRAPLC .pptx
 
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
 
Cad cam unit i [pls vis it our blog sres11meches]
Cad cam unit  i [pls vis it our blog sres11meches]Cad cam unit  i [pls vis it our blog sres11meches]
Cad cam unit i [pls vis it our blog sres11meches]
 
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
 

Recently uploaded

writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 

Recently uploaded (20)

writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 

DWH Monitoring System

  • 1. 1 Monitoring & Incident Management for Yamato 2023 Yamato monitoring system
  • 2. 2 Agenda ● Overview ● Monitoring ● Incident Management ● Monitoring technical details ● Incident Management technical details ● Concerns and next steps ● QA
  • 3. 3 Overview The monitoring system components are based in a monitoring dashboard to analyse the performance and the Incident Management to be reactive to the incidents.
  • 4. 4 Monitoring Yamato platform dashboard is oriented to know the current status and the historical data oriented in a performance view: ● General overview ● DWH: Infrastructure ● DWH: Workload
  • 5. 5 Monitoring Yamato platform jobs dashboard is oriented to know the Jobs historical data when they are completed: ● Completed jobs ● Instrumented jobs ● Maintenance jobs
  • 6. 6 Incident Management Pagerduty is collecting alarms sent by Newrelic. ● Yamato INFRA collects alarms from Yamato Redshift Cluster, WLM Shift Enforcer, … ● Yamato DEV collects alarms which are developed on top of the infrastructure (maintenance, RDL (livesync and hydra), other pipelines, …) Every alarm have defined the prioritisation ( high, low and info notifications) To review the IM SLA’s use the PD Insight
  • 7. 7 Incident Management The info notifications are not an incident and they can be reviewed in the Alerts panel
  • 8. 8 Incident Management The audience for the incidences are set to EU Data team which has two schedules to manage the incident escalation. ● Level 1 (data team developers) receives low and high incidents ● Level 2 (data team managers) receives high incidents if they are not acknowledge by Level 1 Note: The data-team is managing infrastructure and development incidents that hey own. SRE are focused in infra that the own. This about the nature of the incidents, don’t confuse :) Note: The purpose of Level 2 is not to resolve an incident.
  • 10. 10 Monitoring Tech details The dashboard use Cloudwatch metrics converted to NR metrics. It uses metric streamer. Also we add Custom Metrics from CW The custom metrics uses the field label of the Redshift system tables. This approach is used by odyn dags and it allows us to identify the jobs name and the job type This is applied in all the instrumented jobs (livesync, hydra, amnesia) and the odyn dags. The rest will show as unknown
  • 11. 11 Monitoring Tech details The Custom metrics importer is pushing two types of metrics: ● Completed metrics: These metrics are about queries have finished aggregated in a job Level. The dimensions are jobType, jobName and Username. Nowadays is running every 30 minutes using the super admin queue. "JobQueries","JobCPUUtilization","JobExecutionTime","JobQueueTime","JobBlocksRead","JobTempBlocksToDisk","JobSpectrumUsag e","JobNestedLoopJoinRowCount","JobReturnRowCount","JobJoinRowCount","JobMaxSegmentExecutionTime","JobMaxSegmentIo Skew","JobConcurrencyScalingTime","JobResultCachingRatio","JobWorkmem","JobSessions" ● Inflight metrics: These metrics are in memory queries aggregated in a job level. The dimensions are jobType, jobName, QueueName and Username. It is running every 1 minutes using the super admin queue. "InflightJobQueries","InflightJobCPUUtilization","InflightJobExecutionTime","InflightJobQueueTime","InflightJobBlocksRead","Inflight JobTempBlocksToDisk","InflightJobSpectrumUsage","InflightJobNestedLoopJoinRowCount","InflightJobReturnRowCount","InflightJ obJoinRowCount","InflightJobMaxSegmentExecutionTime","InflightJobMaxSegmentIoSkew","InflightJobConcurrencyScalingTime","I nflightJobWorkmem","InflightJobSessions"
  • 12. 12 Monitoring Tech details The NR alarms are defined in these three policies: ● Infra Alert policy: Redshift, lambda enforcer ● Dev Job alert policy: livesync, hydra, amnesia, maintenance, custom metrics importer ● Custom Metric alert policy: Inflight Job metrics
  • 13. 13 Monitoring Tech details Every policy recover the information and pass it to PD using NR notification template. ● DEV Job notification ● INFRA notification ● DEV Custom Metric Job notification
  • 15. 15 IM Tech details Newrelic incident lifecycle has auto-resolve incident when the condition of the alarm is recovered. This is a good approach for the infrastructure incident but it is not acceptable for developer incidents which requires a manual intervention to close it. The orchestration rules in PD can disable the auto-resolve behaviour ● The infrastructure service allow auto-resolve an incident from NR ● The developer service disallow auto-resolve an incident from NR
  • 16. 16 IM Tech details To manage the prioritisation every alarm has defined pdUrgency attribute and/or pdPriority which, through the orchestration rules, set the severity and priority in PD. The services in PD are configured based in the severity of the alert.
  • 17. 17 IM Tech details ● The INFRA alarms has defined the pdUrgency and pdPriority attributes in NR
  • 18. 18 IM Tech details ● The Job DEV alarms for the APM (instrumented jobs) has defined the prioritisation attributes using and environment variable that you can set at level of job pipeline ● The Job DEV alarms for the Maintenance Jobs has defined the prioritisation in the NR alarm.
  • 19. 19 IM Tech details ● The Custom Metrics DEV alarms has defined the prioritisation in the NR alarm too.
  • 21. 21 Concerns and next steps ● Custom metrics brings a valuable information but this has a cost in terms of performance. But in IMO these metrics are fundamental and have to have the biggest priority. The inflight queries are using the 6% of the time and the completed queries are taking 2,2% of total time. ● We need to know every error which is happening, try to not hide trough retries. The alarm noise could be reduced grouping it, so as not to be overwhelmed. Next steps ● Add in other pipelines the query trace to have better visibility in AWS Custom Metrics. ● Integrate YUNS as PD extension. ● Integrate Redshift event notification to NR and PD.
  • 22. Q/A