Just enough web ops for web developers

•

0 likes•595 views

Datadog is monitoring that does not suck. It's metrics friendly, people friendly and developer friendly monitoring. Learn more at https://www.datadoghq.com/

Technology

Just Enough WebOps
for Developers
Alexis Lê-Quôc @alq
http://www.datadoghq.com

Datadog is Monitoring that does not
suck... as a Service

Datadog is Monitoring that does not
suck... as a Service
“Metrics
made
social”

930,000 350,000
Dev Ops
2010 US figures from BLS

The New Development Equation
Code + + AWS =

The New Development Equation
Code + + AWS =
3 months

The New Development Equation
Code + + AWS =
3 months 5 minutes

The New Development Equation
Code + + AWS =
3 months 5 minutes
Web
Operations?

reliable, fast, cheap
Provide access
24x7
without going crazy

Dev Release Measure & Log
Monitor
Change
WebOps Cycle
Investigate Alert
Fix || Escalate

Dev Release
Measure & Log
Monitor
Change
Investigate Alert
Fix || Escalate

Purpose
Collect quantitative metrics
Process
Instrument servers
Instrument code
Instrument SaaS deps
Automate collection
Measure
Risks
Imprecise metric
definition
Manual collection
“What does it mean?”
Tools
System (ganglia, collectd, munin, nagios, etc.)
Code (metrics, statsd)
SaaS (Datadog et al.)

Purpose
Collect meaningful, timestamped events
Log
Process
All the time
In one place
Access for everyone
Discipline
Risks
TiB of garbage
Non-uniform
timestamps
Non-uniform formats
Tools
log4j et al.
syslog et al.
logstash, splunk
+ Logging-as-a-Service

Dev Release Measure & Log
Change
Investigate Alert
Fix || Escalate
Monitor

Purpose
Watch actionable events & metrics
Process
Health of the app?
Which metrics for health?
Compute metrics
Metric domain
Access for everyone
Pretty graphs
Monitor
Risks
Non-actionable metrics
Tools
graphite, cubism et al.
+ services

Purpose
Bring human in the loop
when automated fix does not work
Alert
Process
Alert on vital monitors
Add new alerts with new
monitors
Compute metrics from alerts
Ruthlessly edit
Risks
Too many alerts
Become desensitized
Ignore alerts
App crashes for real
Pendulum swings back
Tools
nagios
+ services

Purpose
Fix issue or find someone who can
Process
(fix) capture actions as soon as
possible (while or shortly after)
(fix) runbooks
(fix) automate fixes
(escalation) on-call rotation
(escalation) agree on rules
Fix || Escalate
Risks
Burn out
Tools
PagerDuty
Bug tracker

Dev Release Measure & Log
Monitor
Alert
Change
Investigate
Fix || Escalate

Purpose
Collect evidence
Reconstruct what happened
Process
Start where/when problem 1st detected
Work your way from there
Capture relevant graphs/logs
Investigate
Risks
Missing the starting point
Lagging events/metrics
Low-level events/metrics
Blame game
Tools
Post-mortems

Dev Release Measure & Log
Monitor
Investigate Alert
Fix || Escalate
Change

Change
Purpose
Fewer alerts
Better service
Process
Change infrastructure, code
Infrastructure == code
Add/Edit monitors & alerts
Risks
ad-hoc changes
Tools
...

Dev Release Measure & Log
Monitor
Change
Questions?
Comments?
@alq
Investigate Alert
Fix || Escalate

What's hot

Scalable and Reliable Logging at PinterestKrishna Gade

Scaling monitoring with Datadogalexismidon

(DVO205) Monitoring Evolution: Flying Blind to Flying by InstrumentAmazon Web Services

Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker

Securing Databases with Dynamic Credentials and HashiCorp VaultMitchell Pronschinske

20140708 - Jeremy Edberg: How Netflix Delivers SoftwareDevOps Chicago

Testing at Stream-ScaleAll Things Open

The Art of Decomposing Monoliths - Kfir Bloch, WixCodemotion Tel Aviv

Migrating big datalauraxthomson

Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivotalOpenSourceHub

The inherent complexity of stream processingnathanmarz

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion

Winston - Netflix's event driven auto remediation and diagnostics toolVinay Shah

Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017Codemotion

Python & Cassandra - Best FriendsJon Haddad

Open Source Monitoring Tools Shootouttomdc

Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...DataStax Academy

Advanced A/B Testing at Wix - Aviran Mordo and Sagy Rozman, Wix.comDevOpsDays Tel Aviv

Webinar Slides: Real time Recommendations with Redis, Java and WebsocketsRedis Labs

KubeSecOpsKarthik Gaekwad

What's hot (20)

Scalable and Reliable Logging at Pinterest

Scaling monitoring with Datadog

(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Netflix Open Source: Building a Distributed and Automated Open Source Program

Securing Databases with Dynamic Credentials and HashiCorp Vault

20140708 - Jeremy Edberg: How Netflix Delivers Software

Testing at Stream-Scale

The Art of Decomposing Monoliths - Kfir Bloch, Wix

Migrating big data

Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan

The inherent complexity of stream processing

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...

Winston - Netflix's event driven auto remediation and diagnostics tool

Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017

Python & Cassandra - Best Friends

Open Source Monitoring Tools Shootout

Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...

Advanced A/B Testing at Wix - Aviran Mordo and Sagy Rozman, Wix.com

Webinar Slides: Real time Recommendations with Redis, Java and Websockets

KubeSecOps

Viewers also liked

I <3 graphs in 20 slidesDatadog

Alerting: more signal, less noise, less painDatadog

Monitoring MySQL at scaleOvais Tariq

Treating Infrastructure as GarbageDatadog

Fact based monitoringDatadog

Deep dive into Nagios analyticsDatadog

DevOps, continuous delivery, & the new composable enterpriseDonnie Berkholz

Big (IT) dataDatadog

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015Datadog

Events and metrics the Lifeblood of WebopsDatadog

Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx

Customer Ops: DevOps <3 customer supportDatadog

Effective monitoring with StatsDDatadog

Monitoring Docker containers - Docker NYC Feb 2015Datadog

Monitoring NGINX (plus): key metrics and how-toDatadog

PyData NYC 2015 - Automatically Detecting Outliers with Datadog Datadog

How to measure everything - a million metrics per second with minimal develop...Jos Boumans

Viewers also liked (17)

I <3 graphs in 20 slides

Alerting: more signal, less noise, less pain

Monitoring MySQL at scale

Treating Infrastructure as Garbage

Fact based monitoring

Deep dive into Nagios analytics

DevOps, continuous delivery, & the new composable enterprise

Big (IT) data

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015

Events and metrics the Lifeblood of Webops

Making Cassandra Perform as a Time Series Database - Cassandra Summit 15

Customer Ops: DevOps <3 customer support

Effective monitoring with StatsD

Monitoring Docker containers - Docker NYC Feb 2015

Monitoring NGINX (plus): key metrics and how-to

PyData NYC 2015 - Automatically Detecting Outliers with Datadog

How to measure everything - a million metrics per second with minimal develop...

Similar to Just enough web ops for web developers

Data-Driven DevOps: Mining Machine Data for 'Metrics that Matter' in a DevOps...Splunk

Building an Open Source AppSec Pipeline - 2015 Texas Linux FestMatt Tesauro

Dev opsEslam El Husseiny

Data-Driven DevOps: Improve Velocity and Quality of Software Delivery with Me...Splunk

Continuous DeploymentBrian Henerey

Taking AppSec to 11 - BSides Austin 2016Matt Tesauro

OSSF 2018 - Brandon Jung of GitLab - Is Your DevOps 'Tool Tax' Weighing You D...FINOS

Taking AppSec to 11: AppSec Pipeline, DevOps and Making Things BetterMatt Tesauro

What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw

Observability foundations in dynamically evolving architecturesBoyan Dimitrov

Innovate Better Through Machine data AnalyticsHal Rottenberg

Learn to see, measure and automate with value stream managementLance Knight

The Magic Of Application Lifecycle Management In Vs PublicDavid Solivan

TrailblazerDX Motihari.pptxOm Prakash

Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari

Build a complete security operations and compliance program using a graph dat...Erkang Zheng

From Monoliths to Microservices at Realestate.com.auevanbottcher

Improve your productivity with Microsoft Fow - Power to the peopleserge luca

Performance monitoring in a DevOps WorldSolidify

Unleash Team Productivity with Real-Time Operations (DEV203-S) - AWS re:Inven...Amazon Web Services

Similar to Just enough web ops for web developers (20)

Data-Driven DevOps: Mining Machine Data for 'Metrics that Matter' in a DevOps...

Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest

Dev ops

Data-Driven DevOps: Improve Velocity and Quality of Software Delivery with Me...

Continuous Deployment

Taking AppSec to 11 - BSides Austin 2016

OSSF 2018 - Brandon Jung of GitLab - Is Your DevOps 'Tool Tax' Weighing You D...

Taking AppSec to 11: AppSec Pipeline, DevOps and Making Things Better

What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group

Observability foundations in dynamically evolving architectures

Innovate Better Through Machine data Analytics

Learn to see, measure and automate with value stream management

The Magic Of Application Lifecycle Management In Vs Public

TrailblazerDX Motihari.pptx

Thinking DevOps in the era of the Cloud - Demi Ben-Ari

Build a complete security operations and compliance program using a graph dat...

From Monoliths to Microservices at Realestate.com.au

Improve your productivity with Microsoft Fow - Power to the people

Performance monitoring in a DevOps World

Unleash Team Productivity with Real-Time Operations (DEV203-S) - AWS re:Inven...

Recently uploaded

How to write a Business Continuity PlanDatabarracks

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Commit 2024 - Secret Management made easyAlfredo García Lavilla

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"ML in Production",Oleksandr BaganFwdays

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Recently uploaded (20)

How to write a Business Continuity Plan

Ensuring Technical Readiness For Copilot in Microsoft 365

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

What is DBT - The Ultimate Data Build Tool.pdf

The Ultimate Guide to Choosing WordPress Pros and Cons

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

SIP trunking in Janus @ Kamailio World 2024

Commit 2024 - Secret Management made easy

TeamStation AI System Report LATAM IT Salaries 2024

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"ML in Production",Oleksandr Bagan

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

DevoxxFR 2024 Reproducible Builds with Apache Maven

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Just enough web ops for web developers

1. Just Enough WebOps for Developers Alexis Lê-Quôc @alq http://www.datadoghq.com

2. @alq

3. @alq Co-founder DATADOG

4. Datadog is Monitoring that does not suck... as a Service

5. Datadog is Monitoring that does not suck... as a Service “Metrics made social”

6. People-friendly Monitoring

7. Developer-friendly Monitoring

8. 930,000 350,000 Dev Ops 2010 US figures from BLS

9. The New Development Equation

10. The New Development Equation Code + + AWS =

11. The New Development Equation Code + + AWS = 3 months

12. The New Development Equation Code + + AWS = 3 months 5 minutes

13.

14. Web Operations?

15. The New Development Equation Code + + AWS = 3 months 5 minutes

16. The New Development Equation Code + + AWS = 3 months 5 minutes Web Operations?

17.

18. Cargo cult Operations

19.

20. Common vocabulary between Dev & WebOps?

21. Users SysAdmin

22. “Come and get it” “We want root!”

23. Dev WebOps

24. WebOps and this is what I do

25. But first an important digression

26. Product Service

27. Service = Code + Infrastructure

28. Service = Product + Access

29.

30. Provide access

31. Provide access

32. reliable, fast, cheap Provide access

33. reliable, fast, cheap Provide access

34. reliable, fast, cheap Provide access 24x7 without going crazy

35. 24x7 && !crazy

36. Development Models

37.

38. Delivery historically not the focus

39. Agile Cycle Delivery

40. Agile Cycle Delivery

41. Agile Cycle WDeebliOveprsy Cycle

42. WebOps and this is what I do

43. Dev Release Measure & Log Monitor Change WebOps Cycle Investigate Alert Fix || Escalate

44. (Release)

45. Dev Release Measure & Log Monitor Change Investigate Alert Fix || Escalate

46. Purpose Collect quantitative metrics Process Instrument servers Instrument code Instrument SaaS deps Automate collection Measure Risks Imprecise metric definition Manual collection “What does it mean?” Tools System (ganglia, collectd, munin, nagios, etc.) Code (metrics, statsd) SaaS (Datadog et al.)

47. Dev Release Measure & Log Monitor Change Investigate Alert Fix || Escalate

48. Purpose Collect meaningful, timestamped events Log Process All the time In one place Access for everyone Discipline Risks TiB of garbage Non-uniform timestamps Non-uniform formats Tools log4j et al. syslog et al. logstash, splunk + Logging-as-a-Service

49. Dev Release Measure & Log Change Investigate Alert Fix || Escalate Monitor

50. Purpose Watch actionable events & metrics Process Health of the app? Which metrics for health? Compute metrics Metric domain Access for everyone Pretty graphs Monitor Risks Non-actionable metrics Tools graphite, cubism et al. + services

51. Dev Release Measure & Log Monitor Change Investigate Alert Fix || Escalate

52. Purpose Bring human in the loop when automated fix does not work Alert Process Alert on vital monitors Add new alerts with new monitors Compute metrics from alerts Ruthlessly edit Risks Too many alerts Become desensitized Ignore alerts App crashes for real Pendulum swings back Tools nagios + services

53. Dev Release Measure & Log Monitor Change Investigate Alert Fix || Escalate

54. Purpose Fix issue or find someone who can Process (fix) capture actions as soon as possible (while or shortly after) (fix) runbooks (fix) automate fixes (escalation) on-call rotation (escalation) agree on rules Fix || Escalate Risks Burn out Tools PagerDuty Bug tracker

55. Dev Release Measure & Log Monitor Alert Change Investigate Fix || Escalate

56. Purpose Collect evidence Reconstruct what happened Process Start where/when problem 1st detected Work your way from there Capture relevant graphs/logs Investigate Risks Missing the starting point Lagging events/metrics Low-level events/metrics Blame game Tools Post-mortems

57. Dev Release Measure & Log Monitor Investigate Alert Fix || Escalate Change

58. Change Purpose Fewer alerts Better service Process Change infrastructure, code Infrastructure == code Add/Edit monitors & alerts Risks ad-hoc changes Tools ...

59. WebOps and this is what I do

60. Dev Release Measure & Log Monitor Change Questions? Comments? @alq Investigate Alert Fix || Escalate

Just enough web ops for web developers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Just enough web ops for web developers

Similar to Just enough web ops for web developers (20)

More from Datadog

More from Datadog (13)

Recently uploaded

Recently uploaded (20)

Just enough web ops for web developers