Monitoring and scaling postgres at datadog

•Download as PPTX, PDF•

2 likes•320 views

An overview of how Datadog has built and scaled our Postgres clusters to support the ingestion of trillions of metric data points per day, by Seth Rosenblum, Lead Data Reliability Engineer.

Software

Monitoring and Scaling
Postgres at Datadog
Seth Rosenblum, Datadog @SethRosenblum

Seth Rosenblum
Data Reliability Engineering Lead
‣ Kafka
‣ Elasticsearch
‣ Cassandra
‣ Postgres!
@SethRosenblum

@datadoghq
SaaS-based monitoring
Trillions of data points per day
We’re hiring!
https://jobs.datadoghq.com

Collecting data is cheap; not
having it when you need it can
be expensive

Collecting data is cheap; not
having it when you need it can
be expensive
...So instrument all the things!

Metrics
connections
commits
rollbacks
disk_read
buffer_hit
rows_returned
rows_fetched
rows_inserted
rows_updated
rows_deleted
database_size
deadlocks
temp_bytes
What metrics do we gather?
temp_files
bgwriter.checkpoints_timed
bgwriter.checkpoints_requested
bgwriter.buffers_checkpoint
bgwriter.buffers_clean
bgwriter.maxwritten_clean
bgwriter.buffers_backend
bgwriter.buffers_alloc
bgwriter.buffers_backend_fsync
bgwriter.write_time
bgwriter.sync_time
locks
seq_scans
seq_rows_read
index_scans
index_rows_fetched
rows_hot_updated
live_rows
dead_rows
index_rows_read
table_size
index_size
total_size
table.count
max_connections
percent_usage_connections
replication_delay
replication_delay_bytes
heap_blocks_read
heap_blocks_hit
index_blocks_read
index_blocks_hit
toast_blocks_read
toast_blocks_hit
toast_index_blocks_read
toast_index_blocks_hit

“Postgres performance-optimizes a lot better
when it has a consistent workload”
Josh Berkus

How we do it
Requirements
▸Write master is writeable, read replicas are
readable!

How we do it
Requirements
▸Write master is writeable, read replicas are
readable!
▸Read replicas are up to date and don’t lag

How we do it
Requirements
▸Write master is writeable, read replicas are
readable!
▸Read replicas are up to date and don’t lag
▸Additional read replicas can be provisioned
quickly

How we do it
Solutions
▸PostgreSQL!
▸http://bit.ly/pg-repl-docs
▸WAL-E
▸https://github.com/wal-e/wal-e

How we do it
Solutions
▸PostgreSQL!
▸http://bit.ly/pg-repl-docs
▸WAL-E WAL-G
▸https://github.com/wal-g/wal-g

What are we alerting on?
▸Write master is writeable, read replicas are
readable!
▸Up/Down checks
▸Latency

What are we alerting on?
▸Read replicas are up to date and don’t lag
▸Write master standby availability
▸Write master standby replication lag
▸Read replica lag

What are we alerting on?
▸Additional read replicas can be provisioned
quickly
▸Base backups are functioning properly

When do we care that a backup has failed?

“Aside from shared_buffers, the most
important memory-allocation parameter
is work_mem… Raising this value can
dramatically improve the performance of certain
queries…”
Robert Haas

Latency vs Potential
EXPLAIN ANALYZE
http://bit.ly/pg-explain
‣ Explain displays the execution plan

Latency vs Potential
EXPLAIN ANALYZE
http://bit.ly/pg-explain
‣ Explain displays the execution plan
‣ Analyze runs it and gathers stats

Latency vs Potential
EXPLAIN ANALYZE
Merge Right Join (cost=25870.55..31017.51 rows=229367 width=92) (actual time=2884.501..5147.047 rows=354834 loops=1)
Merge Cond: (a.uid = b.uid)
-> Index Scan using foo on bar a (cost=0.00..537.29 rows=9246 width=27) (actual time=0.049..41.782 rows=9246 loops=1)
-> Materialize (cost=25870.49..27204.80 rows=106745 width=81) (actual time=2884.413..3804.537 rows=354834 loops=1)
-> Sort (cost=25870.49..26137.35 rows=106745 width=81) (actual time=2884.406..3099.732 rows=111878 loops=1)
Sort Key: b.uid
Sort Method: external merge Disk: 8928kB
…
Total runtime: 5588.105 ms
(14 rows)
http://bit.ly/pg-auto-explain

Summary
1. Collect as many metrics as you can, before
you need them
2. If the metrics that you have aren’t providing
the right value, build ones that do
3. Be aggressive in monitoring slow queries,
catch them while they’re easy to find

Resources
‣ http://dtdg.co/monitor-postgres
‣ https://dtdg.co/gcp-sql
‣ https://dtdg.co/postgresql-vacuums

Questions?
Seth Rosenblum @SethRosenblum
seth@datadoghq.com

Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines. Paris Data Eng' Meetup du 26 février 2019 @Datadog

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2mAKgJi. Ian Nowland and Joel Barciauskas talk about the challenges Datadog faces as the company has grown its real-time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. They also talk about how the architecture has evolved, and what they are looking to in the future as they architect for a quadrillion points per day. Filmed at qconnewyork.com. Ian Nowland is the VP Engineering Metrics and Alerting at Datadog. Joel Barciauskas currently leads Datadog's distribution metrics team, providing accurate, low latency percentile measures for customers across their infrastructure.

Provisioning Datadog with Terraform

Matt Spurlin

Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...

Amazon Web Services

Just as we got a hang of monitoring our server-based applications, they take away the server. How do you monitor something that doesn’t exist? What metrics matter most in a serverless world? In this session, we will look at how applications are different in a AWS Lambda-based world and how to monitor them. Join us as we work our way through the stack and demonstrate how to capture the health and performance of your services. The focus of this session is not tool specific. Attendees will learn production tested lessons and leave with frameworks they can implement with their serverless workloads regardless of the platforms and tools they use. Speaker: Matt Williams, Evangelist, Datadog

Using ClickHouse for Experimentation

Gleb Kanterov

tdtechtalk20160330johan

Johan Gustavsson

Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans

Spark Summit

Hoodie: How (And Why) We built an analytical datastore on Spark

Vinoth Chandar

In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Finally, a brief demo of Rocana Ops, an application for large scale data center operations, will be given, along with an explanation about how it uses the underlying platform.

DataEngConf SF16 - High cardinality time series search

Hakka Labs

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

Vianney FOUCAULT

Rental Cars and Industrialized Learning to Rank with Sean Downes

Databricks

Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp. In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.

Elastic Data Analytics Platform @Datadog

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Streaming data for real time analysis

Amazon Web Services

Macy's: Changing Engines in Mid-Flight

DataStax Academy

This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax. One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations. This session will cover: 1) The process that led to our decision to use Cassandra 2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment 3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs. 4) Our lessons learned and next steps

Scaling graphite for application metrics

Jim Plush

Scaling monitoring with Datadogalexismidon

netflix-real-time-data-strata-talk

Danny Yuan

empirical analysis modeling of power dissipation control in internet data ce...

saadjamil31

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

Hakka Labs

Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund

Spark Summit

Symantec: Cassandra Data Modelling techniques in action

DataStax Academy

Our product presents an aggregated view of metadata collected for billions of objects (files, emails, sharepoint objects etc.). We used Cassandra to store those billions of objects along with aggregated view of that metadata. Customers can analyse the corpus of data in real time by searching in completely flexible way i.e. be able to get summary aggregates for many billions of objects, and then be able to further drill down to items by filtering using various facets of the metadata. We achieve this using a combination of Cassandra and ElasticSearch. This presentation will talk about various data modelling techniques we use to aggregate and then further summarise all that metadata and be able to search the summary in real t

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Spark Summit

Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Change Data Capture with Data Collector @OVH

Paris Data Engineers !

La collecte de données au sein d'un DataLake sans impacter les systèmes opérationnels est un challenge pour de nombreuses entreprises. Lors du meetup Paris Data Engineers du 26 mars 2019, Dimitri Capitaine nous a présenté Data Collector qui est un outil de Change Data Capture (CDC) développé en interne chez OVH. Data Collector est capable d'assurer une réplication fiable et performante des bases de données jusqu'au DataLake. Hugo Larcher nous a alors présenté un cas d'utilisation autour de l'exploitation de données aéronautiques avec une touche d'IoT et de DataViz.

Elastic Stack roadmap deep dive

Elasticsearch

Capital One: Using Cassandra In Building A Reporting Platform

DataStax Academy

As a leader in the financial industry, Capital One applications generate huge amounts of data that require fast and accurate handling, storage and analysis. We are transforming how we report operational data to our internal users so that they can make quick and precise business decisions to serve our customers. As part of this transformation, we are building a new Go-based data processing framework that will enable us to transfer data from multiple data stores (RDBMS, files, etc.) to a single NoSQL database - Cassandra. This new NoSQL store will act as a reporting database that will receive data on a near real-time basis and serve the data through scorecards and reports. We would like to share our experience in defining this fast data platform and the methodologies used to model financial data in Cassandra.

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

Altinity Ltd

Leveraging Amazon Redshift for Your Data Warehouse

Amazon Web Services

In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. You¹ll also hear from Dan Wagner, CEO at Civis Analytics, as he discusses why the Civis data science platform was designed on top of Amazon Redshift and the AWS platform in order to help smart organizations bridge their data silos, build 360 degree view of their customer relationships, and identify opportunities for driving their companies forward by leveraging enormous datasets, the power of analytics, and economies of scale on the AWS platform.

Ledingkart Meetup #4: Data pipeline @ lk

Mukesh Singh

What's hot

Building a system for machine and event-oriented data with Rocana

Treasure Data, Inc.

DataEngConf SF16 - High cardinality time series search

Hakka Labs

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

Vianney FOUCAULT

Rental Cars and Industrialized Learning to Rank with Sean Downes

Databricks

Elastic Data Analytics Platform @Datadog

C4Media

Streaming data for real time analysis

Amazon Web Services

Macy's: Changing Engines in Mid-Flight

DataStax Academy

Scaling graphite for application metrics

Jim Plush

Scaling monitoring with Datadogalexismidon

netflix-real-time-data-strata-talk

Danny Yuan

empirical analysis modeling of power dissipation control in internet data ce...

saadjamil31

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

Hakka Labs

Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund

Spark Summit

Symantec: Cassandra Data Modelling techniques in action

DataStax Academy

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Spark Summit

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Change Data Capture with Data Collector @OVH

Paris Data Engineers !

Elastic Stack roadmap deep dive

Elasticsearch

Capital One: Using Cassandra In Building A Reporting Platform

DataStax Academy

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

Altinity Ltd

What's hot (20)

Building a system for machine and event-oriented data with Rocana

DataEngConf SF16 - High cardinality time series search

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

Rental Cars and Industrialized Learning to Rank with Sean Downes

Elastic Data Analytics Platform @Datadog

Streaming data for real time analysis

Macy's: Changing Engines in Mid-Flight

Scaling graphite for application metrics

Scaling monitoring with Datadog

netflix-real-time-data-strata-talk

empirical analysis modeling of power dissipation control in internet data ce...

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund

Symantec: Cassandra Data Modelling techniques in action

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Change Data Capture with Data Collector @OVH

Elastic Stack roadmap deep dive

Capital One: Using Cassandra In Building A Reporting Platform

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

GlobusWorld 2024 Opening Keynote session

Globus

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

2024 RoOUG Security model for the cloud.pptx

Georgi Kodinov

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

Hivelance Technology

Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots

First Steps with Globus Compute Multi-User Endpoints

Globus

In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Recently uploaded (20)

Vitthal Shirke Microservices Resume Montevideo

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Cyaniclab : Software Development Agency Portfolio.pdf

Designing for Privacy in Amazon Web Services

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

A Comprehensive Look at Generative AI in Retail App Testing.pdf

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR

Globus Compute Introduction - GlobusWorld 2024

Why React Native as a Strategic Advantage for Startup Innovation.pdf

Corporate Management | Session 3 of 3 | Tendenci AMS

Lecture 1 Introduction to games development

Software Testing Exam imp Ques Notes.pdf

GlobusWorld 2024 Opening Keynote session

Globus Connect Server Deep Dive - GlobusWorld 2024

2024 RoOUG Security model for the cloud.pptx

Into the Box 2024 - Keynote Day 2 Slides.pdf

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

First Steps with Globus Compute Multi-User Endpoints

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Monitoring and scaling postgres at datadog

1. Monitoring and Scaling Postgres at Datadog Seth Rosenblum, Datadog @SethRosenblum

2. Seth Rosenblum Data Reliability Engineering Lead ‣ Kafka ‣ Elasticsearch ‣ Cassandra ‣ Postgres! @SethRosenblum

3. @datadoghq SaaS-based monitoring Trillions of data points per day We’re hiring! https://jobs.datadoghq.com

4. Collecting data is cheap; not having it when you need it can be expensive

5. Collecting data is cheap; not having it when you need it can be expensive ...So instrument all the things!

6. Metrics connections commits rollbacks disk_read buffer_hit rows_returned rows_fetched rows_inserted rows_updated rows_deleted database_size deadlocks temp_bytes What metrics do we gather? temp_files bgwriter.checkpoints_timed bgwriter.checkpoints_requested bgwriter.buffers_checkpoint bgwriter.buffers_clean bgwriter.maxwritten_clean bgwriter.buffers_backend bgwriter.buffers_alloc bgwriter.buffers_backend_fsync bgwriter.write_time bgwriter.sync_time locks seq_scans seq_rows_read index_scans index_rows_fetched rows_hot_updated live_rows dead_rows index_rows_read table_size index_size total_size table.count max_connections percent_usage_connections replication_delay replication_delay_bytes heap_blocks_read heap_blocks_hit index_blocks_read index_blocks_hit toast_blocks_read toast_blocks_hit toast_index_blocks_read toast_index_blocks_hit

7. Scaling PostgreSQL at Datadog

8. Moar Resources!

9. Moar Instances!

10. Writes Repl Reads

11. Writes Repl Reads

12. “Postgres performance-optimizes a lot better when it has a consistent workload” Josh Berkus

13. Writes Repl Standbys

14. How we do it Requirements ▸Write master is writeable, read replicas are readable!

15. How we do it Requirements ▸Write master is writeable, read replicas are readable! ▸Read replicas are up to date and don’t lag

16. How we do it Requirements ▸Write master is writeable, read replicas are readable! ▸Read replicas are up to date and don’t lag ▸Additional read replicas can be provisioned quickly

17. How we do it Solutions ▸PostgreSQL! ▸http://bit.ly/pg-repl-docs ▸WAL-E ▸https://github.com/wal-e/wal-e

18. How we do it Solutions ▸PostgreSQL! ▸http://bit.ly/pg-repl-docs ▸WAL-E WAL-G ▸https://github.com/wal-g/wal-g

19. How we do it Requirements ▸Write master is writeable, read replicas are readable! ▸Read replicas are up to date and don’t lag ▸Additional read replicas can be provisioned quickly

20. What are we alerting on? ▸Write master is writeable, read replicas are readable! ▸Up/Down checks ▸Latency

21. What are we alerting on? ▸Read replicas are up to date and don’t lag ▸Write master standby availability ▸Write master standby replication lag ▸Read replica lag

22.

23. What are we alerting on? ▸Additional read replicas can be provisioned quickly ▸Base backups are functioning properly

24.

25. When do we care that a backup has failed?

26.

27.

28. Monitoring to Improve Performance

29. Slow Queries

30. Slow Queries

31. Performance: RAM vs Disk

32. “Aside from shared_buffers, the most important memory-allocation parameter is work_mem… Raising this value can dramatically improve the performance of certain queries…” Robert Haas

33. Finding **Inefficient** Queries

34. Finding **Inefficient** Queries

35. Latency vs Potential EXPLAIN ANALYZE http://bit.ly/pg-explain ‣ Explain displays the execution plan

36. Latency vs Potential EXPLAIN ANALYZE http://bit.ly/pg-explain ‣ Explain displays the execution plan ‣ Analyze runs it and gathers stats

37. Latency vs Potential EXPLAIN ANALYZE Merge Right Join (cost=25870.55..31017.51 rows=229367 width=92) (actual time=2884.501..5147.047 rows=354834 loops=1) Merge Cond: (a.uid = b.uid) -> Index Scan using foo on bar a (cost=0.00..537.29 rows=9246 width=27) (actual time=0.049..41.782 rows=9246 loops=1) -> Materialize (cost=25870.49..27204.80 rows=106745 width=81) (actual time=2884.413..3804.537 rows=354834 loops=1) -> Sort (cost=25870.49..26137.35 rows=106745 width=81) (actual time=2884.406..3099.732 rows=111878 loops=1) Sort Key: b.uid Sort Method: external merge Disk: 8928kB … Total runtime: 5588.105 ms (14 rows) http://bit.ly/pg-auto-explain

38. “Aside from shared_buffers, the most important memory-allocation parameter is work_mem… Raising this value can dramatically improve the performance of certain queries, but it's important not to overdo it.” Robert Haas

39. Track Connections

40. Connections By Application

41. Summary 1. Collect as many metrics as you can, before you need them 2. If the metrics that you have aren’t providing the right value, build ones that do 3. Be aggressive in monitoring slow queries, catch them while they’re easy to find

42. Resources ‣ http://dtdg.co/monitor-postgres ‣ https://dtdg.co/gcp-sql ‣ https://dtdg.co/postgresql-vacuums

43. Questions? Seth Rosenblum @SethRosenblum seth@datadoghq.com

Monitoring and scaling postgres at datadog

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring and scaling postgres at datadog

Similar to Monitoring and scaling postgres at datadog (20)

Recently uploaded

Recently uploaded (20)

Monitoring and scaling postgres at datadog