The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
This document discusses the importance of database experiments for optimizing PostgreSQL performance and presents a structured approach to conducting these experiments. It introduces the nancy CLI tool developed by PostgreSQL.ai for automating database experimentation and outlines best practices for performance analysis, including configuration tuning and workload evaluation. Additionally, it highlights real-life examples demonstrating the impact of specific configurations on database performance.
Overview of the presentation's focus on database experiments, speaker's expertise and background in PostgreSQL.
Importance of identifying SQL performance issues using monitoring tools such as pg_stat_statements and log analysis.
Methods to enhance SQL performance, including tuning configurations, index management, and resource allocation, with emphasis on best practices.
A case study on adjusting default_statistics_target with results demonstrating varying impacts on SQL performance.
Drawing parallels between database change management and automated testing frameworks used in other industries.
Live demonstration of Nancy CLI, a tool for conducting database experiments within various environments.
Components and structure of a database experiment, detailing input requirements like environment, object, workload, and delta.
Discussion on leveraging pre-existing tools to facilitate database experimentation and automation, including their integration.Challenges faced in conducting experiments and managing databases effectively, including logging impacts and security concerns.
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
1.
The Art ofDatabase
Experiments
Postgres.ai
Nikolay Samokhvalov
email: ns@postgres.ai
twitter: @postgresmen
2.
About meAbout me:
Postgresexperience: 13+ years (databases: 17+)
Past: Co-founder/CTO of 3 startups (total 30M+ users), all based on Postgres
Founder of #RuPostgres (1700+ members on Meetup.com, the 2nd largest globally)
Re-launched consulting practice in the SF Bay Area PostgreSQL.support
Founder of Postgres.ai – a platform aimed to automate what is not yet automated
Twitter: @postgresmen
Email: ns@postgres.ai
prod
monitoring
Engineer
● pg_stat_statements
○ noquery examples
○ no query plans
● log analysis (pgBadger)
○ requires maintenance
○ not “live”
○ no query plans (unless auto_explain is ON)
○ usually, not full picture
(log_min_duration_statement >> 0)
Automated signals (alerts),
manual observations
The most common tools for general SQL performance analysis:
Four ways toimprove performance:
1. Tune Postgres configuration
2. Add/remove indexes
3. Rewrite query / redesign DB schema
4. Add more resources (CPU, RAM, disks)
~280 knobs! No real-workload
and real-data
verification
(or very limited
and affecting
production)
btree, hash, GiST, SP-GiST, GIN,
RUM, BRIN, Bloom;
unique, partial, functional, covering
22.
Four ways toimprove performance:
1. Tune Postgres configuration
2. Add/remove indexes
3. Rewrite query / redesign DB schema
4. Add more resources (CPU, RAM, disks)
~280 knobs!
Sub-optimal
or even far
from optimal
decisions
No real-workload
and real-data
verification
(or very limited
and affecting
production)
btree, hash, GiST, SP-GiST, GIN,
RUM, BRIN, Bloom;
unique, partial, functional, covering
23.
Four ways toimprove performance:
1. Tune Postgres configuration
2. Add/remove indexes
3. Rewrite query / redesign DB schema
4. Add more resources (CPU, RAM, disks)
~280 knobs!
Expert DBA skips many steps…
Black magic!
Sub-optimal
or even far
from optimal
decisions
No real-workload
and real-data
verification
(or very limited
and affecting
production)
btree, hash, GiST, SP-GiST, GIN,
RUM, BRIN, Bloom;
unique, partial, functional, covering
A real-life example.default_statistics_target: 100 vs 1000
The idea (from articles/blogs):
Default detault_statistics_target (100) is too low.
Let’s change it to 1000!
...Sounds good!
...Deployed!
But did itreally improve everything? …anything?
Let’s check with Postgres.ai platform!
28.
A real-life example.default_statistics_target: 100 vs 10002 years later we decided to make DB experiment to compare 100 and 1000:
29.
A real-life example.default_statistics_target: 100 vs 1000
“before”:
default_statistics_target = 100
“after”:
default_statistics_target = 1000
30.
A real-life example.default_statistics_target: 100 vs 1000
“before”:
default_statistics_target = 100
In general, the new value gives
better performance. Great!
“after”:
default_statistics_target = 1000
31.
A real-life example.default_statistics_target: 100 vs 1000
“before”:
default_statistics_target = 100
Scroll down, a couple of query groups…
“after”:
default_statistics_target = 1000
32.
A real-life example.default_statistics_target: 100 vs 1000
“before”:
default_statistics_target = 100
“after”:
default_statistics_target = 1000
One query group became much slower
after the change applied!
resolved with:
“ALTER TABLE/INDEX … ALTER COLUMN SET
STATISTICS …“
for particular table/index and column
GFDL and CC-BY-2.5,Wikipedia.org
One more example...
Aviation, aeronautics, sport cars, etc:
wind tunnel
37.
Experiments in otherfields:
1. Are conducted in the special “lab” environment, not prod
(call it “staging”)
2. With very detailed, deep analysis
3. Highly automated. Multiple experimental runs, faster, less expensive
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
44.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
45.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
3. Workload
some SQL
46.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
3. Workload
some SQL
4. Delta (optional; multiple values allowed)
some Postgres configuration change, or
new index
47.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
3. Workload
some SQL
4. Delta (optional; multiple values allowed)
some Postgres configuration change, or
new index
The Output:
1. Summary
better or worse, in general?
48.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
3. Workload
some SQL
4. Delta (optional; multiple values allowed)
some Postgres configuration change, or
new index
The Output:
1. Summary
better or worse, in general?
2. Artifacts
any useful “telemetry” data
49.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
3. Workload
some SQL
4. Delta (optional; multiple values allowed)
some Postgres configuration change, or
new index
The Output:
1. Summary
better or worse, in general?
2. Artifacts
any useful “telemetry” data
3. Per-query deep SQL analysis
each query: better or worse?
50.
What is aDatabase Experiment?
The Input:
1. Environment
hardware, software, OS, FS,
Postgres version, configuration
2. Object
some database (e.g. cloned production)
3. Workload
some SQL
4. Delta (optional; multiple values allowed)
some Postgres configuration change, or
new index
The Output:
1. Summary
better or worse, in general?
2. Artifacts
any useful “telemetry” data
3. Per-query deep SQL analysis
each query: better or worse?
+ histograms:
ms
query
group #
before
after
51.
Is it possiblewith existing tools? Yes!
● Docker
● pgreplay
● pg_stat_***
● auto_explain
● pgBadger (with JSON output)
● AWS EC2 spot instances
All the blocks exist already!
52.
Is it possiblewith existing tools? Yes!
● Docker
● pgreplay
● pg_stat_***
● auto_explain
● pgBadger (with JSON output)
● AWS EC2 spot instances
All the blocks exist already!
Nancy CLI: all these blocks (and more!), integrated and automated
https://github.com/postgres-ai/nancy
53.
DIY automated pipelinefor DB experiments and optimization
How to automate database optimization using ecosystem tools and AWS?
Analyze:
● pg_stat_statements
● auto_explan
● pgBadger to parse logs, use JSON output
● pg_query to group queries better
● pg_stat_kcache to analyze FS-level ops
Configuration:
● annotated.conf, pgtune, pgconfigurator, postgresqlco.nf
● ottertune
Suggested indexes (internal “what-if” API w/o actual execution)
● (useful: pgHero, POWA, HypoPG, dexter, plantuner)
Conduct experiments:
● pgreplay to replay logs (different log_line_prefix, you need to handle it)
● EC2 spot instances
Machine learning
● MADlib
pgBadger:
● Grouping queries can be implemented better (see pg_query)
● Makes all queries lower cased (hurts "camelCased" names)*
● Doesn’t really support plans (auto_explain)*
pgreplay and pgBadger are not friends,
require different log formats
*)
Fixed/improved in pgBadger 10.0
54.
Postgres.ai — artificialDBA/DBRE assistants
AI-based cloud-friendly platform to automate database administration
54
Steve
AI-based expert in capacity planning and
database tuning
Joe
AI-based expert in query optimization and
Postgres indexes
Nancy
AI-based expert in database experiments.
Conducts experiments and presents
results to human and artificial DBAs
Sign up for early access:
https://Postgres.ai
55.
Postgres.ai — artificialDBA/DBRE assistants
AI-based cloud-friendly platform to automate database administration
55
Steve
AI-based expert in capacity planning and
database tuning
Joe
AI-based expert in query optimization and
Postgres indexes
Nancy
AI-based expert in database experiments.
Conducts experiments and presents
results to human and artificial DBAs
Sign up for early access:
https://Postgres.ai
56.
Meet Nancy CLI(open source)
Nancy CLI https://github.com/postgres-ai/nancy
● custom docker image (Postgres with extensions & tools)
● nancy prepare-workload to convert Postgres logs (now only .csv)
to workload binary file
● nancy run to run experiments
● able to run locally (any machine) on in EC2 spot instance (low price!),
including i3.*** instances (with NVMe)
● fully automated management of EC2 spots
Nancy CLI – UseCases
- Deep performance analysis (“clean run”), not affecting production
- Change management
- Regression testing when upgrading (software, hardware)
- Verify config optimization ideas
- Find optimal configuration
- … and train ML models for Postgres AI!
59.
Part 4. Reallife challenges*
______________________________
*
based on experience in 4 companies
with small to mid-sized databases (up to few TBs)
and OLTP workloads (up to 15k TPS)
Fears of log_min_duration_statement= 0
Evaluate the impact of log_min_duration_statement = 0 (quickly and w/o any changes!):
https://gist.github.com/NikolayS/08d9b7b4845371d03e195a8d8df43408 (pay attention to comments)
Only ~300 kB/s expected,
~800 log entries per second
log_destination = syslog
logging_collector= off
or
log_destination = stderr
logging_collector = off
or
log_destination = csvlog
logging_collector = on
What is better
for intensive logging?
# Postgres 9.6.10
pgbench-U postgres -j2 -c24 -T60 -rnf - <<EOF
select;
EOF
All-queries logging with syslog is
● 44x slower compared to “no logging”
● 33x slower compared to stderr /
logging collector
Be careful with syslog / journald
68.
● Forecast expectedIOPS and write bandwidth
● Analyze your logging system (syslog/journald slows down everything, use
STDERR, logging collector)
● If forecasted impact is too high (say, dozens of MBs), consider sampling
( "SET log_min_duration_statement = 0;" in particular, random sessions)
Advices for log_min_duration_statement = 0
69.
Nancy real-world examples:shared_buffers
postila.ru,
Real workload (5 min),
61 GB RAM (ec2 i3.2xlarge),
DB size: ~500GB
Various shared_buffers
values
shared_buffers = 15GB (25%)
If we go from 25% o higher values (~80%), we improve SQL performance ~50%
70.
Nancy real-world examples:seq_page_cost
GitLab.com: random_page_cost = 1.5 and seq_page_cost = 1
Decision was made to switch to seq_page_cost = 4
DB experiments with Nancy CLI were made to check was this decision good in
terms of performance.
Results:
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4835#note_106669373
– in general, SQL performance improved ~40%
WIP here, it is an open question, why is it so.
Nancy real-world examples:educate yourself
PostgreSQL Documentation “19.5. Write Ahead Log”
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
Just conduct DB experiment with Nancy CLI,
use --keep-alive 3600 and compare!
Various ways tocreate an experimental database
● plain text pg_dump
○ restoration is very slow (1 vcpu utilized)
○ “logical” – physical structure is lost (cannot experiment with bloat, etc)
○ small (if compressed)
○ “snapshot” only
● pg_dump with either -Fd (“directory”) or -Fc (“custom”):
○ restoration is faster (multiple vCPUs, -j option)
○ “logical” (again: bloat, physical layout is “lost”)
○ small (because compressed)
○ “snapshot” only
● pg_basebackup + WALs, point-in-time recovery (PITR), possibly with help from WAL-E, WAL-G, pgBackRest
○ less reliable, sometimes there issues (especially if 3rd party tools involved - e.g. WAL-E & WAL-G don’t support
tablespaces, there are bugs sometimes, etc)
○ “physical”: bloat and physical structure is preserved
○ not small – ~ size of the DB
○ can “walk in time” (PITR)
○ requires warm-up procedure (data is not in the memory!)
● AWS RDS: create a replica + promote it
○ no Spots :-/
○ Lazy Load is tricky (it looks like the DB is there but it’s very slow – warm-up is needed)
● Snapshots
● Ideas for serialization
○ Stop Postgres / rsync “back” or re-copy locally on NVMe / start Postgres
76.
How can wespeed up experimental runs?
● Prepare the EC2 instance(s) in advance and keep it
● Prepare EBS volume(s) only (perhaps, using an instance of the different
type) and keep it ready. When attached to the new instance, do warm-up
● Resource re-usage:
○ reuse docker container
○ reuse EC2 instance
○ serialize experimental runs serialization (DDL Do/Undo; VACUUM FULL; cleanup)
● Partial database snapshots (dump/restore only needed tables)
77.
The future developmentof Nancy CLI
● Speedup DB creation
● ZFS/XFS snapshots to revert PGDATA state within seconds
● Support GCP
● More artifacts delivered: pg_stat_kcache, etc
● nancy see-report to print the summary + top-30 queries
● nancy compare-reports to print the “diff” for 2+ reports (the summary + numbers for top-30 queries,
ordered by by total time based on the 1st report)
● Postgres 11
● pgbench -i for database initialization
● pgbench to generate multithreaded synthetic workload
● Workload analysis: automatically detect “N+1 SELECT” when running workload
● Better support for the serialization of experimental runs
● Better support for multiple runs https://github.com/postgres-ai/nancy/pull/97
○ interval with step
○ gradient descent
● Provide costs estimation (time + money)
● Rewrite in Python or Go
Feedback/contributions welcome
https://github.com/postgres-ai/nancy
78.
Challenge: security issues
Problem:a developer doesn’t have access to production.
Nancy works with production data/workload.
What about permissions and data protection?
79.
Challenge: security issues
Problem:a developer doesn’t have access to production.
Nancy works with production data/workload.
What about permissions and data protection?
Possible solutions:
● Option 1: allow using Nancy CLI only to those who already have access
production (DBAs, SREs, DBREs)
● Option 2: obfuscate data when preparing a DB clone (no universal
solution yet, TODO)
● Option 3: allow access only to GUI, hide/obfuscate parameters (TODO)
80.
Challenge: reliable results
Issues:
1.Single runs is not enough (fluctuations) – must repeat
2. “Before”/”after” runs on 2 different machines / EC2 nodes – “not fair” comparison
(defective hardware, throttling)
Solutions (ideally: combination of them):
● Sequential runs
● 4+ iterations of each experimental run
● “Baseline benchmark” https://github.com/postgres-ai/nancy/issues/94