Migrating To PostgreSQL

• Ryan Booz & Grant Fritchey
• PGConf.EU 2023

PostgreSQL & DevOps
Advocate
@ryanbooz
/in/ryanbooz
www.softwareandbooz.com
youtube.com/@ryanbooz

DevOps Advocate
Microsoft Data Platform MVP
AWS Community Builder
grant@scarydba.com
scarydba.com
@gfritchey
linkedin.com/in/grant-fritchey

1. Configurations
2. MVCC and VACUUM
3. Security and Privileges
4. Datatypes
5. Indexes
7. Functions & Procedures
8. SQL Considerations
9. Backups
10. Migration Tools

• Application layer
• What tooling changes are necessary?
• Do you rely on SDK/framework features that are different in PostgreSQL
versions?
• If you have a multi-tenant application, could the migration be used
to improve tenant separation?
• Does your application rely on query hints?
• Why did you have them and how will you approach PostgreSQL with these
queries?

• Are partitions a better option in PostgreSQL?
• How will you manage partitions?
• Do you need to normalize some relationships more?
• Should you de-normalize a relationship that's been difficult
to maintain in its current form?
• Aggregated queries and materialized views – or methods
for doing incremental materialized views?

• Every non-serverless Postgres should be tuned
• Edit postgresql.conf
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
https://www.youtube.com/watch?v=IFIXpm73qtk
https://postgresqlco.nf/
Helpful Tuning Resources

shared_buffers:
• data cache
• at least 25% of total RAM
work_mem:
• max memory for each plan operation
• JOINS, GROUP BY, ORDER BY can all have workers
maintenance_work_mem:
• memory for background tasks like VACUUM and ANALYZE
max_connections:
• exactly as it says

wal_compression:
• Off (default)
• Default uses pglz
• PG15 can use lz4 or zstd
autovacuum:
• On (default)
• Do not turn off
• Might have to tune individual tables
autoanalyze:
• On (default)
• Do not turn off
• Might have to tune individual tables

• SHOW will display the current value
• SET allows settings to be modified (when possible)
• Examples:
• SHOW ALL – page all settings
• SHOW max_connections –configured connection
limit

For the default Read Committed Isolation Level:
• Each query only sees transactions that were committed before it
started (ID less than current transaction)
• At the beginning of a query, PostgreSQL records:
• The transaction ID (xid) of the current command
• All transaction IDs that are in process
• Multi-statement transactions also record any of their own
completed statements from earlier in transaction

• In PostgreSQL, transaction IDs are assigned when
the first command/statement is run within a
transaction, not when BEGIN is executed.
• XIDs are stored in-table and on-row

xmin xmax User Data
48 0 Hello!
INSERT
• xmin = current transaction ID
• xmax = zero (invalid)

xmin xmax User Data
48 78 Hello!
DELETE
• xmin is unchanged
• xmax = current transaction ID
Dead Tuple

xmin xmax User Data
48 78 Hello!
78 0 Hello World!
UPDATE
Original tuple:
New tuple:
• xmax = zero
Dead Tuple

xmin xmax User Data
48 78 Hello!
78 0 Hello!
UPDATE (with the same value)
Original tuple:
New tuple:
• xmax = zero
Dead Tuple

lp1 lp2
Tuple 2
Tuple 1
Index
Tuple 2
lp3
Tuple 3

(dead) (dead)
(dead)
(dead)
Index
Tuple 2
lp3
Tuple 3

(dead) (dead)
(dead)
(dead)
Index
Tuple 2
lp3
Tuple 3
lp4
Tuple 4

xmin xmax ID User data…
Sequential Scan
SELECT a, b, c FROM t;

Sequential Scan
SELECT a, b, c FROM t
WHERE ID=10;
Index

Index Scan
SELECT a, b, c FROM t
WHERE ID=10;
Index

• Serves two main purposes:
• Free up space on pages used by dead tuples
• Mark rows as "frozen" if they have a XID sufficiently older
than the most recent XID, allowing the XID to be reused
again in the future
• Additionally, if all rows on a page are frozen, the page can
be marked frozen to speed up scans

(unused) (unused)
Index
Tuple 2
lp3
Tuple 3
lp4
Tuple 4

"Scale Factor" = Percentage of rows modified before process runs
"Threshold" = Added number of rows to add to the percent calculation.
(this prevents very small tables from taking resources)
autovacuum_vacuum_scale_factor = 0.2 (20%)
autovacuum_vacuum_threshold = 50
autovacuum_analyze_scale_factor = 0.1 (10%)
autovacuum_analyze_threshold = 50
Global Autovacuum and Autoanalyze Settings

• Autovacuum triggers at 20% data modification
• Autoanalyze triggers at 10% data modification
• Analogous to TF-2731, but per table!

Analyze at 5% modification:
ALTER TABLE t SET (autovacuum_vacuum_scale_factor=0.05)
Analyze threshold above percentage:
ALTER TABLE t SET (autovacuum_vacuum_threshold=500)

Setting maintenance_work_mem correctly:
• maintenance_work_mem is the amount of RAM that a
maintenance process like autovacuum can use to process pages of
data. If it is not sufficient, vacuuming (and other essential
maintenance processes) will take more time and fall behind.
• If autovacuum isn't completing it's work regularly, check this setting!
Default setting = 64MB

Cluster Cluster Cluster
Server/Host (Firewall, Ports)
Port: 5432
pg_hba.conf
Port: 5433
pg_hba.conf
Port: 5434
pg_hba.conf

• Own database objects
• Tables, Functions, Etc.
• Have cluster-level privileges (attributes)
• Granted privileges to database objects
• Can possibly grant privileges to other roles

• Semantically the same as roles
• By Convention:
• User = LOGIN
• Group = NOLOGIN
• PostgreSQL 8.2+ CREATE (USER|GROUP) is an
alias

• All roles always have membership in PUBLIC role
(different from public schema)
• This is where CONNECT is usually granted, but can
be provided on a role-by-role basis

• Standard PostgreSQL functions are typically
installed into the PUBLIC schema
• <=PG14 all roles had CREATE on this schema by
default
• Schemas are a common security boundary
• Search path assumes PUBLIC schema by default

• Current best practice is to never create user objects in
the PUBLIC schema
• This will require setting the "search_path" to avoid
schema qualified object reference
• Easily limit what users are allowed to do with
PostgreSQL supplied objects vs. user/application
objects

• Object creator = owner
• Initial object access = Principle of Least Privilege
• Unless specifically granted ahead of time, objects are
owned and "accessible" by the creator/superuser only
• Roles can specify default privileges to GRANT for
each object type that they create

• Explicit GRANTs can be added each time
• Default privileges can be set by each role
• Very useful for things like migrations with an
"admin" application role

SQL Server PostgreSQL
BIT BOOLEAN
VARCHAR/NVARCHAR TEXT
DATETIME/DATETIME2 TIMESTAMP (kind of)
DATETIMEOFFSET TIMESTAMP WITH TIME ZONE
UUID CHAR(16), uuid-ossp extension, PG14
VARBINARY BYTEA
ARRAYS
RANGE
ENUM

• TEXT is suitable for 99% of string data
• Query memory doesn't work the same way,
therefore TEXT doesn't impact the allocated
memory like VARCHAR(MAX) in SQL Server
• Space efficient like VARCHAR

• Use TEXT with length constraint rather than
VARCHAR(n)
• Allows easy modification to the column length
without costly table modification
• NULL characters can cause issues

CREATE TABLE bluebox.film (
film_id bigint NOT NULL,
title varchar(50) NOT NULL,
overview varchar(2000) NOT NULL,
release_date date,
original_language varchar(16),
rating public.mpaa_rating,
popularity real,
vote_count integer,
vote_average real,
budget bigint,
revenue bigint,
runtime integer,
fulltext tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig,
((COALESCE(title, ''::text) || ' '::text) || COALESCE(overview,
''::text)))) STORED
);

title TEXT NOT NULL,
overview TEXT NOT NULL,
release_date date,
original_language TEXT,
popularity real,
vote_count integer,
vote_average real,
budget bigint,
revenue bigint,
runtime integer,
''::text)))) STORED
);

SELECT e'Hello world!000';
-- ERROR: invalid byte sequence for encoding "UTF8": 0x00
INSERT INTO actor (first_name, last_name)
VALUES (e'Hello world!000','Smith');

• PostgreSQL has DATE, TIME, TIMESTAMP, and
TIMESTAMPTZ
• Most of the time you likely want TIMESTAMPTZ unless
you have a specific need for just DATE or TIME
• Date math is simple
• Use INTERVALs to modify by arbitrary units of time

CREATE TABLE bluebox.rental (
rental_id INT NOT NULL,
rental_start datetime2,
rental_end datetime2,
inventory_id INT NOT NULL,
customer_id INT NOT NULL,
staff_id INT NOT NULL,
last_update datetime2 DEFAULT now() NOT NULL
);

rental_start timestamptz,
rental_end timestamptz,
last_update timestamptz DEFAULT now() NOT NULL
);

• Currently supports INT, NUMERIC, DATE, and TIMSTAMP(TZ)
• A possible replacement for typical schema design of
start_date/end_date
• Indexable with GIST indexes
• Unique range type operators for comparisons like overlap and
contains
• Does present some (potentially) annoying query modifications that
need to use LOWER/UPPER to grab different ends of the range

rental_period tstzrange,
last_update timestamptz DEFAULT now() NOT NULL
);

• Any defined data type can have a corresponding
array type
• Indexable with GIN and GIST indexes
• Not useful as a foreign key replacement

title varchar(50) NOT NULL,
overview varchar(2000) NOT
NULL,
release_date date,
original_language varchar(16),
popularity real,
vote_count integer,
vote_average real,
budget bigint,
revenue bigint,
runtime integer
);
CREATE TABLE bluebox.genre (
genre_id int NOT NULL,
genre varchar(64) NOT NULL,
);
CREATE TABLE bluebox.film_genre (
film_id int NOT NULL,
genre_id int NOT NULL
);

title TEXT NOT NULL,
overview TEXT NOT NULL,
release_date date,
original_language TEXT,
genre_ids int[],
popularity real,
vote_count integer,
vote_average real,
budget bigint,
revenue bigint,
runtime integer,
''::text)))) STORED
);

Name |Value
-----------------+-----------------------
film_id |1726
title |Iron Man
overview |After being held captiv
release_date |2008-05-02
genre_ids |{28,878,12}
original_language|en
rating |PG-13
popularity |63.926
vote_count |24957
vote_average |7.6
budget |140000000
revenue |585174222
runtime |126
SELECT * FROM film;

Name |Value
-----------------+-----------------------
film_id |1726
title |Iron Man
genre_ids |{28,878,12}
rating |PG-13
popularity |63.926
vote_count |24957
vote_average |7.6
budget |140000000
revenue |585174222
runtime |126
SELECT * FROM film
WHERE genre_ids @> '{28}';

Name |Value
-----------------+-----------------------
film_id |1726
title |Iron Man
genre_ids |{28,878,12}
rating |PG-13
popularity |63.926
vote_count |24957
vote_average |7.6
budget |140000000
revenue |585174222
runtime |126
SELECT * FROM film
WHERE genre_ids && '{28,100}';

Name |Value
-----------------+-----------------------
SELECT * FROM film
WHERE genre_ids @> '{28,100}';

• Creation defaults to B-tree
• Support for:
• multi-column
• partial (filtered)
• covering indexes

• CONCURRENT create, reindex, and drop
• Multiple types supported
• Custom implementations of index types supported

• PostgreSQL uses HEAP-based storage
• No support for CLUSTERED/INDEX ORGANIZED
tables
• This does impact query plans because of random
page access
• Indexing strategies may change based on this

• B-Tree
• GiST/SP-GiST
• GIN
• BRIN
• HASH
• Trigram (extension)

• Classic tree-based indexing
• Default if no index type is specified
• Used for all traditional predicate operators
• Cannot index composite types (eg. Arrays, JSONB)
• PostgreSQL 13+ provides deduplication

• Very fast pre-hash of column values
• Can only be used for equality (=) operator
• Don't use prior to PostgreSQL 10
• Were not transaction safe, crash safe, or replicated

• Generalized Search Tree Index
• Well suited for overlapping data
• Intermediate index pages are typically bitmaps
Common Uses
• Full-text search capabilities
• Geospatial

• Generalized INverted Index
• Useful when multiple values point to one row on disk
• Maintenance heavy
Common Uses
• Arrays
• JSONB
• Range types

• Block Range INdex
• Stores high/low for each range of pages on disk
• Block defaults to 128 pages
Common Uses
• Data with natural ordering on disk
• Timestamps, Zip Codes, odometer-like readings

• Don't Repeat Yourself
• "If you can dream it, there is a SQL function for it in
PostgreSQL"
• Operators, Window, Aggregate, Arrays, and more!

• Functions can be overloaded
• Written in any supported language or compiled
variant
• Numerous query planner parameters available to
influence function execution
• Three modes: IMMUTABLE/STABLE/VOLATILE

• Scaler
• Inline Table Valued
• Multi-statement Table Valued
• No tuning hints for query planner
• TSQL or CLR (less common)
SQL Server Postgres
• Scaler
• Table/Set Types
• Triggers
• Query planner tuning
parameters
• Any procedural language, but
typically SQL or PL/pgSQL

ROWS – Hint row count for planner (1,000 default)
COST – Estimated execution cost (per row)
PARALLEL – SAFE/RESTRICTED/UNSAFE

• Introduced in PostgreSQL 11
• Nobody calls them "SPROC"
• Cannot return a result set
• One INOUT parameter possible
• No plan cache improvement (mostly)

• Stored procedures can create sub-transactions
• Commit transactions that survive later failure
• Helpful in data processing/ETL routines

SQL Server
• Complex logic
• Multi-language
• Contained in calling
transaction scope
• With some work, can return
results
Postgres
• Complex logic
• Multi-language
• Can issue COMMIT/ROLLBACK
and keep processing

https://wiki.postgresql.org/wiki/Don't_Do_This

• TEXT instead of VARCHAR
• UTF-8 by default. No need for two separate kinds of
text fields
• ICU in recent versions for better collation control

• Use IDENTITY columns rather than SERIAL
• Semantically the same under the covers for now (a
sequence is still created), but more control using
SQL standard IDENTITY
• By default, IDENTITY can be inserted into
• Must use ALWAYS to prevent this

• All collations are case sensitive
• PostgreSQL converts to lower case unless text qualified
• By convention, prefer snake_case rather than
CamelCase
• Using camel case will require quotes – e.g. "CamelCase"

• PostgreSQL resolution is microseconds
• Some servers like SQL Server resolution is
nanoseconds
• Practically, it means little to our real calculations,
but if anything relies on that precision (incoming
data), something will have to be done.

• INSERT… ON CONFLICT
• DO UPDATE - Simple, constraint aware method for doing
"upserts"
• DO NOTHING - Error-free inserts when data is duplicated
• MERGE (as of PG15)

• PostgreSQL can create prepared statements with
parameters
• There is no such thing as named parameters
• Limited to 65,535

• PostgreSQL executes SQL as… SQL, not a
procedural variant
• To use procedural features, something like
pl/pgSQL needs to be invoked specifically
• Inline SQL variables must be used in an anonymous
code block, but they can't return a table/set of data

• Move towards functions or stored procedures
• Any supported and installed language can be used,
most without pre-compiling (although compiled will
be faster)

• You are only as good as your last restore

• Full Database
• Point in Time
• Extensions

• Pg_dump – By default, creates a SQL script for
recreating the database & inserting data
o You can dump to binary format, making for easier transport and
management
o If you use pg_dump to binary, you'll need to use pg_restore to
recover
• Pg_dumpall
o Pg_dump on all the databases

• Write Ahead Log (WAL) Archive
o Wal_level must be set to replica or logical
o Archive_mode must be set to on or continuous
o Edit the archive_command to supply a path to your WAL
archive
o Must have a base backup

• Base Backup
o Pg_basebackup
o All databases are copied to target location & WAL start
point is set

• First, run pg_switch_wal to close the current WAL file and move it to the archive
• Stop the service
• Remove all files & subdirectories in your data directory
• Remove all files from pg_wal
• Restore databases from Base Backup
• Change the postgresql.conf file to start recovery, add recovery_target_time
• Create recovery.signal in the cluster data directory
• Start the service

• Before/After scripts
• Logfile support
• Data casting
• Error support
• Continuous migrations
• ...and more
• Initially created by Dimitri Fontaine
• CLI application
• Many migration formats
• CSV, DBF, IXF
• SQLite, MySQL, SQL Server
• PostgreSQL to PostgreSQL
• Limited Redshift support

• Perl-based migration tool
• Open-source with long-standing usage and support
• Export all major objects and data to SQL
• Migration assessment tooling
https://ora2pg.darold.net/

• Wide-ranging conversion tool for AWS hosted
databases
• Converts schema into target database
• Data is usually handled separately with extraction
agents
https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Welcome.html

• Transparent SQL Server –> PostgreSQL migration
• Open-source
• Available in AWS Aurora Postgres
• Stackgres.io
• Not fully ready for complex applications
https://babelfishpg.org/

• Full Convert
• https://www.fullconvert.com/
• PostgreSQL Migration Toolkit
• https://www.convert-in.com/pgskit.htm

• Managed service for end-to-end migrations
• Primarily intended to migrate into AWS Aurora DB
• Ongoing change replication
• Pricing is complex like many hosted services
https://aws.amazon.com/dms/

Migrating To PostgreSQL

Recommended

Recommended

More Related Content

Similar to Migrating To PostgreSQL

Similar to Migrating To PostgreSQL (20)

More from Grant Fritchey

More from Grant Fritchey (20)

Recently uploaded

Recently uploaded (20)

Migrating To PostgreSQL

Editor's Notes