Build a Complex, Realtime Data Management App with Postgres 14!

Chicago PostgreSQL User Group - October 20, 2021 Jonathan S. Katz
Let's Build a Complex, Real-
Time Data Management
Application

• VP, Platform Engineering @ Crunchy Data
• Previously: Engineering Leadership @ Startups
• Longtime PostgreSQL community contributor
• Core Team Member
• Various Governance Committees
• Conference Organizer / Speaker
• @jkatz05
About Me

• Leading Team in Postgres – 10 contributors
• Certified Open Source PostgreSQL Distribution
• Leader in Postgres Technology for Kubernetes
• Crunchy Bridge: Fully managed cloud service
Crunchy Data
Your partner in deploying
open source PostgreSQL
throughout your enterprise.

CPSM Provider Plugin
This talk introduces many different tools and techniques available
in PostgreSQL for building applications.
It introduces different features and where to find out more
information.
We have a lot of material to cover in a short time - the slides and
demonstrations will be made available
How to Approach This Talk

Imagine we are managing virtual rooms for an event platform.
We have a set of operating hours in which the rooms can be
booked.
Only one booking can occur in a virtual room at a given time.
The Problem

For Example

We need to know...
- All the rooms that are available to book
- When the rooms are available to be booked (operating hours)
- When the rooms have been booked
And...
The system needs to be able to CRUD fast
(Create, Read, Update, Delete. Fast).
Specifications

Interlude:
Finding Availability

Availability can be thought about in three ways:
Closed
Available
Unavailable (or "booked")
Our ultimate "calendar tuple" is (room, status, range)
Managing Availability

PostgreSQL 9.2 introduced "range types" that included the ability to store and
efficiently search over ranges of data.
Built-in:
Date, Timestamps
Integer, Numeric
Lookups (e.g. overlaps) can be sped up using GiST indexes
Postgres Range Types
SELECT tstzrange('2021-10-28 09:30'::timestamptz, '2021-10-28 10:30'::timestamptz);

Availability
SELECT *
FROM (
VALUES
('closed', tstzrange('2021-10-28 0:00', '2021-10-28 8:00')),
('available', tstzrange('2021-10-28 08:00', '2021-10-28 09:30')),
('unavailable', tstzrange('2021-10-28 09:30', '2021-10-28 10:30')),
('unavailable', tstzrange('2021-10-28 16:30', '2021-10-28 18:30')),
('closed', tstzrange('2021-10-28 20:00', '2021-10-29 0:00'))
) x(status, calendar_range)
ORDER BY lower(x.calendar_range);

Insert new ranges and dividing them up
PostgreSQL did not work well with noncontiguous ranges…until PostgreSQL 14
Availability
Just for one day - what about other days?
What happens with data in the past?
What happens with data in the future?
Unavailability
Ensure no double-bookings
Overlapping Events?
Handling multiple spaces
But…

availability_rule
id <serial> PRIMARY KEY
room_id <int> REFERENCES (room)
days_of_week <int[]>
start_time <time>
end_time <time>
generate_weeks_into_future <int>
DEFAULT 52
room
id <serial>
PRIMARY KEY
name <text>
availability
room_id <int> REFERENCES
(room)
availability_rule_id <int>
REFERENCES (availabilityrule)
available_date <date>
available_range <tstzrange>
unavailability
(room)
unavailable_date <date>
unavailable_range <tstzrange>
calendar
(room)
status <text> DOMAIN:
{available, unavailable, closed}
calendar_date <date>
calendar_range <tstzrange>

We can now store data, but what about:
Generating initial calendar?
Generating availability based on rules?
Generating unavailability?
Sounds like we need to build an application

To build our application, there are a few topics we will need to explore first:
generate_series
Recursive queries
Ranges and Multiranges
SQL Functions
Set returning functions
PL/pgsql
Triggers

Generate series is a "set returning" function, i.e. a function that can return
multiple rows of data.
Generate series can return:
A set of numbers (int, bigint, numeric) either incremented by 1 or some
other integer interval
A set of timestamps incremented by a time interval(!!)
generate_series:
More Than Just For Test Data
SELECT x::date
FROM generate_series(
'2021-01-01'::date, '2021-12-31'::date, '1 day'::interval
) x;

PostgreSQL 8.4 introduced the "WITH" syntax and with it also introduced the
ability to perform recursive queries
WITH RECURSIVE ... AS ()
Base case vs. recursive case
UNION vs. UNION ALL
CAN HIT INFINITE LOOPS
Recursion in SQL?

Recursion in SQL?
WITH RECURSIVE fac AS (
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.n * (fac.i + 1),
fac.i + 1 AS i
FROM fac
)
SELECT fac.n, fac.i
FROM fac;
Infinite Recursion

Recursion in SQL?
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.i + 1 AS i
FROM fac
)
SELECT fac.n, fac.i
FROM fac
LIMIT 100;

Postgres 14 introduces multirange types
Ordered list of ranges
Can be noncontiguous
Adds range aggregates: range_agg and unnest
Multirange Types
SELECT
datemultirange(
daterange(CURRENT_DATE, CURRENT_DATE + 1),
daterange(CURRENT_DATE + 5, CURRENT_DATE + 8),
daterange(CURRENT_DATE + 15, CURRENT_DATE + 22)
);

PostgreSQL provides the ability to write functions to help encapsulate
repeated behavior
PostgreSQL 11 introduces stored procedures which enables you to
embed transactions! PostgreSQL 14 adds the ability to get output from stored
procedures!
SQL functions have many properties, including:
Input / output
Volatility (IMMUTABLE, STABLE, VOLATILE) (default VOLATILE)
Parallel safety (default PARALLEL UNSAFE)
LEAKPROOF; SECURITY DEFINER
Execution Cost
Language type (more on this later)
Functions

Functions
CREATE OR REPLACE FUNCTION chipug_fac(n int)
RETURNS numeric
AS $$
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.i + 1 AS i
FROM fac
WHERE i + 1 <= $1
)
SELECT max(fac.n)
FROM fac;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;

Functions
CREATE OR REPLACE FUNCTION chipug_fac_set(n int)
RETURNS SETOF numeric
AS $$
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.i + 1 AS i
FROM fac
WHERE i + 1 <= $1
)
SELECT fac.n
FROM fac
ORDER BY fac.n;

Functions
CREATE OR REPLACE FUNCTION chipug_fac_table(n int)
RETURNS TABLE(n numeric)
AS $$
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.i + 1 AS i
FROM fac
WHERE i + 1 <= $1
)
SELECT fac.n
FROM fac
ORDER BY fac.n;

PostgreSQL has the ability to load in procedural languages ("PL") and execute
code in them beyond SQL.
Built-in: pgSQL, Python, Perl, Tcl
Others: Javascript, R, Java, C, JVM, Container, LOLCODE, Ruby, PHP, Lua,
pgPSM, Scheme
Procedural Languages

PL/pgSQL
CREATE EXTENSION IF NOT EXISTS plpgsql;
CREATE OR REPLACE FUNCTION chipug_fac_plpgsql(n int)
RETURNS numeric
AS $$
DECLARE
fac numeric;
i int;
BEGIN
fac := 1;
FOR i IN 1..n LOOP
fac := fac * i;
END LOOP;
RETURN fac;
END;
$$ LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE;

Triggers are functions that can be called before/after/instead of an operation or event
Data changes (INSERT/UPDATE/DELETE)
Events (DDL, DCL, etc. changes)
Atomic
Must return "trigger" or "event_trigger"
(Return "NULL" in a trigger if you want to skip operation)
(Gotcha: RETURN OLD [INSERT] / RETURN NEW [DELETE])
Execute once per modified row or once per SQL statement
Multiple triggers on same event will execute in alphabetical order
Writeable in any PL language that defined trigger interface
Triggers

Building a
Synchronized System

We'll Scan the Code
It's Available for Download 😉

[Test your live demos before running them, and you will have much
success!]
availability_rule inserts took some time, > 350ms
availability: INSERT 52
calendar: INSERT 52 from nontrivial function
Updates on individual availability / unavailability are not too painful
Lookups are faaaaaaaast
Lessons of the Test

Recursive CTE 😢
Even with only 100 more rooms with a few set of rules, rule
generation time increased significantly
Multirange Types
These are still pretty fast and are handling scaling up well.
May still be slow for a web transaction.
Lookups are still lightning fast!
Web Scale

Added in PostgreSQL 9.4
Replays all logical changes made to the database
Create a logical replication slot in your database
Only one receiver can consume changes from one slot at a time
Slot keeps track of last change that was read by a receiver
If receiver disconnects, slot will ensure database holds changes until
receiver reconnects
Only changes from tables with primary keys are relayed
As of PostgreSQL 10, you can set a "REPLICA IDENTITY" on a
UNIQUE, NOT NULL, non-deferrable, non-partial column(s)
Basis for Logical Replication
Logical Decoding

A logical replication slot has a name and an output plugin
PostgreSQL comes with the "test" output plugin
Have to write a custom parser to read changes from test output plugin
Several output plugins and libraries available
wal2json: https://github.com/eulerto/wal2json
jsoncdc: https://github.com/instructure/jsoncdc
Debezium: http://debezium.io/
(Test: https://www.postgresql.org/docs/current/static/test-decoding.html)
Logical Replication (pgoutput)
Every data change in the database is streamed
Need to be aware of the logical decoding format
Logical Decoding Out of the Box

C: libpq
pg_recvlogical
PostgreSQL functions
Python: psycopg2 - version 2.7
JDBC: version 42
Go: pgx
JavaScript: node-postgres (pg-logical-replication)
Driver Support

Using Logical Decoding

We know it takes time to regenerate calendar
Want to ensure changes always propagate but want to ensure all users
(managers, calendar searchers) have good experience
Thoughts🤔

Will use the same data model as before as well as the same helper
functions, but without the triggers
We will have a Python script that reads from a logical replication
slot and if it detects a relevant change, take an action
Similar to what we did with triggers, but this moves the work to
OUTSIDE the transaction
BUT...we can confirm whether or not the work is completed, thus if
the program fails, we can restart from last acknowledged
transaction ID
Replacing Triggers

A consumer of the logical stream can only read one change at a time
If our processing of a change takes a lot of time, it will create a backlog
of changes
Backlog means the PostgreSQL server needs to retain more WAL logs
Retaining too many WAL logs can lead to running out of disk space
Running out of disk space can lead to...rough times.
The Consumer Bottleneck
🌤
🌥
☁
🌩

Can utilize a durable message queueing system to store any WAL changes
that are necessary to perform post-processing on
Ensure the changes are worked on in order
"Divide-and-conquer" workload - have multiple workers acting on
diﬀerent "topics"
Remove WAL bloat
Shifting the Workload

Durable message processing and distribution system
Streams
Supports parallelization of consumers
Multiple consumers, partitions
Highly-available, distributed architecture
Acknowledgement of receiving, processing messages; can replay (sounds like
WAL?)
Can also accomplish this with Debezium, which interfaces with Kafka +
Postgres
Apache Kafka

Architecture

WAL Consumer
import json, sys
from kafka import KafkaProducer
from kafka.errors import KafkaError
import psycopg2
import psycopg2.extras
TABLES = set([
'availability', 'availability_rule', 'room', 'unavailability',
])
reader = WALConsumer()
cursor = reader.connection.cursor()
cursor.start_replication(slot_name='schedule', decode=True)
try:
cursor.consume_stream(reader)
except KeyboardInterrupt:
print("Stopping reader...")
finally:
cursor.close()
reader.connection.close()
print("Exiting reader")

class WALConsumer(object):
def __init__(self):
self.connection = psycopg2.connect("dbname=realtime",
connection_factory=psycopg2.extras.LogicalReplicationConnection,
)
self.producer = producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda m: json.dumps(m).encode('ascii'),
)
def __call__(self, msg):
payload = json.loads(msg.payload, strict=False)
print(payload)
# determine if the payload should be passed on to a consumer
listening
# to the Kafka que
for data in payload['change']:
if data.get('table') in TABLES:
self.producer.send(data.get('table'), data)
# ensure everything is sent; call flush at this point
self.producer.flush()
# acknowledge that the change has been read - tells PostgreSQL to
stop
# holding onto this log file
msg.cursor.send_feedback(flush_lsn=msg.data_start)

Kafka Consumer
import json
from kafka import KafkaConsumer
from kafka.structs import OffsetAndMetadata, TopicPartition
import psycopg2
class Worker(object):
"""Base class to work perform any post processing on changes"""
OPERATIONS = set([]) # override with "insert", "update", "delete"
def __init__(self, topic):
# connect to the PostgreSQL database
self.connection = psycopg2.connect("dbname=realtime")
# connect to Kafka
self.consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf8')),
auto_offset_reset="earliest",
group_id='1')
# subscribe to the topic(s)
self.consumer.subscribe(topic if isinstance(topic, list) else [topic])

Kafka Consumer
def run(self):
"""Function that runs ad-infinitum"""
# loop through the payloads from the consumer
# determine if there are any follow-up actions based on the kind of
# operation, and if so, act upon it
# always commit when done.
for msg in self.consumer:
print(msg)
# load the data from the message
data = msg.value
# determine if there are any follow-up operations to perform
if data['kind'] in self.OPERATIONS:
# open up a cursor for interacting with PostgreSQL
cursor = self.connection.cursor()
# put the parameters in an easy to digest format
params = dict(zip(data['columnnames'], data['columnvalues']))
# all the function
getattr(self, data['kind'])(cursor, params)
# commit any work that has been done, and close the cursor
self.connection.commit()
cursor.close()
# acknowledge the message has been handled
tp = TopicPartition(msg.topic, msg.partition)
offsets = {tp: OffsetAndMetadata(msg.offset, None)}
self.consumer.commit(offsets=offsets)

Kafka Consumer
# override with the appropriate post-processing code
def insert(self, cursor, params):
"""Override with any post-processing to be done on an ``INSERT``"""
raise NotImplementedError()
def update(self, cursor, params):
"""Override with any post-processing to be done on an ``UPDATE``"""
def delete(self, cursor, params):
"""Override with any post-processing to be done on an ``DELETE``"""

Logical decoding allows the bulk inserts to occur significantly faster from a
transactional view
Potential bottleneck for long running execution, but bottlenecks are isolated to
specific queues
Newer versions of PostgreSQL has features that make it easier to build
applications and scale
Lessons

PostgreSQL is robust.
Triggers will keep your data in sync but can have significant
performance overhead
Utilizing a logical replication slot can eliminate trigger overhead
and transfer the computational load elsewhere
Not a panacea: still need to use good architectural patterns!
Conclusion

Thank You
jonathan.katz@crunchydata.com
@jkatz05
https://github.com/CrunchyData/postgres-realtime-demo

Build a Complex, Realtime Data Management App with Postgres 14!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Build a Complex, Realtime Data Management App with Postgres 14!

Similar to Build a Complex, Realtime Data Management App with Postgres 14! (20)

More from Jonathan Katz

More from Jonathan Katz (13)

Recently uploaded

Recently uploaded (20)

Build a Complex, Realtime Data Management App with Postgres 14!