Successfully reported this slideshow.
Your SlideShare is downloading. ×

Build a Complex, Realtime Data Management App with Postgres 14!

Ad

Chicago PostgreSQL User Group - October 20, 2021 Jonathan S. Katz
Let's Build a Complex, Real-
Time Data Management
Applic...

Ad

• VP, Platform Engineering @ Crunchy Data
• Previously: Engineering Leadership @ Startups
• Longtime PostgreSQL community ...

Ad

• Leading Team in Postgres – 10 contributors
• Certified Open Source PostgreSQL Distribution
• Leader in Postgres Technolo...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 57 Ad
1 of 57 Ad

Build a Complex, Realtime Data Management App with Postgres 14!

Download to read offline

Congratulations: you've been selected to build an application that will manage reservations for rooms!

On the surface, this sounds simple, but you are building a system for managing a high traffic reservation web page, so we know that a lot of people will be accessing the system. Therefore, we need to ensure that the system can handle all of the eager users that will be flooding the website checking to see what availability each room has.

Fortunately, PostgreSQL is prepared for this! And even better, we will be using Postgres 14 to make the problem even easier!

We will explore the following PostgreSQL features:

* Data types and their functionality, such as:
* Data/Time types
* Ranges / Multirnages
Indexes such as:
* GiST
* Common Table Expressions and Recursion (though multiranges will make things easier!)
* Set generating functions and LATERAL queries
* Functions and the PL/PGSQL
* Triggers
* Logical decoding and streaming

We will be writing our application primary with SQL, though we will sneak in a little bit of Python and using Kafka to demonstrate the power of logical decoding.

At the end of the presentation, we will have a working application, and you will be happy knowing that you provided a wonderful user experience for all users made possible by the innovation of PostgreSQL!

Congratulations: you've been selected to build an application that will manage reservations for rooms!

On the surface, this sounds simple, but you are building a system for managing a high traffic reservation web page, so we know that a lot of people will be accessing the system. Therefore, we need to ensure that the system can handle all of the eager users that will be flooding the website checking to see what availability each room has.

Fortunately, PostgreSQL is prepared for this! And even better, we will be using Postgres 14 to make the problem even easier!

We will explore the following PostgreSQL features:

* Data types and their functionality, such as:
* Data/Time types
* Ranges / Multirnages
Indexes such as:
* GiST
* Common Table Expressions and Recursion (though multiranges will make things easier!)
* Set generating functions and LATERAL queries
* Functions and the PL/PGSQL
* Triggers
* Logical decoding and streaming

We will be writing our application primary with SQL, though we will sneak in a little bit of Python and using Kafka to demonstrate the power of logical decoding.

At the end of the presentation, we will have a working application, and you will be happy knowing that you provided a wonderful user experience for all users made possible by the innovation of PostgreSQL!

More Related Content

Slideshows for you (19)

Build a Complex, Realtime Data Management App with Postgres 14!

  1. 1. Chicago PostgreSQL User Group - October 20, 2021 Jonathan S. Katz Let's Build a Complex, Real- Time Data Management Application
  2. 2. • VP, Platform Engineering @ Crunchy Data • Previously: Engineering Leadership @ Startups • Longtime PostgreSQL community contributor • Core Team Member • Various Governance Committees • Conference Organizer / Speaker • @jkatz05 About Me
  3. 3. • Leading Team in Postgres – 10 contributors • Certified Open Source PostgreSQL Distribution • Leader in Postgres Technology for Kubernetes • Crunchy Bridge: Fully managed cloud service Crunchy Data Your partner in deploying open source PostgreSQL throughout your enterprise.
  4. 4. CPSM Provider Plugin This talk introduces many different tools and techniques available in PostgreSQL for building applications. It introduces different features and where to find out more information. We have a lot of material to cover in a short time - the slides and demonstrations will be made available How to Approach This Talk
  5. 5. CPSM Provider Plugin Imagine we are managing virtual rooms for an event platform. We have a set of operating hours in which the rooms can be booked. Only one booking can occur in a virtual room at a given time. The Problem
  6. 6. CPSM Provider Plugin For Example
  7. 7. CPSM Provider Plugin We need to know... - All the rooms that are available to book - When the rooms are available to be booked (operating hours) - When the rooms have been booked And... The system needs to be able to CRUD fast (Create, Read, Update, Delete. Fast). Specifications
  8. 8. 🤔
  9. 9. Interlude: Finding Availability
  10. 10. CPSM Provider Plugin Availability can be thought about in three ways: Closed Available Unavailable (or "booked") Our ultimate "calendar tuple" is (room, status, range) Managing Availability
  11. 11. CPSM Provider Plugin PostgreSQL 9.2 introduced "range types" that included the ability to store and efficiently search over ranges of data. Built-in: Date, Timestamps Integer, Numeric Lookups (e.g. overlaps) can be sped up using GiST indexes Postgres Range Types SELECT tstzrange('2021-10-28 09:30'::timestamptz, '2021-10-28 10:30'::timestamptz);
  12. 12. Availability
  13. 13. Availability SELECT * FROM ( VALUES ('closed', tstzrange('2021-10-28 0:00', '2021-10-28 8:00')), ('available', tstzrange('2021-10-28 08:00', '2021-10-28 09:30')), ('unavailable', tstzrange('2021-10-28 09:30', '2021-10-28 10:30')), ('available', tstzrange('2021-10-28 10:30', '2021-10-28 16:30')), ('unavailable', tstzrange('2021-10-28 16:30', '2021-10-28 18:30')), ('available', tstzrange('2021-10-28 18:30', '2021-10-28 20:00')), ('closed', tstzrange('2021-10-28 20:00', '2021-10-29 0:00')) ) x(status, calendar_range) ORDER BY lower(x.calendar_range);
  14. 14. Easy, Right?
  15. 15. CPSM Provider Plugin Insert new ranges and dividing them up PostgreSQL did not work well with noncontiguous ranges…until PostgreSQL 14 Availability Just for one day - what about other days? What happens with data in the past? What happens with data in the future? Unavailability Ensure no double-bookings Overlapping Events? Handling multiple spaces But…
  16. 16. Managing Availability availability_rule id <serial> PRIMARY KEY room_id <int> REFERENCES (room) days_of_week <int[]> start_time <time> end_time <time> generate_weeks_into_future <int> DEFAULT 52 room id <serial> PRIMARY KEY name <text> availability id <serial> PRIMARY KEY room_id <int> REFERENCES (room) availability_rule_id <int> REFERENCES (availabilityrule) available_date <date> available_range <tstzrange> unavailability id <serial> PRIMARY KEY room_id <int> REFERENCES (room) unavailable_date <date> unavailable_range <tstzrange> calendar id <serial> PRIMARY KEY room_id <int> REFERENCES (room) status <text> DOMAIN: {available, unavailable, closed} calendar_date <date> calendar_range <tstzrange>
  17. 17. CPSM Provider Plugin We can now store data, but what about: Generating initial calendar? Generating availability based on rules? Generating unavailability? Sounds like we need to build an application Managing Availability
  18. 18. CPSM Provider Plugin To build our application, there are a few topics we will need to explore first: generate_series Recursive queries Ranges and Multiranges SQL Functions Set returning functions PL/pgsql Triggers Managing Availability
  19. 19. CPSM Provider Plugin Generate series is a "set returning" function, i.e. a function that can return multiple rows of data. Generate series can return: A set of numbers (int, bigint, numeric) either incremented by 1 or some other integer interval A set of timestamps incremented by a time interval(!!) generate_series: More Than Just For Test Data SELECT x::date FROM generate_series( '2021-01-01'::date, '2021-12-31'::date, '1 day'::interval ) x;
  20. 20. CPSM Provider Plugin PostgreSQL 8.4 introduced the "WITH" syntax and with it also introduced the ability to perform recursive queries WITH RECURSIVE ... AS () Base case vs. recursive case UNION vs. UNION ALL CAN HIT INFINITE LOOPS Recursion in SQL?
  21. 21. CPSM Provider Plugin Recursion in SQL? WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac ) SELECT fac.n, fac.i FROM fac; Infinite Recursion
  22. 22. CPSM Provider Plugin Recursion in SQL? WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac ) SELECT fac.n, fac.i FROM fac LIMIT 100;
  23. 23. Postgres 14 introduces multirange types Ordered list of ranges Can be noncontiguous Adds range aggregates: range_agg and unnest Multirange Types SELECT datemultirange( daterange(CURRENT_DATE, CURRENT_DATE + 1), daterange(CURRENT_DATE + 5, CURRENT_DATE + 8), daterange(CURRENT_DATE + 15, CURRENT_DATE + 22) );
  24. 24. CPSM Provider Plugin PostgreSQL provides the ability to write functions to help encapsulate repeated behavior PostgreSQL 11 introduces stored procedures which enables you to embed transactions! PostgreSQL 14 adds the ability to get output from stored procedures! SQL functions have many properties, including: Input / output Volatility (IMMUTABLE, STABLE, VOLATILE) (default VOLATILE) Parallel safety (default PARALLEL UNSAFE) LEAKPROOF; SECURITY DEFINER Execution Cost Language type (more on this later) Functions
  25. 25. CPSM Provider Plugin Functions CREATE OR REPLACE FUNCTION chipug_fac(n int) RETURNS numeric AS $$ WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac WHERE i + 1 <= $1 ) SELECT max(fac.n) FROM fac; $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
  26. 26. CPSM Provider Plugin Functions CREATE OR REPLACE FUNCTION chipug_fac_set(n int) RETURNS SETOF numeric AS $$ WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac WHERE i + 1 <= $1 ) SELECT fac.n FROM fac ORDER BY fac.n; $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
  27. 27. CPSM Provider Plugin Functions CREATE OR REPLACE FUNCTION chipug_fac_table(n int) RETURNS TABLE(n numeric) AS $$ WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac WHERE i + 1 <= $1 ) SELECT fac.n FROM fac ORDER BY fac.n; $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
  28. 28. CPSM Provider Plugin PostgreSQL has the ability to load in procedural languages ("PL") and execute code in them beyond SQL. Built-in: pgSQL, Python, Perl, Tcl Others: Javascript, R, Java, C, JVM, Container, LOLCODE, Ruby, PHP, Lua, pgPSM, Scheme Procedural Languages
  29. 29. CPSM Provider Plugin PL/pgSQL CREATE EXTENSION IF NOT EXISTS plpgsql; CREATE OR REPLACE FUNCTION chipug_fac_plpgsql(n int) RETURNS numeric AS $$ DECLARE fac numeric; i int; BEGIN fac := 1; FOR i IN 1..n LOOP fac := fac * i; END LOOP; RETURN fac; END; $$ LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE;
  30. 30. CPSM Provider Plugin Triggers are functions that can be called before/after/instead of an operation or event Data changes (INSERT/UPDATE/DELETE) Events (DDL, DCL, etc. changes) Atomic Must return "trigger" or "event_trigger" (Return "NULL" in a trigger if you want to skip operation) (Gotcha: RETURN OLD [INSERT] / RETURN NEW [DELETE]) Execute once per modified row or once per SQL statement Multiple triggers on same event will execute in alphabetical order Writeable in any PL language that defined trigger interface Triggers
  31. 31. Building a Synchronized System
  32. 32. We'll Scan the Code It's Available for Download 😉
  33. 33. The Test
  34. 34. CPSM Provider Plugin [Test your live demos before running them, and you will have much success!] availability_rule inserts took some time, > 350ms availability: INSERT 52 calendar: INSERT 52 from nontrivial function Updates on individual availability / unavailability are not too painful Lookups are faaaaaaaast Lessons of the Test
  35. 35. How About At (Web) Scale?
  36. 36. CPSM Provider Plugin Recursive CTE 😢 Even with only 100 more rooms with a few set of rules, rule generation time increased significantly Multirange Types These are still pretty fast and are handling scaling up well. May still be slow for a web transaction. Lookups are still lightning fast! Web Scale
  37. 37. CPSM Provider Plugin Added in PostgreSQL 9.4 Replays all logical changes made to the database Create a logical replication slot in your database Only one receiver can consume changes from one slot at a time Slot keeps track of last change that was read by a receiver If receiver disconnects, slot will ensure database holds changes until receiver reconnects Only changes from tables with primary keys are relayed As of PostgreSQL 10, you can set a "REPLICA IDENTITY" on a UNIQUE, NOT NULL, non-deferrable, non-partial column(s) Basis for Logical Replication Logical Decoding
  38. 38. CPSM Provider Plugin A logical replication slot has a name and an output plugin PostgreSQL comes with the "test" output plugin Have to write a custom parser to read changes from test output plugin Several output plugins and libraries available wal2json: https://github.com/eulerto/wal2json jsoncdc: https://github.com/instructure/jsoncdc Debezium: http://debezium.io/ (Test: https://www.postgresql.org/docs/current/static/test-decoding.html) Logical Replication (pgoutput) Every data change in the database is streamed Need to be aware of the logical decoding format Logical Decoding Out of the Box
  39. 39. CPSM Provider Plugin C: libpq pg_recvlogical PostgreSQL functions Python: psycopg2 - version 2.7 JDBC: version 42 Go: pgx JavaScript: node-postgres (pg-logical-replication) Driver Support
  40. 40. CPSM Provider Plugin Using Logical Decoding
  41. 41. CPSM Provider Plugin We know it takes time to regenerate calendar Want to ensure changes always propagate but want to ensure all users (managers, calendar searchers) have good experience Thoughts🤔
  42. 42. CPSM Provider Plugin Will use the same data model as before as well as the same helper functions, but without the triggers We will have a Python script that reads from a logical replication slot and if it detects a relevant change, take an action Similar to what we did with triggers, but this moves the work to OUTSIDE the transaction BUT...we can confirm whether or not the work is completed, thus if the program fails, we can restart from last acknowledged transaction ID Replacing Triggers
  43. 43. Reviewing the Code
  44. 44. CPSM Provider Plugin A consumer of the logical stream can only read one change at a time If our processing of a change takes a lot of time, it will create a backlog of changes Backlog means the PostgreSQL server needs to retain more WAL logs Retaining too many WAL logs can lead to running out of disk space Running out of disk space can lead to...rough times. The Consumer Bottleneck 🌤 🌥 ☁ 🌩
  45. 45. Eliminating the Bottleneck
  46. 46. CPSM Provider Plugin Can utilize a durable message queueing system to store any WAL changes that are necessary to perform post-processing on Ensure the changes are worked on in order "Divide-and-conquer" workload - have multiple workers acting on different "topics" Remove WAL bloat Shifting the Workload
  47. 47. CPSM Provider Plugin Durable message processing and distribution system Streams Supports parallelization of consumers Multiple consumers, partitions Highly-available, distributed architecture Acknowledgement of receiving, processing messages; can replay (sounds like WAL?) Can also accomplish this with Debezium, which interfaces with Kafka + Postgres Apache Kafka
  48. 48. CPSM Provider Plugin Architecture
  49. 49. CPSM Provider Plugin WAL Consumer import json, sys from kafka import KafkaProducer from kafka.errors import KafkaError import psycopg2 import psycopg2.extras TABLES = set([ 'availability', 'availability_rule', 'room', 'unavailability', ]) reader = WALConsumer() cursor = reader.connection.cursor() cursor.start_replication(slot_name='schedule', decode=True) try: cursor.consume_stream(reader) except KeyboardInterrupt: print("Stopping reader...") finally: cursor.close() reader.connection.close() print("Exiting reader")
  50. 50. CPSM Provider Plugin class WALConsumer(object): def __init__(self): self.connection = psycopg2.connect("dbname=realtime", connection_factory=psycopg2.extras.LogicalReplicationConnection, ) self.producer = producer = KafkaProducer( bootstrap_servers=['localhost:9092'], value_serializer=lambda m: json.dumps(m).encode('ascii'), ) def __call__(self, msg): payload = json.loads(msg.payload, strict=False) print(payload) # determine if the payload should be passed on to a consumer listening # to the Kafka que for data in payload['change']: if data.get('table') in TABLES: self.producer.send(data.get('table'), data) # ensure everything is sent; call flush at this point self.producer.flush() # acknowledge that the change has been read - tells PostgreSQL to stop # holding onto this log file msg.cursor.send_feedback(flush_lsn=msg.data_start)
  51. 51. CPSM Provider Plugin Kafka Consumer import json from kafka import KafkaConsumer from kafka.structs import OffsetAndMetadata, TopicPartition import psycopg2 class Worker(object): """Base class to work perform any post processing on changes""" OPERATIONS = set([]) # override with "insert", "update", "delete" def __init__(self, topic): # connect to the PostgreSQL database self.connection = psycopg2.connect("dbname=realtime") # connect to Kafka self.consumer = KafkaConsumer( bootstrap_servers=['localhost:9092'], value_deserializer=lambda m: json.loads(m.decode('utf8')), auto_offset_reset="earliest", group_id='1') # subscribe to the topic(s) self.consumer.subscribe(topic if isinstance(topic, list) else [topic])
  52. 52. CPSM Provider Plugin Kafka Consumer def run(self): """Function that runs ad-infinitum""" # loop through the payloads from the consumer # determine if there are any follow-up actions based on the kind of # operation, and if so, act upon it # always commit when done. for msg in self.consumer: print(msg) # load the data from the message data = msg.value # determine if there are any follow-up operations to perform if data['kind'] in self.OPERATIONS: # open up a cursor for interacting with PostgreSQL cursor = self.connection.cursor() # put the parameters in an easy to digest format params = dict(zip(data['columnnames'], data['columnvalues'])) # all the function getattr(self, data['kind'])(cursor, params) # commit any work that has been done, and close the cursor self.connection.commit() cursor.close() # acknowledge the message has been handled tp = TopicPartition(msg.topic, msg.partition) offsets = {tp: OffsetAndMetadata(msg.offset, None)} self.consumer.commit(offsets=offsets)
  53. 53. CPSM Provider Plugin Kafka Consumer # override with the appropriate post-processing code def insert(self, cursor, params): """Override with any post-processing to be done on an ``INSERT``""" raise NotImplementedError() def update(self, cursor, params): """Override with any post-processing to be done on an ``UPDATE``""" raise NotImplementedError() def delete(self, cursor, params): """Override with any post-processing to be done on an ``DELETE``""" raise NotImplementedError()
  54. 54. Testing the Application
  55. 55. CPSM Provider Plugin Logical decoding allows the bulk inserts to occur significantly faster from a transactional view Potential bottleneck for long running execution, but bottlenecks are isolated to specific queues Newer versions of PostgreSQL has features that make it easier to build applications and scale Lessons
  56. 56. CPSM Provider Plugin PostgreSQL is robust. Triggers will keep your data in sync but can have significant performance overhead Utilizing a logical replication slot can eliminate trigger overhead and transfer the computational load elsewhere Not a panacea: still need to use good architectural patterns! Conclusion
  57. 57. Thank You jonathan.katz@crunchydata.com @jkatz05 https://github.com/CrunchyData/postgres-realtime-demo

×