Your SlideShare is downloading. ×
A couple of things  about PostgreSQL...
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A couple of things about PostgreSQL...

3,796
views

Published on

PostgreSQL is a wild beast and a wrong approach can become a slow descent into hell. We'll talk about common mistakes, confusing jargon, the online manual's lost pages (well hidden in the source code) …

PostgreSQL is a wild beast and a wrong approach can become a slow descent into hell. We'll talk about common mistakes, confusing jargon, the online manual's lost pages (well hidden in the source code) and best practices to avoid headaches for your DBA.

Published in: Technology

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,796
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A couple of things to know about PostgreSQL... (Before you start coding) Federico Campoli 9 July 2013 Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 1 / 76
  • 2. Introduction What is blue, bigger in the inside and has time travel? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 76
  • 3. Introduction What is blue, bigger in the inside and has time travel? If your answer is the TARDIS, then, yes you’re close enough, but the correct answer is PostgreSQL. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 76
  • 4. Introduction What is blue, bigger in the inside and has time travel? If your answer is the TARDIS, then, yes you’re close enough, but the correct answer is PostgreSQL. and regarding the couple of things.... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 76
  • 5. Introduction What is blue, bigger in the inside and has time travel? If your answer is the TARDIS, then, yes you’re close enough, but the correct answer is PostgreSQL. and regarding the couple of things.... I lied. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 76
  • 6. Introduction PostgreSQL is a wild beast. We’ll talk about the common mistakes, the confusing jargon, the on line manual’s lost pages and the best practices to avoid headaches to your DBA. The major version used in this talk is the 9.2. So let’s start with the TOC Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 76
  • 7. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 8. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keeps things consistent. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 9. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keeps things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage, up to 1 GB and free of charge. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 10. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keeps things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage, up to 1 GB and free of charge. It’s bigger on the inside The database memory, how to stick an elephant in a smart car. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 11. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keeps things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage, up to 1 GB and free of charge. It’s bigger on the inside The database memory, how to stick an elephant in a smart car. The answer is 42 Explaining the unexplainable, the CBO and the execution plan. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 12. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keeps things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage, up to 1 GB and free of charge. It’s bigger on the inside The database memory, how to stick an elephant in a smart car. The answer is 42 Explaining the unexplainable, the CBO and the execution plan. Why do we fall? Crashing the most advanced open source database it’s easy... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 13. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keeps things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage, up to 1 GB and free of charge. It’s bigger on the inside The database memory, how to stick an elephant in a smart car. The answer is 42 Explaining the unexplainable, the CBO and the execution plan. Why do we fall? Crashing the most advanced open source database it’s easy... And I thought my jokes were bad And then I’ll need a back door to escape.... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 76
  • 14. A byte it’s a byte, it’s a byte it’s a byte Jargon oid column hidden column in system tables. shall explicitly put in the select list to display it. relation a relational object like table,index... initdb executable which initialises the data directory cluster PostgreSQL, data and memory and processes shared objects relations visible in each database created in the cluster template databases databases used to build the other databases in the cluster tablespace database logical name pointing to a physical location WAL write ahead log. not volatile memory where the database blocks are written before the write on the datafile heap page a table’s data page index page an index data page tuple a physical data or index row point in time recovery physical backup strategy where is possible to restore a database at the exact point in time wanted Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 5 / 76
  • 15. A byte it’s a byte, it’s a byte it’s a byte PostgreSQL stores the data in a dedicated directory identified by the environment variable $PGDATA on unix and %PGDATA% on windows. The cluster contains various folders, each folder having a specific function Figure: PGDATA Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 76
  • 16. A byte it’s a byte, it’s a byte it’s a byte The global directory Contains the cluster’s shared objects like pg database,pg tablespace... and a small 8kb file, pg control, probably the most important file in the entire system. pg control tracks the database vital activities with a corrupted or missing pg control the database cannot start Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 7 / 76
  • 17. A byte it’s a byte, it’s a byte it’s a byte The base directory The default location when a new database is created without the TABLESPACE clause. Contains folders with numeric names, one for each database. The number is the database object id and is stored in the pg database system table. Contains an optional folder pgsql tmp used for external sorts and temporary files. The base location corresponds at the pg default name in the pg tablespace system table name. Usually the subfolder 1 contains the template1 database the subsequent two values are the databases template0 and postgres Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 76
  • 18. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains..... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 76
  • 19. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 76
  • 20. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... file with numeric names Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix The data files are organized in fixed size blocks, by default 8192 bytes. Any change to the block size require the build from source and a new initdb. The data files are called file nodes and mapped to the relations in the pg class system table... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 76
  • 21. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... file with numeric names Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix The data files are organized in fixed size blocks, by default 8192 bytes. Any change to the block size require the build from source and a new initdb. The data files are called file nodes and mapped to the relations in the pg class system table... And yes, we are dealing with an object relational database system. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 76
  • 22. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. Very useful to spread tables and indices on different physical devices Combined with the logical volume management can improve dramatically the performance... or drive the project to a complete failure The objects tablespace location can be safely changed but this require an exclusive lock on the affected object the view pg tablespace maps the objects name and identifiers Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 76
  • 23. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 11 / 76
  • 24. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Not used when the database is stopped cleanly Is absolutely critical when a crash or an unclean shutdown happens A single block corruption during the crash recovery results make the cluster not recoverable The number of segments is automatically managed by the database Putting the location on a dedicated and high reliable device is vital Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 11 / 76
  • 25. A byte it’s a byte, it’s a byte it’s a byte Pages Voyage to the centre of datafile Each block is structured almost the same, for tables and indices. Figure: Page schema Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 12 / 76
  • 26. A byte it’s a byte, it’s a byte it’s a byte Pages Each page starts with a 24 bytes header followed by an optional bitmap to track nulls After the header’s end the tuple pointers are stored, usually 4 bytes each The physical tuples are stored in the page’s end Figure: Page header Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 76
  • 27. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page pd tli is the page’s timeline id Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 14 / 76
  • 28. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page pd tli is the page’s timeline id yes, PostgreSQL have timelines... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 14 / 76
  • 29. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page pd tli is the page’s timeline id yes, PostgreSQL have timelines... when a point in time recovery happens a new timeline is generated to avoid conflicts and paradoxes Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 14 / 76
  • 30. A byte it’s a byte, it’s a byte it’s a byte Time and relative dimension in space People assume that transactions in PostgreSQL are a strict progression of xid, but actually from a non-linear, non-subjective viewpoint and thanks to the timelines, it’s more like a big ball of wibbly wobbly... timey wimey... stuff Figure: Would you like a jelly baby? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 15 / 76
  • 31. A byte it’s a byte, it’s a byte it’s a byte The tuples Now finally we can look to the physical tuples and discover another 27 bytes header. The numbers are the bytes used by the single values. Each tuple, even a simple boolean value, have a 27 bytes overhead. The user data can be the actual data stream or the pointer to the out of line data stream. Figure: Tuple structure Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 16 / 76
  • 32. The magic of the MVCC Jargon transaction unit of work coherent, reliable and independent of other transactions transaction isolation level property that defines how/when the changes made by one operation become visible to other concurrent operations non repeatable read a transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read). phantom read a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction. read committed permits non repeatable read and phantom read serializable forbid non repeatable read and phantom read xid transaction id, 4 bytes integer Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 17 / 76
  • 33. The magic of the MVCC PostgreSQL consistency Statements in PostgreSQL happen through transactions. By default when a single statement is successfully completed the database commits automatically the transaction. It’s possible to wrap multiple statements in a single transaction using the keywords [BEGIN;]....... [COMMIT; ROLLBACK] The minimal possible level the transaction isolation is READ COMMITTED. PostgreSQL supports the savepoints to do partial rollbacks. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 76
  • 34. The magic of the MVCC How PostgreSQL keep things consistent The PostgreSQL’s consistency is achieved using MVCC which stands for Multi Version Concurrency Control. The base logic seems simple. A 4 byte unsigned integer called xid is incremented by 1 and assigned to the current transaction. Every committed xid lesser than the current xid is in the past and then visible to the current session. Every xid greater than the current xid is in the future and then invisible to the current session. The commit status is managed in the $PGDATA using the directory pg clog where small 8k files tracks the transaction statuses. The the xid match is performed on the tuple’s header seen before. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 19 / 76
  • 35. The magic of the MVCC t xmin contains the xid generated at tuple insert t xmax contains the xid generated at tuple delete t cid contains the internal command id to track the sequence inside the same transaction Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 20 / 76
  • 36. The magic of the MVCC t xmin contains the xid generated at tuple insert t xmax contains the xid generated at tuple delete t cid contains the internal command id to track the sequence inside the same transaction there’s something missing, isn’t it? Where is the field to store the UPDATE xid? Figure: Tuple structure Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 20 / 76
  • 37. The magic of the MVCC Well, PostgreSQL actually NEVER performs an update. When an UPDATE statement is issued the updated rows are inserted with t xmin set to the current XID value. The old rows versions are marked as dead writing the t xmax field with the current transaction id. The database manages the tuple’s visibility using this simple routine Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 21 / 76
  • 38. The magic of the MVCC Source code comment in src/backend/utils/time/tqual.c: /* * * The satisfaction of "now" requires the following: * * ((Xmin == my-transaction && inserted by the current transaction * Cmin < my-command && before this command, and * (Xmax is null || the row has not been deleted, or * (Xmax == my-transaction && it was deleted by the current transaction * Cmax >= my-command))) but not before this command, * || or * (Xmin is committed && the row was inserted by a committed transaction, and * (Xmax is null || the row has not been deleted, or * (Xmax == my-transaction && the row is being deleted by this transaction * Cmax >= my-command) || but it’s not deleted "yet", or * (Xmax != my-transaction && the row was deleted by another transaction * Xmax is not committed)))) that has not been committed * */ Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 22 / 76
  • 39. The magic of the MVCC Source code comment in src/backend/utils/time/tqual.c: (continuing) * HeapTupleSatisfiesNow * True iff heap tuple is valid "now". * * Here, we consider the effects of: * all committed transactions (as of the current instant) * previous commands of this transaction * * Note we do _not_ include changes made by the current command. This * solves the "Halloween problem" wherein an UPDATE might try to re-update * its own output tuples, http://en.wikipedia.org/wiki/Halloween_Problem. * * Note: * Assumes heap tuple is valid. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 76
  • 40. The magic of the MVCC The dead tuples are not immediately reclaimed and add overhead to any IO operation. This happens because the page is accessed entirely to determine which tuple is visible to the transaction’s xid. To free the space the VACUUM command shall be used. The impact on other sessions is very limited . It’s designed to have the minimal impact on the database normal activity. VACUUM scans the relation and the indices for dead tuples no longer visible to open transactions. Is absolutely vital to run vacuum on each cluster’s database at least every 2 billions transactions. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 76
  • 41. The magic of the MVCC The XID wraparound failure XID is a 4 byte unsigned integer. Every 4 billions transactions the value wraps PostgreSQL uses the modulo − 231 comparison method For each value 2 billions XID are in the future and 2 billions in the past When a xid’s age becomes too close to 2 billions VACUUM freezes the xmin value to an hardcoded xid forever in the past Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 76
  • 42. The magic of the MVCC The XID wraparound failure If for any reason an xid reaches 10 millions transactions from the wraparound failure the database starts emitting scary messages WARNING: database "mydb" must be vacuumed within 177009986 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". If a xid’s age reaches 1 million transactions from the wraparound failure the database simply shut down and can be started only in single user mode to perform the VACUUM. Anyway, the autovacuum deamon, even if turned off starts the required VACUUM long before this catastrophic scenario happens. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 76
  • 43. TOAST Please, and don’t forget the Marmite jargon bytea binary string used by postgres to store large object data varlena variable length data type datum everything and nothing. is the raw data in memory before the processing Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 27 / 76
  • 44. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 28 / 76
  • 45. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Funny people indeed TOAST is the acronym for The Overside Attribute Storage Technique The attribute is also known as field The TOAST can store up to 1 GB in the out of line storage (and free of charge) Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 28 / 76
  • 46. TOAST Please, and don’t forget the Marmite Fixed length data types like integer, date, timestamp do not are not TOASTable. The data is stored after the tuple header. Varlena data types as character varying without the upper bound, text or bytea are stored in line or out of line. The storage technique used depends from the data stream size, and the storage method assigned to the attribute. Depending from the storage strategy is possible to store the data in external relations and/or compressed using the fast zlib algorithm. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 29 / 76
  • 47. TOAST Please, and don’t forget the Marmite TOAST permits four storage strategies (shamelessy copied from the on line manual). PLAIN prevents either compression or out-of-line storage; This is the only possible strategy for columns of non-TOAST-able data types. EXTENDED allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big. EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster at the penalty of increased storage space. MAIN allows compression but not out-of-line storage. Actually, out-of-line storage will still be performed for such columns, but only as a last resort. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 30 / 76
  • 48. TOAST Please, and don’t forget the Marmite When the out of line storage is used the data is encoded in bytea and eventually split in multiple chunks. An unique index over the chunk id and chunk seq avoid either duplicate data and speed up the lookups Figure: Toast table Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 31 / 76
  • 49. It’s bigger in the inside Jargon backend process a postgres process attached to the shared buffer heap page a table’s data page index page an index data page buffer a page, index or heap loaded in the shared buffer dirty buffer a buffer wal logged but not yet written on disk clean buffer a buffer written on its corresponding data file pinned buffer buffer held by a backend process unpinned buffer buffer released and available to be pinned again Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 32 / 76
  • 50. It’s bigger in the inside A PostgreSQL instance is a memory segment shared between multiple processes accessing the data directory. When a new connection happens a new postgres is forked and attached to the shared memory, also known as shared buffer. PostgreSQL is a multiprocess database system but not multi threaded. Each process can use only one processor or core. To keep things consistent every single block, for read or for write purposes must pass through the shared buffer. Usually the shared buffer is smaller than the database size, and sometime smaller than a single table’s size, the blocks in memory shall be managed and the space allocation must adapt to the usage. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 33 / 76
  • 51. It’s bigger in the inside In the earlies days of PostgreSQL 7.x a simple most recently used buffer was used. The simple algorithm, moved the buffers on the first position in a FIFO list after the pin. During the new generation 8.0 development, a powerful algorithm was introduced. The Adaptive Replacement Cache, two self adapting memory pools for the most recently used and most frequently used buffers. This algorithm was removed few weeks before the stable release because a software patent. An emergency two queue algorithm was adopted making the memory management workable but not brilliant. The next year, the release 8.1 adopted the clock sweep memory manager. The algorithm is still in use with few improvements, simple, flexible and free. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 34 / 76
  • 52. It’s bigger in the inside The buffer manager’s main goal is to keep cached in memory the most recently used blocks and adapt dynamically for the most frequently used blocks. To do this a small memory portion is used as free list for the buffers available for memory eviction. Figure: Free list Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 35 / 76
  • 53. It’s bigger in the inside The buffers have a reference counter which increase by one when the buffer is pinned, up to a small value. Figure: Block usage counter Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 36 / 76
  • 54. It’s bigger in the inside Shamelessy copied from the file src/backend/storage/buffer/README There is a ”free list” of buffers that are prime candidates for replacement. In particular, buffers that are completely free (contain no valid page) are always in this list. To choose a victim buffer to recycle when there are no free buffers available, we use a simple clock-sweep algorithm, which avoids the need to take system-wide locks during common operations. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 37 / 76
  • 55. It’s bigger in the inside It works like this: Each buffer header contains a usage counter, which is incremented (up to a small limit value) whenever the buffer is pinned. (This requires only the buffer header spinlock, which would have to be taken anyway to increment the buffer reference count, so it’s nearly free.) The ”clock hand” is a buffer index, NextVictimBuffer, that moves circularly through all the available buffers. NextVictimBuffer is protected by the BufFreelistLock. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 38 / 76
  • 56. It’s bigger in the inside The algorithm for a process that needs to obtain a victim buffer is: 1 Obtain BufFreelistLock. 2 If buffer free list is nonempty, remove its head buffer. If the buffer is pinned or has a nonzero usage count, it cannot be used; ignore it and return to the start of step 2. Otherwise, pin the buffer, release BufFreelistLock, and return the buffer. 3 Otherwise, select the buffer pointed to by NextVictimBuffer, and circularly advance NextVictimBuffer for next time. 4 If the selected buffer is pinned or has a nonzero usage count, it cannot be used. Decrement its usage count (if nonzero) and return to step 3 to examine the next buffer. 5 Pin the selected buffer, release BufFreelistLock, and return the buffer. (Note that if the selected buffer is dirty, we will have to write it out before we can recycle it; if someone else pins the buffer meanwhile we will have to give up and try another buffer. This however is not a concern of the basic select-a-victim-buffer algorithm.) Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 39 / 76
  • 57. It’s bigger in the inside Figure: The NextVictimBufferFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 40 / 76
  • 58. It’s bigger in the inside Since the version 8.3 the buffer manager have the ring buffer strategy. Operations which require a large amount of buffers in memory, like VACUUM or large tables sequential scans, have a dedicated 256kb ring buffer, small enough to fit in the processor’s L2. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 41 / 76
  • 59. The answer is 42 How PostgreSQL executes a query Jargon OID Object ID, 4 byte unsigned used to map any system object to a unique value class any relational object, table, index, view, sequence... attribute basically table fields execution plan the steps needed by the database to execute the query with or without returning data nodes execution plan’s steps CBO stands for cost based optimizer. the execution plan is generated evaluating the plan’s cost. cost arbitrary value to determine a score for the nodes and the execution plans Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 42 / 76
  • 60. The answer is 42 How PostgreSQL executes a query When a query is sent for processing PostgreSQL executes at first a syntax analysis using the query parser. Any error in this phase will stop the execution throwing a syntax error. As this stage doesn’t require access to the system catalogue there’s no wasted xid. If the syntax is correct the parser will return a parse tree ready for the next step. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 43 / 76
  • 61. The answer is 42 The query tree The second stage is still managed by the parser which accesses the system catalogue and from the parse tree generates a query tree. This is the SQL’s logical representation, where any object and attribute is unique. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 76
  • 62. The answer is 42 The query tree Figure: A simple query tree Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 45 / 76
  • 63. The answer is 42 The query tree To generate the query tree the parser access the system catalogue and retrieve the corresponding OID for each class and attribute in the query. Ambiguous names will generate an error. In the query tree the optional filtering elements are translated as well. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 76
  • 64. The answer is 42 The planner stage The query tree is then sent to the query planner which transverse the tree and generates all the possible execution plans with the arbitrary cost estimated from the database collected statistics. The estimated plan with minimum cost is chosen for the processing and sent to the executor. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 47 / 76
  • 65. The answer is 42 The planner stage The query tree is then sent to the query planner which transverse the tree and generates all the possible execution plans with the arbitrary cost estimated from the database collected statistics. The estimated plan with minimum cost is chosen for the processing and sent to the executor. Let me stress again the word estimate. A database with old or missing statistics will generate inefficient plans resulting in slow queries. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 47 / 76
  • 66. The answer is 42 The executor The planner returns then the execution plan, a sequence of steps to retrieve the requested data, to manipulate the data or change the database structure. The last stage is the executor. The execution plan’s steps are executed, then the eventual output is returned to the backend. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 48 / 76
  • 67. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 49 / 76
  • 68. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. The optional clause ANALYZE actually executes the query, discard the results and return the real execution plan. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 49 / 76
  • 69. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. The optional clause ANALYZE actually executes the query, discard the results and return the real execution plan. DML queries with EXPLAIN ANALYZE will change the data. Should be wrapped between BEGIN; ROLLBACK; to avoid unwanted results. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 49 / 76
  • 70. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. The optional clause ANALYZE actually executes the query, discard the results and return the real execution plan. DML queries with EXPLAIN ANALYZE will change the data. Should be wrapped between BEGIN; ROLLBACK; to avoid unwanted results. Let’s see EXPLAIN in action. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 49 / 76
  • 71. The answer is 42 For our purpose we’ll create a test table with two fields. An identifier, integer 4 bytes, with an auto incremental value and a character varying where to store md5 values. The serial pseudo type is short for CREATE SEQUENCE t test i id seq (self generated name); then integer NOT NULL DEFAULT default nextval(’t test i id seq’::regclass) Listing 1: Create table test =# CREATE TABLE t_test ( i_id serial , v_value character varying (50) ) ; NOTICE: CREATE TABLE will create implicit sequence " t_test_i_id_seq " for serial column "t_test.i_id" CREATE TABLE Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 50 / 76
  • 72. The answer is 42 Now let’s add some rows to our table Listing 2: Insert in table test =# INSERT INTO t_test (v_value) SELECT v_value FROM ( SELECT generate_series (1 ,1000) as i_cnt , md5(random ():: text) as v_value ) t_gen ; INSERT 0 1000 Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 51 / 76
  • 73. The answer is 42 Let’s generate the estimated plan for one row result Listing 3: EXPLAIN test =# EXPLAIN SELECT * FROM t_test WHERE i_id =20; QUERY PLAN -------------------------------------------------------- Seq Scan on t_test (cost =0.00..21.50 rows =1 width =37) Filter: (i_id = 20) (2 rows) As the table have no indices the only action possible is the table’s sequential scan. The cost is an arbitrary value. The first cost’s value is the startup cost, the cost to delivery the first row to the next operator or to the backend. The second number is the total cost needed to return all the rows. The rows value contains how many rows the estimated amount of rows returned by the operation and the width is the estimated average row width in bytes. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 52 / 76
  • 74. The answer is 42 Now let’s generate the real execution plan for one row result Listing 4: EXPLAIN ANALYZE test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id =20; QUERY PLAN -------------------------------------------------------------------------------------------------- Seq Scan on t_test (cost =0.00..21.50 rows =1 width =37) (actual time =0.021..0.198 rows =1 loops =1) Filter: (i_id = 20) Rows Removed by Filter: 999 Total runtime: 0.235 ms (4 rows) The actual time values gives the real time, in milliseconds, for the startup and total cost. The loops value shows how many times the operator is executed. In the bottom the total runtime tell the real execution time for the query. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 53 / 76
  • 75. The answer is 42 Let’s add an index on the id field... Listing 5: CREATE INDEX test =# CREATE INDEX idx_i_id ON t_test (i_id); CREATE INDEX Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 76
  • 76. The answer is 42 and generate a new execution plan Listing 6: EXPLAIN ANALYZE WITH INDEX test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id =20; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Index Scan using idx_i_id on t_test (cost =0.00..8.27 rows =1 width =37) (actual time =0.019..0.020 rows =1 loops =1) Index Cond: (i_id = 20) Total runtime: 0.055 ms (3 rows) The runtime is ten times faster. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 76
  • 77. The answer is 42 The cost based optimizer becomes constantly clever. For example, if we ask for more than half estimated table the database will choose the sequential scan, much cheaper compared to a full index scan. Listing 7: EXPLAIN ANALYZE WITH INDEX test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id >20; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan on t_test (cost =0.00..21.50 rows =980 width =37) (actual time =0.013..0.148 rows =980 loops =1) Filter: (i_id > 20) Rows Removed by Filter: 20 Total runtime: 0.209 ms (4 rows) test =# SET enable_seqscan =’off ’; SET test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id >20; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Index Scan using idx_i_id on t_test (cost =0.00..49.40 rows =980 width =37) (actual time =0.042..0.390 rows =980 loops =1) Index Cond: (i_id > 20) Total runtime: 0.507 ms (3 rows) Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 56 / 76
  • 78. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the unmatched rows Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 57 / 76
  • 79. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the unmatched rows index scan: read the index tree with random disk read and access the heap blocks pointed by the index. This returns ordered data. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 57 / 76
  • 80. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the unmatched rows index scan: read the index tree with random disk read and access the heap blocks pointed by the index. This returns ordered data. index only scan: read the index tree with random disk read and returns the data without accessing the heap page. This returns ordered data Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 57 / 76
  • 81. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the unmatched rows index scan: read the index tree with random disk read and access the heap blocks pointed by the index. This returns ordered data. index only scan: read the index tree with random disk read and returns the data without accessing the heap page. This returns ordered data bitmap index/heap scan: read the index sequentially generating a bitmap to recheck on heap pages. It’s a good compromise between seq scan and a full index scan. This doesn’t return ordered data. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 57 / 76
  • 82. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 58 / 76
  • 83. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right hash join: the rows of one table are entered into an in-memory hash table, then the other table is scanned and the hash table is probed for matches to each row. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 58 / 76
  • 84. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right hash join: the rows of one table are entered into an in-memory hash table, then the other table is scanned and the hash table is probed for matches to each row. merge join: the two tables are read then the mactching rows are merged. requires sorted input. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 58 / 76
  • 85. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right hash join: the rows of one table are entered into an in-memory hash table, then the other table is scanned and the hash table is probed for matches to each row. merge join: the two tables are read then the mactching rows are merged. requires sorted input. is possible to enable or disable nodes using the enable class parameters with SET Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 58 / 76
  • 86. The answer is 42 ANALYZE The command ANALYZE does random reads on a table’s portion The gathered statistics are stored in the pg statistic system table The view pg stats translate in human readable the stats data The parameter default statistics target set the percentage of table to be read The stastistics target can be fine tuned per column Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 59 / 76
  • 87. Why do we fall? jargon REINDEX read the table’s data and create a new index file, write the change into the system catalogue and delete the old index file. pg ls dir low level function to read directories from the database client min messages verbosity level for the client’s messages trace sort when set to on with the client min messages debug level shows the sort operations in high detail DDL query data definition language query. (CREATE TABLE, REINDEX, etc.) crash recovery replay of the logged WAL segments needed to restore the database consistent status before the crash. all the uncommitted transactions are rolled back. checkpoint disk consolidation activity. during a checkpoint all the dirty buffers are synced on the data files and the corresponding wal files can be recycled or deleted log switch when the wal writer switch from a wal to another checkpoint segments number of log switches before starts a new checkpoint Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 60 / 76
  • 88. Why do we fall? Crash, recovery and dead files Before starting let’s describe the scenario the pg xlog is on a smaller dedicated filesystem a long running transaction is doing a massive data update a reindex is then started on a big index the checkpoint segments parameter is very very very high Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 61 / 76
  • 89. Why do we fall? Crash, recovery and dead files Because the high value of checkpoint segments the checkpoint doesn’t happens often the wal files cannot be recycled then the database generates new wal files to store the updated blocks reindex starts building a new file node writing a not committed entry in the pg class with the new file node reference Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 62 / 76
  • 90. Why do we fall? Crash, recovery and dead files If the pg xlog runs out of space the cluster crashes the subsequent crash recovery replays all the wal files from the latest checkpoint the reindex transaction is rolled back the new file node generated by reindex is not deleted Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 63 / 76
  • 91. Why do we fall? Crash, recovery and dead files To get rid of this kind of file is quite difficult and dangerous as only the database knows what is in the system catalogue. Using a the pg ls dir is possible to match the data files with the pg class entries. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 64 / 76
  • 92. And I thought my jokes were bad The angry DBA When I started my Oracle DBA career, I was inducted to the company’s best practice in coding as Oracle’s SQL have limitations on the size and the usable characters for the identifiers. PostgreSQL does not have this limitations giving us the freedom of choice up to 64 characters for an identifier’s name. It’s a good thing but, sometimes, can generate monsters. I’ve written some guidelines to get things sorted out. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 65 / 76
  • 93. And I thought my jokes were bad Rule #1 Avoid the camel case or reserved keywords for identifiers PostgreSQL’s identifier names are converted by default in lower case. In order to preserve the case the identifiers should enclose between double quotes “. That mean that ANY query involving the camel case will contain the quotation and this affects the readability of the query. For reserved keyword if the use isn’t ambiguous the double quoting is optional, but quite disturbing in the syntax highlighters. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 66 / 76
  • 94. And I thought my jokes were bad Rule #2 self explaining schema PostgreSQL stores any relation in one system table named pg class and give you system views like pg tables and pg indexes to query. But why I should bother searching the system catalogue or ask developers if I can read the object’s nature from the database schema? Adding a t prefix on the table name will tell us that the relation carries physical data. Adding a v prefix to any view will tell us that we are dealing with a short cut to a query hopefully preventing the bad habit to join views, with the query planner going nuts. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 67 / 76
  • 95. And I thought my jokes were bad Rule #2 self explaining schema Object Prefix Table t View v Index idx Unique Index uidx Primary key pk Unique key uk Foreign key fk Check chk Custom type ty Sql function fn sql PlPgSql function fn plpg PlPython function fn plpy PlPerl function fn plpr Trigger trg Rule rul Table: Object prefix Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 68 / 76
  • 96. And I thought my jokes were bad Rule #2 self explaining schema Type Prefix Character c Character varying v Integer i Numeric n Bytea by Table: Object prefix Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 69 / 76
  • 97. And I thought my jokes were bad Rule #3 query formatting select * from debitnoteshead a join debitnoteslines b on debnotid where a.datnot=b. datnot and b.deblin>1; Not different from a compressed javascript, all on the same line. To do anything but looking must be prettified. Here some simple rules to write decent SQL code... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 70 / 76
  • 98. And I thought my jokes were bad Rule #3 query formatting 1 SQL keywords shall be in upper case 2 all the identifiers shall indent on the same level 3 SQL keywords must be indented at same level 4 Please don’t SELECT * is a waste of memory and to debug/tuning require explicit rewriting 5 If possible do INNER JOIN, not JOIN 6 Avoid the implicit join using the keyword ON to make clear which column is used for joining 7 In the table list use self explaining aliasing, not a,b,c,d... 8 Performance tip, don’t use user defined functions in the SELECT list or the WHERE condition Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 71 / 76
  • 99. And I thought my jokes were bad Rule #3 query formatting The previous query prettified SELECT productcode, noteid, datnot FROM debitnoteshead hrd, debitnoteslines lns WHERE hdr.debnotid=lns.debnotid AND hdr.datnot=lns.datnot AND lns.deblin>1 ; Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 72 / 76
  • 100. What’s next? Anyone interested in a Brighton PostgreSQL Group? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 73 / 76
  • 101. What’s next? Questions? Please be very basic, I’m just an electrician after all... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 74 / 76
  • 102. Contacts and license Twitter: 4thdoctor scarf (yes, I have the scarf) Blog:http://www.pgdba.co.uk This document is distributed under the terms of the Creative Commons Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 75 / 76
  • 103. A couple of things to know about PostgreSQL... (Before you start coding) Federico Campoli 9 July 2013 Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 76 / 76