String Comparison Surprises: Did Postgres lose my data?

String Comparison Surprises:
Did Postgres Lose My Data?
Putting words in order
without losing your mind or your data
Jeremy Schneider

Jeremy
DB & Perf Engineer / AWS
Slack: pgtreats.info/slack-invite
Blog: ardentperf.com
Twitter/X: @jer_s
Organizer / Seattle PG User Group
• Engineering for Aurora and RDS Open Source: Aurora PostgreSQL
first GA release (2017), ICU & collation, data durability &
corruption protection, performance, fleet data collection and
analysis, early logical replication work, new major versions,
addressing complex operational issues, recruiting & training
• Launched an internal Amazon-wide dedicated PostgreSQL chat
channel in 2018 which now has thousands of members and
numerous technical discussions daily
• Founder of "RAC Attack" Community Driven Oracle Cluster
Database Workshop – almost 40 events across 15 countries
between 2011 and 2016
• Participant, speaker and/or organizer at Linux and Oracle
user groups & conferences since the early 2000’s. PostgreSQL user
groups & conferences since 2017.
• Database blog since 2007, Oracle ACE Alumni, Oak Table
Schneider

Ager, Simon. "Omniglot - writing systems and languages of the world".
Date accessed: 6th
June 2023. www.omniglot.com

#PASSDataCommunitySummit
aws ec2 run-instances
--instance-type t2.micro --key-name mac --tag-specifications
'ResourceType=instance,Tags=[{Key=Name,Value=research-db}]'
--image-id ami-0172070f66a8ebe63 --region us-east-1
sudo apt install postgresql-common
sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-15
create database research_texts template=template0
locale_provider=icu icu_locale="en-US"

create table arabic_dictionary_research (
word text,
crossreferences text,
notes text
) partition by range (word);
create table arabic_dictionary_research_p1 partition of arabic_dictionary_research
for values from ('‫)'ا‬ to ('‫;)'ح‬
for values from ('‫)'ح‬ to ('‫;)'س‬
for values from ('‫)'س‬ to ('‫;)'ل‬
for values from ('‫)'ل‬ to ('‫;)'ے‬
default;

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Daniel Chekalov on Unsplash

Photo by Kanashi on Unsplash

aws ec2 run-instances
--instance-type t2.micro --key-name mac --tag-specifications
'ResourceType=instance,Tags=[{Key=Name,Value=research-db-hotstandby}]'
--image-id ami-0fd2c44049dd805b8 --region us-east-1
sudo apt install postgresql-common
sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-15
# cut and paste instructions from
# https://ubuntu.com/server/docs/databases-postgresql
# to easily set up the hot standby database

Photo by Aditya Chinchure on Unsplash

• Verify Backup and Log File Retention (long enough for investigation)
• Articulate and Write the Business Impact at Present
• Freeze Ongoing Changes (any dev teams)
• Inventory Copies of Data
• Safely Scan to Determine If There’s More Corruption
• Follow General Best Practices
• Two-person rule, rename/move not delete, verify/compare healthy neighboring data,
test remediations before applying on prod, document everything.
Checklist for Responding to Data Corruption
https://ardentperf.com/2019/11/08/postgresql-invalid-page-and-checksum-verification-failed/

Quick Refresher:
Encoding
Photo by Kelly Sikkema on Unsplash

Image by Simon from Pixabay

• 4
• 5
• 6
• 12
• 14
• 20
Length of this Unicode string?

What is encoding?
CREATE EXTENSION pageinspect;
UPDATE mytable SET id=2 WHERE id=1;
SELECT lp,t_ctid,t_xmax,t_data from
heap_page_items(get_raw_page('mytable',0));
lp | t_ctid | t_xmax | t_data
----+--------+--------+--------
1 | (0,2) | 1113 | x0100
2 | (0,2) | 0 | x0200
https://github.com/petergeoghegan/pg_hexedit

Assume Unexpected Characters

1978: JIS C 6226 (JIS X 0208) “2-byte” character set was
developed to express hiragana and kanji characters (Japan);
followed by GB/T 2312 (1980, China) and Big5 (1984, Taiwan)
1981: Xerox 8010 Information System released; family included
16-bit encodings and 27 languages (sadly didn’t sell)
1985-1987: Discussions between Xerox & Apple Engineers about
need for “one universal character set”, soon joined by engineers
from many other companies (IBM, Microsoft, Sun, NeXT, Novell,
etc.) as well as Toronto University
1991: Unicode Consortium Incorporated to 'standardize, extend
and promote the Unicode character encoding, a fixed-width, 16-
bit character encoding for over 60,000 graphic characters.’
1991: First Unicode Standard is published
“…begin at 0 and add the next character”

https://home.unicode.org/technical-quick-start-guide/
Chris Weber

PostgreSQL Supports Client & Server
Encodings

Only ASCII In Global Catalogs

Avoid Conversion When Possible
https://www.cybertec-postgresql.com/en/fix-bad-encoding-postgresql/
Laurenz Albe
May 30, 2023

1. use unicode
2. localize applications
3. assume unexpected chars

Putting Strings In Order
select *
from (values ('Baptisto')
, ('banqueta')
, ('baño')
, ('como')
, ('chorizo')
) list(word)
order by word;

Putting Strings In Order
banqueta
baño
Baptisto
como
chorizo
Baptisto
banqueta
baño
chorizo
como
baño
banqueta
Baptisto
chorizo
como
select * from (values('Baptisto'),('banqueta'),('baño'),('como'),('chorizo’)) list(word)
order by word;

• Contractions: two (or more) characters sort as if they
were a single base letter. In Table 4, CH acts like a
single letter sorted after C.
• Expansions: a single character sorts as if it were a
sequence of two (or more) characters. In Table 4, an Œ
ligature sorts as if it were the sequence of O + E.
• Backwards Accent: In row 1 of Table 5, the first accent
difference is on the o, so that is what determines the
order. In some French
dictionary ordering traditions, however,
it is the last accent difference that
determines the order, as shown in row 2.
Linguistic Collation is Complex
https://www.unicode.org/reports/tr10/
https://www.cybertec-postgresql.com/en/case-insensitive-pattern-matching-in-postgresql/

• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL

• Formatting in TO_CHAR, TO_NUMBER, etc
• Automatic “recoding” of data and query results to client encoding
• Encoding and language/translation of messages (error, warning, etc)
• Character classification (letter, number, punctuation, whitespace, etc)
Flexible Collation:
• ORDER BY
• LIKE, regex
• =
• Partitions, Constraints, Generated Columns, Triggers…
• Any expression that involves strings (cf. COLLATE sql clause)
Localization Support in PostgreSQL

PostgreSQL does not include its own
collation code. Instead, it depends on
an external library installed and
managed separately.
PostgreSQL Collation Libraries

PostgreSQL Collation Libraries
Operating
System C
Library (libc)
International
Components for
Unicode (icu)
Default
Recommended

Collation Precedence in PostgreSQL

Levels of Defaults:
• OS Environment (for initdb)
• Template0/1 (for database)
• Database
• Table/Column
• Data Type (for constants)
• Explicit in SQL statement
Collation Precedence in PostgreSQL
Conflict Resolution Rules:
1. Explicit > Implicit
2. Non-default > Default
3. Indeterminate collation only
raises error if collation is needed
at runtime
Docs: Part III (Server Admin)
Chapter 24 (Localization)
Part 24.2 (Collation Support)

• Case insensitive comparison
• Comparison of base characters, ignoring accents
• Example: count rows where user input was Mexico, México, mexico, or méxico
• Compare digits by numeric value
• Example: id-45 < id-123
• Ignore whitespace, so that similar strings are kept close together
• By default, glibc keeps similar strings close but with ICU whitespace can cause similar
strings to sort far apart from each other.
• Example: “full time” and “full-time” and “fulltime”
• May get extra performance by comparing without normalizing
• Safe for strings that are system-generated and guaranteed to be consistent, or that are pre-
normalized
Advanced Collation Support with ICU

Advanced Collation Support with ICU

Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Collation Challenges in PostgreSQL

With both libc and ICU, changing the
version of the library can cause corruption
that isn’t noticed until long afterwards.
Can trigger a version change:
• OS Upgrade
• Failover and Hot Standby
• Patroni, Kubernetes, etc
• Distributed Systems
Can be corrupted by version change:
• Indexes
• All types, not just btree
• Constraints
• All types, not just unique/primary-key
• Partitions
• FDWs – eg. mergejoin depends on same
local/remote ordering
• Maybe: un-refreshed materialized views,
triggers, generated columns? (I’m not sure)

👍

😭

Localization Challenges in PostgreSQL
👍

Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Localization Challenges in PostgreSQL

PostgreSQL Development
and Safety Improvements

Collation Torture Test
Data to answer the questions:
Is this really a problem?
How common are sort order changes?
• 10 years of historical versions
• Ubuntu and RHEL
• ICU and glibc
github.com/ardentperf/glibc-unicode-sorting

286,654
unicode code points
91
string patterns
26 million
strings

• In summary: both glibc and ICU have regular collation
changes. Both have had at least one release with very
large numbers of changes.
• Code published on github to generate the 26 million
strings with PL/pgSQL snippet
https://github.com/ardentperf/glibc-unicode-sorting/blob/main/run-icu.sh#L65
Collation Torture Test

https://aws.amazon.com/blogs/database/manage-collation-changes-in-postgresql-on-amazon-aurora-and-amazon-rds/

Key Takeaways
• Use collation features for more readable and elegant SQL, when doing fuzzy
comparisons or multi-lingual sorting
• ICU brings powerful new capabilities around linguistic collation
• Assume there are exotic unexpected characters in your data
• Move toward using ICU in PostgreSQL (for both safety and capability)
• When upgrading your operating system, (1) dump or (2) logical or (3) use old ICU
• Not a great idea to under-pay your administrators.
Give them lots of thanks and some extra vacation time. 🏝

Session evaluation
Your feedback is important to us
Evaluate this session at:
www.PASSDataCommunitySummit.com/evaluation

Thank you
aws.amazon.com/rds
Jeremy Schneider
Slack: pgtreats.info/slack-invite
Blog: ardentperf.com
Twitter/X: @jer_s

String Comparison Surprises: Did Postgres lose my data?

String Comparison Surprises: Did Postgres lose my data?

Recommended

Recommended

More Related Content

Similar to String Comparison Surprises: Did Postgres lose my data?

Similar to String Comparison Surprises: Did Postgres lose my data? (20)

Recently uploaded

Recently uploaded (20)

String Comparison Surprises: Did Postgres lose my data?