Comparisons are fundamental to computing - and comparing strings is not nearly as straightforward as you might think. Come learn about the history, nuance and surprises of “putting words in order” that you never knew existed in computer science, and how that nuance impacts both general programming and SQL programming. Next, walk through a few actual scenarios and demonstrations using PostgreSQL as a user and administrator, which you can re-run yourself later for further study, including one way you could easily corrupt your self-managed PostgreSQL database if you aren't prepared. Finally we’ll dive into an explanation of the surprising behaviors we saw in PostgreSQL, and learn more about user and administrative features PostgreSQL provides related to localized string comparison.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
String Comparison Surprises: Did Postgres lose my data?
1. String Comparison Surprises:
Did Postgres Lose My Data?
Putting words in order
without losing your mind or your data
Jeremy Schneider
2. Jeremy
DB & Perf Engineer / AWS
Slack: pgtreats.info/slack-invite
Blog: ardentperf.com
Twitter/X: @jer_s
Organizer / Seattle PG User Group
• Engineering for Aurora and RDS Open Source: Aurora PostgreSQL
first GA release (2017), ICU & collation, data durability &
corruption protection, performance, fleet data collection and
analysis, early logical replication work, new major versions,
addressing complex operational issues, recruiting & training
• Launched an internal Amazon-wide dedicated PostgreSQL chat
channel in 2018 which now has thousands of members and
numerous technical discussions daily
• Founder of "RAC Attack" Community Driven Oracle Cluster
Database Workshop – almost 40 events across 15 countries
between 2011 and 2016
• Participant, speaker and/or organizer at Linux and Oracle
user groups & conferences since the early 2000’s. PostgreSQL user
groups & conferences since 2017.
• Database blog since 2007, Oracle ACE Alumni, Oak Table
Schneider
7. #PASSDataCommunitySummit
create table arabic_dictionary_research (
word text,
crossreferences text,
notes text
) partition by range (word);
create table arabic_dictionary_research_p1 partition of arabic_dictionary_research
for values from (')'ا to (';)'ح
create table arabic_dictionary_research_p2 partition of arabic_dictionary_research
for values from (')'ح to (';)'س
create table arabic_dictionary_research_p3 partition of arabic_dictionary_research
for values from (')'س to (';)'ل
create table arabic_dictionary_research_p4 partition of arabic_dictionary_research
for values from (')'ل to (';)'ے
create table arabic_dictionary_research_p5 partition of arabic_dictionary_research
default;
14. #PASSDataCommunitySummit
aws ec2 run-instances
--instance-type t2.micro --key-name mac --tag-specifications
'ResourceType=instance,Tags=[{Key=Name,Value=research-db-hotstandby}]'
--image-id ami-0fd2c44049dd805b8 --region us-east-1
sudo apt install postgresql-common
sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-15
# cut and paste instructions from
# https://ubuntu.com/server/docs/databases-postgresql
# to easily set up the hot standby database
19. #PASSDataCommunitySummit
• Verify Backup and Log File Retention (long enough for investigation)
• Articulate and Write the Business Impact at Present
• Freeze Ongoing Changes (any dev teams)
• Inventory Copies of Data
• Safely Scan to Determine If There’s More Corruption
• Follow General Best Practices
• Two-person rule, rename/move not delete, verify/compare healthy neighboring data,
test remediations before applying on prod, document everything.
Checklist for Responding to Data Corruption
https://ardentperf.com/2019/11/08/postgresql-invalid-page-and-checksum-verification-failed/
30. #PASSDataCommunitySummit
1978: JIS C 6226 (JIS X 0208) “2-byte” character set was
developed to express hiragana and kanji characters (Japan);
followed by GB/T 2312 (1980, China) and Big5 (1984, Taiwan)
1981: Xerox 8010 Information System released; family included
16-bit encodings and 27 languages (sadly didn’t sell)
1985-1987: Discussions between Xerox & Apple Engineers about
need for “one universal character set”, soon joined by engineers
from many other companies (IBM, Microsoft, Sun, NeXT, Novell,
etc.) as well as Toronto University
1991: Unicode Consortium Incorporated to 'standardize, extend
and promote the Unicode character encoding, a fixed-width, 16-
bit character encoding for over 60,000 graphic characters.’
1991: First Unicode Standard is published
“…begin at 0 and add the next character”
43. #PASSDataCommunitySummit
Putting Strings In Order
banqueta
baño
Baptisto
como
chorizo
Baptisto
banqueta
baño
chorizo
como
baño
banqueta
Baptisto
chorizo
como
select * from (values('Baptisto'),('banqueta'),('baño'),('como'),('chorizo’)) list(word)
order by word;
44. #PASSDataCommunitySummit
• Contractions: two (or more) characters sort as if they
were a single base letter. In Table 4, CH acts like a
single letter sorted after C.
• Expansions: a single character sorts as if it were a
sequence of two (or more) characters. In Table 4, an Œ
ligature sorts as if it were the sequence of O + E.
• Backwards Accent: In row 1 of Table 5, the first accent
difference is on the o, so that is what determines the
order. In some French
dictionary ordering traditions, however,
it is the last accent difference that
determines the order, as shown in row 2.
Linguistic Collation is Complex
https://www.unicode.org/reports/tr10/
https://www.cybertec-postgresql.com/en/case-insensitive-pattern-matching-in-postgresql/
47. #PASSDataCommunitySummit
• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL
48. #PASSDataCommunitySummit
• Formatting in TO_CHAR, TO_NUMBER, etc
• Automatic “recoding” of data and query results to client encoding
• Encoding and language/translation of messages (error, warning, etc)
• Character classification (letter, number, punctuation, whitespace, etc)
Flexible Collation:
• ORDER BY
• LIKE, regex
• =
• Partitions, Constraints, Generated Columns, Triggers…
• Any expression that involves strings (cf. COLLATE sql clause)
Localization Support in PostgreSQL
50. PostgreSQL does not include its own
collation code. Instead, it depends on
an external library installed and
managed separately.
PostgreSQL Collation Libraries
53. #PASSDataCommunitySummit
Levels of Defaults:
• OS Environment (for initdb)
• Template0/1 (for database)
• Database
• Table/Column
• Data Type (for constants)
• Explicit in SQL statement
Collation Precedence in PostgreSQL
Conflict Resolution Rules:
1. Explicit > Implicit
2. Non-default > Default
3. Indeterminate collation only
raises error if collation is needed
at runtime
Docs: Part III (Server Admin)
Chapter 24 (Localization)
Part 24.2 (Collation Support)
54. #PASSDataCommunitySummit
• Case insensitive comparison
• Comparison of base characters, ignoring accents
• Example: count rows where user input was Mexico, México, mexico, or méxico
• Compare digits by numeric value
• Example: id-45 < id-123
• Ignore whitespace, so that similar strings are kept close together
• By default, glibc keeps similar strings close but with ICU whitespace can cause similar
strings to sort far apart from each other.
• Example: “full time” and “full-time” and “fulltime”
• May get extra performance by comparing without normalizing
• Safe for strings that are system-generated and guaranteed to be consistent, or that are pre-
normalized
Advanced Collation Support with ICU
58. #PASSDataCommunitySummit
Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Collation Challenges in PostgreSQL
59. #PASSDataCommunitySummit
With both libc and ICU, changing the
version of the library can cause corruption
that isn’t noticed until long afterwards.
Can trigger a version change:
• OS Upgrade
• Failover and Hot Standby
• Patroni, Kubernetes, etc
• Distributed Systems
Collation Challenges in PostgreSQL
Can be corrupted by version change:
• Indexes
• All types, not just btree
• Constraints
• All types, not just unique/primary-key
• Partitions
• FDWs – eg. mergejoin depends on same
local/remote ordering
• Maybe: un-refreshed materialized views,
triggers, generated columns? (I’m not sure)
67. #PASSDataCommunitySummit
Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Localization Challenges in PostgreSQL
68. #PASSDataCommunitySummit
Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Localization Challenges in PostgreSQL
71. #PASSDataCommunitySummit
Collation Torture Test
Data to answer the questions:
Is this really a problem?
How common are sort order changes?
• 10 years of historical versions
• Ubuntu and RHEL
• ICU and glibc
github.com/ardentperf/glibc-unicode-sorting
74. #PASSDataCommunitySummit
• In summary: both glibc and ICU have regular collation
changes. Both have had at least one release with very
large numbers of changes.
• Code published on github to generate the 26 million
strings with PL/pgSQL snippet
https://github.com/ardentperf/glibc-unicode-sorting/blob/main/run-icu.sh#L65
Collation Torture Test
77. #PASSDataCommunitySummit
• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL
78. #PASSDataCommunitySummit
• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL
79. Key Takeaways
• Use collation features for more readable and elegant SQL, when doing fuzzy
comparisons or multi-lingual sorting
• ICU brings powerful new capabilities around linguistic collation
• Assume there are exotic unexpected characters in your data
• Move toward using ICU in PostgreSQL (for both safety and capability)
• When upgrading your operating system, (1) dump or (2) logical or (3) use old ICU
• Not a great idea to under-pay your administrators.
Give them lots of thanks and some extra vacation time. 🏝