SlideShare a Scribd company logo
1 of 83
Download to read offline
String Comparison Surprises:
Did Postgres Lose My Data?
Putting words in order
without losing your mind or your data
Jeremy Schneider
Jeremy
DB & Perf Engineer / AWS
Slack: pgtreats.info/slack-invite
Blog: ardentperf.com
Twitter/X: @jer_s
Organizer / Seattle PG User Group
• Engineering for Aurora and RDS Open Source: Aurora PostgreSQL
first GA release (2017), ICU & collation, data durability &
corruption protection, performance, fleet data collection and
analysis, early logical replication work, new major versions,
addressing complex operational issues, recruiting & training
• Launched an internal Amazon-wide dedicated PostgreSQL chat
channel in 2018 which now has thousands of members and
numerous technical discussions daily
• Founder of "RAC Attack" Community Driven Oracle Cluster
Database Workshop – almost 40 events across 15 countries
between 2011 and 2016
• Participant, speaker and/or organizer at Linux and Oracle
user groups & conferences since the early 2000’s. PostgreSQL user
groups & conferences since 2017.
• Database blog since 2007, Oracle ACE Alumni, Oak Table
Schneider
“My Opinions Are My Own”
Ager, Simon. "Omniglot - writing systems and languages of the world".
Date accessed: 6th
June 2023. www.omniglot.com
#PASSDataCommunitySummit
aws ec2 run-instances
--instance-type t2.micro --key-name mac --tag-specifications
'ResourceType=instance,Tags=[{Key=Name,Value=research-db}]'
--image-id ami-0172070f66a8ebe63 --region us-east-1
sudo apt install postgresql-common
sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-15
create database research_texts template=template0
locale_provider=icu icu_locale="en-US"
#PASSDataCommunitySummit
create table arabic_dictionary_research (
word text,
crossreferences text,
notes text
) partition by range (word);
create table arabic_dictionary_research_p1 partition of arabic_dictionary_research
for values from ('‫)'ا‬ to ('‫;)'ح‬
create table arabic_dictionary_research_p2 partition of arabic_dictionary_research
for values from ('‫)'ح‬ to ('‫;)'س‬
create table arabic_dictionary_research_p3 partition of arabic_dictionary_research
for values from ('‫)'س‬ to ('‫;)'ل‬
create table arabic_dictionary_research_p4 partition of arabic_dictionary_research
for values from ('‫)'ل‬ to ('‫;)'ے‬
create table arabic_dictionary_research_p5 partition of arabic_dictionary_research
default;
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Daniel Chekalov on Unsplash
#PASSDataCommunitySummit
#PASSDataCommunitySummit
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Kanashi on Unsplash
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
#PASSDataCommunitySummit
aws ec2 run-instances
--instance-type t2.micro --key-name mac --tag-specifications
'ResourceType=instance,Tags=[{Key=Name,Value=research-db-hotstandby}]'
--image-id ami-0fd2c44049dd805b8 --region us-east-1
sudo apt install postgresql-common
sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-15
# cut and paste instructions from
# https://ubuntu.com/server/docs/databases-postgresql
# to easily set up the hot standby database
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Aditya Chinchure on Unsplash
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
#PASSDataCommunitySummit
• Verify Backup and Log File Retention (long enough for investigation)
• Articulate and Write the Business Impact at Present
• Freeze Ongoing Changes (any dev teams)
• Inventory Copies of Data
• Safely Scan to Determine If There’s More Corruption
• Follow General Best Practices
• Two-person rule, rename/move not delete, verify/compare healthy neighboring data,
test remediations before applying on prod, document everything.
Checklist for Responding to Data Corruption
https://ardentperf.com/2019/11/08/postgresql-invalid-page-and-checksum-verification-failed/
#PASSDataCommunitySummit
Quick Refresher:
Encoding
Photo by Kelly Sikkema on Unsplash
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Image by Simon from Pixabay
#PASSDataCommunitySummit
• 4
• 5
• 6
• 12
• 14
• 20
Length of this Unicode string?
#PASSDataCommunitySummit
$ hexdump -C /etc/hosts
00000000 23 23 0a 23 20 48 6f 73 74 20 44 61 74 61 62 61 |##.# Host Databa|
00000010 73 65 0a 23 0a 23 20 6c 6f 63 61 6c 68 6f 73 74 |se.#.# localhost|
00000020 20 69 73 20 75 73 65 64 20 74 6f 20 63 6f 6e 66 | is used to conf|
00000030 69 67 75 72 65 20 74 68 65 20 6c 6f 6f 70 62 61 |igure the loopba|
00000040 63 6b 20 69 6e 74 65 72 66 61 63 65 0a 23 20 77 |ck interface.# w|
00000050 68 65 6e 20 74 68 65 20 73 79 73 74 65 6d 20 69 |hen the system i|
00000060 73 20 62 6f 6f 74 69 6e 67 2e 20 20 44 6f 20 6e |s booting. Do n|
00000070 6f 74 20 63 68 61 6e 67 65 20 74 68 69 73 20 65 |ot change this e|
00000080 6e 74 72 79 2e 0a 23 23 0a 31 32 37 2e 30 2e 30 |ntry..##.127.0.0|
00000090 2e 31 09 6c 6f 63 61 6c 68 6f 73 74 0a 32 35 35 |.1.localhost.255|
000000a0 2e 32 35 35 2e 32 35 35 2e 32 35 35 09 62 72 6f |.255.255.255.bro|
000000b0 61 64 63 61 73 74 68 6f 73 74 0a 3a 3a 31 20 20 |adcasthost.::1 |
000000c0 20 20 20 20 20 20 20 20 20 20 20 6c 6f 63 61 6c | local|
000000d0 68 6f 73 74 0a |host.|
000000d5
What is encoding?
#PASSDataCommunitySummit
What is encoding?
CREATE EXTENSION pageinspect;
UPDATE mytable SET id=2 WHERE id=1;
SELECT lp,t_ctid,t_xmax,t_data from
heap_page_items(get_raw_page('mytable',0));
lp | t_ctid | t_xmax | t_data
----+--------+--------+--------
1 | (0,2) | 1113 | x0100
2 | (0,2) | 0 | x0200
https://github.com/petergeoghegan/pg_hexedit
#PASSDataCommunitySummit
Assume Unexpected Characters
#PASSDataCommunitySummit
#PASSDataCommunitySummit
#PASSDataCommunitySummit
1978: JIS C 6226 (JIS X 0208) “2-byte” character set was
developed to express hiragana and kanji characters (Japan);
followed by GB/T 2312 (1980, China) and Big5 (1984, Taiwan)
1981: Xerox 8010 Information System released; family included
16-bit encodings and 27 languages (sadly didn’t sell)
1985-1987: Discussions between Xerox & Apple Engineers about
need for “one universal character set”, soon joined by engineers
from many other companies (IBM, Microsoft, Sun, NeXT, Novell,
etc.) as well as Toronto University
1991: Unicode Consortium Incorporated to 'standardize, extend
and promote the Unicode character encoding, a fixed-width, 16-
bit character encoding for over 60,000 graphic characters.’
1991: First Unicode Standard is published
“…begin at 0 and add the next character”
https://home.unicode.org/technical-quick-start-guide/
Chris Weber
#PASSDataCommunitySummit
What about PostgreSQL?
#PASSDataCommunitySummit
PostgreSQL Supports Client & Server
Encodings
#PASSDataCommunitySummit
PostgreSQL Supports Client & Server
Encodings
#PASSDataCommunitySummit
Only ASCII In Global Catalogs
#PASSDataCommunitySummit
Avoid Conversion When Possible
https://www.cybertec-postgresql.com/en/fix-bad-encoding-postgresql/
Laurenz Albe
May 30, 2023
1. use unicode
2. localize applications
3. assume unexpected chars
1. use unicode
2. localize applications
3. assume unexpected chars
Collation
String Comparison
Image by Simon from Pixabay
#PASSDataCommunitySummit
Putting Strings In Order
select *
from (values ('Baptisto')
, ('banqueta')
, ('baño')
, ('como')
, ('chorizo')
) list(word)
order by word;
#PASSDataCommunitySummit
Putting Strings In Order
banqueta
baño
Baptisto
como
chorizo
Baptisto
banqueta
baño
chorizo
como
baño
banqueta
Baptisto
chorizo
como
select * from (values('Baptisto'),('banqueta'),('baño'),('como'),('chorizo’)) list(word)
order by word;
#PASSDataCommunitySummit
• Contractions: two (or more) characters sort as if they
were a single base letter. In Table 4, CH acts like a
single letter sorted after C.
• Expansions: a single character sorts as if it were a
sequence of two (or more) characters. In Table 4, an Œ
ligature sorts as if it were the sequence of O + E.
• Backwards Accent: In row 1 of Table 5, the first accent
difference is on the o, so that is what determines the
order. In some French
dictionary ordering traditions, however,
it is the last accent difference that
determines the order, as shown in row 2.
Linguistic Collation is Complex
https://www.unicode.org/reports/tr10/
https://www.cybertec-postgresql.com/en/case-insensitive-pattern-matching-in-postgresql/
What about PostgreSQL?
#PASSDataCommunitySummit
• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL
#PASSDataCommunitySummit
• Formatting in TO_CHAR, TO_NUMBER, etc
• Automatic “recoding” of data and query results to client encoding
• Encoding and language/translation of messages (error, warning, etc)
• Character classification (letter, number, punctuation, whitespace, etc)
Flexible Collation:
• ORDER BY
• LIKE, regex
• =
• Partitions, Constraints, Generated Columns, Triggers…
• Any expression that involves strings (cf. COLLATE sql clause)
Localization Support in PostgreSQL
#PASSDataCommunitySummit
PostgreSQL does not include its own
collation code. Instead, it depends on
an external library installed and
managed separately.
PostgreSQL Collation Libraries
#PASSDataCommunitySummit
PostgreSQL Collation Libraries
Operating
System C
Library (libc)
International
Components for
Unicode (icu)
Default
Recommended
#PASSDataCommunitySummit
Collation Precedence in PostgreSQL
#PASSDataCommunitySummit
Levels of Defaults:
• OS Environment (for initdb)
• Template0/1 (for database)
• Database
• Table/Column
• Data Type (for constants)
• Explicit in SQL statement
Collation Precedence in PostgreSQL
Conflict Resolution Rules:
1. Explicit > Implicit
2. Non-default > Default
3. Indeterminate collation only
raises error if collation is needed
at runtime
Docs: Part III (Server Admin)
Chapter 24 (Localization)
Part 24.2 (Collation Support)
#PASSDataCommunitySummit
• Case insensitive comparison
• Comparison of base characters, ignoring accents
• Example: count rows where user input was Mexico, México, mexico, or méxico
• Compare digits by numeric value
• Example: id-45 < id-123
• Ignore whitespace, so that similar strings are kept close together
• By default, glibc keeps similar strings close but with ICU whitespace can cause similar
strings to sort far apart from each other.
• Example: “full time” and “full-time” and “fulltime”
• May get extra performance by comparing without normalizing
• Safe for strings that are system-generated and guaranteed to be consistent, or that are pre-
normalized
Advanced Collation Support with ICU
#PASSDataCommunitySummit
Advanced Collation Support with ICU
#PASSDataCommunitySummit
Advanced Collation Support with ICU
#PASSDataCommunitySummit
Advanced Collation Support with ICU
#PASSDataCommunitySummit
Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Collation Challenges in PostgreSQL
#PASSDataCommunitySummit
With both libc and ICU, changing the
version of the library can cause corruption
that isn’t noticed until long afterwards.
Can trigger a version change:
• OS Upgrade
• Failover and Hot Standby
• Patroni, Kubernetes, etc
• Distributed Systems
Collation Challenges in PostgreSQL
Can be corrupted by version change:
• Indexes
• All types, not just btree
• Constraints
• All types, not just unique/primary-key
• Partitions
• FDWs – eg. mergejoin depends on same
local/remote ordering
• Maybe: un-refreshed materialized views,
triggers, generated columns? (I’m not sure)
#PASSDataCommunitySummit
Collation Challenges in PostgreSQL
👍
#PASSDataCommunitySummit
Collation Challenges in PostgreSQL
😭
Did Postgres Lose My Data?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
#PASSDataCommunitySummit
#PASSDataCommunitySummit
Localization Challenges in PostgreSQL
👍
#PASSDataCommunitySummit
#PASSDataCommunitySummit
Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Localization Challenges in PostgreSQL
#PASSDataCommunitySummit
Version management remains a significant challenge for collation:
• With both libc and ICU, changing the version of the library can cause
corruption that isn’t noticed until long afterwards.
• For OS upgrades, it’s safe to use dump-and-load or logical replication.
• Safe to install old ICU version on new OS, but may need to build from source.
• The “ALTER … REFRESH COLLATION” command is dangerous.
• Does not change anything; instructs database to permanently forget the warning.
• On version mismatch, PostgreSQL unfortunately issues WARNING instead
of ERROR, perhaps only to server logs. Responsibility of the admin to know
you shouldn’t change the OS/libs, and that it means possible corruption.
Localization Challenges in PostgreSQL
PostgreSQL Development
and Safety Improvements
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
#PASSDataCommunitySummit
Collation Torture Test
Data to answer the questions:
Is this really a problem?
How common are sort order changes?
• 10 years of historical versions
• Ubuntu and RHEL
• ICU and glibc
github.com/ardentperf/glibc-unicode-sorting
#PASSDataCommunitySummit
286,654
unicode code points
91
string patterns
26 million
strings
#PASSDataCommunitySummit
#PASSDataCommunitySummit
• In summary: both glibc and ICU have regular collation
changes. Both have had at least one release with very
large numbers of changes.
• Code published on github to generate the 26 million
strings with PL/pgSQL snippet
https://github.com/ardentperf/glibc-unicode-sorting/blob/main/run-icu.sh#L65
Collation Torture Test
https://aws.amazon.com/blogs/database/manage-collation-changes-in-postgresql-on-amazon-aurora-and-amazon-rds/
#PASSDataCommunitySummit
• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL
#PASSDataCommunitySummit
• 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting)
• 6.3.1 (1998) – Multibyte support at database level
• 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte
• 8.4 (2009) – Locale (LC_COLLATE) at database level
• 9.1 (2011) – Support collation per-column and COLLATE clause
• 10 (2017) – Support ICU at column level
• 13 (2020) – Track collation-related versions for some Operating Systems
• 15 (2022) – Support ICU collation at database/cluster level
• 16 (2023) – Support custom ICU collation rules, build ICU by default
• Future – Discussion: multiple ICU versions, initdb use ICU by default
Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more…
Localization Support in PostgreSQL
Key Takeaways
• Use collation features for more readable and elegant SQL, when doing fuzzy
comparisons or multi-lingual sorting
• ICU brings powerful new capabilities around linguistic collation
• Assume there are exotic unexpected characters in your data
• Move toward using ICU in PostgreSQL (for both safety and capability)
• When upgrading your operating system, (1) dump or (2) logical or (3) use old ICU
• Not a great idea to under-pay your administrators.
Give them lots of thanks and some extra vacation time. 🏝
Session evaluation
Your feedback is important to us
Evaluate this session at:
www.PASSDataCommunitySummit.com/evaluation
Thank you
aws.amazon.com/rds
Jeremy Schneider
Slack: pgtreats.info/slack-invite
Blog: ardentperf.com
Twitter/X: @jer_s
String Comparison Surprises: Did Postgres lose my data?

More Related Content

Similar to String Comparison Surprises: Did Postgres lose my data?

扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区yiditushe
 
Fotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging CommunityFotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging Communityfarhan "Frank"​ mashraqi
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisC4Media
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumertirlukachaitanya
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)akirahiguchi
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTCGuy Coates
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeAman Kohli
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
JCConf 2022 - New Features in Java 18 & 19
JCConf 2022 - New Features in Java 18 & 19JCConf 2022 - New Features in Java 18 & 19
JCConf 2022 - New Features in Java 18 & 19Joseph Kuo
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadAll Things Open
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management Systempsathishcs
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core ModuleKatie Gulley
 

Similar to String Comparison Surprises: Did Postgres lose my data? (20)

MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017
 
扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区
 
Fotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging CommunityFotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging Community
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Survey of Percona Toolkit
Survey of Percona ToolkitSurvey of Percona Toolkit
Survey of Percona Toolkit
 
HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTC
 
MLflow with R
MLflow with RMLflow with R
MLflow with R
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
JCConf 2022 - New Features in Java 18 & 19
JCConf 2022 - New Features in Java 18 & 19JCConf 2022 - New Features in Java 18 & 19
JCConf 2022 - New Features in Java 18 & 19
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

String Comparison Surprises: Did Postgres lose my data?

  • 1. String Comparison Surprises: Did Postgres Lose My Data? Putting words in order without losing your mind or your data Jeremy Schneider
  • 2. Jeremy DB & Perf Engineer / AWS Slack: pgtreats.info/slack-invite Blog: ardentperf.com Twitter/X: @jer_s Organizer / Seattle PG User Group • Engineering for Aurora and RDS Open Source: Aurora PostgreSQL first GA release (2017), ICU & collation, data durability & corruption protection, performance, fleet data collection and analysis, early logical replication work, new major versions, addressing complex operational issues, recruiting & training • Launched an internal Amazon-wide dedicated PostgreSQL chat channel in 2018 which now has thousands of members and numerous technical discussions daily • Founder of "RAC Attack" Community Driven Oracle Cluster Database Workshop – almost 40 events across 15 countries between 2011 and 2016 • Participant, speaker and/or organizer at Linux and Oracle user groups & conferences since the early 2000’s. PostgreSQL user groups & conferences since 2017. • Database blog since 2007, Oracle ACE Alumni, Oak Table Schneider
  • 3. “My Opinions Are My Own”
  • 4.
  • 5. Ager, Simon. "Omniglot - writing systems and languages of the world". Date accessed: 6th June 2023. www.omniglot.com
  • 6. #PASSDataCommunitySummit aws ec2 run-instances --instance-type t2.micro --key-name mac --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=research-db}]' --image-id ami-0172070f66a8ebe63 --region us-east-1 sudo apt install postgresql-common sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh sudo apt install postgresql-15 create database research_texts template=template0 locale_provider=icu icu_locale="en-US"
  • 7. #PASSDataCommunitySummit create table arabic_dictionary_research ( word text, crossreferences text, notes text ) partition by range (word); create table arabic_dictionary_research_p1 partition of arabic_dictionary_research for values from ('‫)'ا‬ to ('‫;)'ح‬ create table arabic_dictionary_research_p2 partition of arabic_dictionary_research for values from ('‫)'ح‬ to ('‫;)'س‬ create table arabic_dictionary_research_p3 partition of arabic_dictionary_research for values from ('‫)'س‬ to ('‫;)'ل‬ create table arabic_dictionary_research_p4 partition of arabic_dictionary_research for values from ('‫)'ل‬ to ('‫;)'ے‬ create table arabic_dictionary_research_p5 partition of arabic_dictionary_research default;
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Daniel Chekalov on Unsplash
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Kanashi on Unsplash
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 14. #PASSDataCommunitySummit aws ec2 run-instances --instance-type t2.micro --key-name mac --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=research-db-hotstandby}]' --image-id ami-0fd2c44049dd805b8 --region us-east-1 sudo apt install postgresql-common sudo sh /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh sudo apt install postgresql-15 # cut and paste instructions from # https://ubuntu.com/server/docs/databases-postgresql # to easily set up the hot standby database
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Aditya Chinchure on Unsplash
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 19. #PASSDataCommunitySummit • Verify Backup and Log File Retention (long enough for investigation) • Articulate and Write the Business Impact at Present • Freeze Ongoing Changes (any dev teams) • Inventory Copies of Data • Safely Scan to Determine If There’s More Corruption • Follow General Best Practices • Two-person rule, rename/move not delete, verify/compare healthy neighboring data, test remediations before applying on prod, document everything. Checklist for Responding to Data Corruption https://ardentperf.com/2019/11/08/postgresql-invalid-page-and-checksum-verification-failed/
  • 21. Quick Refresher: Encoding Photo by Kelly Sikkema on Unsplash
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Image by Simon from Pixabay
  • 23. #PASSDataCommunitySummit • 4 • 5 • 6 • 12 • 14 • 20 Length of this Unicode string?
  • 24. #PASSDataCommunitySummit $ hexdump -C /etc/hosts 00000000 23 23 0a 23 20 48 6f 73 74 20 44 61 74 61 62 61 |##.# Host Databa| 00000010 73 65 0a 23 0a 23 20 6c 6f 63 61 6c 68 6f 73 74 |se.#.# localhost| 00000020 20 69 73 20 75 73 65 64 20 74 6f 20 63 6f 6e 66 | is used to conf| 00000030 69 67 75 72 65 20 74 68 65 20 6c 6f 6f 70 62 61 |igure the loopba| 00000040 63 6b 20 69 6e 74 65 72 66 61 63 65 0a 23 20 77 |ck interface.# w| 00000050 68 65 6e 20 74 68 65 20 73 79 73 74 65 6d 20 69 |hen the system i| 00000060 73 20 62 6f 6f 74 69 6e 67 2e 20 20 44 6f 20 6e |s booting. Do n| 00000070 6f 74 20 63 68 61 6e 67 65 20 74 68 69 73 20 65 |ot change this e| 00000080 6e 74 72 79 2e 0a 23 23 0a 31 32 37 2e 30 2e 30 |ntry..##.127.0.0| 00000090 2e 31 09 6c 6f 63 61 6c 68 6f 73 74 0a 32 35 35 |.1.localhost.255| 000000a0 2e 32 35 35 2e 32 35 35 2e 32 35 35 09 62 72 6f |.255.255.255.bro| 000000b0 61 64 63 61 73 74 68 6f 73 74 0a 3a 3a 31 20 20 |adcasthost.::1 | 000000c0 20 20 20 20 20 20 20 20 20 20 20 6c 6f 63 61 6c | local| 000000d0 68 6f 73 74 0a |host.| 000000d5 What is encoding?
  • 25. #PASSDataCommunitySummit What is encoding? CREATE EXTENSION pageinspect; UPDATE mytable SET id=2 WHERE id=1; SELECT lp,t_ctid,t_xmax,t_data from heap_page_items(get_raw_page('mytable',0)); lp | t_ctid | t_xmax | t_data ----+--------+--------+-------- 1 | (0,2) | 1113 | x0100 2 | (0,2) | 0 | x0200 https://github.com/petergeoghegan/pg_hexedit
  • 29.
  • 30. #PASSDataCommunitySummit 1978: JIS C 6226 (JIS X 0208) “2-byte” character set was developed to express hiragana and kanji characters (Japan); followed by GB/T 2312 (1980, China) and Big5 (1984, Taiwan) 1981: Xerox 8010 Information System released; family included 16-bit encodings and 27 languages (sadly didn’t sell) 1985-1987: Discussions between Xerox & Apple Engineers about need for “one universal character set”, soon joined by engineers from many other companies (IBM, Microsoft, Sun, NeXT, Novell, etc.) as well as Toronto University 1991: Unicode Consortium Incorporated to 'standardize, extend and promote the Unicode character encoding, a fixed-width, 16- bit character encoding for over 60,000 graphic characters.’ 1991: First Unicode Standard is published “…begin at 0 and add the next character”
  • 37. #PASSDataCommunitySummit Avoid Conversion When Possible https://www.cybertec-postgresql.com/en/fix-bad-encoding-postgresql/ Laurenz Albe May 30, 2023
  • 38. 1. use unicode 2. localize applications 3. assume unexpected chars
  • 39. 1. use unicode 2. localize applications 3. assume unexpected chars
  • 41. Image by Simon from Pixabay
  • 42. #PASSDataCommunitySummit Putting Strings In Order select * from (values ('Baptisto') , ('banqueta') , ('baño') , ('como') , ('chorizo') ) list(word) order by word;
  • 43. #PASSDataCommunitySummit Putting Strings In Order banqueta baño Baptisto como chorizo Baptisto banqueta baño chorizo como baño banqueta Baptisto chorizo como select * from (values('Baptisto'),('banqueta'),('baño'),('como'),('chorizo’)) list(word) order by word;
  • 44. #PASSDataCommunitySummit • Contractions: two (or more) characters sort as if they were a single base letter. In Table 4, CH acts like a single letter sorted after C. • Expansions: a single character sorts as if it were a sequence of two (or more) characters. In Table 4, an Œ ligature sorts as if it were the sequence of O + E. • Backwards Accent: In row 1 of Table 5, the first accent difference is on the o, so that is what determines the order. In some French dictionary ordering traditions, however, it is the last accent difference that determines the order, as shown in row 2. Linguistic Collation is Complex https://www.unicode.org/reports/tr10/ https://www.cybertec-postgresql.com/en/case-insensitive-pattern-matching-in-postgresql/
  • 45.
  • 47. #PASSDataCommunitySummit • 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting) • 6.3.1 (1998) – Multibyte support at database level • 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte • 8.4 (2009) – Locale (LC_COLLATE) at database level • 9.1 (2011) – Support collation per-column and COLLATE clause • 10 (2017) – Support ICU at column level • 13 (2020) – Track collation-related versions for some Operating Systems • 15 (2022) – Support ICU collation at database/cluster level • 16 (2023) – Support custom ICU collation rules, build ICU by default • Future – Discussion: multiple ICU versions, initdb use ICU by default Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more… Localization Support in PostgreSQL
  • 48. #PASSDataCommunitySummit • Formatting in TO_CHAR, TO_NUMBER, etc • Automatic “recoding” of data and query results to client encoding • Encoding and language/translation of messages (error, warning, etc) • Character classification (letter, number, punctuation, whitespace, etc) Flexible Collation: • ORDER BY • LIKE, regex • = • Partitions, Constraints, Generated Columns, Triggers… • Any expression that involves strings (cf. COLLATE sql clause) Localization Support in PostgreSQL
  • 50. PostgreSQL does not include its own collation code. Instead, it depends on an external library installed and managed separately. PostgreSQL Collation Libraries
  • 51. #PASSDataCommunitySummit PostgreSQL Collation Libraries Operating System C Library (libc) International Components for Unicode (icu) Default Recommended
  • 53. #PASSDataCommunitySummit Levels of Defaults: • OS Environment (for initdb) • Template0/1 (for database) • Database • Table/Column • Data Type (for constants) • Explicit in SQL statement Collation Precedence in PostgreSQL Conflict Resolution Rules: 1. Explicit > Implicit 2. Non-default > Default 3. Indeterminate collation only raises error if collation is needed at runtime Docs: Part III (Server Admin) Chapter 24 (Localization) Part 24.2 (Collation Support)
  • 54. #PASSDataCommunitySummit • Case insensitive comparison • Comparison of base characters, ignoring accents • Example: count rows where user input was Mexico, México, mexico, or méxico • Compare digits by numeric value • Example: id-45 < id-123 • Ignore whitespace, so that similar strings are kept close together • By default, glibc keeps similar strings close but with ICU whitespace can cause similar strings to sort far apart from each other. • Example: “full time” and “full-time” and “fulltime” • May get extra performance by comparing without normalizing • Safe for strings that are system-generated and guaranteed to be consistent, or that are pre- normalized Advanced Collation Support with ICU
  • 58. #PASSDataCommunitySummit Version management remains a significant challenge for collation: • With both libc and ICU, changing the version of the library can cause corruption that isn’t noticed until long afterwards. • For OS upgrades, it’s safe to use dump-and-load or logical replication. • Safe to install old ICU version on new OS, but may need to build from source. • The “ALTER … REFRESH COLLATION” command is dangerous. • Does not change anything; instructs database to permanently forget the warning. • On version mismatch, PostgreSQL unfortunately issues WARNING instead of ERROR, perhaps only to server logs. Responsibility of the admin to know you shouldn’t change the OS/libs, and that it means possible corruption. Collation Challenges in PostgreSQL
  • 59. #PASSDataCommunitySummit With both libc and ICU, changing the version of the library can cause corruption that isn’t noticed until long afterwards. Can trigger a version change: • OS Upgrade • Failover and Hot Standby • Patroni, Kubernetes, etc • Distributed Systems Collation Challenges in PostgreSQL Can be corrupted by version change: • Indexes • All types, not just btree • Constraints • All types, not just unique/primary-key • Partitions • FDWs – eg. mergejoin depends on same local/remote ordering • Maybe: un-refreshed materialized views, triggers, generated columns? (I’m not sure)
  • 62. Did Postgres Lose My Data?
  • 63. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 67. #PASSDataCommunitySummit Version management remains a significant challenge for collation: • With both libc and ICU, changing the version of the library can cause corruption that isn’t noticed until long afterwards. • For OS upgrades, it’s safe to use dump-and-load or logical replication. • Safe to install old ICU version on new OS, but may need to build from source. • The “ALTER … REFRESH COLLATION” command is dangerous. • Does not change anything; instructs database to permanently forget the warning. • On version mismatch, PostgreSQL unfortunately issues WARNING instead of ERROR, perhaps only to server logs. Responsibility of the admin to know you shouldn’t change the OS/libs, and that it means possible corruption. Localization Challenges in PostgreSQL
  • 68. #PASSDataCommunitySummit Version management remains a significant challenge for collation: • With both libc and ICU, changing the version of the library can cause corruption that isn’t noticed until long afterwards. • For OS upgrades, it’s safe to use dump-and-load or logical replication. • Safe to install old ICU version on new OS, but may need to build from source. • The “ALTER … REFRESH COLLATION” command is dangerous. • Does not change anything; instructs database to permanently forget the warning. • On version mismatch, PostgreSQL unfortunately issues WARNING instead of ERROR, perhaps only to server logs. Responsibility of the admin to know you shouldn’t change the OS/libs, and that it means possible corruption. Localization Challenges in PostgreSQL
  • 70. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 71. #PASSDataCommunitySummit Collation Torture Test Data to answer the questions: Is this really a problem? How common are sort order changes? • 10 years of historical versions • Ubuntu and RHEL • ICU and glibc github.com/ardentperf/glibc-unicode-sorting
  • 74. #PASSDataCommunitySummit • In summary: both glibc and ICU have regular collation changes. Both have had at least one release with very large numbers of changes. • Code published on github to generate the 26 million strings with PL/pgSQL snippet https://github.com/ardentperf/glibc-unicode-sorting/blob/main/run-icu.sh#L65 Collation Torture Test
  • 76.
  • 77. #PASSDataCommunitySummit • 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting) • 6.3.1 (1998) – Multibyte support at database level • 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte • 8.4 (2009) – Locale (LC_COLLATE) at database level • 9.1 (2011) – Support collation per-column and COLLATE clause • 10 (2017) – Support ICU at column level • 13 (2020) – Track collation-related versions for some Operating Systems • 15 (2022) – Support ICU collation at database/cluster level • 16 (2023) – Support custom ICU collation rules, build ICU by default • Future – Discussion: multiple ICU versions, initdb use ICU by default Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more… Localization Support in PostgreSQL
  • 78. #PASSDataCommunitySummit • 6.1 (1997) – FAQ instructions to set locale as user running PG (for formatting) • 6.3.1 (1998) – Multibyte support at database level • 7.3 (2002) – Multibyte & locale by default, remove --enable-multibyte • 8.4 (2009) – Locale (LC_COLLATE) at database level • 9.1 (2011) – Support collation per-column and COLLATE clause • 10 (2017) – Support ICU at column level • 13 (2020) – Track collation-related versions for some Operating Systems • 15 (2022) – Support ICU collation at database/cluster level • 16 (2023) – Support custom ICU collation rules, build ICU by default • Future – Discussion: multiple ICU versions, initdb use ICU by default Tatsuo Ishii, Oleg Bartunov, Oleg Broytmann, Josef Balatka, Radek Strnad, Heikki Linnakangas, Tom Lane, Peter Eisentraut, Jeff Davis, and more… Localization Support in PostgreSQL
  • 79. Key Takeaways • Use collation features for more readable and elegant SQL, when doing fuzzy comparisons or multi-lingual sorting • ICU brings powerful new capabilities around linguistic collation • Assume there are exotic unexpected characters in your data • Move toward using ICU in PostgreSQL (for both safety and capability) • When upgrading your operating system, (1) dump or (2) logical or (3) use old ICU • Not a great idea to under-pay your administrators. Give them lots of thanks and some extra vacation time. 🏝
  • 80.
  • 81. Session evaluation Your feedback is important to us Evaluate this session at: www.PASSDataCommunitySummit.com/evaluation
  • 82. Thank you aws.amazon.com/rds Jeremy Schneider Slack: pgtreats.info/slack-invite Blog: ardentperf.com Twitter/X: @jer_s