Addressing vendor weaknesses in user space (Robert Treat)
1. Addressing Vendor Weaknesses in User-Space
ROBERT TREAT,
OmniTI
Highload++ 2011
@robtreat2
xzilla.net
+Robert Treat
1
Monday, October 3, 11
2. Who Am I?
OMNTI - Internet Scalability Consultants
Lead Database Operations
2
Monday, October 3, 11
3. Who Am I?
OMNTI - Internet Scalability Consultants
Lead Database Operations
“Large Scale”
3
Monday, October 3, 11
4. Who Am I?
OMNTI - Internet Scalability Consultants
Lead Database Operations
“Large Scale”
High Transactions
TB+ Data
4
Monday, October 3, 11
5. Who Am I?
OMNTI - Internet Scalability Consultants
Lead Database Operations
“Large Scale”
High Transactions
TB+ Data
Mission Critical
5
Monday, October 3, 11
6. Who Am I?
Database Operations @OMNTI
Postgres
MySQL
Oracle
& More
6
Monday, October 3, 11
7. Postgres for Scalability
Traditional RDBMS
Highly Extensible
Runs Everywhere
Talks To Everything
“BSD” Licensed
15+ Years Development
Open Development Community
7
Monday, October 3, 11
8. The Bloat Problem
Data Footprint Can Be Critical To Performance
8
Monday, October 3, 11
9. The Bloat Problem
Data Footprint Can Be Critical To Performance
Size On Disk Affects The Needs Of
RAM, Disk Speed, Storage
9
Monday, October 3, 11
10. The Bloat Problem
Data Footprint Can Be Critical To Performance
Size On Disk Affects The Needs Of
RAM, Disk Speed, Storage
“Bloat” is unused, wasted disk space,
used by the database,
but not needed for actual data storage
10
Monday, October 3, 11
11. The Bloat Problem
Data Footprint Can Be Critical To Performance
Size On Disk Affects The Needs Of
RAM, Disk Speed, Storage
“Bloat” is unused, wasted disk space,
taken up by the database,
but not needed for actual data storage
Why?
11
Monday, October 3, 11
12. MVCC Architecture
Multiversion Concurrency Control (MVCC) allows
Postgres to offer high concurrency even during
significant database read/write activity. MVCC
specifically offers behavior where "readers never block
writers, and writers never block readers".
12
Monday, October 3, 11
13. MVCC Architecture
• Oracle
• MySQL (InnoDB)
• Informix
• Firebird
• MSSQL (optional)
13
Monday, October 3, 11
17. Postgres MVCC Architecture
• Postgres maintains global transaction counters
• Keeps track of transaction counter per row for
• creating transaction
• removing transaction
• Using these counters, Postgres allows different
transactions to see different rows, based on visibility rules.
17
Monday, October 3, 11
18. Postgres MVCC Architecture
• Postgres maintains global transaction counters
• Keeps track of transaction counter per row for
• creating transaction
• removing transaction
• Using these counters, Postgres allows different
transactions to see different rows, based on visibility rules.
Transaction Reading An Old Row
Doesn’t Block Transaction Writing A Row
18
Monday, October 3, 11
22. MVCC Architecture
user_id X69
Create 43 <~~ DEAD ROW
Expire 56
user_id X69
Clean Up / Bloat
Create 43 <~~ VISIBLE ROW
Expire
22
Monday, October 3, 11
23. MVCC Architecture
user_id X69
Create 43 <~~ DEAD ROW
Expire 56
user_id X69
Clean Up / Bloat
Create 43 <~~ VISIBLE ROW
Expire
Speed Up SQL Commands By
Dealing With Clean Up Later
23
Monday, October 3, 11
24. How Postgres Deals With Bloat
• Heap-Only-Tuples (HOT)
• On-The-Fly, Per Page Cleanup
• Marks Given Row’s Space Reusable
• Update Only
24
Monday, October 3, 11
25. How Postgres Deals With Bloat
• Heap-Only-Tuples (HOT)
• On-The-Fly, Per Page Cleanup
• Marks Given Row’s Space Reusable
• Update Only
• VACUUM
• Non-Blocking Bulk Cleanup
• Removes End-Of-File Pages
• “autovacuum” Process Monitors Tables
25
Monday, October 3, 11
26. Problems With Automatic Cleanup
• HOT
• Update Only
• Doesn’t Work With Changing Index Data
26
Monday, October 3, 11
27. Problems With Automatic Cleanup
• HOT
• Update Only
• Doesn’t Work When Changing Index Data
• VACUUM
• Must Wait For Long Transactions To Complete
• Costs I/O, Can Only Work So Fast
• Can’t Remove Non End-Of-File Pages
• Leaves A “High Water Mark”
27
Monday, October 3, 11
28. Dealing With Bloat - The Hard Way
• VACUUM FULL / CLUSTER
• The Good
• Reclaims All “Dead Rows”
28
Monday, October 3, 11
29. Dealing With Bloat - The Hard Way
• VACUUM FULL / CLUSTER
• The Good
• Reclaims All “Dead Rows”
• The Bad
• Exclusive Lock
• Rewrite All Data In Tables
• Needs Working Space
• Heavy I/O
29
Monday, October 3, 11
30. Monitoring Your Bloat
• check_postgres.pl
• Nagios plugin
• Compares physical size to row size estimates
• http://bucardo.org/wiki/Check_postgres
• “bloat report”
• Script to measure table/index bloat
• Compares physical size to row size estimates
• http://labs.omniti.com/labs/pgtreats/
browser/trunk/tools/
30
Monday, October 3, 11
31. Dealing With Bloat In Userspace
• Solving MVCC Bloat Is A “Hard Problem”
• Even a good solution would be hard to
implement in core
31
Monday, October 3, 11
32. Dealing With Bloat In Userspace
• Solving MVCC Bloat Is A “Hard Problem”
• Even a good solution would be hard to
implement in core
• Can we build a tool in user space?
• Develop solution quicker
• Easier to deploy and maintain
• Provide a prototype for future development
32
Monday, October 3, 11
33. Dealing With Bloat Redux
• Updating A Row Rewrites Data To New Location
33
Monday, October 3, 11
34. Dealing With Bloat Redux
• Updating A Row Rewrites Data To New Location
• Use Vacuum To Mark Old Rows “Reusable”
34
Monday, October 3, 11
35. Dealing With Bloat Redux
• Updating A Row Rewrites Data To New Location
• Use Vacuum To Mark Old Rows “Reusable”
• Update Row To Rewrite Data At “Front” Of Page
35
Monday, October 3, 11
36. Dealing With Bloat Redux
• Updating A Row Rewrites Data To New Location
• Use Vacuum To Mark Old Rows “Reusable”
• Update Row To Rewrite Data At “Front” Of Page
• Use Vacuum To Reclaim Space From End Of File
36
Monday, October 3, 11
37. Dealing With Bloat Redux
• Updating A Row Rewrites Data To New Location
• Use Vacuum To Mark Old Rows “Reusable”
• Update Row To Rewrite Data At “Front” Of Page
• Use Vacuum To Reclaim Space From End Of File
• Put A Script On It
• https://labs.omniti.com/pgtreats/trunk/tools/compact_table
37
Monday, October 3, 11
38. Dealing With Bloat Redux
• “Compact Table”
• Requires Lots of Time, I/O
• Often Causes Heavy Index Bloat
• Heavy Concurrency Bloats Faster Than
We Can Recover It
38
Monday, October 3, 11
39. Dealing With Bloat For Real!
• Enter “pg_reorg”
39
Monday, October 3, 11
40. Dealing With Bloat For Real!
• Enter “pg_reorg”
• Vacuum / Cluster Replacement
40
Monday, October 3, 11
41. Dealing With Bloat For Real!
• Enter “pg_reorg”
• Vacuum / Cluster Replacement
• Command Line Tool
41
Monday, October 3, 11
42. Dealing With Bloat For Real!
• Enter “pg_reorg”
• Vacuum / Cluster Replacement
• Command Line Tool
• Online Table Rewrite
• Uses Minimal Locking
42
Monday, October 3, 11
43. Dealing With Bloat For Real!
• Enter “pg_reorg”
• Vacuum / Cluster Replacement
• Command Line Tool
• Online Table Rewrite
• Uses Minimal Locking
• Developed By NTT
43
Monday, October 3, 11
44. Dealing With Bloat For Real!
• Enter “pg_reorg”
• Vacuum / Cluster Replacement
• Command Line Tool
• Online Table Rewrite
• Uses Minimal Locking
• Developed By NTT
• BSD Licensed
• C Code
• http://pgfoundry.org/projects/reorg/
44
Monday, October 3, 11
45. How pg_reorg Works
• Create a log table for changes
• Create triggers on the old table to log changes (I/U/D)
• Create a new table with a copy of all data in old table
• Create all indexes on the new table
• Apply all changes from the log table to the new table
• Modify the system catalogs information about table files
• Drop old table, leaving new table in it’s place
45
Monday, October 3, 11
46. How pg_reorg Works
• Create a log table for changes
• Create triggers on the old table to log changes
• Create a new table with a copy of all data in old table
• Create all indexes on the new table
• Apply all changes from the log table to the new table
• MODIFY THE SYSTEM CATALOGS
INFORMATION ABOUT THE TABLE FILES (!!!)
• Drop old table, leaving the new table in it’s place
46
Monday, October 3, 11
47. Dealing With Bloat For Real!
Open Source Code
The Power Is In Your Hands
Look At Code
Examine the SQL
(User Space Is Really Visible)
TEST!
47
Monday, October 3, 11
48. Dealing With Bloat For Real!
What Does Testing Look Like?
Create Some Tables,
Create Artificial Bloat,
run pg_reorg
48
Monday, October 3, 11
49. Dealing With Bloat For Real!
What Does Testing Look Like?
Create Some Tables,
Create Artificial Bloat,
run pg_reorg
WIN!
49
Monday, October 3, 11
51. Dealing With Bloat For Real!
Test In “Prod”
Find Some Bloated Tables,
Make Backup Of Tables,
Cross Fingers,
pg_reorg
51
Monday, October 3, 11
52. Dealing With Bloat For Real!
Test In “Prod”
Find Some Bloated Tables,
Make Backup Of Tables,
Cross Fingers,
pg_reorg
WIN!
52
Monday, October 3, 11
53. Dealing With Bloat For Real!
Eventually You Have To Use It
On Something That Matters
53
Monday, October 3, 11
54. pg_reorg In The Real World
• Production Database (OLTP)
• 540GB Size
• 2000 TPS (off-peak time, multiple statements)
• Largest Table (pre-reorg) 127GB
54
Monday, October 3, 11
55. pg_reorg In The Real World
• Production Database (OLTP)
• 540GB Size
• 2000 TPS (off-peak time, multiple statements)
• Largest Table (pre-reorg) 127GB
• Rebuild Stats
• 5.75 Hours To Rebuild
• Reclaimed 52GB Disk Space
• No outages reported for Website/API’s
55
Monday, October 3, 11
64. “your faith in your
friends is yours.”
-Emperor Palpatine
61
Monday, October 3, 11
65. Sometimes You Can Have Both
Trust in NTT’s Code == faith in friends
Success in production == overconfidence
62
Monday, October 3, 11
66. When Good pg_reorgs Go Bad!
WARNING: unexpected attrdef record found
for attr 61 of rel orders
WARNING: 1 attrdef record(s) missing for rel
orders
63
Monday, October 3, 11
67. When Good pg_reorgs Go Bad!
WARNING: unexpected attrdef record found
for attr 61 of rel orders
WARNING: 1 attrdef record(s) missing for rel
orders
Yes, On A Production System
Yes, Trying To Take 1000’s of Orders Per Second
64
Monday, October 3, 11
68. When Good pg_reorgs Go Bad!
create table test (
a int4,
b int4 default 2112,
c bool
);
65
Monday, October 3, 11
69. When Good pg_reorgs Go Bad!
create table test (
a int4,
b int4 default 2112,
c bool
);
Postgres internals track defaults / constraints
based on column position “2”, not column name “b”
66
Monday, October 3, 11
70. When Good pg_reorgs Go Bad!
create table test (
a int4,
b int4 default 2112,
c bool
);
Postgres internals track defaults / constraints
based on column position “2”, not column name “b”
If you drop column “a” and then do pg_reorg, column
“c” is now column “2”, and default 2112 is on boolean
67
Monday, October 3, 11
71. When Good pg_reorgs Go Bad!
create table test (
a int4,
b int4 default 2112,
c bool
);
Postgres internals track defaults / constraints
based on column position “2”, not column name “b”
If you drop column “a” and then do pg_reorg, column
“c” is now column “2”, and default 2112 is on boolean
This Is Fair - pg_reorg hacks the system tables
68
Monday, October 3, 11
72. When Good pg_reorgs Go Bad!
Basic Fix: Drop All Defaults And Recreate
69
Monday, October 3, 11
73. When Good pg_reorgs Go Bad!
Basic Fix: Drop All Defaults And Recreate
Alternative Fix: Hack System Catalogs Some More
70
Monday, October 3, 11
74. When Good pg_reorgs Go Bad!
Basic Fix: Drop All Defaults And Recreate
Alternative Fix: Hack System Catalogs Some More
Haven’t we had enough
system catalog hacking
for now?
71
Monday, October 3, 11
75. When Good pg_reorgs Go Bad!
“now, if you'll excuse me,
I'll go away and have a
heart attack.”
72
Monday, October 3, 11
76. What Next?
Report Problem To Mailing List
Submit A Patch
Ultimately The Problem Is Fixed
Everyone’s Happy?
73
Monday, October 3, 11
77. Hackers Discussion
Postgres Development Community Is Funny
Sometimes Hard To Get Them To
Recognize Problems
Not Everyone See Online Rebuild As A Big Problem
74
Monday, October 3, 11
78. Hackers Discussion
Postgres Development Community Is Funny
Sometimes Hard To Get Them To
Recognize Problems
Not Everyone See Online Rebuild As A Big Problem
In All The Fairness,
Not Everyone Has This Problem
75
Monday, October 3, 11
79. Hackers Discussion
Hackers Meeting 2011,
Discussion On Internal Queuing System
Could Be Used As Underlying Basis For
On-Line Rebuilding
Until Then...
76
Monday, October 3, 11
80. pg_reorg Is A Great Tool!
Best Option For Difficult Situation
Just Be Careful!
77
Monday, October 3, 11
81. THANKS!
Highload++
NTT
OmniTI
Postgres Community
Momjian, Depesz, Patel, Kocoloski
xzilla.net
@robtreat2
+ Robert Treat
78
Monday, October 3, 11