Andy Parsons Pivotal June 2011

War Stories: Operational
Fun with PostgreSQL in
the Cloud
Andy Parsons

Who am I?

• Startup junkie / masochist
• Deliver stuff that works in startup time
• Old timer on the NYC startup scene
• ♥ luxury of choosing tools. And living with
them.

Who Aren’t I?

• DBA
• Sys Admin
• PostgreSQL Uber Guru

This Talk is ...
• Pragmatic, Concrete
• About My Experiences and Lessons
Learned
• About 3 recent startups built with
PostgreSQL
• Going to focus on Postgres, but leak into
overall architecture

Brief History:
me + PostgreSQL
• Digital Railroad (Deadpool 2007)
• Shrty (acquired by Collecta 2008)
• Outside.in (acquired by AOL 2011)
• Bookish (stealth)

The IM You Don’t Want
to Get. Ever.

The IM You Don’t Want
to Get. Ever.
1:05 am “Site’s down”

1:06 am “U seeing all these alerts?”

1:09 am “What’s it mean- no such
device?”

Fallout

• A lot of the system was down for a short
time
• When it came back up, data was old
• New data had to be merged with incoming
• But, incoming pipeline never compromised

Lessons Learned
• Don’t trust expensive hardware and
datacenters
• Redundant isn’t redundant enough
• SH** HAPPENS
• Postgres looks cool!

Shrty

• Social Network Aggregation
• Seed capital
• 2 developers
• First attempt to run Postgres on EC2

Story
• 3 guys with an idea and a logo
• Built in 2 months in RoR and Java
• Modest trafﬁc, tested up to 100K users
• Investor pitches
• “Production”
• Sold.

Lessons Learned
• PostgreSQL + EC2 : it works!
• Cheap!
• I/O is massively unpredictable
• Ephemeral storage is ... ephemeral
• No SLA in the Cloud

outside.in
• Hyperlocal News
• Geotag and categorize web pages, blog posts and
tweets from hundreds of thousands of sources
• Organize data into ~85,000 neighborhoods
• Query for news with 1000 ft. of a user
• Chose Postgres for PostGIS
• Powers local on CNN’s homepage and many other sites
• Now part of AOL’s Patch

Architecture
Postgres
RoR Slave
Scala Postgres
Svc Master
Scala Denorm Postgres
RoR APIs / Slave
Scala Indexing
Q’ing
Svc
Scala
Mobile APIs Text Solr
Scala Mining Slave
Solr
Svc
Master
Public Solr
API Slave

EC2 DB “Hardware”

• m2.4xlarge = High-Memory Quadruple
Extra Large!
• 68.4 GB RAM
• High I/O Performance
• 8 virtual cores

The Cloud Giveth
and Taketh
• Machines vanish (network, switch, power ...)
• Network availability
• Multi-tenant machines
• SAN location
• OI became a large AWS customer, assigned
acct. manager and access to EC2 engineers
• Email you don’t want to get on a Friday night...

Hello,

One of your instances in the us-east-1 region is on hardware that requires network
related maintenance. Your other instances that are not listed here will not be affected.

i-3fcdb156

For the above instance, we recommend migrating to a replacement instance to avoid
any downtime. Your replacement instance would not be subject to this maintenance.

If you leave your instance running, you will lose network connectivity for up to two
hours. The maintenance will occur during a 12-hour window starting at 12:00am
PST on Monday, February 15, 2010. After the maintenance is complete, network
connectivity will be restored to your instance.

As always, we recommend keeping current backups of data stored on your instance.

Sincerely,

The Amazon EC2 Team

Failure is Assured
• Load balance with health checks (Varnish)
• Use DNS. Private IPs *do* change
• Use Puppet (or Chef)
• Hardened basic image, apply security patches
there
• Puppet bootstraps from there
• Replace instances before they fail when possible

Resource Contention
• Everyone needs data, everyone needs it
NOW.
• PUT WAL on separate disk (log writing
bounds write throughput)
• Keep an eye on iostat - one disk in RAID 0
can ruin your day
• Backups, buffer cache ﬁlling, vacuuming

Connections

• Managing max_connections
• PGBouncer = basic conn pooler
• Session mode - life of connection
• Tx mode - life of transaction
• Statement mode - life of single statement

Containment Problem
• Places (points) need to be placed into
neighborhoods properly
• Neighborhood and municipal boundaries are
complex
• Neighborhoods overlap towns - need %
intersection
• Containment projects upward
• US shape data is messy

Geometry is
Slow :(
• Simplify shapes - if you can
• Avoid complex Geo queries online
(ST_CONTAINS, ST_INTERSECTION,
ST_CENTROID)
• Cache Containment. Geo will never be faster than
simple SELECT
• Eventually... index containment in Lucene
• PostGIS for generating and updating containment
cache only (periodic, ofﬂine)

Hyperlocal at
CNN Scale
• Strategic investor
• Initial API impl was
CNN homepage!
• Many MM page views
• 350 req/s
• News = sensitive to
caching

Replication
• Done via WAL Shipping
• Warm standby only in Postgres 8.4
• Base (hot) backup, then ship/apply applying WAL
• Replica - sometimes came out of standby mode (manual
procedure to remedy)
• WAL shipping to multiple slaves:
• Make some with RAID for emergency promotion to
master
• Make one use a single EBS volume and snapshot that.

Backup

• Periodic full pg_dump -> S3
• Lots of I/O pressure
• Experiments using XFS RAID snapshotting.
Don’t do it.

Load Balancing
• HAProxy
• ELB for Application Servers - not for
internal use!
• From the horse’s mouth: scales up
HAProxy cores with # unique IP’s NOT
raw trafﬁc.

Linux Buffer Cache
• Postgres highly dependent on warm OS
caches
• Crazy variances in query times:
• 10 ms in Staging
• 5000 ms in Prod
• Data stampedes
• Warm up time for db = warming caches

I/O

• DB performance is a game of maximizing
IO, where EC2 is your opponent.
• Guaranteed IOPs (???)
• RAID 0 or RAID 10?

EBS Filesystem Tests
Seq. Reads Seq. Writes Random Random R/W Mix: RW Mix:
Filesystem # of Disks
(MB/s) (MB/s) Reads (MB/s) Writes (MB/s) Reads (MB/s) Writes (MB/s)
EXT3,
3 74.7 102.1 1.3 20.4 21.3 25.1
64K stripe

EXT3, 128K stripe,
3 1.6 11.3
2MB readahead buffer

XFS,
3 20.7 107.2 1.7 40.2 13.6 12.5
64k stripe

XFS,
3 102.2 106.2 1.5 87.8 41.1 24.6
128K stripe

XFS,
4 115.8 135.4 2.0 76.4 41.0 24.6
64K stripe

XFS,
4 104.8 103.1 1.8 70.8 49.3 30.3
128K stripe

XFS, 128K stripe,
4 105.0 102.8 2.0 70.1 55.1 31.5
deadline scheduler

Keeping Things Healthy
• Monitor bloat
• Vacuum as needed
• autovacuum may not be enough
• VACUUM FULL may be too much (locks)
• Vacuum analyze frequently
• Use autovacuum but tune carefully
• PgFouine FTW!
• Log analysis
• Slow queries
• Vacuum analysis

More Performance
• Use stored procedures (and debugger)
• Query optimizer doesn’t always do what
you expect! (separate slide?)
• Maximize statistics (but beware dynamic
SQL)
ALTER TABLE <table> ALTER COLUMN <column> SET
STATISTICS <number>

SELECT

Heinous SQL
stories.id, WHERE
(SELECT (fpa.owned=TRUE OR fpa.owned IS NULL) AND
fsa.title fsa.story_id=stories.id
FROM ORDER BY fsa.created_at DESC
feed_story_attachments fsa LIMIT 1) AS author_url,
LEFT OUTER JOIN feed_publication_attachments fpa (EXISTS (
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 SELECT fpa.id
WHERE FROM feed_publication_attachments fpa
(fpa.owned=TRUE OR fpa.owned IS NULL) AND JOIN feed_story_attachments fsa
fsa.story_id=stories.id ON fsa.feed_id=fpa.feed_id
ORDER BY fsa.created_at ASC WHERE
LIMIT 1) as title, stories.id = fsa.story_id AND
(SELECT fpa.publication_id=112 AND
f.title fpa.owned
FROM )) AS promoted
feeds f FROM stories
JOIN feed_story_attachments fsa JOIN blips b
ON f.id=fsa.feed_id ON b.story_id = stories.id AND b.location_id=1435491 AND
LEFT OUTER JOIN feed_publication_attachments fpa b.publisher_id IN (0,115)
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 WHERE
WHERE b.blip_type_id in (1,3) AND -- comment out to run prior query
(fpa.owned=TRUE OR fpa.owned IS NULL) AND form
fsa.story_id=stories.id (
ORDER BY fsa.created_at DESC NOT EXISTS (
LIMIT 1) as "author", SELECT bf.id
(SELECT FROM blip_filters bf
fsa.url WHERE
FROM bf.location_id=1435491 AND
feed_story_attachments fsa bf.story_id = stories.id AND
LEFT OUTER JOIN feed_publication_attachments fpa bf.publisher_id=115
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 ) AND EXISTS (
WHERE SELECT
(fpa.owned=TRUE OR fpa.owned IS NULL) AND f.id
fsa.story_id=stories.id FROM
ORDER BY fsa.created_at DESC feeds f
LIMIT 1) as url, JOIN feed_story_attachments fsa
SUBSTRING(stories.summary FROM 1 FOR 200) AS summary, ON f.id=fsa.feed_id
stories.sort_date as published_at, LEFT OUTER JOIN feed_publication_attachments fpa
(SELECT ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112
f.base_url WHERE
FROM (fpa.owned=TRUE OR fpa.owned IS NULL) AND
feeds f fsa.story_id=stories.id
JOIN feed_story_attachments fsa ) AND (
ON f.id=fsa.feed_id NOT EXISTS(
LEFT OUTER JOIN feed_publication_attachments fpa SELECT psf.id
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 FROM publication_story_filters psf
WHERE WHERE
(fpa.owned=TRUE OR fpa.owned IS NULL) AND psf.story_id = stories.id AND
fsa.story_id=stories.id psf.publication_id=112
ORDER BY fsa.created_at DESC )

fsa.url fpa.owned
FROM )) AS promoted
feed_story_attachments fsa FROM stories
LEFT OUTER JOIN JOIN blips b
feed_publication_attachments fpa ON b.story_id = stories.id AND
ON fsa.feed_id=fpa.feed_id b.location_id=1435491 AND
AND fpa.publication_id=112 b.publisher_id IN (0,115)
WHERE WHERE
(fpa.owned=TRUE OR fpa.owned b.blip_type_id in (1,3) AND --
IS NULL) AND comment out to run prior query form
fsa.story_id=stories.id (
ORDER BY fsa.created_at DESC NOT EXISTS (
LIMIT 1) as url, SELECT bf.id
SUBSTRING(stories.summary FROM 1 FROM blip_filters bf
FOR 200) AS summary, WHERE
stories.sort_date as published_at, bf.location_id=1435491 AND
(SELECT bf.story_id = stories.id AN
f.base_url bf.publisher_id=115
FROM ) AND EXISTS (
feeds f SELECT
JOIN feed_story_attachments fsa f.id
ON f.id=fsa.feed_id FROM
LEFT OUTER JOIN feeds f
feed_publication_attachments fpa JOIN feed_story_attachments
ON fsa.feed_id=fpa.feed_id fsa
AND fpa.publication_id=112 ON f.id=fsa.feed_id
WHERE LEFT OUTER JOIN
(fpa.owned=TRUE OR fpa.owned ISfeed_publication_attachments fpa
ON fsa.feed_id=fpa.feed_i

Make Heinous SQL
Run Fast!
• Fast = subsecond
• Ideally < 250 ms
• Query planner - feed it stats
• Sometimes rewrite q’s to take advantage of
GiST indexes (critical for geo)

Costs
Reserved Standard
Monthly $$

Jan Feb March April May June July Aug Sept Oct Nov Dec

Lessons Learned
• EC2 is still cheap-ish, but not without careful
planning!
• Denormalize into something else (Lucene,
Geo Cache)
• Monitor the crap out of everything
• Send a synthetic transaction ID through stack
• Plan on a few failures a week

• Hybrid Postgres/MongoDB/Lucene Data Stack
• Postgres 9.0
• Mongo for social graph and event-logging
• UUIDs for shared references
• Hot Standby
• Streaming Replication
• VPC and Dedicated Instances ($$)
• Experimenting with other Clouds for Production
Environment
• Launching late summer - and we’re hiring!

Lessons Learned

• Invite me back in a few months.

Some Thoughts and
Conclusions
• PostgreSQL is a GREAT choice if you are
starting out now, on EC2.
• The Postgres community is awesome.
• Organized governing body- who needs it?
• Let’s see a shrink-wrapped EC2 cloud-
provider. We’ll be customer #2 :)

Thanks!
Andy Parsons
andy@bookish.com
@andyparsons
http://linkedin.com/in/andyparsons

Andy Parsons Pivotal June 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Andy Parsons Pivotal June 2011

Similar to Andy Parsons Pivotal June 2011 (20)

Andy Parsons Pivotal June 2011

Editor's Notes