2. Who am I?
• Startup junkie / masochist
• Deliver stuff that works in startup time
• Old timer on the NYC startup scene
• ♥ luxury of choosing tools. And living with
them.
4. This Talk is ...
• Pragmatic, Concrete
• About My Experiences and Lessons
Learned
• About 3 recent startups built with
PostgreSQL
• Going to focus on Postgres, but leak into
overall architecture
5. Brief History:
me + PostgreSQL
• Digital Railroad (Deadpool 2007)
• Shrty (acquired by Collecta 2008)
• Outside.in (acquired by AOL 2011)
• Bookish (stealth)
7. The IM You Don’t Want
to Get. Ever.
1:05 am “Site’s down”
1:06 am “U seeing all these alerts?”
1:09 am “What’s it mean- no such
device?”
8. The IM You Don’t Want
to Get. Ever.
1:05 am “Site’s down”
1:06 am “U seeing all these alerts?”
1:09 am “What’s it mean- no such
device?”
9. Fallout
• A lot of the system was down for a short
time
• When it came back up, data was old
• New data had to be merged with incoming
• But, incoming pipeline never compromised
11. Shrty
• Social Network Aggregation
• Seed capital
• 2 developers
• First attempt to run Postgres on EC2
12. Story
• 3 guys with an idea and a logo
• Built in 2 months in RoR and Java
• Modest traffic, tested up to 100K users
• Investor pitches
• “Production”
• Sold.
13. Lessons Learned
• PostgreSQL + EC2 : it works!
• Cheap!
• I/O is massively unpredictable
• Ephemeral storage is ... ephemeral
• No SLA in the Cloud
14. outside.in
• Hyperlocal News
• Geotag and categorize web pages, blog posts and
tweets from hundreds of thousands of sources
• Organize data into ~85,000 neighborhoods
• Query for news with 1000 ft. of a user
• Chose Postgres for PostGIS
• Powers local on CNN’s homepage and many other sites
• Now part of AOL’s Patch
15. Architecture
Postgres
RoR Slave
Scala Postgres
Svc Master
Scala Denorm Postgres
RoR APIs / Slave
Scala Indexing
Q’ing
Svc
Scala
Mobile APIs Text Solr
Scala Mining Slave
Solr
Svc
Master
Public Solr
API Slave
16. EC2 DB “Hardware”
• m2.4xlarge = High-Memory Quadruple
Extra Large!
• 68.4 GB RAM
• High I/O Performance
• 8 virtual cores
17. The Cloud Giveth
and Taketh
• Machines vanish (network, switch, power ...)
• Network availability
• Multi-tenant machines
• SAN location
• OI became a large AWS customer, assigned
acct. manager and access to EC2 engineers
• Email you don’t want to get on a Friday night...
18. Hello,
One of your instances in the us-east-1 region is on hardware that requires network
related maintenance. Your other instances that are not listed here will not be affected.
i-3fcdb156
For the above instance, we recommend migrating to a replacement instance to avoid
any downtime. Your replacement instance would not be subject to this maintenance.
If you leave your instance running, you will lose network connectivity for up to two
hours. The maintenance will occur during a 12-hour window starting at 12:00am
PST on Monday, February 15, 2010. After the maintenance is complete, network
connectivity will be restored to your instance.
As always, we recommend keeping current backups of data stored on your instance.
Sincerely,
The Amazon EC2 Team
19. Failure is Assured
• Load balance with health checks (Varnish)
• Use DNS. Private IPs *do* change
• Use Puppet (or Chef)
• Hardened basic image, apply security patches
there
• Puppet bootstraps from there
• Replace instances before they fail when possible
20. Resource Contention
• Everyone needs data, everyone needs it
NOW.
• PUT WAL on separate disk (log writing
bounds write throughput)
• Keep an eye on iostat - one disk in RAID 0
can ruin your day
• Backups, buffer cache filling, vacuuming
21. Connections
• Managing max_connections
• PGBouncer = basic conn pooler
• Session mode - life of connection
• Tx mode - life of transaction
• Statement mode - life of single statement
22. Containment Problem
• Places (points) need to be placed into
neighborhoods properly
• Neighborhood and municipal boundaries are
complex
• Neighborhoods overlap towns - need %
intersection
• Containment projects upward
• US shape data is messy
23. Geometry is
Slow :(
• Simplify shapes - if you can
• Avoid complex Geo queries online
(ST_CONTAINS, ST_INTERSECTION,
ST_CENTROID)
• Cache Containment. Geo will never be faster than
simple SELECT
• Eventually... index containment in Lucene
• PostGIS for generating and updating containment
cache only (periodic, offline)
24. Hyperlocal at
CNN Scale
• Strategic investor
• Initial API impl was
CNN homepage!
• Many MM page views
• 350 req/s
• News = sensitive to
caching
25. Replication
• Done via WAL Shipping
• Warm standby only in Postgres 8.4
• Base (hot) backup, then ship/apply applying WAL
• Replica - sometimes came out of standby mode (manual
procedure to remedy)
• WAL shipping to multiple slaves:
• Make some with RAID for emergency promotion to
master
• Make one use a single EBS volume and snapshot that.
26. Backup
• Periodic full pg_dump -> S3
• Lots of I/O pressure
• Experiments using XFS RAID snapshotting.
Don’t do it.
27. Load Balancing
• HAProxy
• ELB for Application Servers - not for
internal use!
• From the horse’s mouth: scales up
HAProxy cores with # unique IP’s NOT
raw traffic.
28. Linux Buffer Cache
• Postgres highly dependent on warm OS
caches
• Crazy variances in query times:
• 10 ms in Staging
• 5000 ms in Prod
• Data stampedes
• Warm up time for db = warming caches
29. I/O
• DB performance is a game of maximizing
IO, where EC2 is your opponent.
• Guaranteed IOPs (???)
• RAID 0 or RAID 10?
32. Keeping Things Healthy
• Monitor bloat
• Vacuum as needed
• autovacuum may not be enough
• VACUUM FULL may be too much (locks)
• Vacuum analyze frequently
• Use autovacuum but tune carefully
• PgFouine FTW!
• Log analysis
• Slow queries
• Vacuum analysis
33. More Performance
• Use stored procedures (and debugger)
• Query optimizer doesn’t always do what
you expect! (separate slide?)
• Maximize statistics (but beware dynamic
SQL)
ALTER TABLE <table> ALTER COLUMN <column> SET
STATISTICS <number>
34. SELECT
Heinous SQL
stories.id, WHERE
(SELECT (fpa.owned=TRUE OR fpa.owned IS NULL) AND
fsa.title fsa.story_id=stories.id
FROM ORDER BY fsa.created_at DESC
feed_story_attachments fsa LIMIT 1) AS author_url,
LEFT OUTER JOIN feed_publication_attachments fpa (EXISTS (
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 SELECT fpa.id
WHERE FROM feed_publication_attachments fpa
(fpa.owned=TRUE OR fpa.owned IS NULL) AND JOIN feed_story_attachments fsa
fsa.story_id=stories.id ON fsa.feed_id=fpa.feed_id
ORDER BY fsa.created_at ASC WHERE
LIMIT 1) as title, stories.id = fsa.story_id AND
(SELECT fpa.publication_id=112 AND
f.title fpa.owned
FROM )) AS promoted
feeds f FROM stories
JOIN feed_story_attachments fsa JOIN blips b
ON f.id=fsa.feed_id ON b.story_id = stories.id AND b.location_id=1435491 AND
LEFT OUTER JOIN feed_publication_attachments fpa b.publisher_id IN (0,115)
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 WHERE
WHERE b.blip_type_id in (1,3) AND -- comment out to run prior query
(fpa.owned=TRUE OR fpa.owned IS NULL) AND form
fsa.story_id=stories.id (
ORDER BY fsa.created_at DESC NOT EXISTS (
LIMIT 1) as "author", SELECT bf.id
(SELECT FROM blip_filters bf
fsa.url WHERE
FROM bf.location_id=1435491 AND
feed_story_attachments fsa bf.story_id = stories.id AND
LEFT OUTER JOIN feed_publication_attachments fpa bf.publisher_id=115
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 ) AND EXISTS (
WHERE SELECT
(fpa.owned=TRUE OR fpa.owned IS NULL) AND f.id
fsa.story_id=stories.id FROM
ORDER BY fsa.created_at DESC feeds f
LIMIT 1) as url, JOIN feed_story_attachments fsa
SUBSTRING(stories.summary FROM 1 FOR 200) AS summary, ON f.id=fsa.feed_id
stories.sort_date as published_at, LEFT OUTER JOIN feed_publication_attachments fpa
(SELECT ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112
f.base_url WHERE
FROM (fpa.owned=TRUE OR fpa.owned IS NULL) AND
feeds f fsa.story_id=stories.id
JOIN feed_story_attachments fsa ) AND (
ON f.id=fsa.feed_id NOT EXISTS(
LEFT OUTER JOIN feed_publication_attachments fpa SELECT psf.id
ON fsa.feed_id=fpa.feed_id AND fpa.publication_id=112 FROM publication_story_filters psf
WHERE WHERE
(fpa.owned=TRUE OR fpa.owned IS NULL) AND psf.story_id = stories.id AND
fsa.story_id=stories.id psf.publication_id=112
ORDER BY fsa.created_at DESC )
35. fsa.url fpa.owned
FROM )) AS promoted
feed_story_attachments fsa FROM stories
LEFT OUTER JOIN JOIN blips b
feed_publication_attachments fpa ON b.story_id = stories.id AND
ON fsa.feed_id=fpa.feed_id b.location_id=1435491 AND
AND fpa.publication_id=112 b.publisher_id IN (0,115)
WHERE WHERE
(fpa.owned=TRUE OR fpa.owned b.blip_type_id in (1,3) AND --
IS NULL) AND comment out to run prior query form
fsa.story_id=stories.id (
ORDER BY fsa.created_at DESC NOT EXISTS (
LIMIT 1) as url, SELECT bf.id
SUBSTRING(stories.summary FROM 1 FROM blip_filters bf
FOR 200) AS summary, WHERE
stories.sort_date as published_at, bf.location_id=1435491 AND
(SELECT bf.story_id = stories.id AN
f.base_url bf.publisher_id=115
FROM ) AND EXISTS (
feeds f SELECT
JOIN feed_story_attachments fsa f.id
ON f.id=fsa.feed_id FROM
LEFT OUTER JOIN feeds f
feed_publication_attachments fpa JOIN feed_story_attachments
ON fsa.feed_id=fpa.feed_id fsa
AND fpa.publication_id=112 ON f.id=fsa.feed_id
WHERE LEFT OUTER JOIN
(fpa.owned=TRUE OR fpa.owned ISfeed_publication_attachments fpa
ON fsa.feed_id=fpa.feed_i
36. Make Heinous SQL
Run Fast!
• Fast = subsecond
• Ideally < 250 ms
• Query planner - feed it stats
• Sometimes rewrite q’s to take advantage of
GiST indexes (critical for geo)
37. Costs
Reserved Standard
Monthly $$
Jan Feb March April May June July Aug Sept Oct Nov Dec
38. Lessons Learned
• EC2 is still cheap-ish, but not without careful
planning!
• Denormalize into something else (Lucene,
Geo Cache)
• Monitor the crap out of everything
• Send a synthetic transaction ID through stack
• Plan on a few failures a week
39. • Hybrid Postgres/MongoDB/Lucene Data Stack
• Postgres 9.0
• Mongo for social graph and event-logging
• UUIDs for shared references
• Hot Standby
• Streaming Replication
• VPC and Dedicated Instances ($$)
• Experimenting with other Clouds for Production
Environment
• Launching late summer - and we’re hiring!
41. Some Thoughts and
Conclusions
• PostgreSQL is a GREAT choice if you are
starting out now, on EC2.
• The Postgres community is awesome.
• Organized governing body- who needs it?
• Let’s see a shrink-wrapped EC2 cloud-
provider. We’ll be customer #2 :)
Seen Josh Berkus&#x2019; 7 Habits of Highly Ineffective Presenters\n
\n
\n
DRR- photography\nShrty - social network aggregation, innovative at the time\nOI - hyperlocal news\nObikosh - stealth\n
going to start with a story. 5 years ago\n
but especially on a Friday night. failure of primary and secondary backup devices on site. Offsite backups were old.\n
Bad went to worse and worse and worse.\nPainful recovery process.\nthis is NOT what i mean by &#x201C;operational fun&#x201D;\nBesides friendships with the engineers, learned a few things\n\n
Digital Railroad.\nUsed by news photographers worldwide\nFTP service stayed up - separation of concerns\nHappened to be built on postgres by one of our devs\n
and it happens at 1 am on Friday night\n
\n
production wasn&#x2019;t really production\n
no sla: nature of EC2 scale. often we would know of failures long before AWS\nsupport can really only tell you that indeed, you have experienced a failure\n
\n
\n
\n
150 instances\n
\n
\n
\n
each connection has its own working set, so be careful. we got up to 500.\ncould not use pgbouncer because clients never release conns.\n
\n
Engineer on team came up with containment cache idea.\n
\n
explain WAL\nhot standby in 9.0!\n
\n
elastic load balancers. no good for internal use.\n
tell ironhide story. service for iPhone app and a version of API.\n
RAID 10 will tolerate volume loss. we did not do io tests.\n
fio used for testing.\nGuaranteed iops?\n
used mdadm to build software RAID0\n
autovacuum can kick in at inopportune moments (data load, data fixup)\n
- crank up stats for better smarter query plans\n- but it comes at a price: slower vacuum analyze, pay the price with dynamic sql query planning time\n
\n
\n
query planner is really good. \nBruce Momjian (sp) said they want the query optimizer to do everything for you. If you have a problem they will get you a patch.\n
8 months to realize savings from reservations, given typical flux of res and non-res\n