20080330 Postgresqlconference2008 Pg In Web2.0 Samokhvalov

Using PostgreSQL
In Web 2.0 Applications
How PostgreSQL helps to build Web 2.0 Apps

Nikolay Samokhvalov
Postgresmen, LLC (Moscow, Russia)

PostgreSQL Conference East 2008

What Is Web 2.0?

Using PostgreSQL In Web 2.0 Applications

What Is Web 2.0?

For users:
 Collaborative web (UGC*, comments, rating system, etc)
 Web applications, not web sites (AJAX, more interaction)
 Web as a platform (interoperability, RSS, microformats, etc)
 Rounded corners, mirrored logos etc :)

*)
UGC — usergenerated content


What Is Web 2.0?

For software developers it means:

more users more developers more brands

more competitors
on the market

higher rapidly number of users,
larger pageviews, TPS,
rates of changing
development, data etc:
business
volumes
shorter iterations requirements N = et

Better technologies help to win!


Why PostgreSQL?

1. Performance, scalability
2. Reliability
3. Powerful capabilities
4. Standards compliance, proper approaches
5. Freedom


Why PostgreSQL?

1. Performance, scalability
2. Reliability } Quality

3. Powerful capabilities } Development efficiency
4. Standards compliance, proper approaches
5. Freedom
} HR


How to Deal With UGC?

1. Taxonomy
● Catalogs
2. Folksonomy
● Tags
3. Hybrid, two ways:
● Tags + Catalogs
● Both users and editors control Catalogs


UGC: Taxonomy

1. Taxonomy (Catalogs)
● EAV, where ATTRIBUTE table is [almost] constant
● intarray / hstore


EAV: EntityAttibuteValue
Entity
Value

Attribute

intarray / hstore
item

obj_id INT8
item_section_id INT8
item_vendor_id INT8
item_model_id INT8
item_year INT2
item_price NUMERIC(30,6)
item_props intarray

What about performance?
● This approach allows to save much space
● Performance is good if you mix GiST/GIN search with FTS search
● Better to cache tag values in external cache (e.g. Memcache) if you use
intarray, but in this case using FTS is a bit harder


UGC: Folksonomy

1. Folksonomy (Tags)
1. EAV (again), usercontrolled ATTRIBUTE table
2. intarray / hstore (again)

— it's just almost the same, you just give control to your users

Tags:


UGC: Hybrid

1. Hybrid, two ways:
1. Tags + Catalogs
— common practice
2. Both users and editors control Catalogs
— is the most interesting, but is the most difficult to implement and
maintain
● UGConly catalog entries are not shown in common <SELECT>
lists, they are waiting for editors approval.
● 'Merge' procedure is really complicated (merge UGC with editors'
data; merge duplicates, synonyms, etc).
● FTS (stemming, morphology, thesaurus), pg_trgm, metaphone,
soundex, etc may help. BUT: human work is still needed.


UGC: More About Tags

1. Use FTS (tsearch2) to integrate tag searching in your search subsystem:
● use FTS categories to differ tag words from mere words when needed;
● to process tags, use separate FTS configuration, if needed.
2. Use quot;prefix searchquot; for tag searching, but it's not straightforward (wait for
the next slides ;) )


UGC: Tags And Prefix Search

quot;Prefix searchquot; helps to build smth like this:

If you use simple LIKE 'bla%' the result will be somewhat dissapointing:
test=# EXPLAIN ANALYZE SELECT * FROM tag WHERE tag_name LIKE 'bla%';
                              QUERY PLAN

Seq Scan on tag  (cost=0.00..6182.75 rows=1 width=105) (actual
time=0.951..102.779 rows=162 loops=1)
   Filter: ((tag_name)::text ~~ 'bla%'::text)
Total runtime: 102.871 ms
(3 rows)

Notice: ~300k unique tags in the table

Tags And Prefix Search:
The Proper Solution
1. Use text_pattern_ops to speed up LIKE 'bla%' queries:
test=# CREATE INDEX i_tag_prefix ON tag
            USING btree(lower(tag_name) text_pattern_ops);
CREATE INDEX

test=# EXPLAIN ANALYZE SELECT * FROM tag
            WHERE lower(tag_name) LIKE lower('bla%');
                             QUERY PLAN

Bitmap Heap Scan on tag  (cost=43.55..2356.16 rows=1096 width=105)
(actual time=0.164..0.791 rows=235 loops=1)
   Filter: (lower((tag_name)::text) ~~ 'bla%'::text)
   >  Bitmap Index Scan on i_tag_prefix  (cost=0.00..43.28 rows=1096
width=0) (actual time=0.116..0.116 rows=235 loops=1)
         Index Cond: ((lower((tag_name)::text) ~>=~ 'bla'::text) AND
(lower((tag_name)::text) ~<~ 'мис'::text))
(5 rows)

Notices: (1) ILIKE is not acceptable, so use lower(); (2) be careful using nonASCII charactes
(i.e. it's OK for Russian UTF8 except minor 'ё' & 'Ё' chars)

Tags And Prefix Search:
The Proper Solution
2. Create tag_words (unique tag words) table to work with words, not with phrases:
CREATE TABLE tag_words AS
    SELECT DISTINCT word
    FROM ts_stat('SELECT to_tsvector(tag_name) FROM tag'); heavy
DROP INDEX i_tag_prefix;
CREATE INDEX i_tag_fts ON tag USING gin(to_tsvector(tag_name));
CREATE INDEX i_tag_words_prefix ON tag_words
    USING btree(lower(word) text_pattern_ops);

test=# EXPLAIN ANALYZE
  SELECT * FROM tag
  WHERE to_tsvector('utf8_russian'::regconfig, tag_name::text)
        @@ to_tsquery('utf8_russian', '(' || (
          SELECT array_to_string(array_accum(lower(word)), '|')
          FROM tag_words
          WHERE lower(word) LIKE 'bla%') || ')'); add '...&word1&word2' if needed
/* plan is omitted */
(11 rows)

Notices: (1) better to limit number of tag words found by the inner query (e.g. ordering by word
age — dirty but it works); (2) word order in original query is lost, unfortunately; (3) GIN indexes
are better than GiST here

Rate And Comment Everything
PostgreSQL Inheritance helps to achieve development efficiency
obj

obj_id INT8 — Not SERIAL, wait for the next slide to see details
obj_status_did INT8 — Dictionary value
obj_creator_obj_id INT8 — ID of user who created the record (if applicable)
obj_created TIMESTAMP
obj_modified
obj_commented
TIMESTAMP
TIMESTAMP
} NOT NULL DEFAULT CURRENT_TIMESTAMP
obj_marks_count INT4
obj_marks_rating FLOAT8 } rate everything!
obj_tsvector tsvector — Almost all business objects need FTS

user2obj group

u2o_user_obj_id user comment
u2o_obj_obj_id
u2o_mark comment_author_obj_id
u2o_is_favorite comment_text


Rate And Comment Everything
create table comment (
   obj_id INT8 not null default
(((nextval('comment_obj_id_seq'::regclass) * 223072849) %
(1000000000)::bigint) + 41000000000)
     constraint c_obj_comment_obj_id check
        (obj_id between 41000000000 and 41999999999),
   comment_author_obj_id INT8,
   comment_body VARCHAR (2000) NOT NULL,
   constraint PK_MESSAGE primary key (obj_id)
)
inherits (obj);

ID generation scheme:
nextID = (N mod Y) * X + S,
        where X & Y are coprimes, and S is an interval shift

Use separate sequence per each table!

do not forget:
SET constraint_exclusion ON;


Build your Google Maps mashup:
with PostgreSQL it's easy
 Ways to store & index geo data in PostgreSQL:
 two integer columns and Btree
 point column and Rtree MirTesen.ru
 PostGIS
 pgSphere GiST
 Q3C


Conclusion

 PostgreSQL provides a great set of capabilities to
meet Web 2.0 developer needs
 PostgreSQL allows to develop quickly, w/o losing
quality


Contacts
● nikolay@samokhvalov.com
● Blog: http://nikolay.samokhvalov.com
● XMPP/GTalk: samokhvalov@gmail.com
● Skype: samokhvalov OR postgresmen
● +7 905 783 9804


20080330 Postgresqlconference2008 Pg In Web2.0 Samokhvalov

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to 20080330 Postgresqlconference2008 Pg In Web2.0 Samokhvalov

Similar to 20080330 Postgresqlconference2008 Pg In Web2.0 Samokhvalov (20)

More from Nikolay Samokhvalov

More from Nikolay Samokhvalov (20)

Recently uploaded

Recently uploaded (20)

20080330 Postgresqlconference2008 Pg In Web2.0 Samokhvalov