Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Indexing	
  Complex	
  PostgreSQL	
  
Data	
  Types
Jonathan	
  S.	
  Katz	
  -­‐	
  PGDay	
  UK	
  2014	
  -­‐	
  July	
 ...
About
• Jonathan	
  S.	
  Katz	
  
– CTO,	
  VenueBook	
  
– Co-­‐Organizer,	
  NYC	
  PostgreSQL	
  User	
  Group	
  
– P...
Quick	
  Overview
• Introductory	
  Talk	
  with	
  demos	
  and	
  fun	
  
• B-­‐trees	
  
• GiST:	
  Generalized	
  Sear...
Assumptions
• PostgreSQL	
  9.3+	
  
• most	
  will	
  be	
  9.0+	
  
• PostGIS	
  2.0+	
  
• Believe	
  it	
  will	
  wor...
The	
  Beginning
5
The	
  Beginning
6
4
3
8
7
2
9
1
6
10
6
Find	
  3
7
4
3
8
7
2
9
1
6
10
5
Find	
  3
8
4
3
8
7
2
9
1
6
10
5
Find	
  3
9
4
3
8
7
2
9
1
6
10
5
Find	
  10
10
4
3
8
7
2
9
1
6
10
5
Find	
  10
11
4
3
8
7
2
9
1
6
10
5
Find	
  10
12
4
3
8
7
2
9
1
6
10
5
Find	
  10
13
4
3
8
7
2
9
1
6
10
5
Find	
  10
14
4
3
8
7
2
9
1
6
10
5
Find	
  10
15
4
3
8
7
2
9
1
6
10
5
Find	
  10
16
4
3
8
7
2
9
1
6
10
5
Find	
  10
17
4
3
8
7
2
9
1
6
10
5
Find	
  10
18
4
3
8
7
2
9
1
6
10
5
Find	
  10
19
4
3
8
7
2
9
1
6
10
5
What	
  We	
  Learned
• Without	
  any	
  data	
  structure	
  around	
  search,	
  we	
  
rely	
  on	
  "hope"	
  
• Assu...
What	
  We	
  Need
• Need	
  a	
  data	
  structure	
  for	
  search	
  that:	
  
– allows	
  efficient	
  lookups	
  
– p...
B-­‐Trees
• "default"	
  index	
  
• quick	
  traversal	
  to	
  leaf	
  nodes	
  
• leaf	
  nodes	
  pre-­‐sorted	
  
• n...
B-­‐Trees	
  and	
  PostgreSQL
• supports	
  
– <=,	
  <,	
  =,	
  >,	
  >=	
  
– BETWEEN,	
  IN	
  
– IS	
  NOT	
  NULL,	...
B-­‐Trees	
  and	
  PostgreSQL
• data	
  types	
  supported	
  
– any	
  data	
  type	
  with	
  all	
  the	
  equality	
 ...
Demo	
  Specs	
  +	
  Configuration
• Hardware	
  specs	
  
– 2.3GHz	
  Intel	
  i7	
  2x4	
  core	
  
– 16GB	
  RAM	
  DD...
Demo	
  #1:	
  Basic	
  Indexing	
  Plans
26
Demo	
  #1	
  Notes
• Index	
  maintenance	
  
• VACUUM	
  –	
  "cleans	
  up"	
  after	
  writes	
  on	
  table	
  /
inde...
Indexing	
  in	
  Production
• CREATE	
  INDEX	
  CONCURRENTLY	
  
• REINDEX	
  
– corruption,	
  bloat,	
  invalid	
  
• ...
Demo	
  #2:	
  Partial	
  Indexes
29
CREATE INDEX indexname ON tablename (columnname)
WHERE somecondition;
Demo	
  #2	
  Notes
• Partial	
  Indexes	
  are	
  
– good	
  if	
  known	
  to	
  query	
  limited	
  subset	
  of	
  tab...
Unique	
  Indexes
• only	
  for	
  B-­‐trees	
  
• NULL	
  not	
  unique	
  
• use	
  UNIQUE	
  constraints	
  –	
  automa...
Multi-­‐Column	
  Indexes
• Useful	
  for	
  
– querying	
  two	
  columns	
  that	
  are	
  frequently	
  queried	
  
tog...
Multi-­‐Column	
  Indexes
33
CREATE INDEX multicolumn_idx ON
tablename (col1, col2);
!
!
!
CREATE UNIQUE INDEX pn_idx ON
p...
Demo	
  #3	
  Notes
• Multi-­‐column	
  indexes	
  can	
  be	
  
– efficient	
  for	
  speed	
  +	
  space	
  
– inefficie...
Expression	
  Indexes
• can	
  index	
  on	
  expressions	
  to	
  speed	
  up	
  lookups	
  
– e.g.	
  case	
  insensitiv...
Demo	
  #4:	
  Expression	
  Indexes
36
Demo	
  #4	
  Notes
• fast	
  lookups,	
  slow	
  updates	
  
• size	
  on	
  disk	
  
• cool	
  data	
  types	
  
– geome...
Geometric	
  Data	
  Types
38
Geometric	
  Data	
  Types
CREATE TABLE points (coord point);
!
CREATE INDEX points_idx ON points (coord);
ERROR: data typ...
GiST
• "generalized	
  search	
  tree"	
  
• infrastructure	
  that	
  provides	
  template	
  to	
  create	
  arbitrary	
...
Demo	
  #5:	
  Geometric	
  GiST	
  +	
  KNN-­‐Gist
41
Demo	
  #5	
  Notes
• GiST	
  indexes	
  on	
  geometric	
  types	
  radically	
  
speedup	
  reads	
  
• Writes	
  are	
 ...
PostGIS
• For	
  when	
  you	
  are	
  doing	
  real	
  things	
  with	
  shapes
43• (and	
  geographic	
  information	
  ...
PostGIS	
  +	
  Indexes
• B-­‐Tree?	
  
• R-­‐Tree?	
  
• PostGIS	
  docs	
  do	
  not	
  recommend	
  using	
  just	
  
a...
PostGIS	
  +	
  GiST
45
2-­‐D
CREATE INDEX zipcodes_geom_gist_idx ON zipcodes
USING gist(geom);
N-­‐D	
  (PostGIS	
  2.0+
...
Example	
  -­‐	
  USA	
  Zipcode	
  Boundaries
• 33,120	
  rows	
  
• geom:	
  MultiPolygon	
  
• 52MB	
  without	
  index...
What	
  Zipcode	
  Is	
  My	
  Office	
  In?
• Geocoded	
  Address	
  
• Lat,Long	
  =	
  40.7356197,-­‐73.9891102	
  
• P...
What	
  Zipcode	
  Is	
  My	
  Office	
  In?
• No	
  Index
48
Seq Scan on zipcodes (cost=0.00..15382.00 rows=1 width=6) (a...
What	
  Zipcode	
  Is	
  My	
  Office	
  In?
• Here’s	
  the	
  GiST:
49
Index Scan using zipcodes_geom_gist on zipcodes (...
PostGIS	
  +	
  GiST	
  Conclusion
• 5153ms	
  =>	
  0.235ms	
  =	
  21,900x	
  speedup	
  :-­‐)
50
SELECT zcta5ce10 AS zi...
Full	
  Text	
  Search
• PostgreSQL	
  offers	
  full	
  text	
  search	
  with	
  the	
  
tsearch2	
  engine	
  
– algori...
Test	
  Data	
  Set
• Wikipedia	
  English	
  category	
  titles	
  –	
  all	
  
1,823,644	
  that	
  I	
  downloaded
52
Full-­‐Text	
  Search:	
  Basics
53
SELECT title
FROM category
WHERE
to_tsvector('english', title) @@ to_tsquery('united &...
Full-­‐Text	
  Search	
  +	
  GiST
54
CREATE INDEX category_title_gist_idx ON category
USING gist(to_tsvector('english', t...
Full	
  Text	
  Search	
  +	
  GiST
• GiST	
  indexes	
  can	
  produce	
  false	
  positives	
  
– "documents"	
  represe...
Performance	
  Summary	
  with	
  GiST
• initial	
  index	
  build	
  takes	
  awhile	
  =>	
  slow	
  writes	
  
• reads	...
GIN	
  Index
• "generalized	
  inverted	
  index"	
  
• supports	
  searching	
  within	
  composite	
  data	
  
– arrays,...
Full	
  Text	
  Search	
  +	
  GIN
58
CREATE INDEX category_title_gin_idx ON category
USING gin(to_tsvector('english', tit...
Performance	
  Summary	
  with	
  GIN
• index	
  build	
  was	
  much	
  quicker	
  
• significant	
  speedup	
  from	
  n...
What	
  Was	
  Not	
  Discussed
• Word	
  density	
  
– prior	
  to	
  9.3,	
  performance	
  issues	
  with	
  greater	
 ...
Full	
  Text	
  Search	
  –	
  GiST	
  vs	
  GIN
• Reads	
  
– overall,	
  GIN	
  should	
  win	
  
• Writes	
  
– traditi...
Regular	
  Expression	
  Indexes
• Added	
  in	
  9.3	
  
• Support	
  for	
  LIKE/ILIKE	
  wildcard	
  indexes	
  in	
  9...
Regular	
  Expressions	
  -­‐	
  No	
  Index
63
EXPLAIN ANALYZE SELECT title FROM category
WHERE title ~ '(([iI]sland(s)?)...
Regular	
  Expressions	
  -­‐	
  Indexed
64
CREATE INDEX category_title_regex_idx ON category
USING gin(title gin_trgm_ops...
Range	
  Types
• stores	
  range	
  data	
  
– 	
  1	
  to	
  6	
  
– 2013-­‐10-­‐29	
  –	
  2013-­‐11-­‐2	
  
• easy-­‐to...
Range	
  Type	
  Examples
66
--find all ranges that overlap with [100, 200)
!
SELECT * FROM ranges WHERE int4range(100, 20...
Range	
  Types	
  +	
  GiST
67
CREATE INDEX ranges_range_gist_idx ON ranges USING gist(range);
!
EXPLAIN ANALYZE SELECT * ...
SP-­‐GiST
• space-­‐partitioned	
  generalized	
  search	
  tree	
  
• ideal	
  for	
  non-­‐balanced	
  data	
  structure...
Range	
  Types:	
  GiST	
  vs	
  SP-­‐Gist
CREATE TABLE ranges AS
SELECT
int4range(
(random()*5)::int,
(random()*5)::int +...
N	
  =	
  1,000,000
70
CREATE INDEX ranges_range_spgist_idx ON ranges USING spgist(range);
ERROR: unexpected spgdoinsert()...
Range	
  Types:	
  GiST	
  vs	
  SP-­‐GiST
CREATE TABLE ranges AS
SELECT
int4range(x, x + (random()*5)::int + 5)
AS range
...
N	
  =	
  250,000
72
GiST	
  Used GiST	
  Time SP-­‐GiST	
  Used SP-­‐GiST	
  Time
= Yes 0.5 Yes 0.7
&& Yes 0.3 Yes 0.3
@>...
GiST	
  vs	
  SP-­‐GiST:	
  Space
73
GiST	
  Clustered SP-­‐GiST	
  Clustered GiST	
  Sparse SP-­‐GiST	
  Sparse
100K	
  S...
Integer	
  Arrays
74
CREATE UNLOGGED TABLE int_arrays AS
SELECT ARRAY[x, x + 1, x + 2] AS data
FROM generate_series(1,1000...
B-­‐Tree(?)	
  +	
  Integer	
  Arrays
75
EXPLAIN ANALYZE
SELECT *
FROM int_arrays
WHERE 5432 = ANY (data);
QUERY PLAN
----...
GIN	
  +	
  Integer	
  Arrays
76
EXPLAIN ANALYZE
SELECT *
FROM int_arrays
WHERE ARRAY[5432] <@ data;
QUERY PLAN
----------...
Hash	
  Indexes
• only	
  work	
  with	
  "="	
  operator	
  
• are	
  still	
  not	
  WAL	
  logged	
  as	
  of	
  9.4	
 ...
btree_gin
78
CREATE EXTENSION IF NOT EXISTS btree_gin;
!
CREATE UNLOGGED TABLE numbers AS
SELECT (random() * 2000)::int AS...
btree_gin	
  vs	
  btree
79
-- btree
SELECT pg_size_pretty(pg_total_relation_size('numbers_idx'));
pg_size_pretty
--------...
hstore	
  -­‐	
  the	
  PostgreSQL	
  Key-­‐Value	
  
Store
80
CREATE EXTENSION IF NOT EXISTS hstore;
!
CREATE UNLOGGED TA...
hstore	
  -­‐	
  the	
  PostgreSQL	
  Key-­‐Value	
  
Store
81
CREATE INDEX keypairs_data_gin_idx ON keypairs
USING gin(da...
JSONB:	
  Coming	
  in	
  9.4
82
INSERT INTO documents
SELECT row_to_json(ROW(x, x + 2, x + 3))::jsonb
FROM generate_serie...
Awesome	
  vs	
  WTF:	
  
A	
  Note	
  On	
  Operator	
  Indexability
83
EXPLAIN ANALYZE SELECT * FROM documents WHERE dat...
For	
  More	
  Information…
• http://www.postgresql.org/docs/current/static/
indexes.html	
  
• http://www.postgresql.org/...
Conclusion
• Postgres	
  has	
  *a	
  lot*	
  of	
  different	
  types	
  of	
  
indexes,	
  and	
  variations	
  on	
  ea...
Thanks!
• http://nyc.pgconf.us	
  
• Twitter:	
  @jkatz05
86
Upcoming SlideShare
Loading in …5
×

Indexing Complex PostgreSQL Data Types

15,872 views

Published on

PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?

Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.

Published in: Technology
  • If someone knows links to sound or text or some record with this report, please, share the link? I can't find this report on youtuve.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Indexing Complex PostgreSQL Data Types

  1. 1. Indexing  Complex  PostgreSQL   Data  Types Jonathan  S.  Katz  -­‐  PGDay  UK  2014  -­‐  July  9,  2014
  2. 2. About • Jonathan  S.  Katz   – CTO,  VenueBook   – Co-­‐Organizer,  NYC  PostgreSQL  User  Group   – PGConf  NYC  2015   • Mar  25  -­‐  27,  2015   • New  York  Marriott  Downtown   • http://nyc.pgconf.us   – @jkatz05 2
  3. 3. Quick  Overview • Introductory  Talk  with  demos  and  fun   • B-­‐trees   • GiST:  Generalized  Search  Trees   • GIN:  Generalized  Inverted  Index   • SP-­‐GiST:  Space  Partitioned  Generalized  Search   Trees 3
  4. 4. Assumptions • PostgreSQL  9.3+   • most  will  be  9.0+   • PostGIS  2.0+   • Believe  it  will  work  for  most  available   versions   • PostgreSQL  9.4  beta? 4
  5. 5. The  Beginning 5
  6. 6. The  Beginning 6 4 3 8 7 2 9 1 6 10 6
  7. 7. Find  3 7 4 3 8 7 2 9 1 6 10 5
  8. 8. Find  3 8 4 3 8 7 2 9 1 6 10 5
  9. 9. Find  3 9 4 3 8 7 2 9 1 6 10 5
  10. 10. Find  10 10 4 3 8 7 2 9 1 6 10 5
  11. 11. Find  10 11 4 3 8 7 2 9 1 6 10 5
  12. 12. Find  10 12 4 3 8 7 2 9 1 6 10 5
  13. 13. Find  10 13 4 3 8 7 2 9 1 6 10 5
  14. 14. Find  10 14 4 3 8 7 2 9 1 6 10 5
  15. 15. Find  10 15 4 3 8 7 2 9 1 6 10 5
  16. 16. Find  10 16 4 3 8 7 2 9 1 6 10 5
  17. 17. Find  10 17 4 3 8 7 2 9 1 6 10 5
  18. 18. Find  10 18 4 3 8 7 2 9 1 6 10 5
  19. 19. Find  10 19 4 3 8 7 2 9 1 6 10 5
  20. 20. What  We  Learned • Without  any  data  structure  around  search,  we   rely  on  "hope"   • Assumed  "unique  values"  and  "equality"   – would  have  to  scan  all  rows  otherwise   • …and  what  about:   – INSERT   – UPDATE   – DELETE 20
  21. 21. What  We  Need • Need  a  data  structure  for  search  that:   – allows  efficient  lookups   – plays  nicely  with  disk  I/O   – does  not  take  too  long  for  updates   21
  22. 22. B-­‐Trees • "default"  index   • quick  traversal  to  leaf  nodes   • leaf  nodes  pre-­‐sorted   • node  size  designed  to  fit  in  disk  block  size   – "degree"  of  nodes  has  max-­‐size   • theoretical  performance   – reads:  O(log  n)   – writes:  O(log  n)   – space:  O(n) 22
  23. 23. B-­‐Trees  and  PostgreSQL • supports   – <=,  <,  =,  >,  >=   – BETWEEN,  IN   – IS  NOT  NULL,  IS  NULL   – LIKE  in  specific  case  of  ‘plaintext%’   – ~  in  specific  case  of  ‘^plaintext’   – ILIKE  and  ~*  if  pattern  starts  with  nonalpha  characters   • does  not  support   • IS  NOT  DISTINCT  FROM 23
  24. 24. B-­‐Trees  and  PostgreSQL • data  types  supported   – any  data  type  with  all  the  equality  operators  defined   – number  types   • integer,  numeric,  decimal   – text   • char,  varchar,  text   – date  /  times   • timestamptz,  timestamp,  date,  time,  timetz,  interval   - arrays,  ranges 24
  25. 25. Demo  Specs  +  Configuration • Hardware  specs   – 2.3GHz  Intel  i7  2x4  core   – 16GB  RAM  DDR3  1600MHz   – Apple  SSD  6Gbps   • postgresql.conf  (9.3,  9.4  beta  1)   – shared_buffers  =  1GB   – work_mem  =  64MB   – maintenance_work_mem  =  1024MB   – effective_cache_size  =  8GB 25
  26. 26. Demo  #1:  Basic  Indexing  Plans 26
  27. 27. Demo  #1  Notes • Index  maintenance   • VACUUM  –  "cleans  up"  after  writes  on  table  / indexes   – ANALYZE  –  keeps  statistics  up-­‐to-­‐date  for  planner   ! VACUUM ANALYZE tablename;   ! • Good  idea  to  leave  autovacuum  on 27
  28. 28. Indexing  in  Production • CREATE  INDEX  CONCURRENTLY   • REINDEX   – corruption,  bloat,  invalid   • FILLFACTOR   – 10  –  100   – default:  90   – strategy:  lower  %  ::  write-­‐activity   • TABLESPACE   • NULLS  LAST,  NULLS  FIRST 28
  29. 29. Demo  #2:  Partial  Indexes 29 CREATE INDEX indexname ON tablename (columnname) WHERE somecondition;
  30. 30. Demo  #2  Notes • Partial  Indexes  are   – good  if  known  to  query  limited  subset  of  table   – take  up  less  space   – allow  for  much  quicker  writes   • Like  all  good  things,  do  not  overuse  and   saturate  your  I/O 30
  31. 31. Unique  Indexes • only  for  B-­‐trees   • NULL  not  unique   • use  UNIQUE  constraints  –  automatically  create   indexes   ! CREATE TABLE foo (bar int UNIQUE);   -- or   CREATE UNIQUE INDEX foo_bar_idx ON foo (bar);   ALTER TABLE foo ADD CONSTRAINT a_unique USING INDEX a_idx; 31
  32. 32. Multi-­‐Column  Indexes • Useful  for   – querying  two  columns  that  are  frequently  queried   together   – enforcing  UNIQUEness  across  columns   • n.b.  creating  UNIQUE  constraint  on  table  creates  UNIQUE   INDEX   • PostgreSQL  supports   – up  to  32  columns   – B-­‐tree,  GiST,  GIN   • Be  careful  of  how  you  choose  initial  column  order! 32
  33. 33. Multi-­‐Column  Indexes 33 CREATE INDEX multicolumn_idx ON tablename (col1, col2); ! ! ! CREATE UNIQUE INDEX pn_idx ON phone_numbers (country_code, national_number) WHERE extension IS NULL
  34. 34. Demo  #3  Notes • Multi-­‐column  indexes  can  be   – efficient  for  speed  +  space   – inefficient  with  performance   • Usage  depends  on  your  application  needs 34
  35. 35. Expression  Indexes • can  index  on  expressions  to  speed  up  lookups   – e.g.  case  insensitive  email  addresses   – can  use  functions  or  scalars   • (x * y) / 100   • COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')   • LOWER(email_address)   • tradeoff:  slower  writes 35
  36. 36. Demo  #4:  Expression  Indexes 36
  37. 37. Demo  #4  Notes • fast  lookups,  slow  updates   • size  on  disk   • cool  data  types   – geometric   – JSON 37
  38. 38. Geometric  Data  Types 38
  39. 39. Geometric  Data  Types CREATE TABLE points (coord point); ! CREATE INDEX points_idx ON points (coord); ERROR: data type point has no default operator class for access method "btree" HINT: You must specify an operator class for the index or define a default operator class for the data type. 39
  40. 40. GiST • "generalized  search  tree"   • infrastructure  that  provides  template  to  create  arbitrary   indexing  schemes   – supports  concurrency,  logging,  searching  –  only  have  to  define   behavior   – user-­‐defined  operator  class   • <<,  &<,  &>,  >>,  <<|,  &<|,  |&>,  |>>,  @>,  <@,  ~=,  &&   – have  to  implement  functions  in  interface   • supports  lossless  +  lossy  indexes   • provides  support  for  "nearest-­‐neighbor"  queries  –   "KNN-­‐Gist" 40 CREATE INDEX points_coord_gist_idx ON points USING gist(coord)
  41. 41. Demo  #5:  Geometric  GiST  +  KNN-­‐Gist 41
  42. 42. Demo  #5  Notes • GiST  indexes  on  geometric  types  radically   speedup  reads   • Writes  are  slower  due  to  distance  calculation   • Index  size  can  be  very  big 42
  43. 43. PostGIS • For  when  you  are  doing  real  things  with  shapes 43• (and  geographic  information  systems)
  44. 44. PostGIS  +  Indexes • B-­‐Tree?   • R-­‐Tree?   • PostGIS  docs  do  not  recommend  using  just   an  R-­‐Tree  index   • GiST   • overlaps!    containment!   • uses  a  combination  of  GiST  +  R-­‐Tree 44
  45. 45. PostGIS  +  GiST 45 2-­‐D CREATE INDEX zipcodes_geom_gist_idx ON zipcodes USING gist(geom); N-­‐D  (PostGIS  2.0+ CREATE INDEX zipcodes_geom_gist_idx ON zipcodes USING gist(geom gist_geometry_ops_nd);
  46. 46. Example  -­‐  USA  Zipcode  Boundaries • 33,120  rows   • geom:  MultiPolygon   • 52MB  without  indexes   • With  geometry  GiST  +  integer  B-­‐Tree:  869MB 46
  47. 47. What  Zipcode  Is  My  Office  In? • Geocoded  Address   • Lat,Long  =  40.7356197,-­‐73.9891102   • PostGIS:  POINT(-­‐73.9891102  40.7356197)   • 4269  -­‐  “SRID”  -­‐  unique  ID  for  coordinate  system   definitions 47 SELECT zcta5ce10 AS zipcode FROM zipcodes WHERE ST_Contains( geom, --MultiPolygon ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269) );
  48. 48. What  Zipcode  Is  My  Office  In? • No  Index 48 Seq Scan on zipcodes (cost=0.00..15382.00 rows=1 width=6) (actual time=64.780..5153.485 rows=1 loops=1) Filter: ((geom && '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) AND _st_contains(geom, '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry)) Rows Removed by Filter: 33119 Total runtime: 5153.505 ms
  49. 49. What  Zipcode  Is  My  Office  In? • Here’s  the  GiST: 49 Index Scan using zipcodes_geom_gist on zipcodes (cost=0.28..8.54 rows=1 width=6) (actual time=0.120..0.207 rows=1 loops=1) Index Cond: (geom && '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) Filter: _st_contains(geom, '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) Rows Removed by Filter: 1 ! Total runtime: 0.235 ms
  50. 50. PostGIS  +  GiST  Conclusion • 5153ms  =>  0.235ms  =  21,900x  speedup  :-­‐) 50 SELECT zcta5ce10 AS zipcode FROM zipcodes WHERE ST_Contains( geom, ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269) ); ! zipcode --------- 10003
  51. 51. Full  Text  Search • PostgreSQL  offers  full  text  search  with  the   tsearch2  engine   – algorithms  for  performing  FTS   – to_tsvector('english',  content)  @@   to_tsquery('irish  &  conference  |  meeting')   – provides  indexing  capabilities  for  efficient  search 51
  52. 52. Test  Data  Set • Wikipedia  English  category  titles  –  all   1,823,644  that  I  downloaded 52
  53. 53. Full-­‐Text  Search:  Basics 53 SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom’); ! title ----- Lists of railway stations in the United Kingdom Political history of the United Kingdom Military of the United Kingdom United Kingdom constitution Television channels in the United Kingdom United Kingdom Roman Catholic secondary schools in the United Kingdom [results truncated] ! ! QUERY PLAN ------------ Seq Scan on category (cost=0.00..49262.77 rows=46 width=29) (actual time=21.900..16809.890 rows=8810 loops=1) Filter: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) Rows Removed by Filter: 1814834 ! Total runtime: 16811.108 ms
  54. 54. Full-­‐Text  Search  +  GiST 54 CREATE INDEX category_title_gist_idx ON category USING gist(to_tsvector('english', title)); ! SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom'); QUERY PLAN ------------- Bitmap Heap Scan on category (cost=4.77..182.47 rows=46 width=29) (actual time=75.517..180.650 rows=8810 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) -> Bitmap Index Scan on category_title_gist_idx (cost=0.00..4.76 rows=46 width=0) (actual time=74.687..74.687 rows=8810 loops=1) Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom’::text)) ! Total runtime: 181.354 ms
  55. 55. Full  Text  Search  +  GiST • GiST  indexes  can  produce  false  positives   – "documents"  represented  by  fixed  length  signature   • words  are  hashed  into  single  bits  and  concatenated   – when  false  positive  occurs,  row  is  returned  and   checked  to  see  if  false  match   • Extra  validations  =  performance  degradation 55
  56. 56. Performance  Summary  with  GiST • initial  index  build  takes  awhile  =>  slow  writes   • reads  are  quick   • Table  size:  271MB   • Index  size:  83MB 56
  57. 57. GIN  Index • "generalized  inverted  index"   • supports  searching  within  composite  data   – arrays,  full-­‐text  documents,  hstore   • key  is  stored  once  and  points  to  composites  it  is   contained  in   • like  GiST,  provides  index  infrastructure  to  extend  GIN   based  on  behavior   – supports  operators  <@,  @>,  =,  &&   • GIN  performance  ⬄  log(#  unique  things) 57
  58. 58. Full  Text  Search  +  GIN 58 CREATE INDEX category_title_gin_idx ON category USING gin(to_tsvector('english', title)); ! EXPLAIN ANALYZE SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom'); ! QUERY PLAN ------- Bitmap Heap Scan on category (cost=28.36..206.06 rows=46 width=29) (actual time=8.864..14.674 rows=8810 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) -> Bitmap Index Scan on category_title_gin_idx (cost=0.00..28.35 rows=46 width=0) (actual time=7.905..7.905 rows=8810 loops=1) Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) ! ! Total runtime: 15.157 ms
  59. 59. Performance  Summary  with  GIN • index  build  was  much  quicker   • significant  speedup  from  no  index   – (12,000ms  =>  15ms)   • significant  speedup  from  GiST   – (181ms  =>  15ms)   • Table  size:  271MB   • Index  size:   • 9.3:  71MB   • 9.4  beta  1:    40MB 59
  60. 60. What  Was  Not  Discussed • Word  density   – prior  to  9.3,  performance  issues  with  greater  word   density   • Type  of  text  data  –  phrases  vs  paragraphs 60
  61. 61. Full  Text  Search  –  GiST  vs  GIN • Reads   – overall,  GIN  should  win   • Writes   – traditionally,  GiST  has  better  performance  for   writes   – GIN   • FASTUPDATE   • 9.4:  compression 61
  62. 62. Regular  Expression  Indexes • Added  in  9.3   • Support  for  LIKE/ILIKE  wildcard  indexes  in  9.1   – title  LIKE  '%ab%e'   • Uses  pg_trgm  extension  +  GIN   ! CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX category_title_regex_idx ON category USING GIN(title gin_trgm_ops); 62
  63. 63. Regular  Expressions  -­‐  No  Index 63 EXPLAIN ANALYZE SELECT title FROM category WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$'; ! QUERY PLAN ---------- Seq Scan on category (cost=0.00..40144.55 rows=182 width=29) (actual time=2.509..4260.792 rows=5878 loops=1) Filter: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) Rows Removed by Filter: 1817766 ! Total runtime: 4261.204 ms
  64. 64. Regular  Expressions  -­‐  Indexed 64 CREATE INDEX category_title_regex_idx ON category USING gin(title gin_trgm_ops); ! EXPLAIN ANALYZE SELECT title FROM category WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$'; QUERY PLAN ----------- Bitmap Heap Scan on category (cost=197.41..871.77 rows=182 width=29) (actual time=107.445..146.713 rows=5878 loops=1) Recheck Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) Rows Removed by Index Recheck: 4712 -> Bitmap Index Scan on category_title_regex_idx (cost=0.00..197.37 rows=182 width=0) (actual time=106.645..106.645 rows=10590 loops=1) Index Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) ! ! Total runtime: 147.026 ms
  65. 65. Range  Types • stores  range  data   –  1  to  6   – 2013-­‐10-­‐29  –  2013-­‐11-­‐2   • easy-­‐to-­‐use  operators  to  check  inclusion,   overlaps   • built-­‐in  types:  integers,  numerics,  dates,   timestamps   • extensible 65
  66. 66. Range  Type  Examples 66 --find all ranges that overlap with [100, 200) ! SELECT * FROM ranges WHERE int4range(100, 200) && range; ! range ----------- [10,102) [13,102) [18,108) [32,101) [34,134) [37,123) [43,111) [46,132) [48,107) [results trunctated] !QUERY PLAN ----------- Seq Scan on ranges (cost=0.00..14239.86 rows=7073 width=32) (actual time=0.018..185.411 rows=143 loops Filter: ('[100,200)'::int4range && range) Rows Removed by Filter: 999857 ! Total runtime: 185.439 ms
  67. 67. Range  Types  +  GiST 67 CREATE INDEX ranges_range_gist_idx ON ranges USING gist(range); ! EXPLAIN ANALYZE SELECT * FROM ranges WHERE int4range(100, 200) && range; ! QUERY PLAN ------------ Bitmap Heap Scan on ranges (cost=5.29..463.10 rows=130 width=13) (actual time=0.120..0.135 rows=144 loops=1) Recheck Cond: ('[100,200)'::int4range && range) -> Bitmap Index Scan on ranges_range_gist_idx (cost=0.00..5.26 rows=130 width=0) (actual time=0.109..0.109 rows=144 loops=1) Index Cond: ('[100,200)'::int4range && range) ! ! Total runtime: 0.168 ms
  68. 68. SP-­‐GiST • space-­‐partitioned  generalized  search  tree   • ideal  for  non-­‐balanced  data  structures   – k-­‐d  trees,  quad-­‐trees,  suffix  trees   – divides  search  space  into  partitions  of  unequal  size   • matching  partitioning  rule  =  fast  search   • traditionally  for  "in-­‐memory"  transactions,   converted  to  play  nicely  with  I/O 68
  69. 69. Range  Types:  GiST  vs  SP-­‐Gist CREATE TABLE ranges AS SELECT int4range( (random()*5)::int, (random()*5)::int + 5 ) AS range FROM generate_series(1,<N>) x; ! SELECT * FROM ranges WHERE range <operator> int4range(3,6); 69
  70. 70. N  =  1,000,000 70 CREATE INDEX ranges_range_spgist_idx ON ranges USING spgist(range); ERROR: unexpected spgdoinsert() failure Fixed in 9.3.2 GiST  Used GiST  Time SP-Gist Used SP-GiST Time= Yes 121 Yes 37 && No 257  No 260 @> No 223 No 223 <@ Yes 163  Yes 111 << Yes 95 Yes 5 >> Yes 95 Yes 25 &< No 184 No 185 &> No 203 No 203
  71. 71. Range  Types:  GiST  vs  SP-­‐GiST CREATE TABLE ranges AS SELECT int4range(x, x + (random()*5)::int + 5) AS range FROM generate_series(1,<N>) x; 71
  72. 72. N  =  250,000 72 GiST  Used GiST  Time SP-­‐GiST  Used SP-­‐GiST  Time = Yes 0.5 Yes 0.7 && Yes 0.3 Yes 0.3 @> Yes 0.3 Yes 0.3 <@ Yes 0.06 Yes 0.25 << No 40 Yes 0.2 >> No 60 No 60 &< Yes 0.3 Yes 0.2 &> No 74 No 61
  73. 73. GiST  vs  SP-­‐GiST:  Space 73 GiST  Clustered SP-­‐GiST  Clustered GiST  Sparse SP-­‐GiST  Sparse 100K  Size 6MB 5MB 6MB 11MB 100K  Time 0.5s .4s 2.5s 7.8s 250K  Size 15MB 12MB 15MB 28MB 250K  Time 1.5s 1.1s 6.3s 47.2s 500K  Size 30MB 25MB 30MB 55MB 500K  Time 3.1s 3.0s 13.9s 192s 1MM  Size 59MB 52MB! 60MB 110MB 1MM  Time 5.1s 5.7s 29.2 777s
  74. 74. Integer  Arrays 74 CREATE UNLOGGED TABLE int_arrays AS SELECT ARRAY[x, x + 1, x + 2] AS data FROM generate_series(1,1000000) x; ! CREATE INDEX int_arrays_data_idx ON int_arrays (data); ! CREATE INDEX int_arrays_data_gin_idx ON int_arrays USING GIN(data);
  75. 75. B-­‐Tree(?)  +  Integer  Arrays 75 EXPLAIN ANALYZE SELECT * FROM int_arrays WHERE 5432 = ANY (data); QUERY PLAN ----------- Seq Scan on int_arrays (cost=0.00..30834.00 rows=5000 width=33) (actual time=1.260..159.197 rows=3 loops=1) Filter: (5432 = ANY (data)) Rows Removed by Filter: 999997 ! Total runtime: 159.222 ms
  76. 76. GIN  +  Integer  Arrays 76 EXPLAIN ANALYZE SELECT * FROM int_arrays WHERE ARRAY[5432] <@ data; QUERY PLAN ----------- Bitmap Heap Scan on int_arrays (cost=70.75..7680.14 rows=5000 width=33) (actual time=0.020..0.021 rows=3 loops=1) Recheck Cond: ('{5432}'::integer[] <@ data) -> Bitmap Index Scan on int_arrays_data_gin_idx (cost=0.00..69.50 rows=5000 width=0) (actual time=0.014..0.014 rows=3 loops=1) Index Cond: ('{5432}'::integer[] <@ data) ! Total runtime: 0.045 ms
  77. 77. Hash  Indexes • only  work  with  "="  operator   • are  still  not  WAL  logged  as  of  9.4  beta  1   – not  crash  safe   – not  replicated 77
  78. 78. btree_gin 78 CREATE EXTENSION IF NOT EXISTS btree_gin; ! CREATE UNLOGGED TABLE numbers AS SELECT (random() * 2000)::int AS a FROM generate_series(1, 2000000) x; ! CREATE INDEX numbers_gin_idx ON numbers USING gin(a); ! EXPLAIN ANALYZE SELECT * FROM numbers WHERE a = 1000; ! QUERY PLAN ------------ Bitmap Heap Scan on numbers (cost=113.50..9509.26 rows=10000 width=4) (actual time=0.388..1.459 rows=991 loops=1) Recheck Cond: (a = 1000) -> Bitmap Index Scan on numbers_gin_idx (cost=0.00..111.00 rows=10000 width=0) (actual time=0.232..0.232 rows=991 loops=1) Index Cond: (a = 1000) ! Total runtime: 1.563 ms
  79. 79. btree_gin  vs  btree 79 -- btree SELECT pg_size_pretty(pg_total_relation_size('numbers_idx')); pg_size_pretty ---------------- 43 MB ! ! ! -- GIN SELECT pg_size_pretty(pg_total_relation_size('numbers_gin_idx')); pg_size_pretty ---------------- 16 MB • Only  use  GIN  over  btree  if  you  have  a  lot  of  duplicate  entries
  80. 80. hstore  -­‐  the  PostgreSQL  Key-­‐Value   Store 80 CREATE EXTENSION IF NOT EXISTS hstore; ! CREATE UNLOGGED TABLE keypairs AS SELECT (x || ' => ' || (x + (random() * 5)::int))::hstore AS data FROM generate_series(1,1000000) x; SELECT pg_size_pretty(pg_relation_size('keypairs')); ! ! SELECT * FROM keypairs WHERE data ? ‘3'; data ---------- "3"=>"4" ! EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3'; QUERY PLAN ----------- Seq Scan on keypairs (cost=0.00..19135.06 rows=950 width=32) (actual time=0.065..208.808 rows=1 loops=1) Filter: (data ? '3'::text) Rows Removed by Filter: 999999 ! Total runtime: 208.825 ms
  81. 81. hstore  -­‐  the  PostgreSQL  Key-­‐Value   Store 81 CREATE INDEX keypairs_data_gin_idx ON keypairs USING gin(data); ! EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3'; ! QUERY PLAN ----------- Bitmap Heap Scan on keypairs (cost=27.75..2775.66 rows=1000 width=24) (actual time=0.044..0.045 rows=1 loops=1) Recheck Cond: (data ? '3'::text) -> Bitmap Index Scan on keypairs_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual time=0.039..0.039 rows=1 loops=1) Index Cond: (data ? '3'::text) ! Total runtime: 0.071 ms
  82. 82. JSONB:  Coming  in  9.4 82 INSERT INTO documents SELECT row_to_json(ROW(x, x + 2, x + 3))::jsonb FROM generate_series(1,1000000) x; ! ! CREATE INDEX documents_data_gin_idx ON documents USING gin(data jsonb_path_ops); ! ! ! SELECT * FROM documents WHERE data @> '{ "f1": 10 }'; data -------------------------------- {"f1": 10, "f2": 12, "f3": 13} ! ! Execution time: 0.084 ms
  83. 83. Awesome  vs  WTF:   A  Note  On  Operator  Indexability 83 EXPLAIN ANALYZE SELECT * FROM documents WHERE data @> '{ "f1": 10 }'; ! QUERY PLAN ----------- Bitmap Heap Scan on documents (cost=27.75..3082.65 rows=1000 width=66) (actual time=0.029..0.030 rows=1 loops=1) Recheck Cond: (data @> '{"f1": 10}'::jsonb) Heap Blocks: exact=1 -> Bitmap Index Scan on documents_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (data @> '{"f1": 10}'::jsonb) ! Execution time: 0.084 ms ! EXPLAIN ANALYZE SELECT * FROM documents WHERE '{ "f1": 10 }' <@ data; ! QUERY PLAN ----------- Seq Scan on documents (cost=0.00..24846.00 rows=1000 width=66) (actual time=0.015..245.924 rows=1 loops=1) Filter: ('{"f1": 10}'::jsonb <@ data) Rows Removed by Filter: 999999 ! Execution time: 245.947 ms
  84. 84. For  More  Information… • http://www.postgresql.org/docs/current/static/ indexes.html   • http://www.postgresql.org/docs/current/static/ gist.html   • http://www.postgresql.org/docs/current/static/ gin.html   • http://www.postgresql.org/docs/current/static/ spgist.html   • GiST  +  GIN  +  Full  Text  Search:   – http://www.postgresql.org/docs/current/static/textsearch-­‐ indexes.html 84
  85. 85. Conclusion • Postgres  has  *a  lot*  of  different  types  of   indexes,  and  variations  on  each  of  its  engines   • Extensions  make  use  of  PostgreSQL  indexes   – PostGIS   • Need  to  understand  where  index  usage  is   appropriate  in  your  application 85
  86. 86. Thanks! • http://nyc.pgconf.us   • Twitter:  @jkatz05 86

×