Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open

1,928 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,928
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open

  1. 1. Scalable Uniques in Postgres - Craig Kerstiens Heroku Postgres
  2. 2. Postgresql-HLL
  3. 3. Truviso • Extended Postgres to do streaming • Various markets • Ad space • Wanted unique impressions • Sort of wanted unique impressions
  4. 4. SELECT count(*)
  5. 5. Approx Top K
  6. 6. Compressed Bitmap
  7. 7. HyperLogLog
  8. 8. HyperLogLog • KMV - K minimum value
  9. 9. HyperLogLog • KMV - K minimum value • Bit observable patterns
  10. 10. HyperLogLog • KMV - K minimum value • Bit observable patterns • Stochastic averaging
  11. 11. HyperLogLog • KMV - K minimum value • Bit observable patterns • Stochastic averaging • Harmonic averaging
  12. 12. HyperLogLog • KMV - K minimum value • Bit observable patterns • Stochastic averaging • Harmonic averaging
  13. 13. HyperLogLog • KMV - K minimum value • Bit observable patterns • Stochastic averaging • Harmonic averaging • Implemented by Aggregate Knowledge
  14. 14. HyperLogLog Probabilistic uniques with small footprint
  15. 15. HyperLogLog Probabilistic uniques with small footprint Close enough distinct with small footprint
  16. 16. Use cases
  17. 17. Use cases • Semi distinct count • Think pg_stat_statements • Ad networks • Web traffic
  18. 18. Use cases • Semi distinct count • Think pg_stat_statements • Ad networks • Web traffic • With rollups/groupings
  19. 19. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (            id        integer,            set      hll    );
  20. 20. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (            id        integer,            set      hll    );
  21. 21. Inserting data UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_integer(12345))   WHERE  id  =  1; UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_text('hello  world'))   WHERE  id  =  1;
  22. 22. Real world CREATE  TABLE  daily_uniques  (        date                        date  UNIQUE,        users                      hll );
  23. 23. Real world INSERT  INTO  daily_uniques(date,  users)    SELECT  date,  hll_add_agg(hll_hash_integer(user_id))    FROM  users    GROUP  BY  1;
  24. 24. Real world SELECT                EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  25. 25. Real world SELECT                EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  26. 26. Good practices
  27. 27. Good practices
  28. 28. Good practices • It uses update
  29. 29. Good practices • It uses update • Do as a batch in most cases
  30. 30. Good practices • It uses update • Do as a batch in most cases • Tweak the config
  31. 31. Tuning Parameters
  32. 32. Tuning Parameters • log2m - log base 2 of registers • Between 4 and 17 • Each 1 increase doubles storage
  33. 33. Tuning Parameters • log2m - log base 2 of registers • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register
  34. 34. Tuning Parameters • log2m - log base 2 of registers • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse
  35. 35. Tuning Parameters • log2m - log base 2 of registers • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse • spareson - on/off for sparse
  36. 36. Is it better?
  37. 37. 1280 bytes Estimate count of 10s of billions Few percent error
  38. 38. Resources • https://github.com/aggregateknowledge/ postgresql-hll • http://blog.aggregateknowledge.com/ 2013/02/04/open-source-release- postgresql-hll/ • http://tapoueh.org/blog/2013/02/25- postgresql-hyperloglog
  39. 39. Questions

×