Vertica

3,099 views
2,747 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,099
On SlideShare
0
From Embeds
0
Number of Embeds
85
Actions
Shares
0
Downloads
62
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Vertica

  1. 1. Vertica
  2. 2. Why? ● ● ● ● Postgres benchmarks (2014-01-13) Remember, these queries are expected to occur within a web request/repsonse cycle! After 60 seconds connections time out We are used to web pages loading in 1-2 seconds
  3. 3. Count ● SELECT count(*) FROM transactions ● (229527.0ms) ● => [{"count"=>78144197}] ● SELECT count(*) FROM transactions WHERE client_id = 131 ● (85451.0ms) ● => [{"count"=>34406416}]
  4. 4. Yikes.
  5. 5. Don't panic! (and carry a towel)
  6. 6. We have a few tricks. ● What if we had a table that recorded 1 row per client that tracked all the counts of transactions for each client? id client_id count_transactions 1 131 34406416 2 132 10587625 3 133 85095 What if we wired this table up to a SQL parser?
  7. 7. Mondrian! ● ● ● ● Robust aggregate table interface Auto recognizes aggregate tables via naming convention Queries are directed to the correct table If aggregate tables are missed, fall back to fact table ● Can define multiple aggregate tables / fact table ● Also has an intelligent segment cache
  8. 8. But theres a problem. ● ● SELECT count(distinct(user_id) FROM transactions Aggregate tables rely on properties of addition operations ● distinct(set_1) + distinct(set_2) != distinct(set_1 + set_2) ● We have no choice but to query our fact table.
  9. 9. Ok, now we can panic.
  10. 10. Options? ● ● ● NOSQL (map reduce) – Hbase/Hadoop, Mongo, etc Columnar – Lucid, Paraccell, Vertica Bleeding Edge – Google BigQuery, Apache Drill
  11. 11. Much Cluster, Many Computer ● ● ● All of these solutions are using distributed systems to query lots of data quickly Querying 100 million rows on a single computer is not fast on current hardware And we are projecting to have a lot more than 100 million rows this year
  12. 12. Vertica ● Columnar ● Distributed ● Speaks SQL ● Compatible with Mondrian ● Its fast! ● “drop in” replacement for Postgres
  13. 13. Row based database id name favorite_color 1 brian blue 2 dennis red 3 nelson green 4 spencer green (1,brian,blue)(2,dennis,red)(3,nelson,green)(4,spencer,green)
  14. 14. Columnar database id name favorite_color 1 brian blue 2 dennis red 3 nelson green 4 spencer green (1,2,3,4)(brian,dennis,nelson,spencer)(blue,red,green,green)
  15. 15. Do you even index, bro?
  16. 16. Nope! ● ● ● ● ● Vertica has no indexes Vertica has “projections” which are similar to a materialized view Projections are transparent to the query (like an index) Projections are used to optimize JOIN, GROUP BY, and other sorts of queries Provides a tool to autobuild projections based on query analysis
  17. 17. Tradeoffs Columnar Row Based ● Slow single row read ● Fast single row read ● Slow single row write ● Fast single row write ● Fast aggreagtes ● Slow aggregates ● Compression (5-10x) ● No compression
  18. 18. Distributed ● Data split among servers ● Horizontal scaling ● Data is compressed, so its stored in memory ● Node failure is tolerated ● Network IO is important
  19. 19. Count All Transactions Postgres – 230s Vertica – 2.10s Distinct User Count All Transactions Postgres – 187s Vertica – 0.63s
  20. 20. So you just drop it in, right? ● 6 or 7 gems needed updates ● Had to roll an activerecord driver ● AreL saved us from a lot of pain ● ● ● Still some SQL problems (database drop, multirow insert) Lots of DevOps help needed Currently deployed to sand and qa, hitting production soon!
  21. 21. Thank You

×