• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
887
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
15
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Vertica
  • 2. Why? ● ● ● ● Postgres benchmarks (2014-01-13) Remember, these queries are expected to occur within a web request/repsonse cycle! After 60 seconds connections time out We are used to web pages loading in 1-2 seconds
  • 3. Count ● SELECT count(*) FROM transactions ● (229527.0ms) ● => [{"count"=>78144197}] ● SELECT count(*) FROM transactions WHERE client_id = 131 ● (85451.0ms) ● => [{"count"=>34406416}]
  • 4. Yikes.
  • 5. Don't panic! (and carry a towel)
  • 6. We have a few tricks. ● What if we had a table that recorded 1 row per client that tracked all the counts of transactions for each client? id client_id count_transactions 1 131 34406416 2 132 10587625 3 133 85095 What if we wired this table up to a SQL parser?
  • 7. Mondrian! ● ● ● ● Robust aggregate table interface Auto recognizes aggregate tables via naming convention Queries are directed to the correct table If aggregate tables are missed, fall back to fact table ● Can define multiple aggregate tables / fact table ● Also has an intelligent segment cache
  • 8. But theres a problem. ● ● SELECT count(distinct(user_id) FROM transactions Aggregate tables rely on properties of addition operations ● distinct(set_1) + distinct(set_2) != distinct(set_1 + set_2) ● We have no choice but to query our fact table.
  • 9. Ok, now we can panic.
  • 10. Options? ● ● ● NOSQL (map reduce) – Hbase/Hadoop, Mongo, etc Columnar – Lucid, Paraccell, Vertica Bleeding Edge – Google BigQuery, Apache Drill
  • 11. Much Cluster, Many Computer ● ● ● All of these solutions are using distributed systems to query lots of data quickly Querying 100 million rows on a single computer is not fast on current hardware And we are projecting to have a lot more than 100 million rows this year
  • 12. Vertica ● Columnar ● Distributed ● Speaks SQL ● Compatible with Mondrian ● Its fast! ● “drop in” replacement for Postgres
  • 13. Row based database id name favorite_color 1 brian blue 2 dennis red 3 nelson green 4 spencer green (1,brian,blue)(2,dennis,red)(3,nelson,green)(4,spencer,green)
  • 14. Columnar database id name favorite_color 1 brian blue 2 dennis red 3 nelson green 4 spencer green (1,2,3,4)(brian,dennis,nelson,spencer)(blue,red,green,green)
  • 15. Do you even index, bro?
  • 16. Nope! ● ● ● ● ● Vertica has no indexes Vertica has “projections” which are similar to a materialized view Projections are transparent to the query (like an index) Projections are used to optimize JOIN, GROUP BY, and other sorts of queries Provides a tool to autobuild projections based on query analysis
  • 17. Tradeoffs Columnar Row Based ● Slow single row read ● Fast single row read ● Slow single row write ● Fast single row write ● Fast aggreagtes ● Slow aggregates ● Compression (5-10x) ● No compression
  • 18. Distributed ● Data split among servers ● Horizontal scaling ● Data is compressed, so its stored in memory ● Node failure is tolerated ● Network IO is important
  • 19. Count All Transactions Postgres – 230s Vertica – 2.10s Distinct User Count All Transactions Postgres – 187s Vertica – 0.63s
  • 20. So you just drop it in, right? ● 6 or 7 gems needed updates ● Had to roll an activerecord driver ● AreL saved us from a lot of pain ● ● ● Still some SQL problems (database drop, multirow insert) Lots of DevOps help needed Currently deployed to sand and qa, hitting production soon!
  • 21. Thank You