Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Vertica Vertica Presentation Transcript

  • Vertica
  • Why? ● ● ● ● Postgres benchmarks (2014-01-13) Remember, these queries are expected to occur within a web request/repsonse cycle! After 60 seconds connections time out We are used to web pages loading in 1-2 seconds
  • Count ● SELECT count(*) FROM transactions ● (229527.0ms) ● => [{"count"=>78144197}] ● SELECT count(*) FROM transactions WHERE client_id = 131 ● (85451.0ms) ● => [{"count"=>34406416}]
  • Yikes.
  • Don't panic! (and carry a towel)
  • We have a few tricks. ● What if we had a table that recorded 1 row per client that tracked all the counts of transactions for each client? id client_id count_transactions 1 131 34406416 2 132 10587625 3 133 85095 What if we wired this table up to a SQL parser?
  • Mondrian! ● ● ● ● Robust aggregate table interface Auto recognizes aggregate tables via naming convention Queries are directed to the correct table If aggregate tables are missed, fall back to fact table ● Can define multiple aggregate tables / fact table ● Also has an intelligent segment cache
  • But theres a problem. ● ● SELECT count(distinct(user_id) FROM transactions Aggregate tables rely on properties of addition operations ● distinct(set_1) + distinct(set_2) != distinct(set_1 + set_2) ● We have no choice but to query our fact table.
  • Ok, now we can panic.
  • Options? ● ● ● NOSQL (map reduce) – Hbase/Hadoop, Mongo, etc Columnar – Lucid, Paraccell, Vertica Bleeding Edge – Google BigQuery, Apache Drill
  • Much Cluster, Many Computer ● ● ● All of these solutions are using distributed systems to query lots of data quickly Querying 100 million rows on a single computer is not fast on current hardware And we are projecting to have a lot more than 100 million rows this year
  • Vertica ● Columnar ● Distributed ● Speaks SQL ● Compatible with Mondrian ● Its fast! ● “drop in” replacement for Postgres
  • Row based database id name favorite_color 1 brian blue 2 dennis red 3 nelson green 4 spencer green (1,brian,blue)(2,dennis,red)(3,nelson,green)(4,spencer,green)
  • Columnar database id name favorite_color 1 brian blue 2 dennis red 3 nelson green 4 spencer green (1,2,3,4)(brian,dennis,nelson,spencer)(blue,red,green,green)
  • Do you even index, bro?
  • Nope! ● ● ● ● ● Vertica has no indexes Vertica has “projections” which are similar to a materialized view Projections are transparent to the query (like an index) Projections are used to optimize JOIN, GROUP BY, and other sorts of queries Provides a tool to autobuild projections based on query analysis
  • Tradeoffs Columnar Row Based ● Slow single row read ● Fast single row read ● Slow single row write ● Fast single row write ● Fast aggreagtes ● Slow aggregates ● Compression (5-10x) ● No compression
  • Distributed ● Data split among servers ● Horizontal scaling ● Data is compressed, so its stored in memory ● Node failure is tolerated ● Network IO is important
  • Count All Transactions Postgres – 230s Vertica – 2.10s Distinct User Count All Transactions Postgres – 187s Vertica – 0.63s
  • So you just drop it in, right? ● 6 or 7 gems needed updates ● Had to roll an activerecord driver ● AreL saved us from a lot of pain ● ● ● Still some SQL problems (database drop, multirow insert) Lots of DevOps help needed Currently deployed to sand and qa, hitting production soon!
  • Thank You