Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimising Column stores with statistical analysis

1,809 views

Published on

A presentation about column stores, how they work and how you can optimise compression with them

Published in: Technology
  • Be the first to comment

Optimising Column stores with statistical analysis

  1. 1. Turning Rows into Columns Sales Product Customer I D Value I D Customer 1 Beer 1 Thomas 2 GBP 2 Beer 2 Thomas 2011-1125 10 GBP 3 Vodka 3 Thomas 4 Whiskey 4 Christian 5 Whiskey 5 Christian 6 Vodka 6 Alexei 7 Vodka 7 Alexei Product Customer Date Sale Beer Thomas 2011-1125 2 GBP Beer Thomas 2011-1125 Vodka Thomas And so on… until… Whiskey Christian 2011-1125 5 GBP Whiskey Christian 2011-1125 5 GBP Vodka Alexei 2011-1125 10 GBP
  2. 2. And we get… Product Sale Date Customer I D Value I D Customer I D Date I D Sale 1 Beer 1 Thomas 1 2011-11-25 1 2 GBP 2 Beer 2 Thomas 2 2011-11-25 2 2 GBP 3 Vodka 3 Thomas 3 2011-11-25 3 10 GBP 4 Whiskey 4 Christian 4 2011-11-25 4 5 GBP 5 Whiskey 5 Christian 5 2011-11-25 5 5 GBP 6 Vodka 6 Alexei 6 2011-11-25 6 10 GBP 7 Vodka 7 Alexei 7 2011-11-25 7 10 GBP
  3. 3. And what now? Product I D Value 1 Beer 2 Beer 3 Vodka 4 Whiskey 5 Whiskey 6 Vodka 7 Vodka Product’ ID 1-2 Run length Encode Value Beer 3 Vodka 4-5 Whiskey 6-7 Vodka
  4. 4. Applying Compression Product’ Customer’ Sale’ Date’ ID Value ID Customer ID Date ID Sale 1-2 Beer 1-3 Thomas 1-7 2011-11-25 1-2 2 GBP 3 Vodka 4-5 Christian 3 10 GBP 4-5 Whiskey 6-7 Alexei 4-5 5 GBP 6-7 Vodka 6-7 10 GBP
  5. 5. Insights Product’ ID One RL Value 12 3 Beer 45 Whiskey 67 Vodka Vodka
  6. 6. Ordering Example Product Customer Product Customer Product Customer Beer Thomas Beer Thomas Beer Christian Whiskey Christian Beer Thomas Vodka Thomas Vodka Thomas Vodka Thomas Whiskey Christian Whiskey Christian Whiskey Christian Beer Thomas Beer Thomas Whiskey Christian Vodka Alexei Vodka Alexei Vodka Alexei Vodka Alexei Customer Thomas Whiskey Product Vodka Alexei Beer Thomas Vodka Whiskey Christian Vodka Alexei
  7. 7. There is some overhead… Cluster on ID Heap Data Size 327MB 327MB Column Index Size 59MB 142MB
  8. 8. Rule of Thumb? Lowest first is worse!
  9. 9. OK, so what about highest first? Loose correlation Highest first is worse!
  10. 10. What are we looking for?
  11. 11. Just Read the Magic Code?
  12. 12. Coming to Terms with Entropy SKEW SPLAT ID 10001 10001 1000000 0.01 0.01 1 Histogram COUNT DISTINCT DISTINCT / COUNT
  13. 13. SKEW SPLAT ID ≈ 0.21 ≈ 13 ≈ 20
  14. 14. You will NEVER win
  15. 15. Column that “cluster” with other columns?
  16. 16. I(X;Y)
  17. 17. d(A, B) is zero!
  18. 18. Reflecting on Information Distance
  19. 19. There are MORE than n! routes http://arxiv.org/pdf/1207.2189.pdf
  20. 20. Heuristics are your Best Bet

×