Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Online statistical analysis using transducers and sketch algorithms

284 views

Published on

Online statistical analysis using transducers and sketch algorithms. Don’t know what either is? You are going to learn something very cool (and perspective-changing) then. Know them, but want an experience report? Got you covered, fam.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Online statistical analysis using transducers and sketch algorithms

  1. 1. Online statistical analysis using transducers and sketch algorithms simon@metabase.com @sbelak
  2. 2. Metabase ❤
 github.com/metabase/metabase • Open source analytics tool • Building a “data scientist in a box” • Hundreds to billions of rows • Some DBs optimised for analytics, some not
  3. 3. Transducers at a glance • Transducers decomplect recursion mechanism, transformation, building the output, and access mechanism
 
 
 
 
 
 • 3 user-facing “protocols”: xf, transdcucer, and CollReduce
  4. 4. xf and transducer
  5. 5. Composing transducers 1. comp xfs
 
 2. xf and transducer 3. github.com/henrygarner/redux
 post-complete fuse
 
 
 
 
 

  6. 6. On-line/streaming analysis
  7. 7. Many batch algorithms can be turned into online ones Parallelize independent computations Find a recursive relation
  8. 8. github.com/MastodonC/kixi.stats • Count • (Arithmetic) mean • Geometric mean • Harmonic mean • Median • Variance • Interquartile range • Standard deviation • Standard error • Skewness • Kurtosis • Covariance • Covariance matrix • Correlation • Correlation matrix • Simple linear regression • Standard error of the mean • Standard error of the estimate • Standard error of the prediction • …
  9. 9. Single-pass analysis
  10. 10. Using transducers is worth it for the composition alone
  11. 11. Annoyances • Can only transduce one coll at a time • Always have to pass in an xf • Having functions that return a transducer or not is error prone
  12. 12. Sketch algorithms
  13. 13. Idea: summarise your data with some data structure and query that
  14. 14. Histograms
  15. 15. Histogram construction 1. Pick a number of buckets K 2. For each incoming value: 1. If a bucket for it exists, increment it 2. else, add a new bucket with count = 1 3. If there are > K buckets, find the two most adjacent buckets and merge them
  16. 16. Nice property: merge
  17. 17. Estimating values • Assume the bin mean in also its median • Do weighted interpolations • Often we can be precise up to the two bounding buckets
  18. 18. Nice property II: 
 decouples data collection from computation
  19. 19. github.com/bigmlcom/histogram
  20. 20. Aside: transducers are a good way to wrap Java/ imperative construction
  21. 21. Having distributions readily available is great
  22. 22. Sampling trick
  23. 23. Time slices
  24. 24. What about categorical data?
  25. 25. Count–min sketch
  26. 26. Often approximations are good enough
  27. 27. github.com/addthis/stream-lib
  28. 28. Takeouts • Transducers are not only performant but also a good modularization protocol • You don’t realise how often you want a distribution until you have it readily available • Often approximations are good enough • You can get surprisingly far on a single machine
  29. 29. Questions simon@metabase.com @sbelak

×