Hyper loglog

•

0 likes•72 views

HyperLogLog is dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes''), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/sqrt(m). This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory.

Engineering

Definitions and facts
- harmonic mean:
- if each of a collection of m independent random variables has standard
deviation σ, then their arithmetic mean has standard deviation σ/√m
- the 68–95–99.7 rule

Structure
1. The HyperLogLog algorithm
2. Mean value analysis
3. Variance and other stories
4. Discussion

Problem statement and naive solution
- given a multiset M, find number of distinct elements
- hash table on M?
- sort(M) + scroll?

Issues
- big cardinality of data set, no space to store
- data set stored in distributed environment

Examples
- Google search, distinct number of search queries
- traffic monitoring (dos attacks)
- correlation of genomes in human DNA, distinct subwords of fixed size k

Constraints
- crucial factor is then to relax the constraint of computing the value of the
cardinality exactly
- allows to apply whole range of probabilistic algorithms
- in 99% practical applications, a tolerance of a few percents on the result
is acceptable

Idea of probabilistic counting
- imagine I flip a coin many times and count the number of consecutive
heads before the first tail
- repeat it several times
- Sequence 1: HHHT
- Sequence 2: HT
- Sequence 3: HHT

What if?
- what if I say you that I get 1000 sequences and got 2 as maximum index
- what if I say you that I get 10 sequences and got 100 as maximum index
- X ≈ 2k
, X - number of sequences, k - maximum index

Prototype
- let h: D → {0, 1}∞
- h(v1
) = 0001001110011...
- h(v2
) = 0100100110011...
- h(v3
) = 0010011010011...
- observe 0p−1
1 patterns, ρ(h(v)) = p, v∈M
- k = maxv∈M
ρ(h(v))
- cardinality(M) ≈ 2k

m different hash functions, drawbacks
- complexity = O(Nm)
- it would necessitate a large set (e.g.: 104
to decrease error by 102
) of
independent hashing functions, for which no construction is known

Split one problem into m sub-problems
- split M into m buckets
- estimate cardinality of each bucket (X/m)
- compute mean of all estimations
- multiply result by m (get estimation with accuracy σ/√m)

Edge cases
- small cardinalities (< m)
- large cardinalities (> , t - number of bits in hash function)

Chart, relative error
n=107
, m=1024, 3%

Histogram, relative error
+/- 3σ = 99.7%

Data structure
- estimate the cardinality of union of multiple sets. It is natural to combine
multiple HLL’s; simply take the largest count of consecutive leading 0’s
from all the HLL’s
- estimate the overlap of two sets. Since |A ∩ B| = |A| + |B| – |A ∪ B|, the
overlap of two sets can be calculated from the cardinality of each set and
the cardinality of their union

Usage
- Elasticsearch (HLL++, “precision_threshold”)
- Redis (HLL++, PF*, 12k + 8 bytes)
- Spark (HLL++, approx_count_distinct, “rsd”)
- ...

References
1. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
2. https://static.googleusercontent.com/media/research.google.com/en//pubs
/archive/40671.pdf
3. http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf

What's hot

Monte Carlo Statistical MethodsChristian Robert

Bayesian Neural Networksm.a.kirn

Monte Carlo Statistical MethodsChristian Robert

Markov Chain Monte Carlo MethodsFrancesco Casalegno

Monte Caro Simualtions, Sampling and Markov Chain Monte CarloXin-She Yang

InterpolationBhavik A Shah

Metropolis-Hastings MCMC Short TutorialRalph Schlosser

Lecture9 multi kernel_svmStéphane Canu

Sampling and Markov Chain Monte Carlo TechniquesTomasz Kusmierczyk

Spline InterpolationaiQUANT

Compression of “noisy” measurement data for plotting with TikZ and pgfplotsMathias Magdowski

Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)Kyuseok Hwang(allosha)

Tonethompsonguest12a053

March12 natarajanBBKuhn

Skiena algorithm 2007 lecture02 asymptotic notationzukun

Mathematical physics group 16derry92

con-dif2Luoyin Feng

Introduction to MCMC methodsChristian Robert

Bayesian Subset SimulationJulien Bect

Learning object 1Sharon Kay

What's hot (20)

Monte Carlo Statistical Methods

Bayesian Neural Networks

Monte Carlo Statistical Methods

Markov Chain Monte Carlo Methods

Monte Caro Simualtions, Sampling and Markov Chain Monte Carlo

Interpolation

Metropolis-Hastings MCMC Short Tutorial

Lecture9 multi kernel_svm

Sampling and Markov Chain Monte Carlo Techniques

Spline Interpolation

Compression of “noisy” measurement data for plotting with TikZ and pgfplots

Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)

Tonethompson

March12 natarajan

Skiena algorithm 2007 lecture02 asymptotic notation

Mathematical physics group 16

con-dif2

Introduction to MCMC methods

Bayesian Subset Simulation

Learning object 1

Similar to Hyper loglog

Program on Mathematical and Statistical Methods for Climate and the Earth Sys...The Statistical and Applied Mathematical Sciences Institute

Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru

Modeling and quantification of uncertainties in numerical aerodynamicsAlexander Litvinenko

introssuser9ed16a1

Jörg Stelzerbutest

Approximation in Stochastic Integer ProgrammingSSA KPI

Limits of ComputationJoshua Reuben

The Limits of ComputationJoshua Reuben

1508.07756v1Samir Crypticus

Numerical method for pricing american options under regime Alexander Decker

Network Security CS3-4 Infinity Tech Solutions

Lecture: Monte Carlo MethodsFrank Kienle

Chapter24rev1.pptPart 6Chapter 24Boundary-Valu.docxtiffanyd4

Sparse data formats and efficient numerical methods for uncertainties in nume...Alexander Litvinenko

An investigation of inference of the generalized extreme value distribution b...Alexander Decker

Count-Distinct ProblemKai Zhang

06-01 Machine Learning and Linear Regression.pptxSaharA84

Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene

Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene

HOME ASSIGNMENT (0).pptxSayedulHassan1

Similar to Hyper loglog (20)

Program on Mathematical and Statistical Methods for Climate and the Earth Sys...

Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...

Modeling and quantification of uncertainties in numerical aerodynamics

intro

Jörg Stelzer

Approximation in Stochastic Integer Programming

Limits of Computation

The Limits of Computation

1508.07756v1

Numerical method for pricing american options under regime

Network Security CS3-4

Lecture: Monte Carlo Methods

Chapter24rev1.pptPart 6Chapter 24Boundary-Valu.docx

Sparse data formats and efficient numerical methods for uncertainties in nume...

An investigation of inference of the generalized extreme value distribution b...

Count-Distinct Problem

06-01 Machine Learning and Linear Regression.pptx

Research internship on optimal stochastic theory with financial application u...

Presentation on stochastic control problem with financial applications (Merto...

HOME ASSIGNMENT (0).pptx

Recently uploaded

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000

Extrusion Processes and Their Limitations120cr0395

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam

Introduction to IEEE STANDARDS and its different types.pptxupamatechverse

What are the advantages and disadvantages of membrane structures.pptxwendy cai

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

Recently uploaded (20)

SPICE PARK APR2024 ( 6,793 SPICE Models )

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...

Extrusion Processes and Their Limitations

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Processing & Properties of Floor and Wall Tiles.pptx

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Coefficient of Thermal Expansion and their Importance.pptx

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

chaitra-1.pptx fake news detection using machine learning

Introduction to IEEE STANDARDS and its different types.pptx

What are the advantages and disadvantages of membrane structures.pptx

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

Hyper loglog

1. HyperLogLog Eugen Kosteev, SE @ Tubular

2. Paper

3. Math

4. Math

5. Definitions and facts - harmonic mean: - if each of a collection of m independent random variables has standard deviation σ, then their arithmetic mean has standard deviation σ/√m - the 68–95–99.7 rule

6. Structure 1. The HyperLogLog algorithm 2. Mean value analysis 3. Variance and other stories 4. Discussion

7. Problem statement and naive solution - given a multiset M, find number of distinct elements - hash table on M? - sort(M) + scroll?

8. Issues - big cardinality of data set, no space to store - data set stored in distributed environment

9. Examples - Google search, distinct number of search queries - traffic monitoring (dos attacks) - correlation of genomes in human DNA, distinct subwords of fixed size k

10. Constraints - crucial factor is then to relax the constraint of computing the value of the cardinality exactly - allows to apply whole range of probabilistic algorithms - in 99% practical applications, a tolerance of a few percents on the result is acceptable

11. Idea of probabilistic counting - imagine I flip a coin many times and count the number of consecutive heads before the first tail - repeat it several times - Sequence 1: HHHT - Sequence 2: HT - Sequence 3: HHT

12. What if? - what if I say you that I get 1000 sequences and got 2 as maximum index - what if I say you that I get 10 sequences and got 100 as maximum index - X ≈ 2k , X - number of sequences, k - maximum index

13. Prototype - let h: D → {0, 1}∞ - h(v1 ) = 0001001110011... - h(v2 ) = 0100100110011... - h(v3 ) = 0010011010011... - observe 0p−1 1 patterns, ρ(h(v)) = p, v∈M - k = maxv∈M ρ(h(v)) - cardinality(M) ≈ 2k

14. m different hash functions, drawbacks - complexity = O(Nm) - it would necessitate a large set (e.g.: 104 to decrease error by 102 ) of independent hashing functions, for which no construction is known

15. Split one problem into m sub-problems - split M into m buckets - estimate cardinality of each bucket (X/m) - compute mean of all estimations - multiply result by m (get estimation with accuracy σ/√m)

16. HyperLogLog mZ - harmonic mean ≈ 0.7213

17. Theorem 1

18. Poissonization

19. Edge cases - small cardinalities (< m) - large cardinalities (> , t - number of bits in hash function)

20. Final implementation

21. Comparison

22. Types of observables

23. Chart, relative error n=107 , m=1024, 3%

24. Histogram, relative error +/- 3σ = 99.7%

25. Data structure - estimate the cardinality of union of multiple sets. It is natural to combine multiple HLL’s; simply take the largest count of consecutive leading 0’s from all the HLL’s - estimate the overlap of two sets. Since |A ∩ B| = |A| + |B| – |A ∪ B|, the overlap of two sets can be calculated from the cardinality of each set and the cardinality of their union

26. Usage - Elasticsearch (HLL++, “precision_threshold”) - Redis (HLL++, PF*, 12k + 8 bytes) - Spark (HLL++, approx_count_distinct, “rsd”) - ...

27. References 1. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf 2. https://static.googleusercontent.com/media/research.google.com/en//pubs /archive/40671.pdf 3. http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf

28. Thanks! Questions?

Hyper loglog

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hyper loglog

Similar to Hyper loglog (20)

Recently uploaded

Recently uploaded (20)

Hyper loglog