HyperLogLog in Hive - How to count sheep efficiently?

HyperLogLog
in Hive How to count
sheep efficiently?
Phillip Capper: Whitecliffs Sheep
@bzamecnik

Agenda
● the problem – count distinct elements
● exact counting
● fast approximate counting – using HLL in Hive
● comparing performance and accuracy
● appendix – a bit of theory of probabilistic counting
○ how it works?

The problem: count distinct elements
● eg. the number of unique visitors
● each visitor can make a lot of clicks
● typically grouped in various ways
● "set cardinality estimation" problem

Small data solutions
● sort the data O(N*log(N)) and skip duplicates O(N)
○ O(N) space
● put data into a hash or tree set and iterate
○ hash set: O(N^2) worst case build, O(N) iteration
○ tree set: O(N*log(N)) build, O(N) iteration
○ both O(N) space
● but: we have big data
Example:
~100M unique values in 5B rows each day
32 bytes per value -> 3 GB unique, 150 GB total

Problems with counting big data
● data is partitioned
○ across many machines
○ in time
● we can't sum cardinality of each partition
○ since the subsets are generally not disjoint
○ we would overestimate
count(part1) + count(part1) >= count(part1 ∪ part2)
● we need to merge estimators and then estimate
cardinality
count(estimator(part1) ∪ estimator(part2))

SELECT COUNT(DISTINCT user_id)
FROM events;
single reducer!
Exact counting in Hive

Exact counting in Hive – subquery
SELECT COUNT(*) FROM (
SELECT 1 FROM events
GROUP BY user_id
) unique_guids;
Or more concisely:
SELECT COUNT(*) FROM (
SELECT DISTINCT user_id
FROM events
) unique_guids;
many reducers
two phases
cannot combine
more aggregations

Exact counting in Hive
● hive.optimize.distinct.rewrite
○ allows to rewrite COUNT(DISTINCT) to subquery
○ since Hive 1.2.0

Probabilistic counting
● fast results, but approximate
● practical example of using HLL in Hive
● more theory in the appendix

● klout/brickhouse
○ single option
○ no JAR, some tests
○ based on HLL++ from stream-lib (quite fast)
● jdmaturen/hive-hll
○ no options (they are in API, but not implemented!)
○ no JAR, no tests
○ compatible with java-hll, pg-hll, js-hll
● t3rmin4t0r/hive-hll-udf
○ no options, no JAR, no tests
Implementations of HLL as Hive UDFs

● User-Defined Functions
● function registered from a class (loaded from JAR)
● JAR needs to be on HDFS (otherwise it fails)
● you can choose the UDF name at will
● work both in HiveServer2/Beeline and Hive CLI
ADD JAR hdfs:///path/to/the/library.jar;
CREATE TEMPORARY FUNCTION foo_func
AS 'com.example.foo.FooUDF';
● Usage:
SELECT foo_func(...) FROM ...;
UDFs in Hive

● to_hll(value)
○ aggregate values to HLL
○ UDAF (aggregation function)
○ + hash each value
○ optionally can be configured (eg. for precision)
● union_hlls(hll)
○ union multiple HLLs
○ UDAF
● hll_approx_count(hll)
○ estimate cardinality from a HLL
○ UDF
HLL can be stored as binary or string type.
General UDFs API for HLL

● Estimate of total unique visitors:
SELECT hll_approx_count(to_hll(user_id))
FROM events;
● Estimate of total events + unique visitors at once:
SELECT
count(*) AS total_events
hll_approx_count(to_hll(user_id))
AS unique_visitors
FROM events;
Example usage

Example usage
● Compute each daily estimator once:
CREATE TABLE daily_user_hll AS
SELECT date, to_hll(user_id) AS users_hll
FROM events
GROUP BY date;
● Then quickly aggregate and estimate:
SELECT hll_approx_count(union_hlls(users_hll))
AS user_count
FROM daily_user_hll
WHERE date BETWEEN '2015-01-01' AND '2015-01-31';

https://github.com/klout/brickhouse - Hive UDF
https://github.com/addthis/stream-lib - HLL++
$ git clone https://github.com/klout/brickhouse
disable maven-javadoc-plugin in pom.xml (since it fails)
$ mvn package
$ wget http://central.maven.
org/maven2/com/clearspring/analytics/stream/2.3.0/stream-
2.3.0.jar
$ scp target/brickhouse-0.7.1-SNAPSHOT.jar
stream-2.3.0.jar cluster-host:
cluster-host$ hdfs dfs -copyFromLocal *.jar
/user/me/hive-libs
Brickhouse – installation

Brickhouse – usage
ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar;
ADD JAR /user/zamecnik/lib/stream-2.3.0.jar;
CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll.
HyperLogLogUDAF';
CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf.
hll.UnionHyperLogLogUDAF';
CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse.
udf.hll.EstimateCardinalityUDF';
to_hll(value, [bit_precision])
● bit_precision: 4 to 16 (default 6)

Hive-hll usage
ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar;
CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll.
HashUDF';
CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll.
AddAggUDAF';
CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll.
UnionAggUDAF';
CREATE TEMPORARY FUNCTION hll_approx_count AS 'com.
kresilas.hll.CardinalityUDF';

We have to explicitly hash the value:
SELECT
hll_approx_count(to_hll(hll_hash(user_id)))
FROM events;
Options for creating HLL:
to_hll(x, [log2m, regwidth, expthresh, sparseon])
hardcoded to:
[log2m=11, regwidth=5, expthresh=-1, sparseon=true]
Hive-hll usage

Nice things
● HLLs are additive
○ can be computed once
○ various partitions can be merged and estimated for
cardinality later
● we can count multiple unique columns at once
○ no need to subquery
○ we can do wild grouping (by country, browser, …)
● HLLs take only little space

Rolling window
-- keep reasonable number of task for month of data
SET mapreduce.input.fileinputformat.split.maxsize=5368709120;
-- keep low number of output files (HLLs are quite small)
SET hive.merge.mapredfiles=true;
-- maximum precision
SET hivevar:hll_precision=16;
-- HLL for each day
CREATE TABLE guids_parquet_hll AS
SELECT
'${year}' AS year,
'${month}' AS month,
day,
to_hll(guid, ${hll_precision}) AS guid_hll
FROM parquet.dump_${year}_${month}
GROUP BY day;

-- for each day estimate number of guids 7-days back
CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count
AS
SELECT
`date`,
hll_approx_count(guids_union) AS guid_count
FROM (
SELECT
concat(`year`, '-', `month`, '-', `day`) as `date`,
union_hlls(guid_hll) OVER w AS guids_union
FROM guids_parquet_hll
WINDOW w AS (
ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING
)
) rolling_guids;
Rolling window

● when JARs are not on HDFS the query fails (why?)
● computing on many days of raw clickstream fails in
Beeline (works in Hive CLI), parquet is ok
● HIVE-9073 WINDOW + custom UDAF → NPE
○ fixed in Hive 1.2.0
● DISTRO-631
Pitfalls

Approximation error
● Typically < 1-2 %
● Can be controlled by the parameters
● Example: 1 year of guids

Appendix – more interesting things

● trade-off: some approximation error for far better
performance and memory consumption
● sketch - streaming & probabilistic algorithm
● KMV - k minimal values
● linear counter
● loglog counter
Probabilistic counting

LogLog counter
● run length of initial zeros
● multiple estimators (registers)
● stochastic averaging
○ single hash function
○ multiple buckets
● hash → (register index, run length)

Linear counter
m = 20 # size of the register
register = bitarray(m) # register, m bits
def add(value):
h = mmh3.hash(value) % m # select bit index
register[h] = 1 # = max(1, register[h])
def cardinality():
u_n = register.count(0) # number of zeros
v_n = u_n / m # relative number of zeros
n_hat = -m * math.log(v_n) # estimate of the set cardinality
return n_hat

● structure like loglog counter
● harmonic mean to combine registers
● correction for small and large cardinalities
● values needs to be hashed well – murmur3
HyperLogLog (HLL)

HLL union
● just take max of each register value
● no loss – same result as HLL of union of streams
● parallelizable
● union preserves error bound, intersection/diff do not

Further reading
● very nice explanation of HLL
● Probabilistic Data Structures For Web Analytics And
Data Mining
● Sketch of the Day: HyperLogLog — Cornerstone of a
Big Data Infrastructure
● HyperLogLog in Pure SQL
● Use Subqueries to Count Distinct 50X Faster
● It is possible to combine HLL of different sizes

Papers
● HyperLogLog in Practice: Algorithmic Engineering of
a State of The Art Cardinality Estimation Algorithm
● https://github.com/addthis/stream-lib#cardinality

Other problems & structures
● set membership – bloom filter
● top-k elements – count-min-sketch, stream-summary

HyperLogLog in Hive - How to count sheep efficiently?

More Related Content

What's hot

Viewers also liked

Similar to HyperLogLog in Hive - How to count sheep efficiently?

Recently uploaded

HyperLogLog in Hive - How to count sheep efficiently?