Successfully reported this slideshow.
Upcoming SlideShare
×

2,034 views

Published on

Efficient count distinct in Hive using HLL UDFs.

Published in: Engineering
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

1. 1. HyperLogLog in Hive How to count sheep efficiently? Phillip Capper: Whitecliffs Sheep @bzamecnik
2. 2. Agenda ● the problem – count distinct elements ● exact counting ● fast approximate counting – using HLL in Hive ● comparing performance and accuracy ● appendix – a bit of theory of probabilistic counting ○ how it works?
3. 3. The problem: count distinct elements ● eg. the number of unique visitors ● each visitor can make a lot of clicks ● typically grouped in various ways ● "set cardinality estimation" problem
4. 4. Small data solutions ● sort the data O(N*log(N)) and skip duplicates O(N) ○ O(N) space ● put data into a hash or tree set and iterate ○ hash set: O(N^2) worst case build, O(N) iteration ○ tree set: O(N*log(N)) build, O(N) iteration ○ both O(N) space ● but: we have big data Example: ~100M unique values in 5B rows each day 32 bytes per value -> 3 GB unique, 150 GB total
5. 5. Problems with counting big data ● data is partitioned ○ across many machines ○ in time ● we can't sum cardinality of each partition ○ since the subsets are generally not disjoint ○ we would overestimate count(part1) + count(part1) >= count(part1 ∪ part2) ● we need to merge estimators and then estimate cardinality count(estimator(part1) ∪ estimator(part2))
6. 6. SELECT COUNT(DISTINCT user_id) FROM events; single reducer! Exact counting in Hive
7. 7. Exact counting in Hive – subquery SELECT COUNT(*) FROM ( SELECT 1 FROM events GROUP BY user_id ) unique_guids; Or more concisely: SELECT COUNT(*) FROM ( SELECT DISTINCT user_id FROM events ) unique_guids; many reducers two phases cannot combine more aggregations
8. 8. Exact counting in Hive ● hive.optimize.distinct.rewrite ○ allows to rewrite COUNT(DISTINCT) to subquery ○ since Hive 1.2.0
9. 9. Probabilistic counting ● fast results, but approximate ● practical example of using HLL in Hive ● more theory in the appendix
10. 10. ● klout/brickhouse ○ single option ○ no JAR, some tests ○ based on HLL++ from stream-lib (quite fast) ● jdmaturen/hive-hll ○ no options (they are in API, but not implemented!) ○ no JAR, no tests ○ compatible with java-hll, pg-hll, js-hll ● t3rmin4t0r/hive-hll-udf ○ no options, no JAR, no tests Implementations of HLL as Hive UDFs
11. 11. ● User-Defined Functions ● function registered from a class (loaded from JAR) ● JAR needs to be on HDFS (otherwise it fails) ● you can choose the UDF name at will ● work both in HiveServer2/Beeline and Hive CLI ADD JAR hdfs:///path/to/the/library.jar; CREATE TEMPORARY FUNCTION foo_func AS 'com.example.foo.FooUDF'; ● Usage: SELECT foo_func(...) FROM ...; UDFs in Hive
12. 12. ● to_hll(value) ○ aggregate values to HLL ○ UDAF (aggregation function) ○ + hash each value ○ optionally can be configured (eg. for precision) ● union_hlls(hll) ○ union multiple HLLs ○ UDAF ● hll_approx_count(hll) ○ estimate cardinality from a HLL ○ UDF HLL can be stored as binary or string type. General UDFs API for HLL
13. 13. ● Estimate of total unique visitors: SELECT hll_approx_count(to_hll(user_id)) FROM events; ● Estimate of total events + unique visitors at once: SELECT count(*) AS total_events hll_approx_count(to_hll(user_id)) AS unique_visitors FROM events; Example usage
14. 14. Example usage ● Compute each daily estimator once: CREATE TABLE daily_user_hll AS SELECT date, to_hll(user_id) AS users_hll FROM events GROUP BY date; ● Then quickly aggregate and estimate: SELECT hll_approx_count(union_hlls(users_hll)) AS user_count FROM daily_user_hll WHERE date BETWEEN '2015-01-01' AND '2015-01-31';
15. 15. https://github.com/klout/brickhouse - Hive UDF https://github.com/addthis/stream-lib - HLL++ \$ git clone https://github.com/klout/brickhouse disable maven-javadoc-plugin in pom.xml (since it fails) \$ mvn package \$ wget http://central.maven. org/maven2/com/clearspring/analytics/stream/2.3.0/stream- 2.3.0.jar \$ scp target/brickhouse-0.7.1-SNAPSHOT.jar stream-2.3.0.jar cluster-host: cluster-host\$ hdfs dfs -copyFromLocal *.jar /user/me/hive-libs Brickhouse – installation
16. 16. Brickhouse – usage ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar; ADD JAR /user/zamecnik/lib/stream-2.3.0.jar; CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll. HyperLogLogUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf. hll.UnionHyperLogLogUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse. udf.hll.EstimateCardinalityUDF'; to_hll(value, [bit_precision]) ● bit_precision: 4 to 16 (default 6)
17. 17. Hive-hll usage ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar; CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll. HashUDF'; CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll. AddAggUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll. UnionAggUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'com. kresilas.hll.CardinalityUDF';
18. 18. We have to explicitly hash the value: SELECT hll_approx_count(to_hll(hll_hash(user_id))) FROM events; Options for creating HLL: to_hll(x, [log2m, regwidth, expthresh, sparseon]) hardcoded to: [log2m=11, regwidth=5, expthresh=-1, sparseon=true] Hive-hll usage
19. 19. Nice things ● HLLs are additive ○ can be computed once ○ various partitions can be merged and estimated for cardinality later ● we can count multiple unique columns at once ○ no need to subquery ○ we can do wild grouping (by country, browser, …) ● HLLs take only little space
20. 20. Rolling window -- keep reasonable number of task for month of data SET mapreduce.input.fileinputformat.split.maxsize=5368709120; -- keep low number of output files (HLLs are quite small) SET hive.merge.mapredfiles=true; -- maximum precision SET hivevar:hll_precision=16; -- HLL for each day CREATE TABLE guids_parquet_hll AS SELECT '\${year}' AS year, '\${month}' AS month, day, to_hll(guid, \${hll_precision}) AS guid_hll FROM parquet.dump_\${year}_\${month} GROUP BY day;
21. 21. -- for each day estimate number of guids 7-days back CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count AS SELECT `date`, hll_approx_count(guids_union) AS guid_count FROM ( SELECT concat(`year`, '-', `month`, '-', `day`) as `date`, union_hlls(guid_hll) OVER w AS guids_union FROM guids_parquet_hll WINDOW w AS ( ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING ) ) rolling_guids; Rolling window
22. 22. ● when JARs are not on HDFS the query fails (why?) ● computing on many days of raw clickstream fails in Beeline (works in Hive CLI), parquet is ok ● HIVE-9073 WINDOW + custom UDAF → NPE ○ fixed in Hive 1.2.0 ● DISTRO-631 Pitfalls
23. 23. Approximation error ● Typically < 1-2 % ● Can be controlled by the parameters ● Example: 1 year of guids
24. 24. Appendix – more interesting things
25. 25. ● trade-off: some approximation error for far better performance and memory consumption ● sketch - streaming & probabilistic algorithm ● KMV - k minimal values ● linear counter ● loglog counter Probabilistic counting
26. 26. LogLog counter ● run length of initial zeros ● multiple estimators (registers) ● stochastic averaging ○ single hash function ○ multiple buckets ● hash → (register index, run length)
27. 27. Linear counter m = 20 # size of the register register = bitarray(m) # register, m bits def add(value): h = mmh3.hash(value) % m # select bit index register[h] = 1 # = max(1, register[h]) def cardinality(): u_n = register.count(0) # number of zeros v_n = u_n / m # relative number of zeros n_hat = -m * math.log(v_n) # estimate of the set cardinality return n_hat
28. 28. ● structure like loglog counter ● harmonic mean to combine registers ● correction for small and large cardinalities ● values needs to be hashed well – murmur3 HyperLogLog (HLL)
29. 29. HLL union ● just take max of each register value ● no loss – same result as HLL of union of streams ● parallelizable ● union preserves error bound, intersection/diff do not
30. 30. Further reading ● very nice explanation of HLL ● Probabilistic Data Structures For Web Analytics And Data Mining ● Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure ● HyperLogLog in Pure SQL ● Use Subqueries to Count Distinct 50X Faster ● It is possible to combine HLL of different sizes
31. 31. Papers ● HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm ● https://github.com/addthis/stream-lib#cardinality
32. 32. Other problems & structures ● set membership – bloom filter ● top-k elements – count-min-sketch, stream-summary