Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HyperLogLog in Hive - How to count sheep efficiently?


Published on

Efficient count distinct in Hive using HLL UDFs.

Published in: Engineering
  • Be the first to comment

HyperLogLog in Hive - How to count sheep efficiently?

  1. 1. HyperLogLog in Hive How to count sheep efficiently? Phillip Capper: Whitecliffs Sheep @bzamecnik
  2. 2. Agenda ● the problem – count distinct elements ● exact counting ● fast approximate counting – using HLL in Hive ● comparing performance and accuracy ● appendix – a bit of theory of probabilistic counting ○ how it works?
  3. 3. The problem: count distinct elements ● eg. the number of unique visitors ● each visitor can make a lot of clicks ● typically grouped in various ways ● "set cardinality estimation" problem
  4. 4. Small data solutions ● sort the data O(N*log(N)) and skip duplicates O(N) ○ O(N) space ● put data into a hash or tree set and iterate ○ hash set: O(N^2) worst case build, O(N) iteration ○ tree set: O(N*log(N)) build, O(N) iteration ○ both O(N) space ● but: we have big data Example: ~100M unique values in 5B rows each day 32 bytes per value -> 3 GB unique, 150 GB total
  5. 5. Problems with counting big data ● data is partitioned ○ across many machines ○ in time ● we can't sum cardinality of each partition ○ since the subsets are generally not disjoint ○ we would overestimate count(part1) + count(part1) >= count(part1 ∪ part2) ● we need to merge estimators and then estimate cardinality count(estimator(part1) ∪ estimator(part2))
  6. 6. SELECT COUNT(DISTINCT user_id) FROM events; single reducer! Exact counting in Hive
  7. 7. Exact counting in Hive – subquery SELECT COUNT(*) FROM ( SELECT 1 FROM events GROUP BY user_id ) unique_guids; Or more concisely: SELECT COUNT(*) FROM ( SELECT DISTINCT user_id FROM events ) unique_guids; many reducers two phases cannot combine more aggregations
  8. 8. Exact counting in Hive ● hive.optimize.distinct.rewrite ○ allows to rewrite COUNT(DISTINCT) to subquery ○ since Hive 1.2.0
  9. 9. Probabilistic counting ● fast results, but approximate ● practical example of using HLL in Hive ● more theory in the appendix
  10. 10. ● klout/brickhouse ○ single option ○ no JAR, some tests ○ based on HLL++ from stream-lib (quite fast) ● jdmaturen/hive-hll ○ no options (they are in API, but not implemented!) ○ no JAR, no tests ○ compatible with java-hll, pg-hll, js-hll ● t3rmin4t0r/hive-hll-udf ○ no options, no JAR, no tests Implementations of HLL as Hive UDFs
  11. 11. ● User-Defined Functions ● function registered from a class (loaded from JAR) ● JAR needs to be on HDFS (otherwise it fails) ● you can choose the UDF name at will ● work both in HiveServer2/Beeline and Hive CLI ADD JAR hdfs:///path/to/the/library.jar; CREATE TEMPORARY FUNCTION foo_func AS ''; ● Usage: SELECT foo_func(...) FROM ...; UDFs in Hive
  12. 12. ● to_hll(value) ○ aggregate values to HLL ○ UDAF (aggregation function) ○ + hash each value ○ optionally can be configured (eg. for precision) ● union_hlls(hll) ○ union multiple HLLs ○ UDAF ● hll_approx_count(hll) ○ estimate cardinality from a HLL ○ UDF HLL can be stored as binary or string type. General UDFs API for HLL
  13. 13. ● Estimate of total unique visitors: SELECT hll_approx_count(to_hll(user_id)) FROM events; ● Estimate of total events + unique visitors at once: SELECT count(*) AS total_events hll_approx_count(to_hll(user_id)) AS unique_visitors FROM events; Example usage
  14. 14. Example usage ● Compute each daily estimator once: CREATE TABLE daily_user_hll AS SELECT date, to_hll(user_id) AS users_hll FROM events GROUP BY date; ● Then quickly aggregate and estimate: SELECT hll_approx_count(union_hlls(users_hll)) AS user_count FROM daily_user_hll WHERE date BETWEEN '2015-01-01' AND '2015-01-31';
  15. 15. - Hive UDF - HLL++ $ git clone disable maven-javadoc-plugin in pom.xml (since it fails) $ mvn package $ wget http://central.maven. org/maven2/com/clearspring/analytics/stream/2.3.0/stream- 2.3.0.jar $ scp target/brickhouse-0.7.1-SNAPSHOT.jar stream-2.3.0.jar cluster-host: cluster-host$ hdfs dfs -copyFromLocal *.jar /user/me/hive-libs Brickhouse – installation
  16. 16. Brickhouse – usage ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar; ADD JAR /user/zamecnik/lib/stream-2.3.0.jar; CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll. HyperLogLogUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf. hll.UnionHyperLogLogUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse. udf.hll.EstimateCardinalityUDF'; to_hll(value, [bit_precision]) ● bit_precision: 4 to 16 (default 6)
  17. 17. Hive-hll usage ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar; CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll. HashUDF'; CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll. AddAggUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll. UnionAggUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'com. kresilas.hll.CardinalityUDF';
  18. 18. We have to explicitly hash the value: SELECT hll_approx_count(to_hll(hll_hash(user_id))) FROM events; Options for creating HLL: to_hll(x, [log2m, regwidth, expthresh, sparseon]) hardcoded to: [log2m=11, regwidth=5, expthresh=-1, sparseon=true] Hive-hll usage
  19. 19. Nice things ● HLLs are additive ○ can be computed once ○ various partitions can be merged and estimated for cardinality later ● we can count multiple unique columns at once ○ no need to subquery ○ we can do wild grouping (by country, browser, …) ● HLLs take only little space
  20. 20. Rolling window -- keep reasonable number of task for month of data SET mapreduce.input.fileinputformat.split.maxsize=5368709120; -- keep low number of output files (HLLs are quite small) SET hive.merge.mapredfiles=true; -- maximum precision SET hivevar:hll_precision=16; -- HLL for each day CREATE TABLE guids_parquet_hll AS SELECT '${year}' AS year, '${month}' AS month, day, to_hll(guid, ${hll_precision}) AS guid_hll FROM parquet.dump_${year}_${month} GROUP BY day;
  21. 21. -- for each day estimate number of guids 7-days back CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count AS SELECT `date`, hll_approx_count(guids_union) AS guid_count FROM ( SELECT concat(`year`, '-', `month`, '-', `day`) as `date`, union_hlls(guid_hll) OVER w AS guids_union FROM guids_parquet_hll WINDOW w AS ( ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING ) ) rolling_guids; Rolling window
  22. 22. ● when JARs are not on HDFS the query fails (why?) ● computing on many days of raw clickstream fails in Beeline (works in Hive CLI), parquet is ok ● HIVE-9073 WINDOW + custom UDAF → NPE ○ fixed in Hive 1.2.0 ● DISTRO-631 Pitfalls
  23. 23. Approximation error ● Typically < 1-2 % ● Can be controlled by the parameters ● Example: 1 year of guids
  24. 24. Appendix – more interesting things
  25. 25. ● trade-off: some approximation error for far better performance and memory consumption ● sketch - streaming & probabilistic algorithm ● KMV - k minimal values ● linear counter ● loglog counter Probabilistic counting
  26. 26. LogLog counter ● run length of initial zeros ● multiple estimators (registers) ● stochastic averaging ○ single hash function ○ multiple buckets ● hash → (register index, run length)
  27. 27. Linear counter m = 20 # size of the register register = bitarray(m) # register, m bits def add(value): h = mmh3.hash(value) % m # select bit index register[h] = 1 # = max(1, register[h]) def cardinality(): u_n = register.count(0) # number of zeros v_n = u_n / m # relative number of zeros n_hat = -m * math.log(v_n) # estimate of the set cardinality return n_hat
  28. 28. ● structure like loglog counter ● harmonic mean to combine registers ● correction for small and large cardinalities ● values needs to be hashed well – murmur3 HyperLogLog (HLL)
  29. 29. HLL union ● just take max of each register value ● no loss – same result as HLL of union of streams ● parallelizable ● union preserves error bound, intersection/diff do not
  30. 30. Further reading ● very nice explanation of HLL ● Probabilistic Data Structures For Web Analytics And Data Mining ● Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure ● HyperLogLog in Pure SQL ● Use Subqueries to Count Distinct 50X Faster ● It is possible to combine HLL of different sizes
  31. 31. Papers ● HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm ●
  32. 32. Other problems & structures ● set membership – bloom filter ● top-k elements – count-min-sketch, stream-summary