Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ClickHouse Features for Advanced Users, by Aleksei Milovidov

5,974 views

Published on

Presentation from ClickHouse Meetup at Cloudflare
April 27, 2018

Published in: Technology
  • This is your last chance to grab all 16,000 plans at this discount price. I've been told that Ted will only extend this offer until midnight tonight and this offer will NOT be repeated again. ★★★ https://t.cn/A62Ygslz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

ClickHouse Features for Advanced Users, by Aleksei Milovidov

  1. 1. ClickHouse Features For Advanced Users
  2. 2. SAMPLE key Let suppose you have a clickstream data and you store it in non-aggregated form. You need to generate reports for your customers on the fly. This is typical ClickHouse use case.
  3. 3. SAMPLE key Most customers are small, but some are rather big. You want to get instant reports even for largest customers. Solution: define a sample key in your MergeTree table.
  4. 4. SAMPLE key CREATE TABLE ... ENGINE = MergeTree ORDER BY (CounterID, Date, intHash32(UserID)) PARTITION BY toYYYYMM(Date) SAMPLE BY intHash32(UserID)
  5. 5. SAMPLE key SELECT uniq(UserID) FROM hits_all WHERE CounterID = 76543 AND EventDate BETWEEN '2018-03-25' AND '2018-04-25' ┌─uniq(UserID)─┐ │ 47362335 │ └──────────────┘ 1 rows in set. Elapsed: 4.571 sec. Processed 1.17 billion rows, 16.37 GB (255.88 million rows/s., 3.58 GB/s.)
  6. 6. SAMPLE key SELECT uniq(UserID) FROM hits_all SAMPLE 1/10 WHERE CounterID = 76543 AND EventDate BETWEEN '2018-03-25' AND '2018-04-25' ┌─uniq(UserID)─┐ │ 4742578 │ └──────────────┘ 1 rows in set. Elapsed: 0.638 sec. Processed 117.73 million rows, 1.65 GB (184.50 million rows/s., 2.58 GB/s.)
  7. 7. SAMPLE key Must be: — included in the primary key; — uniformly distributed in the domain of its data type: Bad: Timestamp; Good: intHash32(UserID); — cheap to calculate: Bad: cityHash64(URL); Good: intHash32(UserID); — not after high granular fields in primary key: Bad: ORDER BY (Timestamp, sample_key); Good: ORDER BY (CounterID, Date, sample_key).
  8. 8. SAMPLE key Sampling is: — deterministic; — works in a consistent way for different tables; — allows to read less amount of data from disk;
  9. 9. SAMPLE key, bonus SAMPLE 1/10 — select data for 1/10 of all possible sample keys; SAMPLE 1000000 — select from about (not less than) 1 000 000 rows on each shard; — you can use _sample_factor virtual column to determine the relative sample factor; SAMPLE 1/10 OFFSET 1/10 — select second 1/10 of all possible sample keys; SET max_parallel_replicas = 3 — select from multiple replicas of each shard in parallel;
  10. 10. Aggregate function combiners -If -Array -ForEach -Merge -State Example: sumIf(x, cond)
  11. 11. Aggregate function combiners: -If SELECT uniqIf(UserID, RefererDomain = 'yandex.ru') AS users_yandex, uniqIf(UserID, RefererDomain = 'google.ru') AS users_google FROM test.hits ┌─users_yandex─┬─users_google─┐ │ 19731 │ 8149 │ └──────────────┴──────────────┘
  12. 12. Aggregate function combiners: -Array SELECT uniq(arr), uniqArray(arr), groupArray(arr), groupUniqArray(arr), groupArrayArray(arr), groupUniqArrayArray(arr) FROM ( SELECT ['hello', 'world'] AS arr UNION ALL SELECT ['goodbye', 'world'] ) FORMAT Vertical
  13. 13. Aggregate function combiners: -Array Row 1: ────── uniq(arr): 2 uniqArray(arr): 3 groupArray(arr): [['hello','world'],['goodbye','world']] groupUniqArray(arr): [['hello','world'],['goodbye','world']] groupArrayArray(arr): ['hello','world','goodbye','world'] groupUniqArrayArray(arr): ['goodbye','world','hello']
  14. 14. Aggregate function combiners ... they may be combined Example: sumArrayIf, sumIfArray.
  15. 15. Aggregate function combiners ... they may be combined Example: sumForEachStateForEachIfArrayIfState.
  16. 16. Intermediate aggregation states are the first class citizens Obtain Intermediate state with -State combiner; Example: uniqState(user_id) AS state; — it will return a value of AggregateFunction(...) data type; — you can store them in tables; — merge them back with -Merge combiner; Example: uniqMerge(state) AS result;
  17. 17. Intermediate aggregation states SELECT avg(x), uniq(x) FROM ( SELECT 123 AS x UNION ALL SELECT 456 ) ┌─avg(x)─┬─uniq(x)─┐ │ 289.5 │ 2 │ └────────┴─────────┘
  18. 18. Intermediate aggregation states SELECT avgState(x), uniqState(x) FROM ( SELECT 123 AS x UNION ALL SELECT 456 ) ┌─avgState(x)─────┬─uniqState(x)─┐ │ C000000 │ 0▒�P���a� │ └─────────────────┴──────────────┘
  19. 19. Intermediate aggregation states SELECT toTypeName(avgState(x)), toTypeName(uniqState(x)) FROM ( SELECT 123 AS x UNION ALL SELECT 456 ) FORMAT Vertical Row 1: ────── toTypeName(avgState(x)): AggregateFunction(avg, UInt16) toTypeName(uniqState(x)): AggregateFunction(uniq, UInt16)
  20. 20. Intermediate aggregation states CREATE TABLE t ( users_state AggregateFunction(uniq, UInt64), ... ) ENGINE = AggregatingMergeTree ORDER BY ... SELECT uniqMerge(uniq_state) FROM t GROUP BY ...
  21. 21. Intermediate aggregation states Main use case: Incremental data aggregation with AggregatingMergeTree table engine of MATERIALIZED VIEW.
  22. 22. How we can make it better — versioning of state serialization format; — identify the cases when different aggregate functions have the same state (sumState, sumIfState must be compatible); — allow to create aggregation state with a function (now it's possible to use arrayReduce for that purpose); — allow to insert AggregateFunction values into a table directly as a tuple of arguments; — adaptive index_granularity;
  23. 23. Consistency modes By default, ClickHouse implements: asynchronous, conflict-free, multi-master replication. Asynchronous: INSERT is acknowledged after being written on a single replica and the replication is done in background. Some replicas may lag and miss some data; All replicas may miss some different parts of data. By default, you have only eventual consistency.
  24. 24. Consistency modes You can enable strong consistency. SET insert_quorum = 2; — each INSERT is acknowledged by a quorum of replicas; — all replicas in quorum are consistent: they contain data from all previous INSERTs (INSERTs are linearized); SET select_sequential_consistency = 1; — allow to SELECT only acknowledged data from consistent replicas (that contain all acknowledged INSERTs).
  25. 25. GROUP BY in external memory
  26. 26. GROUP BY in external memory You can simply increase max_memory_usage
  27. 27. GROUP BY in external memory Also you can enable aggregation with external memory: max_bytes_before_external_group_by distributed_aggregation_memory_efficient
  28. 28. Geospatial functions — pointInPolygon; — pointInEllipses; — greatCircleDistance; SELECT pointInPolygon((lat, lon), [(6, 0), (8, 4), (5, 8), (0, 2), ...])
  29. 29. Machine learned models SELECT modelEvaluate('name', f1, ... fn) AS ctr_prediction https://www.altinity.com/blog/2018/1/18/clickhouse-for-machine-learning
  30. 30. Machine learned models How we can make it better: — add simple regression models; — train models in ClickHouse directly; — online training of models; — parametrized models (dictionaries of multiple models);
  31. 31. Data processing without server clickhouse-local tool $ clickhouse-local --input-format=CSV --output-format=PrettyCompact --structure="SearchPhrase String, UserID UInt64" --query="SELECT SearchPhrase, count(), uniq(UserID) FROM table WHERE SearchPhrase != '' GROUP BY SearchPhrase ORDER BY count() DESC LIMIT 20" < hits.csv ┌─SearchPhrase────────────┬─count()─┬─uniq(UserID)─┐ │ интерьер ванной комнаты │ 2166 │ 1 │ │ яндекс │ 1655 │ 478 │ │ весна 2014 мода │ 1549 │ 1 │ │ фриформ фото │ 1480 │ 1 │ │ анджелина джоли │ 1245 │ 1 │
  32. 32. Data processing without server Bonus: SELECT and process data from an offline server.
  33. 33. Data processing without server How we can make it better? — add more supported formats for Date and DateTime values in text form; — add formats like Avro, Parquet; — customizable CSV format; — "template" and "regexp" formats for more freeform data;
  34. 34. .
  35. 35. . Web site: https://clickhouse.yandex/ Google groups: https://groups.google.com/forum/#!forum/clickhouse Maillist: clickhouse-feedback@yandex-team.com Telegram chat: https://telegram.me/clickhouse_en and https://telegram.me/clickhouse_ru (now with over 1500 members) GitHub: https://github.com/yandex/ClickHouse/ Twitter: https://twitter.com/ClickHouseDB + meetups. Moscow, Saint-Petersburg, Novosibirsk, Ekaterinburg, Minsk, Nizhny Novgorod, Berlin, Palo Alto, Beijing, Sunnyvale, San Francisco...

×