Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web analytics at scale with Druid at naver.com

1,648 views

Published on

The slides of Strata 2018 London

https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/65329

Published in: Engineering

Web analytics at scale with Druid at naver.com

  1. 1. Web analytics at scale with Druid at naver.com Jason Heo (analytic.js.heo@navercorp.com) Doo Yong Kim (dooyong.kim@navercorp.com)
  2. 2. • Part 1 • About naver.com • What is & Why Druid • The Architecture of our service • Part 2 • Druid Segment File Structure • Spark Druid Connector • TopN Query • Plywood & Split-Apply-Combine • How to fix TopN’s unstable results • Appendix Agenda
  3. 3. About naver.com https://en.wikipedia.org/wiki/Naver • naver.com • The biggest website in South Korea • The Google of South Korea • 74.7% of all web searches in South Korea
  4. 4. • Developed Analytics Systems at Naver • Working with Databases since 2000 • Author of 3 MySQL books • Currently Elasticsearch, Spark, Kudu, and Druid • Working on Spark and Druid-based OLAP platform • Implemented search infrastructure at coupang.com • Have been interested in MPP and advanced file formats for big data Jason Heo Doo Yong Kim About Speakers
  5. 5. Platforms we've tested so far Parquet ORC Carbon Data Elasticsearch ClickHouse Kudu Druid SparkSQL Hive Impala Drill Presto Kylin Phoenix Query Engine Storage Format
  6. 6. • What is Druid? • Our Requirements • Why Druid? • Experimental Results What is & Why Druid
  7. 7. • Column-oriented distributed datastore • Real-time streaming ingestion • Scalable to petabytes of data • Approximate algorithms (hyperLogLog, theta sketch) https://www.slideshare.net/HadoopSummit/scalable- realtime-analytics-using-druid From HORTONWORKS What is Druid?
  8. 8. From my point of view • Druid is a cumbersome version of Elasticsearch (w/o search feature) • Similar points • Secondary Index • DSLs for query • Flow of Query Processing • Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical • Different points • more complicated to operate • better with much more data • better for Ultra High Cardinality • less GC overhead • better for Spark Connectivity (for Full Scan) What is Druid?
  9. 9. Real-time Node Historical BrokerOverlord Middle Manager Coordinator Kafka Index Service Segment management What is Druid? - Architecture MySQL metadata Zookeeper cluster mgmt. Deep Storage (HDFS, S3) stores Druid segments for durability Query Service Clients Druid DSL Segments download Segments for query
  10. 10. Real-time Node Historical Broker { "queryType": "groupBy", "dataSource": "sample_data", "dimension": ["country", "device"], "filter": {}, "aggregation": [...], "limitSpec": [...] } { "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "filter": {...} "aggregation": [...], "threshold": 5 } SELECT ... FROM dataSource What is Druid? - Queries • SQLs can be converted to Druid DSL • No JOIN
  11. 11. SELECT COUNT(*) FROM logs WHERE url = ?; 1. Random Access (OLTP) SELECT url, COUNT(*) FROM logs GROUP BY url ORDER BY COUNT(*) DESC LIMIT 10; 2. Most Viewed SELECT visitor, COUNT(*) FROM logs GROUP BY visitor; 3. Full Aggregation SELECT ... FROM logs INNER JOIN users GROUP BY ... HAVING ... 4. JOIN Why Druid? - Requirements
  12. 12. • Supports Bitmap Index • Fast Random Access Perfect solution for OLTP and OLAP For OLTP • Supports TopN Query • 100x times faster than GroupBy query • Supports Complex Queries • JOIN, HAVING, etc • with our Spark Druid Connector For OLAP Why Druid? ★★★★☆1. Random Access ★★★★☆3. Full Aggregation ★★★★★2. Most Viewed ★★★★☆4. JOIN
  13. 13. • Fast Random Access • Terms Aggregation • TopN Query • Easy to manage Pros Cons • Slow full scan with es-hadoop • Low Performance for multi-field terms aggregation (esp. High Cardinality) • GC Overhead Comparison – ElasticSearch 1. Random Access ★★★★★ 3. Full Aggregation ☆☆☆☆☆ 2. Most Viewed ★★★☆☆ 4. JOIN ☆☆☆☆☆
  14. 14. • Fast Random Access via Primary Key • Fast OLAP with Impala Pros • No Secondary Index • No TopN Query Cons Comparison – Kudu + Impala ★★★★★ (PK) ★☆☆☆☆ (non-PK) 1. Random Access ★★★★★3. Full Aggregation ☆☆☆☆☆2. Most Viewed ★★★★★4. JOIN
  15. 15. Random Access Most Viewed 0.25 0.35 0.08 2.7 2.9 0.78 0 0.5 1 1.5 2 2.5 3 3.5 Elasticesarch Kudu+Impala Druid 1 Field 2 Fields 0.003 0.14 0.03 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Elastisearch Kudu+Impala Druid Experimental Results – Response Time sec sec
  16. 16. Experimental Results – Notes • ES: Lucene Index • Kudu+Impala: Primary Key • Druid: Bitmap Index Random Access • ES: Terms Aggregation • Kudu+Implala: Group By • Druid: TopN • Split-Apply-Combine for Multi Fields Most Viewed • 210 mil. rows • same parallelism • same number of shards/partitions/segments Data Sets
  17. 17. Logs The Architecture of our service Zeppelin Plywood Druid DSL Coordinator Overlord Middle Manager Peon Spark Thrift Server Batch Ingestion Parquet Kafka Run daily batch job API Server Historical Spark Executor Segments File Broker Druid SparkSQL Kafka Indexing Service Kafka transform logs Parquet remove duplicated logs Real-time Ingestion
  18. 18. Switching
  19. 19. Introduction – Who am I? 1. Doo Yong Kim 2. Naver 3. Software engineer 4. Big data
  20. 20. Contents 1. Druid Storage Model 2. Spark Druid Connector Implementation 3. TopN Query 4. Plywood & Split-Combine-Apply 5. Extending Druid Query
  21. 21. Druid Storage Model – 4 characteristics • Columnar format • Explicit distinguishes between dimension, metric • Bitmap index • Dictionary encoded
  22. 22. Druid Storage Model - background Druid treats dimension and metric separately. Dimension Metric • Bitmap Index • GroupBy Fields • Argument of Aggregate Function { "dimensionsSpec": { "dimensions": ["country", "device", ...] }, ... "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "fieldName": "duration", "name": "duration" } ] } Druid Ingestion Spec
  23. 23. Druid Storage Model- Dimension Country (Dimension) Korea UK Korea Korea Korea UK Korea ↔ 0 UK ↔ 1 Dictionary for country UK appears in 2nd, 6th rows Korea → 101110 UK → 010001 Bitmap for Korea 0 1 0 0 0 1 Dictionary Encoded Values
  24. 24. Druid Storage Model - Metric 13 2 15 29 30 14 Country (Dimension) duration (Metric) Korea 13 UK 2 Korea 15 Korea 29 Korea 30 UK 14
  25. 25. Row Filter it manually device LIKE 'Iphone%' Druid Storage Model Bitmapcountry Filtering Bitmapdevice Filtering duration Filtering Filter by bitmap country = 'Korea' ('Korea', 'Iphone 6s', 13) SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%'
  26. 26. Spark Druid Connector
  27. 27. Spark Druid Connector 1. 3 Ways to implement, Our implementation 2. What is needed to implement 3. Sample Codes, Performance Test 4. How to implement
  28. 28. Spark Druid Connector - 3 Ways to implement Druid Broker Spark Driver DSLSQL Druid Historical Spark Driver SQL Spark Executor • Good if SQL is rewritable to DSL • But DSL does not support all SQL • Ex: JOIN, sub-query • Easy to implement • No need to understand Druid Index Library • Ser/de operation is expensive • Parallelism is bounded to no. of Historical Select DSL Large JSON 1st way 2nd way
  29. 29. Spark Druid Connector - 3 Ways to implement Spark Driver SQL • Read Druid segment files directly. • Similar to the way of reading Parquet • Difficult to implement • Need to understand Druid segment library 3rd way Executor Segment File Reads segments using Druid Library Allocate Spark executor into Historical Node We chose this way!
  30. 30. spark.read .format("com.navercorp.ni.druid.spark.druid") .option("coordinator", "host1.com:18081") .option("broker", "host2.com:18082") .option("datasource", "logs").load() .createOrReplaceTempView("logs") Spark Druid Connector – How to use spark.sql(""" SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%' """).show(false) Create table Execute Query
  31. 31. Total 4.4B rows 0.21 7.5 0 1 2 3 4 5 6 7 8 Spark Druid Spark Parquet Random Access 24.1 7.7 0 5 10 15 20 25 30 Spark Druid Spark Parquet Full Scan & GROUP BY Spark Druid Connector - Performance Seconds, lower is better
  32. 32. Spark Druid Connector – How to implement
  33. 33. Spark Druid Connector – How to implement 1. Druid Rest API 2. Druid Segment Library 3. Spark Data Source API
  34. 34. Spark Druid Connector – Get table schema Spark Driver Druid Broker { "queryType": "segmentMetaData", "dataSource": "logs", "merge": "true" } { "columns": { "__time": {...}, "country": {...}, "device": {...}, "duration": {...} ... } spark.read .format("...") .option("coordinator", "...") .option("broker", "...") .option("datasource", "logs") .load() Schema
  35. 35. Spark Druid Connector – Partition pruning WHERE country = 'Korea' AND_time = CAST('2018-05-23' AS TIMESTAMP) Segments can be pruned by interval condition and single dimension partition 1. Interval condition serverview returns only matched segments 2. Single dimension partition compare start and end with given filter Spark Driver Druid Coordinator GET /.../logs/intervals/2018-05-23/serverview [ { "segment": { "shardSpec": { "dimension": "country", "start": "null", "end": "b" ...}, "id": "segmentId" }, "servers": [ {"host": "host1"}, {"host": "host2"} ] }, { "segment": ...}, ... }
  36. 36. Spark Druid Connector – Spark filters to Druid filters WHERE country = 'Korea' AND city = 'Seoul' buildScan(requiredColumns: [country, device, duration], filters: [EqualTo(country, Korea), EqualTo(city, Seoul)]) Spark's filters are converted into Druid's DimFilter private def toDruidDimFilters(sparkFilter: Filter): DimFilter = { sparkFilter match { ... case EqualTo(attribute, value) => { new SelectorDimFilter( attribute, value.toString, null ) case GreaterThan(attribute, value) => ...
  37. 37. Spark Druid Connector – Attach locality to RACK_LOCAL • getPreferredLocations(partition: Partition) • Returns Hosts having Druid Segments • Caution: Spark does not always guarantee that executors launch on preferred locations • Set spark.locality.wait to very large value
  38. 38. Spark Druid Connector - How to implement Done! Now Spark executor can read records from Druid segment files. Segment File Spark Druid Connector Spark
  39. 39. TopN Query
  40. 40. TopN Query 1. How TopN Query works 2. Performance 3. Limitation
  41. 41. TopN Query flow (N=100) Broker Historical Segment Cache User TopN Query – We heavily use TopN query Historical Segment Cache Historical Segment Cache Client get merged results from each historical node. Broker merge each’s results and make final records. Each historical node return local top 100 results
  42. 42. country SUM(duration) korea 114 uk 47 us 21 country SUM(duration) uk 67 korea 24 usa 3 country SUM(duration) korea 87 uk 57 china 33 country SUM(duration) korea 225 uk 171 china 33 usa 24 country SUM(duration) korea 225 uk 171 china 33 TopN Query - Example Top 3 country ORDER BY SUM(duration) Broker Top 3 Result Top 3 of Historical a Top 3 of Historical b Top 3 of Historical c
  43. 43. country SUM(duration) korea 114 uk 47 usa 21 china 17 country SUM(duration) uk 67 korea 24 usa 3 china 1 country SUM(duration) korea 87 uk 57 usa 22 china 33 country SUM(duration) korea 225 uk 171 china 33 Missing! TopN – is an approximate approach
  44. 44. GroupBy (Few minutes) TopN (1536 ms) rank metric rank metric 1 1,948,297 1 1,948,297 2 1,404,167 2 1,404,167 3 1,383,538 3 1,383,538 4 1,141,977 4 1,141,977 5 1,099,028 5 1,090,277 6 1,090,277 6 1,079,242 7 1,051,448 7 1,051,448 8 996,961 8 996,961 9 941,284 9 941,284 10 937,078 10 937,078 100x Faster! TopN – 100x faster than GroupBy 1. rank changed rank 5 → rank 6 2. value changed 1,099,028 → 1,079,242
  45. 45. TopN – Limitations 1. TopN only has one dimension. 2. Unstable result when replication factor is larger than 2.
  46. 46. Plywood 1. Plywood 2. Split-Apply-Combine 3. Our Improvement
  47. 47. 1. https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf 2. http://plywood.imply.io/index // Split [ country, city, device ] ply() .apply(dataSource, $(dataSource).filter(...)) // Filter1 .apply(dataSource, $(dataSource).filter(...)) // Filter2 .apply(dataSource, $(dataSource).filter(...)) // Filter3 .apply('country', $(dataSource).split(...) .apply(...) // Filter to Split1 (country) .apply('city', $(dataSource).split(...) .apply(...) // Filter to Split2 (city) .apply(...) // Filter to Split2 (city) .apply('device', $(dataSource).split(...) .apply(...) // Filter to Split3 (device) ) ) ) SELECT country, city, device FROM $TABLE WHERE … GROUP BY country, city, device ≒ Split Apply Combine - SAC
  48. 48. Before After Plywood tuning
  49. 49. Throughput (qps, higher is better) Before Before After Tuning Results
  50. 50. Challenge
  51. 51. Same query but the results can be different under 2+ replica factor configuration Stable TopN - Motivation Seg_1 Seg_2 Historical 1 Seg_1 Seg_2 Historical 2 Broker Historical 1 Historical 2 Broker TopN(Seg_1 + Seg_2) TopN(Seg_2 + Seg_3) First Result Second Result Results can be different != Seg_3Seg_3 Seg_1 Seg_2 Seg_3 Seg_2 Seg_3 TopN(Seg_3) Seg_1 TopN(Seg_1)
  52. 52. Bypass Historical side TopN Merge, do Broker side merge TopN results for each segment by it’s ID order by_segment patch Broker Broker First Result Second Result Always identical == Seg_1 Seg_2 Historical 1 Seg_1 Seg_2 Historical 2 Historical 1 Historical 2 TopN(Seg_1) + TopN(Seg_2) TopN(Seg_2) + TopN(Seg_3) Seg_3Seg_3 Seg_1 Seg_2 Seg_3 Seg_2 Seg_3 TopN(Seg_3) Seg_1 TopN(Seg_1)
  53. 53. Navis @ SK TelecomEns @ Naver Special Thanks
  54. 54. Thank you!
  55. 55. Appendix
  56. 56. • 10 Broker Nodes • 40 Historical Nodes • 2 MiddleManager & Overlord Nodes • 2 Coordinator Nodes • 10 Yarn & HDFS Nodes for Batch Ingestion • Spark Standalone Cluster runs on Historical Nodes • for Locality Druid Deploy & Configuration (1)
  57. 57. • Druid version : 0.11 • H/W Spec for Broker & Historical • CPU: 40 cores (w/ hyperthread) • RAM: 128GB • HDD: SSD w/ RAID 5 • Memory Configuration Configuration Value for Broker Value for Historical -Xmx 20GB 12GB -XX:MaxDirectMemorySize 30GB 45GB druid.processing.numMergeBuffers 10 20 druid.processing.numThreads 20 30 druid.processing.buffer.sizeBytes 512MB 800MB druid.cache.sizeInBytes 0 5GB druid.server.http.numThreads 40 40 Druid Deploy & Configuration (2)
  58. 58. Use Yarn External Resource for Batch Ingestion "tuningConfig": { "type": "hadoop", "jobProperties": { "yarn.resourcemanager.hostname" : "host1.com", "yarn.resourcemanager.address" : "host1.com:8032", "yarn.resourcemanager.scheduler.address": "host1.com:8030", "yarn.resourcemanager.webapp.address": "host1.com:8088", "yarn.resourcemanager.resource-tracker.address": "host1.com:8031", "yarn.resourcemanager.admin.address": "host1.com:8033" } } Ingest Spec for External Yarn and HDFS
  59. 59. Use External HDFS for intermediate MR output "tuningConfig": { "type": "hadoop", "jobProperties": { "fs.defaultFS": "hdfs://DEFAULT_FS:8020", "dfs.namenode.http-address": "NAMENODE:50070", "dfs.namenode.https-address": "NAMENODE:50470", "dfs.namenode.servicerpc-address": "NAMENODE:8022" } } Ingest Spec for External Yarn and HDFS
  60. 60. Lambda Architecture with Two Databases https://en.wikipedia.org/wiki/Lambda_architecture Lambda Architecture with Druid https://www.slideshare.net/gianmerlino/druid-at-sf-big-analytics- 2015-1201 Why Druid? – Simple Lambda Architecture
  61. 61. How Kafka Indexing Service
  62. 62. https://github.com/knoguchi/cm-druid Druid on CDH
  63. 63. Extending Druid Query 1. Accumulated Metric in TopN 2. Stable TopN Result
  64. 64. Row stream Query Second Query Historical Result Result Extending Druid Query Client Broker Historical Cursor Aggregation Row Row Row Row Row
  65. 65. Extending Druid Query - Motivation 2 queries are needed to make following table 1. Total 3 times TopN query for 3 countries 2. Aggregation query for total duration Country SUM(duration) Ratio over total duration korea 225 20% uk 171 15.2% usa 33 2.9% Can we do it at once?
  66. 66. Extending Druid Query - Background Yes we can! Just do TopN operation and SUM operation simultaneously! country SUM(duration) korea 114 china 17 usa 21 uk 47 country duration korea 100 korea 14 uk 40 uk 7 usa 21 china 17 Segment Data Aggregated in map structure country SUM(duration) korea 114 uk 47 usa 21 Final records Total duration equals sum of all metric values!
  67. 67. { "queryType": "topN", ... "metric": "edits", "accMetrics": ["edits"], ... } { ... "edits": 33, "__acc_edits": 1234 ... } User Request Druid Response Extending Druid Query in TopN Broker Historical Cursor TopN Aggregation Row TopN Queue Count Metric We customized Druid to calculate total edits and metric at once! Row Row Row Row Row
  68. 68. Huge intermediate files with MapReduce • Druid's default Batch Ingestion use MapReduce • To ingest 1.4GB Parquet file (Single Dim. Partition) • Read: 16.6GB • Write: 20.5GB • Total: 41.1GB Druid Spark Batch
  69. 69. We modified Original Druid Spark Batch • https://github.com/metamx/druid-spark-batch • Original version of Druid Spark Batch from Metamarket (creator of Druid) • We added some features • Parquet input • Single Dimension Partition • Query Granularity • Same Ingest spec with Druid MapReduce Batch Druid Spark Batch
  70. 70. 37.1 7 0 5 10 15 20 25 30 35 40 MapReduce Spark Disk Read, Write 759 2260 0 500 1000 1500 2000 2500 MapReduce Spark Ingest time (Single Dim Partition) (3 Segments, 430MB each) 333 376 0 50 100 150 200 250 300 350 400 MapReduce Spark Ingest time (Single Dim Partition) (11 Segments, 135MB each) Druid Spark Batch GB, lower is better Seconds, lower is better Seconds, lower is better

×