Successfully reported this slideshow.
Your SlideShare is downloading. ×

Funnel Analysis with Apache Spark and Druid

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 81 Ad

Funnel Analysis with Apache Spark and Druid

Download to read offline

Every day, millions of advertising campaigns are happening around the world.

As campaign owners, measuring the ongoing campaign effectiveness (e.g “how many distinct users saw my online ad VS how many distinct users saw my online ad, clicked it and purchased my product?”) is super important.

However, this task (often referred to as “funnel analysis”) is not an easy task, especially if the chronological order of events matters.

One way to mitigate this challenge is combining Apache Druid and Apache DataSketches, to provide fast analytics on large volumes of data.

However, while that combination can answer some of these questions, it still can’t answer the question “how many distinct users viewed the brand’s homepage FIRST and THEN viewed product X page?”

In this talk, we will discuss how we combine Spark, Druid and DataSketches to answer such questions at scale.

Every day, millions of advertising campaigns are happening around the world.

As campaign owners, measuring the ongoing campaign effectiveness (e.g “how many distinct users saw my online ad VS how many distinct users saw my online ad, clicked it and purchased my product?”) is super important.

However, this task (often referred to as “funnel analysis”) is not an easy task, especially if the chronological order of events matters.

One way to mitigate this challenge is combining Apache Druid and Apache DataSketches, to provide fast analytics on large volumes of data.

However, while that combination can answer some of these questions, it still can’t answer the question “how many distinct users viewed the brand’s homepage FIRST and THEN viewed product X page?”

In this talk, we will discuss how we combine Spark, Druid and DataSketches to answer such questions at scale.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Funnel Analysis with Apache Spark and Druid (20)

Advertisement

More from Databricks (20)

Advertisement

Funnel Analysis with Apache Spark and Druid

  1. 1. The future is open
  2. 2. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020)
  3. 3. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019)
  4. 4. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019) ● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising (AdAge.com, January 2020)
  5. 5. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B (statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019) ● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising (AdAge.com, January 2020) $$$ spent each year on digital advertising campaigns
  6. 6. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off So everybody wants to measure their campaigns’ efficiency!
  7. 7. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off So everybody wants to measure their campaigns’ efficiency! But how???
  8. 8. Funnel Analysis with Apache Spark and Druid Etti Gur, Nielsen Itai Yaffe, Imply
  9. 9. @ItaiYaffe, @ettigur Introduction Etti Gur ● Senior Big Data Engineer @ Nielsen ● Building data pipelines using Spark, Kafka, Druid, Airflow and more Etti Gur @ettigur Itai Yaffe ● Principal Solutions Architect @ Imply Prev. Big Data Tech Lead @ Nielsen ● Dealing with Big Data challenges since 2012 ● Itai Yaffe @ItaiYaffe
  10. 10. @ItaiYaffe, @ettigur Nielsen Identity ● Data and Measurement company ● Media consumption ● Single source of truth of individuals and households ○ Unifies many proprietary datasets ○ Generates holistic view of a consumer
  11. 11. @ItaiYaffe, @ettigur Nielsen Identity in numbers >10B events/day 60TB/day S3 6000 nodes/day 10’s of TB ingested/day druid
  12. 12. @ItaiYaffe, @ettigur Scalability Cost Efficiency Fault-tolerance The challenges
  13. 13. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis
  14. 14. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis using Apache Spark, Druid and DataSketches,
  15. 15. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis using Apache Spark, Druid and DataSketches, and why you should even care
  16. 16. @ItaiYaffe, @ettigur Campaign phases - user’s point-of-view Awareness Exposed to campaign (e.g via online ad) Consideration Interest is expressed (e.g clicked ad) Intent Steps taken towards making a purchase (e.g added product to cart) Purchase
  17. 17. @ItaiYaffe, @ettigur Campaign phases - user’s point-of-view Awareness Exposed to campaign (e.g via online ad) Consideration Interest is expressed (e.g clicked ad) Intent Steps taken towards making a purchase (e.g added product to cart) Purchase Tactic Stages
  18. 18. @ItaiYaffe, @ettigur Campaign phases - campaign owner’s point-of-view Awareness Consideration Intent Purchase Drop- off Drop- off Drop- off
  19. 19. @ItaiYaffe, @ettigur PRODUCT PAGE 10M UUs HOMEPAGE 15M UUs 7M Drop-off 5M Drop-off AD EXPOSURE 100M UUs 85M Drop-off Campaign phases - why is it called “a funnel”? * UUs = Unique Users CHECKOUT 3M UUs
  20. 20. @ItaiYaffe, @ettigur PRODUCT PAGE 10M UUs HOMEPAGE 15M UUs 7M Drop-off 5M Drop-off AD EXPOSURE 100M UUs 85M Drop-off Campaign phases - why is it called “a funnel”? * UUs = Unique Users CHECKOUT 3M UUs We need to analyze the funnel, hence: “Funnel Analysis”
  21. 21. @ItaiYaffe, @ettigur Views vs Unique Users 2 Unique Users 7 Views 2 Purchases $$$ $$$
  22. 22. @ItaiYaffe, @ettigur Everybody wants to measure their campaigns’ efficiency! What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off
  23. 23. @ItaiYaffe, @ettigur But how can one measure campaign efficiency? ● Collect a huge stream of events (i.e user activities) while the campaign is live ● Map events to funnel stages ○ E.g ad exposure = tactic ● Provide insights quickly
  24. 24. @ItaiYaffe, @ettigur So… what’s wrong with off-the-shelf alternatives? Topic Off-the-shelf alternatives Scalability Limited Access to raw data Lack access Count-distinct operations Very slow * Based on tinyurl.com/qqza5ur
  25. 25. @ItaiYaffe, @ettigur Introducing: Apache Druid
  26. 26. @ItaiYaffe, @ettigur Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective ● Decoupled architecture ○ E.g ingestion is separated from query
  27. 27. @ItaiYaffe, @ettigur Roll-up - Simple Count (Views) LongSumAggregator 2021-05-26 Timestamp Website Device ID www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Website Views 2021-05-26 2021-05-26 2021-05-26 www.a.com 3 1 1 www.b.com www.c.com
  28. 28. @ItaiYaffe, @ettigur Druid architecture
  29. 29. @ItaiYaffe, @ettigur Powered by Druid
  30. 30. @ItaiYaffe, @ettigur Common use-cases for Druid ● Clickstream analytics ○ Funnel analysis ● Network performance monitoring ● Application performance management ● Supply chain analytics ○ Manufacturing (IoT and device) metrics ● BI and OLAP ● And more...
  31. 31. @ItaiYaffe, @ettigur Druid in a nutshell ● A real-time analytics database ○ Time-series, columnar ● Can ingest and store trillions of events, and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies for: ○ Application performance management ○ Clickstream analytics and funnel analysis ○ And more
  32. 32. @ItaiYaffe, @ettigur Druid in a nutshell ● A real-time analytics database ○ Time-series, columnar ● Can ingest and store trillions of events, and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies for: ○ Application performance management ○ Clickstream analytics and funnel analysis ○ And more
  33. 33. @ItaiYaffe, @ettigur Why is Druid suitable for the task? Topic Off-the-shelf alternatives Druid Scalability Limited Highly scalable Access to raw data Lack access Can store trillions of events Count-distinct operations Very slow Sub-second approximate count distinct with set operations using the Theta Sketch module * Based on tinyurl.com/qqza5ur
  34. 34. @ItaiYaffe, @ettigur Why is Druid suitable for the task? Topic Off-the-shelf alternatives Druid Scalability Limited Highly scalable Access to raw data Lack access Can store trillions of events Count-distinct operations Very slow Sub-second approximate count distinct with set operations using the Theta Sketch module Theta Sketch??? * Based on tinyurl.com/qqza5ur
  35. 35. @ItaiYaffe, @ettigur What is Theta Sketch? ● ThetaSketch mathematical framework - generalization of KMV ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations
  36. 36. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% Error as function of K Theta Sketch error * Larger K = more memory & storage needed @ItaiYaffe
  37. 37. @ItaiYaffe, @ettigur Theta Sketch demo tinyurl.com/ugk6p67
  38. 38. @ItaiYaffe, @ettigur The Theta Sketch module in Druid ● Part of the Apache DataSketches library (datasketches.apache.org) ● At ingestion time ○ Sketches are created and stored in Druid segments ● At query time ○ Sketches are aggregated (i.e union, intersection or difference between sketches) ○ The result - estimated number of unique entries in the aggregated sketch ● Also see this short video - tinyurl.com/vdwojh6
  39. 39. @ItaiYaffe, @ettigur Roll-up - Count Distinct (Unique Users) 2021-05-26 Timestamp Website Device ID www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Website Unique Users* 2021-05-26 2021-05-26 2021-05-26 www.a.com 2* 1* 1* www.b.com www.c.com ThetaSketchAggregator * What is actually stored is a ThetaSketch object. The actual result is calculated in real-time, which allows us to do UNIONs and INTERSECTIONs
  40. 40. @ItaiYaffe, @ettigur Cool, so… Back to funnel analysis?
  41. 41. @ItaiYaffe, @ettigur Funnel analysis - simple use-case How many unique users viewed online ad? VS How many unique users viewed online ad AND viewed product X page?
  42. 42. @ItaiYaffe, @ettigur Funnel analysis - simple use-case 5/1/2021 - 5/26/2021 5/1/2021 - 5/26/2021
  43. 43. @ItaiYaffe, @ettigur Funnel analysis - simple use-case
  44. 44. @ItaiYaffe, @ettigur Funnel analysis pipeline - high-level architecture 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  45. 45. @ItaiYaffe, @ettigur Funnel analysis pipeline - Data Lake 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {event_time=2021-05-26T..., userid=uid1, attribute=online_ad} {event_time=2021-05-26T..., userid=uid1, attribute=homepage} {event_time=2021-05-26T..., userid=uid1, attribute=productX_page} .... date=2021-05-24 date=2021-05-25 date=2021-05-26
  46. 46. @ItaiYaffe, @ettigur Funnel analysis pipeline - Mart Generator 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_time=2021-05-26T... , userid=uid1, attribute=online_ad, type=Tactic} {event_time=2021-05-26T... , userid=uid1, attribute=homepage, type=Stage} {event_time=2021-05-26T... , userid=uid1, attribute=productX_page , type=Stage} ....
  47. 47. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } .... ....
  48. 48. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "type": "index_hadoop", "spec": { "dataSchema": { "dataSource": "campaign_1472", "granularitySpec": { "queryGranularity": "day", "segmentGranularity": "day", "type": "uniform", "intervals": ["2021-05-01/2021-05-27"] ...
  49. 49. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "timestampSpec": { "column": "event_date", "format": "yyyy-MM-dd" }, "dimensionsSpec": { "dimensions": ["tactic", "stage"] }, "metricsSpec": [{ "fieldName": "userid", "type": "thetaSketch", "name": "user_id_sketch", "size": 65536}], ...
  50. 50. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "inputSpec": {"type": " multi", "children": [ {"type": " dataSource", "ingestionSpec": { "intervals": ["2021-05-01/2021-05-27"], "dataSource": "campaign_1472", ...}}, {"type": " static", "Paths": "s3://<BUCKET_NAME>/date=2021-05-26/campaign=1472", ...}, ...
  51. 51. @ItaiYaffe, @ettigur Funnel analysis pipeline - Druid datasources 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {__time=2021-05-26, tactic=online_ad, stage=homepage, user_id_sketch=<Object>} {__time=2021-05-26, tactic=online_ad, stage=productX_page , user_id_sketch=<Object>} .... .... campaign_1210 campaign_1319 campaign_1472
  52. 52. @ItaiYaffe, @ettigur Funnel analysis pipeline - querying Druid (SQL) 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher SELECT APPROX_COUNT_DISTINCT_DS_THETA(user_id_sketch,65536) as homepage_sketch FROM campaign_1472 WHERE (("tactic" = 'online_ad') AND ("stage" = 'homepage')) AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' * This specific query returns the estimated number of unique users that viewed the online ad AND viewed the homepage
  53. 53. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited 5/1/2021 - 5/26/2021
  54. 54. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited
  55. 55. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited
  56. 56. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited 3,100 - 2,500 != 1000
  57. 57. @ItaiYaffe, @ettigur PRODUCT PAGE 1K UUs ... HOMEPAGE 3.1K UUs 2.5K Drop-off ONLINE AD 8.1M UUs Funnel analysis - simple use-case revisited * UUs = Unique Users
  58. 58. @ItaiYaffe, @ettigur PRODUCT PAGE 1K UUs ... HOMEPAGE 3.1K UUs ONLINE AD 8.1M UUs Funnel analysis - simple use-case revisited * UUs = Unique Users 2.5K Drop-off
  59. 59. @ItaiYaffe, @ettigur Funnel analysis - simple complex use-case How many unique users viewed online ad? VS How many unique users viewed online ad FIRST and THEN viewed product X page?
  60. 60. @ItaiYaffe, @ettigur Funnel analysis - complex use-case ● This is what we call a sequential funnel ○ Chronological order of events is important ● The data pipeline is very similar, but… ○ Taking into account only events that happened in the pre-defined order of the funnel ● That way we better represent the efficiency of a specific tactic (i.e advertisement)
  61. 61. @ItaiYaffe, @ettigur Funnel analysis pipeline - reminder 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  62. 62. @ItaiYaffe, @ettigur Funnel analysis pipeline - Data Lake 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {event_time=2021-05-26T09:15, userid=uid1, attribute=productX_page} {event_time=2021-05-26T10:10, userid=uid1, attribute=online_ad} {event_time=2021-05-26T10:11, userid=uid1, attribute=homepage} .... date=2021-05-24 date=2021-05-25 date=2021-05-26
  63. 63. @ItaiYaffe, @ettigur Funnel analysis pipeline - Mart Generator 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_time=2021-05-26T09:15 , userid=uid1, attribute=productX_page , type=Stage} {event_time=2021-05-26T10:10 , userid=uid1, attribute=online_ad, type=Tactic} {event_time=2021-05-26T10:11 , userid=uid1, attribute=homepage, type=Stage} ....
  64. 64. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} ....
  65. 65. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} ....
  66. 66. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} .... ....
  67. 67. @ItaiYaffe, @ettigur Funnel analysis pipeline - querying Druid (SQL) 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher SELECT APPROX_COUNT_DISTINCT_DS_THETA(THETA_SKETCH_NOT(65536, THETA_SKETCH_INTERSECT(65536,a,b), THETA_SKETCH_UNION(65536,c,d,e))) as dropoff_sketch FROM ( SELECT DS_THETA("user_id_sketch") FILTER (WHERE tactic = 'online_ad') as a, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'homepage') as b, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'productX_page') as c, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'add_to_cart') as d, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'checkout') as e FROM campaign_1472 WHERE stage in ('homepage','productX_page','add_to_cart','checkout') AND tactic = 'online_ad' AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' ) subquery * This specific query should return the estimated number of unique users for the drop-off between the homepage and product X page
  68. 68. @ItaiYaffe, @ettigur Funnel analysis - complex use-case 5/1/2021 - 5/26/2021
  69. 69. @ItaiYaffe, @ettigur Funnel analysis - complex use-case
  70. 70. @ItaiYaffe, @ettigur Funnel analysis - complex use-case
  71. 71. @ItaiYaffe, @ettigur Funnel analysis - complex use-case 3,100 - 2,500 = 600
  72. 72. @ItaiYaffe, @ettigur PRODUCT PAGE 0.6K UUs ... HOMEPAGE 3.1K UUs 2.5K Drop-off ONLINE AD 8.1M UUs Funnel analysis - complex use-case * UUs = Unique Users
  73. 73. @ItaiYaffe, @ettigur PRODUCT PAGE 0.6K UUs ... HOMEPAGE 3.1K UUs ONLINE AD 8.1M UUs Funnel analysis - complex use-case * UUs = Unique Users 2.5K Drop-off
  74. 74. @ItaiYaffe, @ettigur A few tips ● Use Druid with Theta Sketch for fast approximate count distinct ○ Allows set operations (intersection/union/negation) ● Use Spark to pre-process incoming events ○ Allows you to take into account only events that happened in the pre-defined order of the funnel ○ Check out Etti’s “Optimizing Spark-based data pipelines” talk (video - tinyurl.com/7hvyxtc8, slides - tinyurl.com/3rvc9mus) ● Optimize your ingestion process ○ Write Theta Sketch objects from Spark app ○ Load to Druid using isInputThetaSketch=true flag
  75. 75. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters)
  76. 76. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters) ● Druid is a very powerful tool for real-time analytics ○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in sub-second ○ Used for many different use-cases
  77. 77. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters) ● Druid is a very powerful tool for real-time analytics ○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in sub-second ○ Used for many different use-cases ● Combining Apache Spark, Druid and DataSkecthes FTW! ○ Pre-process events before ingesting into Druid ○ Decide how to handle out-of-order events
  78. 78. @ItaiYaffe, @ettigur DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field ○ 30+ chapters and 17,000+ members world-wide ○ Everyone can join (regardless of gender), so find a chapter near you - www.womeninbigdata.org/membership/ ● Conference talks ○ Casting the Spell: Druid in Practice (Berlin Buzzwords, June 17th 2021) - tinyurl.com/559hufnj ○ Migrating Airflow-based Spark Jobs to K8s (Data+AI Summit Europe 2020) - tinyurl.com/cbm42mn8 ● Our Tech Blog - medium.com/nmc-techblog ○ Data Retention and Deletion in Apache Druid - tinyurl.com/yymrvrn2
  79. 79. QUESTIONS
  80. 80. THANK YOU Etti Gur Etti Gur Itai Yaffe Itai Yaffe
  81. 81. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×