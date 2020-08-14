Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets Kaige Liu Solution Architect, Ap...
© Kyligence Inc. 2019, Confidential. Business Scenarios Demo Technical Principles Use Cases Q&A Agenda
© Kyligence Inc. 2019, Confidential. Business Scenarios
© Kyligence Inc. 2019, Confidential. What Is Count Distinct? Count Distinct is used to compute the number of unique values...
© Kyligence Inc. 2019, Confidential. Approximate and Exact Count Distinct Approximate Count Distinct • Quick, less memory/...
© Kyligence Inc. 2019, Confidential. Where are they coming from? Who are my visitors? Web/Ap p Analytic s Which page lost ...
© Kyligence Inc. 2019, Confidential. Scenarios - User Behavior Analytics Retention Analysis Funnel Analysis
© Kyligence Inc. 2019, Confidential. Technical Principles
© Kyligence Inc. 2019, Confidential. Challenges with Exact Count Distinct • Approximate Count Distinct is easy – HyperLogL...
© Kyligence Inc. 2019, Confidential. Count Distinct Performance on Different Platforms • Google BigQuery • Snowflake • Ath...
© Kyligence Inc. 2019, Confidential. Kyligence = Kylin + Intelligence • Founded in 2016 by the creators of Apache Kylin • ...
© Kyligence Inc. 2019, Confidential. How Does Apache Kylin Achieve This? BitmapPre-Aggregation • Pre-aggregate count disti...
© Kyligence Inc. 2019, Confidential. Pre-Aggregation Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/B...
© Kyligence Inc. 2019, Confidential. 7 6 5 4 3 2 1 0 Bitmap UID 1 2 4 5 7 9 10 11 13 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 Table...
© Kyligence Inc. 2019, Confidential. Bitmap Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-...
© Kyligence Inc. 2019, Confidential. Operations in Bitmap • Two bitmaps, each containing two different data sets: [1, 3, 4...
© Kyligence Inc. 2019, Confidential. Dictionary Date USERNAME Page 2020-04-01 01 Alice /kyligence 2020-04-01 01 Alice /Kyl...
© Kyligence Inc. 2019, Confidential. Use Cases
© Kyligence Inc. 2019, Confidential. Manbang Group • The largest Chinese truck logistics startup • 7 million+ trucks • 2.2...
© Kyligence Inc. 2019, Confidential. Architecture with Apache Kylin
© Kyligence Inc. 2019, Confidential. Retention Analysis for Manbang Group • Users can choose any column and any date range...
© Kyligence Inc. 2019, Confidential. Funnel Analysis for Manbang group • Users can customize funnels with any number of st...
© Kyligence Inc. 2019, Confidential. DiDi • #1 ride-share company in China • 92 million monthly active users (as of Dec. 2...
© Kyligence Inc. 2019, Confidential. Scenarios – Apache Kylin in Didi • Precision Marketing o Send coupons to exact target...
© Kyligence Inc. 2019, Confidential. Didi Kylin Usage 200 TB+ 5,000+ 7,000+ 7 Data Cubes Jobs per day Clusters
© Kyligence Inc. 2019, Confidential. Join the Community https://github.com/apache/kylin apache-kylin.slack.comuser@kylin.a...
THANK YOU
Upcoming SlideShare
Loading in …5
×

SF Big Analytics Meetup - Exact Count Distinct with Apache Kylin

60 views

Published on

With over 450 million customers, Didi (world’s largest rideshare company) conducts complex user behavior analysis on huge datasets daily. Exact Count Distinct is one of Didi’s most critical metrics, but it is known for being computationally heavy and notoriously slow. The difference between exact Count Distinct and approximate Count Distinct can cost Didi millions of dollars. In this talk, Kaige Liu of the Apache Kylin project will explain how Didi uses Apache Kylin to return exact Distinct Count on billions of rows of data with sub-second latency to generate the most accurate picture of its business.

You will also learn about the latest development in modern OLAP technologies. Kaige will share how Didi and Truck Alliance (a truck-hailing company that processes $100 billion worth of goods yearly) use Apache Kylin to power their analytics platforms that allow 100s of analysts to achieve sub-second latency on petabyte-scale data.

Published in: Data & Analytics
no profile picture user

  • Be the first to comment

  • Be the first to like this

SF Big Analytics Meetup - Exact Count Distinct with Apache Kylin

  1. 1. How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets Kaige Liu Solution Architect, Apache Kylin Commiter 2020.8
  2. 2. © Kyligence Inc. 2019, Confidential. Business Scenarios Demo Technical Principles Use Cases Q&A Agenda
  3. 3. © Kyligence Inc. 2019, Confidential. Business Scenarios
  4. 4. © Kyligence Inc. 2019, Confidential. What Is Count Distinct? Count Distinct is used to compute the number of unique values in a data set. • PV (Page View) • UV (Unique Visitors) ID Username Page 1 Alice /kyligence 2 Alice /Kyligence/Blog 3 Carol /Kyligence/Events 4 Bob /Kyligence/Resources 5 Alice /Kyligence/Downloads Alice, Bob, Carol 3
  5. 5. © Kyligence Inc. 2019, Confidential. Approximate and Exact Count Distinct Approximate Count Distinct • Quick, less memory/CPU • Not accurate • Trend analysis, small errors are acceptable Error Rate $ 1 Million $ 1 Billion 1.22% $12,200 $12,200,000 2.44% $24,000 $24,000,000 9.75% $97,500 $97,500,000 Exact Count Distinct • Slow, more memory/CPU • Accurate • Transaction relevant. Paid Advertising, Precision Marketing, etc.
  6. 6. © Kyligence Inc. 2019, Confidential. Where are they coming from? Who are my visitors? Web/Ap p Analytic s Which page lost the most users? How many active users? How many new users? How many unique visitors? Scenarios - Web/App Analytics
  7. 7. © Kyligence Inc. 2019, Confidential. Scenarios - User Behavior Analytics Retention Analysis Funnel Analysis
  8. 8. © Kyligence Inc. 2019, Confidential. Technical Principles
  9. 9. © Kyligence Inc. 2019, Confidential. Challenges with Exact Count Distinct • Approximate Count Distinct is easy – HyperLogLog • Exact Count Distinct is a big challenge for all query engines at massive scale Challenges • Bad performance – Need to scan all data • Non-cumulative – Hard to do rollup and/or operations • Hard to optimize on multiple columns • Analysis always requires more than one count distinct operation
  10. 10. © Kyligence Inc. 2019, Confidential. Count Distinct Performance on Different Platforms • Google BigQuery • Snowflake • Athena • Apache Kylin • Kyligence
  11. 11. © Kyligence Inc. 2019, Confidential. Kyligence = Kylin + Intelligence • Founded in 2016 by the creators of Apache Kylin • Built around Kylin, with augmented AI and enhanced to deliver unprecedented enterprise analytic performance • CRN Top-10 big data startups in 2018 • Global Presence: San Jose, Seattle, New York, Shanghai, Beijing • VCs: Fidelity International, Shunwei Capital, Broadband Capital, Redpoint, Cisco, Coatue Accelerate Critical Business Decisions with AI-Augmented Data Management and Analytics 2016 Founded Pre- A Redpoint Cisco 2017 Series A CBC Shunwei 2018 Series B 8Roads 2019 Series C Coatue
  12. 12. © Kyligence Inc. 2019, Confidential. How Does Apache Kylin Achieve This? BitmapPre-Aggregation • Pre-aggregate count distinct in cubes • Fetch results directly without on the fly calculations • Supports Rollup • Reduces memory/storage significantly • Supports String type and detail queries Dictionary
  13. 13. © Kyligence Inc. 2019, Confidential. Pre-Aggregation Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-04-01 01 2 /Kyligence/News 2020-04-02 02 3 /Kyligence/Events 2020-04-02 02 2 /Kyligence/Resources 2020-04-02 02 1 /Kyligence/Downloads Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 2 2020-04-02 02 3 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 ??
  14. 14. © Kyligence Inc. 2019, Confidential. 7 6 5 4 3 2 1 0 Bitmap UID 1 2 4 5 7 9 10 11 13 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 Table Bitmap • Saves storage significantly • Supports logical operations directly • Contains information needed to do aggregation • RoaringBitmap
  15. 15. © Kyligence Inc. 2019, Confidential. Bitmap Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-04-01 01 2 /Kyligence/News 2020-04-02 02 3 /Kyligence/Events 2020-04-02 02 2 /Kyligence/Resources 2020-04-02 02 1 /Kyligence/Downloads Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 2 2020-04-02 02 3 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 Bitmap(1,2,3) Date Count(UID) Count(distinct UID) UID) 2020-04-01 3 Bitmap(1,2) 2020-04-02 3 Bitmap(1,2,3)
  16. 16. © Kyligence Inc. 2019, Confidential. Operations in Bitmap • Two bitmaps, each containing two different data sets: [1, 3, 4, 5] [2, 3, 4, 6] • And - All elements contained in both bitmaps: [1, 3, 4, 5] and [2, 3, 4, 6] = [3, 4] Scenarios: Retention Analysis, Funnel Analysis • Or – All elements in either bitmap: [1, 3, 4, 5] or [2, 3, 4, 6] = [1, 2, 3, 4, 5, 6] Scenarios: Cross-Dimension Analysis
  17. 17. © Kyligence Inc. 2019, Confidential. Dictionary Date USERNAME Page 2020-04-01 01 Alice /kyligence 2020-04-01 01 Alice /Kyligence/Blog 2020-04-01 01 Bob /Kyligence/News 2020-04-02 02 Coral /Kyligence/Events 2020-04-02 02 Bob /Kyligence/Resources 2020-04-02 02 Alice /Kyligence/Downloads USERNAME ECODED Alice 1 Bob 2 Coral 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 Bitmap(1,2,3) Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 Bitmap(1,2) 2020-04-02 02 3 Bitmap(1,2,3) Bitmap can only support int values. How about String columns? Dictionary
  18. 18. © Kyligence Inc. 2019, Confidential. Use Cases
  19. 19. © Kyligence Inc. 2019, Confidential. Manbang Group • The largest Chinese truck logistics startup • 7 million+ trucks • 2.25 million active users • 8 apps and 10 TB+ data Requirements • Retention analysis on a wide range of dimensions and date ranges • Funnel analysis with ability to customize funnel • User profile analysis
  20. 20. © Kyligence Inc. 2019, Confidential. Architecture with Apache Kylin
  21. 21. © Kyligence Inc. 2019, Confidential. Retention Analysis for Manbang Group • Users can choose any column and any date range to do the retention analysis
  22. 22. © Kyligence Inc. 2019, Confidential. Funnel Analysis for Manbang group • Users can customize funnels with any number of steps • Can identify the specific users lost between steps
  23. 23. © Kyligence Inc. 2019, Confidential. DiDi • #1 ride-share company in China • 92 million monthly active users (as of Dec. 2019) • 24 million rides per day in 2019 Requirements • User profile analysis • Precision marketing
  24. 24. © Kyligence Inc. 2019, Confidential. Scenarios – Apache Kylin in Didi • Precision Marketing o Send coupons to exact target users o Upgrade cars for specific users • Promotion Activity Analysis o How many new/returned users are gained in this activity? o Which kind of users are most interested in this activity? • Optimize User Experience o Which stages lost the most users? o How to increase customer stickiness? User Profile Precision Marketing User Behavior Analysis User Tags Workflow Analysis Promotion Activity Analysis
  25. 25. © Kyligence Inc. 2019, Confidential. Didi Kylin Usage 200 TB+ 5,000+ 7,000+ 7 Data Cubes Jobs per day Clusters
  26. 26. © Kyligence Inc. 2019, Confidential. Join the Community https://github.com/apache/kylin apache-kylin.slack.comuser@kylin.apache.org
  27. 27. THANK YOU

×