SlideShare a Scribd company logo
How to Guarantee Exact COUNT DISTINCT Queries
with Sub-Second Latency on Massive Datasets
Kaige Liu
2020.5
© Kyligence Inc. 2019, Confidential.
Business Scenarios
Technical Principles
Demo
Use Cases
Q&A
Agenda
© Kyligence Inc. 2019, Confidential.
Business Scenarios
© Kyligence Inc. 2019, Confidential.
What Is Count Distinct?
Count Distinct is used to compute the number of
unique values in a data set.
• PV (Page View)
• UV (Unique Visitors)
ID Username Page
1 Alice /kyligence
2 Alice /Kyligence/Blog
3 Carol /Kyligence/Events
4 Bob /Kyligence/Resources
5 Alice /Kyligence/Downloads
Alice, Bob, Carol
3
© Kyligence Inc. 2019, Confidential.
Approximate and Exact Count Distinct
• Approximate Count Distinct
• Quick, less memory/CPU
• Not accurate
• Trend analysis, small errors are acceptable
• Exact Count Distinct
• Slow, more memory/CPU
• Accurate
• Transaction relevant. Paid Advertising, Precision Marketing, etc.
Error Rate $ 1 Million $ 1 Billion
1.22% $12,200 $12,200,000
2.44% $24,000 $24,000,000
9.75% $97,500 $97,500,000
© Kyligence Inc. 2019, Confidential.
Where
are they
coming
from?
Who are
my
visitors?
Web/Ap
p
Analytic
s
Which
page lost
the most
users?
How
many
active
users?
How
many
new
users?
How
many
unique
visitors?
Scenarios - Web/App Analytics
© Kyligence Inc. 2019, Confidential.
Scenarios - User Behavior Analytics
Retention Analysis
Funnel Analysis
© Kyligence Inc. 2019, Confidential.
Technical Principles
© Kyligence Inc. 2019, Confidential.
Challenges with Exact Count Distinct
• Approximate Count Distinct is easy – HyperLogLog
• Exact Count Distinct is a big challenge for all query engines at massive scale
Challenges
• Bad performance – Need to scan all data
• Non-cumulative – Hard to do rollup and/or operations
• Hard to optimize on multiple columns
• Analysis always requires more than one count distinct operation
© Kyligence Inc. 2019, Confidential.
Count Distinct Performance on Different Platforms
• Google BigQuery
• Snowflake
• Athena
• Apache Kylin
• Kyligence
© Kyligence Inc. 2019, Confidential.
Kyligence = Kylin + Intelligence
• Founded in 2016 by the creators of Apache Kylin
• Built around Kylin, with augmented AI and enhanced to deliver
unprecedented enterprise analytic performance
• CRN Top-10 big data startups in 2018
• Global Presence: San Jose, Seattle, New York, Shanghai, Beijing
• VCs: Fidelity International, Shunwei Capital, Broadband Capital,
Redpoint, Cisco, Coatue
Accelerate Critical Business Decisions with AI-Augmented Data Management
and Analytics
2016
Founded Pre-
A
Redpoint
Cisco
2017
Series A
CBC
Shunwei
2018
Series B
8Roads
2019
Series C
Coatue
© Kyligence Inc. 2019, Confidential.
How Does Apache Kylin Achieve This?
BitmapPre-Aggregation
• Pre-aggregate count distinct in cubes
• Fetch results directly without on the
fly calculations
• Supports Rollup
• Reduces memory/storage significantly
• Supports String type and detail queries
Dictionary
© Kyligence Inc. 2019, Confidential.
Pre-Aggregation
Date UID Page
2020-04-01
01
1 /kyligence
2020-04-01
01
1 /Kyligence/Blog
2020-04-01
01
2 /Kyligence/News
2020-04-02
02
3 /Kyligence/Events
2020-04-02
02
2 /Kyligence/Resources
2020-04-02
02
1 /Kyligence/Downloads
Date Count(UID) Count(distinct UID)
UID)
2020-04-01
01
3 2
2020-04-02
02
3 3
Date Count(UID) Count(distinct UID)
2020-04-01
01 and
2020-04-02
02
6 ??
© Kyligence Inc. 2019, Confidential.
7 6 5 4 3 2 1 0
Bitmap
UID
1
2
4
5
7
9
10
11
13
1 0 0 1 0 1 1 0
0 0 1 0 1 1 1 0
Table Bitmap
• Saves storage significantly
• Supports logical operations directly
• Contains information needed to do
aggregation
• RoaringBitmap
© Kyligence Inc. 2019, Confidential.
Bitmap
Date UID Page
2020-04-01
01
1 /kyligence
2020-04-01
01
1 /Kyligence/Blog
2020-04-01
01
2 /Kyligence/News
2020-04-02
02
3 /Kyligence/Events
2020-04-02
02
2 /Kyligence/Resources
2020-04-02
02
1 /Kyligence/Downloads
Date Count(UID) Count(distinct UID)
UID)
2020-04-01
01
3 2
2020-04-02
02
3 3
Date Count(UID) Count(distinct UID)
2020-04-01
01 and
2020-04-02
02
6 Bitmap(1,2,3)
Date Count(UID) Count(distinct UID)
UID)
2020-04-01 3 Bitmap(1,2)
2020-04-02 3 Bitmap(1,2,3)
© Kyligence Inc. 2019, Confidential.
Operations in Bitmap
• Two bitmaps, each containing two different data sets:
[1, 3, 4, 5]
[2, 3, 4, 6]
• And - All elements contained in both bitmaps:
[1, 3, 4, 5] and [2, 3, 4, 6] = [3, 4]
Scenarios: Retention Analysis, Funnel Analysis
• Or – All elements in either bitmap:
[1, 3, 4, 5] or [2, 3, 4, 6] = [1, 2, 3, 4, 5, 6]
Scenarios: Cross-Dimension Analysis
© Kyligence Inc. 2019, Confidential.
Dictionary
Date USERNAME Page
2020-04-01
01
Alice /kyligence
2020-04-01
01
Alice /Kyligence/Blog
2020-04-01
01
Bob /Kyligence/News
2020-04-02
02
Coral /Kyligence/Events
2020-04-02
02
Bob /Kyligence/Resources
2020-04-02
02
Alice /Kyligence/Downloads
USERNAME ECODED
Alice 1
Bob 2
Coral 3
Date Count(UID) Count(distinct UID)
2020-04-01
01 and
2020-04-02
02
6 Bitmap(1,2,3)
Date Count(UID) Count(distinct UID)
UID)
2020-04-01
01
3 Bitmap(1,2)
2020-04-02
02
3 Bitmap(1,2,3)
Bitmap can only support int values. How about String columns?
Dictionary
© Kyligence Inc. 2019, Confidential.
Use Cases
© Kyligence Inc. 2019, Confidential.
Manbang Group
• The largest Chinese truck logistics startup
• 7 million+ trucks
• 2.25 million active users
• 8 apps and 10 TB+ data
Requirements
• Retention analysis on a wide range of dimensions
and date ranges
• Funnel analysis with ability to customize funnel
• User profile analysis
© Kyligence Inc. 2019, Confidential.
Architecture with Apache Kylin
© Kyligence Inc. 2019, Confidential.
Retention Analysis for Manbang Group
• Users can choose any column and any date range to do the retention analysis
© Kyligence Inc. 2019, Confidential.
Funnel Analysis for Manbang group
• Users can customize funnels with any number of steps
• Can identify the specific users lost between steps
© Kyligence Inc. 2019, Confidential.
DiDi
• #1 ride-share company in China
• 92 million monthly active users
(as of Dec. 2019)
• 24 million rides per day in 2019
Requirements
• User profile analysis
• Precision marketing
© Kyligence Inc. 2019, Confidential.
Scenarios – Apache Kylin in Didi
• Precision Marketing
o Send coupons to exact target users
o Upgrade cars for specific users
• Promotion Activity Analysis
o How many new/returned users are gained in this activity?
o Which kind of users are most interested in this activity?
• Optimize User Experience
o Which stages lost the most users?
o How to increase customer stickiness?
User Profile
Precision
Marketing
User
Behavior
Analysis
User Tags
Workflow
Analysis
Promotion
Activity
Analysis
© Kyligence Inc. 2019, Confidential.
Didi Kylin Usage
200 TB+ 5,000+ 7,000+ 7
Data Cubes Jobs per day Clusters
© Kyligence Inc. 2019, Confidential.
Join the Community
https://github.com/apache/kylin apache-kylin.slack.comuser@kylin.apache.org
THANK YOU

More Related Content

What's hot

Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)
Cloudera, Inc.
 
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
InfluxData
 
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
DevOps.com
 
Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?
Tyrone Systems
 
Enabling Push Button Productization of AI Models
Enabling Push Button Productization of AI ModelsEnabling Push Button Productization of AI Models
Enabling Push Button Productization of AI Models
Databricks
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
The Hive
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
ayushi19
 
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” ArchitecturesFIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Dataconomy Media
 
OpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS DataOpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS Data
Ganesan Narayanasamy
 
Visualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual realityVisualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual reality
Molham Al-Maleh
 
InfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application dataInfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application data
Bharath Nunepalli
 
This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019
NVIDIA
 
Seven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchSeven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence Research
NVIDIA
 
AI at the Edge
AI at the EdgeAI at the Edge
AI at the Edge
DATAVERSITY
 
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE
 
Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?
Veselin Pizurica
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
Jeff Kelly
 
Opportunities derived by AI
Opportunities derived by AIOpportunities derived by AI
Opportunities derived by AI
Amazon Web Services
 

What's hot (20)

Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)
 
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
 
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
 
Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?
 
Enabling Push Button Productization of AI Models
Enabling Push Button Productization of AI ModelsEnabling Push Button Productization of AI Models
Enabling Push Button Productization of AI Models
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” ArchitecturesFIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
OpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS DataOpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS Data
 
Visualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual realityVisualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual reality
 
InfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application dataInfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application data
 
This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019
 
Seven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchSeven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence Research
 
AI at the Edge
AI at the EdgeAI at the Edge
AI at the Edge
 
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
 
Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
 
Opportunities derived by AI
Opportunities derived by AIOpportunities derived by AI
Opportunities derived by AI
 

Similar to How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets

Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
Tyler Wishnoff
 
Augmented OLAP for Big Data
Augmented OLAP for Big DataAugmented OLAP for Big Data
Augmented OLAP for Big Data
Luke Han
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big Data
Tyler Wishnoff
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the Cloud
Tyler Wishnoff
 
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and UptimeLegacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
Precisely
 
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
NuoDB
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
Snowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the UglySnowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the Ugly
SamanthaBerlant
 
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Tyler Wishnoff
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data Analytics
Tyler Wishnoff
 
Addressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analyticsAddressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analytics
SamanthaBerlant
 
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainApache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data Spain
Luke Han
 
Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019
IanUriarte2
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Canada
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Canada
 
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschapIoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Academy
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS Overview
Jean Tan
 
The value of a connected factory
The value of a connected factoryThe value of a connected factory
The value of a connected factory
Croonwolter&dros
 
A Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of ThingsA Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of Things
Inside Analysis
 
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
tsigitnist02
 

Similar to How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets (20)

Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
 
Augmented OLAP for Big Data
Augmented OLAP for Big DataAugmented OLAP for Big Data
Augmented OLAP for Big Data
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big Data
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the Cloud
 
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and UptimeLegacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
 
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Snowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the UglySnowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the Ugly
 
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data Analytics
 
Addressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analyticsAddressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analytics
 
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainApache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data Spain
 
Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
 
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschapIoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS Overview
 
The value of a connected factory
The value of a connected factoryThe value of a connected factory
The value of a connected factory
 
A Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of ThingsA Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of Things
 
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
 

More from Tyler Wishnoff

Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Tyler Wishnoff
 
Providing Interactive Analytics on Excel with Billions of Rows
Providing Interactive Analytics on Excel with Billions of RowsProviding Interactive Analytics on Excel with Billions of Rows
Providing Interactive Analytics on Excel with Billions of Rows
Tyler Wishnoff
 
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache kylin 101 - Get Sub-Second Analytics on Massive DatasetsApache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Tyler Wishnoff
 
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
Tyler Wishnoff
 
Analysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Analysis of the Pressure Placed on Medical Systems during the COVID-19 PandemicAnalysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Analysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Tyler Wishnoff
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
Tyler Wishnoff
 
Apache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence PresentationApache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence Presentation
Tyler Wishnoff
 
How Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
How Analytics Teams Using SSAS Can Embrace Big Data and the CloudHow Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
How Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
Tyler Wishnoff
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
Tyler Wishnoff
 

More from Tyler Wishnoff (9)

Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
 
Providing Interactive Analytics on Excel with Billions of Rows
Providing Interactive Analytics on Excel with Billions of RowsProviding Interactive Analytics on Excel with Billions of Rows
Providing Interactive Analytics on Excel with Billions of Rows
 
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache kylin 101 - Get Sub-Second Analytics on Massive DatasetsApache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
 
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
 
Analysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Analysis of the Pressure Placed on Medical Systems during the COVID-19 PandemicAnalysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Analysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
 
Apache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence PresentationApache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence Presentation
 
How Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
How Analytics Teams Using SSAS Can Embrace Big Data and the CloudHow Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
How Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 

Recently uploaded

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 

Recently uploaded (20)

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 

How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets

  • 1. How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets Kaige Liu 2020.5
  • 2. © Kyligence Inc. 2019, Confidential. Business Scenarios Technical Principles Demo Use Cases Q&A Agenda
  • 3. © Kyligence Inc. 2019, Confidential. Business Scenarios
  • 4. © Kyligence Inc. 2019, Confidential. What Is Count Distinct? Count Distinct is used to compute the number of unique values in a data set. • PV (Page View) • UV (Unique Visitors) ID Username Page 1 Alice /kyligence 2 Alice /Kyligence/Blog 3 Carol /Kyligence/Events 4 Bob /Kyligence/Resources 5 Alice /Kyligence/Downloads Alice, Bob, Carol 3
  • 5. © Kyligence Inc. 2019, Confidential. Approximate and Exact Count Distinct • Approximate Count Distinct • Quick, less memory/CPU • Not accurate • Trend analysis, small errors are acceptable • Exact Count Distinct • Slow, more memory/CPU • Accurate • Transaction relevant. Paid Advertising, Precision Marketing, etc. Error Rate $ 1 Million $ 1 Billion 1.22% $12,200 $12,200,000 2.44% $24,000 $24,000,000 9.75% $97,500 $97,500,000
  • 6. © Kyligence Inc. 2019, Confidential. Where are they coming from? Who are my visitors? Web/Ap p Analytic s Which page lost the most users? How many active users? How many new users? How many unique visitors? Scenarios - Web/App Analytics
  • 7. © Kyligence Inc. 2019, Confidential. Scenarios - User Behavior Analytics Retention Analysis Funnel Analysis
  • 8. © Kyligence Inc. 2019, Confidential. Technical Principles
  • 9. © Kyligence Inc. 2019, Confidential. Challenges with Exact Count Distinct • Approximate Count Distinct is easy – HyperLogLog • Exact Count Distinct is a big challenge for all query engines at massive scale Challenges • Bad performance – Need to scan all data • Non-cumulative – Hard to do rollup and/or operations • Hard to optimize on multiple columns • Analysis always requires more than one count distinct operation
  • 10. © Kyligence Inc. 2019, Confidential. Count Distinct Performance on Different Platforms • Google BigQuery • Snowflake • Athena • Apache Kylin • Kyligence
  • 11. © Kyligence Inc. 2019, Confidential. Kyligence = Kylin + Intelligence • Founded in 2016 by the creators of Apache Kylin • Built around Kylin, with augmented AI and enhanced to deliver unprecedented enterprise analytic performance • CRN Top-10 big data startups in 2018 • Global Presence: San Jose, Seattle, New York, Shanghai, Beijing • VCs: Fidelity International, Shunwei Capital, Broadband Capital, Redpoint, Cisco, Coatue Accelerate Critical Business Decisions with AI-Augmented Data Management and Analytics 2016 Founded Pre- A Redpoint Cisco 2017 Series A CBC Shunwei 2018 Series B 8Roads 2019 Series C Coatue
  • 12. © Kyligence Inc. 2019, Confidential. How Does Apache Kylin Achieve This? BitmapPre-Aggregation • Pre-aggregate count distinct in cubes • Fetch results directly without on the fly calculations • Supports Rollup • Reduces memory/storage significantly • Supports String type and detail queries Dictionary
  • 13. © Kyligence Inc. 2019, Confidential. Pre-Aggregation Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-04-01 01 2 /Kyligence/News 2020-04-02 02 3 /Kyligence/Events 2020-04-02 02 2 /Kyligence/Resources 2020-04-02 02 1 /Kyligence/Downloads Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 2 2020-04-02 02 3 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 ??
  • 14. © Kyligence Inc. 2019, Confidential. 7 6 5 4 3 2 1 0 Bitmap UID 1 2 4 5 7 9 10 11 13 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 Table Bitmap • Saves storage significantly • Supports logical operations directly • Contains information needed to do aggregation • RoaringBitmap
  • 15. © Kyligence Inc. 2019, Confidential. Bitmap Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-04-01 01 2 /Kyligence/News 2020-04-02 02 3 /Kyligence/Events 2020-04-02 02 2 /Kyligence/Resources 2020-04-02 02 1 /Kyligence/Downloads Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 2 2020-04-02 02 3 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 Bitmap(1,2,3) Date Count(UID) Count(distinct UID) UID) 2020-04-01 3 Bitmap(1,2) 2020-04-02 3 Bitmap(1,2,3)
  • 16. © Kyligence Inc. 2019, Confidential. Operations in Bitmap • Two bitmaps, each containing two different data sets: [1, 3, 4, 5] [2, 3, 4, 6] • And - All elements contained in both bitmaps: [1, 3, 4, 5] and [2, 3, 4, 6] = [3, 4] Scenarios: Retention Analysis, Funnel Analysis • Or – All elements in either bitmap: [1, 3, 4, 5] or [2, 3, 4, 6] = [1, 2, 3, 4, 5, 6] Scenarios: Cross-Dimension Analysis
  • 17. © Kyligence Inc. 2019, Confidential. Dictionary Date USERNAME Page 2020-04-01 01 Alice /kyligence 2020-04-01 01 Alice /Kyligence/Blog 2020-04-01 01 Bob /Kyligence/News 2020-04-02 02 Coral /Kyligence/Events 2020-04-02 02 Bob /Kyligence/Resources 2020-04-02 02 Alice /Kyligence/Downloads USERNAME ECODED Alice 1 Bob 2 Coral 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 Bitmap(1,2,3) Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 Bitmap(1,2) 2020-04-02 02 3 Bitmap(1,2,3) Bitmap can only support int values. How about String columns? Dictionary
  • 18. © Kyligence Inc. 2019, Confidential. Use Cases
  • 19. © Kyligence Inc. 2019, Confidential. Manbang Group • The largest Chinese truck logistics startup • 7 million+ trucks • 2.25 million active users • 8 apps and 10 TB+ data Requirements • Retention analysis on a wide range of dimensions and date ranges • Funnel analysis with ability to customize funnel • User profile analysis
  • 20. © Kyligence Inc. 2019, Confidential. Architecture with Apache Kylin
  • 21. © Kyligence Inc. 2019, Confidential. Retention Analysis for Manbang Group • Users can choose any column and any date range to do the retention analysis
  • 22. © Kyligence Inc. 2019, Confidential. Funnel Analysis for Manbang group • Users can customize funnels with any number of steps • Can identify the specific users lost between steps
  • 23. © Kyligence Inc. 2019, Confidential. DiDi • #1 ride-share company in China • 92 million monthly active users (as of Dec. 2019) • 24 million rides per day in 2019 Requirements • User profile analysis • Precision marketing
  • 24. © Kyligence Inc. 2019, Confidential. Scenarios – Apache Kylin in Didi • Precision Marketing o Send coupons to exact target users o Upgrade cars for specific users • Promotion Activity Analysis o How many new/returned users are gained in this activity? o Which kind of users are most interested in this activity? • Optimize User Experience o Which stages lost the most users? o How to increase customer stickiness? User Profile Precision Marketing User Behavior Analysis User Tags Workflow Analysis Promotion Activity Analysis
  • 25. © Kyligence Inc. 2019, Confidential. Didi Kylin Usage 200 TB+ 5,000+ 7,000+ 7 Data Cubes Jobs per day Clusters
  • 26. © Kyligence Inc. 2019, Confidential. Join the Community https://github.com/apache/kylin apache-kylin.slack.comuser@kylin.apache.org

Editor's Notes

  1. UV/PV put some words in the slide
  2. Put a static image instead of gif
  3. Link And OR to analysis scenarios