Cloud-native Semantic Layer on Data Lake

Cloud-native Semantic Layer on
Data Lake
Dong Li
PMC @ Apache Kylin, Founding Member @ Kyligence

About Me
Dong Li is a Founding Member and Head of Product and Innovation
at Kyligence, an Apache Kylin Core Developer (Committer) and member of
the Project Management Committee (PMC) where he focuses on big data
technology development.
Previously, he was a Senior Engineer in eBay’s Global Analytics
Infrastructure Department, a Software Development Engineer for
Microsoft Cloud Computing and Enterprise Products.

Agenda
▪ Start from a real challenge
▪ The Solution
▪ Q&A

Customer Background
• A fast-growing SaaS company in US
• 1800 customers in 40+ countries
• 1/3 Fortune 500 use
• 8 Billion transactions per year
• Dashboards for end users

Landscape & Challenges
• Source Data in AWS RDS
• Materialized views used for dashboards
• Slow queries cost 5+ seconds
• 4+ hours to refresh materialized views every day
• Bottleneck at ~10 concurrent users
• Couldn’t provide flexible dashboards
• Number of views keeps increasing
OLTP
(RDS)
OLAP
(RDS)
Materialized
View
Dashboard
Export ETL SQL

Expectation for the future data platform
• Flexible dashboards for end users
• High performance (< 2s), high concurrency (> 100 users)
• Easy to scale
• Low data preparation latency (< 1 hour)
• Flexible for new requirements
• Enterprise-grade security: data recovery, row/column level access etc.
• Totally on AWS
• Low TCO
• Open Platform for Machine learning, Internal Analytics etc.

Apache Kylin: Open Source Distributed Analytical Data Warehouse

Apache Kylin: Managing Your Most Valuable Data
• OLAP Data Modelling
• Speed Up Analytics Using Pre-Calculation
• ANSI SQL Interface
• High Concurrency and High Performance
• Batch & Streaming Together
Presentation
Visualization
Big Data
Platform
Data
Source
Data Mart
Hive Impala Spark SQL Kafka
MapReduce …Spark

Apache Kylin Community & Adoptions
1000+ Global Adoptions
Leading Open Source OLAP
Github Stars
JIRA Issues

Star Schema Benchmark
Star schema benchmark:
http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
0
2
4
6
8
10
12
1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 3.3 3.4 4.1 4.2 4.3
Latency(s)
SSB Queries
条SQL响
Kylin SQL on Hadoop
SQL Latency
Lower is better
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50
Latency(s)
Data Scale
不同数据量性能化
Kylin SQL on Hadoop
Data Volume Scale
Lower is better

Interactive Analysis with BI for Petabyte-Scale Datasets

select
l_returnflag,
o_orderstatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price
from
v_lineitem
inner join v_orders on l_orderkey = o_orderkey
where
l_shipdate <= '1998-09-16'
group by
l_returnflag,
o_orderstatus
order by
l_returnflag,
o_orderstatus;
Sort
Aggr
Filter
TablesO(N)
Join
Parse SQL to
an execution
plan
How Does Kylin Accelerate Queries?
• Kylin uses Apache Calcite as the SQL parser and optimizer

How Does Kylin Accelerate Queries?
• Kylin optimizes and adapts the plan to an OLAP cube.
• With less processing, Kylin can return the result instantly.
Aggr
Filter
Tables
Join
Sort
Sort
Cube
Filter
Pick the best
matched cube
Rewrite toThese steps have
already been completed
in the cube build.
O(1)

Apache Kylin
BI Tools Apps Machine Learning
SQL
Runtime Workload
Offload Workload
Scan & filter
Extract
Load
Architecture
The architecture of Apache Kylin v4.0.0-alpha

Use Case: Online Shopping Reporting
The most visited website in Japan
https://techblog.yahoo.co.jp/oss/apache-kylin/
§ Our reporting system used Impala as a backend
database previously.
- It took a long time (about 60 sec) to show Web
UI.
§ In order to lower the latency, we moved to Apache
Kylin.
- Average latency < 1sec for most cases
Thanks to low latency with Kylin, we become possible to focus on
adding functions for users.
§ We provide a reporting system that show
statistics for store owners.
- e. g. impressions, clicks and sales.

Apache Kylin 4.0 Roadmap: Cloud Native
Data analytics
Apache Kylin
Container Service (K8S, Docker)
Interactive Reporting Dashboard
OLAP ／ Data mart
Resource
Orchestration
Data Lake Source file, Streams, Parquet on Object Storage (S3, ADSL)
Metadata
Security
• Less Dependency, More
Lightweight
• Automated Scaling
• Less Computing and
Storage Cost
• Automated DevOps

Data is next Oil
The world’s most valuable resource is no
longer oil, but data. —“The Economist”
China’s Datasphere is expected to grow 30% on average over the next 7
years and will be the largest Datasphere of all regions by 2025 --IDC
175 Zettabytes By 2025 -- IDC

But, the Chaos Happens Again!!!

What is missing here?
? ? ?
Reporting Dashboard Ad-Hoc Data-as-a-Services Machine Learning
EDW Datasets Data Lake Datasets Cloud Datasets
Data Lake
Application
SQL / MDX
Data Analysts Marketing User Operation Analysts

Unified Semantic Layer
Govern
Data Platform
Reporting Dashboard Ad-Hoc Data-as-a-Services Machine Learning
Managed Datasets
Managed KPIs
CUSTOMER
CUSTOMER NUMBER
CUSTOMER NAME
CUSTOMER CITY
CUSTOMER POST
CUSTOMER ST
CUSTOMER ADDR
CUSTOMER PHONE
CUSTOMER FAX
ORDER
ORDER NUMBER
ORDER DATE
STATUS
ORDER ITEM BACKORDERED
QUANTITY
ITEM
ITEM NUMBER
QUANTITY
DESCRIPTION
ORDER ITEM SHIPPED
QUANTITY
SHIP DATE
Finance KPI ERP KPI
Accounting KPI ……
Marketing KPI
Sales KPI
EDW Datasets Data Lake Datasets Cloud Datasets
Data Lake
Application
SQL / MDX
One-stop Governed Platform
• Data as a service
• Single source of truth
• Managed golden data
Intelligent Data Platform
• Machine Learning recommendation
from SQL history
• Optimizaed for PB data at scale
• High performance and High
Concurrency
Analysts Delighted Platform
• Supports most favorite BI tools
• Support standard SQL/MDX
• Reduce engineering efforts
Data Analysts Marketing User Operation Analysts
Intelligent Cubing
managed data at scale

Unified Data-as-a-Service Layer
Unified Semantic Layer

Use Case: Less Cubes for More
2 Cubes vs 1200 IBM Cognos Cubes
Challenge
• 1200+ existing cubes to manage
• 1000+ jobs to maintain
• Time-to-insight over 4 days
Job
Job
Job
Job
Job
Job
Job
…
…
Job
Job
Job
1200+ IBM Cognos Cubes, 1000+ ETL Jobs
with dependencies
Merchant Topic
Daily Cube
Region Topic
Daily Cube
Merchant Topic
Monthly Cube
2 Cubes, 1 ETL Job
┄ ┄
Agency Topic
Daily Cube
Agency Topic
Monthly Cube
Channel Topic
Monthly Cube
Region Topic
Monthly Cube
Merchant Topic
Shanghai Cube
Merchant Topic
Beijing Cube
Merchant Topic
Zhejiang Cube
Merchant Topic
Guangdong Cube
Sub
scenarios
Channel Topic
Daily Cube
Sub
scenarios
Sub
scenarios
Solution
• Using Kylin replaced IBM Cognos backend but
continue keep Cognos Reporting to interactive
with Kyligence
Result
• 1000x improved maintenance efficiency
• 10x faster and more stable analytics performance
• Time-to-Insight less then 4 hours

Cloud-native Semantic Layer on Data Lake
FinanceMarketingSales
Index
More…
Data Lake Aggregation & Index ApplicationsSource
Azure Blob Storage

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Cloud-native Semantic Layer on Data Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud-native Semantic Layer on Data Lake

Similar to Cloud-native Semantic Layer on Data Lake (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Cloud-native Semantic Layer on Data Lake