SlideShare a Scribd company logo
1 of 28
The Evolution of Apache Kylin
Realtime & Plugin Architecture in Kylin
Li, Yang | 李扬
Co-founder & CTO at Kyligence Inc.
Agenda
 What’s Apache Kylin?
 New Features in Kylin
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from eBay that
provides SQL interface and multi-dimensional analysis (OLAP) on
Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Nov 2014 -- Apache Incubator Project
• Nov 2015 – Apache Top Level Project
Feature – SQL Interface
Hive Table
Build Cube
(Index)
SQL Query
 eBay
Feature – Big Data
Case Cube Size Raw Records
Session Analysis 20 TB 81+ billion rows
Traffic Analysis 30 TB 28+ billion rows
Transaction Analysis 560 GB 1.2+ billion rows
90% queries <5s
Dark-blue line: 90%tile queries
Light-blue line: 95%tile queries
90%ile query returns in 3 seconds
Feature – Low Latency
Feature – BI Integration via ODBC, JDBC
Linear scale out with more nodes
Feature – Scalable Throughput
Agenda
 What’s Apache Kylin?
 New Features in Kylin
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
Cube Builder (MapReduce…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
Plugin Architecture Overview
MR Engine
IN OUT
Hive
Source
HBase
Storage
Cube Metadata
SourceFactory StorageFactoryEngineFactory
Plugin Architecture
MR Engine
Plugin Architecture
Hive Adapter HBase Adapter
load data save cubeHive
Source
HBase
Storage
adapt to IN adapt to OUT
 Engine
 MR V1
 MR V2
 Spark (early)
 Streaming (experimental)
 Source
 Hive
 Kafka
 Spark SQL & DataFrames
 Storage
 HBase
 ? Kudu
 ? Cassandra
Developing Modules
 Freedom
 Zoo break, not bound to Hadoop any more
 Free to go to a better engine or storage
 Extensibility
 Accept any input, e.g. Kafka
 Embrace next-gen distributed platform, e.g. Spark
 Flexibility
 Choose different engine for different data set
The Freedom, Extensibility, Flexibility
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
A,B,C,D
A,B,C A,B,D A,C,D B,C,D
Layered Cubing (MR Engine V1)
 Pros
 Simple implementation, depends
on MR shuffle to merge sort and
then aggregate
 Little requirement on memory
 Cons
 Aggregation happens at reducer
side
 Mapper outputs raw data thus
shuffle is huge
 Multiple rounds of MR overhead
 Shuffle can be 100x of cube size,
big I/O pressure
mapper mapper mapper
reducer
Fast Cubing
 Pros
 In-mem cubing algorithm that can
be reused by Streaming, Spark etc.
 Mapper side aggregation
 Lesser shuffling given the right data
split
 One round MR
 Cons
 Code complexity
 High mapper CPU/Mem
consumption
Data Split Data Split Data Split
……
Final Cube
Merge Sort
(Shuffle)
 If data splits are unique
 Fast cubing wins
 If data splits are common
 Layer cubing wins
 New cube engine chooses
the right algorithm based on
data sampling.
 Overall build time is 1.5x
faster, sum results from 500
jobs.
Fast Cubing (MR Engine V2)
 Slow queries are 5-10x
faster.
 New Hbase storage
enables partition on
cuboids that are big
enough.
 Overall query time is 2x
faster than before, sum
results from 10,000+
queries.
Parallel Scan
Query
Cuboid A
Cuboid B
Query
A1 B1
A2 B2
A3 C
Cuboid C
Server 1
Server 2
Server 3
Server 1
Server 2
Server 3
Near Realtime Incremental Build
 Minutes micro cubes
 Kafka source
 In-mem cubing
 Auto merge
Cube StorageReal-time In-Mem Store
streaming Kafka
SQL Query
minute batch
Latest second
Inverted
Index
Hybrid Storage
Interface
Cube
Future Lambda Architecture for Realtime
Use Case: SEO Operational Dashboard
 eBay Site
 ebay.com, ebay.co.uk, ebay.de
 Buyer Country
 US, CN, RU
 Search Engine
 Google, Bing, Yahoo!
 Referrer
 google.com, google.co.uk
 Page
 Search, View Item, Product
 User Experience
 Desktop, Mobile APP, mWeb
• Visits, GMB $, GMB share,
conversion rate, bounce rate, # of
view items, # of bought items etc.
Dimensions
Measurements
 HyperLogLog Count Distinct
 TopN
 BitMap Precise Count Distinct
 from Sun, Yerui (netease.com)
 Raw Records
 from Wang, Xiaoyu (jd.com)
 Domain specific aggregations now become easy
 aggregate user events to detect time serials or access patterns
 draw a sketch of certain user groups
 pre-calculate clusters of data points
 histogram…
User Defined Aggregation Types
DT,LOC TopN
2015-10-1,CN Item A, $500
Item B, $300
…
TopN Support
select dt, loc, item, sum(gmv)
from test_kylin_fact
where dt=‘2015-10-1’ and loc=‘CN’
group by dt, loc, item
order by 4 desc
limit 100 cube pre-calculation
 TopN as a measure
 Approximate algorithm
 SpaceSaving TopN
 Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”.
Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.
 A parallel version
 Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta
distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.
 Answer TopN queries directly from pre-calculation
 Works with Tableau 9.1
 Works with MS Excel
 Works with MS Power BI
ODBC Enhancement
Zeppelin Integration
Agenda
 What’s Apache Kylin?
 New Features in Kylin
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
 New in Apache Kylin
 Plugin-able architecture
 New MR Cube Engine with fast cubing (1.5x faster)
 New HBase Storage with parallel scan (2x faster)
 Near real-time analysis (experimental)
 User defined aggregations
 Excel / PowerBI / Zeppelin integration
Summary
Thanks!
http://kylin.apache.org

More Related Content

What's hot

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

What's hot (20)

Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Introducing ios core ml
Introducing ios core mlIntroducing ios core ml
Introducing ios core ml
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 

Similar to The Evolution of Apache Kylin

Similar to The Evolution of Apache Kylin (20)

Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

The Evolution of Apache Kylin

  • 1. The Evolution of Apache Kylin Realtime & Plugin Architecture in Kylin Li, Yang | 李扬 Co-founder & CTO at Kyligence Inc.
  • 2. Agenda  What’s Apache Kylin?  New Features in Kylin  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 3. Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Nov 2014 -- Apache Incubator Project • Nov 2015 – Apache Top Level Project
  • 4. Feature – SQL Interface Hive Table Build Cube (Index) SQL Query
  • 5.  eBay Feature – Big Data Case Cube Size Raw Records Session Analysis 20 TB 81+ billion rows Traffic Analysis 30 TB 28+ billion rows Transaction Analysis 560 GB 1.2+ billion rows
  • 6. 90% queries <5s Dark-blue line: 90%tile queries Light-blue line: 95%tile queries 90%ile query returns in 3 seconds Feature – Low Latency
  • 7. Feature – BI Integration via ODBC, JDBC
  • 8. Linear scale out with more nodes Feature – Scalable Throughput
  • 9. Agenda  What’s Apache Kylin?  New Features in Kylin  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 10. Cube Builder (MapReduce…) SQL Low Latency - SecondsRouting 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST ServerDataSource Abstraction Engine Abstraction Storage Abstraction Plugin Architecture Overview
  • 11. MR Engine IN OUT Hive Source HBase Storage Cube Metadata SourceFactory StorageFactoryEngineFactory Plugin Architecture
  • 12. MR Engine Plugin Architecture Hive Adapter HBase Adapter load data save cubeHive Source HBase Storage adapt to IN adapt to OUT
  • 13.  Engine  MR V1  MR V2  Spark (early)  Streaming (experimental)  Source  Hive  Kafka  Spark SQL & DataFrames  Storage  HBase  ? Kudu  ? Cassandra Developing Modules
  • 14.  Freedom  Zoo break, not bound to Hadoop any more  Free to go to a better engine or storage  Extensibility  Accept any input, e.g. Kafka  Embrace next-gen distributed platform, e.g. Spark  Flexibility  Choose different engine for different data set The Freedom, Extensibility, Flexibility
  • 15. Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR A,B,C,D A,B,C A,B,D A,C,D B,C,D Layered Cubing (MR Engine V1)  Pros  Simple implementation, depends on MR shuffle to merge sort and then aggregate  Little requirement on memory  Cons  Aggregation happens at reducer side  Mapper outputs raw data thus shuffle is huge  Multiple rounds of MR overhead  Shuffle can be 100x of cube size, big I/O pressure
  • 16. mapper mapper mapper reducer Fast Cubing  Pros  In-mem cubing algorithm that can be reused by Streaming, Spark etc.  Mapper side aggregation  Lesser shuffling given the right data split  One round MR  Cons  Code complexity  High mapper CPU/Mem consumption Data Split Data Split Data Split …… Final Cube Merge Sort (Shuffle)
  • 17.  If data splits are unique  Fast cubing wins  If data splits are common  Layer cubing wins  New cube engine chooses the right algorithm based on data sampling.  Overall build time is 1.5x faster, sum results from 500 jobs. Fast Cubing (MR Engine V2)
  • 18.  Slow queries are 5-10x faster.  New Hbase storage enables partition on cuboids that are big enough.  Overall query time is 2x faster than before, sum results from 10,000+ queries. Parallel Scan Query Cuboid A Cuboid B Query A1 B1 A2 B2 A3 C Cuboid C Server 1 Server 2 Server 3 Server 1 Server 2 Server 3
  • 19. Near Realtime Incremental Build  Minutes micro cubes  Kafka source  In-mem cubing  Auto merge
  • 20. Cube StorageReal-time In-Mem Store streaming Kafka SQL Query minute batch Latest second Inverted Index Hybrid Storage Interface Cube Future Lambda Architecture for Realtime
  • 21. Use Case: SEO Operational Dashboard  eBay Site  ebay.com, ebay.co.uk, ebay.de  Buyer Country  US, CN, RU  Search Engine  Google, Bing, Yahoo!  Referrer  google.com, google.co.uk  Page  Search, View Item, Product  User Experience  Desktop, Mobile APP, mWeb • Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc. Dimensions Measurements
  • 22.  HyperLogLog Count Distinct  TopN  BitMap Precise Count Distinct  from Sun, Yerui (netease.com)  Raw Records  from Wang, Xiaoyu (jd.com)  Domain specific aggregations now become easy  aggregate user events to detect time serials or access patterns  draw a sketch of certain user groups  pre-calculate clusters of data points  histogram… User Defined Aggregation Types
  • 23. DT,LOC TopN 2015-10-1,CN Item A, $500 Item B, $300 … TopN Support select dt, loc, item, sum(gmv) from test_kylin_fact where dt=‘2015-10-1’ and loc=‘CN’ group by dt, loc, item order by 4 desc limit 100 cube pre-calculation  TopN as a measure  Approximate algorithm  SpaceSaving TopN  Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”. Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.  A parallel version  Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.  Answer TopN queries directly from pre-calculation
  • 24.  Works with Tableau 9.1  Works with MS Excel  Works with MS Power BI ODBC Enhancement
  • 26. Agenda  What’s Apache Kylin?  New Features in Kylin  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 27.  New in Apache Kylin  Plugin-able architecture  New MR Cube Engine with fast cubing (1.5x faster)  New HBase Storage with parallel scan (2x faster)  Near real-time analysis (experimental)  User defined aggregations  Excel / PowerBI / Zeppelin integration Summary

Editor's Notes

  1. Olap Big data Vs ubuntu kylin Ebay 第一个贡献到apache的开源项目,也是完整由中国团队贡献到Apache的第一个项目
  2. 介绍query 1台机器4个tomcat instanc可以达到300左右的QPS
  3. A High Level Architecture for Kylin which is a Standard MOLAP Architecture built on Hadoop. Data Sources to build your MOLAP Cubes primarily Hive, We have a fantastic project in the works for a Storage Abstraction Layer and support other NoSQL Stores such as Cassandra/CouchBase. An Engine Abstraction which maintains the Cube Metadata and a Cube Builder. Today a set of Map Reduce Jobs to build the cubes. A storage layer to store the Cubes in Hbase, primarily through a Bulk Load of the aggregrates into Hbase. We are looking for active community participation to build out additional Data Source, Engine and Storage plugins into Kylin. A Query Engine that directly index into the multi-dimensional arrays built into Hbase.