Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets.
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
Apache Kylin (incubating) is a new project to bring OLAP cubes to Hadoop. I walk through the project and describe how it works and how users see the project.
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets.
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
Apache Kylin (incubating) is a new project to bring OLAP cubes to Hadoop. I walk through the project and describe how it works and how users see the project.
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
During Kylin OLAP development, we setup many engineering principles in the team. These principles are very important to delivery Kylin with high quality and on schedule.
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
Kylin is an open source distributed analytics engine contributed by eBay that provides a SQL interface and OLAP on Hadoop supporting extremely large datasets. Kylin's pre-built MOLAP cubes (stored in HBase), distributed architecture, and high concurrency helps users analyze multidimensional queries via SQL and other BI tools. During this session, you'll learn how Kylin uses HBase's key-value store to serve SQL queries with relational schema.
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
We use "SaasBase Analytics" to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
During Kylin OLAP development, we setup many engineering principles in the team. These principles are very important to delivery Kylin with high quality and on schedule.
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
Kylin is an open source distributed analytics engine contributed by eBay that provides a SQL interface and OLAP on Hadoop supporting extremely large datasets. Kylin's pre-built MOLAP cubes (stored in HBase), distributed architecture, and high concurrency helps users analyze multidimensional queries via SQL and other BI tools. During this session, you'll learn how Kylin uses HBase's key-value store to serve SQL queries with relational schema.
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
We use "SaasBase Analytics" to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborSvetlin Nakov
Software Development for the Public Cloud Platforms: Windows Azure vs. Google App Engine vs. Amazon Web Services (AWS) vs AppHarbor.
In this talk the speaker will compare the most widely used public PaaS clouds (Azure, GAE and AWS) from the software developer’s perspective.
A parallel between Azure, GAE, AWS and few other clouds (like AppHarbor, Heroku, Cloudfoundry and AppForce) will be made based on several criteria: architecture, pricing, storage services (non-relational databases, relational databases in the cloud and blob/file storage), business-tier services (like queues, notifications, email, CDN, etc.), supported languages, platforms and frameworks and front-end technologies.
A live demo will be made to compare the way we build and deploy a multi-tiered application in Azure, Amazon and GAE and how to implement its back-end (using a cloud database), business tier (based on REST services) and front-end (based on HTML5).
The speaker Svetlin Nakov (http://www.nakov.com) is well-known software development expert and trainer, a head of the Telerik Software Academy and a main organizer of the Cloud Development course (http://clouddevcourse.telerik.com).
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandSearch Engine Land
Updated in July 2013, SearchEngineLand.com's Periodic Table of SEO Success Factors is explained in a quick-start presentation format. Read the companion Guide to SEO, featuring nine chapters that explain the Periodic Table of SEO elements in more detail, perfect to use a beginner's guide and tutorial for getting started with search engine optimization.
The Periodic Table of SEO Success Factors
http://searchengineland.com/seotable
Read the Search Engine Land Guide to SEO:
http://searchengineland.com/guide/seo
Chapter 1: Types Of Search Engine Success Factors
http://searchengineland.com/guide/seo/types-of-search-engine-ranking-factors
Chapter 2: Content & Search Engine Success Factors
http://searchengineland.com/guide/seo/content-search-engine-ranking
Chapter 3: HTML Code & Search Engine Success Factors
http://searchengineland.com/guide/seo/html-code-search-engine-ranking
Chapter 4: Site Architecture & Search Engine Success Factors
http://searchengineland.com/guide/seo/site-architecture-search-engine-ranking
Chapter 5: Link Building & Ranking In Search Engines
http://searchengineland.com/guide/seo/link-building-ranking-search-engines
Chapter 6: Social Media & Ranking In Search Results
http://searchengineland.com/guide/seo/social-media-ranking-search-results
Chapter 7: Trust, Authority, Identity & Search Rankings
http://searchengineland.com/guide/seo/trust-authority-search-rankings
Chapter 8: Personalization & Search Engine Rankings
http://searchengineland.com/guide/seo/personalization-search-engine-rankings
Chapter 9: Violations & Search Engine Spam Penalties
http://searchengineland.com/guide/seo/violations-search-engine-spam-penalties
this is the ppt on 2 stroke and 4 stroke petrol engine. . i made this ppt with the help of dhrumil patel .who is in the L.D. college of engineering in chemical department. . i am very thankful to him for being my great partner. . .thanx dhrumil..
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...JRibbeck
Am praktischen Beispiel wird gezeigt, wie eine generische Interaktion zwischen der OpenSource Plattform camunda und einem Domino Server erfolgen kann. Im Beispiel werden BPMN 2.0 Workflowmodelle so gestaltet, dass eine beliebige Notes/Domino Datenbank ohne Designänderung mit einem Workflow ergänzt werden kann.
Diapositivas correspondientes a la parte sobre la plataforma de desarrollo Google App Engine del curso de extensión universitaria "Cloud Computing. Desarrollo de Aplicaciones y Minería Web", celebrado en la Escuela Universitaria de Ingeniería Informática de Oviedo
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]Websec México, S.C.
http://www.guadalajaracon.org/talleres/desarrollando-para-nmap-scripting-engine-nse/
Hace 5 años Nmap dió luz a su propio motor de scripts que facilita a administradores de sistema y pentesters ha realizar una variedad impresionante de tareas como recolección de información, detección de servicios y hasta explotación de vulnerabilidades.
Su flexibilidad y poder lo han convertido en una herramienta indispensable, no solo para escaneo de puertos, sino durante todas las etapas de una prueba de penetración.
En este taller los asistentes se familiarizarán con el motor de scripts de Nmap, aprenderán casos de uso avanzado y crearán sus propios scripts. Los participantes desarrollarán módulos para diversos fines incluyendo escaneo, análisis de tráfico y explotación web y de otros dispositivos.
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
In the past when Hadoop was born, the big data world were focusing on how to build systems that scales. Now the world has evolved. HBase hits 2.0, Cassandra hits 3.0, Hive hits 3.0, etc. When scalability is conquered, what's next? That’s right, usability comes into play. If we look back into the history, NoSQL is really just using divide and concur mechanism to tackle big data problems by trading off SQL capabilities. But once big data problem solved, we see more and more NoSQL and data processing engines start to build up SQL or SQL-like interfaces. Therefore, a generic SQL engine that provides core SQL capabilities such as query parsing, relational algebra, and query optimization starts to shine.
In this talk, I'll walk you through the architecture, functionality, and design concept of Apache Calcite. Notice that Calcite itself is not a database, but many well known systems already incorporate Calcite as a library. For instance, Hive, Drill, Druid, Phoenix, Apex, Flink, Storm, Samza, and more. To better illustrate how Calcite works, I'll choose some of the systems and describe how they adopt Calcite and which part is enhanced by Calcite. Furthermore, I'll talk about several features that Calcite provides such as query optimization, heterogeneous data source, materialized view, and Stream SQL. From user's perspective, knowing better how these systems work behind the scene equips you with more knowledge to chose a system that ultimately suits your needs.
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
Learn about the latest advancements in Apache Kylin and how its OLAP technology is making analytics faster and insights more actionable.
Learn more about Apache Kylin: https://kyligence.io/apache-kylin-overview/
Learn more about Apache Kylin's enterprise version Kyligence: https://kyligence.io/
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
Apache Kylin general introduction, including background, business needs and technical challenges, theory and architecture, features and some tech detail. Following with performance and benchmark, finally, ecosystem and roadmap.
More detail, please visit http://kylin.io or follow @ApacheKylin.
One key area of Oracle OpenWorld 2016 was data in various shapes. Big Data, streaming data and traditional transactional data. The power of SQL to access and unleash all data - even data in NoSQL databases. The advent of the citizen data scientist. Streaming data analysis in real time on vast and fast and vast data, data discovery. And the new Oracle Database 12cR2 release. Forms, APEX, SQL and PL/SQL.
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
This is Nicholas Dritsas, Eric Jacobsen, and my 2007 SQL PASS Summit presentation on designing, building, and maintaining large Analysis Services cubes
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks.
This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under distinct configurations including Tez, Mahout, MLlib. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and logfile analysis. The talk concludes with the main findings, the scalability, and limits of each framework.
Originally presented at: https://dataworkssummit.com/munich-2017/sessions/using-bigbench-to-compare-hive-and-spark-versions-and-features/
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
In this session, Engineer Allen Herrera describes how SpendHQ made the move to a columnar database with MariaDB. He shares every aspect of the process from setting up their first cluster and testing it within their application to automating cluster deployment, analyzing performance and refining their data import process (i.e., ETL). He finishes by discussing future plans for MariaDB at SpendHQ.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
1. 1
Kylin: Hadoop OLAP Engine - Tech Deep Dive
Jiang Xu, Architect of Kylin
October 2014
2. Kylin Overview
Kylin
- n. (in Chinese art) a mythical animal of composite form
2
http://kylin.io
Kylin is an open source Distributed Analytics Engine from
eBay Inc. that provides SQL interface and multi-dimensional
analysis (OLAP) on Hadoop supporting extremely large
datasets
3. What Is Kylin?
• Extremely Fast OLAP Engine at Scale
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
• ANSI-SQL Interface on Hadoop
Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
• Interactive Query Capability
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same
dataset
• MOLAP Cube
User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
• Seamless Integration with BI Tools
Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel
is coming soon
3
4. What Is Kylin? - Other Highlights
• Job Management and Monitoring
• Compression and Encoding Support
• Incremental Refresh of Cubes
• Leverage HBase Coprocessor for query latency
• Approximate Query Capability for distinct Count (HyperLogLog)
• Easy Web interface to manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
4
5. Glance of SQL-on-Hadoop Ecosystem …
• SQL translated to MapReduce jobs
– Hive
– Stinger without Tez
• SQL processed by a MPP Engine
– Impala
– Drill
– Presto
– Spark + Shark
• SQL process by a existing SQL Engine + HDFS
– EMC Greenplum (postgres)
– Taobao Garude (mysql)
• OLAP on Hadoop in other Companies
– Adobe: HBase Cube
– LinkedIn: Avatara
– Salesforce.com: Phoenix
5
6. Why Do We Build Kylin?
• Why existing SQL-on-Hadoop solutions fall short?
The existing SQL-on-Hadoop needs to scan partial or whole data set to answer a user query. Moreover, table join may trigger
the data transfer across host. Due to large scan range and network traffic latency, many queries are very slow (minute+
latency).
• What is MOLAP/ROLAP?
– MOLAP (Multi-dimensional OLAP) is to pre-compute data along different dimensions of interest and store resultant
values in the cube. MOLAP is much faster but is inflexible. Kylin is more like MOLAP.
– ROLAP (Relational-OLAP) is to use star or snow-flake schema to do runtime aggregation. ROLAP is flexible but much
slower. All existing SQL-on-Hadoop is kind of ROLAP.
• How does Kylin support ROLAP/MOLAP?
Kylin builds data cube (MOLAP) from hive table (ROLAP) according to the metadata definition.
– If the query can be fulfilled by data cube, Kylin will route the query to data cube that is MOLAP.
– If the query can’t be fulfilled by data cube, Kylin will route the query to hive table that is ROLAP.
– Basically, you can think Kylin as HOLAP on top of MOLAP and ROLAP.
6
8. Analytics Query Taxonomy
8
High Level
Aggregation
• Very High Level, e.g GMV by
site by vertical by weeks
Analysis
Query
• Middle level, e.g GMV by site by vertical, by
category (level x) past 12 weeks
Drill Down to
Detail • Detail Level (SSA Table)
Low Level
Aggregation • Seller ID
Transaction
Level • Transaction ID
80+%
Analytics
Kylin is designed to accelerate
Analytics queries!
9. Query Performance -- Compare to Hive
9
200
100
0
SQL #1 SQL #2 SQL #3
# Query Type Return
Dataset
Query
On Kylin (s)
Query
On Hive (s)
Hive
Kylin
Comments
1 High Level Aggregation 4 0.129 157.437 1,217 times
2 Analysis Query 22,669 1.615 109.206 68 times
3 Drill Down to Detail 325,029 12.058 113.123 9 times
4 Drill Down to Detail 524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
13. Background – Cube & Cuboid
13
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
time, item
time item location supplier
time, item, location
item, location
time, location, supplier
time, item, location, supplier
time, location
Time, supplier
item, supplier
location, supplier
time, item, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
14. Metadata- Overview
14
Cube: …
Fact Table: …
Dimensions: …
Measures: …
Row Key: …
HBase Mapping:
…
Dim
Fact
Dim Dim
Source
Star Schema
row A
row B
row C
Val 1
Val 2
Val 3
Column
Family
Row Key
Column
Target
HBase Storage
Cube Metadata
16. Metadata – Measure
• Measure
– Sum
– Count
– Max
– Min
– Average
– Distinct Count (based on HyperLogLog)
16
17. Query Engine - Overview
• Query engine is based on Calcite
• Query Execution Plan
• Optiq Plug-ins in Query Engine
17
18. Query Engine – Calcite
• Calcite (http://incubator.apache.org/projects/calcite.html) is an extensible open source SQL
engine that is also used in Stinger/Drill/Cascading.
18
19. Query Engine – Explain Plan
19
SELECT test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name,
test_sites.site_name, sum(test_kylin_fact.price) as total_price, count(*) as total_count
FROM test_kylin_fact
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt
LEFT JOIN test_category_groupings ON test_kylin_fact.leaf_categ_id = test_category_groupings.leaf_categ_id AND test_kylin_fact.lstg_site_id =
test_category_groupings.site_id
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id
WHERE test_kylin_fact.seller_id = 123456 OR test_kylin_fact.lstg_format_name = ’New'
GROUP BY test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name,
test_sites.site_name
OLAPToEnumerableConverter
OLAPProjectRel(WEEK_BEG_DT=[$0], LV1_CATEG=[$1], LVL2_CATEG=[$2], LVL3_CATEG=[$3], FORMAT_NAME=[$4],
SITE_NAME=[$5], TOTAL_PRICE=[CASE(=($7, 0), null, $6)], TOTAL_COUNT=[$8])
OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13], LV1_CATEG=[$21], LVL2_CATEG=[$15], LVL3_CATEG=[$14], FORMAT_NAME=[$5],
SITE_NAME=[$23], PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ‘New'))])
OLAPJoinRel(condition=[=($2, $25)], joinType=[left])
OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])
OLAPJoinRel(condition=[=($4, $12)], joinType=[left])
OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])
OLAPTableScan(table=[[DEFAULT, TEST_CATEGORY_GROUPINGS]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]])
OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
20. Query Engine – Calcite Plugin-ins
• Metadata SPI
– Provide table schema from kylin metadata
• Optimize Rule
– Translate the logic operator into kylin operator
• Relational Operator
– Find right cube
– Translate SQL into storage engine api call
– Generate physical execute plan by linq4j java implementation
• Result Enumerator
– Translate storage engine result into java implementation result.
• SQL Function
– Add HyperLogLog for distinct count
– Implement date time related functions (i.e. Quarter)
20
21. Storage Engine - Overview
• Provide cube query for query engine
– Common iterator interface for storage engine
– Isolate query engine from underline storage
• HBase Storage
– HBase schema
– Pre-join & pre-aggregation
– Dictionary
– HBase coprocessor
21
23. Storage Engine – Pre-join & pre-aggregation
• Pre-join (Hive)
– Kylin will generate pre-join HQL based on metadata that will join the fact table with dimension tables
– Kylin will use Hive to execute the pre-join HQL that will generate the pre-joined flat table
• Pre-aggregation (MapReduce)
– Kylin will generate a minimum spanning tree of cuboid from cube lattice graph.
– Kylin will generate MapReduce job base on the minimum spanning tree
– MapReduce job will do pre-aggregation to generate all cuboids from N dimensions to 1 dimension.
• MOLAP Cube (HBase)
– The pre-join & pre-aggregation results (i.e. MOLAP Cube) is stored in HBase
23
24. Storage Engine – Dictionary
• Dictionary maps dimension values into IDs that will reduce the memory and storage footprint.
• Metadata can define whether one dimension use “dictionary” or not
• Dictionary is based on Trie
24
25. Storage Engine – HBase Coprocessor
• HBase coprocessor can reduce network traffic and parallelize scan logic
• HBase client
– Serialize the filter + dimension + metrics into bytes
– Send the encoded bytes to corprocessor
• HBase Corprocessor
– Deserialize the bytes to filter + dimension + metrics
– Iterate all rows from each scan range
• Filter unmatched rows
• Aggregate the matched rows by dimensions in an cache
• Send back the aggregated rows from cache
25
29. Cube Build – Steps
• Build dictionary from dimension tables on local disk. And copy dictionary to HDFS
• Run Hive query to build a joined flatten table
• Run map reduce job to build cuboids in HDFS sequence files from tier 1 to tier N
• Calculate the key distribution of HDFS sequence files. And evenly split the key space into K regions.
•
• Translate HDFS sequence files into HBase Hfile
• Bulk load the HFile into HBase
29
30. Cube Optimization - Overview
• “Curse of dimensionality”: N dimension cube has 2N cuboid
– Full Cube vs. Partial Cube
• Hugh data volume
– Incremental Build
• Slow Table Scan – TopN Query on High Cardinality Dimension
– Bitmap inverted index
– Time range partition
– In-memory parallel scan: block cache + endpoint coprocessor
30
31. Cube Optimization – Full Cube vs. Partial Cube
• Full Cube
– Pre-aggregate all dimension combinations
– “Curse of dimensionality”: N dimension cube has 2N cuboid.
• Partial Cube
– To avoid dimension explosion, we divide the dimensions into different aggregation groups
– For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will
reduce from 1 Billion to 3 Thousands
– Tradeoff between online aggregation and offline pre-aggregation
31
34. Cube Optimization – TopN Query on High Cardinality Dimension
34
• Bitmap inverted index
• Separate high cardinality
dimension from low cardinality
dimension
• Time range partition
• In-memory (block cache)
• Parallel scan (endpoint
coprocessor)
35. Kylin Resources
• Web Site
http://kylin.io
• Google Groups
https://groups.google.com/forum/#!forum/kylin-olap
• Source Code
https://github.com/KylinOLAP/Kylin
35