Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kylin and Druid Presentation


Published on

This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
Types Of OLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin

For any queries
Contact Us:-

Published in: Technology

Kylin and Druid Presentation

  1. 1. INTRODUCTION TO OLAP  OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view.  OLAP allows users to analyze database information from multiple database systems at one time.  OLAP data is stored in multidimensional databases.
  2. 2. Analysis Query/ Reporting Data Mining Monitoring & Administration Metadata Repository External Sources Operational databases Extract Transform Load Refresh DATA WAREHOUSE Serve OLAP servers DATAWAREHOUSING ARCHITECHURE
  3. 3.  A multidimensional cube can combine data from disparate data sources and store the information in a fashion that is logical for business users.
  4. 4. THE OLAP CUBE  An OLAP Cube is a data structure that allows fast analysis of data.  The arrangement of data into cubes overcomes a limitation of relational databases.  The OLAP cube consists of numeric facts called measures which are categorized by dimensions.
  5. 5. OLAP CUBE
  6. 6. TWOTYPES OF DATABASE ACTIVITY  OLTP ◦ (Online-Transaction Processing)  OLAP ◦ (Online-Analytical Processing)
  7. 7. OLTP vs. OLAP  On-LineTransaction Processing (OLTP): – technology used to perform updates on operational or transactional systems (e.g., point of sale systems)  On-Line Analytical Processing (OLAP): – technology used to perform complex analysis of the data in a data warehouse OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the dimensionality of the enterprise as understood by the user. [source: OLAP Council:]
  8. 8. OLTP vs. OLAP
  9. 9. TYPES OF OLAP  Relational OLAP(ROLAP):  Relational and Specialized Relational DBMS to store and manage warehouse data  OLAP middleware to support missing pieces  Optimize for each DBMS backend  Aggregation Navigation Logic  Additional tools and services  Example: Microstrategy, MetaCube (Informix)  Extended RDBMS with multidimensional data mapping to standard relational operation.  Multidimensional OLAP(MOLAP):  Array-based storage structures  Direct access to array data structures  Implemented operation in multidimensional data  Example: Essbase (Arbor)  Hybrid Online Analytical Processing (HOLAP): A hybrid approach to the solution where the aggregated totals are stored in a multidimensional database while the detail data is stored in the relational database. This is the balance between the data efficiency of the ROLAP model and the performance of the MOLAP model.
  10. 10. ROLAP v/s MOLAP Characteristics ROLAP MOLAP SCHEMA User star Schema •Additional dimensions can be added dynamically. User Data cubes •Addition dimensions require recreation of data cube. Database Size Medium to large Small to medium Architecture Client/Server Client/Server Access Support ad-hoc requests Limited to pre-defined dimensions
  11. 11. Characteristics ROLAP MOLAP Resources HIGH VERY HIGH Flexibility HIGH LOW Scalability HIGH LOW Speed •Good with small data sets. •Average for medium to large data set. •Faster for small to medium data sets. •Average for large data sets.
  12. 12.  One main benefit of OLAP is consistency of information and calculations.  "What if" scenarios are some of the most popular uses of OLAP software and are made eminently more possible by multidimensional processing.  It allows a manager to pull down data from an OLAP database in broad or specific terms.  OLAP creates a single platform for all the information and business needs, planning, budgeting, forecasting, reporting and analysis. BENEFITS OF OLAP
  13. 13. /Contd…  Marketing and sales analysis  Consumer goods industries  Financial services industry (insurance, banks etc)  Database Marketing
  14. 14. Apache Kylin – What ? ● Open source ● Distributed Analytics Engine ● Provides SQL interface ● Multi-dimensional analysis (OLAP) on Hadoop ● Faster and more user-responsive than relational online analytical processing (ROLAP)
  15. 15. The Fundamental Idea ● The idea of Kylin is not brand new. ● Technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels.
  16. 16. From Relational to key-value ● Prevents large table scan and a long delay to get the answer. ● It makes sense to calculate and store those values for further usage. ● This process generates all of the dimension combinations and measured values.
  17. 17. Github Page
  18. 18. How it Works ? ● Read data from Hive (which is stored on HDFS) ● Run MapReduce jobs to pre-calculate ● Store cube data in HBase ● Leverage Zookeeper for job coordination
  19. 19. Apache Foundation Blog December 2015 ● Apache Kylin is the best OLAP engine on Big Data so far. ● While other OLAP engines struggle with the data volume, Kylin enables query responses in the milliseconds. ● Starting to leverage Kylin for near real time data streaming storage and analytics engine.
  20. 20. Advantages ● Kylin has good intergration with BI tools, such as Tableau or Excel. ● Kylin support molap cube, it has very good performance for complex query on billion level data set
  21. 21. Limitations ● Real Time Support hasn’t yet been built. ● Kylin only supports the star schema. You are limited to a single fact table for each cube.
  22. 22. Key Features ●Open Source. ●Distributed architecture. ●Real-time ingestion. ●Column-oriented for speed. ●Fast filtering. ●Operational simplicity. ●Support to OLAP Queries.
  23. 23. Druid Architecture Types of Nodes: Historical Nodes ➢Backbone of Druid cluster. ➢Download segments and serve queries over them. Broker Nodes ➢Clients query to broker node to get data from Druid . ➢Scattering Queries. ➢Gathering and merging results.(know location of the segments) Coordinator Nodes ➢Manage segments on historical nodes . ➢Load new segments , drop old segments and move segments to load balance.
  24. 24. Ingestion method ● Streaming (real-time): – If your dataset originates in a streaming system like Kafka . – Kafka lets you process streams of records as they occur. – The Kafka cluster stores streams of records in categories called topics. – Each record consists of a key, a value, and a timestamp ● File based (Batch): – Load data from HDFS, local files ,etc in batches.
  25. 25. Segments ● Druid stores its index in segment files ,partitioned by time (Timestamp) ● Data Structure of segment file – Columnar: the data for each column is laid out in separate data structures.
  26. 26. ●A segment consists of the timestamp column, dimension columns, and metric columns . ●The timestamp and metric columns are simple and each of these is an array of integer or floating point values .Values in metric columns are pulled out to perform aggregate. ●Dimensions columns are different because they support filter and group-by operations and requires: ➢ Dictionary that encodes column values { "Justin Bieber": 0, "Ke$ha": 1 } ➢Column data [0, 0, 1, 1] ●Bitmaps - one for each unique value of the column ●value="Justin Bieber": [1,1,0,0] ●value="Ke$ha": [0,0,1,1]
  27. 27. Druid vs Apache Kylin DRUID APACHE KYLIN Query Speed Very Fast Fast Type of Analysis RealTime Analysis Focuses on OLAP cases, RealTime Analysis under development SQL Support Absent Present FaultTolerance All Nodes Need to Setup BITools Integration Under Development Present (Tableau or Excel) Integration with Kafka Present Absent Complex Queries Bad for big data sets Good Performance StorageType Bit-map Index OLAP Cube Underlying technology Own computation and storage cluster Hadoop for cube build , HBase for storage
  28. 28. Miscellaneous Points to Consider…  Druid has limitation on table join.  Apache Kylin supports Star Schema.  Modern corporations are increasingly looking for near real time analytics and insights to make actionable decisions.  Druid is trying to support integration with BI tools using Apache Hive at Horton works. (  Previous version of Druid was under GPL v2 license.The latest version of Druid is under Apache license v2,Apache Kylin is under Apache License v2.  Druid has 181 contributors for their GitHub project whereas Apache Kylin has 60 contributors.
  29. 29. References - OLAP & OLTP ● ● ●,_transform,_load ●
  30. 30. References-Apache Kylin  dev/201503.mbox/%3CCAKmQrOY0fjZLUU0MGo5aajZ2uLb3T0qJknHQd+Wv1oxd5PKixQ@mai      
  31. 31. References-Druid    
  32. 32. References-Druid vs Apache Kylin     
  33. 33. THANK YOU!!