Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kylin Extreme OLAP Engine for Big Data

2,384 views

Published on

Apache Kylin Slides for Strata+Hadoop World 2015 London.

Published in: Technology

Apache Kylin Extreme OLAP Engine for Big Data

  1. 1. Apache Kylin Extreme OLAP Engine for Big Data Luke Han, Yang Li 2015-05-06 http://kylin.io | @ApacheKylin
  2. 2. About us  Luke Han | lukehan@apache.org | @lukehq  Apache Kylin PMC member & Product Owner  Sr. Product Manager of eBay GDI  from Shanghai China  Yang Li | yangli@apache.org  Apache Kylin PMC member & Tech Leader  Sr. Architect of eBay GDI  from Shanghai China
  3. 3. Agenda  About Apache Kylin  Feature Highlights  Tech Highlights  Roadmap  Q&A
  4. 4. What kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine, contributed by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets • Open Sourced on Oct 1st, 2014 • Be accepted as Apache Incubator Project on Nov 25th, 2014 • http://kylin.io (http://kylin.incubator.apache.org) @ApacheKylin
  5. 5. Why Happiness Latency 10s
  6. 6. Balance Between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids) OLAP Cube
  7. 7. How Map Reduce Kylin BI Tools, Web App… ANSI SQL
  8. 8. Agenda  About Apache Kylin  Feature Highlights  Tech Highlights  Roadmap  Q&A
  9. 9. Feature Highlights • Extremely Fast OLAP Engine at scale • ANSI SQL Interface on Hadoop • Seamless Integration with BI Tools, like Tableau • Interactive Query Capability • MOLAP Cube • Incremental Build of Cubes • Approximate Query Capability for Distinct Count (HyperLogLog) • Leverage HBase Coprocessor for query latency • Job Management and Monitoring • User friendly Web GUI for manage, build, monitor and query cubes • Security capability to set ACL at Cube/Project Level • Support LDAP Integration
  10. 10. Define Data Model
  11. 11. Manage Jobs
  12. 12. Explore the Data
  13. 13. Interactive with BI Tool - Tableau
  14. 14.  eBay - 90% query < 5 seconds  Baidu - Baidu Map internal analysis  Many other Proof of Concepts - Huawei, Bloomberg Law, British GAS, JD.com, Microsoft, StubHub, Tableau … Case Cube Size Raw Records User Session Analysis 26 TB 28+ billion rows Traffic Analysis 21 TB 20+ billion rows Behavior Analysis 560 GB 1.2+ billion rows —from mailing list Who are using Kylin?
  15. 15. Agenda  About Apache Kylin  Feature Highlights  Tech Highlights  Roadmap  Q&A
  16. 16. Cube Build Engine (MapReduce…) SQL Low Latency - SecondsMid Latency - Minutes Routing 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC Online Analysis Data Flow Offline Data Flow Clients/Users interactive with Kylin via SQL OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cube (HBase) SQL REST Server Kylin Architecture Overview
  17. 17. Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): … Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin Data Modeling
  18. 18. Cube Build Job Flow
  19. 19. How to Store Cube - HBase Schema
  20. 20. SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]]) Kylin Query Engine - Explain Plan
  21. 21. Cube Optimization  Curse of Dimensionality  N dimension cube has 2N cuboid  Full Cube vs. Partial Cube  Huge Data Volume  Dictionary Encoding  Incremental Building
  22. 22.  Full Cube - Pre-aggregate all dimension combinations - “Curse of dimensionality”: N dimension cube has 2N cuboid.  Partial Cube - To avoid dimension explosion, we divide the dimensions into different aggregation groups - 2N+M+L  2N + 2M + 2L - For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands - 230  210 + 210 + 210 - Tradeoff between online aggregation and offline pre-aggregation Full Cube vs. Partical Cube
  23. 23. Partical Cube
  24. 24. Incremental Build
  25. 25. What’s Next  Improve cube algorithm  Cube by segments, 30%-50% faster  Build delay down to tens of minutes  Streaming cubing  Analyze real-time data  Build delay down to seconds  Spark
  26. 26. Cube by Layer  The current algorithm - Many MRs, the number of dimensions - Huge shuffles, aggregation at reduce side, 100x of total cube size Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR
  27. 27. Cube by Segments  The to-be algorithm, 30%-50% faster - 1 round MR - Reduced shuffles, map side aggregation, 20x total cube size - Hourly incremental build done in tens of minutes Data Split Cube Segment Data Split Cube Segment Data Split Cube Segment …… Final Cube Merge Sort (Shuffle) mapper mapper mapper
  28. 28. Streaming Cubing  Cube is great but… - Cube takes time to build, how about real-time analysis? - Sometimes we want to drill down to row level information  Streaming cubing - Build micro cube segments from streaming - Use inverted index to capture last minute data
  29. 29. Streaming Last Hour Inverted Index Before Last Hour Cube Kylin Lambda Architecture Hybrid Storage Interface
  30. 30. Adding Spark Support  Cubing Efficiency  MR is not optimal framework  Spark Cubing Engine  Source from SparkSQL  Read data from SparkSQL instead of Hive  Route to SparkSQL  Unsupported queries be coved by SparkSQL
  31. 31. Agenda  About Apache Kylin  Feature Highlights  Tech Highlights  Roadmap  Q&A
  32. 32. Future2015 2016 Kylin Evolution Roadmap 20142013 Initial Prototype for MOLAP • Basic end to end POC MOLAP • Incremental Refresh • ANSI SQL • ODBC Driver • Web GUI • Tableau • ACL • Open Source StreamingOLAP • Streaming OLAP • JDBC Driver • New UI • Excel • SparkSQL • … more TBD Sep, 2013 Jan, 2014 Oct, 2014 H1, 2015 HybridOLAP • Lambda Arch • Automation • Capacity Management • Spark • … more Next Gen • Adv OLAP Functions • In-Memory Analysis (TBD) • Mobile (TBD) • … more
  33. 33. Kylin Ecosystem  Kylin Core  Fundamental framework of Kylin OLAP Engine  Extension  Plugins to support for additional functions and features  Integration  Lifecycle Management Support to integrate with other applications  Interface  Allows for third party users to build more features via user- interface atop Kylin core Kylin OLAP Core Extension  Security  Redis Storage  Spark Engine  Docker Interface  Web Console  Customized BI  Ambari/Hue Plugin Integration  ODBC Driver  ETL  Drill  SparkSQL
  34. 34. http://kylin.io If you want to go fast, go alone. If you want to go far, go together. --African Proverb dev@kylin.incubator.apache.org

×