Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache kylin 2.0: from classic olap to real-time data warehouse

Ad

Apache Kylin 2.0
From classic OLAP to real-time data warehouse
李扬 | Li, Yang
Co-founder & CTO

Ad

All rights reserved ©Kyligence Inc.
http://kyligence.io
BI
Visualization
HDFS
Apache Kylin
Hive HBase
Interactive Reportin...

Ad

Why Kylin is Fast
All rights reserved ©Kyligence Inc.
http://kyligence.io
select
l_returnflag,
o_orderstatus,
sum(l_quanti...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 27 Ad
1 of 27 Ad

Apache kylin 2.0: from classic olap to real-time data warehouse

Download to read offline

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Advertisement
Advertisement

More Related Content

Advertisement

Similar to Apache kylin 2.0: from classic olap to real-time data warehouse (20)

Advertisement

Apache kylin 2.0: from classic olap to real-time data warehouse

  1. 1. Apache Kylin 2.0 From classic OLAP to real-time data warehouse 李扬 | Li, Yang Co-founder & CTO
  2. 2. All rights reserved ©Kyligence Inc. http://kyligence.io BI Visualization HDFS Apache Kylin Hive HBase Interactive Reporting Dashboard OLAP Engine Hadoop - 3,000 billion rows, < 1 sec query latency @toutiao.com, top 1 news feed app in China - 60+ dimensions cube @CPIC, top 3 insurance group in China - JDBC / ODBC / RestAPI - BI integration What is Apache Kylin
  3. 3. Why Kylin is Fast All rights reserved ©Kyligence Inc. http://kyligence.io select l_returnflag, o_orderstatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price … from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipdate <= '1998-09-16' group by l_returnflag, o_orderstatus order by l_returnflag, o_orderstatus; A sample query: Report revenue by “returnflag” and “orderstatus” over a time period.Sort Aggr. Filter Tables O(N) Join
  4. 4. Why Kylin is Fast All rights reserved ©Kyligence Inc. http://kyligence.io Sort Cuboid Filter Sort Aggr. Filter Tables O(N) Join O(flag x status x days) = O(1) Precalculate the Kylin Cube
  5. 5. Kylin is about Precalculation All rights reserved ©Kyligence Inc. http://kyligence.io time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid - Based on the Cube Theory - Model and Cube define the space of precalculation - Build Engine carries out the precalculation - Query Engine runs SQL on the precalculated result
  6. 6. O(1) regardless of Data Size All rights reserved ©Kyligence Inc. http://kyligence.io Online Calculation O(N) O(1) Apache Kylin Data Size Response Time
  7. 7. All rights reserved ©Kyligence Inc. http://kyligence.io Snowflake Schema Support Runs TPC-H benchmark.
  8. 8. Kylin 1.0 Star Schema Limitation All rights reserved ©Kyligence Inc. http://kyligence.io Precalculate the join of 1 level of lookups - Support star schema only - Don’t allow same name columns from different tables - Don’t allow table joining itself - Difficult to support real world business cases LINEORDER DATES PART CUSTOMER SUPPLIER Join LINEORDER DATES PART CUSTOMERSUPPLIER
  9. 9. Kylin 2.0 Snowflake Schema All rights reserved ©Kyligence Inc. http://kyligence.io Precalculate unlimited levels of lookups - Snowflake schema support (KYLIN-1875) - Allow table be joined multiple times - Big metadata change at Model level - Many bug fixes regarding joins and sub-queries - Support complex models of any kind, support flexible queries on the models ORDERS CUSTOMER SUPPLIER PART LINEITEM PARTSUPP NATION REGION Join Join Join Join Join
  10. 10. TPC-H on Kylin 2.0 All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H is a benchmark for decision support system. - Popular among commercial RDBMS & DW solutions - Queries and data have broad industry-wide relevance - Examine large volumes of data - Execute queries with a high degree of complexity - Give answers to critical business questions Kylin 2.0 runs all the 22 TPC-H queries. (KYLIN-2467) - Precalculation can answer very complex queries - Goal is functionality at this stage - Try it: https://github.com/Kyligence/kylin-tpch ORDERS CUSTOMER SUPPLIER PART LINEITEM PARTSUPPNATION REGION
  11. 11. Complex Query 1 All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H query 07 - 0.17 sec (Hive+Tez 35.23 sec) - 2 sub-queries select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, l_shipyear as l_year, l_saleprice as volume from v_lineitem inner join supplier on s_suppkey = l_suppkey inner join v_orders on l_orderkey = o_orderkey inner join customer on o_custkey = c_custkey inner join nation n1 on s_nationkey = n1.n_nationkey inner join nation n2 on c_nationkey = n2.n_nationkey where ( (n1.n_name = 'KENYA' and n2.n_name = 'PERU') or (n1.n_name = 'PERU' and n2.n_name = 'KENYA') ) and l_shipdate between '1995-01-01' and '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year Sort Aggr. Filter LINEITEM Join Proj. Join Join Join Join SUPPLIER ORDER CUSTOMER NATION NATION
  12. 12. All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H query 11 - 3.42 sec (Hive+Tez 15.87 sec) - 4 sub-queries, 1 online join with q11_part_tmp_cached as ( select ps_partkey, sum(ps_partvalue) as part_value from v_partsupp inner join supplier on ps_suppkey = s_suppkey inner join nation on s_nationkey = n_nationkey where n_name = 'GERMANY' group by ps_partkey ), q11_sum_tmp_cached as ( select sum(part_value) as total_value from q11_part_tmp_cached ) select ps_partkey, part_value from ( select ps_partkey, part_value, total_value from q11_part_tmp_cached, q11_sum_tmp_cached ) a where part_value > total_value * 0.0001 order by part_value desc; Sort Filter Join Proj. Aggr. Filter Join Proj. Join SUPPLIER NATION Aggr. Filter Join Proj. Join SUPPLIER NATION PARTSUPP PARTSUPP Proj. Aggr. Complex Query 2
  13. 13. All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H query 12 - 7.66 sec (Hive+Tez 12.64 sec) - 5 sub-queries, 2 online joins with in_scope_data as( select l_shipmode, o_orderpriority from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipmode in ('REG AIR', 'MAIL') and l_receiptdelayed = 1 and l_shipdelayed = 0 and l_receiptdate >= '1995-01-01' and l_receiptdate < '1996-01-01' ), all_l_shipmode as( select distinct l_shipmode from in_scope_data ), high_line as( select l_shipmode, count(*) as high_line_count from in_scope_data where o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' group by l_shipmode ), low_line as( select l_shipmode, count(*) as low_line_count from in_scope_data where o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' group by l_shipmode ) select al.l_shipmode, hl.high_line_count, ll.low_line_count from all_l_shipmode al left join high_line hl on al.l_shipmode = hl.l_shipmode left join low_line ll on al.l_shipmode = ll.l_shipmode order by al.l_shipmode Complex Query 3 Sort Filter Join Join Aggr. Filter Join Proj. ORDERS LINEITEMAggr. Filter Join Proj. ORDERS LINEITEM Aggr. Filter Join Proj. ORDERS LINEITEM
  14. 14. More than MOLAP All rights reserved ©Kyligence Inc. http://kyligence.io - Supports complex data models and sub-queries; Runs TPC-H - Percentile / Window / Time functions SQL Maturity Speed Kylin 2.0Kylin 1.0 DW on Hadoop MOLAP Analytics DW
  15. 15. All rights reserved ©Kyligence Inc. http://kyligence.io Spark Cubing Halves the build time.
  16. 16. A Bit of History All rights reserved ©Kyligence Inc. http://kyligence.io Kylin 1.5 attempted Spark cubing, but the feature was never released. - It was a port of MR in-mem cubing algorithm. - Exploit memory and build the whole cube in one round. - No improvement observed. - Spark did nothing differently than MR.
  17. 17. Spark Cubing in 2.0 All rights reserved ©Kyligence Inc. http://kyligence.io RDD-1 RDD-2 RDD-3 RDD-4 RDD-5 Kylin 2.0 did a complete rework based on Layered Cubing algorithm. - Each layer of cuboids as a RDD. - Parent RDD is cached for next round. - RDD exports to sequence file, the same output format as MR. - Translate “map” to “flatMap”; and “reduce” to “reduceByKey”; most code get reused.
  18. 18. DAG Calculating the 3rd Layer All rights reserved ©Kyligence Inc. http://kyligence.io
  19. 19. Spark Cubing vs. MR Layered Cubing All rights reserved ©Kyligence Inc. http://kyligence.io Halves the build time. The advantage decreases as data size increases. - 4-node cluster - Spark 1.6.3 on YARN - 24 vcores, 30 GB memory - 3 data sets of increasing size: .15 GB / 2.5 GB / 8 GB
  20. 20. Spark Cubing vs. MR In-mem Cubing All rights reserved ©Kyligence Inc. http://kyligence.io Almost the same fast. And more adaptable to general data set. - In-mem cubing expects sharded data, works poorly on random data sets. - Spark cubing is more adaptable to different kinds of data distribution.
  21. 21. All rights reserved ©Kyligence Inc. http://kyligence.io Near Real-time Streaming Build latency down to a few minutes.
  22. 22. New in Kylin 1.6 All rights reserved ©Kyligence Inc. http://kyligence.io In-mem Cubing Kylin BI Tools, Web App… ANSI SQL New
  23. 23. Demo of Twitter Analysis All rights reserved ©Kyligence Inc. http://kyligence.io http://hub.kyligence.io Incremental build triggers every 2 minutes, build finishes in 3 minutes. - 8-node cluster on AWS, 3 Kafka brokers - Twitter sample feed, 10+ K messages per second - Cube has 9 dimensions and 3 measures - 2 jobs running at the same time
  24. 24. Summary All rights reserved ©Kyligence Inc. http://kyligence.io Apache Kylin 2.0 - Kylin 2.0 Beta download available. - Snowflake schema support - Runs TPC-H benchmark - Time / Window / Percentile functions - Spark cubing - Near real-time streaming What is next - Hadoop 3.0 support (Erasure Coding) - Spark cubing enhancement - Connect more source (JDBC, SparkSQL) - Alternative storage (Kudu?) - Real-time support, lambda architecture
  25. 25. Thanks See you on our next talk!

Editor's Notes

  • Thank you all for coming. Thanks for your time.
    My name is Yang, or 李扬 in Chinese. Co-founder and CTO of Kyligence. Also a PMC member of Apache Kylin.
    The topic today is about Apache Kylin 2.0. It is in beta at the moment and is planned to release in April.
    I will first brief what is Aapche Kylin and then talk about the important and new features it has, including snowflake schema support, spark cubing, and streaming.
  • So what is Apache Kylin. It is an OLAP engine on Hadoop. Perhaps the most popular one at the moment. If you google “OLAP on Hadoop”, Kylin is the first result as I tried this morning.

    It sits on top of Hadoop infrastructure and exposes your relational data to upper application via the standard SQL interface.

    Kylin can handle very big data set and is very fast in terms of query latency. For example, the biggest Kylin instance we know in production is at toutiao.com, the top 1 news feed app in China. It has a table of 3,000 billion rows, and with Kylin, the average query response time is below 1 second.

    Kylin can handle very complex data models, and we will talk more about snowflake support soon. The widest cube we know is at CPIC (中国太平洋保险), the top 3 insurance group in China. It contains more than 60 dimensions.

    And Kylin provides standard JDBC / ODBC / RestAPI interfaces. Can integrate with existing BI tools very well, like Tableau and PowerBI, you name it.
  • Why is Kylin so fast. We can show it with an example. Image a retail scenario, I want to report revenue by “returnflag” and “orderstatus” within a date range, to see the total amount of successful transactions, canceled transaction, and returned transactions etc. A typical SQL will look like this.

    And to execute it, we will compile it into a relational expression like the diagram on the left. It is so called the execution plan. On the execution plan, we can see, executing the query involves scanning all the rows in table, join them together, go through the date range filter, sum revenue by “returnflag” and “orderstatus”, and finally produced sorted result.

    It is easy to see that the time complexity of such execution is at least O(N), where N represents the total number of rows in the tables. Because at least each table row is visited once. And we assume the joining tables are perfectly partitioned and indexed, such that the expensive join operator can also finish in linear time complexity, which is actually not very possible in real cases.

    Anyway, O(N) is the best you could have doing ad-hoc SQL processing.
  • How can Kylin go beyond O(N)? That is by precalculation.

    So if I know the query pattern in advance, I could precalculate the Aggregate, Join, and Table Scan operators, to create a cuboid. If cuboid sounds unfamiliar, you can think it as a materialized summary table.

    The summary table is transaction amount grouped by “returnflag”, “orderstatus”, and “date”. And because there are fixed number of return flags, order status, and let’s say the date range is limited too, for 3 years, there are about 1000 days. That means the number of rows in the summary table is at most “flag x status x days”, which is a constant in the big O notion.

    That means, if execute the same SQL on the precalculated cuboid, the maximum rows to process is a constant. And that is why Kylin can be faster.
  • So Kylin is all about precalculation. The core idea is based on the classic cube theory and is developed from there into the SQL on big data domain.

    Kylin provides Model and Cube to help you define the space of precalculation. It has Build Engine that execute the precalculation in a distribute system using MR or Spark. And a Query Engine allows to run SQL on top of the precalculated result.

    The key here is modeling. If you have good understanding of your data and business analysis requirement, you can capture the necessary precalculation with the right cube.
  • Once you got the right precalculation, then most of your queries (if not all) will be able to transform into the cube query like we have just seen. And the execution time complexity can be reduced to O(1) and achieve very fast query speed regardless of the original data size.

    That is the brief introduction of Apache Kylin.
  • Next we will talk about the Snowflake Schema support, which is the most important new feature in Kylin 2.0.
  • Kylin 1.0 has a limitation that it only supports star schema data model. That means it only allows precalculation of joining 1 level of lookup tables. Like the diagram shows. It is difficult to support real world business cases, which are often more complicated than star schema.
  • Kylin 2.0 introduced big changes to model metadata, and can support snowflake data model out-of-box. Allows precalculate unlimited levels of lookup tables. Like the diagram shows.

    Also there were many bug fixes and improvements regarding joins and sub-queries. As a result, Kylin 2.0 is able to support very complex data models and queries.
  • To demonstrate the greatly improved modeling and SQL capability, we made effort to run TPC-H queries on Kylin 2.0. For those who are not familiar with TPC-H, following is quote from TPC-H website: It is a popular decision support system for commercial RDBMS & DW solutions. It includes both queries and data that have broad industry-wide relevance. The queries are designed to examine large volumes of data, to have a high degree of complexity, and to give answers to critical business questions.

    Kylin 2.0 can run all the 22 TPC-H queries. We have put up a page with all the steps and resources that is needed to reproduce the work. So anyone who want to verify can give it a try in your own environment. This shows that with precalculated cube, Kylin is also very flexible and can answer very complex queries.

    Another note is, the goal here is not comparing performance with other TPC-H result. On one hand, according to TPC-H spec, precalculation is not permitted across tables, so in that sense, Kylin does not qualify a valid system to compare with other TPC-H results. On the other hand, we haven’t done performance tuning for the TPC-H queries yet. Just had enough time to let all the queries pass. The room for performance improvement is still very big.
  • To quickly show you some interesting TPC-H queries. On the left side is the SQL. Its font size is very small and is not intended to be read. I highlighted the sub-queries in different colors, so we have a feeling of the complexity. On the right side is the relational expression of the query in the tree form. In the tree structure, I marked precalculated nodes with white background and red text, so they look lightweight. The other solid nodes are submit to online calculation and will be the cause if a query runs slow.

    This is TPC-H query 07, as you can see, almost all of the relational operators are precalculated. The remaining Sort / Proj / Filter work on very few records, thus the query is very fast. Takes Kylin only 0.17 second to run, and on the same hardware and same data set, Hive+Tez run 35.23 second for this query. That shows the difference between precalculation and online calculation.
  • This is TPC-H query 11. It has 4 sub-queries in terms of SQL and is more complex than the previous query. Its relational expression tree is more complex too, include two branches, each load from a precalculated cuboid. And finally the result of the branches are joined again, which is a very heavy online computation. With the percentage of online calculation increased, the query takes longer for Kylin, runs 3.42 second. While a full online calculation is still much slower, takes 15.87 second to run.
  • This is an even more complex query, TPC-H query 12. It contains 5 sub-queries, the SQL font size is even smaller than the previous pages, to fit the SQL in screen. And the relational form is more ugly too. Reading 3 cuboids in 3 branches, and online join them together. There are 2 heavy online join operators. It is expected that the more online calculation, the slower the query gets. It takes 7.66 second to run in Kylin and 12.64 second to run in Hive+Tez. As more heavy operators stay on online, the advantage of precalculation decreases.
  • So sum it up, Kylin 2.0 is much more than multidimensional analysis. With snowflake schema enabled, it can support very complex relational models and answer very complex queries. It can run TPC-H benchmark, and has enhanced/added many other SQL features like Percentile functions, Window functions, and Time functions.

    Comparing to Kylin 1.0, Kylin 2.0 is a big step forward in terms of SQL maturity. Given the right model design, it can answer all your queries of any kind of complexity. Comparing to other data warehouse on Hadoop, Kylin 2.0 may still a little behind in terms of SQL, e.g. it does have UDF yet, but the gap is very small now. And in terms of query latency, because of precalculation, Kylin would have big advantage.
  • Next we will talk about Spark Cubing, which cuts the build time by half.
  • Since 1.5, we have been trying to do cubing on Spark. However at that time, the attempt was not successful.

    The first attempt was a port of MR in-mem cubing algorithm on to Spark. It is the easiest to do at that time, due to ease of implementation. The algorithm uses as much memory as it could and builds the whole in one round, thus Spark’s memory cache is not an advantage since the MR job is doing the same. The result is no obvious improvement compare to MapReduce.
  • So the 2.0 did a complete rework based on the layered cubing algorithm. For layered cubing, the cube is calculated by layers. Each layer’s output is the next layer’s input. We use RDD to abstract the layer data, then the parent RDD can be cached and speed up the processing of the next layer. RDD can be exported to sequence file using the same format as MapReduce. This keeps the compatibility in the cube data format. Implementation wise, the original mapper can translate into “flatMap”, original reducer can translate into “reduceByKey”. Most of the code gets reused.
  • This is a sample of a Spark job, calculating the 3rd layer of a cube. The first 2 stages are skipped, because they have been calculated and cached by previous layer’s job. The later two stages do the additional processing to calculate the 3rd of cube.
  • Compare Spark cubing performance with MR layered cubing. The red bar is Spark and the blue bar is MR.

    The test is done in a 4-node cluster. Spark 1.6.3 on YARN, with 24 vcores and 30 GB memory allocated to Spark. We tested 3 data sets with increasing size: 0.15 GB, 2.5 GB, and 8 GB. The first two data sets are very small comparing to 30 GB memory. So spark could cache everything in memory, and the result is Spark build time is about 50% of MR layered cubing. The 8 GB data set is much bigger, and as cubing cause data expansion, it won’t be able to all fit in memory, thus the advantage of Spark decreases little, as the diagram shows.
  • Compare Spark cubing to MR In-mem cubing. Still the red bar is Spark, the blue bar is MR in-mem cubing.

    They are almost the same fast. However, remember in-mem cubing is very picky on data distribution. It is only effective when the data is sharded or nearly sharded. And if you force in-mem cubing to run on random data, it performs even slower than MR layered cubing. In the diagram, the data sets are all sharded for in-mem cubing.

    On the other hand, Spark cubing shows good performance consistently, regardless of the level of data distribution. So in general, Spark cubing is still a better choice than in-mem cubing.
  • Last thing is about near real-time streaming.
  • This is actually a 1.6 feature. However I feel that it is not marketed well enough, so is it again.

    Start from 1.6, Kylin can connect Kafka as a source just like Hive. Using in-mem cubing algorithm, we can trigger micro incremental build very frequently, e.g. every 2 minutes. The result is many small cube segments and they can be queried to give very real-time result.
  • To show this really truly works, we have put up a demo site to analyze twitter messages in real-time.

    It runs on a 8-node AWS cluster, 3 Kafka brokers. The input is Twitter sample feed, which has 10+ K messages per second. The cube is average complex, 9 dimensions and 3 measures. For such setup, the incremental build is triggered every 2 minutes and finishes in 3 minutes. That’s why you will see 2 jobs running at the same time like the screenshot shows. That is perfectly OK as long as your cluster continue to complete jobs at the fixed rate.

    As a result, the system has about 5 minutes delay in terms of real-time-ness.
  • The demo shows twitter message trends by language and by devices. The left diagram is message trend by language of a whole day. We can see English message volume goes up in the US day time, and meanwhile Asia message volume goes down because it’s Asia night.
  • There’s also the tag cloud, show the most recent hot topics. And below it the trend of the hottest tags.

    All the charts real-time to the latest 5 minutes.
  • To summarize.

    Apache Kylin 2.0 is about to reach it 2.0 release. It has rich features like snowflake schema support, runs TPC-H benchmark, and many enhanced SQL functions. It has spark cubing that halves the build time, and has streaming capability. 2.0 is still in beta at the moment, there is a beta package which you can download and try. We are very eager to hear feedbacks from community, good or bad. After fixing any critical issues, the plan is to release in April.

    As to Kylin’s roadmap, there is a lot we want to do. Hadoop 3.0 with erasure coding could save cube storage greatly, something we will definitely catch up. Spark cubing has many room of improvement too, e.g. at the moment it does not source from Kafka yet. Connecting more sources is another frequently asked requirement. We could source from JDBC, or maybe SparkSQL. Alternative storage, like Kudu? Lastly but not least, a lambda architecture to support true real-time.

    That is all. Thanks for your time again.

×