Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

2,710 views

Published on

Apache Kylin is a distributed OLAP engine on Hadoop, which provides sub-second level query latency over datasets scaling to petabytes. Kylin’s superior query performance relies on pre-calculated multi-dimension Cube, which is often time-consuming to build. By default, Kylin uses MapReduce Cube Engine built atop of Hadoop MapReduce framework to aggregate huge amounts of source data. The MR Engine has been well-tuned over years and proven to be stable in hundreds of production deployments. Recently, the Kylin team is trying to further speed up the process of cube building by replacing MR with Spark. Kyligence has initiated the new Spark Cube Engine with some benchmarks between Spark and MR over different datasets, and has received some promising results. Hear about their results and experiences on moving Cube building, which is a huge computing task, to Spark.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

  1. 1. Luke Han, luke@kyligence.io Shaofeng Shi, shaofeng@kyligence.io Apache Kylin: Speed up Cubing with Spark
  2. 2. About Us Shaofeng Shi (史少锋) • Apache Kylin PMC • Sr. Architect, Kyligence Inc. Luke Han (韩卿) • VP of Apache Kylin • Co-founder & CEO, Kyligence Inc. formed by the team who created Apache Kylin http://kyligence.io
  3. 3. Agenda • What is Apache Kylin • Challenges with MapReduce • Speed up Cubing with Spark • Q&A
  4. 4. About Apache Kylin ü Leading open source OLAP on Hadoop ü Fast growing open source community ü Adopted by 200+ global organizations ü First born in China Apache Top Level Project ü InfoWorld BossieAward: – Best Open Source Big Data Tool (2015) – Best Open Source Big Data Tool (2016) Kylin / ˈkiːˈlɪn / 麒麟 -- n. (in Chinese art) a mythical animal of composite form Apache Kylin Extreme OLAP Engine for Big Data http://kylin.apache.org
  5. 5. Global Users 200+ use cases in production
  6. 6. Use Case
  7. 7. About Apache Kylin • Extreme OLAP Engine on Hadoop ü High Performance ü High Concurrency ü ANSI SQL ü Native on Hadoop ü Cloud Ready
  8. 8. The Basic: OLAP Cube • Pre-calculate metrics by dimensions
  9. 9. The Magic: Pre-Calculation Sort Aggr Filter Tables Join Sort Filter Online:Query from Cube Post Aggr. SELECT returned, status, sum(quantity), sum(price) FROM lineitem INNER JOIN orders on l_orderkey =o_orderkey WHERE shipdate =’2016-09-16' GROUP BY returned, status ORDER BY returned, status; Offline:Cubing Cube
  10. 10. Apache Kylin: O(1) O(N) O(1) Apache Kylin Data Size Response Time Other Engines
  11. 11. Star-Schema Benchmark Run SSB at 10, 20 and 40 million-row scales, Kylin’s response time keep stable Read more at: https://github.com/kyligence/ssb-kylin Kylin vs Hive, O(1) : O(N)
  12. 12. Architecture
  13. 13. Cubing Process Cubing: Map -> Shuffle -> Reduce
  14. 14. Build Cube with MapReduce • Calculate Cuboids by layer :N dim (Base cuboid), N-1 dim, N-2…, 1, 0 • Reuse previous layer’s result • HDFS used for data sharing • Totally need N round MR;
  15. 15. Challenges with MapReduce • Slow data sharing ü Serialization, Replication, Deserialization… • Repeated job submission ü Submit dependent jar/files repeatedly ü Re-queue when cluster is busy • Short of streaming support ü Couldn’t support low latency data processing • Limited storage support ü Couldn’t ingest data from non-HDFS
  16. 16. Speed up Cubing with Spark • Abstract each layer cuboids as a RDD • Cache parent RDD to generate Child RDD • Export RDD when Child be generated • As-is measure aggregators can be reused with Spark Java API RDD-1 RDD-2 RDD-3 RDD-4 RDD-5
  17. 17. Speed up Cubing with Spark Reuse data in memory
  18. 18. Speed up Cubing with Spark One job finishes all layers aggregation!
  19. 19. Performance Benchmark • Environment ü 4 nodes Hadoop cluster; each has 28 GB RAM and 12 cores ü CDH 5.8, Apache Kylin 2.0 • Spark ü Spark 1.6.3 on YARN ü 6 executors, each has 4 cores, 5GB memory • Test Data ü Airline data of US DoT, totally 160 million rows ü Cube:10 dimensions, 5 measures (SUM) • Test Scenarios ü Build the cube at different scale: 3 million, 50 million and 160 million rows
  20. 20. Spark Cubing vs. MR Cubing Build Cube Time Comparison BuildTime(minute) 0 8 15 23 30 Source data rows 3-million 50-million 160-million MapReduce Spark MapReduce Spark MapReduce Spark 17 8 3 29 19 6
  21. 21. Spark Cubing vs. MR In-Mem Cubing MR In-mem is slow when dataset is big and random distributed When data is in sharding, MR In-mem can be fast Spark is fast & stable regardless of data distribution
  22. 22. Benefits with Spark • Spark speeds up Cubing at 1x ü Half time be saved • Spark simplifies Kylin’s development ü More functions, less codes • Spark brings Kylin to a new era ü Real-time OLAP, Ad hoc query, Cloud integration
  23. 23. Thank You. Join Apache Kylin community dev@kylin.apache.org Visit Kylin & Kyligence home http://kylin.apache.org http://kyligence.io

×