Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kylin @ Big Data Europe 2015

2,020 views

Published on

Apache Kylin, Extreme OLAP Engine for Big Data. Build Peta Bytes Scale OLAP Cubes

Published in: Technology
  • Be the first to comment

Apache Kylin @ Big Data Europe 2015

  1. 1. Using Apache Kylin for Large Scale Data Analytics Apache Big Data Europe 2015
  2. 2. http://kylin.io Agenda  Big Data @ eBay  What’s Apache Kylin?  Cube Segments & Streaming  Q&A
  3. 3. Big Data @ eBay $28B GMV VIA MOBILE (2014) 266M MOBILE APP GLOBALLY 1B LISTINGS CREATED VIA MOBILE 157M ACTIVE BUYERS 25M ACTIVE SELLERS 800M ACTIVE LISTINGS 8.8M NEW LISTINGS EVERY WEEK
  4. 4. CHECKOUT WATCH LIST SHARE LOCAL PREFERENCES BID 100’s PB 9 Quarters of historical data 100M QUERIES/MONTH 20TB Behavior Data Capture every day CLICK SEARCH
  5. 5. Hadoop @ eBay 2007 2010 2011 2012 2013/2014 2009
  6. 6. Extreme OLAP Engine for Big Data Apache Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Accepted as Apache Incubator Project on Nov 25th, 2014
  7. 7. Business Needs for Big Data Analysis  Sub-second query latency on billions of rows  ANSI SQL for both analysts and engineers  Full OLAP capability to offer advanced functionality  Seamless Integration with BI Tools  Support of high cardinality and high dimensions  High concurrency – thousands of end users  Distributed and scale out architecture for large data volume
  8. 8.  Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data  ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions  Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau.  Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset  MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records Features Highlights
  9. 9. Apache Kylin Adoption External 25+ contributors in community Adoption: In Production: eBay, Baidu Maps Evaluation: Huawei, Bloomberg Law, British Gas, JD.com Microsoft, Tableau… eBay Use Cases Case Cube Size Raw Records User Session Analysis 14+TB 75+ billion rows Traffic Analysis 21+ TB 20+ billion rows Behavior Analysis 560+ GB 1.2+ billion rows
  10. 10. Cube Designer
  11. 11. Job Management
  12. 12. Query and Visualization
  13. 13. Tableau Integration
  14. 14. Kylin Architecture Overview 14 Cube Builder (MapReduce…) SQL Low Latency - SecondsRouting 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST Server DataSource Abstraction Engine Abstraction Storage Abstraction
  15. 15. Cube Storage in HBase
  16. 16. OLAP Cube – Balance between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids)
  17. 17.  Dynamic data management framework.  Formerly known as Optiq, Calcite is an Apache incubator project, used by Apache Drill and Apache Hive, among others.  http://optiq.incubator.apache.org Querying the Cube
  18. 18. • Metadata SPI – Provide table schema from Kylin metadata • Optimization Rules – Translate the logic operators into kylin operators • Relational Operators – Find right cube – Translate SQL into storage engine API call – Generate physical execute plan by linq4j java implementation • Result Enumerator – Translate storage engine result into java implementation result. • SQL Function – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) Query Engine based on Calcite
  19. 19.  Full Cube  Pre-aggregate all dimension combinations  “Curse of dimensionality”: N dimension cube has 2N cuboid.  Partial Cubes  To avoid dimension explosion, we divide the dimensions into different aggregation groups  2N+M+L  2N + 2M + 2L  For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands  230  210 + 210 + 210  Tradeoff between online aggregation and offline pre-aggregation Optimizing Cube Builds – Full Cube vs. Partial Cube
  20. 20. Optimizing Cube Builds: Partial Cubes
  21. 21. Optimizing Cube Building – Incremental Building
  22. 22. Optimizing Cube Builds: Improved Parallelism • The current algorithm – Many MRs, the number of dimensions – Huge shuffles, aggregation at reduce side, 100x of total cube size Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR
  23. 23. Optimizing Cube Builds: Improved Parallelism • The to-be algorithm, 30%-50% faster – 1 round MR – Reduced shuffles, map side aggregation, 20x total cube size – Hourly incremental build done in tens of minutes Data Split Cube Segment Data Split Cube Segment Data Split Cube Segment …… Final Cube Merge Sort (Shuffle) mapper mapper mapper
  24. 24. Historical DataRecent Data Kafka Topics SQL Query Optimizing Cube Builds: Streaming Inverted Index Hybrid Storage Interface Cube Segments
  25. 25. Use Case: SEO Operational Dashboard • eBay Site – ebay.com, ebay.co.uk, ebay.de • Buyer Country – US, CN, RU • Search Engine – Google, Bing, Yahoo! • Referrer – google.com, google.co.uk • Page – Search, View Item, Product • User Experience – Desktop, Mobile APP, mWeb • Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc. Dimensions Measurements
  26. 26. Q & A
  27. 27. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  28. 28. Cube Storage in HBase
  29. 29.  Plugin-able storage engine  Common iterator interface for storage engine  Isolate query engine from underline storage  Translate cube query into HBase table scan  Columns, Groups  Cuboid ID  Filters -> Scan Range (Row Key)  Aggregations -> Measure Columns (Row Values)  Scan HBase table and translate HBase result into cube result  HBase Result (key + value) -> Cube Result (dimensions + measures) How to Query Cube? Storage Engine
  30. 30.  Compression and Encoding Support  Incremental Refresh of Cubes  Approximate Query Capability for distinct Count (HyperLogLog)  Leverage HBase Coprocessor for query latency  Job Management and Monitoring  Easy Web interface to manage, build, monitor and query cubes  Security capability to set ACL at Cube/Project Level  Support LDAP Integration Features Highlights…
  31. 31. Data Modeling Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): …Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin
  32. 32. OLAP Cube – Balance between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids)
  33. 33. • Different ages have different analysis need, resulting in different storage architecture • Inverted index may keep data long enough to support row level analysis Optimizing Cube Builds: Streaming ts ts … ts ts ts … Streaming ts ts min min … min min … min min hour hour … hour hour … Last Hour inverted index in memory row level Last Week cube segments in memory minute granularity Months cube segments on disk hour granularity cubing cubing
  34. 34. Transaction Operation Strategy High Level Aggregation •Very High Level, e.g GMV by site by vertical by weeks Analysis Query •Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail •Detail Level (Summary Table) Low Level Aggregation •First Level Aggragation Transaction Level •Transaction Data Analytics Query Taxonomy OLAP Kylin is designed to accelerate 80+% analytics queries performance on Hadoop OLTP
  35. 35. Cube Build Job Flow
  36. 36.  Huge volume data  Table scan  Big table joins  Data shuffling  Analysis on different granularity  Runtime aggregation expensive  Map Reduce job  Batch processing Technical Challenges
  37. 37. Big Data Era  More and more data becoming available on Hadoop  Limitations in existing Business Intelligence (BI) Tools  Limited support for Hadoop  Data size growing exponentially  High latency of interactive queries  Scale-Up architecture  Challenges to adopt Hadoop as interactive analysis system  Majority of analyst groups are SQL savvy  No mature SQL interface on Hadoop  OLAP capability on Hadoop ecosystem not ready yet
  38. 38. Streaming Cubes • Cube takes time to build, how about real-time analysis? • Sometimes we want to drill down to row level information • Streaming Cubing – Build micro cube segments from streaming – Use inverted index to capture last minute data
  39. 39. Query Engine – Kylin Explain Plan
  40. 40. Streaming… Historic StoreReal-time Store streaming Kafka SQL Query hourly/daily batch minutes batch Inverted Index Hybrid Storage Interface Cube
  41. 41. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  42. 42. Kylin vs. Hive # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin High Level Aggregatio n Analysis Query Drill Down to Detail Low Level Aggregatio n Transactio n Level Based on 12+B records case
  43. 43. Performance -- Concurrency Linear scale out with more nodes
  44. 44. Performance - Query Latency 90%tile queries <5s Green Line: 90%tile queries Gray Line: 95%tile queries
  45. 45. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  46. 46. Kylin Evolution Roadmap
  47. 47.  Kylin Core  Fundamental framework of Kylin OLAP Engine  Extension  Plugins to support for additional functions and features  Integration  Lifecycle Management Support to integrate with other applications  Interface  Allows for third party users to build more features via user- interface atop Kylin core  Driver  ODBC and JDBC Drivers Kylin OLAP Core Extension  Security  Redis Storage  Spark Engine  Docker Interface  Web Console  Customized BI  Ambari/Hue Plugin Integration  ODBC Driver  ETL  Drill  SparkSQL Kylin Ecosystem
  48. 48.  Cubing Efficiency  MR is not optimal framework  Spark Cubing Engine  Source from SparkSQL  Read data from SparkSQL instead of Hive  Route to SparkSQL  Unsupported queries be coved by SparkSQL Adding Spark Support
  49. 49.  Hive  Input source  Pre-join star schema during cube building  MapReduce  Pre-aggregation metrics during cube building  HDFS  Store intermediated files during cube building.  HBase  Store data cube.  Serve query on data cube.  Coprocessor is used for query processing.  Calcite  Query Parsing  Query Execution The DRY Principle
  50. 50. If you want to go fast, go alone. If you want to go far, go together. --African Proverb http://kylin.io

×