Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kylin - Balance between space and time - Hadoop Summit 2015

2,151 views

Published on

Hadoop Summit 2015 Presentation on Apache Kylin - Balance between space and time

Published in: Technology

Apache Kylin - Balance between space and time - Hadoop Summit 2015

  1. 1. Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io | @ApacheKylin
  2. 2. About us §Debashis Saha (@debashis_saha ) - VP, eBay Cloud Services (Platform, Infrastructure, Data) §Luke Han (@lukehq) - Sr. Product Manager, Analytics Data Infrastructure - Committer & PMC Member of Apache Kylin
  3. 3. Agenda §About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A http://kylin.io
  4. 4. What kylin    /  ˈkiːˈlɪn  /  麒麟 -­‐-­‐n.  (in  Chinese  art)  a  mythical  animal  of  composite  form   Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine, contributed by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets http://kylin.io • Open  Sourced  on  Oct  1st,  2014   • Accepted  as  Apache  Incubator  Project  on  Nov  25th,  2014   • http://kylin.io  (http://kylin.incubator.apache.org) @ApacheKylin
  5. 5. § External § 25+ contributors in community § Adoption: § On Production: Baidu Map § Evaluation: Huawei, Bloomberg Law, British Gas, JD.com Microsoft, Tableau… § eBay Internal Cases - 90% ile query < 5 seconds Case Cube  Size Raw  Records User  Session  Analysis 26  TB 28+  billion  rows Traffic  Analysis 21  TB 20+  billion  rows Behavior  Analysis 560  GB 1.2+  billion  rows —from mailing list Who are using Kylin? http://kylin.io
  6. 6. Why http://kylin.io Happiness Latency 10s size
  7. 7. Balance Between Space and Time http://kylin.io time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids) OLAP Cube
  8. 8. How http://kylin.io Map Reduce Kylin BI Tools,Web App… ANSI SQL
  9. 9. Agenda §About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A http://kylin.io
  10. 10. Feature Highlights • Extremely Fast OLAP Engine at scale • ANSI SQL Interface on Hadoop • Seamless Integration with BI Tools, like Tableau • Interactive Query Capability • MOLAP Cube • Incremental Build of Cubes • Approximate Query Capability for Distinct Count (HyperLogLog) • Leverage HBase Coprocessor for query latency • Job Management and Monitoring • User friendly Web GUI for manage, build, monitor and query cubes • Security capability to set ACL at Cube/Project Level • Support LDAP Integration http://kylin.io
  11. 11. Define Data Model http://kylin.io
  12. 12. Manage Jobs http://kylin.io
  13. 13. Explore the Data http://kylin.io
  14. 14. Interactive with BI Tool - Tableau http://kylin.io
  15. 15. Agenda §About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A http://kylin.io
  16. 16. http://kylin.io Cube  Build  Engine   (MapReduce…) SQL Low    Latency  -­‐  SecondsMid  Latency  -­‐  Minutes Routing 3rd  Party  App   (Web  App,  Mobile…) Metadata SQL-­‐Based  Tool   (BI  Tools:  Tableau…) Query  Engine Hadoop Hive REST  API JDBC/ODBC ➢Online  Analysis  Data  Flow   ➢Offline  Data  Flow   ➢Clients/Users  interactive  with  Kylin   via  SQL   ➢OLAP  Cube  is  transparent  to  users Star  Schema  Data Key  Value  Data Data   Cube OLAP   Cube   (HBase) SQL REST  Server Kylin Architecture Overview
  17. 17. http://kylin.io Cube:  …   Fact  Table:  …   Dimensions:  …   Measures:  …   Storage(HBase):  … Fact Dim Dim Dim Source   Star  Schema row  A row  B row  C Column  Family Val  1 Val  2 Val  3 Row  Key Column Target     HBase  Storage Mapping   Cube  Metadata End  User Cube  Modeler Admin Data Modeling
  18. 18. http://kylin.io Cube Build Job Flow
  19. 19. http://kylin.io How to Store Cube - HBase Schema
  20. 20. http://kylin.io SELECT  test_cal_dt.week_beg_dt,   test_category.category_name,  test_category.lvl2_name,   test_category.lvl3_name,  test_kylin_fact.lstg_format_name,     test_sites.site_name,  SUM(test_kylin_fact.price)  AS  GMV,   COUNT(*)  AS  TRANS_CNT   FROM    test_kylin_fact        LEFT  JOIN  test_cal_dt  ON  test_kylin_fact.cal_dt  =   test_cal_dt.cal_dt      LEFT  JOIN  test_category  ON  test_kylin_fact.leaf_categ_id  =   test_category.leaf_categ_id    AND  test_kylin_fact.lstg_site_id   =  test_category.site_id      LEFT  JOIN  test_sites  ON  test_kylin_fact.lstg_site_id  =   test_sites.site_id   WHERE  test_kylin_fact.seller_id  =  123456OR   test_kylin_fact.lstg_format_name  =  ’New'   GROUP  BY  test_cal_dt.week_beg_dt,   test_category.category_name,  test_category.lvl2_name,   test_category.lvl3_name,   test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter      OLAPProjectRel(WEEK_BEG_DT=[$0],  category_name=[$1],   CATEG_LVL2_NAME=[$2],  CATEG_LVL3_NAME=[$3],  LSTG_FORMAT_NAME=[$4],   SITE_NAME=[$5],  GMV=[CASE(=($7,  0),  null,  $6)],  TRANS_CNT=[$8])          OLAPAggregateRel(group=[{0,  1,  2,  3,  4,  5}],  agg#0=[$SUM0($6)],   agg#1=[COUNT($6)],  TRANS_CNT=[COUNT()])              OLAPProjectRel(WEEK_BEG_DT=[$13],  category_name=[$21],   CATEG_LVL2_NAME=[$15],  CATEG_LVL3_NAME=[$14],   LSTG_FORMAT_NAME=[$5],  SITE_NAME=[$23],  PRICE=[$0])                  OLAPFilterRel(condition=[OR(=($3,  123456),  =($5,  ’New'))])                      OLAPJoinRel(condition=[=($2,  $25)],  joinType=[left])                          OLAPJoinRel(condition=[AND(=($6,  $22),  =($2,  $17))],  joinType=[left])                              OLAPJoinRel(condition=[=($4,  $12)],  joinType=[left])                                  OLAPTableScan(table=[[DEFAULT,  TEST_KYLIN_FACT]],  fields=[[0,  1,  2,  3,   4,  5,  6,  7,  8,  9,  10,  11]])                                  OLAPTableScan(table=[[DEFAULT,  TEST_CAL_DT]],  fields=[[0,  1]])                              OLAPTableScan(table=[[DEFAULT,  test_category]],  fields=[[0,  1,  2,  3,  4,  5,   6,  7,  8]])                          OLAPTableScan(table=[[DEFAULT,  TEST_SITES]],  fields=[[0,  1,  2]]) Kylin Query Engine - Explain Plan
  21. 21. Cube Optimization § Curse  of  Dimensionality   § N  dimension  cube  has  2N  cuboid   § Full  Cube  vs.  Parqal  Cube     § Huge  Data  Volume     § Dicqonary  Encoding   § Incremental  Building http://kylin.io
  22. 22. § Full  Cube - Pre-aggregate all dimension combinations - “Curse of dimensionality”: N dimension cube has 2N cuboid. § ParOal  Cube - To avoid dimension explosion, we divide the dimensions into different aggregation groups - 2N+M+L à 2N + 2M + 2L - For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands - 230 à 210 + 210 + 210 - Tradeoff between online aggregation and offline pre-aggregation http://kylin.io Full Cube vs. Partical Cube
  23. 23. http://kylin.io Partical Cube
  24. 24. http://kylin.io Incremental Build
  25. 25. What’s Next § Improve  cube  algorithm   § Cube  by  segments,  30%-­‐50%  faster   § Build  delay  down  to  tens  of  minutes   § Streaming  cubing   § Analyze  real-­‐qme  data   § Build  delay  down  to  seconds   § Spark http://kylin.io
  26. 26. Cube by Layer § The current algorithm - Many MRs, the number of dimensions - Huge shuffles, aggregation at reduce side, 100x of total cube size http://kylin.io Full  Data 0-­‐D  Cuboid 1-­‐D  Cuboid 2-­‐D  Cuboid 3-­‐D  Cuboid 4-­‐D  Cuboid MR MR MR MR MR
  27. 27. Cube by Segments § The to-be algorithm, 30%-50% faster - 1 round MR - Reduced shuffles, map side aggregation, 20x total cube size - Hourly incremental build done in tens of minutes http://kylin.io Data  Split Cube  Segment Data  Split Cube  Segment Data  Split Cube  Segment …… Final  Cube Merge  Sort   (Shuffle) mapper mapper mapper
  28. 28. Streaming Cubing § Cube is great but… - Cube takes time to build, how about real-time analysis? - Sometimes we want to drill down to row level information § Streaming cubing - Build micro cube segments from streaming - Use inverted index to capture last minute data http://kylin.io
  29. 29. http://kylin.io Streaming seconds  delay Last  Hour Inverted   Index Before  Last  Hour Cube Kylin Lambda Architecture minutes  delay Query  Engine ANSI  SQL Hybrid Storage Interface
  30. 30. Adding Spark Support § Cubing Efficiency § MR is not optimal framework § Spark Cubing Engine § Source from SparkSQL § Read data from SparkSQL instead of Hive § Route to SparkSQL § Unsupported queries be coved by SparkSQL http://kylin.io
  31. 31. Agenda §About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A http://kylin.io
  32. 32. Future2015 2016 Kylin Evolution Roadmap http://kylin.io 20142013 Initial Prototype  for   MOLAP   • Basic  end  to  end  POC   MOLAP   • Incremental  Refresh   • ANSI  SQL   • ODBC  Driver   • Web  GUI   • Tableau   • ACL   • Open  Source StreamingOLAP   • Streaming  OLAP   • JDBC  Driver   • New  UI   • Excel   • SparkSQL   • …  more   TBD Sep,  2013 Jan,  2014 Oct,  2014 H1,  2015 HybridOLAP   • Lambda  Arch   • Automation   • Capacity   Management   • Spark     • …  more Next  Gen   • Adv  OLAP  Functions   • In-­‐Memory  Analysis   (TBD)   • Mobile  (TBD)   • …  more
  33. 33. Kylin Ecosystem ■ Kylin Core ■ Fundamental framework of Kylin OLAP Engine ■ Extension ■ Plugins to support for additional functions and features ■ Integration ■ Lifecycle Management Support to integrate with other applications ■ Interface ■ Allows for third party users to build more features via user- interface atop Kylin core http://kylin.io Kylin OLAP Core Extension à Security à Redis Storage à Spark Engine à Docker Interface à Web Console à Customized BI à Ambari/Hue Plugin Integration à ODBC Driver à ETL à Drill à SparkSQL
  34. 34. http://kylin.io If  you  want  to  go  fast,  go  alone.   If  you  want  to  go  far,  go  together. -­‐-­‐African  Proverb dev@kylin.incubator.apache.org

×