Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin Meetup @Shanghai

2,478 views

Published on

Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin Meetup @Shanghai

Published in: Software

1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin Meetup @Shanghai

  1. 1. http://kylin.io Apache Kylin Deep Dive Streaming & Plugin Architecture Oct 10, 2015 |@ApacheKylin Yang Li Architect & Tech Leader | yangli9@ebay.com
  2. 2. http://kylin.io Agenda n What’s  Apache  Kylin? n Plugin  Architecture n Fast  Cubing n Streaming  Cubing n Summary
  3. 3. http://kylin.io Extreme  OLAP Engine  for  Big  Data Apache  Kylin is  an  open  source  Distributed  Analytics  Engine  designed  to   provide  SQL  interface  and  multi-­‐dimensional  analysis  (OLAP)  on   Hadoop  supporting  extremely  large  datasets,  original  contributed  from   eBay  Inc. What’s  Kylin kylin /  ˈkiːˈlɪn /  麒麟 -­-­n.  (in  Chinese  art)  a  mythical  animal  of  composite  form   • Open  Sourced  on  Oct  1st,  2014 • Accepted  as  Apache  Incubator  Project  on  Nov  25th,  2014
  4. 4. http://kylin.io $28B GMV VIA MOBILE (2014) 266M MOBILE APP GLOBALLY 1B LISTINGS CREATED VIA MOBILE 157M ACTIVE BUYERS 25M ACTIVE SELLERS 800M ACTIVE LISTINGS 8.8M NEW LISTINGS EVERY WEEK Big  Data  @  eBay
  5. 5. http://kylin.io n eBay n Adoptions n Baidu  Map,  China  Mobile,  明略数据,  京东,  美团,  唯品会… n Expedia,  Microsoft,  Tableau,  Infoworks.io… Feature  – Big  Data Case Cube   Size Raw  Records Session  Analysis 20  TB 81+  billion  rows Traffic  Analysis 30  TB 28+  billion  rows Transaction  Analysis 560  GB 1.2+  billion  rows
  6. 6. http://kylin.io Feature  – SQL  Interface  
  7. 7. http://kylin.io Feature  – BI  Integration  via  ODBC,  JDBC
  8. 8. http://kylin.io Feature  – Low  Latency 90%  queries  <5s Dark-­blue  line:  90%tile  queries Light-­blue  line:  95%tile  queries 90%  query  returns  in  3  seconds
  9. 9. http://kylin.io Feature  – Scalable  Throughput Linear  scale  out  with  more  nodes
  10. 10. http://kylin.io n A  query  may  consider  only  3  dimensions How  it  works  – Materialized  View
  11. 11. http://kylin.io n Base  vs.  aggregate  cells;  ancestor  vs.  descendant  cells;  parent  vs.  child  cells 1. (9/15,  milk,  Urbana,  Dairy_land)    -­‐ <time, item, location, supplier> 2. (9/15,  milk,  Urbana,  *)    -­‐ <time, item, location> 3. (*,  milk,  Urbana,  *)    -­‐ <item, location> 4. (*,  milk,  Chicago,  *)  -­‐ <item, location> 5. (*,  milk,  *,  *)    -­‐ <item> How  it  works  – OLAP  Cube,  space  for  time • Cuboid  =  one  combination  of  dimensions • Cube  =  all  combination  of  dimensions    (all  cuboids) time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  12. 12. http://kylin.io Agenda n What’s  Apache  Kylin? n Plugin  Architecture n Fast  Cubing n Streaming  Cubing n Summary
  13. 13. http://kylin.io Kylin  Architecture  Overview 13 Cube Builder (MapReduce…) SQL Low    Latency  -­‐ SecondsRouting 3rd  Party  App (Web  App,  Mobile…) Metadata SQL-­‐Based  Tool (BI  Tools:  Tableau…) Query  Engine Hadoop Hive REST  API JDBC/ODBC Ø Online  Analysis  Data  Flow Ø Offline  Data  Flow Ø Clients/Users   interactive   with   Kylin  via  SQL Ø OLAP  Cube  is  transparent   to   users Star  Schema  Data Key  Value  Data Data   Cube OLAP Cubes (HBase) SQL REST  Server Data  Source   Abstraction   Engine   Abstraction   Storage Abstraction  
  14. 14. http://kylin.io Engine Plugin  Architecture IN OUT Hive HBase Cube  Meta
  15. 15. http://kylin.io MapRed Plugin  Architecture Hive Hive  Adapter HBase Adapter HBase Cube  Meta
  16. 16. http://kylin.io n Engine n MR  V1 n MR  V2 n Spark n Streaming n Source n Hive n Kafka n Spark  SQL  &  DataFrames n Storage n HBase n ?  Kudu  (Cloudera) n ?  Cassandra 2.x  Developing  Modules
  17. 17. http://kylin.io n The  freedom n Zoo  break,  not  bound  to  Hadoop  any  more n Free  to  go  to  a  better  engine  or  storage n Extensibility n Accept  any  input,  e.g.  Kafka n Embrace  next-­‐gen  distributed  platform,  e.g.  Spark n Flexibility n Choose  different  engine  for  different  data  set The  Freedom,  Extensibility,  Flexibility
  18. 18. http://kylin.io Agenda n What’s  Apache  Kylin? n Plugin  Architecture n Fast  Cubing n Streaming  Cubing n Summary
  19. 19. http://kylin.io Layered  Cubing  (MR  Engine  V1) Full  Data 0-­‐D  Cuboid 1-­‐D  Cuboid 2-­‐D  Cuboid 3-­‐D  Cuboid 4-­‐D  Cuboid MR MR MR MR MR A,B,C,D A,B,C A,B,D A,C,D B,C,D
  20. 20. http://kylin.io n Pros n Simple  implementation,  depends  on  MR  shuffle  to   merge  sort  and  then  aggregate n Little  requirement  on  memory n Cons n Aggregation  happens  at  reducer  side n Mapper  outputs  raw  data  thus  shuffle  is  huge n Multiple  rounds  of  MR  overhead n Shuffle  can  be  100x  of  cube  size,  big  I/O  pressure Layered  Cubing (MR  Engine  V1)
  21. 21. http://kylin.io Fast  Cubing  (MR  Engine  V2) Data  Split Cube  Segment Data  Split Cube  Segment Data  Split Cube  Segment …… Final  Cube Merge  Sort (Shuffle) mapper mapper mapper reducer
  22. 22. http://kylin.io n One  round  MR  calculates  the  whole  cube n Minimize  scheduling  overhead n Aggregation  happens  at  mapper  side n 1M  raw  records  becomes  10K  at  base  level n Reduced  shuffles  size,  20x  total  cube  size n Memory  eater Fast  Cubing  (MR  Engine  V2)
  23. 23. http://kylin.io n A  simplified  star  cubing  algorithm n Xin,  Dong,  et  al.  "Star-­‐cubing:  Computing  iceberg  cubes  by  top-­‐down  and  bottom-­‐up  integration." Proceedings  of   the  29th  international  conference  on  Very  large  data  bases-­‐Volume  29.  VLDB  Endowment,  2003. n Top-­‐down;  Free  resource  on  branch  complete n Multi-­‐threading  if  mem  available;  Ordered  output In-­‐Mem  Cubing
  24. 24. http://kylin.io n Pros n Lesser  network  pressure n Independent  cubing  algorithm  that  can  be   reused  by  Streaming,  Spark  etc. n Seems  30%-­‐50%  faster n Cons n Code  complexity n High  mapper  CPU/Mem  consumption Fast  Cubing  Summary
  25. 25. http://kylin.io Comparison  on  ~500  GB  cubes Fast  cubing  is  30%  -­ 50%  faster 0 20 40 60 80 100 120 Case  1 Case  2 Layered  Cubing Fast  Cubing
  26. 26. http://kylin.io Agenda n What’s  Apache  Kylin? n Plugin  Architecture n Fast  Cubing n Streaming  Cubing n Summary
  27. 27. http://kylin.io Incremental  Build
  28. 28. http://kylin.io n Do  micro  batch  at  minutes  interval n Source  data  from  streaming  input n Fast  cubing Xin,  Dong,  et  al.  "Star-­‐cubing:  Computing  iceberg  cubes  by  top-­‐down   and  bottom-­‐up  integration."Proceedings  of  the  29th  international   conference  on  Very  large  data  bases-­‐Volume  29.  VLDB  Endowment,   2003. n Cube  auto  merge  and  garbage  collection Push  the  Idea  to  Near  Realtime
  29. 29. http://kylin.io Fast  Cubing Streaming  Setup Kafka Kafka  Adapter HBase Adapter HBase Streaming  Cube
  30. 30. http://kylin.io Stream  Data  Consuming  
  31. 31. http://kylin.io Cube  Auto  Merge In-­‐Memory   Cube    Building Auto  Cube   Merge  with  MR
  32. 32. http://kylin.io Use  Case:  SEO  Operational  Dashboard • eBay  Site – ebay.com,  ebay.co.uk,  ebay.de • Buyer  Country – US,  CN,  RU • Search  Engine   – Google,   Bing,  Yahoo! • Referrer – google.com,  google.co.uk • Page – Search,  View  Item,  Product • User  Experience – Desktop,  Mobile  APP,  mWeb • Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc. Dimensions Measurements
  33. 33. http://kylin.io Future  Lambda  Architecture  for  Realtime Cube  StorageReal-­‐time  In-­‐Mem  Store streaming Kafka SQL  Query minute  batch Latest  second Inverted   Index Hybrid  Storage   Interface Cube
  34. 34. http://kylin.io DT,LOC TopN 2015-­‐10-­‐1,CN Item  A, $500 Item  B,  $300 … TopN Support select dt,  loc,  item,  sum(gmv) from test_kylin_fact where dt=‘2015-­‐10-­‐1’  and loc=‘CN’ group  by dt,  loc,  item order  by 2  desc limit 100 cube  pre-­‐calculation n TopN as  a  measure n Answer  TopN queries  directly  from  pre-­‐calculation n Approximate  algorithm n SpaceSaving TopN n Ahmed  Metwally,  et  al.  “Efficient  computation  of  frequent  and  top-­‐k  elements  in  data  streams”.  Proceeding  ICDT'05   Proceedings  of  the  10th  international  conference  on  Database  Theory,  2005. n A  parallel  version n Massimo  Cafaro,  et  al.  “A  parallel  space  saving  algorithm  for  frequent  items  and  the  Hurwitz  zeta  distribution”.   Proceeding  arXiv:  1401.0702v12  [cs.DS]  19  Setp 2015.
  35. 35. http://kylin.io Agenda n What’s  Apache  Kylin? n Plugin  Architecture n Fast  Cubing n Streaming  Cubing n Summary
  36. 36. http://kylin.io n Coming  soon… n Plugin  Architecture n Replaceable  engine,  storage,  source n Fast  Cubing n 30%-­‐50%  faster n Streaming  Cubing n Support  NRT  analysis n Lightening  fast  TopN New  features  in  2.x
  37. 37. http://kylin.io n Kylin Site: n http://kylin.io n Twitter/微博: n @ApacheKylin n 微信公众号 n ApacheKylin We  are  hiring

×