Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PCM18 (Big Data Analytics)

447 views

Published on

PCM18 (Big Data Analytics) slides of Pentaho Community Meetup in Bologna about Big Data OLAP using Pentaho, Vertica and Kylin

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

PCM18 (Big Data Analytics)

  1. 1. www.stratebi.com Emilio Arias Co-Founder at StrateBI. Follow us on Twittter @Stratebi @TodoBI_OS
  2. 2. www.stratebi.com Roberto Tardío Head of Big Data at StrateBI. Follow me on Twittter @RoberTardio
  3. 3. www.stratebi.com
  4. 4. www.stratebi.com
  5. 5. www.stratebi.com • OLAP (On-Line Analytical Processing) • Analytical systems that enable interactive queries. • Requires very low query latency: Milliseconds-Seconds. • Usually supports SQL and, sometimes, MDX query language. • Enables KPI’s data aggregation and filtering across hierarchical multidimensional structures (OLAP cubes). • Used as data source for diferents goals: • Detailed data analysis (OLAP views). • Dashboarding. • Reporting.
  6. 6. www.stratebi.com • Big Data OLAP • Big Data: Volume, Variety and Velocity. • OLAP applications over Big Data sets. • Main challenges: • Very low query latency over fact and dimension tables of billions to trillions of rows. • Support for ANSI SQL and BI Tools integration. • Real time data ingestion and processing.
  7. 7. www.stratebi.com • Current Approaches (some of them).
  8. 8. www.stratebi.com • Why Apache Kylin? • Sub-second queries over +12 billion rows fact tables. • Best query latency results (in our deployements and benchmarks) • ANSI SQL and BI Tools integration. • Integration with Pentaho possible through JDBC, Mondrian and PME • Also Superset, Tableau, Power BI, Zeppelin, Microstrategy… • Star and snowflake schemas full support • Not all tools support it (e.g. Druid) • Near Real time data ingestion (Kafka) and processing. • It is an Apache open-source project. • Currently in version 2.5
  9. 9. www.stratebi.com • Apache Kylin Architecture • M-OLAP approach: • Data pre-aggregation. • Enables only analytical queries. • Hadoop based tool • Full scalability • Hadoop nodes • Hbase and Kylin separated clusters (if needed)
  10. 10. www.stratebi.com
  11. 11. www.stratebi.com • A real love story
  12. 12. www.stratebi.com • Why Apache Kylin and Pentaho BA Server? • It is becoming more and more necessary to provide dashboarding, reporting and OLAP viewing over Big Data scenarios. • Using our STTools Pentaho plugins: STPivot, STReport, STDashboard,… • Also Pentaho Reporting, Community Dashboard Editor, Saiku (plugin),… • Both Kylin and Pentaho are leading BI & Big Data open-source tools. • Pentaho enables integration with most-known Big Data tools: Hive, Impala, Spark SQL,… • Integration with Pentaho possible through JDBC, Mondrian and PME • Mondrian 4.X using existing Mondrian 4.4 (lagunitas) • Mondrian 3.X, with a great effort of our team. • Using Pentaho BA Sever 7.1
  13. 13. www.stratebi.com • Identified issues and solutions: Kylin and Mondrian 3.X (3.14) • Issue 1: Kylin needs ANSI-92 inner joins but Mondrian 3.X generated old-style joins. • Solution: We defined a Mondrian dialect and we used this patch to implement allowsJoinOn() method. • Issue 2: Mondrian native cross join and nonempty properties caused invalid SQL code for Kylin. • Solution: We disabled these properties for Kylin dialect. • Issue 3: Kylin needs the fact table to be the first table in the from SQL clause. • Workaround: We modified Mondrian code to identify fact tables using a name prefix (F or FT) and thus place them first in the from clause.
  14. 14. www.stratebi.com • Identified issues and solutions: Kylin and Mondrian 3.X • Some interesting used references: • How to implement Kylin dialect for Mondrian • https://web.archive.org/web/20171010103502/http://dekarlab.de/wp/?p=443 • Pentaho JIRA - MONDRIAN-955 • Mondrian should support the Dialect.allowsJoinOn() option • Patch • Pentaho JIRA - MONDRIAN-2364 • Add dialect for Apache Kylin
  15. 15. www.stratebi.com • Identified issues and solutions: Kylin and Pentaho Metadata Editor • Issue 1: There is no dialect for Kylin in PME. • Solution: Definition of the Kylin dialect using the Hive 2 SQL dialect. • Works perfectly without changing anything. • JDBC connections between Pentaho BA Server and Kylin: • Initially we used the generic connection through a JDBC driver. • To simplify the connection, we defined the connection interface for Kylin in Pentaho BA Server. • We have used Pentaho BA Server 7.1 but a connection to Kylin has not yet been included in Pentaho 8.1.
  16. 16. www.stratebi.com • Enabling security at schemas, concepts and data levels: • Mondrian 3.X • We could not use views to filter data (Kylin approach limitation) • Solution: We have used Mondrian Dynamic Schema Processor • We extended the typical Mondrian DSP class using a variable that replaces a piece of XML from the schema. • Pentaho Metadata Editor • PME requires roles and users tables be created in the same data source, but Kylin does not allow it (Kylin approach limitation). • Solution: We have created JDBCSecuritySqlGenerator • Extension of this PME existing security class. • The security is defined in a file we called securitySQLGenerator-properties.xml.
  17. 17. www.stratebi.com • What have we obtained? • Dasboarding, reporting and OLAP viewing using our Pentaho STTools over cubes with more than a billion rows (1.000.000.000) • Enabling sub-second Roll-up, Drill-down, Slice and Dice and Pivot OLAP operations. • We have carried the first deployement of Kylin for a Spain based company. • Try our demo with Kylin, Pentaho and STPivot viewer (Marketplace available) • http://bigdata.stratebi.com/kylin-olap/index.htm
  18. 18. www.stratebi.com
  19. 19. www.stratebi.com • Kylin applied to digital marketing scenario • Initial Scenario • OLAP system for data analysis using an in-house reporting tool. • Based on MySQL (80% queries) + Redshift (20% queries) • Several million rows per hour in some fact tables • Goals • Reduce query latency (some queries take >20s to run) • Reduce ETL processing time: "Data freshness". • Implementation of Open-Source BI tools (STTools) • Self-service OLAP, reporting and dashboarding
  20. 20. www.stratebi.com • Kylin applied to digital marketing scenario • Architecture
  21. 21. www.stratebi.com • Kylin applied to digital marketing scenario • Goals achieved • Reduced query latency: User queries were compared for the company's three most important reports. • Kylin query executions times are 4 times faster than Redshift. • Most Kylin queries have response times below 1 second. • Some very complex queries that in Redshift take about 30 seconds are executed in over 400 milliseconds using Kylin • Full integration with open source BI tools (STTools) • STPivot, STReport, STDashboard • Security implemented at schema and data levels (Mondrian and PME).
  22. 22. www.stratebi.com • Kylin applied to digital marketing scenario • Kylin vs Redshift Redshift Kylin
  23. 23. www.stratebi.com
  24. 24. www.stratebi.com • Why Vertica is an alternative to Kylin for Big Data OLAP? • Sub-second queries over billions of rows fact tables. • In our implementations and benchmark it achieves very good query latency results. • But it is not as fast as Kylin for extremely huge fact tables. • ANSI SQL and BI Tools integration. • Integration with Pentaho possible through JDBC, Mondrian and PME • Also Superset, Tableau, Power BI, Zeppelin, Microstrategy… • Star and snowflake schemas full support • Near Real time data ingestion and processing. • Microfocus Vertica is not an open-source project. • But there is a free community version, enough for much typical Big Data scenarios.
  25. 25. www.stratebi.com • Vertica Architecture • Distributed processing in cluster mode. • But it does not need a hadoop cluster to work. • Although it does support integration with Hadoop (e.g. Spark or Hive) • Columnar and distributed storage • Hybrid OLAP (tables, projections, flattened tables…)
  26. 26. www.stratebi.com • Integration with Pentaho and STTools • Seamless integration with Pentaho PDI for data warehouse loading • Including bulk load steps • We have also integrated Vertica with Pentaho BA Server for several successfully use cases • Be careful defining the Mondrian OLAP scheme to achieve good performance. • In PME we have faced similar issues to Kylin (use of PostgreSQL dialect) • Retail Sector use case • + 3,000 points of sales = high concurrency • Volumetrics determined by sales line level detail • Need for highly customized graphics (we have implemented a lot of CDE dashboards)
  27. 27. www.stratebi.com
  28. 28. www.stratebi.com • Why a Big Data OLAP Benchmark? • To test the performance of the two most powerful Big Data OLAP tools • Kylin vs Vertica • Compare their performance against OLAP implementations in traditional databases • PostgreSQL: Open source relational database that has a good performance for OLAP systems.
  29. 29. www.stratebi.com • Benchmark implementation • We have used the SSB benchmark • A star scheme version of the best known TPC-H (industry-standard) • Kyligence team has an implementation of the SSB benchmark for Apache Kylin. • Including schemas and data generator. • We have adapted it to use with Vertica and PostgreSQL. • It provides a set of 13 analytical queries
  30. 30. www.stratebi.com • Test performed • Number of rows of facts and dimensions tables for each test performed. • Hardware used LINEORDER CUSTOMER PART SUPPLIER DATE Test – Role of table Fact (KPI) Dimension Dimension Dimension Dimension 100M 100.000.000 40.000 32.000 20.000 2.556 500M 500.000.000 200.000 48.000 100.000 2.556 1.000M 1.000.000.000 400.000 56.000 200.000 2.556 Tool Distributed Processing Kind of hardware Nº of hosts Processor Cores RAM Memory Kylin 2.4 Yes Dedicated Cloud 3 Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 8 32 Gb Vertica 9.1 Yes Dedicated Cloud 3 Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 8 32 Gb PostgreSQL 9.6 No Dedicated Cloud 1 Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 8 32 Gb
  31. 31. www.stratebi.com • Benchmark Results Test P1 – 100M (seconds) P1 – 500M (seconds) P1 – 1.000M (seconds) Query Kylin Vertica Postgre Kylin Vertica Postgre Kylin Vertica Postgre Q1.1 0.2 0.2 22.4 0.3 0.3 +280 0.6 0.6 - Q1.2 0.2 0.4 18.7 0.3 0.2 +280 0.5 0.3 - Q1.3 0.2 0.4 18.5 0.3 0.3 +280 0.6 0.2 - Q2.1 0.3 1.1 18.1 0.4 2.7 +280 0.6 9.1 - Q2.2 0.3 0.8 16.3 0.4 2.7 +280 0.7 8.2 - Q2.3 0.3 0.8 15.2 0.4 2.2 +280 0.6 7.4 - Q3.1 0.3 1.4 23.9 0.4 3.7 +280 0.8 15.1 - Q3.2 0.6 0.7 18.5 0.8 0.7 +280 0.9 9.8 - Q3.3 0.3 0.9 15.8 0.3 0.6 +280 0.7 3.7 - Q3.4 0.2 0.6 15.9 0.2 0.2 +280 0.2 1.0 - Q4.1 0.3 1.4 23.7 0.4 7.3 +280 0.7 14.7 - Q4.2 0.3 1.0 23.3 0.4 2.0 +280 0.7 3.8 - Q4.3 2.5 0.8 17.1 2.4 1.3 +280 2.9 2.0 -
  32. 32. www.stratebi.com • Benchmark Results Relationship between row size in the fact table and query latency between Kylin and Vertica
  33. 33. www.stratebi.com • Benchmark Results • Kylin and Vertica are both suitable for Big Data OLAP applications. • Apache Kylin has the best query performance. • But high hardware, software (Hadoop) and know-how requirements. • 100% open source version without limitations. • Vertica is the alternative to Kylin for less extreme Big Data scenarios. • Lower hardware, software and know-how requirements. • Free community version with some limitations. • PostgreSQL is not suitable for Big Data OLAP.
  34. 34. www.stratebi.com
  35. 35. www.stratebi.com • Pentaho also integrates with many other Big Data tools • Lince Big Data Stack • Our selection of Big Data tools based on experience and tests. • All of them allow the integration with Pentaho open source tools. • Lince BI tools (formerly STTools) are used to analyze the data from Big Data repositories. • STPivot: OLAP Viewer. • STReport: Ad-Hoc Reporting. • STDashboard: Fast Dashboards. • STCard: Balanced Scorecards.
  36. 36. www.stratebi.com
  37. 37. www.stratebi.com • Visit our Big Data demos website • http://bigdata.stratebi.com/ • Pentaho • Kylin + STPivot (Mondrian 4.X) • Hadoop + PDI (HDFS, Hive, Oozie,…steps) • PDI + SparkMlib + Zeppelin • Other open source Big Data tools • Kafka + Spark Streaming • Kylin + Superset • Neo4J • …and much more.
  38. 38. www.stratebi.com
  39. 39. www.stratebi.com • Pentaho BA server enables Big Data OLAP in combination with Kylin or Vertica. • Easy to integrate through JDBC connector with SQL based plugins (CDE dashboards) • We have worked hard to integrate these tools with Mondrian 3.X and PME 7.1. • Best performance results with the integration between Pentaho, Kylin and STTools • Sub-second Roll-up, Drill-down, Slice and Dice and Pivot OLAP operations. • Experienced performance with STTools is really good, but we have to extend our benchmark to test it (Kylin with Mondrian or PME) • Pentaho tools are useful for Big Data ETL and analysis • However, our experience tells us that many of the Pentaho Big Data connectors and features are very hard to configure. • We propose to include Kylin and Vertica dialects (Mondrian and PME) in future Pentaho versions.

×