Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark


Published on

Databases store not only the data used for computations in Spark, but they often also want to consume the output of Spark computations directly in queries using the output like a relational table. So-called polymorphic table functions provide an mechanism for achieving this. This presentation explains what polymorphic table functions are, how they are used, and why they are a very efficient way of communicating between Spark and a SQL engine minimizing the amount of network traffic and maximizing the used parallelism by co-locating the workers of the SQL engine and the Spark executors. In addition use cases are presented like performing complex transformations on a table in the SQL engine by passing the table as argument to a polymorphic table function and then using the result of the transformation again as a table.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark

  1. 1. Polymorphic Table Functions: The best way to integrate SQL and Spark Andreas Weininger @aweininger
  2. 2. Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  3. 3. Notices and disclaimers © 2020 International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided. The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.” Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
  4. 4. Notices and disclaimers continued It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, and are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at:
  5. 5. Agenda Motivation Traditional ways of integrating Spark and SQL Shortcomings of the traditional approach Polymorphic Table Functions What are PTFs? How do PTFs work? Use cases for PTFs Conclusions
  6. 6. Motivation
  7. 7. Traditional Way of Integrating Spark and Data Stores ▪ Spark reads data from and writes data to data stores (e.g. JDBC, Parquet, …) ▪ Spark initiates communication ▪ Spark processes data which it read and writes data which it generated ▪ Spark is in the driver seat Data Store Spark is in the driver seat
  8. 8. A New Way of Integrating Spark and Data Stores ▪ The data store is in the driver seat ▪ It uses the result of a Spark computation as a table ▪ The data store delegates complex transformations on tables to Spark ▪ Solution: Polymorphic Table Functions (PTF) Data Store Data store is in the driver seat
  9. 9. Polymorphic Table Functions
  10. 10. What are Polymorphic Table Functions? ▪ A table function is used to invoke the Spark code ▪ The result of a computation in Spark is presented as relational table ▪ Why polymorphic? ▪ Single function is used to execute all kinds of Spark code ▪ This table function can return many different types of tables ▪ Different number of columns ▪ Different types of columns ▪ Tables may in addition be passed as arguments to the Spark computation ▪ Table function may be used like a normal table in SQL (e.g. for joins with other tables or table functions, for filtering, etc.)
  11. 11. Example: Polymorphic Table Functions in Big SQL ▪ Polymorphic table function is called EXECSPARK ▪ Class which implements the SparkPtf interface must be passed as argument to EXECSPARK ▪ Class may be written in Scala or Java ▪ Class must implement four methods: Method Purpose describe Used by Big SQL to understand which columns with which data type are returned execute Called during runtime. Implements the actual Spark computation destroy Used for cleanup cardinality Used by the optimizer of Big SQL to estimate the size of the returned table
  12. 12. PTF: A Complete Example: Reading a JSON File Calling SQL Statement select cast(country as varchar(10)) c from table( SYSHADOOP.EXECSPARK( class => ‘org.examples.ReadJsonFileSecure’, uri => ‘/user/bigsql/tmp/demo.json’ ) ) as doc where country is not null order by c ;
  13. 13. PTF: A Complete Example: Reading a JSON File PTF Implementation: Imports package org.examples; import java.util.Map; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import;
  14. 14. PTF: A Complete Example: Reading a JSON File PTF Implementation: Class Definition & describe Method public class ReadJsonFileSecure implements SparkPtf { Dataset<Row> df = null; @Override public StructType describe(SQLContext ctx, Map<String, Object> arguments) { df =“URI”)); df.cache(); return df.schema(); }
  15. 15. PTF: A Complete Example: Reading a JSON File PTF Implementation: execute Method @Override public Dataset<Row> execute(SQLContext ctx, Map<String, Object> arguments) { if (df == null) { df =“URI”)); df.cache(); } return df; }
  16. 16. PTF: A Complete Example: Reading a JSON File PTF Implementation: destroy & cardinality Methods @Override public void destroy(SQLContext ctx, Map<String, Object> arguments) { if (df != null) { df = null; } } @Override public long cardinality(SQLContext ctx, Map<String, Object> arguments) { if (arguments.get(“CARD”) == null) return 100; else return (long) arguments.get(“CARD”); } }
  17. 17. What is IBM Db2 Big SQL? ▪ Advanced, fully-featured SQL engine for physical and virtual data lakes on Hadoop and/or Cloud Object Storage ▪ Focusses on ▪ data virtualization, ▪ SQL compatibility, ▪ scalability, ▪ performance, ▪ enterprise security/governance ▪ Supports queries on Hive and Hbase data ▪ Uses Hive Metastore ▪ Supports CSV, Parquet, ORC, … ▪ Bidirectional integration with Spark: ▪ Spark JDBC data source enables execution of Big SQL queries from Spark and consumes the results as data frames ▪ Polymorphic table function enables execution of Spark jobs from Big SQL and consumes the results as tables.
  18. 18. IBM Cloud / © 2020 IBM Corporation Virtualized environment to query heterogeneous data SQL-on-Hadoop to virtualizes more than 10 different data sources: RDBMS, NoSQL, HDFS or Object Store with efficient query rewrite and predicate pushdown MS SQL Server Netezza Oracle PostgreSQL Teradata DB2 LUW, Db2z, DB2 on iInformix WebHDFS Object Store (S3) Hive HBase HDFS Hadoop Db2 Big SQL NoSQL ML Model Transparent High function Autonomous High performance • Appears to be one source • Programmers don’t need to know how / where data is stored • Full SQL support against all data • Capabilities of sources as well • Non-disruptive to data sources, existing applications, systems. • Optimization of distributed queries
  19. 19. Efficient Integration • Low latency from Spark jobs executing on long running Spark app that is co-located with Big SQL engine (no spark-submit per job) • High throughput data movement by exploiting parallelism and local data transfer at the nodes Big SQL Workers co-located with Spark Executors Head Node Big SQL Head Spark Driver Worker Node Big SQL Worker Spark Executor Worker Node Big SQL Worker Spark Executor Worker Node Big SQL Worker Spark Executor … HDFS Job Data Data Data data flow control flow across engines control/data flow within engine
  20. 20. How it works Spark DriverBigSQL Head Head Node Spark ExecutorBigSQL Worker Worker Node DataFrame Partitions HDFS Local Storage SELECT * FROM TABLE( EXECSPARK( class => ‘com….PTF’, …)) describe schema execute scan stages/ tasks class
  21. 21. High-throughput data transfer • Parallelism across nodes and within nodes • DataFrame partitions are streamed from Spark executor to BigSQL worker as soon as Spark produce them • Partition data is serialized in BigSQL runtime format on the Spark side • No deserialization overhead once data arrives in BigSQL Spark ExecutorBigSQL Worker DataFrame Partitions Worker Nodes
  22. 22. Use Cases ▪ Data virtualization to many data sources otherwise not supported ▪ All sources for which there is a Spark DataSource become available in SQL engine ▪ Complex transformations that are hard/impossible to express in SQL become possible ▪ Exploitation of Spark libraries and external packages in SQL engine ▪ ETL ▪ Machine learning ▪ Custom aggregates
  23. 23. Example SELECT people.firstname, people.lastname, FROM TABLE(execspark( class => '', format => ‘jdbc’, url => ‘jdbc:mysql://localhost/mydb’, driver => ‘com.mysql.jdbc.Driver’, user => ‘joe’, password => ‘joespwd’ load => ‘default.people’ )) AS people WHERE people.age > 55 Using DataSource PTF to Query a Mysql Table
  24. 24. Example: SELECT DESCRIBE doc.* FROM TABLE(execspark( class => '', format => ‘csv’, skipHeader = true load => ‘hdfs://user/guest/sample.csv’ )) AS doc Using DataSource PTF to Describe a CSV File
  25. 25. Example: SELECT people.firstname, people.lastname, FROM TABLE(execspark( class => '', format => ‘parquet’, load => ‘hdfs://user/guest/people.parquet’ )) AS people WHERE people.age > 55 Using DataSource PTF to Query a Parquet File
  26. 26. Conclusions
  27. 27. Summary Polymorphic table functions ▪ Provide a fast and scalable way to transfer data from Spark to SQL engine with SQL engine in control ▪ No intermediate storage necessary like in naïve solution ▪ Many possible use cases supported ▪ Leverage the analytics capabilities of Spark in SQL engine
  28. 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Andreas Weininger @aweininger