Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integration for real-time Kafka SQL

Integration for real-time Kafka SQL

  • Be the first to comment

Integration for real-time Kafka SQL

  1. 1. 1 / 18 Hive-Kafka Integration for Real-Time Kafka SQL Queries
  2. 2. 2 / 18 Kafka SQL: What Our Kafka Customers Have Been Asking For ● Stream processing engines/libraries like Kafka Streams provide a programmatic stream processing access pattern to Kafka. ● Application developers love this access pattern but when you talk to BI developers, their analytic s requirements are quite different which are focused on use cases around ad hoc analytic, data exploration, and trend discovery. BI persona requirements for Kafka access include:
  3. 3. 3 / 18 ● Treat Kafka topics/streams as tables. ● Support for ANSI SQL. ● Support complex joins (different join keys, multi-way join, join predicate to non-table keys, non-equi joins, multiple joins in the same query). ● UDF support for extensibility. ● JDBC/ODBC support. ● Creating views for column masking. ● Rich ACL support including column level security.
  4. 4. 4 / 18 ● To address these requirements, the new HDP 3.1 release has added a new Hive Storage Handler for Kafka which allows users to view Kafka topics as Hive tables. ● This new feature allows BI developers to take full advantage of Hive analytical operations/capabilities including complex joins, aggregations, window functions, UDFs, pushdown predicate fltering, windowing, etc.
  5. 5. 5 / 18
  6. 6. 6 / 18 Kafka Hive C-A-T (Connect, Analyze, Transform) ● The goal of the Hive-Kafka integration is to enable users the ability to connect, analyze, and transform data in Kafka via SQL quickly. ● Connect: Users will be able to create an external table that maps to a Kafka topic without actually copying or materializing the data to HDFS or any other persistent storage. Using this table, users will be able to run any SQL statement with out of the box authentication and authorization support using Ranger.
  7. 7. 7 / 18 ● Analyze: Leverage Kafka time travel capabilities and offset based seeks in order to minimize I/O. Having such capabilities will enable both ad hoc queries across time slices in the stream and will allow exactly once ofoading by controlling the position in the stream. ● Transform: Users will be able to masque, join, aggregate, and change the serialization encoding of the original stream and create a stream persisted in a Kafka topic. Joins can be against any dimension table or any stream. User will be able to ofoad the data from Kafka to Hive warehouse (e.g., HDFS, S3, etc.).
  8. 8. 8 / 18
  9. 9. 9 / 18 The Connect of Kafka Hive C-A-T ● To connect to a Kafka topic, execute a DDL to create an external Hive table representing a live view of the Kafka stream. ● The external table defnition is handled by a storage handler implementation called 'KafkaStorageHandler.' ● The storage handler relies on two mandatory table properties to map the Kafka topic name and the Kafka broker connection string. The below shows a sample DDL statement
  10. 10. 10 / 18
  11. 11. 11 / 18 ● Recall that records in Kafka are stored as key- value pairs, therefore the user needs to supply the serialization/deserialization classes to transform the value byte array into a set of columns. ● Serialization/deserialization are supplied using the table property, "kafka.serde.class." As of today the default is JsonSerDe and there is out of the box serdes for formats such as csv, avro, and others. ● In addition to the schema columns defned in the DDL, the storage handler captures metadata columns for the Kafka topic including partition, timestamp, and offset. The metadata columns allows Hive to optimize queries for "time travel," partition pruning, and offset-based seeks.
  12. 12. 12 / 18
  13. 13. 13 / 18 Time-Travel, Partition Pruning and Offset Based Seeks: Opttimizations for Fast SQL in Kafka ● From Kafka release 0.10.1 onwards, every Kafka message has a time-stamp associated with it. The semantics of this time-stamp is confgurable (e.g: value assigned by producer, when leader receives the message, when a consumer receives the message, etc.). ● Hive adds this time-stamp feld as a column to the Kafka Hive table. With this column, users can use flter predicates to time travel (e.g: read only records past a given point in time).
  14. 14. 14 / 18 ● This is achieved using the Kafka consumer public API OffsetsForTime that returns the offset for each partition's earliest offset whose timestamp is greater than or equal to the given timestamp. ● Hive parses the flter expression tree and looks for any predicate in the the following form to provide time based offset optimizations: __timestamp [>= , >, =] constant_int64. This will allow optimization
  15. 15. 15 / 18 ● Customers leverage this powerful optimization by creating time based views of the streaming data into Kafka like the following (e.g: a view over the last 15 minutes of the streaming data into Kafka topic kafka_truck_geo_events). The implementation of this optimization can be found here: KafkaScanTrimmer#buildScanForTimesPredica te
  16. 16. 16 / 18 ● Partition pruning is another powerful optimization. Each Kafka Hive table is partitioned based on the __partition metadata column. Any flter predicate over column __partition can be used to eliminate the unused partitions. See optimization implementation here: KafkaScanTrimmer#buildScanFromPartitionPre dicate
  17. 17. 17 / 18 ● Kafka Hive also takes advantages of offset- based seeks which allows users to seek a specifc offset in the stream. Thus any predicate that can be used as a start point e.g., __offset > constant_64int can be used to seek in the stream. Supported operators are =, >, >=, <, <=. ● See optimization implementation here: KafkaScanTrimmer#buildScanFromOffsetPredi cate
  18. 18. 18 / 18 Demo Video

×