SlideShare a Scribd company logo
1 of 18
Download to read offline
1 / 18
Hive-Kafka Integration
for Real-Time Kafka SQL
Queries
2 / 18
Kafka SQL: What Our Kafka
Customers Have Been
Asking For
●
Stream processing engines/libraries like Kafka
Streams provide a programmatic stream
processing access pattern to Kafka.
●
Application developers love this access
pattern but when you talk to BI developers,
their analytic s requirements are quite
different which are focused on use cases
around ad hoc analytic, data exploration, and
trend discovery. BI persona requirements for
Kafka access include:
3 / 18
●
Treat Kafka topics/streams as tables.
●
Support for ANSI SQL.
●
Support complex joins (different join keys,
multi-way join, join predicate to non-table
keys, non-equi joins, multiple joins in the
same query).
●
UDF support for extensibility.
●
JDBC/ODBC support.
●
Creating views for column masking.
●
Rich ACL support including column level
security.
4 / 18
●
To address these requirements, the new HDP
3.1 release has added a new Hive Storage
Handler for Kafka which allows users to view
Kafka topics as Hive tables.
●
This new feature allows BI developers to take
full advantage of Hive analytical
operations/capabilities including complex
joins, aggregations, window functions, UDFs,
pushdown predicate fltering, windowing, etc.
5 / 18
6 / 18
Kafka Hive C-A-T (Connect,
Analyze, Transform)
●
The goal of the Hive-Kafka integration is to
enable users the ability to connect, analyze,
and transform data in Kafka via SQL quickly.
●
Connect: Users will be able to create an
external table that maps to a Kafka topic
without actually copying or materializing the
data to HDFS or any other persistent storage.
Using this table, users will be able to run any
SQL statement with out of the box
authentication and authorization support
using Ranger.
7 / 18
●
Analyze: Leverage Kafka time travel
capabilities and offset based seeks in order to
minimize I/O. Having such capabilities will
enable both ad hoc queries across time slices
in the stream and will allow exactly once
ofoading by controlling the position in the
stream.
●
Transform: Users will be able to masque, join,
aggregate, and change the serialization
encoding of the original stream and create a
stream persisted in a Kafka topic. Joins can be
against any dimension table or any stream.
User will be able to ofoad the data from Kafka
to Hive warehouse (e.g., HDFS, S3, etc.).
8 / 18
9 / 18
The Connect of Kafka Hive
C-A-T
●
To connect to a Kafka topic, execute a DDL to
create an external Hive table representing a
live view of the Kafka stream.
●
The external table defnition is handled by a
storage handler implementation called
'KafkaStorageHandler.'
●
The storage handler relies on two mandatory
table properties to map the Kafka topic name
and the Kafka broker connection string. The
below shows a sample DDL statement
10 / 18
11 / 18
●
Recall that records in Kafka are stored as key-
value pairs, therefore the user needs to supply
the serialization/deserialization classes to
transform the value byte array into a set of
columns.
●
Serialization/deserialization are supplied using
the table property, "kafka.serde.class." As of
today the default is JsonSerDe and there is out
of the box serdes for formats such as csv, avro,
and others.
●
In addition to the schema columns defned in the
DDL, the storage handler captures metadata
columns for the Kafka topic including partition,
timestamp, and offset. The metadata columns
allows Hive to optimize queries for "time travel,"
partition pruning, and offset-based seeks.
12 / 18
13 / 18
Time-Travel, Partition Pruning and
Offset Based Seeks: Opttimizations for
Fast SQL in Kafka
●
From Kafka release 0.10.1 onwards, every
Kafka message has a time-stamp associated
with it. The semantics of this time-stamp is
confgurable (e.g: value assigned by
producer, when leader receives the message,
when a consumer receives the message,
etc.).
●
Hive adds this time-stamp feld as a column
to the Kafka Hive table. With this column,
users can use flter predicates to time travel
(e.g: read only records past a given point in
time).
14 / 18
●
This is achieved using the Kafka consumer
public API OffsetsForTime that returns the
offset for each partition's earliest offset
whose timestamp is greater than or equal to
the given timestamp.
●
Hive parses the flter expression tree and
looks for any predicate in the the following
form to provide time based offset
optimizations: __timestamp [>= , >, =]
constant_int64. This will allow optimization
15 / 18
●
Customers leverage this powerful
optimization by creating time based views of
the streaming data into Kafka like the
following (e.g: a view over the last 15
minutes of the streaming data into Kafka
topic kafka_truck_geo_events). The
implementation of this optimization can be
found here:
KafkaScanTrimmer#buildScanForTimesPredica
te
16 / 18
●
Partition pruning is another powerful
optimization. Each Kafka Hive table is
partitioned based on the __partition metadata
column. Any flter predicate over column
__partition can be used to eliminate the
unused partitions. See optimization
implementation here:
KafkaScanTrimmer#buildScanFromPartitionPre
dicate
17 / 18
●
Kafka Hive also takes advantages of offset-
based seeks which allows users to seek a
specifc offset in the stream. Thus any
predicate that can be used as a start point
e.g., __offset > constant_64int can be used to
seek in the stream. Supported operators are
=, >, >=, <, <=.
●
See optimization implementation here:
KafkaScanTrimmer#buildScanFromOffsetPredi
cate
18 / 18
Demo Video

More Related Content

What's hot

How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Keigo Suda
 

What's hot (20)

End to-end example: consumer loan acceptance scoring using kubeflow
End to-end example: consumer loan acceptance scoring using kubeflowEnd to-end example: consumer loan acceptance scoring using kubeflow
End to-end example: consumer loan acceptance scoring using kubeflow
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Cloud Native Streaming and Event-Driven Microservices
Cloud Native Streaming and Event-Driven MicroservicesCloud Native Streaming and Event-Driven Microservices
Cloud Native Streaming and Event-Driven Microservices
 
How to integrate your database with kafka & CDC
How to integrate your database with kafka & CDCHow to integrate your database with kafka & CDC
How to integrate your database with kafka & CDC
 
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
 
Kafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePayKafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePay
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRKafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Westpac AU - Confluent Schema Registry
Westpac AU - Confluent Schema RegistryWestpac AU - Confluent Schema Registry
Westpac AU - Confluent Schema Registry
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
How to Build an Apache Kafka® Connector
How to Build an Apache Kafka® ConnectorHow to Build an Apache Kafka® Connector
How to Build an Apache Kafka® Connector
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 

Similar to Integration for real-time Kafka SQL

Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Denodo
 

Similar to Integration for real-time Kafka SQL (20)

Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
 
Edbt19 paper 329
Edbt19 paper 329Edbt19 paper 329
Edbt19 paper 329
 
Apache Kafka Streams Use Case
Apache Kafka Streams Use CaseApache Kafka Streams Use Case
Apache Kafka Streams Use Case
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka meetup - kafka connect
Kafka meetup -  kafka connectKafka meetup -  kafka connect
Kafka meetup - kafka connect
 
Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
 
Using MongoDB with Kafka - Use Cases and Best Practices
Using MongoDB with Kafka -  Use Cases and Best PracticesUsing MongoDB with Kafka -  Use Cases and Best Practices
Using MongoDB with Kafka - Use Cases and Best Practices
 
Connecting kafka message systems with scylla
Connecting kafka message systems with scylla   Connecting kafka message systems with scylla
Connecting kafka message systems with scylla
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Streaming the platform with Confluent (Apache Kafka)
Streaming the platform with Confluent (Apache Kafka)Streaming the platform with Confluent (Apache Kafka)
Streaming the platform with Confluent (Apache Kafka)
 
Understanding kafka
Understanding kafkaUnderstanding kafka
Understanding kafka
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Solution for events logging with akka streams and kafka
Solution for events logging with akka streams and kafkaSolution for events logging with akka streams and kafka
Solution for events logging with akka streams and kafka
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
 

Recently uploaded

IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
17thcssbs2
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
SaadHumayun7
 

Recently uploaded (20)

Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptx
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
Behavioral-sciences-dr-mowadat rana (1).pdf
Behavioral-sciences-dr-mowadat rana (1).pdfBehavioral-sciences-dr-mowadat rana (1).pdf
Behavioral-sciences-dr-mowadat rana (1).pdf
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
The Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdfThe Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdf
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
Mbaye_Astou.Education Civica_Human Rights.pptx
Mbaye_Astou.Education Civica_Human Rights.pptxMbaye_Astou.Education Civica_Human Rights.pptx
Mbaye_Astou.Education Civica_Human Rights.pptx
 

Integration for real-time Kafka SQL

  • 1. 1 / 18 Hive-Kafka Integration for Real-Time Kafka SQL Queries
  • 2. 2 / 18 Kafka SQL: What Our Kafka Customers Have Been Asking For ● Stream processing engines/libraries like Kafka Streams provide a programmatic stream processing access pattern to Kafka. ● Application developers love this access pattern but when you talk to BI developers, their analytic s requirements are quite different which are focused on use cases around ad hoc analytic, data exploration, and trend discovery. BI persona requirements for Kafka access include:
  • 3. 3 / 18 ● Treat Kafka topics/streams as tables. ● Support for ANSI SQL. ● Support complex joins (different join keys, multi-way join, join predicate to non-table keys, non-equi joins, multiple joins in the same query). ● UDF support for extensibility. ● JDBC/ODBC support. ● Creating views for column masking. ● Rich ACL support including column level security.
  • 4. 4 / 18 ● To address these requirements, the new HDP 3.1 release has added a new Hive Storage Handler for Kafka which allows users to view Kafka topics as Hive tables. ● This new feature allows BI developers to take full advantage of Hive analytical operations/capabilities including complex joins, aggregations, window functions, UDFs, pushdown predicate fltering, windowing, etc.
  • 6. 6 / 18 Kafka Hive C-A-T (Connect, Analyze, Transform) ● The goal of the Hive-Kafka integration is to enable users the ability to connect, analyze, and transform data in Kafka via SQL quickly. ● Connect: Users will be able to create an external table that maps to a Kafka topic without actually copying or materializing the data to HDFS or any other persistent storage. Using this table, users will be able to run any SQL statement with out of the box authentication and authorization support using Ranger.
  • 7. 7 / 18 ● Analyze: Leverage Kafka time travel capabilities and offset based seeks in order to minimize I/O. Having such capabilities will enable both ad hoc queries across time slices in the stream and will allow exactly once ofoading by controlling the position in the stream. ● Transform: Users will be able to masque, join, aggregate, and change the serialization encoding of the original stream and create a stream persisted in a Kafka topic. Joins can be against any dimension table or any stream. User will be able to ofoad the data from Kafka to Hive warehouse (e.g., HDFS, S3, etc.).
  • 9. 9 / 18 The Connect of Kafka Hive C-A-T ● To connect to a Kafka topic, execute a DDL to create an external Hive table representing a live view of the Kafka stream. ● The external table defnition is handled by a storage handler implementation called 'KafkaStorageHandler.' ● The storage handler relies on two mandatory table properties to map the Kafka topic name and the Kafka broker connection string. The below shows a sample DDL statement
  • 11. 11 / 18 ● Recall that records in Kafka are stored as key- value pairs, therefore the user needs to supply the serialization/deserialization classes to transform the value byte array into a set of columns. ● Serialization/deserialization are supplied using the table property, "kafka.serde.class." As of today the default is JsonSerDe and there is out of the box serdes for formats such as csv, avro, and others. ● In addition to the schema columns defned in the DDL, the storage handler captures metadata columns for the Kafka topic including partition, timestamp, and offset. The metadata columns allows Hive to optimize queries for "time travel," partition pruning, and offset-based seeks.
  • 13. 13 / 18 Time-Travel, Partition Pruning and Offset Based Seeks: Opttimizations for Fast SQL in Kafka ● From Kafka release 0.10.1 onwards, every Kafka message has a time-stamp associated with it. The semantics of this time-stamp is confgurable (e.g: value assigned by producer, when leader receives the message, when a consumer receives the message, etc.). ● Hive adds this time-stamp feld as a column to the Kafka Hive table. With this column, users can use flter predicates to time travel (e.g: read only records past a given point in time).
  • 14. 14 / 18 ● This is achieved using the Kafka consumer public API OffsetsForTime that returns the offset for each partition's earliest offset whose timestamp is greater than or equal to the given timestamp. ● Hive parses the flter expression tree and looks for any predicate in the the following form to provide time based offset optimizations: __timestamp [>= , >, =] constant_int64. This will allow optimization
  • 15. 15 / 18 ● Customers leverage this powerful optimization by creating time based views of the streaming data into Kafka like the following (e.g: a view over the last 15 minutes of the streaming data into Kafka topic kafka_truck_geo_events). The implementation of this optimization can be found here: KafkaScanTrimmer#buildScanForTimesPredica te
  • 16. 16 / 18 ● Partition pruning is another powerful optimization. Each Kafka Hive table is partitioned based on the __partition metadata column. Any flter predicate over column __partition can be used to eliminate the unused partitions. See optimization implementation here: KafkaScanTrimmer#buildScanFromPartitionPre dicate
  • 17. 17 / 18 ● Kafka Hive also takes advantages of offset- based seeks which allows users to seek a specifc offset in the stream. Thus any predicate that can be used as a start point e.g., __offset > constant_64int can be used to seek in the stream. Supported operators are =, >, >=, <, <=. ● See optimization implementation here: KafkaScanTrimmer#buildScanFromOffsetPredi cate
  • 18. 18 / 18 Demo Video