2. 2 / 18
Kafka SQL: What Our Kafka
Customers Have Been
Asking For
●
Stream processing engines/libraries like Kafka
Streams provide a programmatic stream
processing access pattern to Kafka.
●
Application developers love this access
pattern but when you talk to BI developers,
their analytic s requirements are quite
different which are focused on use cases
around ad hoc analytic, data exploration, and
trend discovery. BI persona requirements for
Kafka access include:
3. 3 / 18
●
Treat Kafka topics/streams as tables.
●
Support for ANSI SQL.
●
Support complex joins (different join keys,
multi-way join, join predicate to non-table
keys, non-equi joins, multiple joins in the
same query).
●
UDF support for extensibility.
●
JDBC/ODBC support.
●
Creating views for column masking.
●
Rich ACL support including column level
security.
4. 4 / 18
●
To address these requirements, the new HDP
3.1 release has added a new Hive Storage
Handler for Kafka which allows users to view
Kafka topics as Hive tables.
●
This new feature allows BI developers to take
full advantage of Hive analytical
operations/capabilities including complex
joins, aggregations, window functions, UDFs,
pushdown predicate fltering, windowing, etc.
6. 6 / 18
Kafka Hive C-A-T (Connect,
Analyze, Transform)
●
The goal of the Hive-Kafka integration is to
enable users the ability to connect, analyze,
and transform data in Kafka via SQL quickly.
●
Connect: Users will be able to create an
external table that maps to a Kafka topic
without actually copying or materializing the
data to HDFS or any other persistent storage.
Using this table, users will be able to run any
SQL statement with out of the box
authentication and authorization support
using Ranger.
7. 7 / 18
●
Analyze: Leverage Kafka time travel
capabilities and offset based seeks in order to
minimize I/O. Having such capabilities will
enable both ad hoc queries across time slices
in the stream and will allow exactly once
ofoading by controlling the position in the
stream.
●
Transform: Users will be able to masque, join,
aggregate, and change the serialization
encoding of the original stream and create a
stream persisted in a Kafka topic. Joins can be
against any dimension table or any stream.
User will be able to ofoad the data from Kafka
to Hive warehouse (e.g., HDFS, S3, etc.).
9. 9 / 18
The Connect of Kafka Hive
C-A-T
●
To connect to a Kafka topic, execute a DDL to
create an external Hive table representing a
live view of the Kafka stream.
●
The external table defnition is handled by a
storage handler implementation called
'KafkaStorageHandler.'
●
The storage handler relies on two mandatory
table properties to map the Kafka topic name
and the Kafka broker connection string. The
below shows a sample DDL statement
11. 11 / 18
●
Recall that records in Kafka are stored as key-
value pairs, therefore the user needs to supply
the serialization/deserialization classes to
transform the value byte array into a set of
columns.
●
Serialization/deserialization are supplied using
the table property, "kafka.serde.class." As of
today the default is JsonSerDe and there is out
of the box serdes for formats such as csv, avro,
and others.
●
In addition to the schema columns defned in the
DDL, the storage handler captures metadata
columns for the Kafka topic including partition,
timestamp, and offset. The metadata columns
allows Hive to optimize queries for "time travel,"
partition pruning, and offset-based seeks.
13. 13 / 18
Time-Travel, Partition Pruning and
Offset Based Seeks: Opttimizations for
Fast SQL in Kafka
●
From Kafka release 0.10.1 onwards, every
Kafka message has a time-stamp associated
with it. The semantics of this time-stamp is
confgurable (e.g: value assigned by
producer, when leader receives the message,
when a consumer receives the message,
etc.).
●
Hive adds this time-stamp feld as a column
to the Kafka Hive table. With this column,
users can use flter predicates to time travel
(e.g: read only records past a given point in
time).
14. 14 / 18
●
This is achieved using the Kafka consumer
public API OffsetsForTime that returns the
offset for each partition's earliest offset
whose timestamp is greater than or equal to
the given timestamp.
●
Hive parses the flter expression tree and
looks for any predicate in the the following
form to provide time based offset
optimizations: __timestamp [>= , >, =]
constant_int64. This will allow optimization
15. 15 / 18
●
Customers leverage this powerful
optimization by creating time based views of
the streaming data into Kafka like the
following (e.g: a view over the last 15
minutes of the streaming data into Kafka
topic kafka_truck_geo_events). The
implementation of this optimization can be
found here:
KafkaScanTrimmer#buildScanForTimesPredica
te
16. 16 / 18
●
Partition pruning is another powerful
optimization. Each Kafka Hive table is
partitioned based on the __partition metadata
column. Any flter predicate over column
__partition can be used to eliminate the
unused partitions. See optimization
implementation here:
KafkaScanTrimmer#buildScanFromPartitionPre
dicate
17. 17 / 18
●
Kafka Hive also takes advantages of offset-
based seeks which allows users to seek a
specifc offset in the stream. Thus any
predicate that can be used as a start point
e.g., __offset > constant_64int can be used to
seek in the stream. Supported operators are
=, >, >=, <, <=.
●
See optimization implementation here:
KafkaScanTrimmer#buildScanFromOffsetPredi
cate