Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
1. Streaming Data Analytics with ksqlDB and Superset
w/ Robert Stolz
Email: robert@preset.io
GitHub: garden-of-delete
Find me on the Superset Slack!
2. Who am I?
2
● Data Engineer and Developer Advocate @ Preset
● Background in scientific research, computational
biology, mathematics, open-source software
● Data architecture and best practices nerd
● New(ish) to Kafka
3. Agenda
3
• The history and anatomy of Apache Superset
• What superset offers a streaming data architecture
• Streaming analytics w/ Kafka: paths and challenges
Feel free to ask questions as they come up
Keep an eye out for this series on the Preset Blog!
6. Dynamic Dashboards
Dashboard filters and Jinja templating enable end-users
to drill deeper into data
No Code Exploration
Create beautiful, complex charts from your data without
having to write any code
SQL Lab
State of the art SQL IDE with a rich metadata browser for
deeper analysis
Rich Visualizations
Beautiful array of interactive visualizations including
geospatial
Granular Permissions
Row level security, configurable data policies
Semantic Layer
Support for virtual columns, virtual tables, view creation,
and more
Caching
Reduce load on the database - faster queries, faster
results
Modern Datastack Support
Connect to any SQL speaking database, including popular
cloud data warehouses and SQL engines
Alerts & Reports
Get notified via Slack or email when dips or spikes happen
in your data
Custom Viz Plugins
Build your own custom visualization plug-in or connect to
popular 3rd party plug-ins
6
Apache Superset
9. Value proposition of open-source BI
● Extensibility: custom analytics, embedding, piecemeal
● Control: avoid vendor lock-in
● Cost: free to use and modify, but can be expensive to maintain an
enterprise deployment
● Quality: open-source is a better process for making software
9
18. Why connect streaming data to the BI layer?
● BI is one of the primary sensory organs of modern organizations
● Faster well-informed decision-making is a generally desirable thing
● Many more specific business use-cases require fast response to external events
○ Anomaly detection
○ Location and time-sensitive services
○ Extreme event monitoring
○ Visualizing and analyzing a real-world process that is constantly evolving
19. The Question
Want to understand: what paths exist for getting streaming data from
Kafka into Superset? (and more generally into the BI/analytics layer)
Distinct from wanting to analyze metadata from a kafka deployment
20. Best practice: Intermediate datastore
?
Want to understand: what paths exist for getting streaming data from
Kafka into Superset? (and more generally into the BI/analytics layer)
Distinct from wanting to analyze metadata from a kafka deployment
22. Direct connection
- Superset would need to consume data from Kafka topics directly
- Undesirable to have data live in the BI/Analytics layer
23. Streaming Analytics w/ Superset + ksqlDB
- ksqlDB provides a SQL speaking interface for data in Kafka topics
- Powered by Kafka’s stream processing framework
24. Streaming Analytics w/ Superset + ksqlDB
- No SQLAlchemy dialect for ksqlDB (as of today)
- Probably undesirable to have historical data, complex aggregates,
etc accessible only through Kafka’s stream-processing framework
25. Best-practice: Intermediate datastore
- Desirable properties: high write-volume, robust support for event
data, low read-after-write latency, integrated kafka consumer
?
26. Best-practice: Intermediate datastore
- Desirable properties: high write-volume, robust support for time-
series data, low read-after-write latency, integrated kafka consumer
- Druid, Clickhouse, Rockset, Pinot, Cassandra, etc ...
28. Path 1: Integrated consumer
- Integrated consumers ingest event data directly from Kafka topics
- Transformation can be handled by the datastore or by kafka streams
- Best performance, limited flexibility in choice of datastore
29. Path 2: ksqlDB connection
- Some transformation tasks are handled by ksqlDB (Kafka Streams)
- Expands the list of possible intermediate datastores
30. Path 3: Ad-hoc consumers
- Maximum flexibility around choice of datastore
- Comes at the expense of performance
- Can be harder to maintain
31. Superset fits into batch and streaming data architectures
Src: Designing Cloud Data Platforms by Danil Zburivsky and Lynda Partner
32. Manual Setup
• Complex set-up
• Maximum control over
configuration
• Good for enterprise
deployments
• Advanced features require
additional set-up (Async
Queries, Query Caching,
Prophet integration,
Dashboard thumbnails,
Alerts and Reports)
Docker-compose
• Easiest set-up
• Great for trying out
Superset and local
development
• Some features are part
of the stack by default
(caching) and some
aren’t (alerts and
reports, prophet
integration)
Preset Cloud
• No set-up
• Good for individual
evaluation all the
way up to enterprise
needs
• All advanced
Superset features
available
• Still FREE for small
teams!
Three ways to run Superset
33. Streaming Data Analytics with ksqlDB and Superset
w/ Robert Stolz
Email: robert@preset.io
GitHub: garden-of-delete
Find me on the Superset Slack!