© Instaclustr Pty Limited, 2024
30 Of My Favourite
Open Source
Technologies
Paul Brebner
Open Source Technology Evangelist
© Instaclustr Pty Limited, 2024
30 Of My Favourite
Open Source
Technologies
In 30 Minutes
Paul Brebner
Open Source Technology Evangelist
Paul Brebner (Netherlands 30
minutes bike parking zone)
© Instaclustr Pty Limited, 2024
© Instaclustr Pty Limited, 2024
What do they have in Common?
• Instaclustr provides some as
managed services
• They are complementary and
can be used together
• And I’ve used them to build
realistic demo applications
over the last 7 years
© Instaclustr Pty Limited, 2024
A Strange Toy I Found At The Shop
• What’s that?!
• An escaped “Pokemon”!
• When my kids were growing up Pokemon lived inside a
“Game Boy”
© Instaclustr Pty Limited, 2024
Format
• Name, Overview, Superpower(s), Watch out for …
• E.g. “Pokemon”
• Name: Charmander
• What: A fire Lizard
• Superpower: Evolves to Charizard, a flying fire breathing lizard
• Watch out for: Water
+ Use Cases and What’s New?
© Instaclustr Pty Limited, 2024
Countdown!
Flicker CCL + Wikimedia CCL
© Instaclustr Pty Limited, 2024
1. Apache Cassandra
Office Typing
Pool, 1918
Wikipedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Cassandra
• What?
• NoSQL Horizontally Scalable Key-Value Database
• Superpowers
• Fast Writes (lots of typewriters)
• Wide Column Store
• Clustering Columns, good for hierarchical data modelling (E.g. Geospatial)
• In-built multi-DC replication
• My Use Cases
© Instaclustr Pty Limited, 2024
Anomaly Detection: 19 Million
checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more
© Instaclustr Pty Limited, 2024
Global low-latency Fintech
© Instaclustr Pty Limited, 2024
Apache Cassandra
• Watch Our For
• CQL != SQL
• Different data model
§ Design for reads
§ De-normalization is normal
• Consistency < traditional SQL databases
• Reads are slower
• What’s New?
• Vector Search in 5.0
© Instaclustr Pty Limited, 2024
2. Apache Spark
Car Factory
Assembly Line
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Spark
• What?
• Cluster batch/stream processing, analytics and ML
• Superpowers
• In-memory à fast
• Good support for ML
§ + Cassandra (wide columns) as a feature store
• Good for heavy transformation operations at scale
• My Use Cases
© Instaclustr Pty Limited, 2024
ML of Cassandra Monitoring Data
Apache Spark
Apache Cassandra
MLlib
DataFrames
Spark Streaming
© Instaclustr Pty Limited, 2024
Apache Spark
• Watch Our For
• Lots of RAM, else OOM (Out-of-Memory Errors)
• Spark Streaming is near real-time (micro-batch)
• What’s New?
• 3.4 has Spark Connect for decoupled client-servers
• Ocean for Apache Spark
(Spot by NetApp)
© Instaclustr Pty Limited, 2024
3. Apache Zeppelin
Graf Zeppelin
exploring the
Arctic, 1931
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Zeppelin
• What?
• Web-based notebook for data exploration
• Superpowers
• Interactive “notebook” style tool
• Supports Apache Spark
© Instaclustr Pty Limited, 2024
Apache Zeppelin
• Watch Our For
• Sufficient Zeppelin resources
• We don’t support it anymore
• What’s New?
• Jupyter Notebook!
§ Good Kafka and Cassandra integration
The Galilean moons of Jupiter (Wikimedia CCL)
© Instaclustr Pty Limited, 2024
4. Apache Lucene
A Librarian using
a card catalogue
(1940)
Library of Congress Public
Domain
© Instaclustr Pty Limited, 2024
Apache Lucene
• What?
• Fast Full-featured Search Engine
• Superpowers
• Lucene plugin + Cassandra for enhanced Cassandra search
§ Works as a Cassandra secondary index
§ Support Vector Search too
• Watch Our For
• Performance
• We currently support it: https://github.com/instaclustr/cassandra-lucene-index
• My Use Cases
© Instaclustr Pty Limited, 2024
Geospatial Anomaly Detection
Apache Cassandra
Apache Lucene Plugin
Geospatial searches
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
5. Apache Kafka
Postal Delivery
Service
Railway Post
Office:
Mail bags
snatched by
speeding train
Wikimedia CCL
© Instaclustr Pty Limited, 2024
Apache Kafka
• What?
• Distributed publish-subscribe messaging system
• Superpowers
• Fast
• Highly distributed and horizontally scalable, available and durable
• Buffering and message replay
• My Use Cases
© Instaclustr Pty Limited, 2024
Xmas Tree Lights Simulation
© Instaclustr Pty Limited, 2024
“Kongo” IoT Logistics Simulation
Apache Kafka
Guava Event Bus
Real-time logistics
Tracking and checking
© Instaclustr Pty Limited, 2024
Anomaly Detection: 19 Million
checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more
© Instaclustr Pty Limited, 2024
Apache Kafka
• Watch Our For
• Too many topics/partitions impacts throughput
• What’s New?
• KRaft (replacing ZooKeeper) for faster meta-data operations
§ And maybe even faster data workloads
• Tiered Storage (3.6)
• End-to-end client monitoring (3.7)
© Instaclustr Pty Limited, 2024
6. Apache Kafka Streams
Niagra Falls
Darevevil
Shutterstock
© Instaclustr Pty Limited, 2024
Apache Kafka Streams
• What?
• Stream processing API and client for Kafka
• From/to Kafka cluster
• Superpowers
• Complex stateful stream processing operations (e.g. joins)
• Over time windows and multiple topics and state stores
• My Use Cases
© Instaclustr Pty Limited, 2024
Kafka Streams IoT Application
Truck Overload
© Instaclustr Pty Limited, 2024
Apache Kafka Streams
• Watch Our For
• Complex stream topologies
• Debugging is tricky
• Performance
• What’s New?
• Alternatives (E.g. Apache Flink, RisingWave, etc)
© Instaclustr Pty Limited, 2024
7. Apache Kafka Connect
Telephone
Switchboard
Operators
Connecting Calls
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Kafka Connect
• What?
• Kafka API for streaming from source to sink systems
• Via Kafka cluster
• Superpowers
• Heterogeneous integration
• Code-free – just connector configuration
• Independently scalable
§ connectors run on independent Kafka Connect cluster
• My Use Cases
© Instaclustr Pty Limited, 2024
Zero-code Data Pipelines
REST Tidal Data to PostgreSQL + Superset
REST Tidal Data to OpenSearch
OpenSearch sink
connector
© Instaclustr Pty Limited, 2024
Apache Kafka Connect
• Watch Our For
• Open-source connector evaluation and selection
• Error handling
• Source/sink system scalability
• What’s New?
• Debezium
© Instaclustr Pty Limited, 2024
8. Kafka MirrorMaker 2 (MM2)
Head of Kafka
Replicated tiers
move
Shutterstock
© Instaclustr Pty Limited, 2024
Kafka MirrorMaker 2
• What?
• Replicates Kafka topics between clusters
• Superpowers
• Uses Kafka Connect (but reads/writes from/to Kafka clusters)
• Topic renaming, prevents loops
• Complex bi-directional topologies
• Many use cases for multiple Kafka clusters:
§ Cluster migration
§ Geographical distribution
§ Low latency, redundancy
§ Fan-out architectures
§ Edge computing, etc
© Instaclustr Pty Limited, 2024
Kafka MirrorMaker 2
• Watch Our For
• Bi-directional flow requires TWO Kafka Connect Clusters
• Duplicate events (from overlapping topic subscriptions)
• Use topic renaming and the default source cluster alias to
§ Prevent cycles and infinite topic creation
• What’s New?
• For me, automated consumer offset sync across clusters
§ In 2.7.0 (2020)!
© Instaclustr Pty Limited, 2024
9. Apache Camel
Camel Train In
Broome, WA
(Adobe Stock by scottimage)
© Instaclustr Pty Limited, 2024
Apache Camel - Kafka Connectors
• What?
• Apache Camel – Integration framework
• Apache Camel Kafka Connectors – open source Kafka connectors
• Superpowers
• Large number of open source Kafka Connectors – 172 (officially), 179 sources and sinks
• Auto-generated from Camel components
© Instaclustr Pty Limited, 2024
• Watch Our For
• Configuration!
§ Need to read (1) Camel component, (2) Basic connector configuration, and (3)
connector specific documentation
• Some connectors are both sources and sinks (source or sink depends on
configuration)
• What’s New?
• Kamelets!
§ Can appear in the configuration
Apache Camel - Kafka Connectors
© Instaclustr Pty Limited, 2024
10. Kafka Parallel Consumer
Jacquard Loom,
Berlin
(Paul Brebner)
© Instaclustr Pty Limited, 2024
Kafka Parallel Consumer
• What?
• Multi-threaded Kafka Consumer
• Superpowers
• Multi-threaded c.f. default consumers single-threaded
• Higher concurrency with less consumers and partitions
• Use Cases
• Low latency, High Throughput
• Slow consumers
• Replacement for my multiple pool consumer hack
© Instaclustr Pty Limited, 2024
Kafka Parallel Consumer
• Watch Our For
• Configure for
§ Ordering mode
• Partition à Key à Unordered (Increasing concurrency)
§ Max threads
• What’s New?
• Choice of commit modes
§ Consumer Asynchronous, Synchronous and Producer Transactions
© Instaclustr Pty Limited, 2024
11. Apache ZooKeeper
12. Apache Curator
Being a
ZooKeeper in
Australia can be
risky!
(Shutterstock)
© Instaclustr Pty Limited, 2024
Apache ZooKeeper
• What?
• Distributed systems and coordination and
meta-data management
• Superpowers
• High consistency, availability and performance (reads)
• Use Cases
• Until recently, used in Kafka, Pulsar, etc
© Instaclustr Pty Limited, 2024
Apache ZooKeeper (and Curator)
Meet the Dining Philosophers
Wikipedia CCL
© Instaclustr Pty Limited, 2024
Apache ZooKeeper
• Watch Our For
• Low-level
• Apache Curator (high level ZK client) is better with
§ Leader Latch
§ Shared Lock
§ Shared Counter
• Scalability limitations
• Slow for writes, max cluster size is 7 servers
• What’s New?
• KRaft – Kafka based RAFT implementation
§ For meta-data management and leader election
§ Faster meta-data operations, more partitions etc. Potentially faster data workloads
© Instaclustr Pty Limited, 2024
13. Kubernetes
Greek Triremes
ruled the seas
Captained by
Helmsmen
(Kubernetes)
(Wikipedia CCL)
© Instaclustr Pty Limited, 2024
Kubernetes
• What?
• Automation of containerized applications
• Superpowers
• Available on public clouds, E.g. AWS EKS
• Ephemeral Pods are the unit of concurrency
• Easy to scale applications (more or less Pods)
• My Use Cases
© Instaclustr Pty Limited, 2024
Anomaly Detection: 19 Million
checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more
© Instaclustr Pty Limited, 2024
Kubernetes
• Watch Our For
• Pod and resource scaling
§ Easy to create many Pods
• With insufficient or lots of resources
• Tuning the application can be tricky
§ Optimize the number of Pods vs Kafka consumers/partitions, Cassandra database
connections, etc
• What’s New?
• Operators
§ E.g. Strimzi for Kafka
© Instaclustr Pty Limited, 2024
14. Prometheus
15. Grafana
Counting on an
Abacus
(Wikimedia Public Domain)
© Instaclustr Pty Limited, 2024
Prometheus + Grafana
• What?
• Prometheus: Monitoring and Alerting
• Grafana: Graphing
• Superpowers
• Instrumentation or Agents (Exporters) to expose application metrics
• Time series data with counter, gauge, histogram and summary metrics
• My Use Cases
• Monitoring and scaling/optimization/debugging
§ Anomaly Detector (Cassandra, Kafka, Kubernetes) application
§ Kafka Connect data pipelines
• Instaclustr’s Monitoring API has a Prometheus version
© Instaclustr Pty Limited, 2024
Prometheus + Grafana
• Watch Our For
• Need to run a Prometheus server
• Configuring Prometheus with Kubernetes is tricky
§ use Prometheus Operator
• What’s New?
• Since using it Grafana is now AGPL licensed
§ modified code has to be open sourced
© Instaclustr Pty Limited, 2024
16. OpenTracing
17. OpenTelemetry
18. Jaeger (and others)
X-Ray Vision!
Public Domain
© Instaclustr Pty Limited, 2024
OpenTracing/OpenTelemetry
• What?
• OpenTracing: End-to-end distributed tracing
• Superpowers
• End-to-end distributed application visibility
§ Traces have Spans
• Visualisation of system topology and times
© Instaclustr Pty Limited, 2024
OpenTracing
OpenTelemetry
• Watch Our For
• Originally used OpenTracing and Jaeger
• Manual instrumentation
• What’s New?
• OpenTelemetry is the new standard
§ Tracing, metrics and logs
§ Automatic instrumentation
§ Lots of open-source visualization tools
• Jaeger, SigNoz, Uptrace, OpenSearch
§ Used in new client monitoring KIP-714, Kafka 3.7.0
© Instaclustr Pty Limited, 2024
SigNoz Service Map for
Toy+Boxes application
© Instaclustr Pty Limited, 2024
19. PostgreSQL
Elephant vs. Tree
Elephants are
Powerful
Adobe
© Instaclustr Pty Limited, 2024
PostgreSQL
• What?
• Powerful SQL Database
• Superpowers
• SQL + Object Database
• Extensible
• JSONB+GIN indexes (efficient storage and search of JSON)
© Instaclustr Pty Limited, 2024
PostgreSQL
• Watch Our For
• Scalability
§ Vertical; limited horizontal
• Benefits from connection pooling
• What’s New?
• PGVector (vector similarity search)
• Significant performance improvement
§ on NetApp Azure Files
• FerretDB (MongoDB front-end)
© Instaclustr Pty Limited, 2024
20. Apache Superset
All superheroes
(B) are a superset
of those who use
weapons (A)
(Shutterstock)
© Instaclustr Pty Limited, 2024
Apache Superset
• What?
• Powerful data visualization tool
• Superpowers
• Reads from SQL sources
• Lots of visualization and graph types including geospatial
• My Use Case
• Visualization of tidal data from Kafka
connect pipeline
§ Easy integration with PostgreSQL + JSONB
© Instaclustr Pty Limited, 2024
21. OpenSearch
22. Dashboard
Library of Congress
Card Division 1919
(City block long)
(Library of Congress Public
Domain)
© Instaclustr Pty Limited, 2024
OpenSearch + Dashboard
• What?
• Open-source version of ElasticSearch
• Based on Lucene à powerful + scalable text searching
• Superpowers
• Ingestion, indexing and searching of JSON documents
• Integrated dashboard for visualization
• Computational linguistics support:
§ Stemming, Lemmatization, Levenshtein Fuzzy Queries,
N-grams, Slop, Partial matching!
• My Use Cases
• Sink and visualization for Kafka connect
tidal data processing pipeline
© Instaclustr Pty Limited, 2024
OpenSearch + Dashboard
• Watch Our For
• Default mappings and ingestion may not work
§ E.g. geospatial data needs custom mappings and ingest pipelines
• Reindexing
• Kafka Connect Sink à OpenSearch throughput
§ Needed the BULK API
• What’s New?
• Vector Search
© Instaclustr Pty Limited, 2024
23. Redis
Look! Up in the sky!
It’s an in-memory
key-value store!
It’s a database!
It’s Redis!
(Shutterstock)
© Instaclustr Pty Limited, 2024
Redis
• What?
• Fast (in-memory) Data Structures server
• Superpowers
• Lots of data types
§ Keys, Strings, Lists, Hashes, Sets, Sorted sets, bitmaps, geospatial, streams, time series,
HyperLogLogs (approximate counting)
• Pub/Sub
§ Connected and disconnected delivery
• Client-side caching for ultra-low latency – e.g. Redisson client
© Instaclustr Pty Limited, 2024
Redis
• Watch Our For
• Pipeline tuning impacts throughput
• Often used as a cache to reduce load on backend database
§ I.e. Efficiency not improved latency
• As other factors may dominate
• What’s New?
• Redis Functions
§ Code executed on the server (Redis 7)
• License change (7.4 source-available)
© Instaclustr Pty Limited, 2024
24. Uber’s Cadence
Railway Signal “man”
(Signalwoman!)
(Wikimedia Public Domain)
© Instaclustr Pty Limited, 2024
Uber’s Cadence
• What?
• Scalable code-as-workflows engine
• Superpowers
• Sequenced, stateful, long-running, scheduled steps
• Scalable and reliable using event-sourcing
§ Workflows are failproof, history is replayed until the point of failure and resumed
• My Use Cases
© Instaclustr Pty Limited, 2024
Drone Delivery Application
Kafka Microservices
Integration of fast/slow systems
© Instaclustr Pty Limited, 2024
Uber’s Cadence
• Watch Our For
• Uses Apache Cassandra and OpenSearch backends
• Code must be deterministic (replayed on failure)
§ Use special functions for non-deterministic functions
• What’s New?
• Potential use cases
§ Scalable push notifications (Uber)
§ ML workflows
© Instaclustr Pty Limited, 2024
25. Debezium
Animal speed transformation (Shutterstock)
© Instaclustr Pty Limited, 2024
Debezium
• What?
• Change Data Capture (CDC)
• Superpowers
• Captures slow database state changes
• Turns them into fast Kafka events
• Uses Kafka: Kafka Connect, and/or DB-specific “Connectors”
• Can be used to replicate databases (same type), or send events to different sink
systems
• My Use Cases
• Debezium Cassandra Connector (doesn’t use Kafka Connect, writes to Kafka directly)
• Debezium PostgreSQL Connector (Kafka source connector)
© Instaclustr Pty Limited, 2024
Debezium
• Watch Our For
• The DB specific connectors need to be configured/run in the DB
• Debezium change data format is complex
§ Actual content depends on the source DB
• Schemas may be inline or just an ID
• May include schema changes
• Tricky to find Kafka Connect sink connectors that work correctly
• Duplicates and ordering issues, latency and scalability challenges
• Schema IDs require a Kafka Schema Registry
• What’s New?
• GA on Instaclustr’s managed Cassandra (Dec 2023)
© Instaclustr Pty Limited, 2024
26. Karapace
Karapace in the
driver's seat!
(Shutterstock)
© Instaclustr Pty Limited, 2024
Karapace
• What?
• Open-source Kafka Schema Registry
• Superpowers
• Adds Schemas to Schemeless Kafka
• Supports multiple schema formats
§ Avro, Protobuf and JSON Schemas
• Kafka cluster is not directly involved
§ Karapace enforces schema checks for clients only
• Use Cases
• Debezium
© Instaclustr Pty Limited, 2024
Karapace
• Watch Our For
• Auto vs. manual schema registration – manual is safer in production
• Schema compatibility, compatibility modes, and evolution: complex!
© Instaclustr Pty Limited, 2024
27. FerretDB
Fish/Shark?
(Adobe)
© Instaclustr Pty Limited, 2024
FerretDB
• What?
• Open-source MongoDB proxy for PostgreSQL
• Superpowers
• Compatible with MongoDB drivers on the front-end
• Pluggable backends including PostgreSQL (using JSONB/GIN indexes)
• Query Pushdown for efficiency/performance
© Instaclustr Pty Limited, 2024
28. RisingWave
Wave processing
(Adobe)
© Instaclustr Pty Limited, 2024
RisingWave
• What?
• Stream processing database – also as a Service
• Superpowers
• Stateful stream processing
§ Using Cloud Native Storage
§ Potential replacement for Kafka Streams
• PostgreSQL compatible
§ Works with Apache Superset
• My Use Cases
© Instaclustr Pty Limited, 2024
Santa’s Elves Toy + Box Packing
Streaming joins to match toys and boxes (Adobe) Service Map using
OpenTelemetry + SigNoz
© Instaclustr Pty Limited, 2024
RisingWave
• Watch Our For
• SQL != Kafka Streams DSL
• Kafka keys not propagated
• Windowing has different semantics
© Instaclustr Pty Limited, 2024
29. TensorFlow
What does the
future hold?
(Adobe)
© Instaclustr Pty Limited, 2024
TensorFlow
• What?
• Neural network ML library
• Superpowers
• Supports incremental ML
• From streaming Kafka data
• My Use Cases
© Instaclustr Pty Limited, 2024
ML Over Streaming Kafka
Data – With Concept Drift
Kafka Streams
© Instaclustr Pty Limited, 2024
TensorFlow
• Watch Our For
• ML over streaming spatiotemporal data with concept drifts is tricky
§ Time/space bias
• Wild model accuracy oscillation
§ Concept shift can result in very low-accuracy models initially
• Train/use Multiple Models
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
Concept Drift - incremental training (time
vs accuracy)
same model reset model guessing
© Instaclustr Pty Limited, 2024
30. Yours Here
Invent your own
(DeepAI)
© Instaclustr Pty Limited, 2024
Integration Example 1
Our Customer Facing Monitoring
Before:
Spark and API
requests
à High load on
Cassandra
© Instaclustr Pty Limited, 2024
Integration Example 1
Our Customer Facing Monitoring
After:
Kafka + Kafka
Streams + Redis
Reduced
Cassandra Load
Recent metrics
served from Redis,
or Cassandra on
cache miss
Postgre
SQL
2 – get data from Redis
3 - or from Cassandra
1 – get meta-data
20k Nodes
Thanks to my colleague
Kuangda He
for this information
© Instaclustr Pty Limited, 2024
Integration Example 2
Drone Delivery Demo
© Instaclustr Pty
Limited, 2023
Kafka
Streams
Customers
Order
Shops
Busy warnings
Uses Cassandra+OpenSearch
ML over streaming data
Demo/POC
© Instaclustr Pty Limited, 2024
Integration Example 2
Drone Delivery Prod?
© Instaclustr Pty
Limited, 2023
Kafka
Streams
Customers
Order
Postgre
SQL
Drone Operations
Order Tracking
Shops
Busy warnings
Uses Cassandra+OpenSearch
ML over streaming data
Drone/order locations cached in Redis
Read-through or write-behind
Kafka sink
connectors
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
© Instaclustr Pty Limited, 2024

30 Of My Favourite Open Source Technologies In 30 Minutes

  • 1.
    © Instaclustr PtyLimited, 2024 30 Of My Favourite Open Source Technologies Paul Brebner Open Source Technology Evangelist
  • 2.
    © Instaclustr PtyLimited, 2024 30 Of My Favourite Open Source Technologies In 30 Minutes Paul Brebner Open Source Technology Evangelist Paul Brebner (Netherlands 30 minutes bike parking zone)
  • 3.
    © Instaclustr PtyLimited, 2024
  • 4.
    © Instaclustr PtyLimited, 2024 What do they have in Common? • Instaclustr provides some as managed services • They are complementary and can be used together • And I’ve used them to build realistic demo applications over the last 7 years
  • 5.
    © Instaclustr PtyLimited, 2024 A Strange Toy I Found At The Shop • What’s that?! • An escaped “Pokemon”! • When my kids were growing up Pokemon lived inside a “Game Boy”
  • 6.
    © Instaclustr PtyLimited, 2024 Format • Name, Overview, Superpower(s), Watch out for … • E.g. “Pokemon” • Name: Charmander • What: A fire Lizard • Superpower: Evolves to Charizard, a flying fire breathing lizard • Watch out for: Water + Use Cases and What’s New?
  • 7.
    © Instaclustr PtyLimited, 2024 Countdown! Flicker CCL + Wikimedia CCL
  • 8.
    © Instaclustr PtyLimited, 2024 1. Apache Cassandra Office Typing Pool, 1918 Wikipedia Public Domain
  • 9.
    © Instaclustr PtyLimited, 2024 Apache Cassandra • What? • NoSQL Horizontally Scalable Key-Value Database • Superpowers • Fast Writes (lots of typewriters) • Wide Column Store • Clustering Columns, good for hierarchical data modelling (E.g. Geospatial) • In-built multi-DC replication • My Use Cases
  • 10.
    © Instaclustr PtyLimited, 2024 Anomaly Detection: 19 Million checks/day Apache Cassandra Apache Kafka Kubernetes And more
  • 11.
    © Instaclustr PtyLimited, 2024 Global low-latency Fintech
  • 12.
    © Instaclustr PtyLimited, 2024 Apache Cassandra • Watch Our For • CQL != SQL • Different data model § Design for reads § De-normalization is normal • Consistency < traditional SQL databases • Reads are slower • What’s New? • Vector Search in 5.0
  • 13.
    © Instaclustr PtyLimited, 2024 2. Apache Spark Car Factory Assembly Line Wikimedia Public Domain
  • 14.
    © Instaclustr PtyLimited, 2024 Apache Spark • What? • Cluster batch/stream processing, analytics and ML • Superpowers • In-memory à fast • Good support for ML § + Cassandra (wide columns) as a feature store • Good for heavy transformation operations at scale • My Use Cases
  • 15.
    © Instaclustr PtyLimited, 2024 ML of Cassandra Monitoring Data Apache Spark Apache Cassandra MLlib DataFrames Spark Streaming
  • 16.
    © Instaclustr PtyLimited, 2024 Apache Spark • Watch Our For • Lots of RAM, else OOM (Out-of-Memory Errors) • Spark Streaming is near real-time (micro-batch) • What’s New? • 3.4 has Spark Connect for decoupled client-servers • Ocean for Apache Spark (Spot by NetApp)
  • 17.
    © Instaclustr PtyLimited, 2024 3. Apache Zeppelin Graf Zeppelin exploring the Arctic, 1931 Wikimedia Public Domain
  • 18.
    © Instaclustr PtyLimited, 2024 Apache Zeppelin • What? • Web-based notebook for data exploration • Superpowers • Interactive “notebook” style tool • Supports Apache Spark
  • 19.
    © Instaclustr PtyLimited, 2024 Apache Zeppelin • Watch Our For • Sufficient Zeppelin resources • We don’t support it anymore • What’s New? • Jupyter Notebook! § Good Kafka and Cassandra integration The Galilean moons of Jupiter (Wikimedia CCL)
  • 20.
    © Instaclustr PtyLimited, 2024 4. Apache Lucene A Librarian using a card catalogue (1940) Library of Congress Public Domain
  • 21.
    © Instaclustr PtyLimited, 2024 Apache Lucene • What? • Fast Full-featured Search Engine • Superpowers • Lucene plugin + Cassandra for enhanced Cassandra search § Works as a Cassandra secondary index § Support Vector Search too • Watch Our For • Performance • We currently support it: https://github.com/instaclustr/cassandra-lucene-index • My Use Cases
  • 22.
    © Instaclustr PtyLimited, 2024 Geospatial Anomaly Detection Apache Cassandra Apache Lucene Plugin Geospatial searches Wikimedia Public Domain
  • 23.
    © Instaclustr PtyLimited, 2024 5. Apache Kafka Postal Delivery Service Railway Post Office: Mail bags snatched by speeding train Wikimedia CCL
  • 24.
    © Instaclustr PtyLimited, 2024 Apache Kafka • What? • Distributed publish-subscribe messaging system • Superpowers • Fast • Highly distributed and horizontally scalable, available and durable • Buffering and message replay • My Use Cases
  • 25.
    © Instaclustr PtyLimited, 2024 Xmas Tree Lights Simulation
  • 26.
    © Instaclustr PtyLimited, 2024 “Kongo” IoT Logistics Simulation Apache Kafka Guava Event Bus Real-time logistics Tracking and checking
  • 27.
    © Instaclustr PtyLimited, 2024 Anomaly Detection: 19 Million checks/day Apache Cassandra Apache Kafka Kubernetes And more
  • 28.
    © Instaclustr PtyLimited, 2024 Apache Kafka • Watch Our For • Too many topics/partitions impacts throughput • What’s New? • KRaft (replacing ZooKeeper) for faster meta-data operations § And maybe even faster data workloads • Tiered Storage (3.6) • End-to-end client monitoring (3.7)
  • 29.
    © Instaclustr PtyLimited, 2024 6. Apache Kafka Streams Niagra Falls Darevevil Shutterstock
  • 30.
    © Instaclustr PtyLimited, 2024 Apache Kafka Streams • What? • Stream processing API and client for Kafka • From/to Kafka cluster • Superpowers • Complex stateful stream processing operations (e.g. joins) • Over time windows and multiple topics and state stores • My Use Cases
  • 31.
    © Instaclustr PtyLimited, 2024 Kafka Streams IoT Application Truck Overload
  • 32.
    © Instaclustr PtyLimited, 2024 Apache Kafka Streams • Watch Our For • Complex stream topologies • Debugging is tricky • Performance • What’s New? • Alternatives (E.g. Apache Flink, RisingWave, etc)
  • 33.
    © Instaclustr PtyLimited, 2024 7. Apache Kafka Connect Telephone Switchboard Operators Connecting Calls Wikimedia Public Domain
  • 34.
    © Instaclustr PtyLimited, 2024 Apache Kafka Connect • What? • Kafka API for streaming from source to sink systems • Via Kafka cluster • Superpowers • Heterogeneous integration • Code-free – just connector configuration • Independently scalable § connectors run on independent Kafka Connect cluster • My Use Cases
  • 35.
    © Instaclustr PtyLimited, 2024 Zero-code Data Pipelines REST Tidal Data to PostgreSQL + Superset REST Tidal Data to OpenSearch OpenSearch sink connector
  • 36.
    © Instaclustr PtyLimited, 2024 Apache Kafka Connect • Watch Our For • Open-source connector evaluation and selection • Error handling • Source/sink system scalability • What’s New? • Debezium
  • 37.
    © Instaclustr PtyLimited, 2024 8. Kafka MirrorMaker 2 (MM2) Head of Kafka Replicated tiers move Shutterstock
  • 38.
    © Instaclustr PtyLimited, 2024 Kafka MirrorMaker 2 • What? • Replicates Kafka topics between clusters • Superpowers • Uses Kafka Connect (but reads/writes from/to Kafka clusters) • Topic renaming, prevents loops • Complex bi-directional topologies • Many use cases for multiple Kafka clusters: § Cluster migration § Geographical distribution § Low latency, redundancy § Fan-out architectures § Edge computing, etc
  • 39.
    © Instaclustr PtyLimited, 2024 Kafka MirrorMaker 2 • Watch Our For • Bi-directional flow requires TWO Kafka Connect Clusters • Duplicate events (from overlapping topic subscriptions) • Use topic renaming and the default source cluster alias to § Prevent cycles and infinite topic creation • What’s New? • For me, automated consumer offset sync across clusters § In 2.7.0 (2020)!
  • 40.
    © Instaclustr PtyLimited, 2024 9. Apache Camel Camel Train In Broome, WA (Adobe Stock by scottimage)
  • 41.
    © Instaclustr PtyLimited, 2024 Apache Camel - Kafka Connectors • What? • Apache Camel – Integration framework • Apache Camel Kafka Connectors – open source Kafka connectors • Superpowers • Large number of open source Kafka Connectors – 172 (officially), 179 sources and sinks • Auto-generated from Camel components
  • 42.
    © Instaclustr PtyLimited, 2024 • Watch Our For • Configuration! § Need to read (1) Camel component, (2) Basic connector configuration, and (3) connector specific documentation • Some connectors are both sources and sinks (source or sink depends on configuration) • What’s New? • Kamelets! § Can appear in the configuration Apache Camel - Kafka Connectors
  • 43.
    © Instaclustr PtyLimited, 2024 10. Kafka Parallel Consumer Jacquard Loom, Berlin (Paul Brebner)
  • 44.
    © Instaclustr PtyLimited, 2024 Kafka Parallel Consumer • What? • Multi-threaded Kafka Consumer • Superpowers • Multi-threaded c.f. default consumers single-threaded • Higher concurrency with less consumers and partitions • Use Cases • Low latency, High Throughput • Slow consumers • Replacement for my multiple pool consumer hack
  • 45.
    © Instaclustr PtyLimited, 2024 Kafka Parallel Consumer • Watch Our For • Configure for § Ordering mode • Partition à Key à Unordered (Increasing concurrency) § Max threads • What’s New? • Choice of commit modes § Consumer Asynchronous, Synchronous and Producer Transactions
  • 46.
    © Instaclustr PtyLimited, 2024 11. Apache ZooKeeper 12. Apache Curator Being a ZooKeeper in Australia can be risky! (Shutterstock)
  • 47.
    © Instaclustr PtyLimited, 2024 Apache ZooKeeper • What? • Distributed systems and coordination and meta-data management • Superpowers • High consistency, availability and performance (reads) • Use Cases • Until recently, used in Kafka, Pulsar, etc
  • 48.
    © Instaclustr PtyLimited, 2024 Apache ZooKeeper (and Curator) Meet the Dining Philosophers Wikipedia CCL
  • 49.
    © Instaclustr PtyLimited, 2024 Apache ZooKeeper • Watch Our For • Low-level • Apache Curator (high level ZK client) is better with § Leader Latch § Shared Lock § Shared Counter • Scalability limitations • Slow for writes, max cluster size is 7 servers • What’s New? • KRaft – Kafka based RAFT implementation § For meta-data management and leader election § Faster meta-data operations, more partitions etc. Potentially faster data workloads
  • 50.
    © Instaclustr PtyLimited, 2024 13. Kubernetes Greek Triremes ruled the seas Captained by Helmsmen (Kubernetes) (Wikipedia CCL)
  • 51.
    © Instaclustr PtyLimited, 2024 Kubernetes • What? • Automation of containerized applications • Superpowers • Available on public clouds, E.g. AWS EKS • Ephemeral Pods are the unit of concurrency • Easy to scale applications (more or less Pods) • My Use Cases
  • 52.
    © Instaclustr PtyLimited, 2024 Anomaly Detection: 19 Million checks/day Apache Cassandra Apache Kafka Kubernetes And more
  • 53.
    © Instaclustr PtyLimited, 2024 Kubernetes • Watch Our For • Pod and resource scaling § Easy to create many Pods • With insufficient or lots of resources • Tuning the application can be tricky § Optimize the number of Pods vs Kafka consumers/partitions, Cassandra database connections, etc • What’s New? • Operators § E.g. Strimzi for Kafka
  • 54.
    © Instaclustr PtyLimited, 2024 14. Prometheus 15. Grafana Counting on an Abacus (Wikimedia Public Domain)
  • 55.
    © Instaclustr PtyLimited, 2024 Prometheus + Grafana • What? • Prometheus: Monitoring and Alerting • Grafana: Graphing • Superpowers • Instrumentation or Agents (Exporters) to expose application metrics • Time series data with counter, gauge, histogram and summary metrics • My Use Cases • Monitoring and scaling/optimization/debugging § Anomaly Detector (Cassandra, Kafka, Kubernetes) application § Kafka Connect data pipelines • Instaclustr’s Monitoring API has a Prometheus version
  • 56.
    © Instaclustr PtyLimited, 2024 Prometheus + Grafana • Watch Our For • Need to run a Prometheus server • Configuring Prometheus with Kubernetes is tricky § use Prometheus Operator • What’s New? • Since using it Grafana is now AGPL licensed § modified code has to be open sourced
  • 57.
    © Instaclustr PtyLimited, 2024 16. OpenTracing 17. OpenTelemetry 18. Jaeger (and others) X-Ray Vision! Public Domain
  • 58.
    © Instaclustr PtyLimited, 2024 OpenTracing/OpenTelemetry • What? • OpenTracing: End-to-end distributed tracing • Superpowers • End-to-end distributed application visibility § Traces have Spans • Visualisation of system topology and times
  • 59.
    © Instaclustr PtyLimited, 2024 OpenTracing OpenTelemetry • Watch Our For • Originally used OpenTracing and Jaeger • Manual instrumentation • What’s New? • OpenTelemetry is the new standard § Tracing, metrics and logs § Automatic instrumentation § Lots of open-source visualization tools • Jaeger, SigNoz, Uptrace, OpenSearch § Used in new client monitoring KIP-714, Kafka 3.7.0
  • 60.
    © Instaclustr PtyLimited, 2024 SigNoz Service Map for Toy+Boxes application
  • 61.
    © Instaclustr PtyLimited, 2024 19. PostgreSQL Elephant vs. Tree Elephants are Powerful Adobe
  • 62.
    © Instaclustr PtyLimited, 2024 PostgreSQL • What? • Powerful SQL Database • Superpowers • SQL + Object Database • Extensible • JSONB+GIN indexes (efficient storage and search of JSON)
  • 63.
    © Instaclustr PtyLimited, 2024 PostgreSQL • Watch Our For • Scalability § Vertical; limited horizontal • Benefits from connection pooling • What’s New? • PGVector (vector similarity search) • Significant performance improvement § on NetApp Azure Files • FerretDB (MongoDB front-end)
  • 64.
    © Instaclustr PtyLimited, 2024 20. Apache Superset All superheroes (B) are a superset of those who use weapons (A) (Shutterstock)
  • 65.
    © Instaclustr PtyLimited, 2024 Apache Superset • What? • Powerful data visualization tool • Superpowers • Reads from SQL sources • Lots of visualization and graph types including geospatial • My Use Case • Visualization of tidal data from Kafka connect pipeline § Easy integration with PostgreSQL + JSONB
  • 66.
    © Instaclustr PtyLimited, 2024 21. OpenSearch 22. Dashboard Library of Congress Card Division 1919 (City block long) (Library of Congress Public Domain)
  • 67.
    © Instaclustr PtyLimited, 2024 OpenSearch + Dashboard • What? • Open-source version of ElasticSearch • Based on Lucene à powerful + scalable text searching • Superpowers • Ingestion, indexing and searching of JSON documents • Integrated dashboard for visualization • Computational linguistics support: § Stemming, Lemmatization, Levenshtein Fuzzy Queries, N-grams, Slop, Partial matching! • My Use Cases • Sink and visualization for Kafka connect tidal data processing pipeline
  • 68.
    © Instaclustr PtyLimited, 2024 OpenSearch + Dashboard • Watch Our For • Default mappings and ingestion may not work § E.g. geospatial data needs custom mappings and ingest pipelines • Reindexing • Kafka Connect Sink à OpenSearch throughput § Needed the BULK API • What’s New? • Vector Search
  • 69.
    © Instaclustr PtyLimited, 2024 23. Redis Look! Up in the sky! It’s an in-memory key-value store! It’s a database! It’s Redis! (Shutterstock)
  • 70.
    © Instaclustr PtyLimited, 2024 Redis • What? • Fast (in-memory) Data Structures server • Superpowers • Lots of data types § Keys, Strings, Lists, Hashes, Sets, Sorted sets, bitmaps, geospatial, streams, time series, HyperLogLogs (approximate counting) • Pub/Sub § Connected and disconnected delivery • Client-side caching for ultra-low latency – e.g. Redisson client
  • 71.
    © Instaclustr PtyLimited, 2024 Redis • Watch Our For • Pipeline tuning impacts throughput • Often used as a cache to reduce load on backend database § I.e. Efficiency not improved latency • As other factors may dominate • What’s New? • Redis Functions § Code executed on the server (Redis 7) • License change (7.4 source-available)
  • 72.
    © Instaclustr PtyLimited, 2024 24. Uber’s Cadence Railway Signal “man” (Signalwoman!) (Wikimedia Public Domain)
  • 73.
    © Instaclustr PtyLimited, 2024 Uber’s Cadence • What? • Scalable code-as-workflows engine • Superpowers • Sequenced, stateful, long-running, scheduled steps • Scalable and reliable using event-sourcing § Workflows are failproof, history is replayed until the point of failure and resumed • My Use Cases
  • 74.
    © Instaclustr PtyLimited, 2024 Drone Delivery Application Kafka Microservices Integration of fast/slow systems
  • 75.
    © Instaclustr PtyLimited, 2024 Uber’s Cadence • Watch Our For • Uses Apache Cassandra and OpenSearch backends • Code must be deterministic (replayed on failure) § Use special functions for non-deterministic functions • What’s New? • Potential use cases § Scalable push notifications (Uber) § ML workflows
  • 76.
    © Instaclustr PtyLimited, 2024 25. Debezium Animal speed transformation (Shutterstock)
  • 77.
    © Instaclustr PtyLimited, 2024 Debezium • What? • Change Data Capture (CDC) • Superpowers • Captures slow database state changes • Turns them into fast Kafka events • Uses Kafka: Kafka Connect, and/or DB-specific “Connectors” • Can be used to replicate databases (same type), or send events to different sink systems • My Use Cases • Debezium Cassandra Connector (doesn’t use Kafka Connect, writes to Kafka directly) • Debezium PostgreSQL Connector (Kafka source connector)
  • 78.
    © Instaclustr PtyLimited, 2024 Debezium • Watch Our For • The DB specific connectors need to be configured/run in the DB • Debezium change data format is complex § Actual content depends on the source DB • Schemas may be inline or just an ID • May include schema changes • Tricky to find Kafka Connect sink connectors that work correctly • Duplicates and ordering issues, latency and scalability challenges • Schema IDs require a Kafka Schema Registry • What’s New? • GA on Instaclustr’s managed Cassandra (Dec 2023)
  • 79.
    © Instaclustr PtyLimited, 2024 26. Karapace Karapace in the driver's seat! (Shutterstock)
  • 80.
    © Instaclustr PtyLimited, 2024 Karapace • What? • Open-source Kafka Schema Registry • Superpowers • Adds Schemas to Schemeless Kafka • Supports multiple schema formats § Avro, Protobuf and JSON Schemas • Kafka cluster is not directly involved § Karapace enforces schema checks for clients only • Use Cases • Debezium
  • 81.
    © Instaclustr PtyLimited, 2024 Karapace • Watch Our For • Auto vs. manual schema registration – manual is safer in production • Schema compatibility, compatibility modes, and evolution: complex!
  • 82.
    © Instaclustr PtyLimited, 2024 27. FerretDB Fish/Shark? (Adobe)
  • 83.
    © Instaclustr PtyLimited, 2024 FerretDB • What? • Open-source MongoDB proxy for PostgreSQL • Superpowers • Compatible with MongoDB drivers on the front-end • Pluggable backends including PostgreSQL (using JSONB/GIN indexes) • Query Pushdown for efficiency/performance
  • 84.
    © Instaclustr PtyLimited, 2024 28. RisingWave Wave processing (Adobe)
  • 85.
    © Instaclustr PtyLimited, 2024 RisingWave • What? • Stream processing database – also as a Service • Superpowers • Stateful stream processing § Using Cloud Native Storage § Potential replacement for Kafka Streams • PostgreSQL compatible § Works with Apache Superset • My Use Cases
  • 86.
    © Instaclustr PtyLimited, 2024 Santa’s Elves Toy + Box Packing Streaming joins to match toys and boxes (Adobe) Service Map using OpenTelemetry + SigNoz
  • 87.
    © Instaclustr PtyLimited, 2024 RisingWave • Watch Our For • SQL != Kafka Streams DSL • Kafka keys not propagated • Windowing has different semantics
  • 88.
    © Instaclustr PtyLimited, 2024 29. TensorFlow What does the future hold? (Adobe)
  • 89.
    © Instaclustr PtyLimited, 2024 TensorFlow • What? • Neural network ML library • Superpowers • Supports incremental ML • From streaming Kafka data • My Use Cases
  • 90.
    © Instaclustr PtyLimited, 2024 ML Over Streaming Kafka Data – With Concept Drift Kafka Streams
  • 91.
    © Instaclustr PtyLimited, 2024 TensorFlow • Watch Our For • ML over streaming spatiotemporal data with concept drifts is tricky § Time/space bias • Wild model accuracy oscillation § Concept shift can result in very low-accuracy models initially • Train/use Multiple Models 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 Concept Drift - incremental training (time vs accuracy) same model reset model guessing
  • 92.
    © Instaclustr PtyLimited, 2024 30. Yours Here Invent your own (DeepAI)
  • 93.
    © Instaclustr PtyLimited, 2024 Integration Example 1 Our Customer Facing Monitoring Before: Spark and API requests à High load on Cassandra
  • 94.
    © Instaclustr PtyLimited, 2024 Integration Example 1 Our Customer Facing Monitoring After: Kafka + Kafka Streams + Redis Reduced Cassandra Load Recent metrics served from Redis, or Cassandra on cache miss Postgre SQL 2 – get data from Redis 3 - or from Cassandra 1 – get meta-data 20k Nodes Thanks to my colleague Kuangda He for this information
  • 95.
    © Instaclustr PtyLimited, 2024 Integration Example 2 Drone Delivery Demo © Instaclustr Pty Limited, 2023 Kafka Streams Customers Order Shops Busy warnings Uses Cassandra+OpenSearch ML over streaming data Demo/POC
  • 96.
    © Instaclustr PtyLimited, 2024 Integration Example 2 Drone Delivery Prod? © Instaclustr Pty Limited, 2023 Kafka Streams Customers Order Postgre SQL Drone Operations Order Tracking Shops Busy warnings Uses Cassandra+OpenSearch ML over streaming data Drone/order locations cached in Redis Read-through or write-behind Kafka sink connectors
  • 97.