Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Let’s get real— companies are incorporating more streaming sources as part of their data stack to unlock their customers’ and business needs and trends in real-time. While many data engineers ingest streaming data into their data warehouses or data lakes, they are not unlocking the full potential of the data. In order to extract the most value from your streaming data, you’ll need to consider:
- data freshness
- query latency
- storage
- concurrency
- data mutability
- analyzing streaming data in context (i.e. JOINing) with data from other data sources
In this tech talk, we’ll cover these aforementioned considerations in detail. We’ll show you how to build a SQL-based, real-time recommendation engine and customer 360 data application using Kafka, Rockset, and Retool. By the end, you’ll be equipped to effectively evaluate databases and tools to meet your real-time needs with streaming data.
WebRTC and SIP not just audio and video @ OpenSIPS 2024
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
1. Nadine Farah, Senior Developer Advocate
@heyerrrbody
in/nadinefarah
Current 2022, Austin
Keepin' it real(-time): How to generate
instant, actionable insights on streaming data
2. 2
2
Section Duration
Streaming data is on the rise 📈 5 min
🤔 Real-time analytics on streaming data challenges 20 minutes
🛠 Rockset deep dive + build a real-time customer 360 app with
recommendations
10 minutes
Q&A 5 minutes
Agenda
3. Soo.. about me
3
● 3+ years focusing on real-time analytics & streaming data
● Recently led a workshop on real-time analytics with Kafka
● Lead Rockset’s developer initiatives
● Best friends with Ferro and my dog, Romeo
in/nadinefarah/
@nfarah86
10. Less efficient ways of scaling bursty data
traffic
● Manual reconfigurations
○ You create bottlenecks when you want to scale up because
it’s not triggered automatically
● Tightly coupled compute and storage
○ Ingesting data and querying data affect each other → writing
large amounts of data affects the reads and v.v. ‘
○ Data has to be moved closer to memory to make use of the
available resources
14. Copy on write: less efficient in updating a
field
● 1 way to solve handling out of order events for
immutable databases is copy on write. Example: data
warehouses
○ Any updates require both writing new data and
rewriting already-written adjacent data in order to
store everything correctly to disk in the right time
order
○ requires a significant amount of processing power
and time
15. Problems with immutability for real-time
analytics
● Usage and volume of streaming data is increasing
● Immutable databases can’t do updates, inserts, and
deletes:
○ append-only
● Shift from batch to streaming:
○ Data apps are having tighter SLAs for query and
data latency. So, apps need an efficient real-time
database to handle the nuances and volume of
streaming data
17. Challenge 3: Schema changes
● Hard to update rigid ETL jobs on nested JSON objects
with new fields being updated, added, or deleted
● Relational databases can take a performance hit if you
query JSON data without conversion to SQL fields
● It’s hard to work with nested JSON objects right away
because you have to build processes in order to flatten
it
18. Strong dynamic typing and indexing makes it
easier to work with schema changes
● Database that can index nested JSON docs
● Database that supports strong dynamic typing so you
can query fields with multiple data types
● Database that easily turns nested JSON to a SQL table
at runtime without having to do prior transformations
19. Challenge 4: running queries with low
efficiency
● For user-facing analytics, where the access pattern can
be unknown, defining all the indexes can be a challenge
● Columnar stores that use brute-force scans are slow
and are not ideal when you are constantly querying data
because you have to throw more compute resources in
order to get faster speeds
20. Auto-indexing reduces compute resources
for querying data
● You’ll need a database that automatically creates
indexes so you don’t have to create or manage it
manually and when the steaming data changes
● In the presence of indexes, lesser compute is needed to
serve the query
21. SQL: best for complex analytics
● NoSQL databases:
○ Easy lookups
○ Have to learn a new language
○ No JOIN support at scale
○ Struggles on complex aggregations
● SQL databases:
○ Easy to JOIN (at scale), aggregate,
and search
22. Batch architecture: streaming data
challenges
Inefficient ingest
Eg: Expensive MERGE operations
for processing inserts, updates,
deletes
Query Latency
> 1 min
Data Latency
>1 hour
Slow, expensive user-
facing analytics
Time-Consuming ETL Jobs
Eg: pre-aggregations
Microsoft SQL
Server
Postgres
Amazon S3 Google Cloud Azure Blob BigQuery
Snowflake Redshift Databricks
MongoDB DynamoDB
Kafka Kinesis Google Pub/Sub Azure Event
Hubs
Oracle MySQL
Inefficient queries
Eg: Expensive full table scans,
indexing tuning
🐌
23. Real-time architecture: purpose built for
data applications
Fast, Efficient User-
facing Analytics
Efficient Upserts
Mutable at field level to avoid
MERGE operations
Cloud-native
Compute-storage separation to avoid over-provisioning
Query Latency
< 1 second
Data Latency
< 2 seconds
Amazon S3 Google Cloud Azure Blob BigQuery
Snowflake Redshift Databricks
MongoDB DynamoDB
Kafka Kinesis Google Pub/Sub Azure Event
Hubs
Microsoft SQL
Server
Postgres
Oracle MySQL
Fast Queries
Converged Index avoids
SCAN operations
Ingest Rollups
Transform and pre-aggregate to
reduce storage 10-100x
26. Real-time analytics at your fingertips
● Ability to handle bursty traffic
● Out-of-order data
● Strong dynamic schema
● No ETL jobs
● Serverless architecture
27. Rockset is the real-time analytics platform built for the
cloud.
Rockset enables sub-second queries on real-time data.
Build user-facing analytics with surprising efficiency.
35. rockset.com/docs
Booth S3 at The Austin Convention
Center
…. Or come find me and let’s chat about
Kafka and real-time analytics over a tasty
beverage- on me!