Data used to be a batch thing, but more and more we get unbounded streams of data, fast or slow, that we need to process and analyse in near real time.
In this talk I’ll show you how you can use Apache Flink and QuestDB to build reliable streaming data pipelines that can grow as much as you need.
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Processing and analysing streaming data with Python. Pycon Italy 2022
1. Processing and analysing streaming
data with Python
using Apache Kafka, Apache Flink, QuestDB and Grafana
Javier Ramirez
(@supercoco9)
Developer Advocate
2. A simple problem (until you know the details)
● I want to calculate the total and average of several numbers
3. A simple big data problem (until you know the details)
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
4. A simple streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
5. A simplish streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
6. A quite standard streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and
then send a bunch of stale data
7. An elastic and scalable streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and
then send a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
8. An almost real-life streaming analytics scenario
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a single
hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be adding
and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and then send
a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
● And I don’t want just the total average, but total per month, per week, per day, per
hour, per minute…
9. A real business problem you can solve with streaming
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
● We need pretty dashboards with current status, comparison with the past, trends, and
anomaly detection
● To run this reliably, we need advanced monitoring, alerts, and autoscaling
● No, I am not hiring a whole new operations team to manage the system
10. Let’s play a game
YOU are hired to build an end-to-end solution ingesting sensor data from thousands of devices for an e-health application or an
industrial customer.
Or YOU are hired to build tracking of vehicle fleets, like Uber, Lyft, or an airline.
Or YOU are going to ingest data from a game like candy crush or fortnite.
Or YOU are in charge of monitoring and securing thousands of servers.
Or YOU are a video platform, or an telecom provider, and want to detect quality issues.
Or YOU have an ecommerce application where you need to display ads in real-time, provide recommendations, do fraud detection,
and detect cart abandonment.
11.
12.
13. Streaming data pipeline overview
Key requirements
Durable
Stateful
Continuous
Fast
Correct
Reactive
Reliable
Transform Analyze React Persist Analyze/Reuse
Ingest
14. Apache Kafka: A distributed streaming platform
Apache Flink: Stateful computations over data streams
QuestDB: The fastest open source time series database
Grafana: Query, visualize, alert on, and understand your
data no matter where it’s stored.
Apache Zeppelin: Web-based notebook that enables
data-driven, interactive data analytics and collaborative
documents with SQL, Scala, Python, R and more.