Processing and analysing streaming
data with Python
using Apache Kafka, Apache Flink, QuestDB and Grafana
Javier Ramirez
(@supercoco9)
Developer Advocate
A simple problem (until you know the details)
● I want to calculate the total and average of several numbers
A simple big data problem (until you know the details)
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
A simple streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
A simplish streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
A quite standard streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and
then send a bunch of stale data
An elastic and scalable streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and
then send a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
An almost real-life streaming analytics scenario
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a single
hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be adding
and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and then send
a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
● And I don’t want just the total average, but total per month, per week, per day, per
hour, per minute…
A real business problem you can solve with streaming
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
● We need pretty dashboards with current status, comparison with the past, trends, and
anomaly detection
● To run this reliably, we need advanced monitoring, alerts, and autoscaling
● No, I am not hiring a whole new operations team to manage the system
Let’s play a game
YOU are hired to build an end-to-end solution ingesting sensor data from thousands of devices for an e-health application or an
industrial customer.
Or YOU are hired to build tracking of vehicle fleets, like Uber, Lyft, or an airline.
Or YOU are going to ingest data from a game like candy crush or fortnite.
Or YOU are in charge of monitoring and securing thousands of servers.
Or YOU are a video platform, or an telecom provider, and want to detect quality issues.
Or YOU have an ecommerce application where you need to display ads in real-time, provide recommendations, do fraud detection,
and detect cart abandonment.
Streaming data pipeline overview
Key requirements
Durable
Stateful
Continuous
Fast
Correct
Reactive
Reliable
Transform Analyze React Persist Analyze/Reuse
Ingest
Apache Kafka: A distributed streaming platform
Apache Flink: Stateful computations over data streams
QuestDB: The fastest open source time series database
Grafana: Query, visualize, alert on, and understand your
data no matter where it’s stored.
Apache Zeppelin: Web-based notebook that enables
data-driven, interactive data analytics and collaborative
documents with SQL, Scala, Python, R and more.
Demo time
https://github.com/questdb/questdb
Javier Ramirez
(@supercoco9)
Developer Advocate
Grazie

Processing and analysing streaming data with Python. Pycon Italy 2022

  • 1.
    Processing and analysingstreaming data with Python using Apache Kafka, Apache Flink, QuestDB and Grafana Javier Ramirez (@supercoco9) Developer Advocate
  • 2.
    A simple problem(until you know the details) ● I want to calculate the total and average of several numbers
  • 3.
    A simple bigdata problem (until you know the details) ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive
  • 4.
    A simple streamingproblem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time
  • 5.
    A simplish streamingproblem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
  • 6.
    A quite standardstreaming problem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
  • 7.
    An elastic andscalable streaming problem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data ● Flow will not be constant (from few events per second to thousands)
  • 8.
    An almost real-lifestreaming analytics scenario ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data ● Flow will not be constant (from few events per second to thousands) ● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
  • 9.
    A real businessproblem you can solve with streaming ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data ● Flow will not be constant (from few events per second to thousands) ● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute… ● We need pretty dashboards with current status, comparison with the past, trends, and anomaly detection ● To run this reliably, we need advanced monitoring, alerts, and autoscaling ● No, I am not hiring a whole new operations team to manage the system
  • 10.
    Let’s play agame YOU are hired to build an end-to-end solution ingesting sensor data from thousands of devices for an e-health application or an industrial customer. Or YOU are hired to build tracking of vehicle fleets, like Uber, Lyft, or an airline. Or YOU are going to ingest data from a game like candy crush or fortnite. Or YOU are in charge of monitoring and securing thousands of servers. Or YOU are a video platform, or an telecom provider, and want to detect quality issues. Or YOU have an ecommerce application where you need to display ads in real-time, provide recommendations, do fraud detection, and detect cart abandonment.
  • 13.
    Streaming data pipelineoverview Key requirements Durable Stateful Continuous Fast Correct Reactive Reliable Transform Analyze React Persist Analyze/Reuse Ingest
  • 14.
    Apache Kafka: Adistributed streaming platform Apache Flink: Stateful computations over data streams QuestDB: The fastest open source time series database Grafana: Query, visualize, alert on, and understand your data no matter where it’s stored. Apache Zeppelin: Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
  • 15.
  • 16.