Processing and analysing streaming data with Python. Pycon Italy 2022

•

0 likes•7 views

Data used to be a batch thing, but more and more we get unbounded streams of data, fast or slow, that we need to process and analyse in near real time. In this talk I’ll show you how you can use Apache Flink and QuestDB to build reliable streaming data pipelines that can grow as much as you need.

Data & Analytics

Processing and analysing streaming
data with Python
using Apache Kafka, Apache Flink, QuestDB and Grafana
Javier Ramirez
(@supercoco9)
Developer Advocate

A simple problem (until you know the details)
● I want to calculate the total and average of several numbers

A simple big data problem (until you know the details)
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive

A simple streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time

A simplish streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time

A quite standard streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and
then send a bunch of stale data

An elastic and scalable streaming problem
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a
single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and
then send a bunch of stale data
● Flow will not be constant (from few events per second to thousands)

An almost real-life streaming analytics scenario
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a single
hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be adding
and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and then send
a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
● And I don’t want just the total average, but total per month, per week, per day, per
hour, per minute…

A real business problem you can solve with streaming
● I want to calculate the total and average of several numbers
● They might be MANY numbers, more than you can store in memory, or in a single hard drive
● The dataset is not static, new numbers are coming all the time
● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
● Flow will not be constant (from few events per second to thousands)
● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
● We need pretty dashboards with current status, comparison with the past, trends, and
anomaly detection
● To run this reliably, we need advanced monitoring, alerts, and autoscaling
● No, I am not hiring a whole new operations team to manage the system

Let’s play a game
YOU are hired to build an end-to-end solution ingesting sensor data from thousands of devices for an e-health application or an
industrial customer.
Or YOU are hired to build tracking of vehicle fleets, like Uber, Lyft, or an airline.
Or YOU are going to ingest data from a game like candy crush or fortnite.
Or YOU are in charge of monitoring and securing thousands of servers.
Or YOU are a video platform, or an telecom provider, and want to detect quality issues.
Or YOU have an ecommerce application where you need to display ads in real-time, provide recommendations, do fraud detection,
and detect cart abandonment.

Streaming data pipeline overview
Key requirements
Durable
Stateful
Continuous
Fast
Correct
Reactive
Reliable
Transform Analyze React Persist Analyze/Reuse
Ingest

Apache Kafka: A distributed streaming platform
Apache Flink: Stateful computations over data streams
QuestDB: The fastest open source time series database
Grafana: Query, visualize, alert on, and understand your
data no matter where it’s stored.
Apache Zeppelin: Web-based notebook that enables
data-driven, interactive data analytics and collaborative
documents with SQL, Scala, Python, R and more.

https://github.com/questdb/questdb
Javier Ramirez
(@supercoco9)
Developer Advocate
Grazie

Similar to Processing and analysing streaming data with Python. Pycon Italy 2022

Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari

How Gousto is moving to just-in-time personalization with SnowplowGiuseppe Gaviani

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS

Streamlio and IoT analytics with Apache PulsarStreamlio

Predicting Startup Market Trends based on the news and social media - Albert ...GetInData

Observability at SpotifyAleksandr Kuboskin, CFA

Analysing streaming data in real time (AWS)javier ramirez

Simply Business' Data PlatformDani Solà Lagares

Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez

Getting started with streaming analyticsjavier ramirez

Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn

OSMC 2015 | Testing in Production by Devdas BhagatNETWAYS

OSMC 2015: Testing in Production by Devdas BhagatNETWAYS

Measure All the Things! - Austin Data Day 2014gdusbabek

Wiring the IoT for modern manufacturingFlorent Solt

Turning the web stack upside down rethinking how data flows through systemsPaolo Negri

Challenges of monitoring distributed systemsNenad Bozic

Continues Deployment - Tech Talk weekrantav

Similar to Processing and analysing streaming data with Python. Pycon Italy 2022 (20)

Monitoring Big Data Systems - "The Simple Way"

How Gousto is moving to just-in-time personalization with Snowplow

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...

Streamlio and IoT analytics with Apache Pulsar

Predicting Startup Market Trends based on the news and social media - Albert ...

Observability at Spotify

Analysing streaming data in real time (AWS)

Simply Business' Data Platform

Getting started with streaming analytics: streaming basics (1 of 3)

Getting started with streaming analytics

Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dirty data? Clean it up! - Datapalooza Denver 2016

OSMC 2015 | Testing in Production by Devdas Bhagat

OSMC 2015: Testing in Production by Devdas Bhagat

Measure All the Things! - Austin Data Day 2014

Wiring the IoT for modern manufacturing

Turning the web stack upside down rethinking how data flows through systems

Challenges of monitoring distributed systems

Continues Deployment - Tech Talk week

Recently uploaded

Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Riyadh +966572737505 get cytotec

DAA Assignment Solution.pdf is the best1sinhaabhiyanshu

Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh +966572737505 get cytotec

obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Voces Mineras

Digital Transformation Playbook by Graham WareGraham Ware

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark

sourabh vyas1222222222222222222244444444saurabvyas476

原件一样(UWO毕业证书）西安大略大学毕业证成绩单留信学历认证pwgnohujw

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Case Study 4 Where the cry of rebellion happen?RemarkSemacio

Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh +966572737505 get cytotec

如何办理(WashU毕业证书）圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1

社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社

Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh

Recently uploaded (20)

Abortion pills in Jeddah |+966572737505 | get cytotec

DAA Assignment Solution.pdf is the best1

Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit

obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...

Digital Transformation Playbook by Graham Ware

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格

sourabh vyas1222222222222222222244444444

原件一样(UWO毕业证书）西安大略大学毕业证成绩单留信学历认证

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Case Study 4 Where the cry of rebellion happen?

Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec

如何办理(WashU毕业证书）圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证

社内勉強会資料_Object Recognition as Next Token Prediction

Simplify hybrid data integration at an enterprise scale. Integrate all your d...

Processing and analysing streaming data with Python. Pycon Italy 2022

1. Processing and analysing streaming data with Python using Apache Kafka, Apache Flink, QuestDB and Grafana Javier Ramirez (@supercoco9) Developer Advocate

2. A simple problem (until you know the details) ● I want to calculate the total and average of several numbers

3. A simple big data problem (until you know the details) ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive

4. A simple streaming problem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time

5. A simplish streaming problem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time

6. A quite standard streaming problem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data

7. An elastic and scalable streaming problem ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data ● Flow will not be constant (from few events per second to thousands)

8. An almost real-life streaming analytics scenario ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data ● Flow will not be constant (from few events per second to thousands) ● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…

9. A real business problem you can solve with streaming ● I want to calculate the total and average of several numbers ● They might be MANY numbers, more than you can store in memory, or in a single hard drive ● The dataset is not static, new numbers are coming all the time ● From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time ● And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data ● Flow will not be constant (from few events per second to thousands) ● And I don’t want just the total average, but total per month, per week, per day, per hour, per minute… ● We need pretty dashboards with current status, comparison with the past, trends, and anomaly detection ● To run this reliably, we need advanced monitoring, alerts, and autoscaling ● No, I am not hiring a whole new operations team to manage the system

10. Let’s play a game YOU are hired to build an end-to-end solution ingesting sensor data from thousands of devices for an e-health application or an industrial customer. Or YOU are hired to build tracking of vehicle fleets, like Uber, Lyft, or an airline. Or YOU are going to ingest data from a game like candy crush or fortnite. Or YOU are in charge of monitoring and securing thousands of servers. Or YOU are a video platform, or an telecom provider, and want to detect quality issues. Or YOU have an ecommerce application where you need to display ads in real-time, provide recommendations, do fraud detection, and detect cart abandonment.

11.

12.

13. Streaming data pipeline overview Key requirements Durable Stateful Continuous Fast Correct Reactive Reliable Transform Analyze React Persist Analyze/Reuse Ingest

14. Apache Kafka: A distributed streaming platform Apache Flink: Stateful computations over data streams QuestDB: The fastest open source time series database Grafana: Query, visualize, alert on, and understand your data no matter where it’s stored. Apache Zeppelin: Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.

15. Demo time

16. https://github.com/questdb/questdb Javier Ramirez (@supercoco9) Developer Advocate Grazie

Processing and analysing streaming data with Python. Pycon Italy 2022

Recommended

Recommended

More Related Content

Similar to Processing and analysing streaming data with Python. Pycon Italy 2022

Similar to Processing and analysing streaming data with Python. Pycon Italy 2022 (20)

More from javier ramirez

More from javier ramirez (20)

Recently uploaded

Recently uploaded (20)

Processing and analysing streaming data with Python. Pycon Italy 2022