SlideShare a Scribd company logo
1 of 29
Download to read offline
NYC 2023
Javier Ramírez
QuestDB
@supercoco9 / @j@chaos.social
Deduplicating And Analysing
Time-Series Data With
Apache Beam And QuestDB
About me: I like databases & open source
2022- today. Developer relations at an open source database vendor
● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink
2019-2022. Data & Analytics specialist at a cloud provider
● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data
Analytics, Redshift, ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark…
2013-2018. Data Engineer/Big Data & Analytics consultant
● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM,
Apache Flink, HBase, MongoDB, Presto
2006-2012 - Web developer
● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch
late nineties to 2005. Desktop/CGI/Servlets/ EJBs/CORBA
● MS Access, MySQL, Oracle, Sybase, Informix
As a student/hobbyist (late eighties - early nineties)
● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works, Informix
The pre-SQL years
The licensed SQL period
The libre and open SQL
revolution / The NoSQL
rise
The hadoop dark ages / The
python hegemony/ The cloud
database big migrations
The streaming era/ The
database as a service
singularity
The SQL revenge/ the
realtime database/the
embedded database
BEAM SUMMIT NYC 2023
#
Agenda
● The problem of data duplication
● The problem of data duplication
● The problem of data duplication
● The problem of data duplication
● Behold: a dashboard!
● The many challenges of time-series data
● QuestDB to the rescue
● Down the rabbit hole of writing a custom BEAM Sink
○ Finding several needles on a documentation haystack
○ When I sadly discovered Python streaming support is meh
○ The unsung hero saves the day (again): implementing the Sink in Java
BEAM SUMMIT NYC 2023
#
Duplication
WHY
BEAM SUMMIT NYC 2023
#
Duplication
HOW
BEAM SUMMIT NYC 2023
#
Duplication
WHAT
BEAM SUMMIT NYC 2023
#
My lazy approach to choosing a database
If you can use only one
database for everything, go
with PostgreSQL*
* Or any other major and well supported RDBMS
BEAM SUMMIT NYC 2023
#
Imagine…
a factory floor with 500 machines, or
a fleet with 500 vehicles, or
50 trains, with 10 cars each, or
500 users with a mobile phone, or
500 financial instruments generating tick data
…sending data every second
BEAM SUMMIT NYC 2023
#
A conventional database’s nightmare
43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month
BEAM SUMMIT NYC 2023
#
Timestamps are hard
BEAM SUMMIT NYC 2023
#
Time-series analytics in a nutshell
Working with timestamped data
in a database is tricky*
* specially working with analytics of data changing over time or at a high rate
BEAM SUMMIT NYC 2023
#
We’d like to be known for
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
NYC 2023
A quick overview of some
interesting queries
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
SAMPLE BY
Aggregates data in homogeneous time chunks
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
LATEST ON …
PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where
multiple time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
ASOF JOIN / LT JOIN
SPLICE JOIN
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF
JOIN takes from the second time-series a timestamp that meets both of the following criteria:
● The timestamp is the closest to the first timestamp.
● The timestamp is strictly prior or equal to the first timestamp.
WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
BEAM SUMMIT NYC 2023
#
Building a Sink connector
QuestDB cannot do in-stream
deduplications.
Apache BEAM can help
BEAM SUMMIT NYC 2023
#
The Python QuestDB Sink
● WriteToQuestDB(PTransform) class
○ Receives the args you need to pass to the sink
○ Implements the expand method, which receives the PCollection then invokes
ParDo to _WriteTOQuestDBFn
○
● _WriteToQuestDBFn(DoFn) class
○ Instantiates _QuestDBSink on start_bundle
○ Flushes/releases _QuestDBSink on finish_bundle
○ Implements display_data to show info on the UI
○ Calls to _QuestDBSink.write on the process method
○
● _QuestDBSink class
○ Deals with the QuestDB connection itself
BEAM SUMMIT NYC 2023
#
The Python QuestDB Sink
https://github.com/javier/questdb-beam/tree/main/python
pcoll | WriteToQuestDB(table,
symbols=[list_of_symbols],
columns=[list_of_columns],
host=host,
port=port,
batch_size=optionalSizeOfBatch,
tls=optionalBoolean,
auth=optionalAuthDict)
BEAM SUMMIT NYC 2023
#
The Java QuestDB Sink
● QuestDbIO.Write class, extends PTransform
○ Receives the args you need to pass to the sink
○ Uses @AutoValue to generate classes “magically”
○ Implements the expand method, which receives the PCollection then invokes
ParDo to QuestDbIO.Write.WriteFn (with optional deduplication)
○ Implements populateDisplayData
● QuestDbIO.Write.WriteFn class, extends DoFn
○ Instantiates QuestDBSender on start_bundle
○ Flushes/closes QuestDBSender on finish_bundle
○ Parses/sends the QuestDbRow to QuestDB on the process method
BEAM SUMMIT NYC 2023
#
Where the magic happens
https://github.com/javier/questdb-beam/blob/main/java/src/main/java/org/apache/beam/sdk/io/questdb/QuestDbIO.java
keydAndWindowed = (PCollection) input.apply(WithKeys.of(new SerializableFunction<QuestDbRow, String>() {
@Override
public String apply(QuestDbRow r) {
return String.valueOf(r.hashCode());
}
}));
PCollection windowedItems = (PCollection)
keydAndWindowed.apply(
Window.
<KV<String, String>>into(
Sessions.
withGapDuration(
Duration.standardSeconds(deduplicationDurationMillis())
)
)
);
PCollection<QuestDbRow> uniqueRows = (PCollection<QuestDbRow>)
((PCollection) keydAndWindowed.apply(
Deduplicate.keyedValues()
)
).apply(Values.create());
BEAM SUMMIT NYC 2023
#
The Java QuestDB Sink
https://github.com/javier/questdb-beam/tree/main/java
// pcoll needs to be of type QuestDbRow
pcoll.apply(ParDo.of(new LineToMapFn()));
parsedLines.apply(QuestDbIO.write()
.withUri("your-instance-host.questdb.com:YOUR_PORT")
.withTable("beam_demo")
.withDeduplicationEnabled(true)
.withDeduplicationByValue(false)
.withDeduplicationDurationMillis(5L)
.withSSLEnabled(true)
.withAuthEnabled(true)
.withAuthUser("admin")
.withAuthToken("verySecretToken")
https://github.com/questdb/questdb
https://cloud.questdb.com
NYC 2023
QUESTIONS?
Javier Ramirez
@supercoco9
https://github.com/javier/questdb-beam
https://github.com/javier/questdb-quickstart
https://github.com/questdb/questdb
https://demo.questdb.io
https://cloud.questdb.com
Thanks!

More Related Content

Similar to Deduplicating and analysing time-series data with Apache Beam and QuestDB

Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsTeradata Aster
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Codemotion
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQLEDB
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...MongoDB
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitorInfluxData
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftData Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsMongoDB
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics MongoDB
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
 
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDustGeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDustMargriet Groenendijk
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 

Similar to Deduplicating and analysing time-series data with Apache Beam and QuestDB (20)

Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftData Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDustGeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 

More from javier ramirez

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfestjavier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessjavier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloudjavier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMjavier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analyticsjavier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Divejavier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...javier ramirez
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...javier ramirez
 
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...javier ramirez
 

More from javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
 
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Deduplicating and analysing time-series data with Apache Beam and QuestDB

  • 1.
  • 2. NYC 2023 Javier Ramírez QuestDB @supercoco9 / @j@chaos.social Deduplicating And Analysing Time-Series Data With Apache Beam And QuestDB
  • 3. About me: I like databases & open source 2022- today. Developer relations at an open source database vendor ● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink 2019-2022. Data & Analytics specialist at a cloud provider ● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift, ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark… 2013-2018. Data Engineer/Big Data & Analytics consultant ● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB, Presto 2006-2012 - Web developer ● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch late nineties to 2005. Desktop/CGI/Servlets/ EJBs/CORBA ● MS Access, MySQL, Oracle, Sybase, Informix As a student/hobbyist (late eighties - early nineties) ● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works, Informix The pre-SQL years The licensed SQL period The libre and open SQL revolution / The NoSQL rise The hadoop dark ages / The python hegemony/ The cloud database big migrations The streaming era/ The database as a service singularity The SQL revenge/ the realtime database/the embedded database
  • 4. BEAM SUMMIT NYC 2023 # Agenda ● The problem of data duplication ● The problem of data duplication ● The problem of data duplication ● The problem of data duplication ● Behold: a dashboard! ● The many challenges of time-series data ● QuestDB to the rescue ● Down the rabbit hole of writing a custom BEAM Sink ○ Finding several needles on a documentation haystack ○ When I sadly discovered Python streaming support is meh ○ The unsung hero saves the day (again): implementing the Sink in Java
  • 5. BEAM SUMMIT NYC 2023 # Duplication WHY
  • 6. BEAM SUMMIT NYC 2023 # Duplication HOW
  • 7. BEAM SUMMIT NYC 2023 # Duplication WHAT
  • 8. BEAM SUMMIT NYC 2023 # My lazy approach to choosing a database If you can use only one database for everything, go with PostgreSQL* * Or any other major and well supported RDBMS
  • 9. BEAM SUMMIT NYC 2023 # Imagine… a factory floor with 500 machines, or a fleet with 500 vehicles, or 50 trains, with 10 cars each, or 500 users with a mobile phone, or 500 financial instruments generating tick data …sending data every second
  • 10. BEAM SUMMIT NYC 2023 # A conventional database’s nightmare 43,200,000 rows a day……. 302,400,000 rows a week…. 1,314,144,000 rows a month
  • 11. BEAM SUMMIT NYC 2023 # Timestamps are hard
  • 12. BEAM SUMMIT NYC 2023 # Time-series analytics in a nutshell Working with timestamped data in a database is tricky* * specially working with analytics of data changing over time or at a high rate
  • 13.
  • 14. BEAM SUMMIT NYC 2023 # We’d like to be known for ● Performance ○ Better performance with smaller machines ● Developer Experience ● Proudly Open Source (Apache 2.0)
  • 15. NYC 2023 A quick overview of some interesting queries
  • 16. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io WHERE … TIME RANGE SELECT * from trips WHERE pickup_datetime in '2018'; SELECT * from trips WHERE pickup_datetime in '2018-06'; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59'; SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7' SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
  • 17. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io SAMPLE BY Aggregates data in homogeneous time chunks SELECT timestamp, sum(price * amount) / sum(amount) AS vwap_price, sum(amount) AS volume FROM trades WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now()) SAMPLE BY 15m ALIGN TO CALENDAR; SELECT timestamp, min(tempF), max(tempF), avg(tempF) FROM weather SAMPLE BY 1M;
  • 18. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io SAMPLE BY … FILL Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value) SELECT timestamp, sum(price * amount) / sum(amount) AS vwap_price, sum(amount) AS volume FROM trades WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now()) SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
  • 19. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io LATEST ON … PARTITION BY … Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table. SELECT * FROM trades WHERE symbol in ('BTC-USD', 'ETH-USD') LATEST ON timestamp PARTITION BY symbol, side;
  • 20. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io ASOF JOIN / LT JOIN SPLICE JOIN ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN takes from the second time-series a timestamp that meets both of the following criteria: ● The timestamp is the closest to the first timestamp. ● The timestamp is strictly prior or equal to the first timestamp. WITH trips2018 AS ( SELECT * from trips WHERE pickup_datetime in '2016' ) SELECT pickup_datetime, fare_amount, tempF, windDir FROM trips2018 ASOF JOIN weather;
  • 21. BEAM SUMMIT NYC 2023 # Building a Sink connector QuestDB cannot do in-stream deduplications. Apache BEAM can help
  • 22. BEAM SUMMIT NYC 2023 # The Python QuestDB Sink ● WriteToQuestDB(PTransform) class ○ Receives the args you need to pass to the sink ○ Implements the expand method, which receives the PCollection then invokes ParDo to _WriteTOQuestDBFn ○ ● _WriteToQuestDBFn(DoFn) class ○ Instantiates _QuestDBSink on start_bundle ○ Flushes/releases _QuestDBSink on finish_bundle ○ Implements display_data to show info on the UI ○ Calls to _QuestDBSink.write on the process method ○ ● _QuestDBSink class ○ Deals with the QuestDB connection itself
  • 23. BEAM SUMMIT NYC 2023 # The Python QuestDB Sink https://github.com/javier/questdb-beam/tree/main/python pcoll | WriteToQuestDB(table, symbols=[list_of_symbols], columns=[list_of_columns], host=host, port=port, batch_size=optionalSizeOfBatch, tls=optionalBoolean, auth=optionalAuthDict)
  • 24.
  • 25. BEAM SUMMIT NYC 2023 # The Java QuestDB Sink ● QuestDbIO.Write class, extends PTransform ○ Receives the args you need to pass to the sink ○ Uses @AutoValue to generate classes “magically” ○ Implements the expand method, which receives the PCollection then invokes ParDo to QuestDbIO.Write.WriteFn (with optional deduplication) ○ Implements populateDisplayData ● QuestDbIO.Write.WriteFn class, extends DoFn ○ Instantiates QuestDBSender on start_bundle ○ Flushes/closes QuestDBSender on finish_bundle ○ Parses/sends the QuestDbRow to QuestDB on the process method
  • 26. BEAM SUMMIT NYC 2023 # Where the magic happens https://github.com/javier/questdb-beam/blob/main/java/src/main/java/org/apache/beam/sdk/io/questdb/QuestDbIO.java keydAndWindowed = (PCollection) input.apply(WithKeys.of(new SerializableFunction<QuestDbRow, String>() { @Override public String apply(QuestDbRow r) { return String.valueOf(r.hashCode()); } })); PCollection windowedItems = (PCollection) keydAndWindowed.apply( Window. <KV<String, String>>into( Sessions. withGapDuration( Duration.standardSeconds(deduplicationDurationMillis()) ) ) ); PCollection<QuestDbRow> uniqueRows = (PCollection<QuestDbRow>) ((PCollection) keydAndWindowed.apply( Deduplicate.keyedValues() ) ).apply(Values.create());
  • 27. BEAM SUMMIT NYC 2023 # The Java QuestDB Sink https://github.com/javier/questdb-beam/tree/main/java // pcoll needs to be of type QuestDbRow pcoll.apply(ParDo.of(new LineToMapFn())); parsedLines.apply(QuestDbIO.write() .withUri("your-instance-host.questdb.com:YOUR_PORT") .withTable("beam_demo") .withDeduplicationEnabled(true) .withDeduplicationByValue(false) .withDeduplicationDurationMillis(5L) .withSSLEnabled(true) .withAuthEnabled(true) .withAuthUser("admin") .withAuthToken("verySecretToken")