This document discusses the challenges of working with timestamped data in databases and introduces QuestDB as a time-series database designed to address these challenges. It highlights QuestDB's high performance for ingesting and querying large volumes of timestamped data. It also demonstrates several time-series focused query patterns in QuestDB like time range queries, sampling, filling missing data, retrieving the latest value, and approximate joins between tables. Finally, it outlines some areas QuestDB is exploring to further improve performance.
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
(talk delivered at OSA CON 23)
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed.
We will learn how it deals with data ingestion, and which SQL extensions it implements for working with time-series efficiently.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or data deduplication.
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
Time series data pipelines tend to prioritise speed and freshness over completeness and integrity. In such scenarios, it is very common to ingest duplicate data, which may be fine for many analytical use cases, but is very inconvenient for others.
There are many open source databases built specifically for the speed and query semantics of time series, and most of them lack automatic deduplication of events in near real-time. One such database is QuestDB, which requires a manual batch process to deduplicate ingested data.
In this talk, we will see how we can successfully use Apache Beam to deduplicate streaming time series, which can then be analysed by a time series database.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
In this session I will show you the technical decisions we made when building QuestDB, the open source, Postgres compatible, time-series database, and how we can achieve a million row writes per second without blocking or slowing down the reads.
A presentation I made for Apache Spark and Apache Cassandra Integration.
First I present what are some of the differences between RDBMS and NoSQL, then I proceed with the Cassandra infrastructure and usual errors when creating a Cassandra Data Model.
Finally, I provide the Spark underlying main concepts and some settings for proper configuration.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
by Marie Yap, Enterprise Solutions Architect, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
(talk delivered at OSA CON 23)
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed.
We will learn how it deals with data ingestion, and which SQL extensions it implements for working with time-series efficiently.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or data deduplication.
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
Time series data pipelines tend to prioritise speed and freshness over completeness and integrity. In such scenarios, it is very common to ingest duplicate data, which may be fine for many analytical use cases, but is very inconvenient for others.
There are many open source databases built specifically for the speed and query semantics of time series, and most of them lack automatic deduplication of events in near real-time. One such database is QuestDB, which requires a manual batch process to deduplicate ingested data.
In this talk, we will see how we can successfully use Apache Beam to deduplicate streaming time series, which can then be analysed by a time series database.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
In this session I will show you the technical decisions we made when building QuestDB, the open source, Postgres compatible, time-series database, and how we can achieve a million row writes per second without blocking or slowing down the reads.
A presentation I made for Apache Spark and Apache Cassandra Integration.
First I present what are some of the differences between RDBMS and NoSQL, then I proceed with the Cassandra infrastructure and usual errors when creating a Cassandra Data Model.
Finally, I provide the Spark underlying main concepts and some settings for proper configuration.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
by Marie Yap, Enterprise Solutions Architect, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftAmazon Web Services
Data Warehousing with Amazon Redshift: Data Analytics Week at the San Francisco Loft
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Sudhir Gupta - Partner Solutions Architect, Redshift Specialist, AWS
by Taz Sayed, Sr Technical Account Manager AWS and Marie Yap, Enterprise Solutions Architect AWS
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
Data Analytics Week at the San Francisco Loft
Data Warehousing with Amazon Redshift
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWSA closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Jay Formosa - Solutions Architect, AWS
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
by Andre Hass, Specialist Technical Account Manager, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones which use ElasticSearch. We present our architecture for asynchronous data retrieval and explain how we use Spark together with leading edge technologies to achieve an order of magnitude cost reduction while at the same time boosting performance by several orders of magnitude and tripling weather data coverage from land only to global.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Big data challenges are common : we are all doing aggregations , machine learning , anomaly detection, OLAP ...
This presentation describe how InnerActive answer those requirements
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Thing you didn't know you could do in SparkSnappyData
This presentation discusses issues with the modern lambda architecture and how Spark attempts to solve them with structured streaming and interactive querying. It then shows how SnappyData takes these solutions one step further with its Synopsis Data Engine
Saiba como Amazon Redshift, o nosso dataware house totalmente gerenciados, pode ajudá-lo de forma rápida e rentável analisar todos os seus dados utilizando suas ferramentas de BI. Também será abordado introdução ao serviço, o qual utiliza MPP, arquitetura scale-out e armazenamento de forma colunar.
Hubo un tiempo en el que casi cualquier componente de software requería pagar una licencia. Afortunadamente, hoy en día gracias al software libre y de código abierto, se puede desarrollar prácticamente cualquier aplicación usando componentes gratuitos.
Pero, si el software es gratis, ¿Quién lo desarrolla? ¿Trabaja la comunidad de software libre de forma altruista? ¿Se puede desarrollar software libre de forma profesional? De hecho, hay quien dice que el código abierto tal y como lo conocimos ya no existe, y que lo que hay hoy en día es otra cosa.
En esta charla hablaré de cómo se puede monetizar el código libre, y de algunos posibles conflictos que puedes encontrarte en el camino.
Además, te contaré cómo hacemos desde QuestDB para desarrollar una base de datos de código abierto y mantener un equipo estable viviendo de ello. Comentaré también algunas situaciones problemáticas a las que proyectos muy destacados se han enfrentado, o que se enfrentan a día de hoy.
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas.
Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
More Related Content
Similar to Your Timestamps Deserve Better than a Generic Database
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftAmazon Web Services
Data Warehousing with Amazon Redshift: Data Analytics Week at the San Francisco Loft
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Sudhir Gupta - Partner Solutions Architect, Redshift Specialist, AWS
by Taz Sayed, Sr Technical Account Manager AWS and Marie Yap, Enterprise Solutions Architect AWS
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
Data Analytics Week at the San Francisco Loft
Data Warehousing with Amazon Redshift
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWSA closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Jay Formosa - Solutions Architect, AWS
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
by Andre Hass, Specialist Technical Account Manager, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones which use ElasticSearch. We present our architecture for asynchronous data retrieval and explain how we use Spark together with leading edge technologies to achieve an order of magnitude cost reduction while at the same time boosting performance by several orders of magnitude and tripling weather data coverage from land only to global.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Big data challenges are common : we are all doing aggregations , machine learning , anomaly detection, OLAP ...
This presentation describe how InnerActive answer those requirements
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Thing you didn't know you could do in SparkSnappyData
This presentation discusses issues with the modern lambda architecture and how Spark attempts to solve them with structured streaming and interactive querying. It then shows how SnappyData takes these solutions one step further with its Synopsis Data Engine
Saiba como Amazon Redshift, o nosso dataware house totalmente gerenciados, pode ajudá-lo de forma rápida e rentável analisar todos os seus dados utilizando suas ferramentas de BI. Também será abordado introdução ao serviço, o qual utiliza MPP, arquitetura scale-out e armazenamento de forma colunar.
Hubo un tiempo en el que casi cualquier componente de software requería pagar una licencia. Afortunadamente, hoy en día gracias al software libre y de código abierto, se puede desarrollar prácticamente cualquier aplicación usando componentes gratuitos.
Pero, si el software es gratis, ¿Quién lo desarrolla? ¿Trabaja la comunidad de software libre de forma altruista? ¿Se puede desarrollar software libre de forma profesional? De hecho, hay quien dice que el código abierto tal y como lo conocimos ya no existe, y que lo que hay hoy en día es otra cosa.
En esta charla hablaré de cómo se puede monetizar el código libre, y de algunos posibles conflictos que puedes encontrarte en el camino.
Además, te contaré cómo hacemos desde QuestDB para desarrollar una base de datos de código abierto y mantener un equipo estable viviendo de ello. Comentaré también algunas situaciones problemáticas a las que proyectos muy destacados se han enfrentado, o que se enfrentan a día de hoy.
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas.
Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
Relational databases were created a long time ago for a simpler world. Even if they are still awesome tools for generic workloads, there are some things they cannot do well.
In this session I will speak about purpose-built databases that you can use for specific business scenarios. We will see the type of queries you can run on a Graph database, a Document Database, and a Time-Series database. We will then see how a relational database could also be used for the same use cases, just in a much more complex way.
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
En esta sesión voy a contar las decisiones técnicas que tomamos al desarrollar QuestDB, una base de datos Open Source para series temporales compatible con Postgres, y cómo conseguimos escribir más de cuatro millones de filas por segundo sin bloquear o enlentecer las consultas.
Hablaré de cosas como (zero) Garbage Collection, vectorización de instrucciones usando SIMD, reescribir en lugar de reutilizar para arañar microsegundos, aprovecharse de los avances en procesadores, discos duros y sistemas operativos, como por ejemplo el soporte de io_uring, o del balance entre experiencia de usuario y rendimiento cuando se plantean nuevas funcionalidades.
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
Data used to be a batch thing, but more and more we get unbounded streams of data, fast or slow, that we need to process and analyse in near real time.
In this talk I’ll show you how you can use Apache Flink and QuestDB to build reliable streaming data pipelines that can grow as much as you need.
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
AWS está montando una región de infraestructura en Aragón. Vale, pero ¿Qué significa eso? ¿Es tan diferente de un centro de datos convencional o de otros proveedores de nube? (Spoiler: Sí). En esta sesión te cuento por qué. Hay video en https://catedrasamcadt.unizar.es/noticias/el-momento-tecnologico-actual-contado-por-trabajadores-de-amazon-web-services/
¿Qué es eso del desarrollo sin servidores? ¿Qué lenguajes puedo utilizar? ¿Cómo hago cosas como autenticación, o guardar en base de datos, o enviar notificaciones? ¿Esto escala? A todas estas preguntas, y a alguna más, intentaré dar respuesta en esta sesión, donde haré una pequeña demo de montar una app muy sencilla y desplegarla en la nube sin preocuparnos de gestionar infraestructura. Charla realizada por primera vez para AlcarriaConf 2021
AWS launched publicly on March 2006 with just one service, starting the age of the public cloud. You might think after 15 years everything in cloud has already been invented, but that's simply not the case.
In this session I want to show you how AWS is reinventing the cloud in areas like computing, machine learning, databases and analytics, or cloud infrastructure.
Analitica de datos en tiempo real con Apache Flink y Apache BEAMjavier ramirez
Trabajar en tiempo real con datos que se mueven muy rápido no es trivial, sobre todo con volúmenes de datos elevados. Apache Flink y Apache BEAM están específicamente diseñadas para ese caso de uso. En esta charla te contaré los retos de la analítica en tiempo real, cuál es la arquitectura de Apache Flink, qué es Apace BEAM, y cómo usan estas herramientas empresas para hacer desde procesos triviales hasta gestionar billones de eventos al día con latencias de milisegundos. Por supuesto, haremos una demo :)
In this webinar we explain which are some of the problems of streaming analytics, and why they are different to batch/big data analytics. Then we go into introducing some basic streaming concepts, like event queues, event processors, event vs processing time, and delivery guarantees. We end this first part of the series presenting a few of the most common open source components for streaming (Kafka, Spark, Flink, Cassandra, or ElasticSearch) and we mention the different options you have to run them on AWS.
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
In this session I will show you how to create a simple streaming analytics pipeline, first using open source tools and developing locally, then moving to a VM, then moving to fully managed AWS services. The session will serve as an introduction to some details of Apache Kafka, Apache Flink, ElasticSearch, Amazon Managed Streaming for Kafka, Kinesis Data Analytics, and Amazon ElasticSearch. It will be an almost slideless presentation, as I will spent most of the time at the command line and the IDE.
Getting started with streaming analytics: Deep Divejavier ramirez
Now that we know how to create simple streaming analytics pipelines, it is time to learn something more interesting. In this session I will show you how to add Complex Event Processing to your Apache Flink (or Kinesis Data Analytics) application using JAVA. For those of you that prefer SQL, I will show you how to run streaming analytics using only SQL.
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
In this webinar we explain which are some of the problems of streaming analytics, and why they are different to batch/big data analytics. Then we go into introducing some basic streaming concepts, like event queues, event processors, event vs processing time, and delivery guarantees. We end this first part of the series presenting a few of the most common open source components for streaming (Kafka, Spark, Flink, Cassandra, or ElasticSearch) and we mention the different options you have to run them on AWS.
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
La seguridad es nuestra prioridad número uno. Cuando despliegas tu infraestructura y aplicaciones en la nube, hay que tener en cuenta que muchas de las prácticas de seguridad son iguales a las que se llevan a cabo tradicionalmente cuando trabajas on-premises, pero hay otros mecanismos que son específicos a AWS y que te ayudan a operar de forma segura.
En este webinar, vamos a explicarte las bases de la monitorización de seguridad y de la detección de amenazas, y veremos cómo servicios como Amazon GuardDuty y AWS Security Hub te ayudan a tener una visión completa, te permiten cumplir con tus requisitos de compliance, y te permiten detectar amenazas en tus cargas de trabajo.
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...javier ramirez
Los desarrolladores hoy en día usamos diferentes bases de datos en función de las necesidades de nuestras aplicaciones. Por ejemplo, si estamos construyendo una red social, puede que usemos una base de datos orientada a grafos como Amazon Neptune, o si nuestro requisito es soportar esquemas muy flexibles quizás usemos Amazon DocumentDB, o si necesitamos latencias super bajas, quizás usemos Amazon DynamoDB. O incluso Amazon ElastiCache con Redis.
Es cada vez más normal encontrar aplicaciones complejas compuestas de diferentes servicios y diferentes almacenes de datos. Esto, que es genial desde el punto de vista de escoger la herramienta ideal para cada caso de uso, hace que la analítica de datos se complique al no tener toda la información en una sola base de datos relacional.
En este webinar, te presentamos la funcionalidad de consultas federadas de Athena, que te permite lanzar consultas SQL contra cualquiera de tus bases de datos, tanto en AWS como on premises. Además, en una sola SELECT puedes consultar diferentes fuentes de datos y hacer joins entre ellas. Para que todo quede más claro, te lo contaremos con una demo consultando mediante SQL una base de datos que no soporta SQL de manera nativa.
Recomendaciones, predicciones y detección de fraude usando servicios de intel...javier ramirez
La implementación de modelos de aprendizaje automático para resolver desafíos de negocios complejos, como detección de fraude, recomendaciones o predicción de series de datos es difícil si se quiere partir desde cero. Sin embargo, utilizando herramientas de AWS, implementar esos modelos está al alcance de cualquier empresa que sea capaz de subir un fichero a la nube, y llamar a un API cuando quiera obtener resultados. Basados en la tecnología de aprendizaje automático que se perfeccionó gracias a años de uso en Amazon.com, Amazon Forecast, Amazon Personalize, y Amazon Fraud Detector permiten a cualquiera sin experiencia previa en aprendizaje automático integrar estas tecnologías en sus aplicaciones. En este video aprenderás cuáles son las dificultades de crear modelos de predicción para los casos ya mencionados, verás como AWS acelera el difícil trabajo que se necesita para diseñar, entrenar e implementar un modelo personalizado para tus datos, y te contaremos todo lo que necesitas para poder empezar a integrar estos modelos en tu aplicación. Por supuesto, veremos demos de cómo funcionan
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
2. About me: I like databases & open source
2022- today. Developer relations at an open source database vendor
● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink
2019-2022. Data & Analytics specialist at a cloud provider
● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift,
ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark…
2013-2018. Data Engineer/Big Data & Analytics consultant
● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB,
Presto
2006-2012 - Web developer
● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch
late nineties to 2005. Desktop/CGI/Servlets/ EJBs/CORBA
● MS Access, MySQL, Oracle, Sybase, Informix
As a student/hobbyist (late eighties - early nineties)
● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works, Informix
The pre-SQL years
The licensed SQL period
The libre and open SQL
revolution / The NoSQL
rise
The hadoop dark ages / The
python hegemony/ The cloud
database big migrations
The streaming era/ The
database as a service
singularity
The SQL revenge/ the
realtime database/the
embedded database
10. Working with timestamped data in
a database is tricky*
* specially working with analytics of data changing over time or at a high rate
11. If you can use only one
database for everything,
go with PostgreSQL*
* Or any other major and well supported RDBMS
12. Some things RDBMS are not designed for
● Writing data faster than it is read (several millions of inserts per day and faster)
● Querying rate of ingestion (how many records per second?) and identifying gaps
● Joining tables by approximate timestamp
● Aggregates over billions of records
● Sparse data (tables with hundreds or thousands of columns)
13. Some things RDBMS are not designed for
● Writing data faster than it is read (several millions of inserts per day and faster)
● Querying rate of ingestion (how many records per second?) and identifying gaps
● Joining tables by approximate timestamp
● Aggregates over billions of records
● Sparse data (tables with hundreds or thousands of columns)
18. Now imagine
● a factory floor with 500 machines, or
● a fleet with 500 vehicles, or
● 50 trains, with 10 cars each, or
● 500 users with a mobile phone
Sending data every second
19. 43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month
23. We would like to be known for:
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
24. Do you have a time-series problem? Write patterns
● You mostly insert data. You rarely update or delete individual rows
● Since data keeps growing, you will very likely end up with much bigger
data than your typical operational database would be happy with
● Both ingestion and querying speed are critical for your business
25. Do you have a time-series problem? Read patterns
● Most of your queries are scoped to a time range
● You typically access recent/fresh data rather than older data
● Maybe you eventually just delete data because it is too old
● You often need to resample your data for aggregations/analytics
● You often need to align timestamps from multiple data series
● Your data powers near real-time dashboards
27. All benchmarks are lies (but they give us a ballpark)
Ingesting over 1.4 million rows per second (using 5 CPU threads)
https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-benchmark/
While running queries scanning over 4 billion rows per second (16 CPU threads)
https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/
Time-series specialised benchmark
https://github.com/questdb/tsbs
34. Try out query performance on open datasets
https://demo.questdb.io/
Using SQL and the Postgres Wire Protocol
35. Performance is only part of the
question. Usability and
time-series semantics are
critical
36. WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
Try it live on
https://demo.questdb.io
37. WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
Try it live on
https://demo.questdb.io
38. WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
Try it live on
https://demo.questdb.io
39. SAMPLE BY
Aggregates data in homogeneous time chunks
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
Try it live on
https://demo.questdb.io
40. How do you ask your database
to return which data is not
stored?
41. I am sending data every
second or so. Tell me which
devices didn’t send any data
with more than 1.5 seconds gap
42. SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
Try it live on
https://demo.questdb.io
43. LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where
multiple time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol;
Try it live on
https://demo.questdb.io
44. LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where
multiple time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
Try it live on
https://demo.questdb.io
45. What if I have two tables, where
data is (obviously) not sent at
the same exact timestamps
and I want to join by closest
matching timestamp?
46. ASOF JOIN (LT JOIN and SPLICE JOIN variations)
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN
takes from the second time-series a timestamp that meets both of the following criteria:
● The timestamp is the closest to the first timestamp.
● The timestamp is strictly prior or equal to the first timestamp.
WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
Try it live on
https://demo.questdb.io
52. Some things we are trying out next for performance
● Compression, and exploring data formats like arrow/ parquet
● Own ingestion protocol
● Embedding Julia in the database for custom code/UDFs
● Moving some parts to Rust
● Second level partitioning
● Improved vectorization of some operations (group by multiple
columns or by expressions
● Add specific joins optimizations (index nested loop joins, for
example)