Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
InfluxDB IOx Tech Talks
This talk presents a design of a distributed database system that splits data to gain query performance. The talk will define four main properties of data splitting: sharding, partitioning, sorting, and encoding; and then delve into examples to show their impacts on query performance.
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
In data warehouse area, it is common to use one or more columns in complex type, such as map, and put many subfields into it. It may impact the query performance dramatically because: 1) It is a waste of IO. The whole column (in map), which may contain tens of subfields, need to be read. And Spark will traverse the whole map and get the value of the target key. 2) Vectorized read can not be exploit when nested type column is read. 3) Filter pushdown can not be utilized when nested columns is read. Over the last year, we have added a series of optimizations in Apache Spark to solve the above problems for Parquet.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
InfluxDB IOx Tech Talks
This talk presents a design of a distributed database system that splits data to gain query performance. The talk will define four main properties of data splitting: sharding, partitioning, sorting, and encoding; and then delve into examples to show their impacts on query performance.
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
In data warehouse area, it is common to use one or more columns in complex type, such as map, and put many subfields into it. It may impact the query performance dramatically because: 1) It is a waste of IO. The whole column (in map), which may contain tens of subfields, need to be read. And Spark will traverse the whole map and get the value of the target key. 2) Vectorized read can not be exploit when nested type column is read. 3) Filter pushdown can not be utilized when nested columns is read. Over the last year, we have added a series of optimizations in Apache Spark to solve the above problems for Parquet.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Covered:
1. Databases and Schemas
2. Tablespaces
3. Data Type
4. Exploring Databases
5. Locating the database server's message log
6. Locating the database's system identifier
7. Listing databases on this database server
8. How much disk space does a table use?
9. Which are my biggest tables?
10. How many rows are there in a table?
11. Quickly estimating the number of rows in a table
12. Understanding object dependencies
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API’s capabilities and discuss future directions including easy sessionization and event-time-based windowing.
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
InfluxData is excited to announce InfluxDB Clustered, the self-managed version of InfluxDB 3.0 with unparalleled flexibility, speed, performance, and scale. The evolution of InfluxDB Enterprise, InfluxDB Clustered is delivered as a collection of Kubernetes-based containers and services, which enables you to run and operate InfluxDB 3.0 where you need it, whether that's on-premises or in a private cloud environment. With this new enterprise offering, we’re excited to provide our customers with real-time queries, low-cost object storage, unlimited cardinality, and SQL language support – all with improved data access, support, and security! The newest version of InfluxDB was built on Apache Arrow, and through the open source ecosystem and integrations, extends the value of your time-stamped data.
Join this webinar to learn more about InfluxDB Clustered, and how to manage your large mission-critical workloads in the highly available database service offering!
In this webinar, Balaji Palani and Gunnar Aasen will dive into:
Key features of the new InfluxDB Clustered solution
Use cases for using the newest version of the purpose-built time series database
Live demo
During this 1-hour technical webinar, you’ll also get a chance to ask your questions live.
Best Practices for Leveraging the Apache Arrow EcosystemInfluxData
Apache Arrow is an open source project intended to provide a standardized columnar memory format for flat and hierarchical data. It enables more efficient analytics workloads for modern CPU and GPU hardware, which makes working with large data sets easier and cheaper.
InfluxData and Dremio are both members of the Apache Software Foundation (ASF). Dremio is a data lakehouse management service known for its scalability and capacity for direct querying across diverse data sources. InfluxDB is the purpose-built time series database, and InfluxDB 3.0 has a new columnar storage engine and uses the Arrow format for representing data and moving data to and from Parquet. Discover how InfluxDB and Dremio have advanced their solutions by relying on the Apache Arrow framework.
Join this live panel as Alex Merced and Anais Dotis-Georgiou dive into:
Advantages to utilizing the Apache Arrow ecosystem
Tips and tricks for implementing the columnar data structure
How developers can best utilize the ASF to innovate and contribute to new industry standards
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...InfluxData
Bevi are the creators of smart water dispensers which empower people to choose their desired beverage — flat or sparkling, their desired flavor and temperature. Since 2014, Bevi users have saved more than 350 million bottles and cans. Their "smart" water coolers have prevented the extraction of 1.4 trillion oz of oil from Earth and have saved 21.7 billion grams of CO2 from the atmosphere.
Discover how Bevi uses a time series database to enable better predictive maintenance and alerting of their entire ecosystem — including the hardware and software. They are using InfluxDB to collect sensor data in real-time remotely from their internet-connected machines about their status and activity — i.e., flavor and CO2 levels, water temp, filter status, etc. They a7re using these metrics to improve their customer experience and continuously improve their sustainability practices. Gain tips and tricks on how to best utilize InfluxDB's schema-less design.
Join this webinar as Spencer Gagnon dives into:
Bevi's approach to reducing organizations' carbon footprint — they are saving 50K+ bottles and cans annually
Their entire system architecture — including InfluxDB Cloud, Grafana, Kafka, and DigitalOcean
The importance of using time-stamped data to extend the life of their machines
Power Your Predictive Analytics with InfluxDBInfluxData
If you're using InfluxDB to store and manage your time series data, you're already off to a great start. But why stop there? In our upcoming webinar, we'll show you how to take your data analysis to the next level by building predictive analytics using a variety of tools and techniques.
We will demonstrate how to use Quix to create custom dashboards and visualizations that allow you to monitor your data in real-time. We'll also introduce you to Hugging Face, a powerful tool for building models that can predict future trends and identify anomalies. With these tools at your disposal, you'll be able to extract valuable insights from your data and make more informed decisions about the future. Don't miss out on this opportunity to improve your data analysis skills and take your business to the next level!
What you will learn:
Use InfluxDB to store and manage time series data
Utilize Quix and Hugging Face to build models, visualize trends, and identify anomalies
Extract valuable insights from your data
Improve your data analysis skills to make informed decision
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base InfluxData
Are you considering replacing your legacy data historian and moving your OT data to the cloud? Join this technical webinar to learn how to adopt InfluxDB and IO Base - a digital platform used to improve operational efficiencies!
Teréga Solutions are the creators of digital solutions used to improve energy efficiencies and to address decarbonization challenges. Their network includes 5,000+ km of gas pipelines within France; they aim to help France attain carbon neutrality by 2050. With these impressive goals in mind, Teréga has created IO-Base — the digital platform to improve industrial performance, and increase profitability. Creating digital twins for their clients allows them to collect data from all production sites and view it in real time, from anywhere and at any time.
Discover how Teréga uses InfluxDB, Docker, and AWS to monitor its gas and hydrogen pipeline infrastructure. They chose to replace their legacy data historian with InfluxDB — the purpose built time series database. They are collecting more than 100K different metrics at various frequencies — some are collected every 5 seconds to only every 1-2 minutes. THey have reduced overall IT spend by 50% and collect 2x the amount of data at 20x frequency! By using various industrial protocols (Modbus, OPC-UA, etc.), Teréga improved output, reduced the TCO, and is now able to create added-value services: forecast, monitoring, predictive maintenance.
Join this webinar as Thomas Delquié dives into:
Teréga's approach to modernizing fossil fuel pipelines IT systems while improving yields and safety
Their centralized methodology to collecting sensor, hardware, and network metrics
The importance of time series data and why they chose InfluxDB
Build an Edge-to-Cloud Solution with the MING StackInfluxData
FlowForge enables organizations to reliably deliver Node-RED applications in a continuous, collaborative, and secure manner. Node-RED is the popular, low-code programming solution that makes it easy to connect different services using a visual programming environment. InfluxData is the creator of InfluxDB, the purpose-built time series database run by developers at scale and in any environment in the cloud, on-premises, or at the edge.
Jump-start monitoring your industrial IoT devices and discover how to build an edge-to-cloud solution with the MING stack. The MING stack includes Mosquitto/MQTT, InfluxDB, Node-RED, and Grafana. This solution can be used to improve fleet management, enable predictive maintenance of industrial machines and power generation equipment (i.e. turbines and generators) and increase safety practices (i.e. buildings, construction sites). Join this webinar to learn best practices from industrial IoT SME's.
In this webinar, Robert Marcer and Jay Clifford dive into:
Best practices for monitoring sensor data collected by everyone — from the edge to the factory
Tips and tricks for using Node-RED and InfluxDB together
Demo — see Node-RED and InfluxDB live
Meet the Founders: An Open Discussion About Rewriting Using RustInfluxData
Rust is a systems programming language designed for high performance, type safety, and concurrency. According to Stack Overflow’s annual survey in 2022, Rust is the most loved language with 87% of developers saying they want to continue using it. The same survey also reported that nearly 20% of developers aren’t currently using Rust, but want to start developing using it.
Ockam’s suite of programming libraries, command line tools, and managed cloud services enable developers to orchestrate end-to-end encryption. InfluxDB is the purpose-built time series database developed to handle time series data for IoT, monitoring, and real-time analytics. Ockam was originally developed using C, and InfluxDB was originally written using Go; both solutions have been completely rewritten in Rust. Discover why two founders decided to rewrite their developer tools using Rust, and gain insight into the strategy beforehand and the entire process.
Join this live panel as Mrinal Wadhwa and Paul Dix dive into:
Their approach to rewriting a project in Rust
How to build and train engineering teams
Tips and tricks learned along the way - pitfalls to look out for!
Join this webinar as there will be a live discussion with Q&A
InfluxData is excited to announce the general availability of InfluxDB Cloud Dedicated! It is a fully managed time series database service running on cloud infrastructure resources that are dedicated to a single tenant. With this new offering, we’re excited to provide our customers with additional security options, and more custom configuration options to best suit customers’ workload requirements. Join this webinar to learn more about InfluxDB Cloud, and the new dedicated database service offering!
In this webinar, Balaji Palani and Gary Fowler will dive into:
Key features of the new InfluxDB Cloud Dedicated solution
Use cases for using the newest version of the purpose-built time series database
Live demo
During this 1-hour technical webinar, you’ll also get a chance to ask your questions live.
Gain Better Observability with OpenTelemetry and InfluxDB InfluxData
Many developers and DevOps engineers have become aware of using their observability data to gain greater insights into their infrastructure systems. InfluxDB is the purpose-built time series database used to collect metrics and gain observability into apps, servers, containers, and networks. Developers use InfluxDB to improve the quality and efficiency of their CI/CD pipelines. Start using InfluxDB to aggregate infrastructure and application performance monitoring metrics to enable better anomaly detection, root-cause analysis, and alerting.
This session will demonstrate how to record metrics, logs, and traces with one library — OpenTelemetry — and store them in one open source time series database — InfluxDB. Zoe will demonstrate how easy it is to set up the OpenTelemetry Operator for Kubernetes and to store and analyze your data in InfluxDB.
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...InfluxData
American Metal Processing Company ("AMP") is the US' largest commercial rotary heat treat facility with customers in the automotive, construction, military, and agriculture industries. They use their atmosphere-protected rotary retort furnaces to provide their clients with three primary hardening services: neutral hardening (quench and temper), carburizing, and carbonitriding.
This furnace style ensures consistent, uniform heat treatment process vs. traditional batch-or-belt-style furnaces; excels at processing high volumes of smaller parts with tight tolerances; and improves the strength and toughness of plain carbon steels. Discover why AMP’s use of Telegraf, InfluxDB, Node-RED, and Grafana allows them to gain 24/7 insights into their plant operations and metallurgical results. Learn how they use time-stamped data to gain accurate metrics about their consumables usage, furnace profiles, and machine status.
Join this webinar as Grant Pinkos dives into:
American Metal Processing's approach to heat treating in a digitized environment through connected systems
Their approach to collecting and measuring sensor data to enable predictive maintenance and improve product quality
Why they need a time series database for managing and analyzing vast amounts of time-stamped data
How Delft University's Engineering Students Make Their EV Formula-Style Race ...InfluxData
Delft University is the oldest and largest technical university in the Netherlands with 25,000+ students. Since 1999, they have had a team of students (undergraduate and graduate) designing, building, and racing cars, as part of the Formula Student worldwide competition. The competition has grown to include teams from 1K+ universities in 20+ countries. Students are responsible for all aspects of car manufacturing (research, construction, testing, developing, marketing, management, and fundraising). Delft University's team includes 90 students across disciplines.
Discover how Delft University's team uses Marple and InfluxDB to collect telemetry and sensor metrics while they develop, test, and race their electrics cars. They collect sensor data about their EV's control systems using a time series platform. During races, they are collecting IoT data about their batteries, accelerometer, gyroscope, tires, etc. The engineers are able to share important car stats during races which help the drivers tweak their driving decisions — all with the goal of winning. After races, the entire team are able to analyze data in Marple to understand what to do better next time. By using Marple + InfluxDB, their team are able to collect, share and analyze high frequency car data used to make their car faster at competitions.
Join this webinar as Robbin Baauw and Nero Vanbiervliet dive into:
Marple's approach to empowering engineers to organize, analyze, and visualize their data
Delft University's collaborative methodology to building and racing their Formula-style race car
How InfluxDB is crucial to their collaborative engineering and racing process
Introducing InfluxDB’s New Time Series Database Storage EngineInfluxData
InfluxData is excited to announce the general availability of InfluxDB Cloud's new storage engine! It is a cloud-native, real-time, columnar database optimized for time series data. InfluxDB's rebuilt core was coded in Rust and sits on top of Apache Arrow and DataFusion. InfluxData's team picked Apache Parquet as the persistent format. In this webinar, Paul Dix and Balaji Palani will demonstrate key product features including the removal of cardinality limits!
They will dive into:
The next phase of the InfluxDB platform
How using Apache Arrow's ecosystem has improved InfluxDB's performance and scalability
Key features of InfluxDB Cloud's new core — including SQL native support
Start Automating InfluxDB Deployments at the Edge with balena InfluxData
balena.io helps companies develop, deploy, update, and manage IoT devices. By using Linux containers and other cloud technologies, balena enables teams to quickly and easily build fleets of connected devices. Developers are able to use containers with the language of choice and pull IoT sensor data from 70+ different single board computers into balenaCloud. Discover how to use balena.io to automate your InfluxDB deployments at the edge!
During this one-hour session, experts from balena and InfluxData will demonstrate how to build and deploy your own air quality IoT solution. You will learn:
The fundamentals of IoT sensor deployment and management using balena.
How to use a time series platform to collect and visualize metrics from edge devices.
Tips and tricks to using balenaCloud to automate InfluxDB deployments and Telegraf configurations.
How to use InfluxDB's Edge Data Replication feature to collect sensor data and push it to InfluxDB Cloud for analysis.
No coding experience required, just a curiosity to start your own IoT adventure.
Understanding InfluxDB’s New Storage EngineInfluxData
Learn more about InfluxDB’s new storage engine! The team developed a cloud-native, real-time, columnar database optimized for time series data. We built it all in Rust and it sits on top of Apache Arrow and DataFusion. We chose Apache Parquet as the persistent format, which is an open source columnar data file format. This new storage engine provides InfluxDB Cloud users with new functionality, including the removal of cardinality limits, so developers can bring in massive amounts of time series data at scale.
In this webinar, Anais Dotis-Georgiou will dive into:
Requirements for rebuilding InfluxDB’s core
Key product features and timeline
How Apache Arrow’s ecosystem is used to meet those requirements
Stick around for a demo and live Q&A
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBInfluxData
RudderStack — the creators of the leading open source Customer Data Platform (CDP) — needed a scalable way to collect and store metrics related to customer events and processing times (down to the nanosecond). They provide their clients with data pipelines that simplify data collection from applications, websites, and SaaS platforms. RudderStack's solution enables clients to stream customer data in real time — they quickly deploy flexible data pipelines that send the data to the customer's entire stack without engineering headaches. Customers are able to stream data from any tool using their 16+ SDK's, and they are able to transform the data in-transit using JavaScript or Python. How does RudderStack use a time series platform to provide their customers with real-time analytics?
Join this webinar as Ryan McCrary dives into:
RudderStack's approach to streamlining data pipelines with their 180+ out-of-the-box integrations
Their data architecture including Kapacitor for alerting and Grafana for customized dashboards
Why using InfluxDB was crucial for them for fast data collection and providing single-sources of truths for their customers
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...InfluxData
Customers using ThingWorx and the Manufacturing Solutions often need to store property data longer than the Solutions default to. These customers are recommended to use InfluxDB, and this presentation will cover the key considerations for moving to InfluxDB vs the standard ThingWorx value streams. Join this session as Ward highlights ThingWorx’s solution and its easy implementation process.
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022InfluxData
Two new features are coming to Flux that add flexibility
and functionality to your data workflow—polymorphic
labels and dynamic types. This session walks through
these new features and shows how they work.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
7. Today: IOx Team at InfluxData
Past life 1: Query Optimizer @ Vertica, also
on Oracle DB server
Past life 2: Chief Architect + VP Engineering
roles at some ML startups
8. Talk Outline
What is a Query Engine
Introduction to DataFusion / Apache Arrow
DataFusion Architectural Overview
10. Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
11. Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
Query Engine
SQL is the
common
interface
12. DataFusion Use Cases
1. Data engineering / ETL:
a. Construct fast and efficient data pipelines (~ Spark)
2. Data Science
a. Prepare data for ML / other tasks (~ Pandas)
3. Database Systems:
a. E.g. IOx, Ballista, Cloudfuse Buzz, various internal systems
13. Why DataFusion?
High Performance: Memory (no GC) and Performance, leveraging Rust/Arrow
Easy to Connect: Interoperability with other tools via Arrow, Parquet and Flight
Easy to Embed: Can extend data sources, functions, operators
First Class Rust: High quality Query / SQL Engine entirely in Rust
High Quality: Extensive tests and integration tests with Arrow ecosystems
My goal: DataFusion to be *the* choice for any SQL support in Rust
14. DBMS vs Query Engine ( , )
Database Management Systems (DBMS) are full featured systems
● Storage system (stores actual data)
● Catalog (store metadata about what is in the storage system)
● Query Engine (query, and retrieve requested data)
● Access Control and Authorization (users, groups, permissions)
● Resource Management (divide resources between uses)
● Administration utilities (monitor resource usage, set policies, etc)
● Clients for Network connectivity (e.g. implement JDBC, ODBC, etc)
● Multi-node coordination and management
DataFusion
15. What is DataFusion?
“DataFusion is an in-memory query engine
that uses Apache Arrow as the memory
model” - crates.io
● In Apache Arrow github repo
● Apache licensed
● Not part of the Arrow spec, uses Arrow
● Initially implemented and donated by
Andy Grove; design based on How
Query Engines Work
17. DataFusion Extensibility 🧰
● User Defined Functions
● User Defined Aggregates
● User Defined Optimizer passes
● User Defined LogicalPlan nodes
● User Defined ExecutionPlan nodes
● User Defined TableProvider for tables
* Built in data persistence using parquet and CSV files
18. What is a Query Engine?
1. Frontend
a. Query Language + Parser
2. Intermediate Query Representation
a. Expression / Type system
b. Query Plan w/ Relational Operators (Data Flow Graph)
c. Rewrites / Optimizations on that graph
3. Concrete Execution Operators
a. Allocate resources (CPU, Memory, etc)
b. Pushed bytes around, vectorized calculations, etc
��
22. DataFusion CLI
> CREATE EXTERNAL TABLE
http_api_requests_total
STORED AS PARQUET
LOCATION
'http_api_requests_total.parquet';
+--------+-----------------+
| status | COUNT(UInt8(1)) |
+--------+-----------------+
| 4XX | 73621 |
| 2XX | 338304 |
+--------+-----------------+
> SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
23. EXPLAIN Plan
Gets a textual representation of LogicalPlan
+--------------+----------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
+--------------+----------------------------------------------------------+
> explain SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
24. Plans as DataFlow graphs
Filter:
#path Eq Utf8("/api/v2/write")
Aggregate:
groupBy=[[#status]],
aggr=[[COUNT(UInt8(1))]]
TableScan: http_api_requests_total
projection=None
Step 2: Predicate is applied
Step 1: Parquet file is read
Step 3: Data is aggregated
Data flows up from the
leaves to the root of the
tree
25. More than initially meets the eye
Use EXPLAIN VERBOSE to see optimizations applied
> EXPLAIN VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
+----------------------+----------------------------------------------------------------+
| plan_type | plan |
+----------------------+----------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
| projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
| type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
...
+----------------------+----------------------------------------------------------------+
Optimizer “pushed” down
projection so only status
and path columns from
file were read from
parquet
29. DataFusion Planning Flow
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
LogicalPlan
ExecutionPlan
RecordBatches
Parsing/Planning
Optimization
Execution
“Query Plan”
PG:” Query Tree”
“Access Plan”
“Operator Tree”
PG: “Plan Tree”
30. DataFusion Logical Plan Creation
● Declarative: Describe WHAT you want; system figures out HOW
○ Input: “SQL” text (postgres dialect)
● Procedural Describe HOW directly
○ Input is a program to build up the plan
○ Two options:
■ Use a LogicalPlanBuilder, Rust style builder
■ DataFrame - model popularized by Pandas and Spark
34. Query Optimization Overview
Compute the same (correct) result, only faster
Optimizer
Pass 1
LogicalPlan
(intermediate)
“Optimizer”
Optimizer
Pass 2
LogicalPlan
(input)
LogicalPlan
(output)
…
Other
Passes
...
35. Built in DataFusion Optimizer Passes (source link)
ProjectionPushDown: Minimize the number of columns passed from node to node
to minimize intermediate result size (number of columns)
FilterPushdown (“predicate pushdown”): Push filters as close to scans as possible
to minimize intermediate result size
HashBuildProbeOrder (“join reordering”): Order joins to minimize the intermediate
result size and hash table sizes
ConstantFolding: Partially evaluates expressions at plan time. Eg. ColA && true
→ ColA
37. Expression Evaluation
Arrow Compute Kernels typically operate on 1 or 2 arrays and/or scalars.
Partial list of included comparison kernels:
eq Perform left == right operation on two arrays.
eq_scalar Perform left == right operation on an array and a scalar value.
eq_utf8 Perform left == right operation on StringArray / LargeStringArray.
eq_utf8_scalar Perform left == right operation on StringArray / LargeStringArray and a scalar.
and Performs AND operation on two arrays. If either left or right value is null then the result is also null.
is_not_null Returns a non-null BooleanArray with whether each value of the array is not null.
or Performs OR operation on two arrays. If either left or right value is null then the result is also null.
...
38. Exprs for evaluating arbitrary expressions
path = '/api/v2/write' OR path IS NULL
Column
path
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
left right
BinaryExpr
op: Or
left right
col(“path”)
.eq(lit(‘api/v2/write’))
.or(col(“path”).is_null())
Expression Builder API
47. Type Coercion
sqrt(col)
sqrt(col) → sqrt(CAST col as Float32)
col is Int8, but sqrt implemented for Float32 or Float64
⇒ Type Coercion: adds typecast cast so the implementation can be called
Note: Coercion is lossless; if col was Float64, would not coerce to Float32
Source Code: coercion.rs
49. Plan Execution Overview
Typically called the “execution engine” in database systems
DataFusion features:
● Async: Mostly avoids blocking I/O
● Vectorized: Process RecordBatch at a time, configurable batch size
● Eager Pull: Data is produced using a pull model, natural backpressure
● Partitioned: each operator produces partitions, in parallel
● Multi-Core*
* Uses async tasks; still some unease about this / if we need another thread pool
53. next()
SendableRecordBatchStream
GroupHash
AggregateStream
FilterExecStream
“ParquetStream”*
For file1
Ready to produce values! 😅
Rust Stream: an async iterator that
produces record batches
Execution of GroupHash starts
eagerly (before next() is called on it)
next().await
next().await
RecordBatch
RecordBatch
Step 2:
Data is
filtered
Step 1: Data read from parquet
and returned
Step 3: data
is fed into a
hash table
Step 0: new task spawned, starts
computing input immediately
Step 5: output is requested RecordBatch
Step 6:
returned to
caller
Step 4:
hash done,
output
produced
54. next()
GroupHash
AggregateStream
GroupHash
AggregateStream
GroupHash
AggregateStream
next().await
Step 1: output is requested
MergeStream
MergeStream eagerly
starts on its own task, back
pressure via bounded
channels
Step 0: new task spawned, starts
computing input
RecordBatch
Step 2: eventually RecordBatch is
produced from downstream and returned
Step 0: new task spawned, starts
computing input immediately next().await
next().await
Step 0: new task spawned, starts
computing input
next().await
Step 4: data
is fed into a
hash table
RecordBatch
Step 3: Merge
passes on
RecordBatch
RecordBatch
Step 5:
hash done,
output
produced
Step 6:
returned to
caller
55. Get Involved
Check out the project Apache Arrow
Join the mailing list (links on project page)
Test out Arrow (crates.io) and DataFusion (crates.io) in your projects
Help out with the docs/code/tickets on GitHub
Thank You!!!!