CREATE STATISTICS - What is it for? (PostgresLondon)Tomas Vondra
CREATE STATISTICS command was introduced in PostgreSQL 10, and improved in PostgreSQL 12. Let's see what queries it's meant to improve and how it works.
This document discusses using Amazon Rekognition to index faces from images stored in an S3 bucket into a collection. It describes calling the IndexFaces API to index the full image, as well as cropped left, right, top-left, top-right, bottom-left, and bottom-right regions of faces. Promises are used to index the different regions concurrently. The document also mentions storing face metadata like bounding box coordinates and assigning external IDs to each indexed face for later searching by the SearchFacesByImage API.
Apache Spark™ Applications the Easy Way - Pierre Borckmanssparktc
At the sold-out Spark & Machine Learning Meetup in Brussels on October 27, 2016, Pierre Borckmans of Real Impact Analytics delivered a lightning talk called "Writing Spark applications, the easy way".
As Pierre explained, even though Apache Spark™ offers intuitive and high-level APIs, writing production-ready Spark data pipelines involves non-trivial challenges for data scientists without expert background in software development and devops matters. In this short talk, he showed how his team tackled these issues at Real Impact Analytics, by developing an intuitive framework for writing dataflows, offering convenient data exploration and testing facilities, while hiding devops-related complexity.
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...DataStax Academy
The document describes a method for indexing and searching bitemporal data in Cassandra using Lucene indexes. It proposes using four R-trees, with each R-tree represented using two DateRangePrefixTrees in Lucene to index the data by valid and transaction time ranges. Queries are transformed and distributed to search the appropriate R-trees and DateRangePrefixTrees to retrieve bitemporal data within the specified time ranges.
Geospatial and bitemporal search in cassandra with pluggable lucene indexAndrés de la Peña
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra. With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module. Another feature we have provided with our plugin is the possibility of indexing bitemporal data models, which distinguish between system time and business time. This way, it is possible to make queries over C* such as “give me what system thought in a certain instant about what happened in another instant”. The implementation has been performed combining range prefix trees with the 4R-Tree approach exposed by Bliujūtė et al. Both full-text, geospatial and bitemporal queries can be combined with Apache Spark to avoid systematic full-scan, dramatically reducing the amount of data to be processed.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
MariaDB ColumnStore is an open source columnar database storage engine that provides high performance analytics capabilities on large datasets using standard SQL. It uses a distributed architecture that stores data by column rather than by row to enable fast queries by only accessing the relevant columns. It can scale horizontally on commodity servers to support analytics workloads on datasets ranging from millions to trillions of rows.
CREATE STATISTICS - What is it for? (PostgresLondon)Tomas Vondra
CREATE STATISTICS command was introduced in PostgreSQL 10, and improved in PostgreSQL 12. Let's see what queries it's meant to improve and how it works.
This document discusses using Amazon Rekognition to index faces from images stored in an S3 bucket into a collection. It describes calling the IndexFaces API to index the full image, as well as cropped left, right, top-left, top-right, bottom-left, and bottom-right regions of faces. Promises are used to index the different regions concurrently. The document also mentions storing face metadata like bounding box coordinates and assigning external IDs to each indexed face for later searching by the SearchFacesByImage API.
Apache Spark™ Applications the Easy Way - Pierre Borckmanssparktc
At the sold-out Spark & Machine Learning Meetup in Brussels on October 27, 2016, Pierre Borckmans of Real Impact Analytics delivered a lightning talk called "Writing Spark applications, the easy way".
As Pierre explained, even though Apache Spark™ offers intuitive and high-level APIs, writing production-ready Spark data pipelines involves non-trivial challenges for data scientists without expert background in software development and devops matters. In this short talk, he showed how his team tackled these issues at Real Impact Analytics, by developing an intuitive framework for writing dataflows, offering convenient data exploration and testing facilities, while hiding devops-related complexity.
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...DataStax Academy
The document describes a method for indexing and searching bitemporal data in Cassandra using Lucene indexes. It proposes using four R-trees, with each R-tree represented using two DateRangePrefixTrees in Lucene to index the data by valid and transaction time ranges. Queries are transformed and distributed to search the appropriate R-trees and DateRangePrefixTrees to retrieve bitemporal data within the specified time ranges.
Geospatial and bitemporal search in cassandra with pluggable lucene indexAndrés de la Peña
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra. With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module. Another feature we have provided with our plugin is the possibility of indexing bitemporal data models, which distinguish between system time and business time. This way, it is possible to make queries over C* such as “give me what system thought in a certain instant about what happened in another instant”. The implementation has been performed combining range prefix trees with the 4R-Tree approach exposed by Bliujūtė et al. Both full-text, geospatial and bitemporal queries can be combined with Apache Spark to avoid systematic full-scan, dramatically reducing the amount of data to be processed.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
MariaDB ColumnStore is an open source columnar database storage engine that provides high performance analytics capabilities on large datasets using standard SQL. It uses a distributed architecture that stores data by column rather than by row to enable fast queries by only accessing the relevant columns. It can scale horizontally on commodity servers to support analytics workloads on datasets ranging from millions to trillions of rows.
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
This document discusses using Cassandra and Spark for big data applications. It describes how Cassandra is well-suited for operational workloads with millisecond response times and linear scalability, while Spark can handle real-time, streaming and batch analytics up to 100x faster than Hadoop. The Spark-Cassandra connector allows Spark to efficiently read from and write into Cassandra by optimizing for predicate pushdown, data locality, joins and grouping. The document provides an architecture overview and examples of modeling data in Cassandra and interacting with it from Spark using the connector.
Improving MariaDB’s Query Optimizer with better selectivity estimatesSergey Petrunya
The document discusses improving selectivity estimates in MariaDB's query optimizer. It begins with background on selectivity estimates and how the query optimizer uses statistics like cardinalities and selectivities. It then covers computing selectivity for local and join conditions, including techniques like histograms. The document discusses different types of histograms used in various databases and ongoing work in MariaDB to improve its histograms. It concludes with discussing computing selectivity for multiple conditions.
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
This document describes using MongoDB and Python for data analysis pipelines. It discusses using MongoDB for data storage and aggregation, connecting to MongoDB from Python using Monary and PyMongo, and using Python for data processing, analysis and visualization. It provides an example MongoDB aggregation query in Python to sum the population by state using Monary. It also shows an example Airflow DAG for running the aggregation query as part of a scheduled data pipeline with monitoring and dependencies.
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...Igalia
By Daniel Ehrenberg.
Slides at https://docs.google.com/presentation/d/1bpvESaWtNnhXV0a95b6GWHhhqUVnGCIPcs37ngqx4Uo/edit#slide=id.g38b91fc952_0_103
Following ES6, TC39 is working with the broader JS developer community to continue evolving the JavaScript programming language. The pipeline operator `x |> f |> g` is an early stage, community-driven proposal to make it easier to compose multiple functions, inspired by similar syntax in other programming language and frameworks like RxJS. In this talk, I'll explain how TC39 works and how this proposal is being carefully developed with extensive feedback and opportunities for you to get involved.
(c) WorkerConf 2018
28th June 2018 (Dornbirn, Austria)
https://worker.sh/
(NET301) New Capabilities for Amazon Virtual Private CloudAmazon Web Services
Amazon's Virtual Private Cloud (Amazon VPC) continues to evolve with new capabilities and enhancements. These features give you increasingly greater isolation, control, and visibility at the all-important networking layer. In this session, we review some of the latest changes, discuss their value, and describe their use cases.
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
Monitoring is the key to successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast growing, high performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHouse Webinar Slides
Monitoring is the key to the successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open-source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast-growing, high-performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Presented by:
Roman Khavronenko, Co-Founder at VictoriaMetrics
Robert Hodges, CEO at Altinity
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...EDB
What do you do, when you have to deal with poor database and query performance in PostgreSQL and there is no one around to help? Let us introduce you to important commands in PostgreSQL - EXPLAIN and EXPLAIN ANALYZE. Knowing how to use these 'tools' will help you identify query performance bottlenecks and opportunities. It provides a query plan detailing what approach the planner took to execute the statement provided.
Attend this webinar to learn:
- What are EXPLAIN and EXPLAIN ANALYZE in PostgreSQL?
- How do they help?
- Know planner tuning parameters
This document summarizes a presentation given at Paris Data Ladies #6 on Spark Streaming. The presentation introduced Spark Streaming, discussing how it can be used to perform real-time data analysis. It covered topics such as batch versus streaming processing, why Spark Streaming is useful, and how to build a basic Twitter trending hashtag application using Spark Streaming. Code examples were provided and a live demo of the Twitter application was shown. Considerations for productionizing Spark Streaming applications were also discussed.
The document describes using Python scripts and ArcGIS tools to analyze drive time service areas and population data. It discusses using a cursor to select individual records from a feature class, creating a new feature class for each record, clipping census block groups to the new feature class, calculating total population for the service area, and merging the individual feature classes into a single output. The purpose is to summarize the total population within different drive time thresholds from a given location.
How Russia’s #1 Internet Provider Gets High Performance at Low Cost
What if you could store huge amounts of data using traditional low-cost HDDs while still maintaining low single-digit millisecond latencies? That’s just what they did at Mail.Ru, the largest email and Internet service provider in Russia.
Actian Matrix is a massively scalable columnar database that can scale out horizontally on commodity hardware. It uses a leader node and compute nodes architecture with dynamic slicing of data to distribute query processing across nodes. This allows queries on very large datasets to be processed rapidly in parallel. Matrix has demonstrated excellent performance and scalability on the TPC-H benchmark, processing a 100GB dataset on 16 nodes in a very short time. Its ability to scale out nearly linearly with added nodes and disks is the secret to its strong performance on big data workloads.
A short introduction to reproducible research, reproducibility with R, Docker, and all together for reproducible research using R and Docker containers. Includes demos of Rocker and containerit.
The document provides an overview of Oracle indexes from a conceptual and internal perspective. It discusses B-tree indexes, including their structure, leaf and branch nodes, and how select, update, delete and insert operations are handled internally. It also covers bitmap indexes and their storage format, as well as function-based and reversed indexes.
The document provides information about troubleshooting and monitoring PostgreSQL performance. It includes:
- Top output showing PostgreSQL processes using significant CPU resources
- Iotop output showing disk IO from PostgreSQL processes
- Pg_stat_activity output listing active queries and sessions
- Pg_stat_statements output showing top queries by execution time
- Explanations and examples of using EXPLAIN, EXPLAIN ANALYZE, and ANALYZE to view query plans and statistics
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB
The document summarizes using MongoDB and Spark for IoT analytics on aircraft ADS-B data. Key points include:
- ADS-B data from a SDR antenna is ingested into MongoDB for real-time dashboards and historical analysis using aggregation pipelines.
- The aggregation framework enables ad-hoc queries on aircraft attributes like speed, altitude and operator without ETL.
- The BI connector projects data to SQL for analytics tools. Examples analyze FedEx flight paths and aircraft types by altitude and speed.
- Spark is used for machine learning like k-means clustering on attributes to identify patterns in the data.
Обзорный рассказ про новые возможности в мире PostgreSQL для митапа Big Data Minsk User Group 29 апреля 2016 г.: https://www.facebook.com/events/120784531655479/
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra.
With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-9.html
This document provides examples of how to obtain MAC and IP address accounting information from Cisco routers using SNMP. MAC address accounting tracks packet and byte counts for each unique MAC address on a LAN interface. IP address accounting tracks packet and byte counts for source and destination IP address pairs. The document outlines how to retrieve this information using SNMP commands to access the CISCO-IP-STAT-MIB and OLD-CISCO-IP-MIB, including using checkpointing to copy the active IP accounting database to a stable checkpoint database.
Let's discuss the various sources of data corruption in databases (and in PostgreSQL in general) - cosmic rays, storage, bugs in the OS, database or application. I'll share some general observations about those various sources, and also some recommendations how to detect those issues in PostgreSQL (data checksums, ...) and how to deal with them.
The document discusses different approaches to storing sensitive data like credit card numbers in a database. It compares using full-disk encryption versus application-level encryption. Full-disk encryption protects against theft but prevents indexing and queries on encrypted fields. Application-level encryption requires decrypting and re-encrypting all data, moving processing to the application. The document proposes a solution of using a separate trusted component that can compare encrypted values to enable indexing and aggregation while keeping data encrypted in the database.
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
This document discusses using Cassandra and Spark for big data applications. It describes how Cassandra is well-suited for operational workloads with millisecond response times and linear scalability, while Spark can handle real-time, streaming and batch analytics up to 100x faster than Hadoop. The Spark-Cassandra connector allows Spark to efficiently read from and write into Cassandra by optimizing for predicate pushdown, data locality, joins and grouping. The document provides an architecture overview and examples of modeling data in Cassandra and interacting with it from Spark using the connector.
Improving MariaDB’s Query Optimizer with better selectivity estimatesSergey Petrunya
The document discusses improving selectivity estimates in MariaDB's query optimizer. It begins with background on selectivity estimates and how the query optimizer uses statistics like cardinalities and selectivities. It then covers computing selectivity for local and join conditions, including techniques like histograms. The document discusses different types of histograms used in various databases and ongoing work in MariaDB to improve its histograms. It concludes with discussing computing selectivity for multiple conditions.
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
This document describes using MongoDB and Python for data analysis pipelines. It discusses using MongoDB for data storage and aggregation, connecting to MongoDB from Python using Monary and PyMongo, and using Python for data processing, analysis and visualization. It provides an example MongoDB aggregation query in Python to sum the population by state using Monary. It also shows an example Airflow DAG for running the aggregation query as part of a scheduled data pipeline with monitoring and dependencies.
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...Igalia
By Daniel Ehrenberg.
Slides at https://docs.google.com/presentation/d/1bpvESaWtNnhXV0a95b6GWHhhqUVnGCIPcs37ngqx4Uo/edit#slide=id.g38b91fc952_0_103
Following ES6, TC39 is working with the broader JS developer community to continue evolving the JavaScript programming language. The pipeline operator `x |> f |> g` is an early stage, community-driven proposal to make it easier to compose multiple functions, inspired by similar syntax in other programming language and frameworks like RxJS. In this talk, I'll explain how TC39 works and how this proposal is being carefully developed with extensive feedback and opportunities for you to get involved.
(c) WorkerConf 2018
28th June 2018 (Dornbirn, Austria)
https://worker.sh/
(NET301) New Capabilities for Amazon Virtual Private CloudAmazon Web Services
Amazon's Virtual Private Cloud (Amazon VPC) continues to evolve with new capabilities and enhancements. These features give you increasingly greater isolation, control, and visibility at the all-important networking layer. In this session, we review some of the latest changes, discuss their value, and describe their use cases.
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
Monitoring is the key to successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast growing, high performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHouse Webinar Slides
Monitoring is the key to the successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open-source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast-growing, high-performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Presented by:
Roman Khavronenko, Co-Founder at VictoriaMetrics
Robert Hodges, CEO at Altinity
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...EDB
What do you do, when you have to deal with poor database and query performance in PostgreSQL and there is no one around to help? Let us introduce you to important commands in PostgreSQL - EXPLAIN and EXPLAIN ANALYZE. Knowing how to use these 'tools' will help you identify query performance bottlenecks and opportunities. It provides a query plan detailing what approach the planner took to execute the statement provided.
Attend this webinar to learn:
- What are EXPLAIN and EXPLAIN ANALYZE in PostgreSQL?
- How do they help?
- Know planner tuning parameters
This document summarizes a presentation given at Paris Data Ladies #6 on Spark Streaming. The presentation introduced Spark Streaming, discussing how it can be used to perform real-time data analysis. It covered topics such as batch versus streaming processing, why Spark Streaming is useful, and how to build a basic Twitter trending hashtag application using Spark Streaming. Code examples were provided and a live demo of the Twitter application was shown. Considerations for productionizing Spark Streaming applications were also discussed.
The document describes using Python scripts and ArcGIS tools to analyze drive time service areas and population data. It discusses using a cursor to select individual records from a feature class, creating a new feature class for each record, clipping census block groups to the new feature class, calculating total population for the service area, and merging the individual feature classes into a single output. The purpose is to summarize the total population within different drive time thresholds from a given location.
How Russia’s #1 Internet Provider Gets High Performance at Low Cost
What if you could store huge amounts of data using traditional low-cost HDDs while still maintaining low single-digit millisecond latencies? That’s just what they did at Mail.Ru, the largest email and Internet service provider in Russia.
Actian Matrix is a massively scalable columnar database that can scale out horizontally on commodity hardware. It uses a leader node and compute nodes architecture with dynamic slicing of data to distribute query processing across nodes. This allows queries on very large datasets to be processed rapidly in parallel. Matrix has demonstrated excellent performance and scalability on the TPC-H benchmark, processing a 100GB dataset on 16 nodes in a very short time. Its ability to scale out nearly linearly with added nodes and disks is the secret to its strong performance on big data workloads.
A short introduction to reproducible research, reproducibility with R, Docker, and all together for reproducible research using R and Docker containers. Includes demos of Rocker and containerit.
The document provides an overview of Oracle indexes from a conceptual and internal perspective. It discusses B-tree indexes, including their structure, leaf and branch nodes, and how select, update, delete and insert operations are handled internally. It also covers bitmap indexes and their storage format, as well as function-based and reversed indexes.
The document provides information about troubleshooting and monitoring PostgreSQL performance. It includes:
- Top output showing PostgreSQL processes using significant CPU resources
- Iotop output showing disk IO from PostgreSQL processes
- Pg_stat_activity output listing active queries and sessions
- Pg_stat_statements output showing top queries by execution time
- Explanations and examples of using EXPLAIN, EXPLAIN ANALYZE, and ANALYZE to view query plans and statistics
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB
The document summarizes using MongoDB and Spark for IoT analytics on aircraft ADS-B data. Key points include:
- ADS-B data from a SDR antenna is ingested into MongoDB for real-time dashboards and historical analysis using aggregation pipelines.
- The aggregation framework enables ad-hoc queries on aircraft attributes like speed, altitude and operator without ETL.
- The BI connector projects data to SQL for analytics tools. Examples analyze FedEx flight paths and aircraft types by altitude and speed.
- Spark is used for machine learning like k-means clustering on attributes to identify patterns in the data.
Обзорный рассказ про новые возможности в мире PostgreSQL для митапа Big Data Minsk User Group 29 апреля 2016 г.: https://www.facebook.com/events/120784531655479/
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra.
With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-9.html
This document provides examples of how to obtain MAC and IP address accounting information from Cisco routers using SNMP. MAC address accounting tracks packet and byte counts for each unique MAC address on a LAN interface. IP address accounting tracks packet and byte counts for source and destination IP address pairs. The document outlines how to retrieve this information using SNMP commands to access the CISCO-IP-STAT-MIB and OLD-CISCO-IP-MIB, including using checkpointing to copy the active IP accounting database to a stable checkpoint database.
Similar to CREATE STATISTICS - what is it for? (20)
Let's discuss the various sources of data corruption in databases (and in PostgreSQL in general) - cosmic rays, storage, bugs in the OS, database or application. I'll share some general observations about those various sources, and also some recommendations how to detect those issues in PostgreSQL (data checksums, ...) and how to deal with them.
The document discusses different approaches to storing sensitive data like credit card numbers in a database. It compares using full-disk encryption versus application-level encryption. Full-disk encryption protects against theft but prevents indexing and queries on encrypted fields. Application-level encryption requires decrypting and re-encrypting all data, moving processing to the application. The document proposes a solution of using a separate trusted component that can compare encrypted values to enable indexing and aggregation while keeping data encrypted in the database.
PostgreSQL performance improvements in 9.5 and 9.6Tomas Vondra
The document summarizes performance improvements in PostgreSQL versions 9.5 and 9.6. Some key improvements discussed include optimizations to sorting, hash joins, BRIN indexes, parallel query processing, aggregate functions, checkpoints, and freezing. Performance tests on sorting, hash joins, and parallel queries show significant speedups from these changes, such as faster sorting times and better scalability with parallel queries.
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016Tomas Vondra
The document provides an overview of different file systems for PostgreSQL including EXT3/4, XFS, BTRFS and ZFS. It discusses the evolution and improvements made to EXT3/4 and XFS over time to address scalability, bugs and new storage technologies like SSDs. BTRFS and ZFS are described as designed for large data volumes and built-in features like snapshots and checksums but BTRFS is still considered experimental. Benchmark results show ZFS and optimized EXT4/XFS performing best and BTRFS performance significantly reduced due to copy-on-write. The conclusion recommends EXT4/XFS for traditional needs and ZFS for advanced features, avoiding BTRFS.
PostgreSQL on EXT4, XFS, BTRFS and ZFSTomas Vondra
One of the most common question I'm asked by users and customer about configuring a Linux-based system for PostgreSQL is "What file fystem should I use to get the best performance?" Sadly, most of the recommendations is based on obsolete "common knowledge" passed from generation to generation. But in recent years the file systems improved a lot - we've seen both evolution of established file systems (EXT family, XFS, ...) and revolution in the form of BTRFS, ZFS or F2FS. And of course new kinds of hardware (SSDs) which certainly impacts the design of a file system.
Performance improvements in PostgreSQL 9.5 and beyondTomas Vondra
This document discusses several performance improvements made in PostgreSQL versions 9.5 and beyond. Some key improvements discussed include:
- Faster sorting through allowing sorting by inlined functions, abbreviated keys for VARCHAR/TEXT/NUMERIC, and Sort Support benefits.
- Improved hash joins through reduced palloc overhead, smaller NTUP_PER_BUCKET, and dynamically resizing the hash table.
- Index improvements like avoiding index tuple copying, GiST and bitmap index scan optimizations, and block range tracking in BRIN indexes.
- Aggregate functions see speedups through using 128-bit integers for internal state instead of NUMERIC in some cases.
- Other optimizations affect PL/pgSQL performance,
The document summarizes performance benchmarks comparing PostgreSQL versions from 7.4 to the latest version. The benchmarks show significant performance improvements across different workloads over time, including:
- Up to 6x faster transaction processing rates in pgbench
- Much better scalability to multiple cores and clients
- 6x faster query times for TPC-DS data warehouse benchmark
- Over 10x speedup for some full text search queries in latest version
The benchmarks demonstrate that PostgreSQL has become dramatically faster and more scalable over the past 10+ years, while also using less disk space and memory.
Standardní implementace ispell slovníků v PostgreSQL má bohužel několik nevýhod co se CPU a paměti týká. Napsal jsem extension která umožňuje slovníky inicializovat jen jednou a sdílet je mezi spojeními.
Jaký dopad na výkon má přesun transakčního logu nebo indexů na samostatný disk? Jak výsledek závisí na typu workloadu (OLTP vs. DWH) a na typu disku (HDD vs. SSD)?
Stručný úvod do čtení explain planu - seznámení s principem jak databáze vybírá postup vyhodnocení dotazu, na jaké základní operace v něm můžeme narazit (varianty čtení tabulek a joinů). Zejména se ale podíváme na to jak identifikovat problematická místa v exekučním plánu, v čem může být příčina a jak ji odstranit.
Přehled variant databázové replikace (master-slave vs. master-master, physical vs. logical) a jejich pro a proti. Krátká historie replikace v PostgreSQL s přehled nástrojů které dnes máme k dispozici (slony-I, pgpool-II, Londiste a Bucardo). A nakonec stručné srovnání s replikací ve dvou populárních databázích (MySQL a Oracle).
Prezentace připravená pro konferenci Prague PostgreSQL Developers' Day 2010. Hlavním tématem je monitoring výkonu PostgreSQL databázového serveru, nástrojů které k tomu lze použít, atd. Na začátku jsou představeny základní pojmy (např. co je to snapshot), a poté je předvedeno několik užitečných nástrojů (jeden z nich vyvíjím já).
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
19. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
Functional Dependencies
●
value in column A determines value in column B
●
trivial example: primary key determines everything
– zip code {place, state, province, community→ }
– 4625-113 {Favões, Porto, Marco de Canaveses, Favões→ }
●
other dependencies:
– place community→
– community province→
– province state→
22. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
EXPLAIN (ANALYZE, TIMING off)
SELECT * FROM zip_codes WHERE place_name = 'Berlin'
AND state_name = 'Berlin';
QUERY PLAN
------------------------------------------------------------------
Seq Scan on zip_codes (cost=0.00..5378.11 rows=7076 width=56)
(actual rows=9165 loops=1)
Filter: (((place_name)::text = 'Lisboa'::text)
AND ((state_name)::text = 'Lisboa'::text))
Rows Removed by Filter: 197776
Underestimate : fied
23. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
EXPLAIN (ANALYZE, TIMING off)
SELECT * FROM zip_codes WHERE place_name = 'Lisboa'
AND state_name != 'Lisboa';
QUERY PLAN
----------------------------------------------------------------
Seq Scan on zip_codes (cost=0.00..5378.11 rows=7166 width=56)
(actual rows=1 loops=1)
Filter: (((state_name)::text <> 'Lisboa'::text)
AND ((place_name)::text = 'Lisboa'::text))
Rows Removed by Filter: 206940
Functional dependencies only work with equalities.
Overestimate #1: not fied :-(
24. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
EXPLAIN (ANALYZE, TIMING off)
SELECT * FROM zip_codes WHERE place_name = 'Lisboa'
AND state_name = 'Porto';
QUERY PLAN
------------------------------------------------------------------
Seq Scan on zip_codes (cost=0.00..5378.11 rows=1374 width=56)
(actual rows=0 loops=1)
Filter: (((place_name)::text = 'Lisboa'::text)
AND ((state_name)::text = 'Porto'::text))
Rows Removed by Filter: 206941
The queries need to respect the functional dependencies.
Overestimate #2: not fied :-(
31. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
CREATE STATISTICS s (ndistinct)
ON state_name, province_name, community_name
FROM zip_codes;
ANALYZE zip_codes;
SELECT stxndistinct FROM pg_statistic_ext;
stxndistinct
------------------------------------------------------------
{"3, 4": 308, "3, 5": 2858, "4, 5": 2858, "3, 4, 5": 2858}
32. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
EXPLAIN (ANALYZE, TIMING off)
SELECT count(*) FROM zip_codes GROUP BY province_name, community_name;
QUERY PLAN
-------------------------------------------------------------------------
HashAggregate (cost=102569.26..102597.84 rows=2858 width=36)
(actual rows=3845 loops=1)
Group Key: state_name, province_name, community_name
-> Seq Scan on zip_codes (cost=0.00..69467.13 rows=3310213 width=28)
(actual rows=3311056 loops=1)
Planning Time: 1.367 ms
Execution Time: 2343.846 ms
33. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
ndistinct
●
the “old behavior” was defensive
– unreliable estimates with multiple columns
– HashAggregate can’t spill to disk (OOM)
– rather than crash do Sort+GroupAggregate (slow)
●
ndistinct coefcients
– make multi-column ndistinct estimates more reliable
– reduced danger of OOM
– large tables + GROUP BY multiple columns
34. https://github.com/tvondra/create-statistics-talk https://www.2ndQuadrant.com
pgconf.eu 2018
Lisbon, October 23-26, 2018
Future Improvements
●
additional types of statistics
– MCV lists, histograms, …
●
statistics on expressions
– currently only simple column references
– alternative to functional indexes
●
improving join estimates
– using MCV lists
– special multi-table statistics (syntax already supports it)