User-defined features and session extensions in Apache Spark allow for high-level and low-level customization. High-level customization includes user-defined types (UDT), user-defined functions (UDF), and user-defined aggregate functions (UDAF). Low-level customization involves extensions to the analyzer rules, optimizations, physical execution, and more. The document provides examples of UDT, UDF, and UDAF and discusses how they allow incorporating custom logic into Spark applications similar to stored procedures in databases.
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
This document discusses sessionization techniques using Apache Spark batch and streaming processing. It describes using Spark to join previous session data with new log data to generate user sessions in batch mode. For streaming, it covers using watermarks and stateful processing to continuously generate sessions from streaming data. Key aspects covered include checkpointing to provide fault tolerance, configuring the state store, and techniques for reprocessing data in batch and streaming contexts.
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
The document discusses sessionization with Spark streaming to analyze user sessions from a constant stream of page visit data. Key points include:
- Streaming page visit data presents challenges like joining new visits to ongoing sessions and handling variable data volumes and long user sessions.
- The proposed solution uses Spark streaming to join a checkpoint of incomplete sessions with new visit data to calculate session metrics in real-time.
- Important aspects are controlling data ingress size and partitioning to optimize performance of operations like joins and using custom formats to handle output to multiple sinks.
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Ontico
Postgres has always had strong support for relational storage. However, there are many cases where relational storage is either inefficient or overly restrictive. This talk shows the many ways that Postgres has expanded to support non-relational storage, specifically the ability to store and index multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage. The talk will cover many examples of multiple-value-per-field storage, including arrays, range types, geometry, full text search, xml, json, and records.
This presentation is primarily focused on how to use collectd (http://collectd.org/) to gather data from the Postgres statistics tables. Examples of how to use collectd with Postgres will be shown. There is some hackery involved to make collectd do a little more and collect more meaningful data from Postgres. These small patches will be explored. A small portion of the discussion will be about how to visualize the data.
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
This document discusses sessionization techniques using Apache Spark batch and streaming processing. It describes using Spark to join previous session data with new log data to generate user sessions in batch mode. For streaming, it covers using watermarks and stateful processing to continuously generate sessions from streaming data. Key aspects covered include checkpointing to provide fault tolerance, configuring the state store, and techniques for reprocessing data in batch and streaming contexts.
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
The document discusses sessionization with Spark streaming to analyze user sessions from a constant stream of page visit data. Key points include:
- Streaming page visit data presents challenges like joining new visits to ongoing sessions and handling variable data volumes and long user sessions.
- The proposed solution uses Spark streaming to join a checkpoint of incomplete sessions with new visit data to calculate session metrics in real-time.
- Important aspects are controlling data ingress size and partitioning to optimize performance of operations like joins and using custom formats to handle output to multiple sinks.
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Ontico
Postgres has always had strong support for relational storage. However, there are many cases where relational storage is either inefficient or overly restrictive. This talk shows the many ways that Postgres has expanded to support non-relational storage, specifically the ability to store and index multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage. The talk will cover many examples of multiple-value-per-field storage, including arrays, range types, geometry, full text search, xml, json, and records.
This presentation is primarily focused on how to use collectd (http://collectd.org/) to gather data from the Postgres statistics tables. Examples of how to use collectd with Postgres will be shown. There is some hackery involved to make collectd do a little more and collect more meaningful data from Postgres. These small patches will be explored. A small portion of the discussion will be about how to visualize the data.
In the “Sharing is caring” spirit, we came up with a series of internal talks called, By Showmaxers, for Showmaxers, and we recently started making them public. There are already talks about Networks, and Android app building, available.
Our latest talk focuses on PostgreSQL Terminology, and is led by Angus Dippenaar. He worked on Showmax projects from South Africa, and moved to work with us in Prague, Czech Republic.
The talk was meant to fill some holes in our knowledge of PostgreSQL. So, it guides you through the basic PostgreSQL terminology you need to understand when reading the official documentation and blogs.
You may learn what all these POstgreSQL terms mean:
Command, query, local or global object, non-schema local objects, relation, tablespace, database, database cluster, instance and its processes like postmaster or backend; session, connection, heap, file segment, table, TOAST, tuple, view, materialized (view), transaction, commit, rollback, index, write-ahead log, WAL record, WAL file, checkpoint, Multi-version concurrency control (MVCC), dead tuples (dead rows), or transaction exhaustion.
The terminology is followed by a demonstration of transaction exhaustion.
Get the complete explanation and see the demonstration of the transaction exhaustion and of tuple freezing in the talk on YouTube: https://youtu.be/E-RkI3Ws7gM.
This document provides an overview of troubleshooting streaming replication in PostgreSQL. It begins with introductions to write-ahead logging and replication internals. Common troubleshooting tools are then described, including built-in views and functions as well as third-party tools. Finally, specific troubleshooting cases are discussed such as replication lag, WAL bloat, recovery conflicts, and high CPU recovery usage. Throughout, examples are provided of how to detect and diagnose issues using the various tools.
This document provides an overview of pgCenter, an open source tool for monitoring and managing PostgreSQL databases. It summarizes pgCenter's main features, which include displaying statistics on databases, tables, indexes and functions; monitoring long running queries and statements; managing connections to multiple PostgreSQL instances; and performing administrative tasks like viewing logs, editing configuration files, and canceling queries. Use cases and examples of how pgCenter can help optimize PostgreSQL performance are also provided.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
PostgreSQL Procedural Languages: Tips, Tricks and GotchasJim Mlodgenski
One of the most powerful features of PostgreSQL is its diversity of procedural languages, but with that diversity comes a lot of options.
Did you ever wonder:
- What all of those options are on the CREATE FUNCTION statement?
- How do they affect my application?
- Does my choice of procedural language affect the performance of my statements?
- Should I create a single trigger with IF statements or several simple triggers?
- How do I debug my code?
- Can I tell which line in my function is taking all of the time?
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
RestMQ is a message queue system based on Redis that allows storing and retrieving messages through HTTP requests. It uses Redis' data structures like lists, sets, and hashes to maintain queues and messages. Messages can be added to and received from queues using RESTful endpoints. Additional features include status monitoring, queue control, and support for protocols like JSON, Comet, and WebSockets. The core functionality is language-agnostic but implementations exist in Python and Ruby.
The document discusses strategic autovacuum configuration and monitoring in PostgreSQL. It begins by explaining the ACID properties and how MVCC and transactions work. It then discusses how to monitor workloads for heavily updated tables, adjust per-table autovacuum thresholds to prioritize those tables, monitor autovacuum behavior over time using logs and queries, and tune the autovacuum throttle settings based on that monitoring to optimize autovacuum performance. The key steps are to start with defaults, monitor workload changes, adjust settings for busy tables, continue monitoring, and refine settings as needed.
This document discusses advanced Postgres monitoring. It begins with an introduction of the speaker and an agenda for the discussion. It then covers selection criteria for monitoring solutions, compares open source and SAAS monitoring options, and provides examples of collecting specific Postgres metrics using CollectD. It also discusses alerting, handling monitoring changes, and being prepared to respond to incidents outside of normal hours.
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
PostgreSQL is one of the most advanced relational databases. It offers superb replication capabilities. The most important features are: Streaming replication, Point-In-Time-Recovery, advanced monitoring, etc.
Speaker: Alexander Kukushkin
Kubernetes is a solid leader among different cloud orchestration engines and its adoption rate is growing on a daily basis. Naturally people want to run both their applications and databases on the same infrastructure.
There are a lot of ways to deploy and run PostgreSQL on Kubernetes, but most of them are not cloud-native. Around one year ago Zalando started to run HA setup of PostgreSQL on Kubernetes managed by Patroni. Those experiments were quite successful and produced a Helm chart for Patroni. That chart was useful, albeit a single problem: Patroni depended on Etcd, ZooKeeper or Consul.
Few people look forward to deploy two applications instead of one and support them later on. In this talk I would like to introduce Kubernetes-native Patroni. I will explain how Patroni uses Kubernetes API to run a leader election and store the cluster state. I’m going to live-demo a deployment of HA PostgreSQL cluster on Minikube and share our own experience of running more than 130 clusters on Kubernetes.
Patroni is a Python open-source project developed by Zalando in cooperation with other contributors on GitHub: https://github.com/zalando/patroni
This document discusses using PostgreSQL statistics to optimize performance. It describes various statistics sources like pg_stat_database, pg_stat_bgwriter, and pg_stat_replication that provide information on operations, caching, and replication lag. It also provides examples of using these sources to identify issues like long transactions, temporary file growth, and replication delays.
Jilles van Gurp presents on the ELK stack and how it is used at Linko to analyze logs from applications servers, Nginx, and Collectd. The ELK stack consists of Elasticsearch for storage and search, Logstash for processing and transporting logs, and Kibana for visualization. At Linko, Logstash collects logs and sends them to Elasticsearch for storage and search. Logs are filtered and parsed by Logstash using grok patterns before being sent to Elasticsearch. Kibana dashboards then allow users to explore and analyze logs in real-time from Elasticsearch. While the ELK stack is powerful, there are some operational gotchas to watch out for like node restarts impacting availability and field data caching
This document describes how to use the ELK (Elasticsearch, Logstash, Kibana) stack to centrally manage and analyze logs from multiple servers and applications. It discusses setting up Logstash to ship logs from files and servers to Redis, then having a separate Logstash process read from Redis and index the logs to Elasticsearch. Kibana is then used to visualize and analyze the logs indexed in Elasticsearch. The document provides configuration examples for Logstash to parse different log file types like Apache access/error logs and syslog.
This document provides an overview of pgCenter, a tool for managing and monitoring PostgreSQL databases. It describes pgCenter's interface which displays system metrics, PostgreSQL statistics and additional information. The interface shows values for items like CPU and memory usage, database connections, autovacuum operations, and query information. PgCenter provides a quick way to view real-time PostgreSQL and server performance metrics.
This document discusses using ngx_lua with UPYUN CDN. It provides examples of using Lua with Nginx for tasks like caching, health checking, and configuration as a service. Key points include using Lua for base64 encoding, Redis lookups, and upstream health checking. Lua provides a more flexible alternative to C modules for tasks like these by leveraging its embedding in Nginx via ngx_lua.
This document summarizes full text search capabilities in PostgreSQL. It begins with an introduction and overview of common full text search solutions. It then discusses reasons to use full text search in PostgreSQL, including consistency and no need for additional software. The document covers basics of full text search in PostgreSQL like to_tsvector, to_tsquery, and indexes. It also covers fuzzy full text search using pg_trgm and functions like similarity. Other topics mentioned include ts_headline, ts_rank, and the RUM extension.
- Install Python 2.5 or 2.6 and SQLAlchemy 0.5 using easy_install
- Michael Bayer created SQLAlchemy and is a software architect in New York City
- SQLAlchemy allows modeling database queries and relationships between objects in a more Pythonic way compared to raw SQL
In the “Sharing is caring” spirit, we came up with a series of internal talks called, By Showmaxers, for Showmaxers, and we recently started making them public. There are already talks about Networks, and Android app building, available.
Our latest talk focuses on PostgreSQL Terminology, and is led by Angus Dippenaar. He worked on Showmax projects from South Africa, and moved to work with us in Prague, Czech Republic.
The talk was meant to fill some holes in our knowledge of PostgreSQL. So, it guides you through the basic PostgreSQL terminology you need to understand when reading the official documentation and blogs.
You may learn what all these POstgreSQL terms mean:
Command, query, local or global object, non-schema local objects, relation, tablespace, database, database cluster, instance and its processes like postmaster or backend; session, connection, heap, file segment, table, TOAST, tuple, view, materialized (view), transaction, commit, rollback, index, write-ahead log, WAL record, WAL file, checkpoint, Multi-version concurrency control (MVCC), dead tuples (dead rows), or transaction exhaustion.
The terminology is followed by a demonstration of transaction exhaustion.
Get the complete explanation and see the demonstration of the transaction exhaustion and of tuple freezing in the talk on YouTube: https://youtu.be/E-RkI3Ws7gM.
This document provides an overview of troubleshooting streaming replication in PostgreSQL. It begins with introductions to write-ahead logging and replication internals. Common troubleshooting tools are then described, including built-in views and functions as well as third-party tools. Finally, specific troubleshooting cases are discussed such as replication lag, WAL bloat, recovery conflicts, and high CPU recovery usage. Throughout, examples are provided of how to detect and diagnose issues using the various tools.
This document provides an overview of pgCenter, an open source tool for monitoring and managing PostgreSQL databases. It summarizes pgCenter's main features, which include displaying statistics on databases, tables, indexes and functions; monitoring long running queries and statements; managing connections to multiple PostgreSQL instances; and performing administrative tasks like viewing logs, editing configuration files, and canceling queries. Use cases and examples of how pgCenter can help optimize PostgreSQL performance are also provided.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
PostgreSQL Procedural Languages: Tips, Tricks and GotchasJim Mlodgenski
One of the most powerful features of PostgreSQL is its diversity of procedural languages, but with that diversity comes a lot of options.
Did you ever wonder:
- What all of those options are on the CREATE FUNCTION statement?
- How do they affect my application?
- Does my choice of procedural language affect the performance of my statements?
- Should I create a single trigger with IF statements or several simple triggers?
- How do I debug my code?
- Can I tell which line in my function is taking all of the time?
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
RestMQ is a message queue system based on Redis that allows storing and retrieving messages through HTTP requests. It uses Redis' data structures like lists, sets, and hashes to maintain queues and messages. Messages can be added to and received from queues using RESTful endpoints. Additional features include status monitoring, queue control, and support for protocols like JSON, Comet, and WebSockets. The core functionality is language-agnostic but implementations exist in Python and Ruby.
The document discusses strategic autovacuum configuration and monitoring in PostgreSQL. It begins by explaining the ACID properties and how MVCC and transactions work. It then discusses how to monitor workloads for heavily updated tables, adjust per-table autovacuum thresholds to prioritize those tables, monitor autovacuum behavior over time using logs and queries, and tune the autovacuum throttle settings based on that monitoring to optimize autovacuum performance. The key steps are to start with defaults, monitor workload changes, adjust settings for busy tables, continue monitoring, and refine settings as needed.
This document discusses advanced Postgres monitoring. It begins with an introduction of the speaker and an agenda for the discussion. It then covers selection criteria for monitoring solutions, compares open source and SAAS monitoring options, and provides examples of collecting specific Postgres metrics using CollectD. It also discusses alerting, handling monitoring changes, and being prepared to respond to incidents outside of normal hours.
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
PostgreSQL is one of the most advanced relational databases. It offers superb replication capabilities. The most important features are: Streaming replication, Point-In-Time-Recovery, advanced monitoring, etc.
Speaker: Alexander Kukushkin
Kubernetes is a solid leader among different cloud orchestration engines and its adoption rate is growing on a daily basis. Naturally people want to run both their applications and databases on the same infrastructure.
There are a lot of ways to deploy and run PostgreSQL on Kubernetes, but most of them are not cloud-native. Around one year ago Zalando started to run HA setup of PostgreSQL on Kubernetes managed by Patroni. Those experiments were quite successful and produced a Helm chart for Patroni. That chart was useful, albeit a single problem: Patroni depended on Etcd, ZooKeeper or Consul.
Few people look forward to deploy two applications instead of one and support them later on. In this talk I would like to introduce Kubernetes-native Patroni. I will explain how Patroni uses Kubernetes API to run a leader election and store the cluster state. I’m going to live-demo a deployment of HA PostgreSQL cluster on Minikube and share our own experience of running more than 130 clusters on Kubernetes.
Patroni is a Python open-source project developed by Zalando in cooperation with other contributors on GitHub: https://github.com/zalando/patroni
This document discusses using PostgreSQL statistics to optimize performance. It describes various statistics sources like pg_stat_database, pg_stat_bgwriter, and pg_stat_replication that provide information on operations, caching, and replication lag. It also provides examples of using these sources to identify issues like long transactions, temporary file growth, and replication delays.
Jilles van Gurp presents on the ELK stack and how it is used at Linko to analyze logs from applications servers, Nginx, and Collectd. The ELK stack consists of Elasticsearch for storage and search, Logstash for processing and transporting logs, and Kibana for visualization. At Linko, Logstash collects logs and sends them to Elasticsearch for storage and search. Logs are filtered and parsed by Logstash using grok patterns before being sent to Elasticsearch. Kibana dashboards then allow users to explore and analyze logs in real-time from Elasticsearch. While the ELK stack is powerful, there are some operational gotchas to watch out for like node restarts impacting availability and field data caching
This document describes how to use the ELK (Elasticsearch, Logstash, Kibana) stack to centrally manage and analyze logs from multiple servers and applications. It discusses setting up Logstash to ship logs from files and servers to Redis, then having a separate Logstash process read from Redis and index the logs to Elasticsearch. Kibana is then used to visualize and analyze the logs indexed in Elasticsearch. The document provides configuration examples for Logstash to parse different log file types like Apache access/error logs and syslog.
This document provides an overview of pgCenter, a tool for managing and monitoring PostgreSQL databases. It describes pgCenter's interface which displays system metrics, PostgreSQL statistics and additional information. The interface shows values for items like CPU and memory usage, database connections, autovacuum operations, and query information. PgCenter provides a quick way to view real-time PostgreSQL and server performance metrics.
This document discusses using ngx_lua with UPYUN CDN. It provides examples of using Lua with Nginx for tasks like caching, health checking, and configuration as a service. Key points include using Lua for base64 encoding, Redis lookups, and upstream health checking. Lua provides a more flexible alternative to C modules for tasks like these by leveraging its embedding in Nginx via ngx_lua.
This document summarizes full text search capabilities in PostgreSQL. It begins with an introduction and overview of common full text search solutions. It then discusses reasons to use full text search in PostgreSQL, including consistency and no need for additional software. The document covers basics of full text search in PostgreSQL like to_tsvector, to_tsquery, and indexes. It also covers fuzzy full text search using pg_trgm and functions like similarity. Other topics mentioned include ts_headline, ts_rank, and the RUM extension.
- Install Python 2.5 or 2.6 and SQLAlchemy 0.5 using easy_install
- Michael Bayer created SQLAlchemy and is a software architect in New York City
- SQLAlchemy allows modeling database queries and relationships between objects in a more Pythonic way compared to raw SQL
The document discusses deploying Firebase Cloud Functions using Firebase CLI commands. It also discusses types for defining Cloud Function interfaces and callable functions. There are examples of managing Algolia indexes using Cloud Functions and security rules for Firestore.
Reactive Programming - ReactFoo 2020 - Aziz KhambatiAziz Khambati
This document discusses reactive programming and how it relates to React and RxJS. It begins with an introduction to reactive programming and its focus on data streams and propagating change. It then discusses how React popularized the declarative programming paradigm for building user interfaces. The document also provides an example of using RxJS to build an autocomplete component reactively by composing Observables. It emphasizes that RxJS allows building reactive features in a declarative way using operators on data streams.
Threads, Queues, and More: Async Programming in iOSTechWell
To keep your iOS app running butter-smooth at 60 frames per second, Apple recommends doing as many tasks as possible asynchronously or “off the main thread.” Joe Keeley introduces you to some basic concepts of asynchronous programming in iOS. He discusses what threads and queues are, how they are related, and the special significance of the main queue to iOS. Look at what options are available in the iOS SDK to work asynchronously, including NSOperationQueues and Grand Central Dispatch. Take an in depth look at how to implement some common use cases for those options in Swift. Joe pays special attention to networking, one of the most common asynchronous use cases. Spend some time discussing common asynchronous programming pitfalls—and how to avoid them. Leave this session ready to try out asynchronous programming in your iOS app.
This document provides an overview of Slick, a library for Scala that facilitates database access and querying. It discusses key Slick concepts like the lifted and direct query APIs, supported databases, and features like being easy, concise, safe, composable and explicit. It also covers topics like database schemas, queries for data definition, manipulation, filtering, sorting, joins and unions. Live code examples are provided throughout to demonstrate how to connect to a database, define schemas, and write various query types in Slick.
The optimizer is the brain of any DBMS system, This presentation helps to understand MongoDB explain plan metrics and How MongoDB Optimiser scores for each query plan and choosing the proper query plan for the execution.
TDC2018SP | Trilha Go - Processando analise genetica em background com Gotdc-globalcode
The document discusses using Faktory, an open-source job queue written in Go, to process genetic analysis jobs in the background. It describes setting up Faktory, creating jobs that call the vsa dx command, registering a Go worker to process jobs, and integrating it with an API. This allows asynchronously processing many jobs efficiently using a single lightweight machine.
Presto is an open-source distributed SQL query engine for interactive analytics. It uses a connector architecture to query data across different data sources and formats in the same query. Presto's query planning and execution involves scanning data sources, optimizing query plans, distributing queries across workers, and aggregating results. Understanding Presto's query plans helps optimize queries and troubleshoot performance issues.
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
SQLAlchemy is an object-relational mapper (ORM) for Python. It provides patterns for mapping database tables to objects and vice versa. The document discusses several ORM patterns from the book Patterns of Enterprise Application Architecture including:
- Unit of Work pattern which maintains a set of objects to save in a transaction. SQLAlchemy implements this with the Session object.
- Identity Map pattern which avoids duplicate objects for the same database record. SQLAlchemy implements this with its identity map.
- Lazy Load pattern which loads relationships and columns on demand to improve performance. SQLAlchemy uses lazy loading by default.
- Other patterns discussed include foreign key mapping, association table mapping, single/class table inheritance, and how different architectural patterns
Get to know the two stateful programming models of Azure Serverless compute: workflows and actors and how these models can simplify development and how they enable stateful and long-running application patterns within Azure’s compute environments.
MongoDB is the trusted document store we turn to when we have tough data store problems to solve. For this talk we are going to go a little bit off the path and explore what other roles we can fit MongoDB into. Others have discussed how to turn MongoDB’s capped collections into a publish/subscribe server. We stretch that a little further and turn MongoDB into a full fledged broker with both publish/subscribe and queue semantics, and a the ability to mix them. We will provide code and a running demo of the queue producers and consumers. Next we will turn to coordination services: We will explore the fundamental features and show how to implement them using MongoDB as the storage engine. Again we will show the code and demo the coordination of multiple applications.
The document discusses the Apache Commons project, which develops reusable Java components. It notes that Commons components allow for faster and smaller development by avoiding reinventing common functions. The document outlines several active Commons components like Collections, IO, Lang, and Logging, as well as sandbox and dormant components. It emphasizes that 80% of Commons components have no dependencies on each other, promoting flexibility and reuse.
Apache Spark, the Next Generation Cluster ComputingGerger
This document provides a 3 sentence summary of the key points:
Apache Spark is an open source cluster computing framework that is faster than Hadoop MapReduce by running computations in memory through RDDs, DataFrames and Datasets. It provides high-level APIs for batch, streaming and interactive queries along with libraries for machine learning. Spark's performance is improved through techniques like Catalyst query optimization, Tungsten in-memory columnar formats, and whole stage code generation.
ManageIQ currently runs on Ruby on Rails 3. Aaron "tenderlove" Patterson presents his effort to migrate to RoR 4, which entails some changes in the code to take advantage of the latest advances in RoR.
For more on ManageIQ, see http://manageiq.org/
External Language Stored Procedures for MySQLAntony T Curtis
This document describes an external language stored procedure framework for MySQL. It allows defining stored procedures using external languages like Java, Perl, and XML-RPC. The framework makes minor changes to MySQL's parser and stored procedure engine to support external languages while keeping most of the existing architecture. It also describes how dynamic SQL and result sets are supported through this framework.
This document provides an overview of Java SCORE on ICON. It discusses why Java and the JVM were chosen, how bytecode is instrumented to work with ICON, the structure and APIs of Java SCORE compared to Python SCORE, and storage models including object graphs and key-value stores. It also covers error handling, supporting return types and collections, and object serialization in databases.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Apache Spark in your likeness - low and high level customization
1. Apache Spark à ton image
User-Defined features et session extensions
Bartosz Konieczny
@waitingforcode
2. First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com
#@waitingforcode
#github.com/bartosz25
3. Apache Spark Community Feedback initiative
https://www.waitingforcode.com/static/spark-feedback
Why ?
● a single place with all best practices
● community-driven
● open
● interactive
How ?
● fill the form (https://forms.gle/sjSWPKmudhM6a3776)
● validate
● share
● learn
7. High level customization
● User-Defined Type (UDT)
● User-Defined Function (UDF)
● User-Defined Aggregate Functions (UDAF)
⇒ RDBMS in Apache Spark
⇒ no need of internals
8. User-Defined Type
● prior to 2.0 only - ongoing effort for 3.0
● UDTRegistration
● Dataset substitution - your class in DataFrame
● examples: VectorUDT, MatrixUDT
def sqlType: DataType
def pyUDT: String = null
def serializedPyClass: String = null
def serialize(obj: UserType): Any
def deserialize(datum: Any): UserType
9. User-Defined Type - example
@SQLUserDefinedType(udt = classOf[CityUDT])
case class City(name: String, country: Countries) {
def isFrench: Boolean = country == Countries.France
}
class CityUDT extends UserDefinedType[City] {
override def sqlType: DataType = StructType(Seq(StructField("name", StringType),
StructField("country", StringType)))
// ...
}
val cities = Seq(City("Paris", Countries.France), City("London", Countries.England)).toDF("city")
11. User-Defined Type - expression retrieval
cities.where("city.name == 'Paris'").show()
org.apache.spark.sql.AnalysisException: Can't extract value from city#3: need struct type
but got struct<name:string,region:string>; line 1 pos 0
at
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala
:73)
12. User-Defined Function
● SQL's: CREATE FUNCTION
● blackbox
● vectorized UDFs for PySpark - ML purpose
case class UserDefinedFunction protected[sql] (f: AnyRef, dataType: DataType, inputTypes:
Option[Seq[DataType]]) {
def nullable: Boolean = _nullable
def deterministic: Boolean = _deterministic
def apply(exprs: Column*): Column = { … }
}
14. User-Defined Function
In the query
sparkSession.udf.register("EvenFlagResolver_registerTest", evenFlagResolver
_)
val rows = letterNumbers.selectExpr("letter",
"EvenFlagResolver_registerTest(number) as isEven")
Programmatically
val udfEvenResolver = udf(evenFlagResolver _)
val rows = letterNumbers.select($"letter",
udfEvenResolver($"number") as "isEven")
15. ● if-else == CASE WHEN
● tokenize == LOWER(REGEXP_REPLACE(.....)
● IN clause
● columns equality == abs(col1 - col2) < allowed_precision
● wrapping DataFrame execution == JOIN
val jobnameDF = jobnameSeq.toDF("jobid","jobname")
sqlContext.udf.register("getJobname", (id: String) => (
jobnameDF.filter($"jobid" === id).select($"jobname")
)
)
● testing ML model == MLib
User-Defined Function
StackOverflow overused
23. Parser - from text to AST example
SELECT id, login FROM users WHERE id > 1 AND active = true
"SELECT", "i", "d", ",", "l", "o", "g", "i", "n", "WHERE", "i", "d", ">", "1",
"AND", "a", "c", "t", "i", "v", "e", "=", "t", "r", "u", "e"
(whitespaces omitted for readability)
24. Resolution rules
● handles unresolved, i.e. unknown becomes known
● Example:
SELECT * FROM dataset_1
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `dataset_1`
== Analyzed Logical Plan ==
letter: string, nr: int, a_flag: int
Project [letter#7, nr#8, a_flag#9]
+- SubqueryAlias `dataset_1`
+- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9]
+- LocalRelation [_1#3, _2#4, _3#5]
25. Post hoc resolution rules
● after resolution rules (post hoc)
● same as custom optimization rules
● order does matter
PreprocessTableCreation → PreprocessTableInsertion → DataSourceAnalysis
Examples:
● normalization - casting, renaming
● partitioning checks, e.g. "$partKey is not a partition column"
● generic LogicalPlan resolution, e.g. CreateTable(with query) ⇒
CreateDataSourceTableAsSelectCommand,
CreateTable(without query) ⇒
CreateDataSourceTableCommand
26. Check analysis rules
● plain assertions
Connection conn = getConnection();
assert conn != null : "Connection is null";
● clearer error messages:
"assertion failed: No plan for CreateTable CatalogTable" ⇒ ""Hive
support is required to use CREATE Hive TABLE AS SELECT"
● API:
MyAnalysisRule extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
// throw new AnalysisException("Analysis error message")
}
27. Check analysis rules - PreWriteCheck
object PreWriteCheck extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
plan.foreach {
case InsertIntoTable(l @ LogicalRelation(relation, _, _, _),
partition, query, _, _) =>
val srcRelations = query.collect {case LogicalRelation(src,
_, _, _) => src}
if (srcRelations.contains(relation)) {
failAnalysis("Cannot insert into table that is also being read
from.")
} else {
// ...
28. Logical optimization rule
● simplification :
(id > 0 OR login == 'test') AND id > 0 == id > 0
● collapse :
.repartition(10).repartition(20) == .repartition(20)
● dataset reduction :
columns pruning, predicate pushdown
● human mistakes :
trivial filters (2 > 1), execution tree cleaning (identity functions), redundancy
(projection, aliases)
29. Logical optimization rule - diff transform vs resolve
General template:
def apply(plan: LogicalPlan): LogicalPlan = plan.{{TRANSFORMATION}} {
case agg: Aggregation => …
case projection: Project => ...
}
{{TRANSFORMATION}} = transformUp/transformDown,
resolveOperatorsUp/resolveOperatorsDown
32. Catalog listeners
* Holder for injection points to the [[SparkSession]]. We make NO
guarantee about the stability
* regarding binary compatibility and source compatibility of methods here.
* This current provides the following extension points:
* <ul>
...
* <li>(External) Catalog listeners.</li>
...
class SparkSessionExtensions {
33. Catalog listeners
● ExternalCatalogWithListener
val catalogEvents = new scala.collection.mutable.ListBuffer[ExternalCatalogEvent]()
TestedSparkSession.sparkContext.addSparkListener(new SparkListener {
override def onOtherEvent(event: SparkListenerEvent): Unit = {
event match {
case externalCatalogEvent: ExternalCatalogEvent =>
catalogEvents.append(externalCatalogEvent)
case _ => {}
}
}
})
// ExternalCatalogEvent = (CreateTablePreEvent, CreateTableEvent, AlterTableEvent, ...)
34. Lessons learned
● Apache Spark first ⇒ do not write UDF just to write one, prefer native API
● debug & log
● analyze
● disable rules - much easier
● start small
● find inspirations → NoSQL connectors, "extends SparkPlan", "extends Rule[LogicalPlan]"
● test at scale
explain the idea of the talk and how the series about customizing Apache Spark started
I will show it later in details but everything I will present, do it only if you haven't other choice like no existing SQL operator or no existing data source or optimization (give an example of Casandra join optimization I found other day)
Moreover, SparkSessionExtensions are still a @DeveloperApi, so "So more or less "use at your own risk"."
TODO: does it work in PySpark?
3 types, actually 2 but it's worth knowing
if you did RDBMS before, you will retrieve very similar principles
code - a function, a class, no need to deep delve into the details; much easier, but also the risk of an overuse; I will show it later
TODO: explain the purpose of VectorUDT and MatrixUDT (Spark MLib)
was made private in https://issues.apache.org/jira/browse/SPARK-14155 because it supposed to be a new API for UDT supporting vectorized (batch) data and working better with Datasets
e.g. enum
sqlType → only intended to represent the type at Apache Spark storage level. It's not exposed to the end user so you can't do df.filter("myUdt.field_a = 'a'").show() ! See here for more information: You get this errors because schema defined by sqlType is never exposed and is not intended to be accessed directly. It simply provides a way to express a complex data types using native Spark SQL types.To access these properties, either use row.getAs[MyUdt] or an UDF https://stackoverflow.com/questions/33747851/spark-sql-referencing-attributes-of-udt?lq=1
pyUDT = paired Python UDT if exists
as of this saying (16.08.2019), the ticket intending to bring back the API public (https://issues.apache.org/jira/browse/SPARK-7768) is still in progress and there is no information about how far the progress is; targeted release is 3.0 but probably it won't be the case
How to use? You can directly access the property of given type in map or filter function, see an example here https://stackoverflow.com/a/51957666/9726075
UDT - you can use it in `row.getAs[MyType]("column")` methods, so in any mapping, filter, groupBy function
MatrixUDT & VectorUDT - both are private and should be used from org.apache.spark.ml.linalg.SQLDataTypes
https://issues.apache.org/jira/browse/SPARK-14155 https://issues.apache.org/jira/browse/SPARK-7768
I was still coding in Java, the code is a little bit longer but I use shorter version for presentation purposes https://www.waitingforcode.com/apache-spark-sql/used-defined-type/read
Vectorized UDF - normal UDF operates on one row at a time; mostly used in MLib and more exactly, as a @pandas_udf where it applies a function on Panda's Series rather than row by row; for some cases the accelerate rate is about 242 times!
returns a column, so you can't use it for instance, inside an aggregation
determinsitic - sometimes query planer can skip some optimizations and degrade the performance ; if executed multiple times for the same input, always generates the sasme query
TODO: add this to the link with resources https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql?view=sql-server-2017
https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_5009.htm
blackbox ⇒ be careful about the implementation, Apache Spark doesn't know how to optimize them for you
udf.register for PySpark works too
you can later use "MyUDF" in the string expressions
you can't do that for udf(...) which simply transforms a Scala function into the uDF that you can use later in the operations like letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven") for val udfEvenResolver = udf(evenFlagResolver _)
udf.register for PySpark works too
you can later use "MyUDF" in the string expressions
you can't do that for udf(...) which simply transforms a Scala function into the uDF that you can use later in the operations like letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven") for val udfEvenResolver = udf(evenFlagResolver _)
>>> wrapping - a great anti-pattern and proof that UDF will perform worse than native Apache Spark code most of the time - if used in wrong context
not to blame but simply to highlight the fact of simplicity which is good and bad at the same time
I don't know why ? To simplify ? To write a UT ? But we can still write it with Apache Spark
https://stackoverflow.com/questions/46464125/how-to-write-multiple-if-statements-in-spark-udf/46464610 https://stackoverflow.com/questions/55135347/how-to-pass-dataframe-to-spark-udf
https://stackoverflow.com/questions/35905273/using-a-udf-in-spark-data-frame-for-text-mining/35908115
https://stackoverflow.com/questions/47985382/how-to-use-udf-in-where-clause-in-scala-spark?rq=1
https://stackoverflow.com/questions/50760841/spark-sql-udf-cast-return-value?rq=1
ML: https://stackoverflow.com/questions/53551000/spark-create-dataframe-in-udf
In clause: https://stackoverflow.com/questions/57109478/filtering-a-datasetrow-if-month-is-in-list-of-integers
CREATE aggregate = PostgreSQL, SQL Server
use cases - any custom aggregates, like geometric mean, weighted mean
deterministic - if 2 calls of the same function (with the same parameters) always return the same results It's mostly used in the plan optimization and sometimes during the phase of analysis:analysis step, when the child node is not deterministic, then it shouldn't appear in the aggregation:
if (!child.deterministic) {
failAnalysis(
s"nondeterministic expression ${expr.sql} should not " +
s"appear in the arguments of an aggregate function.")
}
failAnalysis(
s"""nondeterministic expressions are only allowed in
|Project, Filter, Aggregate or Window, found:
| ${o.expressions.map(_.sql).mkString(",")}
|in operator ${operator.simpleString}
""".stripMargin)
custom aggregate can be called after groupBy(...) method, exactly like the aggregates like average, sum and so forth
evaluate - final result
bufferSchema → UDAF works on partial aggregates and this schema represents intermediate results. That's why it's different from DataType which is used to return things.
merge - partialr esults
update - adds new value to the buffer
https://www.waitingforcode.com/apache-spark-sql/user-defined-aggregate-functions/read
examples: https://stackoverflow.com/questions/4421768/the-most-useful-user-defined-aggregate-functions
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDA.html
https://issues.apache.org/jira/browse/SPARK-18127 - adds support for extensions
say that "catalog listeners are not really there"
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
once the plan is parsed, is still unresolved - Apache Spark simply know which SQL operators and corresponding nodes of the logical plan should be executed
later it uses the metadata catalog in order to resolve the data - that creates a fully executable logical plan
that logical plan could be executed as it but before that, the engine also tries to optimize it by applying optimization rules
logical - reduce the # of operations and apply some of them at the data source level; optimizations executed iteratively in batch (spark.sql.optimizer.maxIterations, default = 100)
parser is not called for any methods invoking the API like map(...), select("", "") because it directly constructs logical plan nodes
order is defined in org.apache.spark.sql.catalyst.analysis.Analyzer#batches:
resolution rules are first
post-hoc resolution rules are next
extended check rules (analysis rules)
they're executed after resolving the plan: org.apache.spark.sql.catalyst.analysis.Analyzer#executeAndCheck
org.apache.spark.sql.internal.BaseSessionStateBuilder#sqlParser for parser
org.apache.spark.sql.catalyst.rules.RuleExecutor#execute ⇒ executed logical optimizations; called by lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData) !
org.apache.spark.sql.internal.BaseSessionStateBuilder#analyzer has all rules applied during the analysis stage
org.apache.spark.sql.internal.BaseSessionStateBuilder#optimizer ⇒ all logical optimization rules
parserPlan → SQL query (SELECT * …)
parseExpression → nr > 1 (nr = col)
parseTableIdentifier → converts a table name into a TableIdentifier, e.g. DataFrameWriter.insertInto (.write.insertInto); just a case class holding table name and database attributes
AstBuilder: The AstBuilder converts an ANTLR4 ParseTree into a catalyst Expression, LogicalPlan or
* TableIdentifier.
https://blog.octo.com/mythbuster-apache-spark-parsing-requete-sql/
https://www.slideshare.net/SandeepJoshi55/apache-spark-undocumented-extensions-78929290
"SELECT", "i" ⇒ tokens built from lexer phasis
AST later built from parser phasis
handles unresolved ⇒ I consider it as unknown. If you define an alias in your query, Apache Spark doesn't know whether the columns really exist. It has to resolve them
* - UnresolvedStart, resolves attributes directly from the SubqueryAlias dataset_1 when UnresolvedStart#expand method is called
alias for: dataset1.sqlContext.sql("SELECT nr + 1 + 3 + 4, letter AS letter2, nr AS nr2 FROM dataset_1").explain(true)
relation: UnresolvedRelation, Holds the name of a relation that has yet to be looked up in a catalog. UnresolvedRelation becomes +- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9] +- LocalRelation [_1#3, _2#4, _3#5] for val dataset1 = Seq(("A", 1, 1), ("B", 2, 1), ("C", 3, 1), ("D", 4, 1), ("E", 5, 1)).toDF("letter", "nr", "a_flag")
dataset1.createOrReplaceTempView("dataset_1")
dataset1.sqlContext.sql("SELECT letter AS letter2, nr AS nr2 FROM dataset_1").explain(true)
order does matter, e.g. for DataSourceAnalysis which "must be run after `PreprocessTableCreation` and `PreprocessTableInsertion`." ;
e.g. DataSourceAnalysis - replaces generic operations like InsertIntoTable by more specific (Spark SQL) operations, e.g. InsertIntoDataSourceCommand; another example InsertIntoDir ⇒ InsertIntoDataSourceDirCommand
TODO: show INSERT INTO TABLE tab1 SELECT 1, 2 INSERT INTO TABLE tab1 SELECT 1, 2
TODO: generate an example with RunnableCommand
order does matter ⇒ /**
* Replaces generic operations with specific variants that are designed to work with Spark
* SQL Data Sources.
*
* Note that, this rule must be run after `PreprocessTableCreation` and
* `PreprocessTableInsertion`.
*/
case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with CastSupport {
fail-fast approach - executed before physically running the query
mostly executed as a pattern matching on the LogicalPlan nodes
examples: PreWriteCheck (e.g. Cannot insert into table that is also being read from) , PreReadCheck (input_file_name function in Hive https://issues.apache.org/jira/browse/SPARK-21354 that does not support more than one sources)
see this: https://github.com/apache/spark/commit/2b10ebe6ac1cdc2c723cb47e4b88cfbf39e0de08#diff-73bd90660f41c12a87ee9fe8d35d856a for HiveSupport
override val extendedCheckRules: Seq[LogicalPlan => Unit] =
PreWriteCheck +:
PreReadCheck +:
* A rule to do various checks before inserting into or writing to a data source table.
* A rule to do various checks before reading a table.
e.g. do not allow to write the table used in source (INSERT INTO clause)
e.g. whether you do not execute Hive queries without Hive support enabled: " * A rule to check whether the functions are supported only when Hive support is enabled"
HiveOnlyCheck +:
here org.apache.spark.sql.execution.datasources.PreWriteCheck$#failAnalysis is your friend, only check
if you did some Java assert()) or @
fail-fast approach - executed before physically running the query
mostly executed as a pattern matching on the LogicalPlan nodes
examples: PreWriteCheck (e.g. Cannot insert into table that is also being read from) , PreReadCheck (input_file_name function in Hive https://issues.apache.org/jira/browse/SPARK-21354 that does not support more than one sources)
see this: https://github.com/apache/spark/commit/2b10ebe6ac1cdc2c723cb47e4b88cfbf39e0de08#diff-73bd90660f41c12a87ee9fe8d35d856a for HiveSupport
override val extendedCheckRules: Seq[LogicalPlan => Unit] =
PreWriteCheck +:
PreReadCheck +:
* A rule to do various checks before inserting into or writing to a data source table.
* A rule to do various checks before reading a table.
e.g. do not allow to write the table used in source (INSERT INTO clause)
e.g. whether you do not execute Hive queries without Hive support enabled: " * A rule to check whether the functions are supported only when Hive support is enabled"
HiveOnlyCheck +:
here org.apache.spark.sql.execution.datasources.PreWriteCheck$#failAnalysis is your friend, only check
if you did some Java assert()) or @
rules can be excluded from spark.sql.optimizer.excludedRules property
dataset reduction - plan is rewritten to execute filters on data source side, eg. PushDownPredicate; it reverses filter and project:
case Filter(condition, project @ Project(fields, grandChild))
if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) =>
// Create a map of Aliases to their values from the child projection.
// e.g., 'SELECT a + b AS c, d ...' produces Map(c -> a + b).
val aliasMap = AttributeMap(fields.collect {
case a: Alias => (a.toAttribute, a.child)
})
project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild))
e.g.
// *Project [amount#8L, id#9L]
// +- *Filter (isnotnull(amount#8L) && (amount#8L > 10))insead of Project → Filter (bottom up read
So that filter is executed before
transform - recursively applies the rule on the AST; up or bottom - up goes down to up (children and at the end the current node)eg. operations are reversed for predicatepushdown (filter with project), sometimes the operations can be replaced (e.g. 2 Filter nodes with 1 Filter node containing both conditions, and later you can even remove it when the filter returns always true), removed (e;g when filter is always true, when the same SELECT is called twice)
resolve - similar to transform but skips already analyzed sub-trees ;when resolve* is called, Apache Spark will start by checking the analyzed flag of the plan. In the case of a false value, it will simply skip the rule logicImportant point to note: even though you use resolve*, a transform* can create a completely new plan and invalidate the value of analyzed flagResolve applies mostly on the nodes that can be evaluted only once, like aliases resolution, subtitution methods (inclusion CTE plan, children plan substituted with window spec definitions), relations resolution
Node - kind of container; most of the time it will be interpreted in the custom logical optimizations
Expression - a stringified version of what do you want to do with the operator, e.g. list of columns, filter expression (simple, IN statement). Globally can be considered as a method taking some input and generating some output
different variants: Unary (1 input, 1 output), named (e.g. alias), binary (2 inputs, 1 output), ternary (3 in, 1 out, e.g. months between)
PART OF SPARKPLAN
sequential execution, one tree level at a time ; doExecute can call leftInput.execute() and inside it operates mostly on the RDD functions like map, mapPartitions, foreachPartitions and so on
doExecuteBroadcast - used for intance in BroadcastHasJoinExec to broadcast a part of the query to the rest of executors
doPrepare - if something must be initialized before the physical execution; eg. subquery execution initializes here the subquery which is defined as a lazy val Future → private lazy val relationFuture: Future[Array[InternalRow]] and in doPrepare it's called as protected override def doPrepare(): Unit = {
relationFuture
}
PART OF CODEGENSUPPORT trait
for doProduce ⇒ produces generated code to process
doConsume ⇒ processes rows or columns generated by the physical plan
inputRDDs → input rows for this plan
codegen optimizes CPU usage by generating a single optimized function in bytecode for the set of operators in a SQL query (when possible), instead of generating iterator code for each operator.
https://www.slideshare.net/datamantra/anatomy-of-spark-sql-catalyst-part-2
https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huai
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
recall information about data catalogs
say that the comment is not true
but promising since catalog federation is an ongoing effort for Apache Spark → https://issues.apache.org/jira/browse/SPARK-15777
but despite the lack of support, you can still extend the catalogs
spark.sql.optimizer.excludedRules
find inspirations ⇒ not clearly documented