The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Oracle Database 19c, builds upon key architectural, distributed data and performance innovations established in earlier versions Oracle Database 12c and 18c releases. Oracle 19c has many new features, in this presentation we have covered below areas
Automated Installation, Configuration and Patching
AutoUpgrade and Database Utilities
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
Eric Hanson and I gave this presentation at Hadoop Summit 2013:
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Oracle Database 19c, builds upon key architectural, distributed data and performance innovations established in earlier versions Oracle Database 12c and 18c releases. Oracle 19c has many new features, in this presentation we have covered below areas
Automated Installation, Configuration and Patching
AutoUpgrade and Database Utilities
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
Eric Hanson and I gave this presentation at Hadoop Summit 2013:
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
Modeling Data and Queries for Wide Column NoSQLScyllaDB
Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder
Tanel Poder has been involved in a number of Exadata migration projects since its introduction, mostly in the area of performance ensurance, troubleshooting and capacity planning.
These slides, originally presented at UKOUG in 2010, cover some of the most interesting challenges, surprises and lessons learnt from planning and executing large Oracle database migrations to Exadata v2 platform.
This material is not just repeating the marketing material or Oracle's official whitepapers.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
This session is for you if you want to learn tips and techniques that are used to optimize database development with special emphasis on SQL Server 2005. If you write lot of stored procedures and want to learn the tools of a DBA, this is the session for you. If you are new to SQL Server development environment, you will learn how the various constructs compare to each other and better performance can be produced every time with a brief introduction to understanding Execution Plans.
Session aims at introducing less familiar audience to the Oracle database statistics concept, why statistics are necessary and how the Oracle Cost-Based Optimizer uses them
Reduce Amazon RDS Costs up to 50% with ProxiesJohn Varghese
Amazon RDS is one of the more expensive line items in an AWS bill. In this session, we will discuss techniques to offload SQL for improved performance while reducing database costs. Features include:
Query caching into Amazon ElastiCache
Read/Write split We will go over customer case studies on how they were able to drive down costs while scaling out.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
PostgreSQL Replication High Availability MethodsMydbops
This slides illustrates the need for replication in PostgreSQL, why do you need a replication DB topology, terminologies, replication nodes and many more.
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Modeling Data and Queries for Wide Column NoSQLScyllaDB
Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder
Tanel Poder has been involved in a number of Exadata migration projects since its introduction, mostly in the area of performance ensurance, troubleshooting and capacity planning.
These slides, originally presented at UKOUG in 2010, cover some of the most interesting challenges, surprises and lessons learnt from planning and executing large Oracle database migrations to Exadata v2 platform.
This material is not just repeating the marketing material or Oracle's official whitepapers.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
This session is for you if you want to learn tips and techniques that are used to optimize database development with special emphasis on SQL Server 2005. If you write lot of stored procedures and want to learn the tools of a DBA, this is the session for you. If you are new to SQL Server development environment, you will learn how the various constructs compare to each other and better performance can be produced every time with a brief introduction to understanding Execution Plans.
Session aims at introducing less familiar audience to the Oracle database statistics concept, why statistics are necessary and how the Oracle Cost-Based Optimizer uses them
Reduce Amazon RDS Costs up to 50% with ProxiesJohn Varghese
Amazon RDS is one of the more expensive line items in an AWS bill. In this session, we will discuss techniques to offload SQL for improved performance while reducing database costs. Features include:
Query caching into Amazon ElastiCache
Read/Write split We will go over customer case studies on how they were able to drive down costs while scaling out.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
PostgreSQL Replication High Availability MethodsMydbops
This slides illustrates the need for replication in PostgreSQL, why do you need a replication DB topology, terminologies, replication nodes and many more.
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Designing for Performance: Database Related Worst PracticesChristian Antognini
Optimal performance is not simply a product one can buy but rather the results of accurate planning and a correct implementation. Given that applications should be designed for performance, it would be useful to cover an approach to doing that in great detail. However, for obvious reasons, it is not a subject that can be covered in a presentation. For this reason, I limit myself to briefly describing the top most common database-related design problems that frequently lead to suboptimal performance.
IT 655 Milestone One Guidelines and Rubric Presenta.docxaryan532920
IT 655 Milestone One Guidelines and Rubric
Presentation for the Chief Information Officer (CIO) Overview:
IT professionals are often tasked with optimizing, tuning, and resolving various errors associated with large commercial database systems. As the world becomes
more and more focused on the collection and use of data for various purposes, there is a greater need for building database applications that serve numerous
complex functions. The final assessment will provide a comprehensive overview of the concepts, techniques, and skills that you have acquired throughout the
course. The ability to interface a database management system with other applications and even third-party services is a commonly performed task in this
discipline. The scenario, presented below, provides a real-world situation in which such skills will need to be employed.
Scenario:
The County of Everstone is a large county in a densely populated state in the United States. This county has a total of 34 incorporated cities within its boundaries.
Overall, the county has a population exceeding 4 million residents. The county employs a proprietary database management system that is responsible for the
county’s property tax assessment, parcel, and payment information. In a recent initiative, the county has been highly promoting the use of its online property tax
payment and lookup information to reduce mailed payments and in-office visits. The marketing campaign for online property tax payments and information
lookup has been highly successful. However, the county’s database systems have not sufficiently handled the increased traffic. As such, the web application
available to residents often experiences high amounts of latency, frequently locks up, or in many instances is unavailable.
The board of directors has asked executive IT management to look into the issue and provide a prompt resolution to the matter. A review of the hardware and
software systems that the database management systems employ reveals that these systems “should” be able to take the capacity and load. Therefore, the IT
management has ruled out the hardware systems as being the issue. Upon further investigation, a lead database systems administrator preliminarily reports that
the issue appears to stem not from the hardware, but rather from a need to provide performance tuning on the database. The database design seems to be the
culprit, since this database systems administrator noted large quantities of duplicated data, poorly written database queries, and triggers.
Furthermore, the chief information officer (CIO) has also directed the web programmers and web developers to look at the web application for potential issues.
The web development team has reported that while no design or programming issues were found with the web application, the stored procedures (queries) that
it runs from the database server take long periods of time to run and frequently lock up the web server ...
Best Laid Plans: Saving Time, Money and Trouble with Optimal ForecastingEric Kavanagh
Expectations have changed. That's true for users, executives and customers alike. There's no time for systems running slowly, or cost overruns. That's why fundamentals like capacity planning have become mission-critical. By paying attention to the details, and doing effective forecasts, companies can optimize their information architecture, keeping everyone happy. Register for this episode of Hot Technologies to learn from veteran Analysts Dr. Robin Bloor and Rick Sherman who will offer insights about how and why to do capacity planning. They'll be briefed by Bullett Manale of IDERA, who will explain how his company's SQL Diagnostic Manager can track a wide range of usages metrics which can be used for accurate forecasting.
This complete deck can be used to present to your team. It has PPT slides on various topics highlighting all the core areas of your business needs. This complete deck focuses on Database Performance Improvements Environment Document Requirement Planning Analysis and has professionally designed templates with suitable visuals and appropriate content. This deck consists of total of twelve slides. All the slides are completely customizable for your convenience. You can change the colour, text and font size of these templates. You can add or delete the content if needed. Get access to this professionally designed complete presentation by clicking the download button below. https://bit.ly/3rZcc59
I built this presentation for Informatica World in 2006. It is all about Data Administration, Data Quality and Data Management. It is NOT about the Informatica product. This presentation was a hit, with standing room only full of about 150 people. The content is still useful and applicable today. If you want to use my material, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Government and Education Webinar: Simplify Your Database Performance Manageme...SolarWinds
As a data professional, you need tools to help you detect and diagnose performance issues and errors quickly. Learn how our portfolio of database tools can improve and amplify your ability to keep your databases in top condition.
Database Developer (DBD) designs and implements database systems, writing queries, and integrating databases with applications. They focus on creating and optimizing database schemas, writing queries, and developing applications that interact with databases.
In addition, DBDs are responsible for developing database engineering solutions, which involve tasks like designing and implementing data pipelines, optimizing database performance in distributed and cloud-based environments, and implementing data governance policies and access controls.
These solutions ensure the databases are designed, built, and maintained to meet the specific needs of an application or organization.
Database Administrator: Job Description, Salary and Future ScopeHR Krutika Meheta
What does a database administrator do?
A Database Administrator (DBA) is the sole cause for the performance, reliability and protection of an information source. They will also be involved in the planning and growth and development of the information, as well as problem fixing for any problems regarding the users.
Term Paper VirtualizationDue Week 10 and worth 210 pointsThis.docxmattinsonjanel
Term Paper: Virtualization
Due Week 10 and worth 210 points
This assignment contains two (2) sections: Written Report and PowerPoint Presentation. You must submit both sections as separate files for the completion of this assignment. Label each file name according to the section of the assignment it is written for. Additionally, you may create and / or assume all necessary assumptions needed for the completion of this assignment.
According to a TechRepublic survey performed in 2013, (located at http://www.techrepublic.com/blog/data-center/research-desktop-virtualization-growing-in-popularity/#) desktop virtualization is growing in popularity. Use the Internet and Strayer Library to research this technique. Research the top three (3) selling brands of virtualization software.
Imagine that the Chief Technology Officer (CTO) of your organization, or of an organization in which you are familiar, has tasked you with researching the potential for using virtualization in the organization. You must write a report that the CTO and many others within the organization will read. You must also summarize the paper and share your key ideas, via a PowerPoint presentation, with the CTO and steering committee of the organization.
This paper and presentation should enlighten the organization as to whether or not virtualization is a worthwhile investment that could yield eventual savings to the organization.
Section 1: Written Report
1. Write an six to eight (6-8) page paper in which you:
0. Compare and contrast the top three (3) brands of virtualization software available. Focus your efforts on components such as standard configuration, hardware requirements price, and associated costs.
0. Examine the major pros and major cons of each of the top three (3) software packages available. Recommend the virtualization software that you feel is most appropriate for the organization. Provide a rationale for your recommendation.
0. Explore the major advantages and major disadvantages that your chosen organization may experience when using virtualization software. Give your opinion on whether or not you believe virtualization software is the right fit for your chosen company. Provide a rationale for your response.
0. Create a Microsoft Word table that identifies the advantages, disadvantages, computer requirements, initial costs, and future savings for an organization considering an engagement in virtualization.
0. Use at least six (6) quality resources in this assignment. Note: Wikipedia and similar Websites do not qualify as quality resources.
Your assignment must follow these formatting requirements:
1. Be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides; citations and references must follow APA or school-specific format. Check with your professor for any additional instructions.
1. Include a cover page containing the title of the assignment, the student’s name, the professor’s name, the course title, and the date. The ...
Standard SQL features where PostgreSQL beats its competitorsMarkus Winand
The SQL standard has more than 4300 pages and hundreds of optional features. The number of features offered by different products varies vastly. PostgreSQL implements a relativley large number of them.
In this session I present some standard SQL features that work in PostgreSQL, but not in other popular open-source databases. But when it comes to standard conformance, PostgreSQL doesn’t even need to fear the comparison to its commercial competitors: PostgreSQL also supports a few useful standard SQL features that don’t work in any of the three most popular commercial SQL databases.
ISO SQL:2016 introduced Row Pattern Matching: a feature to apply (limited) regular expressions on table rows and perform analysis on each match. As of writing, this feature is only supported by the Oracle Database 12c.
Backend to Frontend: When database optimization affects the full stackMarkus Winand
OFFSET it the root of all evil: It's slow and actually delivers the wrong result. OFFSET is just not the right answer to implement pagination. There is a better approach, called Key-Set Pagination, that is superior in many regards. However, introducing it to your application affects the full technology stack up to "Layer 8": The User.
Modern SQL in Open Source and Commercial DatabasesMarkus Winand
SQL has gone out of fashion lately—partly due to the NoSQL movement, but mostly because SQL is often still used like 20 years ago. As a matter of fact, the SQL standard continued to evolve during the past decades resulting in the current release of 2016. In this session, we will go through the most important additions since the widely known SQL-92. We will cover common table expressions and window functions in detail and have a very short look at the temporal features of SQL:2011 and row pattern matching from SQL:2016.
Links:
http://modern-sql.com/
http://winand.at/
http://sql-performance-explained.com/
The SQL OFFSET keyword is evil. It basically behaves like SLEEP in other programming langauges: the bigger the number, the slower the execution.
Fetching results in a page-by-page fashion in SQL doesn't require OFFSET at all but an even simpler SQL clause. Besides being faster, you don't have to cope with drifting results if new data is inserted between two page fetches.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
2. Takeaway #1: Pandemic Scale
It affects you!
(Symbolic image; not real data)
http://upload.wikimedia.org/wikipedia/commons/c/c7/2009_world_subdivisions_flu_pandemic.png
6. The Problem: Improper Index Use
“A very common cause of performance
problems is lack of proper indexes or the
use of queries that are not using
existing indexes.”
—Buda Consulting
http://www.budaconsulting.com/Portals/52677/docs/top_5_tech_brief.pdf
7. The Problem: Improper Index Use
“A very common cause of performance
problems is lack of proper indexes or the
use of queries that are not using
existing indexes.”
—Buda Consulting
http://www.budaconsulting.com/Portals/52677/docs/top_5_tech_brief.pdf
8. Quantifying the Problem
Percona White Paper:
Reasons of performance problems
that caused production downtime:
38% bad SQL
15% schema and indexing
http://www.percona.com/files/white-papers/causes-of-downtime-in-mysql.pdf
9. Quantifying the Problem
Survey by sqlskills.com:
Root causes of the last few SQL
Server performance problems:
27% T-SQL
19% Poor indexing
http://www.sqlskills.com/blogs/paul/survey-what-are-the-most-common-causes-of-performance-problems/
10. Quantifying the Problem
Craig S. Mullins (strategist and researcher):
„As much as 75% of poor relational performance
is caused by "bad" SQL and application code.”
Noel Yuhanna (Forrester Research):
„The key difficulties surrounding performance
continue to be poorly written SQL statements,
improper DBMS configuration and a lack of clear
understanding of how to tune databases to solve
performance issues.”
11. Quantifying the Problem
My observation:
~50% of SQL performance problems
are caused by improper index use
14. The Root Cause: DBAs are Indexing
How did databases
work before SQL?
15. The Root Cause: DBAs are Indexing
Index use was intrinsically
tied to the queries.
16. The Root Cause: DBAs are Indexing
Example: dBase
Developers had to...
...use indexes explicitly when searching:
!"#$%&'"($#)$*+!#,&+-"
$$$.%&'$/%&+&'
...take care of index maintenance:
!"#$%&'"($#)$*+!#,&+-"0$%'(1
$$$$+22"&'
17. The Root Cause: DBAs are Indexing
SQL is an abstraction that only
defines the logical view.
The actual SQL implementation
takes care of everything else.
18. The Root Cause: DBAs are Indexing
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
SQL (language)
has:
SQL Databases (software)
have:
19. The Root Cause: DBAs are Indexing
Backup
& recovery
Storage
management
Bugs &
patches
Tuning
parameters
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
SQL (language)
has:
SQL Databases (software)
have:
High
Availability
20. The Root Cause: DBAs are Indexing
Indexes
Backup
& recovery
Storage
management
Bugs &
patches
Tuning
parameters
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
SQL (language)
has:
SQL Databases (software)
have:
High
Availability
21. The Root Cause: DBAs are Indexing
Indexes
Backup
& recovery
Storage
management
Bugs &
patches
Tuning
parameters
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
SQL Databases (software)
have:
Developers
High
Availability
22. The Root Cause: DBAs are Indexing
Indexes
Backup
& recovery
Storage
management
Bugs &
patches
Tuning
parameters
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
Developers Administrators
High
Availability
23. The Root Cause: DBAs are Indexing
Indexing is considered a system
tuning task that belongs to the
administrators responsibilities.
24. The Root Cause: DBAs are Indexing
A misconception that causes new problems:
25. The Root Cause: DBAs are Indexing
A misconception that causes new problems:
DBAs don’t know
the queries
Have to “investigate”
to find the queries.
It is time consuming and
almost always incomplete.
by G-10gian82
deviantart.com
26. The Root Cause: DBAs are Indexing
A misconception that causes new problems:
DBAs don’t know
the queries
Have to “investigate”
to find the queries.
It is time consuming and
almost always incomplete.
DBAs can’t change
the queries
Can make the index
match the query.
Can’t make the query
match the index!
29. The Solution: It’s a Dev Task
Indexes
Backup
& recovery
Storage
management
Tuning
parameters
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
Developers Administrators
High
Availability
Bugs &
patches
30. The Solution: It’s a Dev Task
Indexes
Backup
& recovery
Storage
management
Tuning
parameters
Transactions
Constraints
Views
Tables
Data
manipulation
Queries
Developers Administrators
Must match!
High
Availability
Bugs &
patches
31. Another Problem: It’s not Taught
Indexes are not part of the pure SQL (language) literature
because indexes are not part of the SQL standard.
11 SQL books analyzed: only 1.0% of the pages are
about indexes (70 out of 7330 pages).
Examples:
Oracle SQL by Example: 2.0% (19/960)
Beginning DBs with PostgreSQL: 0.8% (5/664)
Learning SQL: 3.3% (11/336 — highest rate in class)
32. Another Problem: It’s not Taught
Proper index usage is sometimes covered in database
tuning books but is always buried between hundreds of
pages of HW, OS and DB parameterization topics.
14 database administration books analyzed: 5.1% of the
pages are about indexes (307 out of 6069 pages).
Examples:
Oracle Performance Survival Guide: 5.2% (38/730)
High Performance MySQL: 8% (55/684)
PostgreSQL 9 High Performance: 5.8% (27/468)
33. Another Problem: It’s not Taught
Consequence:
Developers don’t know how to use
indexes properly.
34. Another Problem: It’s not Taught
Consequence:
Developers don’t know how to use
indexes properly.
Results of the 3-minute online quiz:
http://use-the-index-luke.com/3-minute-test
5 questions: each about a specific index
usage pattern.
Non-representative!
35. Q1: Good or Bad? (Function use)
345675$89:5;$#<*,%'($=9$#<*$>!"#$%&'()*+?@
A5B537$#"(#0$'+#",C)*D-&
$$E4=F$#<*
$/G545$,-%./012!"#$%&'()*+0$HIIIIH?$J$H1KLMH@
3-Minute Quiz: Indexing Skills
36. Q1: Good or Bad? (Function use)
345675$89:5;$#<*,%'($=9$#<*$>!"#$%&'()*+?@
A5B537$#"(#0$'+#",C)*D-&
$$E4=F$#<*
$/G545$,-%./012!"#$%&'()*+0$HIIIIH?$J$H1KLMH@
3-Minute Quiz: Indexing Skills
37. Q1: Good or Bad? (Function use)
345675$89:5;$#*,%'($=9$#*$!#$%'()*+?@
A5B537$#(#0$'+#,C)*D-
$$E4=F$#*
$/G545$,-%./012!#$%'()*+0$HIIIIH?$J$H1KLMH@
3-Minute Quiz: Indexing Skills
38. 3-Minute Quiz: Indexing Skills
Q2: Good or Bad? (Indexed Top-N, no IOS)
345675$89:5;$#*,%'($=9$#*$+0$'+#,C)*?@
A5B537$%'0$+0$'+#,C)*
$$E4=F$#*
$/G545$+$J$NL
$=4:54$OI$'+#,C)*$:5A3
$3454,67@
39. 3-Minute Quiz: Indexing Skills
Q2: Good or Bad? (Indexed Top-N, no IOS)
345675$89:5;$#*,%'($=9$#*$+0$'+#,C)*?@
A5B537$%'0$+0$'+#,C)*
$$E4=F$#*
$/G545$+$J$NL
$=4:54$OI$'+#,C)*$:5A3
$3454,67@
Understandable
controversy!
40. 3-Minute Quiz: Indexing Skills
Q3: Good or Bad? (Column order)
CREATE INDEX tbl_idx ON tbl (a, b);
SELECT id, a, b FROM tbl
WHERE a = $1 AND b = $2;
SELECT id, a, b FROM tbl
WHERE b = $1;
41. 3-Minute Quiz: Indexing Skills
Q3: Good or Bad? (Column order)
CREATE INDEX tbl_idx ON tbl (a, b);
SELECT id, a, b FROM tbl
WHERE a = $1 AND b = $2;
SELECT id, a, b FROM tbl
WHERE b = $1;
42. 3-Minute Quiz: Indexing Skills
Q4: Good or Bad? (Indexing LIKE)
CREATE INDEX tbl_idx
ON tbl (text varchar_pattern_ops);
SELECT id, text
FROM tbl
WHERE text LIKE '%TERM%';
43. 3-Minute Quiz: Indexing Skills
Q4: Good or Bad? (Indexing LIKE)
CREATE INDEX tbl_idx
ON tbl (text varchar_pattern_ops);
SELECT id, text
FROM tbl
WHERE text LIKE '%TERM%';
44. 3-Minute Quiz: Indexing Skills
Q5: Good or Bad? (equality vs. ranges)
CREATE INDEX tbl_idx
ON tbl (date_col, state);
SELECT id, date_col, state FROM tbl
WHERE date_col =
CURRENT_DATE - INTERVAL '5' YEAR
AND state = 'X';
45. 3-Minute Quiz: Indexing Skills
Q5: Good or Bad? (equality vs. ranges)
CREATE INDEX tbl_idx
ON tbl (date_col, state);
SELECT id, date_col, state FROM tbl
WHERE date_col =
CURRENT_DATE - INTERVAL '5' YEAR
AND state = 'X';
46. Indexes: The Neglected All-Rounder
Everybody knows indexing is
important for performance,
yet nobody takes the time to
learn and apply is properly.
47. Indexes: The Neglected All-Rounder
Index details are hardly known.
! “Details” like column-order or equality vs. range
conditions must be learned and understood.
Only one index capability is used: finding data quickly
! Indexes have three capabilities (powers):
finding data, clustering data, and sorting data.
Indexing is done from single query perspective.
! Should be done from application perspective
(considering all queries). It’s a design task!
48. Indexes: The Neglected All-Rounder
Are you just adding indexes
or
are you designing indexes?
49. About Markus Winand
Tuning developers for
high SQL performance
Training co (one-man show):
winand.at
Geeky blog:
use-the-index-luke.com
Author of:
SQL Performance Explained