This document provides an overview of using Polybase for data virtualization in SQL Server. It discusses installing and configuring Polybase, connecting external data sources like Azure Blob Storage and SQL Server, using Polybase DMVs for monitoring and troubleshooting, and techniques for optimizing performance like predicate pushdown and creating statistics on external tables. The presentation aims to explain how Polybase can be leveraged to virtually access and query external data using T-SQL without needing to know the physical data locations or move the data.
Introduction to Snowflake Datawarehouse and Architecture for Big data company. Centralized data management. Snowpipe and Copy into a command for data loading. Stream loading and Batch Processing.
In this introductory session, we dive into the inner workings of the newest version of Azure Data Factory (v2) and take a look at the components and principles that you need to understand to begin creating your own data pipelines. See the accompanying GitHub repository @ github.com/ebragas for code samples and ADFv2 ARM templates.
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Practical examples of using extended eventsDean Richards
Many presentations I have seen about SQL Server extended events focus on the mechanics of setting them up and querying the definitions back from the DMVs. While useful, I had always been missing the points about when and how to use extended events. What are they and why should you be using them? This presentation will show three real-world examples of using extended events. Each example will include demos on configuring the event and using the data and you will learn:
• Use extended events rather than queries against DMVs or tracing
• Collect query information
• Catch and examine deadlocks
• Collect actual plans
Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities.
This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.
Moving to the cloud; PaaS, IaaS or Managed InstanceThomas Sykes
In this session we'll look at the cloud choices available in Azure for SQL Server. Whether it's PaaS, IaaS or Managed Instance we'll look into the features provided, the major differences and the Pros and Cons of each solution and how to choose the best option available.
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
Data orchestration is the lifeblood of any successful data analytics solution. Take a deep dive into Azure Data Factory's data movement and transformation activities, particularly its integration with Azure's Big Data PaaS offerings such as HDInsight, SQL Data warehouse, Data Lake, and AzureML. Participants will learn how to design, build and manage big data orchestration pipelines using Azure Data Factory and how it stacks up against similar Big Data orchestration tools such as Apache Oozie.
Video of presentation:
https://channel9.msdn.com/Events/Ignite/Australia-2017/DA332
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Introduction to Snowflake Datawarehouse and Architecture for Big data company. Centralized data management. Snowpipe and Copy into a command for data loading. Stream loading and Batch Processing.
In this introductory session, we dive into the inner workings of the newest version of Azure Data Factory (v2) and take a look at the components and principles that you need to understand to begin creating your own data pipelines. See the accompanying GitHub repository @ github.com/ebragas for code samples and ADFv2 ARM templates.
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Practical examples of using extended eventsDean Richards
Many presentations I have seen about SQL Server extended events focus on the mechanics of setting them up and querying the definitions back from the DMVs. While useful, I had always been missing the points about when and how to use extended events. What are they and why should you be using them? This presentation will show three real-world examples of using extended events. Each example will include demos on configuring the event and using the data and you will learn:
• Use extended events rather than queries against DMVs or tracing
• Collect query information
• Catch and examine deadlocks
• Collect actual plans
Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities.
This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.
Moving to the cloud; PaaS, IaaS or Managed InstanceThomas Sykes
In this session we'll look at the cloud choices available in Azure for SQL Server. Whether it's PaaS, IaaS or Managed Instance we'll look into the features provided, the major differences and the Pros and Cons of each solution and how to choose the best option available.
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
Data orchestration is the lifeblood of any successful data analytics solution. Take a deep dive into Azure Data Factory's data movement and transformation activities, particularly its integration with Azure's Big Data PaaS offerings such as HDInsight, SQL Data warehouse, Data Lake, and AzureML. Participants will learn how to design, build and manage big data orchestration pipelines using Azure Data Factory and how it stacks up against similar Big Data orchestration tools such as Apache Oozie.
Video of presentation:
https://channel9.msdn.com/Events/Ignite/Australia-2017/DA332
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
At our March Data Analytics Meetup, Dan Rodriguez and Cherian Mathew demonstrated the variations in Microsoft Azure programs and how they are impacting digital transformation.
Microsoft Azure zmienia się. Jego częśc poświęcona bazie danych (Windows Azure SQL Database) zmienia się jeszcze szybciej. Podczas tej sesji chciałbym pokazac tym, którzy nie widzieli, oraz przypomniec tym, którzy już coś wiedzą - o co chodzi z WASD, jakie zmiany nastapiły i czego możemy po tej bazie oczekiwać. Dla odważnych będzie okazja podłączenia się do konta w chmurze i przetestowania ych rozwiązań samemu.
Microsoft released SQL Azure more than two years ago - that's enough time for testing (I hope!). So, are you ready to move your data to the Cloud? If you’re considering a business (i.e. a production environment) in the Cloud, you need to think about methods for backing up your data, a backup plan for your data and, eventually, restoring with Red Gate Cloud Services. In this session, you’ll see the differences, functionality, restrictions, and opportunities in SQL Azure and On-Premise SQL Server 2008/2008 R2/2012. We’ll consider topics such as how to be prepared for backup and restore, and which parts of a cloud environment are most important: keys, triggers, indexes, prices, security, service level agreements, etc.
Big data is only a group of unstructured and structured data. We need to be able to acquire, organize, analyze and present it in a way that can create value to the business. MySQL is used in 80% Hadoop implementation and has been the "loyal" partner for Hadoop.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
In dieser Session stellen wir ein Projekt vor, in welchem wir ein umfassendes BI-System mit Hilfe von Azure Blob Storage, Azure SQL, Azure Logic Apps und Azure Analysis Services für und in der Azure Cloud aufgebaut haben. Wir berichten über die Herausforderungen, wie wir diese gelöst haben und welche Learnings und Best Practices wir mitgenommen haben.
Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it's features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
3. A community for
professionals who
use the
Microsoft Data
Platform
Articles Webinars Videos Presentations Events Resources News
./c/sqlschool.gr Sqlschool.gr
Group
@antoniosch
@sqlschool
SQLschool.gr
UG & Page
Connect Explore Learn
5. • Overview
• Installing and Configure Polybase
• Data Virtualization using Polybase
• DMVs and Polybase
• Performance and Troubleshooting
Presentation
Content
6. Proliferation of Data Platform technologies
What is Data Virtualization?
What is Polybase?
Data Virtualization using Polybase
Overview
7. Connect / Explore / Learn
Proliferation of Data Platform technologies
Massive increasing
amount of data
The Problem Technologies RDBMS
8. Connect / Explore / Learn
A modern take on the classic problem of ETL.
Data appears to come from one source system while under the
covers defining links to where the data really lives.
End user or analyst:
• Can read this data using one SQL dialect.
• Join with structured data sets from different systems without needing to
know the source of each data set.
• No dependencies from database developers to build in ETL flows to move
data from one system to the next.
What is
Data Virtualization?
9. Connect / Explore / Learn
Polybase has been available since 2010.
General Available in SQL Server 2016.
Polybase purpose was to integrate SQL Server with Hadoop by allowing us to
run MapReduce jobs against a remote Hadoop cluster and bringing the
results back into SQL Server reducing the computational burden on our
relatively more expensive SQL Server instances.
PolyBase in SQL Server 2019 has grown and adapted to this era of data
virtualization and gives us the ability to integrate with a variety of source
systems like Hadoop cluster, Azure Blob Storage, other SQL Server instances,
Oracle database, Teradata, MongoDB, Cosmos DB, an Apache Spark cluster,
Apache Hive tables, and even Microsoft Excel.
The best part of it is that developers need only T-SQL.
PolyBase is no panacea, and there are trade-offs compared to storing all data
natively in one source system, particularly around performance.
What is Polybase?
12. Connect / Explore / Learn
Polybase
Configuration
Scale-out group rules
Each machine hosting SQL Server must be part of the same Active Directory domain.
You must use the same Active Directory service account for each installation of the
PolyBase Engine and PolyBase Data Movement services.
Each machine hosting SQL Server must be able to communicate with all other Scale-Out
Group members in close physical proximity and on the same network, avoiding
geographically distributed servers and communications through the Internet.
Each SQL Server instance must be running the same major version of SQL Server
PolyBase services are machine-level rather than instance-level services.
21. Connect / Explore / Learn
Polybase
Vs.
Linked Servers
PolyBase External Table Linked Server
Object scope
Database level, focusing on a
single table
Instance level
Operational
intent
Read-only Read and write
Scale-out Able to use Scale-Out Groups No scale-out capabilities
Expected data
size
Large tables with analytic
workloads
OLTP-style workloads
querying a small number of
rows
22. Metadata DMVs
Service and Node Resources DMVs
Data Movement Service DMVs
Troubleshooting Queries DMVs
Data Virtualization using Polybase
Polybase DMVs
23. Connect / Explore / Learn
use PolybaseDemo;
select * from
sys.external_data_sources;
select * from
sys.external_file_formats;
select * from
sys.external_tables;
go
Metadata DMVs
24. Connect / Explore / Learn
use master;
select * from
sys.dm_exec_compute_nodes;
select * from
sys.dm_exec_compute_node_status;
select * from
sys.dm_exec_compute_node_errors;
go
Service and
Node Resources
DMVs
25. Connect / Explore / Learn
use master;
select * from
sys.dm_exec_dms_services;
select * from
sys.dm_exec_dms_workers;
go
Data Movement
Service DMVs
26. Connect / Explore / Learn
use PolybaseDemo ;
select * from
sys.dm_exec_external_work;
select * from
sys.dm_exec_external_operations;
select * from
sys.dm_exec_distributed_requests;
select * from
sys.dm_exec_distributed_request_steps;
select * from
sys.dm_exec_distributed_sql_requests;
go
Troubleshooting
Queries DMVs
28. Statistics on External Tables
Predicate Pushdown
Polybase Log Files
Data Issues
Data Virtualization using Polybase
Performance and Troubleshooting
29. Connect / Explore / Learn
Statistics on
External Tables
• Fundamentally are the same as statistics on regular
tables
• Because data lives outside of SQL Server
We cannot automatically create or maintain statistics against external
tables.
We can create statistics from 100% of data (default) or from a sample of
data.
Disk space needed during statistics creation because all the data from
external table streamed into temporary table.
• Performance Impact
External statistics can make a difference when they help the optimizer
decide whether to push down a predicate or reorder joins to other tables,
not in full scans.
30. Connect / Explore / Learn
Predicate
Pushdown
• Pushdown computation improves the performance of
queries on external data sources.
• In SQL Server 2019 is available in Hadoop, Oracle,
Teradata, MongoDB, ODBC generic types, SQL Server.
• SQL Server allows the following basic expressions and
operators for predicate pushdown.
Binary comparison operators (<, >, =, !=, <>, >=, <=) for numeric, date,
and time values.
Arithmetic operators (+, -, *, /, %).
Logical operators (AND, OR).
Unary operators (NOT, IS NULL, IS NOT NULL).
32. Connect / Explore / Learn
Located at
%PROGRAMFILES%Microsoft SQL Server
MSSQL##.MSSQLSERVERMSSQLLogPolybase
Polybase Log Files
33. Connect / Explore / Learn
Data Issues
• Structural
• Unsupported characters
• Date formats
• Limitations
The maximum possible row size (full length of variable length columns)
can't exceed 32 KB in SQL Server or 1 MB in Azure Synapse Analytics.
Text-heavy columns might be limited.
Hello and welcome to another SQL Night
I am Antonios Chatzipavlis
I am a Data Solutions Consultant and Trainer and
I have been in the Information Technology Industry since 1988
I have been an MCT since 2000 and
Microsoft Data platform MVP since 2010.
I started using SQL Server since version 6.0 this means I have more than 25 of experience with this product in large scale environments.
I have more than 60 (sixty) of certifications mostly in MS products
Finally I am the founder of SQLschool.gr
SQLschool.gr is a community for Greek professionals who use the Microsoft Data Platform.
In this you will find Articles, Webinars, Videos, Resources, news about Microsoft Data Platform.
You can join us as a member or follow us in social media to keep up with our community
This year SQLschool.gr became 10 years old and I would like to thank you all for your participation and support.
SQLschool.gr is a community for Greek professionals who use the Microsoft Data Platform.
In this you will find Articles, Webinars, Videos, Resources, news about Microsoft Data Platform.
You can join us as a member or follow us in social media to keep up with our community
This year SQLschool.gr became 10 years old and I would like to thank you all for your participation and support.
There are two components selected aside from Database Engine Services: the PolyBase Query Service for External Data and the Java connector for HDFS data sources.
The Java connector for HDFS data sources provides us support for connecting to Hadoop and Azure Blob Storage, which were the two endpoints available with PolyBase in SQL Server 2016 and SQL Server 2017; I refer to this throughout the book as PolyBase V1.
SQL Server 2019 also adds the PolyBase Query Service for External Data component, which includes support for services like Oracle, Teradata, MongoDB, Cosmos DB, and even other SQL Server instances. In order to install this component, SQL Server’s installer will also install the Microsoft Visual C++ 2017 Redistributable.
Before you begin installation, it is important to know whether you want to install PolyBase as a standalone service or as part of a Scale-Out Group because you will not be able to switch between the two afterward without uninstalling and reinstalling the PolyBase features. If you are using SQL Server on Linux, the only option available to you at this time is to install standalone; SQL Server on Windows allows for both installation methods. All other things equal, a Scale-Out Group is preferable to a standalone installation. The reason for this is that PolyBase is a Massively Parallel Processing (MPP) technology. This means we can scale PolyBase horizontally, improving performance by adding additional servers. But that only works if you incorporate your machine as part of a Scale-Out Group, however; as a standalone installation, your SQL Server instance will not be able to enlist the support of other SQL Server instances when using PolyBase to perform queries.
The preceding text makes sense when all other things are equal, but installing PolyBase as part of a Scale-Out Group has some requirements which standalone PolyBase does not. To wit, in order to install PolyBase as part of a Scale-Out Group, all of the following must be true:
The first option is to install the Azul Zulu Open JRE. This is a distribution of Oracle’s Open Java Runtime Environment which Azul Systems supports. Your license for SQL Server includes support for this particular distribution of Open JRE, meaning that you could contact Microsoft support for issues related to the JRE. The link on the installation page includes more information on this licensing agreement.
If you are already a licensed Oracle Standard Edition (SE) customer, you can of course install the Oracle SE version of the Java Runtime Environment. To do so, select the “Provide the location of a different version that has been installed to on this computer” option and navigate to your already-installed version of the Java Runtime Environment. SQL Server 2016 and 2017 supported JRE version 7 update 51 and later, as well as JRE version 8. SQL Server 2019 supports later versions of the Java Runtime Environment, including version 11.
If you are not a licensed Oracle SE customer, you can also install Oracle’s Open JRE. The downside to this is that your support options are limited to public forum access.
Configuration.sql
PolybaseBlob.sql
PolybaseSQL.sql
Linked servers are a classic technique database administrators and developers can use to query another server’s data from the local server. On the plus side, there is extensive OLEDB driver support, and linked servers can reach out to technologies like Oracle, Apache Hive, other SQL Server instances, and even Excel. On the minus side, linked servers have an oft-deserved reputation for bringing over too much data from the remote server during queries and a somewhat undeserved reputation for being a security issue. Still, introducing the idea of an alternative for linked servers should excite many a DBA. Here is where I have mixed news for you: PolyBase can be superior to linked servers in some circumstances, but you will not want to replace all of your linked servers with external tables, as there are some cases where linked servers will be superior. Instead, think of these as two complementary technologies with considerable overlap.
Object Scopes
Linked servers are scoped at the instance level, which means that when you create a linked server, any database on that instance has access to the linked server. Furthermore, on the remote side, linked servers allow you to query any table or view on any database where the remote login has rights. The advantage to the linked server model is its flexibility: you can use linked servers for any number of queries across an indefinite number of remote tables or views. The biggest disadvantage of this approach is that it promotes the idea that perhaps you ought to make that cross-server join of two very large tables.
By contrast, PolyBase requires more deliberation: a database administrator or developer needs to create the external table link on a table-by-table or view-by-view basis before anybody can use it. This additional effort should make the creator think about whether a cross-server link is really necessary and can provide a bit of extra documentation about which tables the staff intend to use for cross-server queries. The downside to this is, if you have a large number of tables to query, it means writing a large number of external table definitions and also maintaining these definitions across table changes. This makes PolyBase a better choice for more stable data models and linked servers for more dynamic data models.
Operational Intent
Linked servers allow for reads as well as inserts, updates, and deletes. With PolyBase V1, we were able to read and insert but could not update or delete data. For the PolyBase V2 types, we are able to read but the engine prohibits any data modification, including inserts. If you attempt a data modification statement against a PolyBase V2 external table, you will get an error message similar to that Msg 46519 – DML Operations are not supported with external tables
Scale-Out Capabilities
Linked servers offer no ability to scale out. One SQL Server instance may read from one SQL Server instance. If you experience performance problems, there is no way to add additional SQL Server instances to the mix to share the load. PolyBase, meanwhile, offers Scale-Out Groups for cases when three or four servers are better than one. In this regard, PolyBase is strictly superior.
Data Sizes
Tying in with scale-out capabilities, linked servers and PolyBase have different expectations for ideal data size. If you intend to pull back one row or a few rows from a small table, linked servers will generally be a superior option because there are fewer moving parts. As you get more complicated queries with larger data sets, PolyBase tends to do at least as well and often better.
Over the rest of this chapter, we will test the performance of PolyBase vs. linked servers in several scenarios to see when PolyBase succeeds and when linked servers come out ahead.
There are 13 Dynamic Management Views available in SQL Server 2019 which relate to PolyBase. In this section, we will review each of these at a high level, starting with basic metadata resources, followed by the DMVs which help with service and node setup, and finishing with DMVs for query troubleshooting.
External Data Sources
Returns one row per external data source
External File Formats
Shows each of the most important settings for an external file
External Tables
Inherits several columns from sys.objects.
Contains PolyBase-specific columns.
External table
The useful external table columns include external data source and external file format IDs, allowing us to tie these three tables together. For PolyBase V2 tables, the file format ID will be 0, as we do not use external file formats for these data sources.
Compute Nodes
returns one row for the head node and one row for each PolyBase compute node, including the server name and port, as well as its IP address.
If you have a standalone installation of PolyBase, you will get two rows back: one for the head and one for the local instance’s compute node.
If you are using a scale-out cluster, you will get back the two rows in a standalone installation as well as one row for each scale-out compute node you have in the cluster.
The sys.dm_exec_compute_node_status DMV connects to each compute node in order to determine if it is available. It retrieves server-level information such as allocated and available memory (in bytes), process and total CPU utilization (in ticks), the last communication time per node, and the latest error to have occurred as well. Figure 10-5 shows an example of some of the columns in this DMV.
When it comes to errors, however, we can see the value of all of the columns while on-premises by querying sys.dm_exec_compute_node_errors. This DMV holds a history of error messages and is a good place to look when troubleshooting failures on a system.
In addition to its unique ID data type, the data in sys.dm_exec_compute_node_errors will persist even after we restart the SQL Server services. Most Dynamic Management Views—for example, wait stat measures—reset when the database engine restarts, but compute node errors will stick around.
The first of these is sys.dm_exec_dms_services . This view returns one row per compute node—including one row for the head instance’s compute node—and the status for each of these nodes. Figure 10-7 shows the output of this DMV.
We also have the ability to see the outputs of data movement service operations using the sys.dm_exec_dms_workers DMV. This gives us one row for each execution ID and execution step and includes performance measures, including bytes and rows, total elapsed time, CPU utilization time, and more
To clear up potential confusion, the total elapsed time and query time values are in milliseconds, whereas CPU time is in ticks, where 10,000 ticks add up to a millisecond. Therefore, to get a clearer measure across the board, we want to divide the CPU time column by 10,000 to get a better picture of just how much CPU time we are actually using in relation to total elapsed time.
In addition to these measures, we are also able to see the source SQL query for these operations, as well as the error ID if an operation fails. Unlike the compute node errors DMV, the DMS workers Dynamic Management View resets every time you restart the PolyBase engine service.
The final five Dynamic Management Views help us learn more about the SQL queries users run on our instances. Like the server and node resource DMVs we just looked at, these are all instance-level DMVs, meaning we will get the same results when running in any database. These views break down into two types, based on their names: external work and distributed requests. The external work results reset each time we restart the PolyBase engine, whereas the distributed requests DMVs persist even after service restarts.
First up in our set of views is sys.dm_exec_external_work. This Dynamic Management View returns one row for each of the last 1000 PolyBase queries we have run since the last time the PolyBase engine started, as well as any active queries currently running.
This DMV contains information on the current status of each execution, including the latest step for each compute node and Data Movement Service step.
We can see the type of operation, which is “File Split” for PolyBase V1 queries and “ODBC Data Split” for PolyBase V2 queries.
The input name tells us which file, folder, or table we are reading—for the SQL Server example on the first line, the input name is sqlserver://sqlcontrol/PolyBaseRevealed.dbo.Person. If we are reading from a file, the read_location field gives us the starting offset from 0 bytes. In the three cases in Figure 10-9, we read the file starting from the beginning. We can see the actual ODBC command next in the read_command column, which is a new field for SQL Server 2019. Finally, there are some columns containing top-level metrics, including bytes processed, file length (when reading files), start and end dates, the total elapsed time in milliseconds, and the status of each request. This status will be one of the following values: Pending, Processing, Done, Failed, or Aborted.
If you perform a predicate pushdown operation against a Hadoop cluster, the sys.dm_exec_external_operations Dynamic Management View will give you a rundown of these pushdown operations. Figure 10-10 shows an example of a pushdown MapReduce job which failed—we can see that the map and reduce progress values are both at 0%.
The sys.dm_exec_distributed_requests view returns one line per distributed operation. It provides us with one extremely helpful piece of information: a SQL handle, which we can use to return query text or an execution plan for our PolyBase queries. Figure 10-11 shows several rows from this table, including QID2260, which failed in the prior figure.
The sys.dm_exec_distributed_request_steps view returns one row per execution ID and step. It is particularly useful when you already know an execution ID and want to understand what happened at each step along the way. Figure 10-12 gives us a glimpse at some of the most important columns here.
The sys.dm_exec_distributed_sql_requests Dynamic Management View is our final DMV of note. It contains one row per SQL-related step on each compute node and for each distribution. Figure 10-13 shows an example of this for execution ID QID2148.
This view makes clear the distributed nature of PolyBase: each distributed request step has eight separate SPIDs running on a single compute node. As with the distributed request steps DMV, we will look at this DMV in some greater detail next.
The first step creates the name of our temp table, TEMP_ID_XX. This appears to be an incrementing value, and the operation runs on the head node.
The second step has us create a temporary table on each compute node named TEMP_ID_XX . The shape of this table is the set of columns that we will need for our query: population type, year, and population.
The third step adds an extended property named IS_EXTERNAL_STREAMING_TABLE to each of the temp tables, presumably to make it easier to track which temp tables are used for loading external data.
The step 4 runs a statistics update, updating statistics and telling SQL Server that we expect the temp table will have 566 rows.
Our fifth step (i.e., step index 4) runs on the head node once more and is a MultiStreamOperation. There is no official documentation on this step, but it takes up 848 of the 919 total milliseconds of elapsed time and appears to be the operation which causes our compute nodes to do work.
From there, we see a HadoopShuffleOperation on the Data Movement Service. This returns all 13,607 rows in the population table. We can see from the cleaned-up query in Listing 10-3 that this is a simple query of all rows from our population table.
While we shuffle data across our compute nodes’ Data Movement Services, the next step runs, a StreamingReturnOperation . We can tell these are running concurrently because the shuffle operation takes 845 milliseconds and the streaming return operation 804 milliseconds, yet our entire query finished in under a second. This streaming query, which again runs on each of the compute nodes, queries TEMP_ID_73 and performs the aggregation we requested. Of interest is the fact that this query does not follow exactly the same shape as what we sent the database engine.
Polybase Log Files
C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\Polybase
DMS Errors
The DMS error log gives stack traces when an exception occurs in the data movement service. One of the more common errors you might find when reading through this log is System.Data.SqlClient.SqlException: Operation cancelled by user. This exception occurs when a user or application stops a query, such as when a user hits the “Stop” button in Azure Data Studio. You can safely ignore this error.
This particular log file tends to give you a high-level view of when errors occur but little information on the root cause or even the specific error. One of the more common errors I tend to see in this log is Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase. This phrase will not help me diagnose the problem, but this log file does tend to include information like the query ID and plan ID, which I can use to figure out which queries are failing.
DMS Movement
The data movement service writes a good amount of information to the DMS Movement log and includes detailed information on what data moves over from Azure Blob Storage or Hadoop to SQL Server. This includes the SQL queries the PolyBase data movement service generates, configuration settings such as the number of readers the DMS will use to migrate data, and detailed operation at each step. Combined with the DMS error log, we can start to piece together our errors.
DWEngine Errors
Like the DMS error log, the DWEngine error log gives a higher-level overview of when errors occur, as well as stack traces. This file can help you pinpoint when an error occurs. The errors in this file tend to be a bit more descriptive than the ones in the DMS error log. For example, we can find errors relating to the maximum reject threshold in this file: Query aborted-- the maximum reject threshold (1 rows) was reached while reading from an external source: 2 rows rejected out of total 2 rows processed.
DWEngine Movement
This log provides us with more detail on queries and errors which the DWEngine error log captures. In some cases, this file has enough information to drive to the root cause. In Figure 5-3, we see an example of a clear error message where I defined a column in an ORC file as a string data type but am trying to use an integer data type to access it via PolyBase.
DWEngine Server
The DWEngine Server log contains a few pieces of useful information. One of the most useful is that it contains the create statements for external data sources, file formats, and tables. We can use this log to determine what our external resources looked like at the time of exception, just in case somebody changed one of them during troubleshooting.
This log also contains information on failed external table access attempts. If you have firewall or connection problems, this should be your first log to review. Figure 5-4 shows an example of a common HDFS bridge error whose root cause is insufficient permissions granted to the PolyBase pdw_user account.
DMS PolyBase
The DMS PolyBase log shows us something extremely important: any data translation failure. Figure 5-5 gives us three examples of data translation errors, including column conversion errors, data length errors, and string delimiter errors. We can also find cases where values are NULL, but the external table requires a non-nullable field, invalid date conversion attempts, and more.
DWEngine PolyBase
This file is much less interesting than most of the other logs. In my work, I have not seen it stretch to more than a few lines, and the most interesting thing in this log is the location of new Hadoop clusters as you create external data sources.
Structural Mismatch
The first common data problem is structural mismatch—that is, when you define your external table one way but the data does not comport to that structure. For example, you might define an external table as having eight columns, but the underlying data set has seven or nine columns. In that case, the PolyBase engine will reject rows because they do not fit the expected structure.
Caution
In production Hadoop systems, developers are liable to change the structure of files and leave old files as is. For example, a report with eight columns might suddenly populate with nine columns on a certain date. The PolyBase engine cannot support multiple data structures for the same external table and will reject at least one of the two structures. This might cause a previously working external table query suddenly and unexpectedly to fail.
Aside from column totals, there are several other mismatch problems which can cause queries to fail. For example, text files might have different schemas or delimiters: one type might be comma-delimited and another pipe-delimited. Some text files might use the quotation mark as a string delimiter, and others might use brackets or tildes. Any lack of consistency will cause the PolyBase engine to fail processing. If you do run into this scenario, an easy solution would be to create several external tables—one for each distinct file structure—and use a view to combine them together as one logical unit.
Unsupported Characters or Formats
PolyBase supports only a limited number of date formats. The safest route is to limit your text file dates to use supported formats. You can find these on Microsoft Docs (https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql).
PolyBase also struggles with newlines in text fields, so strip those out before trying to load data. Even within a quoted delimiter, newlines will cause the PolyBase engine to think it is starting a new record.
PolyBase Data Limitations
PolyBase also has limits to what data it can support. From Microsoft Docs (https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-versioned-feature-summary), we can see that the maximum row size cannot exceed 32KB for SQL Server or 1MB for Azure Synapse Analytics. In addition, if you save your data in ORC format, you might receive Java out-of-memory exceptions due to data size. For text-heavy files, it might be best to keep them as delimited files rather than ORC files.
The maximum possible row size, which includes the full length of variable length columns, can't exceed 32 KB in SQL Server or 1 MB in Azure Synapse Analytics.
When data is exported into an ORC file format from SQL Server or Azure Synapse Analytics, text-heavy columns might be limited. They can be limited to as few as 50 columns because of Java out-of-memory error messages. To work around this issue, export only a subset of the columns.
PolyBase can't connect to a Hortonworks instance if Knox is enabled.
If you use Hive tables with transactional = true, PolyBase can't access the data in the Hive table's directory.