Data Integration through Data Virtualization (SQL Server Konferenz 2019)

Data Integration through
Data Virtualization
Cathrine Wilhelmsen, Inmeta
@cathrinew | cathrinew.net
February 21st 2019

Abstract
Data virtualization is an alternative to Extract, Transform and Load (ETL) processes. It handles the
complexity of integrating different data sources and formats without requiring you to replicate or
move the data itself. Save time, minimize effort, and eliminate duplicate data by creating a virtual
data layer using PolyBase in SQL Server.
In this session, we will first go through fundamental PolyBase concepts such as external data sources
and external tables. Then, we will look at the PolyBase improvements in SQL Server 2019. Finally, we
will create a virtual data layer that accesses and integrates both structured and unstructured data
from different sources. Along the way, we will cover lessons learned, best practices, and known
limitations.

…the next 60 minutes…
PolyBase
Virtual
Data Layer
Data
Virtualization
Data
Integration

Combine Data in Different Formats
from Separate Sources into Useful
and Valuable Information

Combine Data
Extract Transform Load
Extract Load Transform
Data Ingestion
Data Preparation
Data Wrangling

Different Formats
SQL
TXT
CSV
XLS
XML
JSON
Orc
Parquet

Separate Sources
SQL Server
Oracle
Teradata
MongoDB
Hadoop
Azure Blob Storage
Azure Data-Lake
Local File System

Useful Information
Accurate
Complete
Consistent
Timely
Unique
Valid

Valuable Information
What you need
Answer questions
Solve problems
Timesaving
Reduce effort
Improve efficiency

ETL – Extract Transform Load
ELT – Extract Load Transform

ETL – Extract Transform Load
ELT – Extract Load Transform
= data movement

Data Movement: Costs
Duplicated storage costs
Need resources to build and maintain

Data Movement: Speed
Takes time to build and maintain
Delays before data can be used

Data Movement: Security
Increased attack surface area
Inconsistent security models

Data Movement: Data Quality
More storage layers and pipelines
Higher complexity

Data movement is a
barrier to faster insights
- Microsoft

Data Virtualization
Logical Layers and Abstractions
(Near) Real-Time View of Data
Store in separate locations
View in one location

Data Virtualization: Costs
Lower storage costs
Fewer resources to build and maintain

Data Virtualization: Speed
No data latency
Rapid iterations and prototypes

Data Virtualization: Security
Smaller attack surface area
Consistent security models

Data Virtualization: Data Quality
Fewer storage layers and pipelines
Less complexity

Data virtualization
creates solutions
- Microsoft

Data Movement = Bad ?
Data Virtualization = Good ?

Data Movement = Bad ?
Data Virtualization = Good ?
no, just different use cases!

PolyBase
Feature in SQL Server 2016 and later
Query tables and files using T-SQL
Used to query, import, and export data

PolyBase Performance
Push-Down Computations
Scale-Out Groups

PolyBase in SQL Server 2016 / 2017
Hadoop
Azure Blob Storage
Azure Data Lake

PolyBase in SQL Server 2019
Hadoop
Azure Blob Storage
Azure Data Lake
SQL Server
Oracle
Teradata
MongoDB

ODBC NoSQL Relational Databases Big Data
PolyBase

How to use PolyBase?
1. Install PolyPase
2. Configure PolyBase Connectivity
3. Create Database Master Key
4. Create Database Scoped Credential
5. ...

How to use PolyBase?
4. ...
5. Create External Data Sources
6. Create External File Formats
7. Create External Tables
8. Create Statistics

1. Install Prerequisites
Microsoft .NET Framework 4.5
Oracle Java SE Runtime Environment (JRE) 7 or 8
2. Install PolyBase
Single Node or Scale-Out Group
3. Enable PolyBase

Install Prerequisites
Microsoft .NET Framework 4.5
https://www.microsoft.com/nl-nl/download/details.aspx?id=30653
Oracle Java SE Runtime Environment (JRE) 7 or 8
https://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html

Install PolyBase
Note: PolyBase can be installed on only one SQL
Server instance per machine.
Note: After you install PolyBase either standalone
or in a scale-out group, you have to uninstall and
reinstall to change it.
. . . Ask me how I know : )

Enable PolyBase
sp_configure
'polybase enabled', 1;
RECONFIGURE;

Verify Installation
SELECT SERVERPROPERTY
('IsPolyBaseInstalled');

Configure PolyBase
Connectivity

1. Configure PolyBase Connectivity
2. Restart Services
SQL Server
SQL Server PolyBase Engine
SQL Server PolyBase Data Movement

Configure PolyBase Connectivity
sp_configure
'hadoop connectivity', 7;
RECONFIGURE;

Hadoop Connectivity:
• Specify type of data source
• Values: 0-7
• 1, 4, 7: Multiple Data Sources

Create Database Master Key
CREATE MASTER KEY ENCRYPTION
BY PASSWORD = '<password>';

Create Database
Scoped Credential

Create Database Scoped Credential
CREATE DATABASE SCOPED CREDENTIAL
<CredentialName>
WITH IDENTITY = '<identity>',
SECRET = '<secret>';

Create External Data Source
CREATE EXTERNAL DATA SOURCE <HadoopName> WITH (
TYPE = HADOOP,
LOCATION ='<hdfs://...>',
CREDENTIAL = <CredentialName>,
RESOURCE_MANAGER_LOCATION = '<ip>'
);

CREATE EXTERNAL DATA SOURCE <AzureBlobName> WITH (
TYPE = HADOOP,
LOCATION ='<wasbs://...>',
CREDENTIAL = <CredentialName>
);

CREATE EXTERNAL DATA SOURCE <OracleName> WITH (
LOCATION ='<oracle://...>',
CREDENTIAL = <CredentialName>
);

Create External File Format
CREATE EXTERNAL FILE FORMAT <FileFormatName> WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ';',
USE_TYPE_DEFAULT = TRUE
)
);

Create External Table
CREATE EXTERNAL TABLE [SchemaName].[TableName] (
[ColumnName] INT NOT NULL
) WITH (
LOCATION = <FileName>',
DATA_SOURCE = <DataSourceName>,
FILE_FORMAT = <FileFormatName>
)

Create Statistics
Note: To create statistics, SQL Server imports the
external data into temp table first. Remember to
choose sampling or full scan.
Note: Updating statistics is not supported. Drop
and re-create instead.

Create Statistics
CREATE STATISTICS <StatName>
ON <TableName>(<ColumnName>);
CREATE STATISTICS <StatName>
ON <TableName>(<ColumnName>) WITH FULLSCAN;

Verify using Catalog Views
SELECT * FROM sys.external_data_sources
SELECT * FROM sys.external_file_formats;
SELECT * FROM sys.external_tables;

Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Not enough columns in this line.

Too many columns in the line.

Could not find a delimiter after
string delimiter.

Error converting data type NVARCHAR to INT.

Conversion failed when converting the
NVARCHAR value '"0"' to data type BIT.

Too long string in column [-1]:
Actual len = [4242]. MaxLEN=[4000]

Msg 46518, Level 16, State 12, Line 1:
The type 'nvarchar(max)' is not supported
with external tables.

The size (10000) given to the parameter
exceeds the maximum allowed (4000).

The size (10000) given to the column
exceeds the maximum allowed for any data
type (8000).

SQL Server 2019
Big Data Clusters

SQL Server 2019 Big Data Clusters
SQL Server, Spark, and HDFS
Scalable clusters of containers
Runs on Kubernetes

Kubernetes Pod Kubernetes Pod Kubernetes Pod Kubernetes Pod
SQL Server
Master Instance
SQL Server
HDFS Data Node
SparkSQL Server
HDFS Data Node
Spark SQL Server
HDFS Data Node
Spark SQL Server
HDFS Data Node
Spark

Build Virtual Data Layer
Scenarios:
1. Text Files in Azure Blob Storage
2. Tables in Oracle Database

Text Files in Azure Blob Storage

DEMO
Build Virtual
Data Layer in SSMS

Biml 💚 PolyBase
Ben Weissman:
Using Biml to automagically keep your external
polybase tables in sync!
https://www.solisyon.de/biml-polybase-external-tables/

Azure Data Studio
1. Install Azure Data Studio
docs.microsoft.com/en-us/sql/azure-data-studio/download
2. Install Extension: SQL Server 2019 (Preview)
docs.microsoft.com/en-us/sql/azure-data-studio/sql-server-2019-extension

Extension: SQL Server 2019 (Preview)

Double-clicking the .vsix
file doesn't work…

…install preview
extensions from Azure
Data Studio

Azure Data Studio
Wizard: CSV Files

Azure Data Studio
Wizard: Oracle

DEMO
Build Virtual
Data Layer in ADS

Where can I learn more?
Microsoft SQL Docs:
docs.microsoft.com/sql

Where can I learn more?
Kevin Feasel:
36chambers.wordpress.com/polybase

How can I try PolyBase?
Microsoft Hands-on Labs:
microsoft.com/handsonlabs

How can I try SQL Server 2019?
For Windows, Linux, and containers:
aka.ms/trysqlserver2019

How can I try Big Data Clusters?
SQL Server 2019 Early Adoption Program:
aka.ms/eapsignup

@cathrinew
cathrinew.net
hi@cathrinew.net
Vielen Dank!

Thank you very much for your attention.
Vielen Dank für Eure Aufmerksamkeit.

Data Integration through Data Virtualization (SQL Server Konferenz 2019)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Integration through Data Virtualization (SQL Server Konferenz 2019)

Similar to Data Integration through Data Virtualization (SQL Server Konferenz 2019) (20)

More from Cathrine Wilhelmsen

More from Cathrine Wilhelmsen (20)

Recently uploaded

Recently uploaded (20)

Data Integration through Data Virtualization (SQL Server Konferenz 2019)