Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Integration through Data Virtualization (SQL Server Konferenz 2019)

1,171 views

Published on

Data Integration through Data Virtualization - PolyBase and new SQL Server 2019 Features (Presented at SQL Server Konferenz 2019 on February 21st, 2019)

Published in: Data & Analytics
  • Be the first to comment

Data Integration through Data Virtualization (SQL Server Konferenz 2019)

  1. 1. Data Integration through Data Virtualization Cathrine Wilhelmsen, Inmeta @cathrinew | cathrinew.net February 21st 2019
  2. 2. Abstract Data virtualization is an alternative to Extract, Transform and Load (ETL) processes. It handles the complexity of integrating different data sources and formats without requiring you to replicate or move the data itself. Save time, minimize effort, and eliminate duplicate data by creating a virtual data layer using PolyBase in SQL Server. In this session, we will first go through fundamental PolyBase concepts such as external data sources and external tables. Then, we will look at the PolyBase improvements in SQL Server 2019. Finally, we will create a virtual data layer that accesses and integrates both structured and unstructured data from different sources. Along the way, we will cover lessons learned, best practices, and known limitations.
  3. 3. @cathrinew cathrinew.net
  4. 4. …the next 60 minutes… PolyBase Virtual Data Layer Data Virtualization Data Integration
  5. 5. Data Integration
  6. 6. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  7. 7. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  8. 8. Combine Data Extract Transform Load Extract Load Transform Data Ingestion Data Preparation Data Wrangling
  9. 9. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  10. 10. Different Formats SQL TXT CSV XLS XML JSON Orc Parquet
  11. 11. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  12. 12. Separate Sources SQL Server Oracle Teradata MongoDB Hadoop Azure Blob Storage Azure Data-Lake Local File System
  13. 13. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  14. 14. Useful Information Accurate Complete Consistent Timely Unique Valid
  15. 15. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  16. 16. Valuable Information What you need Answer questions Solve problems Timesaving Reduce effort Improve efficiency
  17. 17. Combine Data in Different Formats from Separate Sources into Useful and Valuable Information
  18. 18. ETL – Extract Transform Load ELT – Extract Load Transform
  19. 19. ETL – Extract Transform Load ELT – Extract Load Transform = data movement
  20. 20. Data Movement: Costs Duplicated storage costs Need resources to build and maintain
  21. 21. Data Movement: Speed Takes time to build and maintain Delays before data can be used
  22. 22. Data Movement: Security Increased attack surface area Inconsistent security models
  23. 23. Data Movement: Data Quality More storage layers and pipelines Higher complexity
  24. 24. Data movement is a barrier to faster insights - Microsoft
  25. 25. Data Virtualization
  26. 26. Data Virtualization Logical Layers and Abstractions (Near) Real-Time View of Data Store in separate locations View in one location
  27. 27. Data Virtualization: Costs Lower storage costs Fewer resources to build and maintain
  28. 28. Data Virtualization: Speed No data latency Rapid iterations and prototypes
  29. 29. Data Virtualization: Security Smaller attack surface area Consistent security models
  30. 30. Data Virtualization: Data Quality Fewer storage layers and pipelines Less complexity
  31. 31. Data virtualization creates solutions - Microsoft
  32. 32. Data Movement = Bad ? Data Virtualization = Good ?
  33. 33. Data Movement = Bad ? Data Virtualization = Good ? no, just different use cases!
  34. 34. PolyBase
  35. 35. PolyBase Feature in SQL Server 2016 and later Query tables and files using T-SQL Used to query, import, and export data
  36. 36. PolyBase Performance Push-Down Computations Scale-Out Groups
  37. 37. PolyBase in SQL Server 2016 / 2017 Hadoop Azure Blob Storage Azure Data Lake
  38. 38. PolyBase in SQL Server 2019 Hadoop Azure Blob Storage Azure Data Lake SQL Server Oracle Teradata MongoDB
  39. 39. ODBC NoSQL Relational Databases Big Data PolyBase
  40. 40. How to use PolyBase? 1. Install PolyPase 2. Configure PolyBase Connectivity 3. Create Database Master Key 4. Create Database Scoped Credential 5. ...
  41. 41. How to use PolyBase? 4. ... 5. Create External Data Sources 6. Create External File Formats 7. Create External Tables 8. Create Statistics
  42. 42. Install PolyBase
  43. 43. 1. Install Prerequisites Microsoft .NET Framework 4.5 Oracle Java SE Runtime Environment (JRE) 7 or 8 2. Install PolyBase Single Node or Scale-Out Group 3. Enable PolyBase
  44. 44. Install Prerequisites Microsoft .NET Framework 4.5 https://www.microsoft.com/nl-nl/download/details.aspx?id=30653 Oracle Java SE Runtime Environment (JRE) 7 or 8 https://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html
  45. 45. Install PolyBase Note: PolyBase can be installed on only one SQL Server instance per machine. Note: After you install PolyBase either standalone or in a scale-out group, you have to uninstall and reinstall to change it. . . . Ask me how I know : )
  46. 46. Enable PolyBase sp_configure 'polybase enabled', 1; RECONFIGURE;
  47. 47. Verify Installation SELECT SERVERPROPERTY ('IsPolyBaseInstalled');
  48. 48. Configure PolyBase Connectivity
  49. 49. 1. Configure PolyBase Connectivity 2. Restart Services SQL Server SQL Server PolyBase Engine SQL Server PolyBase Data Movement
  50. 50. Configure PolyBase Connectivity sp_configure 'hadoop connectivity', 7; RECONFIGURE;
  51. 51. Configure PolyBase Connectivity Hadoop Connectivity: • Specify type of data source • Values: 0-7 • 1, 4, 7: Multiple Data Sources
  52. 52. Configure PolyBase Connectivity
  53. 53. Configure PolyBase Connectivity
  54. 54. Restart Services
  55. 55. Restart Services
  56. 56. Create Database Master Key
  57. 57. Create Database Master Key CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
  58. 58. Create Database Scoped Credential
  59. 59. Create Database Scoped Credential CREATE DATABASE SCOPED CREDENTIAL <CredentialName> WITH IDENTITY = '<identity>', SECRET = '<secret>';
  60. 60. Create External Data Sources
  61. 61. Create External Data Source
  62. 62. Create External Data Source
  63. 63. Create External Data Source
  64. 64. Create External Data Source CREATE EXTERNAL DATA SOURCE <HadoopName> WITH ( TYPE = HADOOP, LOCATION ='<hdfs://...>', CREDENTIAL = <CredentialName>, RESOURCE_MANAGER_LOCATION = '<ip>' );
  65. 65. Create External Data Source CREATE EXTERNAL DATA SOURCE <AzureBlobName> WITH ( TYPE = HADOOP, LOCATION ='<wasbs://...>', CREDENTIAL = <CredentialName> );
  66. 66. Create External Data Source CREATE EXTERNAL DATA SOURCE <OracleName> WITH ( LOCATION ='<oracle://...>', CREDENTIAL = <CredentialName> );
  67. 67. Create External File Formats
  68. 68. Create External File Format
  69. 69. Create External File Format
  70. 70. Create External File Format
  71. 71. Create External File Format CREATE EXTERNAL FILE FORMAT <FileFormatName> WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = ';', USE_TYPE_DEFAULT = TRUE ) );
  72. 72. Create External Tables
  73. 73. Create External Table
  74. 74. Create External Table
  75. 75. Create External Table
  76. 76. Create External Table CREATE EXTERNAL TABLE [SchemaName].[TableName] ( [ColumnName] INT NOT NULL ) WITH ( LOCATION = <FileName>', DATA_SOURCE = <DataSourceName>, FILE_FORMAT = <FileFormatName> )
  77. 77. Create Statistics
  78. 78. Create Statistics Note: To create statistics, SQL Server imports the external data into temp table first. Remember to choose sampling or full scan. Note: Updating statistics is not supported. Drop and re-create instead.
  79. 79. Create Statistics CREATE STATISTICS <StatName> ON <TableName>(<ColumnName>); CREATE STATISTICS <StatName> ON <TableName>(<ColumnName>) WITH FULLSCAN;
  80. 80. All Done
  81. 81. Verify using Catalog Views SELECT * FROM sys.external_data_sources SELECT * FROM sys.external_file_formats; SELECT * FROM sys.external_tables;
  82. 82. T-SQL All The Things :)
  83. 83. …or…?
  84. 84. PolyBase can be grumpy :(
  85. 85. Unexpected error encountered filling record reader buffer: HadoopExecutionException: Not enough columns in this line.
  86. 86. Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too many columns in the line.
  87. 87. Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter.
  88. 88. Unexpected error encountered filling record reader buffer: HadoopExecutionException: Error converting data type NVARCHAR to INT.
  89. 89. Unexpected error encountered filling record reader buffer: HadoopExecutionException: Conversion failed when converting the NVARCHAR value '"0"' to data type BIT.
  90. 90. Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too long string in column [-1]: Actual len = [4242]. MaxLEN=[4000]
  91. 91. Msg 46518, Level 16, State 12, Line 1: The type 'nvarchar(max)' is not supported with external tables.
  92. 92. Msg 2717, Level 16, State 2, Line 1: The size (10000) given to the parameter exceeds the maximum allowed (4000).
  93. 93. Msg 131, Level 15, State 2, Line 1: The size (10000) given to the column exceeds the maximum allowed for any data type (8000).
  94. 94. = Know your data :)
  95. 95. SQL Server 2019 Big Data Clusters
  96. 96. SQL Server 2019 Big Data Clusters SQL Server, Spark, and HDFS Scalable clusters of containers Runs on Kubernetes
  97. 97. Kubernetes Pod Kubernetes Pod Kubernetes Pod Kubernetes Pod SQL Server Master Instance SQL Server HDFS Data Node SparkSQL Server HDFS Data Node Spark SQL Server HDFS Data Node Spark SQL Server HDFS Data Node Spark
  98. 98. Build Virtual Data Layer
  99. 99. Build Virtual Data Layer Scenarios: 1. Text Files in Azure Blob Storage 2. Tables in Oracle Database
  100. 100. Text Files in Azure Blob Storage
  101. 101. Tables in Oracle Database
  102. 102. DEMO Build Virtual Data Layer in SSMS
  103. 103. It's as easy as 1, 2, 3!
  104. 104. …4, 5, 6, 7, 8, 9, 10…
  105. 105. Is there an easier way?
  106. 106. Biml 💚 PolyBase Ben Weissman: Using Biml to automagically keep your external polybase tables in sync! https://www.solisyon.de/biml-polybase-external-tables/
  107. 107. Azure Data Studio 1. Install Azure Data Studio docs.microsoft.com/en-us/sql/azure-data-studio/download 2. Install Extension: SQL Server 2019 (Preview) docs.microsoft.com/en-us/sql/azure-data-studio/sql-server-2019-extension
  108. 108. Extension: SQL Server 2019 (Preview)
  109. 109. Extension: SQL Server 2019 (Preview) Double-clicking the .vsix file doesn't work…
  110. 110. Extension: SQL Server 2019 (Preview) …install preview extensions from Azure Data Studio
  111. 111. Azure Data Studio Wizard: CSV Files
  112. 112. Azure Data Studio Wizard: Oracle
  113. 113. DEMO Build Virtual Data Layer in ADS
  114. 114. Next Steps
  115. 115. Where can I learn more? Microsoft SQL Docs: docs.microsoft.com/sql
  116. 116. Where can I learn more? Kevin Feasel: 36chambers.wordpress.com/polybase
  117. 117. How can I try PolyBase? Microsoft Hands-on Labs: microsoft.com/handsonlabs
  118. 118. How can I try SQL Server 2019? For Windows, Linux, and containers: aka.ms/trysqlserver2019
  119. 119. How can I try Big Data Clusters? SQL Server 2019 Early Adoption Program: aka.ms/eapsignup
  120. 120. @cathrinew cathrinew.net hi@cathrinew.net Vielen Dank!
  121. 121. Thank you very much for your attention. Vielen Dank für Eure Aufmerksamkeit.

×