1. SQL Server
Big Data Clusters
Rock Pereira
SQL Saturday, Redmond
April 27, 2019
2. Contents
1.Kubernetes for Data Science
2.SQL Server Big Data Clusters
3.Understand the problem
4.Data exploration and analysis
5.Data-driven application development with Kubernetes
10. 2.1 What is a Big Data Cluster?
Unified data platform for analytics
Data-driven solutions using Kubernetes
Components of a BDC:
●
Spark - Distributed, In-memory compute
●
HDFS - Elastic Storage
●
SQL Server - Data Hub for structured &
unstructured data
●
Kubernetes - Scale-out, fault-tolerant
11. 2.2 Features
●
Deploy anywhere there is managed Kubernetes
●
Management services for logging, monitoring,
backup and high availability
●
Consistent portal for managing all your clusters
12. 2.3 Polybase
Query HDFS (Azure Blob Storage, Hortonworks, Cloudera)
using External Tables in SQL Server
●
Manage permissions with Active Directory
●
No data duplication – The data is not persistent
New in SQL Server 2019:
●
Connectors to Azure SQL DB, Azure SQL DW, Oracle,
Teradata, MongoDB, Azure CosmosDB + any ODBC
compliant source with an ODBC driver (IBM DB2, SAP
Hana, Microsoft Excel)
●
Read CSV & parquet
13. 2.4 Architecture
Compute Pool:
Parallel Ingest
Storage Pool:
Scalable Storage
Data Processing
SQL Data Pool:
Caching External Data
Distributed across SS
Instances
SS Master Pool:
Read-Write OLTP
Store dimensional
14. 2.5 Azure Data Studio
●
Work with relational (big) data in SQL Server
●
HDFS browser – like Azure Storage Explorer
●
External Table wizard, incl column mapping
●
Jupyter-based notebooks
●
Collaboration
●
Code with intellisense
●
Submit Spark jobs
15. 2.6 Deploying a Big Data Cluster
Minikube On-Prem Cloud (AKS)
Single Node
Requirements:
Memory: 32 GB
CPU: 8
Disk Space: 100 GB
Use kubeadm Use python script
Set environment
variables before
deploying
Tools:
mssqlctl (app_commands, ref), kubectl, Azure CLI
Azure Data Studio + SQL Server 2019 extension