SQL Server
Big Data Clusters
Rock Pereira
SQL Saturday, Redmond
April 27, 2019
Contents
1.Kubernetes for Data Science
2.SQL Server Big Data Clusters
3.Understand the problem
4.Data exploration and analysis
5.Data-driven application development with Kubernetes
1 Kubernetes for Data Science
1.1 What is Kubernetes?
1.1 What is Kubernetes?
Docker Containers
MCR:
Microsoft Container
Registry
1.2 Benefits
Build &
Configure
InsightObservation
Estimate
Compute Needs
Parameterized
Deployment
Autoscaling
1.3 Team Data Science Lifecycle
1.4 Demo: SS 2019 in Minikube
2 SQL Server Big Data Clusters
2.1 What is a Big Data Cluster?
Unified data platform for analytics
Data-driven solutions using Kubernetes
Components of a BDC:
●
Spark - Distributed, In-memory compute
●
HDFS - Elastic Storage
●
SQL Server - Data Hub for structured &
unstructured data
●
Kubernetes - Scale-out, fault-tolerant
2.2 Features
●
Deploy anywhere there is managed Kubernetes
●
Management services for logging, monitoring,
backup and high availability
●
Consistent portal for managing all your clusters
2.3 Polybase
Query HDFS (Azure Blob Storage, Hortonworks, Cloudera)
using External Tables in SQL Server
●
Manage permissions with Active Directory
●
No data duplication – The data is not persistent
New in SQL Server 2019:
●
Connectors to Azure SQL DB, Azure SQL DW, Oracle,
Teradata, MongoDB, Azure CosmosDB + any ODBC
compliant source with an ODBC driver (IBM DB2, SAP
Hana, Microsoft Excel)
●
Read CSV & parquet
2.4 Architecture
Compute Pool:
Parallel Ingest
Storage Pool:
Scalable Storage
Data Processing
SQL Data Pool:
Caching External Data
Distributed across SS
Instances
SS Master Pool:
Read-Write OLTP
Store dimensional
2.5 Azure Data Studio
●
Work with relational (big) data in SQL Server
●
HDFS browser – like Azure Storage Explorer
●
External Table wizard, incl column mapping
●
Jupyter-based notebooks
●
Collaboration
●
Code with intellisense
●
Submit Spark jobs
2.6 Deploying a Big Data Cluster
Minikube On-Prem Cloud (AKS)
Single Node
Requirements:
Memory: 32 GB
CPU: 8
Disk Space: 100 GB
Use kubeadm Use python script
Set environment
variables before
deploying
Tools:
mssqlctl (app_commands, ref), kubectl, Azure CLI
Azure Data Studio + SQL Server 2019 extension

Introduction to SQL Server Big Data Clusters

  • 1.
    SQL Server Big DataClusters Rock Pereira SQL Saturday, Redmond April 27, 2019
  • 2.
    Contents 1.Kubernetes for DataScience 2.SQL Server Big Data Clusters 3.Understand the problem 4.Data exploration and analysis 5.Data-driven application development with Kubernetes
  • 3.
    1 Kubernetes forData Science
  • 4.
    1.1 What isKubernetes?
  • 5.
    1.1 What isKubernetes? Docker Containers MCR: Microsoft Container Registry
  • 6.
  • 7.
    1.3 Team DataScience Lifecycle
  • 8.
    1.4 Demo: SS2019 in Minikube
  • 9.
    2 SQL ServerBig Data Clusters
  • 10.
    2.1 What isa Big Data Cluster? Unified data platform for analytics Data-driven solutions using Kubernetes Components of a BDC: ● Spark - Distributed, In-memory compute ● HDFS - Elastic Storage ● SQL Server - Data Hub for structured & unstructured data ● Kubernetes - Scale-out, fault-tolerant
  • 11.
    2.2 Features ● Deploy anywherethere is managed Kubernetes ● Management services for logging, monitoring, backup and high availability ● Consistent portal for managing all your clusters
  • 12.
    2.3 Polybase Query HDFS(Azure Blob Storage, Hortonworks, Cloudera) using External Tables in SQL Server ● Manage permissions with Active Directory ● No data duplication – The data is not persistent New in SQL Server 2019: ● Connectors to Azure SQL DB, Azure SQL DW, Oracle, Teradata, MongoDB, Azure CosmosDB + any ODBC compliant source with an ODBC driver (IBM DB2, SAP Hana, Microsoft Excel) ● Read CSV & parquet
  • 13.
    2.4 Architecture Compute Pool: ParallelIngest Storage Pool: Scalable Storage Data Processing SQL Data Pool: Caching External Data Distributed across SS Instances SS Master Pool: Read-Write OLTP Store dimensional
  • 14.
    2.5 Azure DataStudio ● Work with relational (big) data in SQL Server ● HDFS browser – like Azure Storage Explorer ● External Table wizard, incl column mapping ● Jupyter-based notebooks ● Collaboration ● Code with intellisense ● Submit Spark jobs
  • 15.
    2.6 Deploying aBig Data Cluster Minikube On-Prem Cloud (AKS) Single Node Requirements: Memory: 32 GB CPU: 8 Disk Space: 100 GB Use kubeadm Use python script Set environment variables before deploying Tools: mssqlctl (app_commands, ref), kubectl, Azure CLI Azure Data Studio + SQL Server 2019 extension