Introduction to SQL Server Big Data Clusters

SQL Server
Big Data Clusters
Rock Pereira
SQL Saturday, Redmond
April 27, 2019

Contents
1.Kubernetes for Data Science
2.SQL Server Big Data Clusters
3.Understand the problem
4.Data exploration and analysis
5.Data-driven application development with Kubernetes

1.1 What is Kubernetes?
Docker Containers
MCR:
Microsoft Container
Registry

1.2 Benefits
Build &
Configure
InsightObservation
Estimate
Compute Needs
Parameterized
Deployment
Autoscaling

1.3 Team Data Science Lifecycle

2 SQL Server Big Data Clusters

2.1 What is a Big Data Cluster?
Unified data platform for analytics
Data-driven solutions using Kubernetes
Components of a BDC:
●
Spark - Distributed, In-memory compute
●
HDFS - Elastic Storage
●
SQL Server - Data Hub for structured &
unstructured data
●
Kubernetes - Scale-out, fault-tolerant

2.2 Features
●
Deploy anywhere there is managed Kubernetes
●
Management services for logging, monitoring,
backup and high availability
●
Consistent portal for managing all your clusters

2.3 Polybase
Query HDFS (Azure Blob Storage, Hortonworks, Cloudera)
using External Tables in SQL Server
●
Manage permissions with Active Directory
●
No data duplication – The data is not persistent
New in SQL Server 2019:
●
Connectors to Azure SQL DB, Azure SQL DW, Oracle,
Teradata, MongoDB, Azure CosmosDB + any ODBC
compliant source with an ODBC driver (IBM DB2, SAP
Hana, Microsoft Excel)
●
Read CSV & parquet

2.4 Architecture
Compute Pool:
Parallel Ingest
Storage Pool:
Scalable Storage
Data Processing
SQL Data Pool:
Caching External Data
Distributed across SS
Instances
SS Master Pool:
Read-Write OLTP
Store dimensional

2.5 Azure Data Studio
●
Work with relational (big) data in SQL Server
●
HDFS browser – like Azure Storage Explorer
●
External Table wizard, incl column mapping
●
Jupyter-based notebooks
●
Collaboration
●
Code with intellisense
●
Submit Spark jobs

2.6 Deploying a Big Data Cluster
Minikube On-Prem Cloud (AKS)
Single Node
Requirements:
Memory: 32 GB
CPU: 8
Disk Space: 100 GB
Use kubeadm Use python script
Set environment
variables before
deploying
Tools:
mssqlctl (app_commands, ref), kubectl, Azure CLI
Azure Data Studio + SQL Server 2019 extension

Introduction to SQL Server Big Data Clusters

More Related Content

What's hot

Similar to Introduction to SQL Server Big Data Clusters

Recently uploaded

Introduction to SQL Server Big Data Clusters