SQLServer Big Data Cluster Layout
IoT data
Controller
Cluster
Compute plane
Compute pool Compute pool
SQL Compute
Node
SQL Compute
Node
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
Control planeSQL Server
Master instance
Storage plane
Directly read
From HDFS
Data pool
SQL Data
Node
SQL Data
Node
Storage Storage
HDFS Data Node
Spark
SQL
Server
Storage pool
Spark
SQL
Server
HDFS Data Node HDFS Data Node
Spark
SQL
Server
Kubernetes pod
External data sources
Microsoft SQL Server
Node
Persistent storage
Node Node Node Node Node Node Node
Analytics
Custom
apps
BI
WhatisKubernetesandwhatitdoes?
Kubernetes is a container orchestrator and is responsible for:
Run a cluster of hosts
Schedule containers to run on different hosts
Facilitate the communication between the containers
Provide and control access to/from outside world
Track and optimize the resource usage
Similar solutions
Docker Swarm, Mesos Marathon, Amazon ECS, Hashicorp Nomad
MasterNodes
Responsible for managing the cluster
Typically more than one is installed
In HA mode one Master node is the
Leader
Can be reached via CLI (kubectl),
APIs, or Dashboard
Master Node
Scheduler
Controller
api-server
Key-Value Store
Master Node
Scheduler
Controller
api-server
Key-Value Store
Master Node
Scheduler
Controller
api-server
Key-Value Store
Schedules the work on
different nodes
Takes care of:
1) Control loops
2) Desired state
Performs:
1) Administrative tasks
2) Stores cluster state
etcd is used and it can
be:
1) part of the master
2) installed externally
(Worker)Nodes
Initially called Minions
Container runtime
containerd, rkt, lxd
Kubelet
Communicates with master
Uses CRI shims
kube-proxy
Network proxy
Node
kube-proxy Kubelet
Container Runtime
Pod 1
Pod 2
Pods(1)
Smallest unit of scheduling
Contains one or more
containers
Containers share the pod
environment
Scheduled on nodes
Created via manifest files
Pod
Main container
Supporting containers
net mount ...
Environment
Pods(2)
Each pod has unique IP address
Inter-pod communication is via a pod network
Intra-pod communication is via localhost and
port
Pod 2
10.10.20.21
Pod network
Pod 1
10.10.20.20
localhost
Deployment
Deployments
Even higher level workload
Simplifies updates
and rollbacks
Declarative and imperative
approach
Self documenting
Suitable for versioning
Replication Set
Pod
Services(1)
Provide reliable network endpoint
IP address
DNS name
Port
Expose Pods to the outside world
NodePort (cluster-wide port)
LoadBalancer (cloud-based)
Use End Point object to track Pods
IP = 10.10.10.1
DNS = demo-svc
Port = 32000
Service
Pod A IP, Pod B IP, ...
End Point
Node 1
Pod A
10.10.20.21
Node 2
Pod B
10.10.20.22
Services(2)
Services use label selectors to do their magic
Service
version=v01
app=myapp
Pod
version=v01
app=myapp
Pod
version=v01
app=myapp
Basenodeconfiguration
Applies to nodes across all planes. Services:
kubelet – K8s local agent
kube-proxy – network config and forwarding
supervisord – process monitor and control
fluentd – node logging
flanneld – Software defined network
collectd – OS and application data collection
SQL Big Data watchdog– config sync, watchdog, data
collector (DMV, etc)
Kubernetes node
watchdog
kubelet
kube-proxy
supervisord
fluentd
flanned
collectd
ControlPlane
External Endpoints:
Kubernetes (REST)
Aris Control Service (REST)
Knox Gateway (REST gateway for Hadoop APIs)
SQL Server Master (TDS gateway for data marts and
SQL Master Service)
Services:
etcd
Kubernetes Master Services Controller
SQL Master instance
SQL Big Data Admin Portal
Knox Gateway
HDFS Name Service
YARN Master
Hive Metastore
InfluxDB (metrics store)
Livy (REST interface for Spark)
Spark Driver
Kubernetes node
Base node services + etcd
K8s Master service
Spark driver
SQL Big Data Admin portal
InfluxDB
Grafana
Kubernetes node
Base node services + etcd
Controller
Proxy
SQL Master
HDFS Name Node
Kibana
Kubernetes node
Base node services + etcd
Livy
Knox
Elastic Search
HIVE Metastore
YARN Master
Controller
External REST/HTTPS Endpoint
Bootstrap and Build out
Manage Capacity
Configure High Availability and recover from failure (AGs)
Security (authN, authZ, certificate rotation)
Lifecycle (upgrade/downgrade/rollback)
Configuration management
Monitoring - capacity, health, metrics, logs
Troubleshooting – performance, failures
Cluster Admin Portal
Controller service
Buildout
Upgrade/Rollback
Add/Remove capacity
Central AuthZ/AutnN
Cluster Admin Portal
Troubleshooting
Controller
Metadata
SQLMasterInstance
TDS endpoint into the cluster
High value data
OLTP server
Data connectors
Machine learning & extensibility
Scalable query engine
Master instance Availability Group
Primary
Readable
Secondary
Readable
Secondary
Computeplane
Hosts one or more SQL
Compute Pools
Compute pool is a group of
instances that forms a data,
security, and resource boundary.
Compute pool processes
complex distributed queries
against the data plane.
Local storage is used for
shuffling data if necessary.
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Dataplane
Storage pool:
Data ingestion through Spark (batch and streaming)
Data storage in HDFS
Data access through HDFS and SQL endpoints. SQL
engine reads files in HDFS directly
Data pool:
Partitioned, in-memory cache for external data
Scale-out data storage for append only data sets
Data ingestion through Spark
Provide persistent SQL Server storage for the cluster
Storage pool node
Base node services
SQL Engine
HDFS
Spark
Data pool node
Base node services
SQL Engine
Storage pool node
Base node services
SQL Engine
HDFS
Spark
Installation,configurationsandtools
Installation methods:
• Cloud - platform such as Azure Kubernetes Service (AKS)
• On-premis - VMs, Bare Metal
• Localhost - using minikube (to be used only for training and testing)
Configurations:
• All-in-One Single Node and Different Multi Node Options
Tools:
• mssqlctl, kubectl, Azure Data Studio, SQL Server 2019 extension,
• Azure CLI (for AKS), mssql-cli, sqlcmd, curl