Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session

Deep Dive On SQL Server and
Big Data
Travis Wright, Mihaela Blendea, Umachandar Jayachandran
Program Managers, SQL Server

Build intelligent apps and
AI with all your data
Analyzing all data
Easily and securely manage
data big and small
Managing all data
Simplified management and analysis through a unified deployment, governance, and tooling
SQL Server enables
intelligence over all your data
Unified access to all your data with
unparalleled performance
Integrating all data

Easily deploy and manage a
SQL Server + Big Data cluster
Easily deploy and manage a Big Data cluster using Microsoft’s
Kubernetes-based Big Data solution built-in to SQL Server
Hadoop Distributed File System (HDFS) storage, SQL Server
relational engine, and Spark analytics are deployed as containers
on Kubernetes in one easy-to manage package
Benefits of containers and Kubernetes:
Fast to deploy
Self-contained – no installation required
Upgrades are easy because - just upload a new image
Scalable, multi-tenant, designed for elasticity

SQL Server Big Data Cluster Layout
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
…
Compute pool
SQL Compute
Node
IoT data
Directly
read from
HDFS
Persistent storage
…
Storage pool
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
Kubernetes pod
Analytics
Custom
apps BI
SQL Server
master instance
Node Node Node Node Node Node Node
SQL
Data pool
SQL Data
Node
SQL Data
Node
Compute pool
SQL Compute
Node
Storage Storage
Controller

Base node configuration
Applies to nodes across all planes
Services
kubelet – K8s local agent
kube-proxy – network config and forwarding
supervisord – process monitor and control
fluentd – node logging
flanneld – Software defined network
collectd – OS and application data collection
SQL Big Data watchdog– config sync, watchdog, data
collector (DMV, etc) Kubernetes node
watchdog
kubelet
supervisord
fluentd
kube-proxy
flanneld
collectd

Control plane
External Endpoints
Kubernetes (REST)
Aris Control Service (REST)
Knox Gateway (REST gateway for Hadoop APIs)
SQL Server Master (TDS gateway for data marts and SQL
Master Service)
Services
etcd
Kubernetes Master Services
Controller
SQL Master instance
SQL Big Data Admin Portal
Knox Gateway
HDFS Name Service
YARN Master
Hive Metastore
InfluxDB (metrics store)
Livy (REST interface for Spark)
Spark Driver
Kubernetes node
Base node services + etcd
Controller
SQL Master
Proxy
HDFS Name Node
Kubernetes node
Kubernetes Master Services
SQL Big Data Admin Portal
Spark Driver
InfluxDB
Kubernetes node
Livy
Elastic Search
Knox
Hive Metastore
Grafana
Kibana
Yarn Master

Controller
External REST/HTTPS Endpoint
Bootstrap and Build out
Manage Capacity
Configure High Availability and recover from failure (AGs)
Security (authN, authZ, certificate rotation)
Lifecycle (upgrade/downgrade/rollback)
Configuration management
Monitoring - capacity, health, metrics, logs
Troubleshooting – performance, failures
Cluster Admin Portal
Controller Service
Buildout
Upgrade/Rollback
HADR
Add/Remove Capacity
Central AuthZ/AuthN
Cluster Admin Portal
Troubleshooting

SQL Master instance
TDS endpoint into the cluster
High value data
OLTP server
Data connectors
Machine learning & extensibility
Scalable query engine with readable secondary replicas
Built-in high availability with Always On Availability Groups
(coming soon)
Master Instance Availability Group

Compute plane
Hosts one or more SQL Compute Pools
Compute pool is a group of instances that forms a data,
security, and resource boundary.
Compute pool processes complex distributed queries against
the data plane.
Local storage is used for shuffling data if necessary.
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine

Data plane
Storage pool
Data ingestion through Spark (batch and streaming)
Data storage in HDFS
Data access through HDFS and SQL endpoints
SQL engine reads files in HDFS directly
Data pool
Partitioned, in-memory cache for external data or HDFS
Scale-out data storage for append only data sets
Data ingestion through Spark
Storage pool node
Base node services
SQL Engine
Data pool node
Base node services
SQL Engine
HDFS
Spark
Storage pool node
Base node services
SQL Engine
HDFS
Spark












SQL Server
T-SQLAnalytics Apps
ODBC NoSQL Relational databases Big Data
PolyBase external tables
SQL Server is the hub for integrating data
Easily combine across relational and non-relational data stores

Scale-out data pools combine and cache data from many
sources for fast querying
Scenario
 A global car manufacturing company wants to join data
from across multiple sources including HDFS, SQL Server,
and Cosmos DB
Solution
• Query data in relational and non-relational data stores with
new PolyBase connectors
• Create a scale-out data pool cache of combined data
• Expose the datasets as a shared data source, without
writing code to move and integrate data
SQL Server
Scale-out data pool
HDFS Cosmos DB SQL Server
Polybase
connectors
Shard 1 Shard nShard 2IoT data

SQL Server can now read directly from HDFS files
Elastically scale compute and storage using HDFS-based
storage pools with SQL Server and Spark built in
Mount and manage remote stores through HDFS
Mount various on-prem and cloud data stores
Accelerate computation by caching data locally
Disaster recovery/Data backup
Storage pool
SQL Server Master instance/Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
Other HDFS store Remote cloud
store










Data scientists can use familiar tools to analyze
structured and unstructured data
1. Use SQL Ops Studio notebooks run a Spark job
over structured and unstructured data
2. Spark SQL jobs can access data in SQL Server
3. Queries can be pushed down to other data
sources like Oracle Database and Mongo DB
4. The Spark job returns the data to the notebook
SQL Server master instance
External data
sources
Storage pool
Spark Spark Spark
Azure Data
Studio

Model & serve
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Sensors and IoT
(unstructured)
Predictive
apps
BI tools
Store
HDFS
SQL Server data
pools
Ingest
Spark streaming
Prep & train
Spark
Spark ML
SQL Server
ML Services
SQL Server
master instance
Simplified management and analysis through a unified deployment, governance, and tooling
Integrate structured and unstructured data
SQL Server
master instance
REST API containers
for models

SQL Server master instance
Storage pool
Spark
MLeap Runtime
Spark Spark
Model
Scoring
Training Training Training

Managed SQL Server, Spark
and data lake
Store high volume data in a data lake and access
it easily using either SQL or Spark
Management services, admin portal, and
integrated security make it all easy to manage
SQL Server
Data virtualization
Combine data from many sources without
moving or replicating it
Scale out compute and caching to boost
performance
T-SQL
Analytics Apps
Open
database
connectivity
NoSQL Relational
databases
HDFS
Complete AI platform
Easily feed integrated data from many sources
to your model training
Ingest and prep data and then train, store, and
operationalize your models all in one system
SQL Server
External Tables
Compute pools and data pools
Spark
Scalable, shared storage (HDFS)
External
data sources
Admin portal and management services
Integrated AD-based security
SQL Server
ML Services
Spark & Spark
ML
HDFS
REST API
containers
for models

Apply to join the SQL Server
Early Adoption Program

Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session

Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session

Similar to Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session (20)

More from Travis Wright

More from Travis Wright (17)

Recently uploaded

Recently uploaded (20)

Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session

Editor's Notes