Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
1.
2. Deep Dive On SQL Server and
Big Data
Travis Wright, Mihaela Blendea, Umachandar Jayachandran
Program Managers, SQL Server
3. Build intelligent apps and
AI with all your data
Analyzing all data
Easily and securely manage
data big and small
Managing all data
Simplified management and analysis through a unified deployment, governance, and tooling
SQL Server enables
intelligence over all your data
Unified access to all your data with
unparalleled performance
Integrating all data
5. Easily deploy and manage a
SQL Server + Big Data cluster
Easily deploy and manage a Big Data cluster using Microsoft’s
Kubernetes-based Big Data solution built-in to SQL Server
Hadoop Distributed File System (HDFS) storage, SQL Server
relational engine, and Spark analytics are deployed as containers
on Kubernetes in one easy-to manage package
Benefits of containers and Kubernetes:
Fast to deploy
Self-contained – no installation required
Upgrades are easy because - just upload a new image
Scalable, multi-tenant, designed for elasticity
6. SQL Server Big Data Cluster Layout
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
…
Compute pool
SQL Compute
Node
IoT data
Directly
read from
HDFS
Persistent storage
…
Storage pool
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
Kubernetes pod
Analytics
Custom
apps BI
SQL Server
master instance
Node Node Node Node Node Node Node
SQL
Data pool
SQL Data
Node
SQL Data
Node
Compute pool
SQL Compute
Node
Storage Storage
Controller
7.
8. Base node configuration
Applies to nodes across all planes
Services
kubelet – K8s local agent
kube-proxy – network config and forwarding
supervisord – process monitor and control
fluentd – node logging
flanneld – Software defined network
collectd – OS and application data collection
SQL Big Data watchdog– config sync, watchdog, data
collector (DMV, etc) Kubernetes node
watchdog
kubelet
supervisord
fluentd
kube-proxy
flanneld
collectd
9. Control plane
External Endpoints
Kubernetes (REST)
Aris Control Service (REST)
Knox Gateway (REST gateway for Hadoop APIs)
SQL Server Master (TDS gateway for data marts and SQL
Master Service)
Services
etcd
Kubernetes Master Services
Controller
SQL Master instance
SQL Big Data Admin Portal
Knox Gateway
HDFS Name Service
YARN Master
Hive Metastore
InfluxDB (metrics store)
Livy (REST interface for Spark)
Spark Driver
Kubernetes node
Base node services + etcd
Controller
SQL Master
Proxy
HDFS Name Node
Kubernetes node
Base node services + etcd
Kubernetes Master Services
SQL Big Data Admin Portal
Spark Driver
InfluxDB
Kubernetes node
Base node services + etcd
Livy
Elastic Search
Knox
Hive Metastore
Grafana
Kibana
Yarn Master
10. Controller
External REST/HTTPS Endpoint
Bootstrap and Build out
Manage Capacity
Configure High Availability and recover from failure (AGs)
Security (authN, authZ, certificate rotation)
Lifecycle (upgrade/downgrade/rollback)
Configuration management
Monitoring - capacity, health, metrics, logs
Troubleshooting – performance, failures
Cluster Admin Portal
Controller Service
Buildout
Upgrade/Rollback
HADR
Add/Remove Capacity
Central AuthZ/AuthN
Cluster Admin Portal
Troubleshooting
11. SQL Master instance
TDS endpoint into the cluster
High value data
OLTP server
Data connectors
Machine learning & extensibility
Scalable query engine with readable secondary replicas
Built-in high availability with Always On Availability Groups
(coming soon)
Master Instance Availability Group
12. Compute plane
Hosts one or more SQL Compute Pools
Compute pool is a group of instances that forms a data,
security, and resource boundary.
Compute pool processes complex distributed queries against
the data plane.
Local storage is used for shuffling data if necessary.
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
Compute pool node
Base node services
SQL Engine
13. Data plane
Storage pool
Data ingestion through Spark (batch and streaming)
Data storage in HDFS
Data access through HDFS and SQL endpoints
SQL engine reads files in HDFS directly
Data pool
Partitioned, in-memory cache for external data or HDFS
Scale-out data storage for append only data sets
Data ingestion through Spark
Storage pool node
Base node services
SQL Engine
Data pool node
Base node services
SQL Engine
HDFS
Spark
Storage pool node
Base node services
SQL Engine
HDFS
Spark
18. SQL Server
T-SQLAnalytics Apps
ODBC NoSQL Relational databases Big Data
PolyBase external tables
SQL Server is the hub for integrating data
Easily combine across relational and non-relational data stores
19.
20. Scale-out data pools combine and cache data from many
sources for fast querying
Scenario
A global car manufacturing company wants to join data
from across multiple sources including HDFS, SQL Server,
and Cosmos DB
Solution
• Query data in relational and non-relational data stores with
new PolyBase connectors
• Create a scale-out data pool cache of combined data
• Expose the datasets as a shared data source, without
writing code to move and integrate data
SQL Server
Scale-out data pool
HDFS Cosmos DB SQL Server
Polybase
connectors
Shard 1 Shard nShard 2IoT data
21.
22. SQL Server can now read directly from HDFS files
Elastically scale compute and storage using HDFS-based
storage pools with SQL Server and Spark built in
Mount and manage remote stores through HDFS
Mount various on-prem and cloud data stores
Accelerate computation by caching data locally
Disaster recovery/Data backup
Storage pool
SQL Server Master instance/Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
Other HDFS store Remote cloud
store
27. Data scientists can use familiar tools to analyze
structured and unstructured data
1. Use SQL Ops Studio notebooks run a Spark job
over structured and unstructured data
2. Spark SQL jobs can access data in SQL Server
3. Queries can be pushed down to other data
sources like Oracle Database and Mongo DB
4. The Spark job returns the data to the notebook
SQL Server master instance
External data
sources
Storage pool
Spark Spark Spark
Azure Data
Studio
28. Model & serve
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Sensors and IoT
(unstructured)
Predictive
apps
BI tools
Store
HDFS
SQL Server data
pools
Ingest
Spark streaming
Prep & train
Spark
Spark ML
SQL Server
ML Services
SQL Server
master instance
Simplified management and analysis through a unified deployment, governance, and tooling
Integrate structured and unstructured data
SQL Server
master instance
REST API containers
for models
30. SQL Server master instance
Storage pool
Spark
MLeap Runtime
Spark Spark
Model
Scoring
Training Training Training
31.
32.
33. Managed SQL Server, Spark
and data lake
Store high volume data in a data lake and access
it easily using either SQL or Spark
Management services, admin portal, and
integrated security make it all easy to manage
SQL Server
Data virtualization
Combine data from many sources without
moving or replicating it
Scale out compute and caching to boost
performance
T-SQL
Analytics Apps
Open
database
connectivity
NoSQL Relational
databases
HDFS
Complete AI platform
Easily feed integrated data from many sources
to your model training
Ingest and prep data and then train, store, and
operationalize your models all in one system
SQL Server
External Tables
Compute pools and data pools
Spark
Scalable, shared storage (HDFS)
External
data sources
Admin portal and management services
Integrated AD-based security
SQL Server
ML Services
Spark & Spark
ML
HDFS
REST API
containers
for models
Challenge: Managing relational and Big Data is complicated
Pillar: Simplify big data management with improved performance and security
Kubernetes-based Big Data distribution is easy to deploy, upgrade and patch, and scales in seconds using compute pools
visually tie this container to the containers in the previous slide
zoom in to show you what's inside one of the containers
Challenge: Data movement can be problematic
Pillar: Gain unified access to all your data using data virtualization
Query across relational and non-relational data stores including Oracle, Teradata, Mongo etc
Besides supporting Hadoop, now PolyBase will let you query over RDBMS, NoSQL and generic ODBC sources.
Increase performance for data virtualization using data pools in scale-out data pools
SCENARIO: Scale out SQL Server with Hadoop and Spark to unlock Big Data
Scenario: A car manufacturing company wants to join data from across multiple sources including Cloudera for sensor data, SQL Server that has customer data with PII (that they want to keep in SQL and mask), and Cosmos DB that has connected car data in Azure. Today the car manufacturer has a Cloudera cluster with 100 nodes on-prem and 1.5 Petabytes of data. They are running into issues with Hive performance for interactive queries.
Problem: To join this data now, they would have to move it into a single system, a huge undertaking.
Solution: Create a shared, scalable data lake based on HDFS. Expose these datasets as a shared semantic layer, so that it can be used to business analysts without moving the data.
In this scenario, you still have to refresh the data periodically (still lives in its home database – not in the data lake)
Application developers can access relational and non-relational data using familiar T-SQL commands
Challenge: Managing relational and Big Data is complicated
Pillar: Simplify big data management with improved performance and security
Data scientists can easily analyze data through Spark jobs
Easy tooling to ingest, store, prep & train, model and serve high velocity using unified data management
Spark streaming in the box
Data preparation tools
Integrated Jupyter notebooks
Open source effort
A common serialization format and execution engine
Train your pipeline and score it on a JVM
Supports Spark ML, scikit-learn and Tensorflow
Core execution engine implemented in Scala
Choose from two portable serialization formats (JSON, Protobuf)
MLeap DataFrame
JSON/Avro/Binary
Bring your own Java :
- Windows: 1.10 (1.8 support is coming)
- Linux 1.8 support
Get started today with the SQL Server v.Next preview by going to https://aka.ms/eapsignup