Microsoft ignite 2018 SQL Server 2019 big data clusters - intro session

The Future of SQL Server 2019
and Big Data

*IDC White Paper, Data Age 2025: The Evolution of Data to Life-Critical
163 ZBs
of data will be generated
In 2025In 2016
16.1 ZBs
of data was generated

Barriers to insights are
barriers to success
The task of generating insights from ever-increasing data is tough

Organizations that transform data into insights
outperform the competition
Source: Keystone Strategy interviews Oct 2015 - Mar 2016
74% of leaders use predictive models37% of leaders dynamically update
data models
Leaders combine structured and
unstructured data in a data lake 8X
as often
Integrate data
without ETL
Combine data in a
central data store
Perform
predictive analytics
What do these organizations do differently?

Build intelligent apps and
AI with all your data
Analyzing all data
Easily and securely manage
data big and small
Managing all data
Simplified management and analysis through a unified deployment, governance, and tooling
SQL Server enables
intelligence over all your data
Unified access to all your data with
unparalleled performance
Integrating all data

Data movement is a barrier to
faster insights
Costs
Duplicated storage costs
Engineering effort to build and
maintain data pipelines
Delays in integrate data before it
can be used
Increased data latency
Increased attack surface area
Inconsistent security models
Data quality issues can be created
by ETL pipelines
Increased governance
issues
No, 19%
Don't
Know, 5%
Yes, 76%
3/4 of respondents say that
untimely data has inhibited business opportunities
Speed
Security
Quality
Compliance
*IDC 3rd Platform Information Management Requirements Survey, Oct 2016

Data virtualization
creates solutions
Costs
Lower storage costs
Less dev time spent on integration
Rapid iterations and prototypes
Timely data
Smaller attach surface area
Consistent security model
Fresh and accurate data
Easier data governance
Speed
Security
Quality
Compliance
Data virtualization integrates data from disparate
sources, locations and formats, without replicating or
moving the data, to create a single "virtual" data fabric

SQL Server
T-SQLAnalytics Apps
ODBC NoSQL Relational databases Big Data
PolyBase external tables
SQL Server is the hub for integrating data
Easily combine across relational and non-relational data stores

Complex scale-out deployment
Time-consuming patching and upgrades
Cumbersome security management

Easily deploy and manage a
SQL Server + Big Data cluster
Easily deploy and manage a Big Data cluster using Microsoft’s
Kubernetes-based Big Data solution built-in to SQL Server
Hadoop Distributed File System (HDFS) storage, SQL Server
relational engine, and Spark analytics are deployed as containers
on Kubernetes in one easy-to manage package

Simplified deployment with
containers & Kubernetes
A container is a standardized unit of software that includes
everything needed to run it
Kubernetes is a container hosting platform
Benefits of containers and Kubernetes:
1. Fast to deploy
2. Self-contained – no installation required
3. Upgrades are easy because - just upload a new image
4. Scalable, multi-tenant, designed for elasticity
Kubernetes pod
SQL Server
HDFS Data Node
Spark

SQL Server can now read directly from HDFS files
Elastically scale compute and storage using HDFS-based
storage pools with SQL Server and Spark built in
Apps, BI, and analytics access Big Data through the
SQL Server master instance
Scale Big Data on demand
SQL Server
master instance
Persistent storage
Custom apps AnalyticsBI
SQL
Server
HDFS Data Node
Spark
Kubernetes pod
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
Node Node Node
SQL

Scale-out data pools combine and cache data from many
sources for fast querying
Scenario
 A global car manufacturing company wants to join data
from across multiple sources including HDFS, SQL Server,
and Cosmos DB
Solution
• Query data in relational and non-relational data stores with
new PolyBase connectors
• Create a scale-out data pool cache of combined data
• Expose the datasets as a shared data source, without
writing code to move and integrate data
SQL Server
Scale-out data pool
HDFS Cosmos DB SQL Server
Polybase
connectors
Shard 1 Shard nShard 2

Persistent storage
SQL Server
Scale-out data pool
IoT data
Extend SQL Server with a scale-out storage tier by
partitioning the data across multiple instances
Speed up query performance by scaling out the filtering
and local aggregation across multiple instances
Shard 1 Shard nShard 2

Increase analytics and apps performance
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
…
Compute pool
SQL Compute
Node
IoT data
Directly
read from
HDFS
Persistent storage
…
Storage pool
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
Kubernetes pod
Analytics
Custom
apps BI
SQL Server
master instance
Node Node Node Node Node Node Node
SQL
Data pool
SQL Data
Node
SQL Data
Node
Compute pool
SQL Compute
Node
Storage Storage

Azure Data Studio provides a unified tool for
querying data using a notebook experience for
both T-SQL and Spark
Easily access all your data across SQL Server and
HDFS
The cluster administration portal provides easy to
use cloud-style managed services for HA,
monitoring, backup/recovery, security, and
provisioning.
The REST API and command line tools simplify
automation
The development and management experience is
consistent regardless of where you run – on prem
or any of the major cloud providers

Integrated Big Data and SQL Server security model
Simple, single sign-on with Active Directory authentication
Manage data access with SQL Server security roles
Access reporting for audit and compliance
Central security
and governance
External data sources
Active Directory
App and AI Developer
Impersonation
Active Directory

Developers struggle to access
insights from Big Data
Data science is siloed from
operational data
Lengthy time to train and
operationalize models

Storage pool
Access relational and non-relational data using familiar T-
SQL commands and development frameworks
Enrich apps with data from other sources like Oracle
database, Mongo DB
Build intelligent applications with access to unstructured,
high volume, and high velocity data
Train R and Python models against Big Data stored in
Hadoop and score your application data without ever
leaving SQL Server
Apply easy to use tools like Azure Data Studio and Visual
Studio Code
Django framework
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark

Data scientists can use familiar tools to analyze
structured and unstructured data
1. Use Azure Data Studio notebooks run a Spark
job over structured and unstructured data
2. Spark jobs can access data in SQL Server
through JDBC, Tedious, etc.
3. Queries can be access data from other sources
like Oracle Database and Mongo DB via
external tables
4. The Spark job returns the data to the notebook
External data
sources
Storage pool
Spark Spark Spark
SQL Ops
Studio

Model & serve
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Sensors and IoT
(unstructured)
Predictive
apps
BI tools
Store
HDFS
SQL Server data
pools
Ingest
Spark streaming
Prep & train
Spark
Spark ML
SQL Server
ML Services
SQL Server
master instance
Simplified management and analysis through a unified deployment, governance, and tooling
Integrate structured and unstructured data
SQL Server
master instance
REST API containers
for models
SQL Server
Integration Services

VolumeVarietyVelocity Veracity

Mount and manage remote stores through HDFS
Mount various on-prem and cloud data stores
Accelerate computation by caching data locally
Disaster recovery/Data backup
Storage pool
SQL Server Master instance/Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
SQL
Server
HDFS Data Node
Spark
Other HDFS store Remote cloud
store

SQL Server 2019 big data & analytics
Managed SQL Server, Spark,
and data lake
Store high volume data in a data lake and access
it easily using either SQL or Spark
Management services, admin portal, and
integrated security make it all easy to manage
SQL
Server
Data virtualization
Combine data from many sources without
moving or replicating it
Scale out compute and caching to boost
performance
T-SQL
Analytics Apps
Open
database
connectivity
NoSQL Relational
databases
HDFS
Complete AI platform
Easily feed integrated data from many sources to
your model training
Ingest and prep data and then train, store, and
operationalize your models all in one system
SQL Server External Tables
Compute pools and data pools
Spark
Scalable, shared storage (HDFS)
External
data sources
Admin portal and management services
Integrated AD-based security
SQL Server
ML Services
Spark &
Spark ML
HDFS
REST API containers
for models

Intelligence
over all data
drives innovation
Simplified management and analysis through a unified deployment, governance, and tooling model
Analyzing all dataManaging all dataIntegrating all data

Apply to join the SQL Server 2019
Early Adoption Program

Microsoft ignite 2018 SQL Server 2019 big data clusters - intro session

Microsoft ignite 2018 SQL Server 2019 big data clusters - intro session

More Related Content

What's hot

Similar to Microsoft ignite 2018 SQL Server 2019 big data clusters - intro session

More from Travis Wright

Recently uploaded

Microsoft ignite 2018 SQL Server 2019 big data clusters - intro session

Editor's Notes