«Moderne» Data Warehouse/Data Lake Architekturen strotzen oft nur von Layern und Services. Mit solchen Systemen lassen sich Petabytes von Daten verwalten und analysieren. Das Ganze hat aber auch seinen Preis (Komplexität, Latenzzeit, Stabilität) und nicht jedes Projekt wird mit diesem Ansatz glücklich.
Der Vortrag zeigt die Reise von einer technologieverliebten Lösung zu einer auf die Anwender Bedürfnisse abgestimmten Umgebung. Er zeigt die Sonnen- und Schattenseiten von massiv parallelen Systemen und soll die Sinne auf das Aufnehmen der realen Kundenanforderungen sensibilisieren.
2. «Moderne» Data Warehouse/Data Lake Architekturen strotzen oft nur von Layern und Services.
• Mit solchen Systemen lassen sich Petabytes von Daten verwalten und analysieren.
• Das Ganze hat aber auch seinen Preis (Komplexität, Latenzzeit, Stabilität)
und nicht jedes Projekt wird mit diesem Ansatz glücklich.
Der Vortrag zeigt die Reise von einer technologieverliebten Lösung zu einer auf die Anwender Bedürfnisse
abgestimmten Umgebung.
Er zeigt die Sonnen- und Schattenseiten von massiv parallelen Systemen und
soll die Sinne auf das Aufnehmen der realen Kundenanforderungen sensibilisieren.
Agenda/Goal
3. • 10 Years Ago: selling hardware
• 5 Years Ago: Transition to Leasing business
− Challenge-> How to maintain the hardware for Leasing?
− Solution -> Put Sensor on everything, and predictive maintenance on IoT
• Trend Now:
Manufacturing companies
are becoming data companies
Manufactory Companies are becoming Data Companies:
4. Architecture Overview for Information Management
4
IoT
Non
Structured
CRM BW
…
Azure
DW
ERP
Polybase
IoT
Hub
Stream
Analytics
Blob
Storage
ADF: Azure Data Factory1
2 Logic, API, App
Theobald1
Theobald IS
&
SSIS
2
Data Lake Store
SSAS
Data Lake
Analysis
HD Insights
(Spark)
Azure
Machine
Learning
Visual Studio
R-Studio
Local
SQL
Local
SQL
Data scientist
MSR
ADF1
SSIS2
Direct Query1
Process2
Local IT
Market Reach
Excel
MS
Access
xxx Apps
SQL1
API2
Browser
API
Product Mgr.
Local
SQL
Reporting
& Analysis
Azure
SQL DB
SQL
API
ADW:
+ No size limit
+ Scalability
+ Polybase
- Less compatible
- Concurrency limit
- No Row level security
- No Ref. Integrity
SQL DB / SSAS:
+ Row level security
+ High concurrency
+ High T-SQL compatibility
- Limited DB size
Push
Pull
SQL connect
5. Architecture Overview for Information Management
5
IoT
Non
Structured
CRM BW
…
Azure
DW
ERP
Polybase
IoT
Hub
Stream
Analytics
Blob
Storage
ADF: Azure Data Factory1
2 Logic, API, App
Theobald1
Theobald IS
&
SSIS
2
Data Lake Store
SSAS
Data Lake
Analysis
HD Insights
(Spark)
Azure
Machine
Learning
Visual Studio
R-Studio
Local
SQL
Local
SQL
Data scientist
MSR
ADF1
SSIS2
Direct Query1
Process2
Local IT
Market Reach
Excel
MS
Access
xxx Apps
SQL1
API2
Browser
API
Product Mgr.
Local
SQL
Reporting
& Analysis
Azure
SQL DB
SQL
API
ADW:
+ No size limit
+ Scalability
+ Polybase
- Less compatible
- Concurrency limit
- No Row level security
- No Ref. Integrity
SQL DB / SSAS:
+ Row level security
+ High concurrency
+ High T-SQL compatibility
- Limited DB size
Push
Pull
SQL connect
6. Azure SQL Data Warehouse performance advantage
Overview
SQL Data Warehouse’s industry leading price-performance
comes from leveraging the Azure ecosystem and core SQL
Server engine improvements to produce massive gains in
performance.
These benefits require no customer configuration and are
provided out-of-the-box for every data warehouse
• Gen2 adaptive caching – using non-volatile memory
solid-state drives (NVMe) to increase the I/O bandwidth
available to queries.
• Azure FPGA-accelerated networking enhancements –
to move data at rates of up to 1GB/sec per node to
improve queries
• Instant data movement – leverages multi-core
parallelism in underlying SQL Servers to move data
efficiently between compute nodes.
• SQL Query Optimizer – ongoing investments in
distributed query optimization
10. CREATE TABLE Sales.Order
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
REPLICATED
);
Round-robin distributed
Distributes table rows evenly across all distributions at
random.
Hash distributed
Distributes table rows across the Compute nodes by
using a deterministic hash function to assign each row
to one distribution.
Replicated
Full copy of table accessible on each Compute node.
Tables – Distributions
13. Architecture Overview for Information Management
1
3
IoT
Non
Structured
CRM BW
…
Azure
DW
ERP
Polybase
IoT
Hub
Stream
Analytics
Blob
Storage
ADF: Azure Data Factory1
2 Logic, API, App
Theobald1
Theobald IS
&
SSIS
2
Data Lake Store
SSAS
Data Lake
Analysis
HD Insights
(Spark)
Azure
Machine
Learning
Visual Studio
R-Studio
Local
SQL
Local
SQL
Data scientist
MSR
ADF1
SSIS2
Direct Query1
Process2
Local IT
Market Reach
Excel
MS
Access
xxx Apps
SQL1
API2
Browser
API
Product Mgr.
Local
SQL
Reporting
& Analysis
Azure
SQL DB
SQL
API
ADW:
+ No size limit
+ Scalability
+ Polybase
- Less compatible
- Concurrency limit
- No Row level security
- No Ref. Integrity
SQL DB / SSAS:
+ Row level security
+ High concurrency
+ High T-SQL compatibility
- Limited DB size
Push
Pull
SQL connect
• Slow load and query performance
• Limited number of concurrent
queries (improved with ADW Gen2)
14. Design Decisions/Table Distributions
All tables are currently Round Robin distributed.
There are 2 main reasons for this:
• When tables are loaded through Theobald, using a drop and create approach, these tables are created
with default settings, which is a round robin distribution
• Loaded via the head node
• An optimal data distribution would allow to use all compute nodes equally during the most frequent
operations.
In our case, this is inserting and reading data. At this moment, the most intensive data reads are the
ones on the Salesdetails view.
• These will get a month of XXXX data and join this with
• Date, TerritoryHierarchy and the Regions table,
to identify the data that needs to be sent to the spoke.
• Distributing by region, or even by company code would not give enough distinct values to evenly
use all nodes.
• Distributing by month would reduce data movement but would not provide any load balancing
during querying as all data would come from the same node.
15. Architecture Overview for Information Management
1
5
IoT
Non
Structured
CRM BW
…
Azure
DW
ERP
Polybase
IoT
Hub
Stream
Analytics
Blob
Storage
ADF: Azure Data Factory1
2 Logic, API, App
Theobald1
Theobald IS
&
SSIS
2
Data Lake Store
SSAS
Data Lake
Analysis
HD Insights
(Spark)
Azure
Machine
Learning
Visual Studio
R-Studio
Local
SQL
Local
SQL
Data scientist
MSR
ADF1
SSIS2
Direct Query1
Process2
Local IT
Market Reach
Excel
MS
Access
xxx Apps
SQL1
API2
Browser
API
Product Mgr.
Local
SQL
Reporting
& Analysis
Azure
SQL DB
SQL
API
ADW:
+ No size limit
+ Scalability
+ Polybase
- Less compatible
- Concurrency limit
- No Row level security
- No Ref. Integrity
SQL DB / SSAS:
+ Row level security
+ High concurrency
+ High T-SQL compatibility
- Limited DB size
Push
Pull
SQL connect
• Slow load and query performance
• Limited number of concurrent
queries (improved with ADW Gen2)
16. Architecture Overview for Information Management
1
6
IoT
Non
Structured
CRM BW
…
Azure
DW
ERP
Polybase
IoT
Hub
Stream
Analytics
Blob
Storage
ADF: Azure Data Factory1
2 Logic, API, App
Theobald1
Theobald IS
&
SSIS
2
Data Lake Store
SSAS
Data Lake
Analysis
HD Insights
(Spark)
Azure
Machine
Learning
Visual Studio
R-Studio
Local
SQL
Local
SQL
Data scientist
MSR
ADF1
SSIS2
Direct Query1
Process2
Local IT
Market Reach
Excel
MS
Access
xxx Apps
SQL1
API2
Browser
API
Product Mgr.
Local
SQL
Reporting
& Analysis
Azure
SQL DB
SQL
API
ADW:
+ No size limit
+ Scalability
+ Polybase
- Less compatible
- Concurrency limit
- No Row level security
- No Ref. Integrity
SQL DB / SSAS:
+ Row level security
+ High concurrency
+ High T-SQL compatibility
- Limited DB size
Push
Pull
SQL connect
17. Data lake platform Hub – spoke concept using ADW
SAP BW
Azure
SQL DWH
Theobald ADF
HUB
Read Only Spoke
Per Region
ADF
ADF
SAP ERP
SAP CRM
…
E1 Spoke
E2 Spoke
E3 Spoke
Read Write Spoke
for local applications
E1W
Spoke
E4 Spoke
ADF
SAP ByD
E2W
Spoke
E3W
Spoke
E4W
Spoke
E4 Spoke
ADF
Data Reference (Elastic Query)
Data Movement
18. Azure Blob Storage Hadoop Azure Data Lake Storage
SQL
MySQL PostgreSQL MariaDB SQL Server in Azure Azure SQL Data Warehouse
Azure Cosmos DB
SQL Server Hyperscale
& Data virtualization
SQL
19. SQL Server PaaS offerings
SQL Database
(PaaS)
Elastic
Pool
Managed
Instance
Singleton
SQL Server
in a VM
SQL
General availabilityGeneral availability Preview
System Current Near Future + 1 Y Future
SAP BW 1TB 4TB 8TB
20. Data lake platform Hub – spoke concept using ADW
SAP BW
Azure
SQL DWH
Theobald ADF
HUB
Read Only Spoke
Per Region
ADF
ADF
SAP ERP
SAP CRM
…
E1 Spoke
E2 Spoke
E3 Spoke
Read Write Spoke
for local applications
E1W
Spoke
E4 Spoke
ADF
SAP ByD
E2W
Spoke
E3W
Spoke
E4W
Spoke
E4 Spoke
ADF
Data Reference (Elastic Query)
Data Movement
• Sub-optimal availability
• Long load times
• Bad query performance using distributed queries
21. Data lake platform Hub – spoke concept using ADW
SAP BW
Azure
SQL DWH
Theobald ADF
HUB
Read Only Spoke
Per Region
ADF
ADF
SAP ERP
SAP CRM
…
E1 Spoke
E2 Spoke
E3 Spoke
Read Write Spoke
for local applications
E1W
Spoke
E4 Spoke
ADF
SAP ByD
E2W
Spoke
E3W
Spoke
E4W
Spoke
E4 Spoke
ADF
Data Reference (Elastic Query)
Data Movement
• Sub-optimal availability
• Long load times
• Bad query performance using distributed queries
SLA 99.9 % SLA 99.99 % SLA 99.99 %
SLA max 99.89 % SLA 99.98 %
22. Data lake platform Hub – spoke concept using ADW
SAP BW
Azure
SQL DWH
Theobald ADF
HUB
Read Only Spoke
Per Region
ADF
ADF
SAP ERP
SAP CRM
…
E1 Spoke
E2 Spoke
E3 Spoke
Read Write Spoke
for local applications
E1W
Spoke
E4 Spoke
ADF
SAP ByD
E2W
Spoke
E3W
Spoke
E4W
Spoke
E4 Spoke
ADF
Data Reference (Elastic Query)
Data Movement
• Sub-optimal availability
• Long load times
• Bad query performance using distributed queries
• Most meta data are copied to
each of the Spokes
• Spoke E1 represents 60% of
all transactions
23. AdventureWorks Sample mapped to referenced project
- Views could replace multiple copy operations
SQL Server Instance
Read Only Spoke
For all Regions
Read Write Spoke
for local applications
Region
Views
Theobald
SAP CRM
SAP ERP
…
SAP ByD
24. Data lake platform Hub – spoke concept using ADW
SAP BW
Azure
SQL DWH
Theobald ADF
HUB
Read Only Spoke
Per Region
ADF
ADF
SAP ERP
SAP CRM
…
E1 Spoke
E2 Spoke
E3 Spoke
Read Write Spoke
for local applications
E1W
Spoke
E4 Spoke
ADF
SAP ByD
E2W
Spoke
E3W
Spoke
E4W
Spoke
E4 Spoke
ADF
Data Reference (Elastic Query)
Data Movement
• Sub-optimal availability
• Long load times
• Bad query performance using distributed queries
26. SQL Server Instance
CPU
Memory
DB Buffer
Cache
Procedure
Cache
Log
Cache
Files
DB1
SQL Server Instance
CPU
Memory
DB Buffer
Cache
Procedure
Cache
Log
Cache
Files
DBn
Spoke Database on Azure SQL Server
SQL Server Instance
Read Only Spoke
For all Regions
Read Write Spoke
for local applications
Region
Views
Hub and spoke objects are stored in differen SQL server instance
- Access via network (external tables)
or
-- e.g. in [AdventureWorksDW2017_US]
CREATE EXTERNAL TABLE [dbo].[DimScenario]
(
[ScenarioKey] [int] NOT NULL,
[ScenarioName] [nvarchar](50) NULL
)
WITH
(
DATA_SOURCE = [AdventureWorksDW2017]
,SCHEMA_NAME = N'Mart_US'
,OBJECT_NAME = N'DimScenario'
)
27. Excution Plans
(Test 9: Spoke Query with accessing Hub and Spoke Objects)
Azure SQL Database
28. Azure SQL Server Managed Instance
SQL Server Instance
CPU
Memory
DB Buffer
Cache
Procedure
Cache
Log
Cache
Files
DB1 DB2 DBn
Data from all databases share the same memory space
Access via simple 3 part nameing is possible db.schema.object
29. Spoke Database on SQL Server Managed Instance
SQL Server Instance
Read Only Spoke
For all Regions
Read Write Spoke
for local applications
Region
Views
SQL Server Instance
CPU
Memory
DB Buffer
Cache
Procedure
Cache
Log
Cache
Files
DB1 DB2 DBn
All objects are stored in the same SQL server instance
- 3 part nameing can be used to access objects in another db
or
-- e.g. in [AdventureWorksDW2017_US]
CREATE VIEW [dbo].[DimScenario]
AS
SELECT *
FROM [AdventureWorksDW2017].[Mart_US].[DimScenario];
-- e.g. in [AdventureWorksDW2017_US]
CREATE VIEW [dbo].[DimScenario]
AS
SELECT [ScenarioKey]
,[ScenarioName]
FROM [AdventureWorksDW2017].[Mart_US].[DimScenario];
30. Excution Plans
(Test 9: Spoke Query with accessing Hub and Spoke Objects)
Azure SQL Database
Managed Instance
31. Layers and objects (via Sample AdventureWorksDW)
Schema
.dbo
Schema
.Mart_DE
.Mart_US
Most Views will be
1:1 mappings
Some Views
will filter
data
32. Test Queries (1)
Title SQL Result Set/Test
Select some attributes
with filter
SELECT LastName, FirstName
FROM dbo.dimCustomer
WHERE LastName = 'Adams’
AND NumberChildrenAtHome = 3;
2 Columns/6 Rows
Aggregate single table select sum([SalesAmount])
from [dbo].[FactInternetSales]
1 Column/1 Row
Aggregate single table
with simple filter
select sum([SalesAmount])
from [dbo].[FactInternetSales]
where [SalesOrderLineNumber] > 1
1 Column/1 Row
Aggregate single table
with filter
with FilterKriteria
as
( select 1 as [MinSalesOrderLineNumber])
select sum([SalesAmount])
from [dbo].[FactInternetSales]
cross join FilterKriteria
where [SalesOrderLineNumber] > [MinSalesOrderLineNumber]
1 Column/1 Row
- Filter push down
Inner database join and
aggregate
select [EnglishProductName], sum(SalesAmount)
from [dbo].[FactInternetSales] [S]
inner join [dbo].[DimProduct] [P]
on [S].[ProductKey] = [P].[ProductKey]
group by [P].[EnglishProductName]
order by 2 desc
2 Columns/130 Rows
- Join push down
33. Test Queries (2)
Title SQL Result Set/Test
Transfer data to spoke select *
into [dbo].[LocalFactInternetSales]
from [dbo].[FactInternetSales]
27 Columns/21’344 Rows
Transfer speed of data
between databases
Cross database join with
aggregate
select [OrderDateKey], [Gender], sum([SalesAmount])
from [dbo].[LocalFactInternetSales] as [S]
inner join [dbo].[DimCustomer] as [C]
on [S].[CustomerKey] = [C].[CustomerKey]
group by [OrderDateKey], [Gender]
3 Column/1761 Rows
Handling of cross database
joins
SQL Server Instance
Read Only Spoke
For all Regions
Read Write Spoke
for local applications
Region
Views
Hub query Spoke query
34. Performance Tests (1)
0
10
20
30
40
50
60
Select some attributes with filter Aggregate single table Aggregate single table with simple filter Aggregate single table with filter Inner database join and aggregate
Test package one
Azure SQL Server Hub Azure SQL Server Spoke Managed Instance Hub Managed Instance Spoke
35. 0
50
100
150
200
250
300
350
400
Transfer data to spoke Cross database join with aggregate
Test package two
Azure SQL Server Hub Azure SQL Server Spoke Managed Instance Hub Managed Instance Spoke
Performance Tests (2)
Part of todays ETL
performance
problem
Issue, if spoke uses
local spoke data and
remote hub data in
one query
36. New architecture
SQL Server Instance
Read Only Spoke
For all Regions
Read Write Spoke
for local applications
Region
Views
From System with:
- 1 Azure SQL DWH
- 9 Azure SQL DB’s
to
- 1 Azure SQL MI (SLA 99.99%)
37. Azure Blob Storage Hadoop Azure Data Lake Storage
SQL
MySQL PostgreSQL MariaDB SQL Server in Azure Azure SQL Data Warehouse
Azure Cosmos DB
SQL Server Hyperscale
& Data virtualization
SQL
Future Option
38. Hyperscale service tier for up to 100 TB
• Support for up to 100 TB of database size
• Higher overall performance due to higher log throughput and faster
transaction commit times regardless
of data volumes
• Nearly instantaneous database backups
(snapshots of Azure Blob storage)
• Fast database restores
(based on file snapshots)
• Rapid read scale out
• Rapid Scale up
39. SQL Server Data Virtualization
• Allows the data to stay in its original location, however you can
virtualize the data in a SQL Server instance
• it can be queried there like any other table in SQL Server.
40. Conclusion KISS (Keep It Simple [not Stupid])
• All used services are excellent services
• Azure SQL Data Warehouse
• Azure SQL Database
• Azure SQL MI
• (SQL in a VM) -> Don’t take PaaS as a religion
• Theobald
• Technical implementation details can make the difference
• Transparent must not mean fast!
• My personal advice
• Try to use as few and “simple” services as possible (but not less)
• For each used service, you should have a good argument chain, why you use it
• POC’s help you to understand the different technologies
• There is no free lunch
• E.g. with databases like Azure SQL Data Warehouse or Cosmos DB you get “endless scale” but
you must deal with data distributions/partitions