Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)

9 views

Published on

«Moderne» Data Warehouse/Data Lake Architekturen strotzen oft nur von Layern und Services. Mit solchen Systemen lassen sich Petabytes von Daten verwalten und analysieren. Das Ganze hat aber auch seinen Preis (Komplexität, Latenzzeit, Stabilität) und nicht jedes Projekt wird mit diesem Ansatz glücklich.

Der Vortrag zeigt die Reise von einer technologieverliebten Lösung zu einer auf die Anwender Bedürfnisse abgestimmten Umgebung. Er zeigt die Sonnen- und Schattenseiten von massiv parallelen Systemen und soll die Sinne auf das Aufnehmen der realen Kundenanforderungen sensibilisieren.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)

  1. 1. Grösser und Komplexer ist nicht immer besser Meinrad Weiss Senior Cloud Solution Architect
  2. 2. «Moderne» Data Warehouse/Data Lake Architekturen strotzen oft nur von Layern und Services. • Mit solchen Systemen lassen sich Petabytes von Daten verwalten und analysieren. • Das Ganze hat aber auch seinen Preis (Komplexität, Latenzzeit, Stabilität) und nicht jedes Projekt wird mit diesem Ansatz glücklich. Der Vortrag zeigt die Reise von einer technologieverliebten Lösung zu einer auf die Anwender Bedürfnisse abgestimmten Umgebung. Er zeigt die Sonnen- und Schattenseiten von massiv parallelen Systemen und soll die Sinne auf das Aufnehmen der realen Kundenanforderungen sensibilisieren. Agenda/Goal
  3. 3. • 10 Years Ago: selling hardware • 5 Years Ago: Transition to Leasing business − Challenge-> How to maintain the hardware for Leasing? − Solution -> Put Sensor on everything, and predictive maintenance on IoT • Trend Now: Manufacturing companies are becoming data companies Manufactory Companies are becoming Data Companies:
  4. 4. Architecture Overview for Information Management 4 IoT Non Structured CRM BW … Azure DW ERP Polybase IoT Hub Stream Analytics Blob Storage ADF: Azure Data Factory1 2 Logic, API, App Theobald1 Theobald IS & SSIS 2 Data Lake Store SSAS Data Lake Analysis HD Insights (Spark) Azure Machine Learning Visual Studio R-Studio Local SQL Local SQL Data scientist MSR ADF1 SSIS2 Direct Query1 Process2 Local IT Market Reach Excel MS Access xxx Apps SQL1 API2 Browser API Product Mgr. Local SQL Reporting & Analysis Azure SQL DB SQL API ADW: + No size limit + Scalability + Polybase - Less compatible - Concurrency limit - No Row level security - No Ref. Integrity SQL DB / SSAS: + Row level security + High concurrency + High T-SQL compatibility - Limited DB size Push Pull SQL connect
  5. 5. Architecture Overview for Information Management 5 IoT Non Structured CRM BW … Azure DW ERP Polybase IoT Hub Stream Analytics Blob Storage ADF: Azure Data Factory1 2 Logic, API, App Theobald1 Theobald IS & SSIS 2 Data Lake Store SSAS Data Lake Analysis HD Insights (Spark) Azure Machine Learning Visual Studio R-Studio Local SQL Local SQL Data scientist MSR ADF1 SSIS2 Direct Query1 Process2 Local IT Market Reach Excel MS Access xxx Apps SQL1 API2 Browser API Product Mgr. Local SQL Reporting & Analysis Azure SQL DB SQL API ADW: + No size limit + Scalability + Polybase - Less compatible - Concurrency limit - No Row level security - No Ref. Integrity SQL DB / SSAS: + Row level security + High concurrency + High T-SQL compatibility - Limited DB size Push Pull SQL connect
  6. 6. Azure SQL Data Warehouse performance advantage Overview SQL Data Warehouse’s industry leading price-performance comes from leveraging the Azure ecosystem and core SQL Server engine improvements to produce massive gains in performance. These benefits require no customer configuration and are provided out-of-the-box for every data warehouse • Gen2 adaptive caching – using non-volatile memory solid-state drives (NVMe) to increase the I/O bandwidth available to queries. • Azure FPGA-accelerated networking enhancements – to move data at rates of up to 1GB/sec per node to improve queries • Instant data movement – leverages multi-core parallelism in underlying SQL Servers to move data efficiently between compute nodes. • SQL Query Optimizer – ongoing investments in distributed query optimization
  7. 7. Logical overview
  8. 8. Mapping Compute in SQLDW – (2 * 30 = 60) 13 14 1615 17 18 2 0 19 21 2 2 2 4 2 3 2 5 2 6 2 8 27 2 9 3 0 3 2 31 3 3 3 4 3 6 3 5 37 3 8 4 0 3 9 41 4 2 4 4 4 3 4 5 4 6 4 8 4 7 4 9 5 0 5 2 51 5 3 5 4 5 6 5 5 57 5 8 6 0 5 9 01 0 2 0 4 0 3 0 5 0 6 0 8 07 0 9 10 1211 DW200
  9. 9. Mapping Compute in SQLDW (3 * 20 = 60) 13 14 1615 17 18 2 0 19 21 2 2 2 4 2 3 2 5 2 6 2 8 27 2 9 3 0 3 2 31 3 3 3 4 3 6 3 5 37 3 8 4 0 3 9 41 4 2 4 4 4 3 4 5 4 6 4 8 4 7 4 9 5 0 5 2 51 5 3 5 4 5 6 5 5 57 5 8 6 0 5 9 01 0 2 0 4 0 3 0 5 0 6 0 8 07 0 9 10 1211 DW300
  10. 10. CREATE TABLE Sales.Order ( OrderId INT NOT NULL, Date DATE NOT NULL, Name VARCHAR(2), Country VARCHAR(2) ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([OrderId]) | ROUND ROBIN | REPLICATED ); Round-robin distributed Distributes table rows evenly across all distributions at random. Hash distributed Distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one distribution. Replicated Full copy of table accessible on each Compute node. Tables – Distributions
  11. 11. OrderId OrderId
  12. 12. OrderId OrderId
  13. 13. Architecture Overview for Information Management 1 3 IoT Non Structured CRM BW … Azure DW ERP Polybase IoT Hub Stream Analytics Blob Storage ADF: Azure Data Factory1 2 Logic, API, App Theobald1 Theobald IS & SSIS 2 Data Lake Store SSAS Data Lake Analysis HD Insights (Spark) Azure Machine Learning Visual Studio R-Studio Local SQL Local SQL Data scientist MSR ADF1 SSIS2 Direct Query1 Process2 Local IT Market Reach Excel MS Access xxx Apps SQL1 API2 Browser API Product Mgr. Local SQL Reporting & Analysis Azure SQL DB SQL API ADW: + No size limit + Scalability + Polybase - Less compatible - Concurrency limit - No Row level security - No Ref. Integrity SQL DB / SSAS: + Row level security + High concurrency + High T-SQL compatibility - Limited DB size Push Pull SQL connect • Slow load and query performance • Limited number of concurrent queries (improved with ADW Gen2)
  14. 14. Design Decisions/Table Distributions All tables are currently Round Robin distributed. There are 2 main reasons for this: • When tables are loaded through Theobald, using a drop and create approach, these tables are created with default settings, which is a round robin distribution • Loaded via the head node • An optimal data distribution would allow to use all compute nodes equally during the most frequent operations. In our case, this is inserting and reading data. At this moment, the most intensive data reads are the ones on the Salesdetails view. • These will get a month of XXXX data and join this with • Date, TerritoryHierarchy and the Regions table, to identify the data that needs to be sent to the spoke. • Distributing by region, or even by company code would not give enough distinct values to evenly use all nodes. • Distributing by month would reduce data movement but would not provide any load balancing during querying as all data would come from the same node.
  15. 15. Architecture Overview for Information Management 1 5 IoT Non Structured CRM BW … Azure DW ERP Polybase IoT Hub Stream Analytics Blob Storage ADF: Azure Data Factory1 2 Logic, API, App Theobald1 Theobald IS & SSIS 2 Data Lake Store SSAS Data Lake Analysis HD Insights (Spark) Azure Machine Learning Visual Studio R-Studio Local SQL Local SQL Data scientist MSR ADF1 SSIS2 Direct Query1 Process2 Local IT Market Reach Excel MS Access xxx Apps SQL1 API2 Browser API Product Mgr. Local SQL Reporting & Analysis Azure SQL DB SQL API ADW: + No size limit + Scalability + Polybase - Less compatible - Concurrency limit - No Row level security - No Ref. Integrity SQL DB / SSAS: + Row level security + High concurrency + High T-SQL compatibility - Limited DB size Push Pull SQL connect • Slow load and query performance • Limited number of concurrent queries (improved with ADW Gen2)
  16. 16. Architecture Overview for Information Management 1 6 IoT Non Structured CRM BW … Azure DW ERP Polybase IoT Hub Stream Analytics Blob Storage ADF: Azure Data Factory1 2 Logic, API, App Theobald1 Theobald IS & SSIS 2 Data Lake Store SSAS Data Lake Analysis HD Insights (Spark) Azure Machine Learning Visual Studio R-Studio Local SQL Local SQL Data scientist MSR ADF1 SSIS2 Direct Query1 Process2 Local IT Market Reach Excel MS Access xxx Apps SQL1 API2 Browser API Product Mgr. Local SQL Reporting & Analysis Azure SQL DB SQL API ADW: + No size limit + Scalability + Polybase - Less compatible - Concurrency limit - No Row level security - No Ref. Integrity SQL DB / SSAS: + Row level security + High concurrency + High T-SQL compatibility - Limited DB size Push Pull SQL connect
  17. 17. Data lake platform Hub – spoke concept using ADW SAP BW Azure SQL DWH Theobald ADF HUB Read Only Spoke Per Region ADF ADF SAP ERP SAP CRM … E1 Spoke E2 Spoke E3 Spoke Read Write Spoke for local applications E1W Spoke E4 Spoke ADF SAP ByD E2W Spoke E3W Spoke E4W Spoke E4 Spoke ADF Data Reference (Elastic Query) Data Movement
  18. 18. Azure Blob Storage Hadoop Azure Data Lake Storage SQL MySQL PostgreSQL MariaDB SQL Server in Azure Azure SQL Data Warehouse Azure Cosmos DB SQL Server Hyperscale & Data virtualization SQL
  19. 19. SQL Server PaaS offerings SQL Database (PaaS) Elastic Pool Managed Instance Singleton SQL Server in a VM SQL General availabilityGeneral availability Preview System Current Near Future + 1 Y Future SAP BW 1TB 4TB 8TB
  20. 20. Data lake platform Hub – spoke concept using ADW SAP BW Azure SQL DWH Theobald ADF HUB Read Only Spoke Per Region ADF ADF SAP ERP SAP CRM … E1 Spoke E2 Spoke E3 Spoke Read Write Spoke for local applications E1W Spoke E4 Spoke ADF SAP ByD E2W Spoke E3W Spoke E4W Spoke E4 Spoke ADF Data Reference (Elastic Query) Data Movement • Sub-optimal availability • Long load times • Bad query performance using distributed queries
  21. 21. Data lake platform Hub – spoke concept using ADW SAP BW Azure SQL DWH Theobald ADF HUB Read Only Spoke Per Region ADF ADF SAP ERP SAP CRM … E1 Spoke E2 Spoke E3 Spoke Read Write Spoke for local applications E1W Spoke E4 Spoke ADF SAP ByD E2W Spoke E3W Spoke E4W Spoke E4 Spoke ADF Data Reference (Elastic Query) Data Movement • Sub-optimal availability • Long load times • Bad query performance using distributed queries SLA 99.9 % SLA 99.99 % SLA 99.99 % SLA max 99.89 % SLA 99.98 %
  22. 22. Data lake platform Hub – spoke concept using ADW SAP BW Azure SQL DWH Theobald ADF HUB Read Only Spoke Per Region ADF ADF SAP ERP SAP CRM … E1 Spoke E2 Spoke E3 Spoke Read Write Spoke for local applications E1W Spoke E4 Spoke ADF SAP ByD E2W Spoke E3W Spoke E4W Spoke E4 Spoke ADF Data Reference (Elastic Query) Data Movement • Sub-optimal availability • Long load times • Bad query performance using distributed queries • Most meta data are copied to each of the Spokes • Spoke E1 represents 60% of all transactions
  23. 23. AdventureWorks Sample mapped to referenced project - Views could replace multiple copy operations SQL Server Instance Read Only Spoke For all Regions Read Write Spoke for local applications Region Views Theobald SAP CRM SAP ERP … SAP ByD
  24. 24. Data lake platform Hub – spoke concept using ADW SAP BW Azure SQL DWH Theobald ADF HUB Read Only Spoke Per Region ADF ADF SAP ERP SAP CRM … E1 Spoke E2 Spoke E3 Spoke Read Write Spoke for local applications E1W Spoke E4 Spoke ADF SAP ByD E2W Spoke E3W Spoke E4W Spoke E4 Spoke ADF Data Reference (Elastic Query) Data Movement • Sub-optimal availability • Long load times • Bad query performance using distributed queries
  25. 25. Azure SQL Database
  26. 26. SQL Server Instance CPU Memory DB Buffer Cache Procedure Cache Log Cache Files DB1 SQL Server Instance CPU Memory DB Buffer Cache Procedure Cache Log Cache Files DBn Spoke Database on Azure SQL Server SQL Server Instance Read Only Spoke For all Regions Read Write Spoke for local applications Region Views Hub and spoke objects are stored in differen SQL server instance - Access via network (external tables) or -- e.g. in [AdventureWorksDW2017_US] CREATE EXTERNAL TABLE [dbo].[DimScenario] ( [ScenarioKey] [int] NOT NULL, [ScenarioName] [nvarchar](50) NULL ) WITH ( DATA_SOURCE = [AdventureWorksDW2017] ,SCHEMA_NAME = N'Mart_US' ,OBJECT_NAME = N'DimScenario' )
  27. 27. Excution Plans (Test 9: Spoke Query with accessing Hub and Spoke Objects) Azure SQL Database
  28. 28. Azure SQL Server Managed Instance SQL Server Instance CPU Memory DB Buffer Cache Procedure Cache Log Cache Files DB1 DB2 DBn Data from all databases share the same memory space Access via simple 3 part nameing is possible db.schema.object
  29. 29. Spoke Database on SQL Server Managed Instance SQL Server Instance Read Only Spoke For all Regions Read Write Spoke for local applications Region Views SQL Server Instance CPU Memory DB Buffer Cache Procedure Cache Log Cache Files DB1 DB2 DBn All objects are stored in the same SQL server instance - 3 part nameing can be used to access objects in another db or -- e.g. in [AdventureWorksDW2017_US] CREATE VIEW [dbo].[DimScenario] AS SELECT * FROM [AdventureWorksDW2017].[Mart_US].[DimScenario]; -- e.g. in [AdventureWorksDW2017_US] CREATE VIEW [dbo].[DimScenario] AS SELECT [ScenarioKey] ,[ScenarioName] FROM [AdventureWorksDW2017].[Mart_US].[DimScenario];
  30. 30. Excution Plans (Test 9: Spoke Query with accessing Hub and Spoke Objects) Azure SQL Database Managed Instance
  31. 31. Layers and objects (via Sample AdventureWorksDW) Schema .dbo Schema .Mart_DE .Mart_US Most Views will be 1:1 mappings Some Views will filter data
  32. 32. Test Queries (1) Title SQL Result Set/Test Select some attributes with filter SELECT LastName, FirstName FROM dbo.dimCustomer WHERE LastName = 'Adams’ AND NumberChildrenAtHome = 3; 2 Columns/6 Rows Aggregate single table select sum([SalesAmount]) from [dbo].[FactInternetSales] 1 Column/1 Row Aggregate single table with simple filter select sum([SalesAmount]) from [dbo].[FactInternetSales] where [SalesOrderLineNumber] > 1 1 Column/1 Row Aggregate single table with filter with FilterKriteria as ( select 1 as [MinSalesOrderLineNumber]) select sum([SalesAmount]) from [dbo].[FactInternetSales] cross join FilterKriteria where [SalesOrderLineNumber] > [MinSalesOrderLineNumber] 1 Column/1 Row - Filter push down Inner database join and aggregate select [EnglishProductName], sum(SalesAmount) from [dbo].[FactInternetSales] [S] inner join [dbo].[DimProduct] [P] on [S].[ProductKey] = [P].[ProductKey] group by [P].[EnglishProductName] order by 2 desc 2 Columns/130 Rows - Join push down
  33. 33. Test Queries (2) Title SQL Result Set/Test Transfer data to spoke select * into [dbo].[LocalFactInternetSales] from [dbo].[FactInternetSales] 27 Columns/21’344 Rows Transfer speed of data between databases Cross database join with aggregate select [OrderDateKey], [Gender], sum([SalesAmount]) from [dbo].[LocalFactInternetSales] as [S] inner join [dbo].[DimCustomer] as [C] on [S].[CustomerKey] = [C].[CustomerKey] group by [OrderDateKey], [Gender] 3 Column/1761 Rows Handling of cross database joins SQL Server Instance Read Only Spoke For all Regions Read Write Spoke for local applications Region Views Hub query Spoke query
  34. 34. Performance Tests (1) 0 10 20 30 40 50 60 Select some attributes with filter Aggregate single table Aggregate single table with simple filter Aggregate single table with filter Inner database join and aggregate Test package one Azure SQL Server Hub Azure SQL Server Spoke Managed Instance Hub Managed Instance Spoke
  35. 35. 0 50 100 150 200 250 300 350 400 Transfer data to spoke Cross database join with aggregate Test package two Azure SQL Server Hub Azure SQL Server Spoke Managed Instance Hub Managed Instance Spoke Performance Tests (2) Part of todays ETL performance problem Issue, if spoke uses local spoke data and remote hub data in one query
  36. 36. New architecture SQL Server Instance Read Only Spoke For all Regions Read Write Spoke for local applications Region Views From System with: - 1 Azure SQL DWH - 9 Azure SQL DB’s to - 1 Azure SQL MI (SLA 99.99%)
  37. 37. Azure Blob Storage Hadoop Azure Data Lake Storage SQL MySQL PostgreSQL MariaDB SQL Server in Azure Azure SQL Data Warehouse Azure Cosmos DB SQL Server Hyperscale & Data virtualization SQL Future Option
  38. 38. Hyperscale service tier for up to 100 TB • Support for up to 100 TB of database size • Higher overall performance due to higher log throughput and faster transaction commit times regardless of data volumes • Nearly instantaneous database backups (snapshots of Azure Blob storage) • Fast database restores (based on file snapshots) • Rapid read scale out • Rapid Scale up
  39. 39. SQL Server Data Virtualization • Allows the data to stay in its original location, however you can virtualize the data in a SQL Server instance • it can be queried there like any other table in SQL Server.
  40. 40. Conclusion KISS (Keep It Simple [not Stupid]) • All used services are excellent services • Azure SQL Data Warehouse • Azure SQL Database • Azure SQL MI • (SQL in a VM) -> Don’t take PaaS as a religion • Theobald • Technical implementation details can make the difference • Transparent must not mean fast! • My personal advice • Try to use as few and “simple” services as possible (but not less) • For each used service, you should have a good argument chain, why you use it • POC’s help you to understand the different technologies • There is no free lunch • E.g. with databases like Azure SQL Data Warehouse or Cosmos DB you get “endless scale” but you must deal with data distributions/partitions
  41. 41. © Copyright Microsoft Corporation. All rights reserved.

×