Nilesh Gule
@nileshgule | www.HandsOnArchitect.com
Modern Data Warehouse
Using
Azure
$whoami
{
“name” : “Nilesh Gule”,
“website” : “https://www.HandsOnArchitect.com",
“github” : “https://github.com/NileshGule"
“twitter” : “@nileshgule”,
“linkedin” : “https://www.linkedin.com/in/nileshgule”,
“email” : “nileshgule@gmail.com",
“likes” : “Technical Evangelism, Cricket”,
“co-organizer” : “Azure Singapore UG”
}
Credits: James Serra
Part 1 - Recap – ADLS & ADF
• Petabyte scale storage
• Hierarchical namespace
• Hadoop compatible access with ABFS
driver
ADLS - Main features
ADF - Main features
• Cloud ETL service
• Scale-out serverless data integration & data
transformation
• Code-free UI
• Monitoring & Management
Part 2 - Recap
• Collaborative Spark based Analytical service
• Different cluster types (automated / interactive / pool)
• Autoscale based on workloads
• Fine grained access controls
Azure Databricks - Main features
Azure Synapse
Limitless analytics service for
enterprise data warehousing
and
Big Data analytics
Parallelism
• Uses many separate CPUs running in parallel to execute a single
program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel
Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
Synapse Architecture
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture
• Control Node
• Compute Node
• Data Movement
Service (DMS)
Components
• Hash
• Round Robin
• Replicate
Distributions
Synapse Data Distributions
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture
• Highest query perf for joins &
aggregations on large tables
• Rows per distribution varies
Hash
• Fastest query performance for
small tables
Replicated
ALTER DATABASE ContosoDW MODIFY
(service_objective = 'DW1000');
DWU
DW100
DW200
DW300
DW400
DW500
DW1000
DW1500
DW2000
DW2500
DW3000
DW5000
DW6000
DW7500
DW10000
DW15000
DW30000
Azure SQL Data Warehouse
Engine Worker1
Azure Storage Blob(s)
D12D11 D13 D14 D15 D16 D18D17 D19 D20
D22D21 D23 D24 D25 D26 D28D27 D29 D30
D32D31 D33 D34 D35 D36 D38D37 D39 D40
D42D41 D43 D44 D45 D46 D48D47 D49 D50
D52D51 D53 D54 D55 D56 D58D57 D59 D60
D2D1 D3 D4 D5 D6 D8D7 D9 D10
Azure SQL Data Warehouse
Engine
Worker4
Azure Storage Blob(s)
Worker1
Worker5
Worker3
Worker2
Worker6 D52D51 D53 D54 D55 D56 D58D57 D59 D60
D12D11 D13 D14 D15 D16 D18D17 D19 D20
D22D21 D23 D24 D25 D26 D28D27 D29 D30
D32D31 D33 D34 D35 D36 D38D37 D39 D40
D42D41 D43 D44 D45 D46 D48D47 D49 D50
D2D1 D3 D4 D5 D6 D8D7 D9 D10
Azure Databricks – SQL DW Connectivity
External Data Sources
• External Data Source
• Hadoop, ADLS
• External File Format
• File types
• Delimited Text, Hive RCFile, Hive ORC file, Parquet
• Data Compression
• Gzip, Snappy
• Field Delimiters
• Date Format
• External Table
What workloads are NOT suitable?
• High frequency reads and writes.
• Large numbers of singleton
selects.
• High volumes of single row
inserts.
Operational workloads (OLTP)
• Row by row processing needs.
• Incompatible formats (XML).
Data Preparations
SQL
SQL
What Workloads are Suitable?
Store large volumes of data.
Consolidate disparate data into a single location.
Shape, model, transform and aggregate data.
Batch/Micro-batch loads.
Perform query analysis across large datasets.
Ad-hoc reporting across large data volumes.
All using simple SQL constructs.
Analytics
Summary
• MPP Architecture
• Can be paused
• Optimized for analytics workloads
• Supports multiple external file formats
• Works with Polybase
Azure Synapse - Main features
SQL Server & SQL Data Warehouse Differences
Azure Synapse
Workload Management
External Data Source
External File Formats
External Table
SQL Data Warehouse Benchmark
References – MS Learn
https://docs.microsoft.com/en-us/learn/paths/implement-sql-data-warehouse
Thank you very much
Code with Passion and Strive for Excellence
https://www.slideshare.net/nileshgule/presentations
https://speakerdeck.com/nileshgule/
Nilesh Gule
ARCHITECT | MICROSOFT MVP
“Code with Passion and
Strive for Excellence”
nileshgule @nileshgule Nilesh Gule
NileshGule
www.handsonarchitect.com
Q&A

Part 3 - Modern Data Warehouse with Azure Synapse

  • 1.
    Nilesh Gule @nileshgule |www.HandsOnArchitect.com Modern Data Warehouse Using Azure
  • 2.
    $whoami { “name” : “NileshGule”, “website” : “https://www.HandsOnArchitect.com", “github” : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “nileshgule@gmail.com", “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  • 4.
  • 6.
    Part 1 -Recap – ADLS & ADF • Petabyte scale storage • Hierarchical namespace • Hadoop compatible access with ABFS driver ADLS - Main features ADF - Main features • Cloud ETL service • Scale-out serverless data integration & data transformation • Code-free UI • Monitoring & Management
  • 7.
    Part 2 -Recap • Collaborative Spark based Analytical service • Different cluster types (automated / interactive / pool) • Autoscale based on workloads • Fine grained access controls Azure Databricks - Main features
  • 8.
    Azure Synapse Limitless analyticsservice for enterprise data warehousing and Big Data analytics
  • 9.
    Parallelism • Uses manyseparate CPUs running in parallel to execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing
  • 10.
    Synapse Architecture https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture • ControlNode • Compute Node • Data Movement Service (DMS) Components • Hash • Round Robin • Replicate Distributions
  • 11.
    Synapse Data Distributions https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture •Highest query perf for joins & aggregations on large tables • Rows per distribution varies Hash • Fastest query performance for small tables Replicated
  • 12.
    ALTER DATABASE ContosoDWMODIFY (service_objective = 'DW1000'); DWU DW100 DW200 DW300 DW400 DW500 DW1000 DW1500 DW2000 DW2500 DW3000 DW5000 DW6000 DW7500 DW10000 DW15000 DW30000
  • 13.
    Azure SQL DataWarehouse Engine Worker1 Azure Storage Blob(s) D12D11 D13 D14 D15 D16 D18D17 D19 D20 D22D21 D23 D24 D25 D26 D28D27 D29 D30 D32D31 D33 D34 D35 D36 D38D37 D39 D40 D42D41 D43 D44 D45 D46 D48D47 D49 D50 D52D51 D53 D54 D55 D56 D58D57 D59 D60 D2D1 D3 D4 D5 D6 D8D7 D9 D10
  • 14.
    Azure SQL DataWarehouse Engine Worker4 Azure Storage Blob(s) Worker1 Worker5 Worker3 Worker2 Worker6 D52D51 D53 D54 D55 D56 D58D57 D59 D60 D12D11 D13 D14 D15 D16 D18D17 D19 D20 D22D21 D23 D24 D25 D26 D28D27 D29 D30 D32D31 D33 D34 D35 D36 D38D37 D39 D40 D42D41 D43 D44 D45 D46 D48D47 D49 D50 D2D1 D3 D4 D5 D6 D8D7 D9 D10
  • 15.
    Azure Databricks –SQL DW Connectivity
  • 16.
    External Data Sources •External Data Source • Hadoop, ADLS • External File Format • File types • Delimited Text, Hive RCFile, Hive ORC file, Parquet • Data Compression • Gzip, Snappy • Field Delimiters • Date Format • External Table
  • 17.
    What workloads areNOT suitable? • High frequency reads and writes. • Large numbers of singleton selects. • High volumes of single row inserts. Operational workloads (OLTP) • Row by row processing needs. • Incompatible formats (XML). Data Preparations SQL SQL
  • 18.
    What Workloads areSuitable? Store large volumes of data. Consolidate disparate data into a single location. Shape, model, transform and aggregate data. Batch/Micro-batch loads. Perform query analysis across large datasets. Ad-hoc reporting across large data volumes. All using simple SQL constructs. Analytics
  • 19.
    Summary • MPP Architecture •Can be paused • Optimized for analytics workloads • Supports multiple external file formats • Works with Polybase Azure Synapse - Main features
  • 20.
    SQL Server &SQL Data Warehouse Differences Azure Synapse Workload Management External Data Source External File Formats External Table SQL Data Warehouse Benchmark
  • 21.
    References – MSLearn https://docs.microsoft.com/en-us/learn/paths/implement-sql-data-warehouse
  • 22.
    Thank you verymuch Code with Passion and Strive for Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  • 23.
    Nilesh Gule ARCHITECT |MICROSOFT MVP “Code with Passion and Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com
  • 24.