Data Modeling on Azure for Analytics

Ike Ellis, MVP
General Manager – Data & AI Practice
Solliance
@ike_ellis
www.ikeellis.com
youtube.com/IkeEllisOnTheMic
• MVP since 2011
• Author of Developing Azure Solutions, Power BI MVP
Book
• Speaker at PASS Summit, SQLBits, DevIntersections,
TechEd, Craft, Microsoft Data & AI Conference
• Founder of the San Diego Software Architecture Group
• Founder of the San Diego Technology Immersion Group
• Lead a team of Data Engineers, Data Architects, Data
Scientists, and Data Creatives

• Data Platform Components
• Azure Data Platform
• Orchestration and data
processing
• Data Virtualization and data in
files
• Organize data in RAW
• Organize data in PREPARED
• Organize data in SERVING
• Star Schemas
• Dimensions
• Fact Tables
• Slowly Changing Dimensions
• Aggregating data
• Securing data

Relational
Data
source
Web URI
Data
Source
Source Files
(JSON, CSV,
ETC)
Streaming
Data
File-based storage
slow, cheap
Parquet files usually
Streaming
Very fast storage & processing
Data marts
Medium fast,
relational storage
Reporting
Tools
Visualization
Tools
Business
tool, usually
MS Excel
Machine
learning and
other
advanced
analytics
Data pipeline processes
Many processes used to clean, organize, and prepare data.
Often written in several different tools and languages as time goes
on. Could be streaming or batch
Orchestration
layer to control the order and workflow of the processes below
Data virtualization
layer that connects data in all of the locations
above to create a single interface for
interacting with data
Data storage
Data processing
Data sources
Data users
Aggregation Layer
Takes some amount of
work, but aggregations
are very fast and cached.
Can be relational

Relational
Data
source
Web URI
Data
Source
Source Files
(JSON, CSV,
ETC)
Streaming
Data
File-based storage
ADLS Gen 2, Azure Blob Storage
Streaming
Event Hubs, Event Grid, Service Bus
Data marts
Azure SQL
Database, Azure
Synapse Dedicated
SQL Pools
Reporting
Tools
SSRS, Power
BI
Visualization
Tools
Power BI
Business tool
MS Excel
Machine
learning
Azure ML
Studio
Data pipeline processes
Azure Databricks, Azure Synapse, Azure Functions, Azure Data Factory,
Azure Stream Analytics
Orchestration
Azure Data Factory, Azure Synapse
Data virtualization
SQL Server Polybase, Azure Synapse (or
Databricks) Spark virtual tables
Data storage
Data processing
Data sources
Data users
Aggregation Layer
Azure Analysis Services
Power BI Data Model
Manual Aggregation Tables

WEB
APPLICATIONS
DASHBOARDS
AZURE DATABRICKS
SQL DB /
SQL Server
SQL DW
AZURE
ANALYSIS
SERVICES
DATA LAKE STORE/
Azure Blob Storage
DATA
FACTORY
Mapping Dataflows
Pipelines
SSIS Packages
Triggered &
Scheduled Pipelines
ETL Logic
Calculations
AZURE
STORAGE
DIRECT
DOWNLOAD
etl
source

•
•
•
•
•
•
•
•
The whole idea of an analytical system is that data duplication will speed up
aggregations and reporting. Files allow for cheap duplication, which allows us
to duplicate more data more frequently.

CREATE TABLE CUSTOMERS
(CustID int NOT NULL,
CompanyName varchar NOT NULL)
ORDERS.parquet
CREATE EXTERNAL TABLE ORDERS
SELECT *
FROM Customers c
JOIN Orders o
ON c.CustID = o.CustID

What you do with them How you do it
Remove bad rows
Change column data types
Pivot
Unpivot
Combine columns
Remove columns
Split columns
Change format
Replace values
Merge/join data files and tables
Append data files and tables
Fill with a literal
Change the format
Perform mathematical calculations
Change the location of data
Python
C#
Azure Functions
Azure Databricks
Azure Synapse Data Flows
Power BI Data Flows
Stored Procedures
Pandas
Azure Stream Analytics
Azure Kubernetes Service
Azure VMs
And much, much more

Accounting
Database
CRM
Database
Copy
Copy
RAW
Folder on
ADLS
.parquet
files
clean data, but
don’t change
shape
Enriched
Folder on
ADLS
.parquet
files
create star
schema
Data mart
Data mart
Create
aggregations
Analysis
Services
Cubes

Files
(Parquet, ADLS, Azure Blob
Storage)
Relational
(Azure SQL Database, Azure
Database, Azure Synapse
Dedicated SQL Pools)
Cache
(Azure Cache for Redis, Azure
CosmosDB)
Stream
(Azure Event Hubs, Azure
Event Grid, Azure ServiceBus,
Azure Stream Analytics)
• Very cheap
• Not very fast
• Great for long-term
storage, archives
• Great for staging/raw
• Great for enriched layer
• Great for duplicating data
• Great for machine-
learning and other
analytics
• Can use SQL to query it
• Great for JSON, CSVs,
TSVs, any other files
• Great for serving layer
• Great for interactivity
• Great for using SQL
• Somewhat expensive
• Bad for long-term storage
(> 5 years)
• Medium term storage (1 –
5 years)
• Forces the format to be
primarily tabular (with
rows and columns)
• Generally bad for JSON
data
• Great for repeated, short-
term storage
• Very expensive
• Great for geo-replication
• Great for data that
changes quickly
• Great for JSON data
• Can use a SQL-variant to
query it (not full featured)
• Great for seven days of
data
• Very expensive
• Great for alerting

•
•
•
•
•
•
•
•
•

•
•
•
•
•
•
•
•
•
•
ERP
Data source
Copy Table RAWERPCustomers.parquet

•
•
•
•
CREATE EXTERNAL TABLE [dbo].[tempSalesOrderHeader]
(
[SalesOrderID] [int] NULL,
[SalesOrderDetailID] [int] NULL,
[OrderQty] [int] NULL,
[ProductID] [int] NULL,
[UnitPrice] [numeric](19,4) NULL,
[UnitPriceDiscount] [numeric](19,4) NULL,
[LineTotal] [numeric](38,6) NULL,
[rowguid] [varchar](8000) NULL,
[ModifiedDate] [datetime2](7) NULL
)
WITH (DATA_SOURCE = [ikedatalakefs_ikedatalake_dfs_core
_windows_net]
, LOCATION = N'raw/SalesLTSalesOrderDetail.parquet’
, FILE_FORMAT = [SynapseParquetFormat], REJECT_TYPE = V
ALUE, REJECT_VALUE = 0 )
GO

•
•
•
•
•
RAWERPCustomers.parquet Transform EnrichedCustomer.parquet

DimSalesPerson
SalesPersonKey
SalesPersonName
StoreName
StoreCity
StoreRegion
DimProduct
ProductKey
ProductName
ProductLine
SupplierName
DimCustomer
CustomerKey
CustomerName
City
Region
FactOrders
CustomerKey
SalesPersonKey
ProductKey
ShippingAgentKey
TimeKey
OrderNo
LineItemNo
Quantity
Revenue
Cost
Profit
DimDate
DateKey
Year
Quarter
Month
Day
DimShippingAgent
ShippingAgentKey
ShippingAgentName
•
•
•

•
•
•
•
EnrichedCustomer.parquet Data pipelines

•
•
•
•
•
•
•
•
•
•
DimSalesPerson
SalesPersonKey
EmployeeNo
SalesPersonName
StoreName
StoreCity
StoreRegion
surrogate key
business key
denormalized (no separate store table)

•
•
•
•
•
•
•
•
•
•
•
FactOrders
CustomerKey
SalesPersonKey
ProductKey
Timekey
OrderNo
LineItemNo
PaymentMethod
Quantity
Revenue
Cost
Profit
Margin
FactAccountTransaction
CustomerKey
BranchKey
AccountTypeKey
AccountNo
CreditDebitAmount
AccountBalance
Additive
Nonadditive
Semi-additive
Degenerate
Dimensions
Grain =
Order Line Item

IoT stream of each ping. Can only
hold 3 days of data, since there are
hundreds of trucks
Copy
RAW Files
Sensor Detail
Transfer and
quick
aggregation
Enriched File
Truck ID
Date & Hour
Miles Per Hour
Transfer
and Star
Schema
Data Mart
Table
TruckID, Date,
HoursPerDay
Data Virtualization:
Three Tables for different granularities. The Star schema is in the most expensive, fastest
storage. The others are file-based, cheaper, there for reference or to redo an aggregation

Data Modeling on Azure for Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Modeling on Azure for Analytics

Similar to Data Modeling on Azure for Analytics (20)

More from Ike Ellis

More from Ike Ellis (20)

Recently uploaded

Recently uploaded (20)

Data Modeling on Azure for Analytics