2. Ike Ellis, MVP
General Manager – Data & AI Practice
Solliance
@ike_ellis
www.ikeellis.com
youtube.com/IkeEllisOnTheMic
• MVP since 2011
• Author of Developing Azure Solutions, Power BI MVP
Book
• Speaker at PASS Summit, SQLBits, DevIntersections,
TechEd, Craft, Microsoft Data & AI Conference
• Founder of the San Diego Software Architecture Group
• Founder of the San Diego Technology Immersion Group
• Lead a team of Data Engineers, Data Architects, Data
Scientists, and Data Creatives
3. • Data Platform Components
• Azure Data Platform
• Orchestration and data
processing
• Data Virtualization and data in
files
• Organize data in RAW
• Organize data in PREPARED
• Organize data in SERVING
• Star Schemas
• Dimensions
• Fact Tables
• Slowly Changing Dimensions
• Aggregating data
• Securing data
5. Relational
Data
source
Web URI
Data
Source
Source Files
(JSON, CSV,
ETC)
Streaming
Data
File-based storage
slow, cheap
Parquet files usually
Streaming
Very fast storage & processing
Data marts
Medium fast,
relational storage
Reporting
Tools
Visualization
Tools
Business
tool, usually
MS Excel
Machine
learning and
other
advanced
analytics
Data pipeline processes
Many processes used to clean, organize, and prepare data.
Often written in several different tools and languages as time goes
on. Could be streaming or batch
Orchestration
layer to control the order and workflow of the processes below
Data virtualization
layer that connects data in all of the locations
above to create a single interface for
interacting with data
Data storage
Data processing
Data sources
Data users
Aggregation Layer
Takes some amount of
work, but aggregations
are very fast and cached.
Can be relational
6. Relational
Data
source
Web URI
Data
Source
Source Files
(JSON, CSV,
ETC)
Streaming
Data
File-based storage
ADLS Gen 2, Azure Blob Storage
Streaming
Event Hubs, Event Grid, Service Bus
Data marts
Azure SQL
Database, Azure
Synapse Dedicated
SQL Pools
Reporting
Tools
SSRS, Power
BI
Visualization
Tools
Power BI
Business tool
MS Excel
Machine
learning
Azure ML
Studio
Data pipeline processes
Azure Databricks, Azure Synapse, Azure Functions, Azure Data Factory,
Azure Stream Analytics
Orchestration
Azure Data Factory, Azure Synapse
Data virtualization
SQL Server Polybase, Azure Synapse (or
Databricks) Spark virtual tables
Data storage
Data processing
Data sources
Data users
Aggregation Layer
Azure Analysis Services
Power BI Data Model
Manual Aggregation Tables
7. WEB
APPLICATIONS
DASHBOARDS
AZURE DATABRICKS
SQL DB /
SQL Server
SQL DW
AZURE
ANALYSIS
SERVICES
DATA LAKE STORE/
Azure Blob Storage
DATA
FACTORY
Mapping Dataflows
Pipelines
SSIS Packages
Triggered &
Scheduled Pipelines
ETL Logic
Calculations
AZURE
STORAGE
DIRECT
DOWNLOAD
etl
source
8. •
•
•
•
•
•
•
•
The whole idea of an analytical system is that data duplication will speed up
aggregations and reporting. Files allow for cheap duplication, which allows us
to duplicate more data more frequently.
9. CREATE TABLE CUSTOMERS
(CustID int NOT NULL,
CompanyName varchar NOT NULL)
ORDERS.parquet
CREATE EXTERNAL TABLE ORDERS
SELECT *
FROM Customers c
JOIN Orders o
ON c.CustID = o.CustID
11. What you do with them How you do it
Remove bad rows
Change column data types
Pivot
Unpivot
Combine columns
Remove columns
Split columns
Change format
Replace values
Merge/join data files and tables
Append data files and tables
Fill with a literal
Change the format
Perform mathematical calculations
Change the location of data
Python
C#
Azure Functions
Azure Databricks
Azure Synapse Data Flows
Power BI Data Flows
Stored Procedures
Pandas
Azure Stream Analytics
Azure Kubernetes Service
Azure VMs
And much, much more
14. Files
(Parquet, ADLS, Azure Blob
Storage)
Relational
(Azure SQL Database, Azure
Database, Azure Synapse
Dedicated SQL Pools)
Cache
(Azure Cache for Redis, Azure
CosmosDB)
Stream
(Azure Event Hubs, Azure
Event Grid, Azure ServiceBus,
Azure Stream Analytics)
• Very cheap
• Not very fast
• Great for long-term
storage, archives
• Great for staging/raw
• Great for enriched layer
• Great for duplicating data
• Great for machine-
learning and other
analytics
• Can use SQL to query it
• Great for JSON, CSVs,
TSVs, any other files
• Great for serving layer
• Great for interactivity
• Great for using SQL
• Somewhat expensive
• Bad for long-term storage
(> 5 years)
• Medium term storage (1 –
5 years)
• Forces the format to be
primarily tabular (with
rows and columns)
• Generally bad for JSON
data
• Great for repeated, short-
term storage
• Very expensive
• Great for geo-replication
• Great for data that
changes quickly
• Great for JSON data
• Can use a SQL-variant to
query it (not full featured)
• Great for seven days of
data
• Very expensive
• Great for alerting
28. IoT stream of each ping. Can only
hold 3 days of data, since there are
hundreds of trucks
Copy
RAW Files
Sensor Detail
Transfer and
quick
aggregation
Enriched File
Truck ID
Date & Hour
Miles Per Hour
Transfer
and Star
Schema
Data Mart
Table
TruckID, Date,
HoursPerDay
Data Virtualization:
Three Tables for different granularities. The Star schema is in the most expensive, fastest
storage. The others are file-based, cheaper, there for reference or to redo an aggregation
33. Ike Ellis, MVP
General Manager – Data & AI Practice
Solliance
@ike_ellis
www.ikeellis.com
youtube.com/IkeEllisOnTheMic
• MVP since 2011
• Author of Developing Azure Solutions, Power BI MVP
Book
• Speaker at PASS Summit, SQLBits, DevIntersections,
TechEd, Craft, Microsoft Data & AI Conference
• Founder of the San Diego Software Architecture Group
• Founder of the San Diego Technology Immersion Group
• Lead a team of Data Engineers, Data Architects, Data
Scientists, and Data Creatives