SlideShare a Scribd company logo
How to model data for analytics in a
modern world
Data Modeling Trends
for 2019 and Beyond
Ike Ellis, Microsoft MVP
General Manager – Data & AI Practice
Solliance
Please silence
cell phones
everything PASS
has to offer
Free online
webinar events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Get involved
Free Online Resources
Newsletters
PASS.org
Explore
Ike Ellis
General Manager – Data &
AI Practice
Solliance
/ikeellis
@ike_ellis
www.ikeellis.com
• Founder of San Diego Power
BI and PowerApps
UserGroup
• Founder of the San Diego
Software Architecture Group
• MVP since 2011
• Author of Developing Azure
Solutions, Power BI MVP Book
• Speaker at PASS Summit,
SQLBits, DevIntersections,
TechEd, Craft, Microsoft Data &
AI Conference
agenda
• where we are today:
• physical storage design
• logical data schema design
• where we are headed:
• data lakes
• lambdas
• iot
• challenges to data modeling from the business
• challenges with the cloud
• positioning with the business
reasons we build a data system for analytics
• alert for things like fraud
• reporting to wall street, auditors, compliance
• reporting to upper management, board of directors
• tactical reporting to other management
• data analysis, machine learning, deep learning
• data lineage
• data governance
• data brokerage between transactional applications
• historical data, archiving data
common enterprise data architecture (eda)
source
staging ods
data
warehouse
etletl etletl
and/or
source
source
source
staging
• relational database that looks like the source systems
• usually keeps about 90 days worth of data or some increment of data
• used so that if etl fails, we don’t have to go back to the source system
ods
• operational data store
• totally normalized
• used to mimic the source systems, but consolidate them into one database
• three databases into one database
• might have light cleaning
• often used for data brokerage
• might have 300 – 500 tables
kimball method: sample project plan
2) gather
requirements/
define vocabulary/
wikidefinitions/
requirements
document
4) dimensional
modeling
3) physical
design
5) etl development
6) deployment/training
1) find executive sponsor
enterprise bus matrix
kimball dimensional modeling
business questions focus on measures that are aggregated by
business dimensions
measures are facts about the business (nouns and strong words)
dimensions are ways in which the measures can be aggregated:
• pay attention to the “by” keyword
• examples:
• sales revenue by salesperson
• profit by product line
• order quantity by Year Product
Line
Salesperson
Time
Quantity
Revenue
Profit
star schemas
• group related dimensions into
dimension tables
• group related measures into fact tables
• relate fact tables to dimension tables by
using foreign keys
DimSalesPerson
SalesPersonKey
SalesPersonName
StoreName
StoreCity
StoreRegion
DimProduct
ProductKey
ProductName
ProductLine
SupplierName
DimCustomer
CustomerKey
CustomerName
City
Region
FactOrders
CustomerKey
SalesPersonKey
ProductKey
ShippingAgentKey
TimeKey
OrderNo
LineItemNo
Quantity
Revenue
Cost
Profit
DimDate
DateKey
Year
Quarter
Month
Day
DimShippingAgent
ShippingAgentKey
ShippingAgentName
considerations for dimension tables
denormalization:
• dimension tables are usually “wide”
• wide tables are often faster when there are fewer joins
• duplicated values are usually preferable to joins
keys:
• create new surrogate keys to identify each row:
• simple integer keys provide the best performance
• retain original business keys
conformed dimensions
• Dimensions that can be shared across multiple fact tables
DimSalesPerson
SalesPersonKey
EmployeeNo
SalesPersonName
StoreName
StoreCity
StoreRegion
surrogate key
business key
denormalized (no separate store table)
considerations for fact tables
grain:
• use the lowest level of detail that relates to all dimensions
• create multiple fact tables if multiple grains are required
keys:
• the primary key is usually a composite key that includes
dimension foreign keys
measures:
• additive: Measures that can be aggregated across all dimensions
• nonadditive: Measures that cannot be aggregated
• semi-additive: Measures that can be aggregated across some
dimensions, but not others
degenerate dimensions:
• dimensions in the fact table
FactOrders
CustomerKey
SalesPersonKey
ProductKey
Timekey
OrderNo
LineItemNo
PaymentMethod
Quantity
Revenue
Cost
Profit
Margin
FactAccountTransaction
CustomerKey
BranchKey
AccountTypeKey
AccountNo
CreditDebitAmount
AccountBalance
Additive
Nonadditive
Semi-additive
Degenerate
Dimensions
Grain =
Order Line Item
reasons to make a star schema
• easy to use and understand
• one version of the truth
• easy to create aggregations by single passing over the data
• much smaller table count (12 – 25 tables)
• faster queries
• good place to feed cubes (either azure analysis services or power bi shared
datasets)
• supported by many business intelligence tools (excel pivot tables, power bi,
tableau, etc)
• what i always say:
• “you can either start out by making a star schema, or you can one day wish you did. those
are the two choices”
snowflake schemas
consider when:
• a subdimension can be shared
between multiple dimensions
• a hierarchy exists, and the
dimension table contains a small
subset of data that may be
changed frequently
• a sparse dimension has several
different subtypes
• multiple fact tables of varying
grain reference different levels in
the dimension hierarchy
DimSalesPerson
SalesPersonKey
SalesPersonName
StoreKey
DimProduct
ProductKey
ProductName
ProductLineKey
SupplierKey
DimCustomer
CustomerKey
CustomerName
GeographyKey
FactOrders
CustomerKey
SalesPersonKey
ProductKey
ShippingAgentKey
TimeKey
OrderNo
LineItemNo
Quantity
Revenue
Cost
Profit
DimDate
DateKey
Year
Quarter
Month
Day
DimShippingAgent
ShippingAgentKey
ShippingAgentName
DimProductLine
ProductLineKey
ProductLineName
DimGeography
GeographyKey
City
Region
DimSupplier
SupplierKey
SupplierName
DimStore
StoreKey
StoreName
GeographyKey
• normalized dimension tables
common enterprise data architecture (eda)
source
staging ods
data
warehouse
etletl etletl
and/or
source
source
source
weaknesses
weakness #1: let’s add a single column
source
staging ods
data
warehouse
etletl etletl
and/or
source
source
source
1
2
3
4
5
6
7
8
9
x9!!!!
so many of you have decided to just go directly to the source!
source power query
mayhem
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
• spread business logic
• when something changes, you have
to change a ton of places
• inconsistent
• repeatedly cleaning the same data
and i’ve seen these data models
weakness #2: sql server for staging/ods
• sql actually writes data multiple times on the insert
• one write for the log (traverse the log structure)
• and then writing for the disk subsystem
• one write to mdf
• and then writing for the disk subsystem
• writing to maintain indexing
• and then writing for the disk subsystem
• sql is strongly consistent
• the write isn’t successful until all the tables and indexes represent the consistent write and
all the triggers fire
• sql is expensive
• you get charged based on the amount of cores you use
weakness #3: great big data warehouses are very difficult to change and maintain
• all tables need to be consistent with one another
• historical data makes queries slow
• historical data makes dws hard to backup and restore and manage
• indexes take too long to maintain
• dev and other environments are too difficult to spin up
• shared database environments are too hard to use in development
environments
• keeping track of pii and sensitive information is very difficult
• creating automated tests is very difficult
weakness #4: very difficult to move this to the cloud
• Cloud you pay for four things, all wrapped up
differently
• CPU
• Memory
• Network
• Disk
• Most expensive
• CPU/compute!
• When you have a big data warehouse, you often need
a lot of memory
• but this is anchored to the CPU
• Big things in the cloud are a lot more expensive then a
lot of small things
• Make the big things cheap, and the small things expensive so
you have lot of knobs to turn
common business complaints about analytic teams
• “they don't change as quickly as we'd like”
• “they said they'd work for all departments, but it seems like they only
work for finance and sales”
• hr, it, and supporting departments never get attention
• “the data is unreliable”
• “this report is far different than this other report”
• “every time i ask for something new, it takes six weeks and by then the
opportunity has passed”
• “the cubes have outdated references to business concepts”
• outdated terms
these problems are actually caused by the business
Consistency
Speed
It only takes 4 –
16 hours to make
a report
It’s making
it consistent
and reliable
that takes
so long
the business worships at the alter of consistency
three types of consistency that cause overhead
• internal consistency: does the report match what the report is saying?
• external consistency: does the report match every other report?
• historical consistency: does the report match itself from years past?
• the first one is mandatory, but the remaining two can cause a lot of problems
and a lot of headache
• how can the business change key metrics and then maintain consistency in the
future?
• do we change the past?
• do we keep the past, but then have two different versions of the metric, the report, the
cube?
boat ownership
• alert for things like fraud
• reporting to wall street, auditors, compliance
• reporting to upper management, board of directors
• tactical reporting to other management
• data analysis, machine learning, deep learning
• data lineage
• data governance
• data brokerage between transactional applications
• historical data, archiving data
no separation of concerns in the architecture
trying to make a star schema do everything
source
staging ods
data
warehouse
etletl etletl
and/or
source
source
source
data latency
staging ods
data
warehouse
etletl etletlsource
data movement takes a long time
don’t be afraid of files
• files are fast!
• files are flexible
• new files can have a new data structure without changing the old data structure
• files only write the data through one data structure
• files can be indexed
• file storage is cheap
• files can have high data integrity
• files can be unstructured or non-relational
parquet files
• Organizing by column allows for better compression,
• The space savings are very noticeable at the scale of a Hadoop cluster.
• I/O will be reduced as we can efficiently scan only a subset of the columns while
reading the data.
• Better compression also reduces the bandwidth required to read the input.
• Splittable
• Horizontally scalable
in the cloud small things are scalable and change-able
• we want a lot of small things
• we want to avoid compute charges for historical data, snapshotting, and
archival
basic physical idea of a data lake
data
mart
etletl etletlsource
staging ods
data
mart
data
mart
etl
example modern data architecture
WEB
APPLICATIONS
DASHBOARDS
AZURE
DATABRICKS
SQL DB /
SQL Server
SQL DW
AZURE
ANALYSIS
SERVICES
DATA LAKE STORE/
Azure Blob Storage
DATA
FACTORY
Mapping Dataflows
Pipelines
SSIS Packages
Triggered & Scheduled
Pipelines
ETL Logic
Calculations
AZURE
STORAGE
DIRECT
DOWNLOAD
etlsource
back to this list: these should be separate parts of a single system
• alert for things like fraud
• reporting to wall street, auditors, compliance
• reporting to upper management, board of directors
• tactical reporting to other management
• data analysis, machine learning, deep learning
• data lineage
• data governance
• data brokerage between transactional applications
• historical data, archiving data
pitch it to the business: consistent path
DASHBOARDS
SQL DB /
SQL Server
AZURE
ANALYSIS
SERVICES
DATA LAKE STORE/
Azure Blob Storage
DATA
FACTORY
Mapping Dataflows
Pipelines
SSIS Packages
Triggered & Scheduled
Pipelines
etlsource
consistent path tips
• never say:
• one version of the truth
• all the data will be cleaned
• all the data will be consistent
• keep total number of reports on the consistent path to less than 50
• remind them of boat ownership
• new report authoring takes a long time
• changes are slow and methodical
new logo
pitch it to the business: alerting
WEB
APPLICATIONS
DASHBOARDS
AZURE
DATABRICKS
DATA LAKE STORE/
Azure Blob Storage
ETL Logic
Calculations
DIRECT
DOWNLOAD
etlsource
pitch it to the business: just-in-time reporting (flexible, inconsistent, but we don’t say that)
etletl etletlsource
staging ods BoD
data
mart
data
mart
etl
done in a couple of days
decoupled, not quite as clean, useful
done in a couple of weeks
coupled, clean, consistent
a properly designed data lake
easy to organize!
• do it by folders
• staging
• ods
• organize by source with folders
• sap
• salesforce
• greatplains
• marketingorigin
• salesorigin
• historical
• snapshot
• temporal
• adls allows you to choose the expense of the file without changing the uri
• allows delta loads
• read and analyze with a myriad of tools, including azure databricks, power bi, excel
• separation from compute so we don’t get charged for compute!
data virtualization as a concept
• data stays in it’s original place
• sql server
• azure blob storage
• azure data lake storage gen 2
• metadata repository is over the data where it is
• data can then be queried and joined in a single location
• spark sql
• polybase
• hive
• power bi
• sql server big data clusters
data virtualization in action
using parquet files
more spark sql
conclusion
• yes, we still make star schemas
• yes, we still use slowly-changing dimensions
• yes, we still use cubes
• we need to understand their limitations
• don’t be afraid of files and just in time analytics
• don’t conflate alerting and speed with consistency
• consistent reporting should be kept to 50 reports or less
• everything else should be de-coupled and flexible so they can change quickly
• we can create analytic systems without SQL
• file-based (parquet)
• we still primarily use SQL as a language
• cheap
• massively parallel
• easily change-able
• distilled using a star schema and data virtualization
Session
Evaluations
Submit by 5pm Friday,
November 15th to
win prizes.
Download the GuideBook App
and search: PASS Summit 2019
Follow the QR code link on session
signage
Go to PASSsummit.com
3 W A Y S T O A C C E S S
Thank You
Speaker name
@yourhandle
youremail@email.com

More Related Content

What's hot

Embed Interactive Reports in Your Apps
Embed Interactive Reports in Your AppsEmbed Interactive Reports in Your Apps
Embed Interactive Reports in Your Apps
Teo Lachev
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
Davide Mauri
 

What's hot (20)

Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and Elasticsearch
 
Blockchain for the DBA and Data Professional
Blockchain for the DBA and Data ProfessionalBlockchain for the DBA and Data Professional
Blockchain for the DBA and Data Professional
 
Democratizing Data Science in the Enterprise
Democratizing Data Science in the EnterpriseDemocratizing Data Science in the Enterprise
Democratizing Data Science in the Enterprise
 
Self-Service Data Integration with Power Query
Self-Service Data Integration with Power QuerySelf-Service Data Integration with Power Query
Self-Service Data Integration with Power Query
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014
 
SenchaCon 2016: Using Ext JS to Turn Big Data into Intelligence - Olga Petrov...
SenchaCon 2016: Using Ext JS to Turn Big Data into Intelligence - Olga Petrov...SenchaCon 2016: Using Ext JS to Turn Big Data into Intelligence - Olga Petrov...
SenchaCon 2016: Using Ext JS to Turn Big Data into Intelligence - Olga Petrov...
 
SenchaCon 2016: Web Development at the Speed of Thought: Succeeding in the Ap...
SenchaCon 2016: Web Development at the Speed of Thought: Succeeding in the Ap...SenchaCon 2016: Web Development at the Speed of Thought: Succeeding in the Ap...
SenchaCon 2016: Web Development at the Speed of Thought: Succeeding in the Ap...
 
ECS19 - Ahmad Najjar and Serge Luca - Power Platform Tutorial
ECS19 - Ahmad Najjar and Serge Luca - Power Platform TutorialECS19 - Ahmad Najjar and Serge Luca - Power Platform Tutorial
ECS19 - Ahmad Najjar and Serge Luca - Power Platform Tutorial
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream Analytics
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BI
 
Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613
 
Embed Interactive Reports in Your Apps
Embed Interactive Reports in Your AppsEmbed Interactive Reports in Your Apps
Embed Interactive Reports in Your Apps
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
PowerBI v2, Power to the People, 1 year later
PowerBI v2, Power to the People, 1 year laterPowerBI v2, Power to the People, 1 year later
PowerBI v2, Power to the People, 1 year later
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events
 
NYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewNYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services Overview
 

Similar to Data modeling trends for Analytics

The final frontier
The final frontierThe final frontier
The final frontier
Terry Bunio
 
The final frontier v3
The final frontier v3The final frontier v3
The final frontier v3
Terry Bunio
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
DATAVERSITY
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - Presentation
Clint Campbell
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
Vivastream
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysis
DotNetCampus
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
Neo4j
 

Similar to Data modeling trends for Analytics (20)

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
 
The final frontier
The final frontierThe final frontier
The final frontier
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
Data Warehouse approaches with Dynamics AX
Data Warehouse  approaches with Dynamics AXData Warehouse  approaches with Dynamics AX
Data Warehouse approaches with Dynamics AX
 
The final frontier v3
The final frontier v3The final frontier v3
The final frontier v3
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged Conference
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Why Data Modeling Is Fundamental
Why Data Modeling Is FundamentalWhy Data Modeling Is Fundamental
Why Data Modeling Is Fundamental
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - Presentation
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Rohit Resume
Rohit ResumeRohit Resume
Rohit Resume
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysis
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
Develop a Custom Data Solution Architecture with NorthBay
Develop a Custom Data Solution Architecture with NorthBayDevelop a Custom Data Solution Architecture with NorthBay
Develop a Custom Data Solution Architecture with NorthBay
 

More from Ike Ellis

More from Ike Ellis (20)

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BI
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptx
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
Power bi premium
Power bi premiumPower bi premium
Power bi premium
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to dax
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI Exam
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATE
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAX
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Azure DocumentDB 101
Azure DocumentDB 101Azure DocumentDB 101
Azure DocumentDB 101
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
SQL Pass Architecture SQL Tips & Tricks
SQL Pass Architecture SQL Tips & TricksSQL Pass Architecture SQL Tips & Tricks
SQL Pass Architecture SQL Tips & Tricks
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

Data modeling trends for Analytics

  • 1. How to model data for analytics in a modern world Data Modeling Trends for 2019 and Beyond Ike Ellis, Microsoft MVP General Manager – Data & AI Practice Solliance
  • 3. everything PASS has to offer Free online webinar events Free 1-day local training events Local user groups around the world Online special interest user groups Business analytics training Get involved Free Online Resources Newsletters PASS.org Explore
  • 4. Ike Ellis General Manager – Data & AI Practice Solliance /ikeellis @ike_ellis www.ikeellis.com • Founder of San Diego Power BI and PowerApps UserGroup • Founder of the San Diego Software Architecture Group • MVP since 2011 • Author of Developing Azure Solutions, Power BI MVP Book • Speaker at PASS Summit, SQLBits, DevIntersections, TechEd, Craft, Microsoft Data & AI Conference
  • 5. agenda • where we are today: • physical storage design • logical data schema design • where we are headed: • data lakes • lambdas • iot • challenges to data modeling from the business • challenges with the cloud • positioning with the business
  • 6. reasons we build a data system for analytics • alert for things like fraud • reporting to wall street, auditors, compliance • reporting to upper management, board of directors • tactical reporting to other management • data analysis, machine learning, deep learning • data lineage • data governance • data brokerage between transactional applications • historical data, archiving data
  • 7. common enterprise data architecture (eda) source staging ods data warehouse etletl etletl and/or source source source
  • 8. staging • relational database that looks like the source systems • usually keeps about 90 days worth of data or some increment of data • used so that if etl fails, we don’t have to go back to the source system
  • 9. ods • operational data store • totally normalized • used to mimic the source systems, but consolidate them into one database • three databases into one database • might have light cleaning • often used for data brokerage • might have 300 – 500 tables
  • 10. kimball method: sample project plan 2) gather requirements/ define vocabulary/ wikidefinitions/ requirements document 4) dimensional modeling 3) physical design 5) etl development 6) deployment/training 1) find executive sponsor
  • 12. kimball dimensional modeling business questions focus on measures that are aggregated by business dimensions measures are facts about the business (nouns and strong words) dimensions are ways in which the measures can be aggregated: • pay attention to the “by” keyword • examples: • sales revenue by salesperson • profit by product line • order quantity by Year Product Line Salesperson Time Quantity Revenue Profit
  • 13. star schemas • group related dimensions into dimension tables • group related measures into fact tables • relate fact tables to dimension tables by using foreign keys DimSalesPerson SalesPersonKey SalesPersonName StoreName StoreCity StoreRegion DimProduct ProductKey ProductName ProductLine SupplierName DimCustomer CustomerKey CustomerName City Region FactOrders CustomerKey SalesPersonKey ProductKey ShippingAgentKey TimeKey OrderNo LineItemNo Quantity Revenue Cost Profit DimDate DateKey Year Quarter Month Day DimShippingAgent ShippingAgentKey ShippingAgentName
  • 14. considerations for dimension tables denormalization: • dimension tables are usually “wide” • wide tables are often faster when there are fewer joins • duplicated values are usually preferable to joins keys: • create new surrogate keys to identify each row: • simple integer keys provide the best performance • retain original business keys conformed dimensions • Dimensions that can be shared across multiple fact tables DimSalesPerson SalesPersonKey EmployeeNo SalesPersonName StoreName StoreCity StoreRegion surrogate key business key denormalized (no separate store table)
  • 15. considerations for fact tables grain: • use the lowest level of detail that relates to all dimensions • create multiple fact tables if multiple grains are required keys: • the primary key is usually a composite key that includes dimension foreign keys measures: • additive: Measures that can be aggregated across all dimensions • nonadditive: Measures that cannot be aggregated • semi-additive: Measures that can be aggregated across some dimensions, but not others degenerate dimensions: • dimensions in the fact table FactOrders CustomerKey SalesPersonKey ProductKey Timekey OrderNo LineItemNo PaymentMethod Quantity Revenue Cost Profit Margin FactAccountTransaction CustomerKey BranchKey AccountTypeKey AccountNo CreditDebitAmount AccountBalance Additive Nonadditive Semi-additive Degenerate Dimensions Grain = Order Line Item
  • 16. reasons to make a star schema • easy to use and understand • one version of the truth • easy to create aggregations by single passing over the data • much smaller table count (12 – 25 tables) • faster queries • good place to feed cubes (either azure analysis services or power bi shared datasets) • supported by many business intelligence tools (excel pivot tables, power bi, tableau, etc) • what i always say: • “you can either start out by making a star schema, or you can one day wish you did. those are the two choices”
  • 17. snowflake schemas consider when: • a subdimension can be shared between multiple dimensions • a hierarchy exists, and the dimension table contains a small subset of data that may be changed frequently • a sparse dimension has several different subtypes • multiple fact tables of varying grain reference different levels in the dimension hierarchy DimSalesPerson SalesPersonKey SalesPersonName StoreKey DimProduct ProductKey ProductName ProductLineKey SupplierKey DimCustomer CustomerKey CustomerName GeographyKey FactOrders CustomerKey SalesPersonKey ProductKey ShippingAgentKey TimeKey OrderNo LineItemNo Quantity Revenue Cost Profit DimDate DateKey Year Quarter Month Day DimShippingAgent ShippingAgentKey ShippingAgentName DimProductLine ProductLineKey ProductLineName DimGeography GeographyKey City Region DimSupplier SupplierKey SupplierName DimStore StoreKey StoreName GeographyKey • normalized dimension tables
  • 18. common enterprise data architecture (eda) source staging ods data warehouse etletl etletl and/or source source source
  • 20. weakness #1: let’s add a single column source staging ods data warehouse etletl etletl and/or source source source 1 2 3 4 5 6 7 8 9 x9!!!!
  • 21. so many of you have decided to just go directly to the source! source power query
  • 22. mayhem source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query • spread business logic • when something changes, you have to change a ton of places • inconsistent • repeatedly cleaning the same data
  • 23. and i’ve seen these data models
  • 24. weakness #2: sql server for staging/ods • sql actually writes data multiple times on the insert • one write for the log (traverse the log structure) • and then writing for the disk subsystem • one write to mdf • and then writing for the disk subsystem • writing to maintain indexing • and then writing for the disk subsystem • sql is strongly consistent • the write isn’t successful until all the tables and indexes represent the consistent write and all the triggers fire • sql is expensive • you get charged based on the amount of cores you use
  • 25. weakness #3: great big data warehouses are very difficult to change and maintain • all tables need to be consistent with one another • historical data makes queries slow • historical data makes dws hard to backup and restore and manage • indexes take too long to maintain • dev and other environments are too difficult to spin up • shared database environments are too hard to use in development environments • keeping track of pii and sensitive information is very difficult • creating automated tests is very difficult
  • 26. weakness #4: very difficult to move this to the cloud • Cloud you pay for four things, all wrapped up differently • CPU • Memory • Network • Disk • Most expensive • CPU/compute! • When you have a big data warehouse, you often need a lot of memory • but this is anchored to the CPU • Big things in the cloud are a lot more expensive then a lot of small things • Make the big things cheap, and the small things expensive so you have lot of knobs to turn
  • 27. common business complaints about analytic teams • “they don't change as quickly as we'd like” • “they said they'd work for all departments, but it seems like they only work for finance and sales” • hr, it, and supporting departments never get attention • “the data is unreliable” • “this report is far different than this other report” • “every time i ask for something new, it takes six weeks and by then the opportunity has passed” • “the cubes have outdated references to business concepts” • outdated terms
  • 28. these problems are actually caused by the business Consistency Speed It only takes 4 – 16 hours to make a report It’s making it consistent and reliable that takes so long
  • 29. the business worships at the alter of consistency
  • 30. three types of consistency that cause overhead • internal consistency: does the report match what the report is saying? • external consistency: does the report match every other report? • historical consistency: does the report match itself from years past? • the first one is mandatory, but the remaining two can cause a lot of problems and a lot of headache • how can the business change key metrics and then maintain consistency in the future? • do we change the past? • do we keep the past, but then have two different versions of the metric, the report, the cube?
  • 32. • alert for things like fraud • reporting to wall street, auditors, compliance • reporting to upper management, board of directors • tactical reporting to other management • data analysis, machine learning, deep learning • data lineage • data governance • data brokerage between transactional applications • historical data, archiving data no separation of concerns in the architecture trying to make a star schema do everything source staging ods data warehouse etletl etletl and/or source source source
  • 33. data latency staging ods data warehouse etletl etletlsource data movement takes a long time
  • 34. don’t be afraid of files • files are fast! • files are flexible • new files can have a new data structure without changing the old data structure • files only write the data through one data structure • files can be indexed • file storage is cheap • files can have high data integrity • files can be unstructured or non-relational
  • 35. parquet files • Organizing by column allows for better compression, • The space savings are very noticeable at the scale of a Hadoop cluster. • I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. • Better compression also reduces the bandwidth required to read the input. • Splittable • Horizontally scalable
  • 36. in the cloud small things are scalable and change-able • we want a lot of small things • we want to avoid compute charges for historical data, snapshotting, and archival
  • 37. basic physical idea of a data lake data mart etletl etletlsource staging ods data mart data mart etl
  • 38. example modern data architecture WEB APPLICATIONS DASHBOARDS AZURE DATABRICKS SQL DB / SQL Server SQL DW AZURE ANALYSIS SERVICES DATA LAKE STORE/ Azure Blob Storage DATA FACTORY Mapping Dataflows Pipelines SSIS Packages Triggered & Scheduled Pipelines ETL Logic Calculations AZURE STORAGE DIRECT DOWNLOAD etlsource
  • 39. back to this list: these should be separate parts of a single system • alert for things like fraud • reporting to wall street, auditors, compliance • reporting to upper management, board of directors • tactical reporting to other management • data analysis, machine learning, deep learning • data lineage • data governance • data brokerage between transactional applications • historical data, archiving data
  • 40. pitch it to the business: consistent path DASHBOARDS SQL DB / SQL Server AZURE ANALYSIS SERVICES DATA LAKE STORE/ Azure Blob Storage DATA FACTORY Mapping Dataflows Pipelines SSIS Packages Triggered & Scheduled Pipelines etlsource
  • 41. consistent path tips • never say: • one version of the truth • all the data will be cleaned • all the data will be consistent • keep total number of reports on the consistent path to less than 50 • remind them of boat ownership • new report authoring takes a long time • changes are slow and methodical
  • 43. pitch it to the business: alerting WEB APPLICATIONS DASHBOARDS AZURE DATABRICKS DATA LAKE STORE/ Azure Blob Storage ETL Logic Calculations DIRECT DOWNLOAD etlsource
  • 44. pitch it to the business: just-in-time reporting (flexible, inconsistent, but we don’t say that) etletl etletlsource staging ods BoD data mart data mart etl done in a couple of days decoupled, not quite as clean, useful done in a couple of weeks coupled, clean, consistent
  • 45. a properly designed data lake easy to organize! • do it by folders • staging • ods • organize by source with folders • sap • salesforce • greatplains • marketingorigin • salesorigin • historical • snapshot • temporal • adls allows you to choose the expense of the file without changing the uri • allows delta loads • read and analyze with a myriad of tools, including azure databricks, power bi, excel • separation from compute so we don’t get charged for compute!
  • 46. data virtualization as a concept • data stays in it’s original place • sql server • azure blob storage • azure data lake storage gen 2 • metadata repository is over the data where it is • data can then be queried and joined in a single location • spark sql • polybase • hive • power bi • sql server big data clusters
  • 50. conclusion • yes, we still make star schemas • yes, we still use slowly-changing dimensions • yes, we still use cubes • we need to understand their limitations • don’t be afraid of files and just in time analytics • don’t conflate alerting and speed with consistency • consistent reporting should be kept to 50 reports or less • everything else should be de-coupled and flexible so they can change quickly • we can create analytic systems without SQL • file-based (parquet) • we still primarily use SQL as a language • cheap • massively parallel • easily change-able • distilled using a star schema and data virtualization
  • 51. Session Evaluations Submit by 5pm Friday, November 15th to win prizes. Download the GuideBook App and search: PASS Summit 2019 Follow the QR code link on session signage Go to PASSsummit.com 3 W A Y S T O A C C E S S