SlideShare a Scribd company logo
1 of 45
Building a Data Platform for Analytics in Azure
Ike Ellis
Solliance
Ike Ellis
General Manager – Data &
AI Practice
Solliance
/ikeellis
@ike_ellis
www.ikeellis.com
youtube.com/IkeEllisOnTheMic
• Founder of San Diego Power
BI UserGroup
• Founder of the San Diego
Software Architecture Group
• Co-chair of San Diego Data
Engineering Meetup
• MVP since 2011
• Author of Developing Azure
Solutions, Power BI MVP Book
• Speaker at PASS Summit,
SQLBits, DevIntersections,
TechEd, Craft, Microsoft Data &
AI Conference
© Microsoft Azure + AI Conference All rights reserved.
DEVSpring22
DEVSpring22
Agenda
• What you have likely built already
• The need for star schemas
• Why SQL Server makes a poor data platform
• How the cloud solves this
• The need for files
• The need for python or code-based pipelines
• The elements of a successful data platform
• Data pipelines to raw
• Different data sources
• Creating star-schema based data marts
• Alerting
• Storing sensitive data
• Deciding where the star schema should be
• Ingesting into Power BI
Research Azure data architecture
You’re likely to
see this
Looks scary
At the end of this lecture, I’ll show you how to build this
Most of the
components you
need to get
started and be
successful
organizing data
for Power BI
Common Enterprise Data Architecture (EDA)
source
staging ods
data
warehouse
etl
etl etl
etl
and/or
source
source
source
Staging
• Relational database that looks like the source systems
• Usually keeps about 90 days worth of data or some increment of data
• Used so that if etl fails, we don’t have to go back to the source system
ODS
• Operational Data Store
• Totally normalized
• Used to mimic the source systems, but consolidate them into one database
• three databases into one database
• Might have light cleaning
• Often used for data brokerage
• Might have 300 – 500 tables
Kimball dimensional modeling
Business questions focus on measures that are aggregated by
business dimensions
Measures are facts about the business (nouns and strong words)
Dimensions are ways in which the measures can be aggregated:
• pay attention to the “by” keyword
• examples:
• sales revenue by salesperson
• profit by product line
• order quantity by Year Product
Line
Salesperson
Time
Quantity
Revenue
Profit
Star schemas
• Group related
dimensions into
dimension tables
• Group related measures
into fact tables
• Relate fact tables to
dimension tables by
using foreign keys
DimSalesPerson
SalesPersonKey
SalesPersonName
StoreName
StoreCity
StoreRegion
DimProduct
ProductKey
ProductName
ProductLine
SupplierName
DimCustomer
CustomerKey
CustomerName
City
Region
FactOrders
CustomerKey
SalesPersonKey
ProductKey
ShippingAgentKey
TimeKey
OrderNo
LineItemNo
Quantity
Revenue
Cost
Profit
DimDate
DateKey
Year
Quarter
Month
Day
DimShippingAgent
ShippingAgentKey
ShippingAgentName
Considerations for dimension tables
Denormalization:
• dimension tables are usually “wide”
• wide tables are often faster when there are fewer joins
• duplicated values are usually preferable to joins
Keys:
• create new surrogate keys to identify each row:
• simple integer keys provide the best performance
• retain original business keys
DimSalesPerson
SalesPersonKey
EmployeeNo
SalesPersonName
StoreName
StoreCity
StoreRegion
surrogate key
business key
denormalized (no separate store table)
Considerations for fact tables
Grain:
• use the lowest level of detail that relates to all dimensions
• create multiple fact tables if multiple grains are required
Keys:
• the primary key is usually a composite key that includes
dimension foreign keys
Measures:
• additive: Measures that can be aggregated across all dimensions
• nonadditive: Measures that cannot be aggregated
• semi-additive: Measures that can be aggregated across some
dimensions, but not others
FactOrders
CustomerKey
SalesPersonKey
ProductKey
Timekey
OrderNo
LineItemNo
PaymentMethod
Quantity
Revenue
Cost
Profit
Margin
FactAccountTransaction
CustomerKey
BranchKey
AccountTypeKey
AccountNo
CreditDebitAmount
AccountBalance
Additive
Nonadditive
Semi-additive
Degenerate
Dimensions
Grain =
Order Line Item
Reasons to make a star schema
• Easy to use and understand
• One version of the truth
• Easy to create aggregations by single passing over the data
• Much smaller table count (12 – 25 tables)
• Faster queries
• Good place to feed cubes (either azure analysis services or power bi shared
datasets)
• Supported by many business intelligence tools (excel pivot tables, power bi,
tableau, etc)
• What i always say:
• “you can either start out by making a star schema, or you can one day wish you did. those
are the two choices”
Weaknesses with traditional data platforms
weakness: let’s add a single column
source
staging ods
data
warehouse
etl
etl etl
etl
and/or
source
source
source
1
2
3
4
5
6
7
8
9
x9!!!!
so many of you have decided to just go directly to the source!
source power query
mayhem
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
source power query
• spread business logic
• when something changes, you have
to change a ton of places
• inconsistent
• repeatedly cleaning the same data
and i’ve seen these data models
Weakness: SQL Server is bad for every stage of a large data platforms
• sql actually writes data multiple times on the insert
• one write for the log (traverse the log structure)
• and then writing for the disk subsystem
• one write to mdf
• and then writing for the disk subsystem
• writing to maintain indexing
• and then writing for the disk subsystem
• sql is strongly consistent
• the write isn’t successful until all the tables and indexes represent the consistent write and
all the triggers fire
• sql is expensive
• you get charged based on the amount of cores you use
• To add columns, you have to provide a base value
• Large databases are difficult when you make a dev environment
The Traditional Data Lake
Don’t be afraid of files
• files are fast!
• files are flexible
• new files can have a new data structure without changing the old data structure
• files only write the data through one data structure
• files can be indexed
• file storage is cheap
• files can have high data integrity
• files can be unstructured or non-relational
• Easily copied
The whole idea of an analytical system is that data duplication will
speed up aggregations and reporting. Files allow for cheap
duplication, which allows us to duplicate more data more frequently.
Parquet files
• Organizing by column allows for better compression,
• The space savings are very noticeable at the scale of a Hadoop cluster.
• I/O will be reduced as we can efficiently scan only a subset of the columns while
reading the data.
• Better compression also reduces the bandwidth required to read the input.
• Splittable
• Horizontally scalable
Basic physical idea of a data lake
data
mart
etl
etl etl
etl
source
staging ods
data
mart
data
mart
etl
Data virtualization as a concept
• data stays in it’s original place
• sql server
• azure blob storage
• azure data lake storage gen 2
• metadata repository is over the data where it is
• data can then be queried and joined in a single location
• spark sql
• polybase
• hive
• power bi
• sql server big data clusters
The need for python or code-based pipelines and PySpark
• Easier testing
• Easier modularity/code re-use
• Easier code structuring
• Easier code changes
• Lazy Execution
• Code-based IDEs, like PyCharm, VS Code
• Easier integration into deployment pipelines, CI/CD, source control. Better
tooling for that
• Better code-based standards. Code is pythonic
What we will build
OK, let’s buld the platform
• The elements of a successful data platform
• Data pipelines to raw
• Different data sources
• Creating star-schema based data marts
• Alerting
• Storing sensitive data
• Deciding where the star schema should be
• Ingesting into Power BI
Landing the data
• Chaos here
• Can land data in JSON, CSV, Parquet, TSV, or any weird format
• Could be connecting to APIs, SQL Servers, flat files, NoSQL data stores, other
data platforms, etc.
• Needs to have great alerting
• Create an ADLS container called “Landing”
• Store all the various junk in the landing folder, but try to keep some
organization
Could be parquet
Or CSV
Or JSON
Creating a raw layer
• Always a single file format
• Parquet
• Delta
• Whatever you choose
• This is where data is organized for consistency
• Analysts should be using this to explore data and do preliminary analytics
• Dirty
• Highly relational
• Difficult to use
• Should have a metadata layer
• Can use T-SQL, Spark SQL, or Python to interact
• Can save results in notebooks, use source control, etc
Different ways of creating pipelines in Synapse
• SQL
• Python
• Mixed
• Scala
• C#/.NET
• All using the same source control
• same environment
Building star schemas
• Usually build the dimensions first
• Build fact tables second
Deciding the correct location for data marts
Spark Tables Dedicated SQL
Pools
Azure SQL DB
Hyperscale
Azure SQL DB
MI
Regular Azure
SQL DB
• Great for
Power BI
refreshes for
direct import
• Good
performance
for querying
• Really cheap
• Most
expensive
option
• Great for very
large data,
partitioned
data that
needs to be
very fast
• Pausable
compute to
save expenses
• Great for large
databases that
have a lot of
data to load
and have log
contention
• Cheaper than
Dedicated SQL
Pools
• Same
interface as
Azure SQL DB
• Good if you
still need SSIS,
SQL Server
Agent, Cross-
database
queries,
Linked
Servers, etc.
• Good for
small, medium
data marts
• Can be cheap
or expensive
• Cost can be
more elastic
Alerting
• First build a class based on dataclasses with JSON
• Needs a requirements file for the spark clusters to work
• Load it in the clusters
Alerting
• Build a connection to Azure Log Analytics
• Put LA secrets in Key Vault
• Configure configuration file with access information
• Load it into the Spark Configuration
Use the class
• Log4J interface
• Call the class methods
• Notice the to_json to get the data into json the same way each time
Power BI to Log Analytics
• Create a query
• Export the query to Power BI
Storing sensitive keys
• Azure Key Vault
• Add it as a linked service
Storing sensitive keys
• Azure Key Vault
• Add it as a linked service
Creating an orchestration pipeline
• Use pipelines (or ADF) for
orchestration, not for real
work
• Do the real work in notebooks
• Much easier to troubleshoot
and debug and support
• Keeps the pipelines as simple
as possible
Loading Power BI
• Can load spark just from the SQL Server connection
• Power BI data models should be simple
• Fight for simplicity
• Most cleaning and data prep should be in Synapse Pipelines
GitHub Integration
• Synapse connects to
Github or ADO
• Branching/Merging
• PRs
• Commit history
• Rollbacks
• Merge conflict resolution
• All Synapse can be
deployed as code
• Infrastructure
• Code Pipelines
• Great for promoting to
test/production
conclusion
• yes, we still make star schemas
• yes, we still use slowly-changing dimensions
• yes, we still use cubes
• we need to understand their limitations
• don’t be afraid of files and just in time analytics
• don’t conflate alerting and speed with consistency
• consistent reporting should be kept to 50 reports or less
• everything else should be de-coupled and flexible so they can change quickly
• we can create analytic systems without SQL Server (but with SQL)
• file-based (parquet)
• we still primarily use SQL as a language
• cheap
• massively parallel
• easily change-able
• distilled using a star schema and data virtualization
Ike Ellis
General Manager – Data &
AI Practice
Solliance
/ikeellis
@ike_ellis
www.ikeellis.com
youtube.com/IkeEllisOnTheMic
• Founder of San Diego Power
BI UserGroup
• Founder of the San Diego
Software Architecture Group
• Co-chair of San Diego Data
Engineering Meetup
• MVP since 2011
• Author of Developing Azure
Solutions, Power BI MVP Book
• Speaker at PASS Summit,
SQLBits, DevIntersections,
TechEd, Craft, Microsoft Data &
AI Conference
APRIL 5-7, 2022 LAS VEGAS, NV
MGM GRAND
FOR INFORMATION ABOUT OUR NEXT
IN PERSON EVENT, VISIT OUR
WEBSITE AT

More Related Content

What's hot

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)
Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)
Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)Rodrigo Radtke de Souza
 
Gs08 modernize your data platform with sql technologies wash dc
Gs08 modernize your data platform with sql technologies   wash dcGs08 modernize your data platform with sql technologies   wash dc
Gs08 modernize your data platform with sql technologies wash dcBob Ward
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
 
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODI
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODIEssbase Statistics DW: How to Automatically Administrate Essbase Using ODI
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODIRodrigo Radtke de Souza
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopDoiT International
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Azure DocumentDB 101
Azure DocumentDB 101Azure DocumentDB 101
Azure DocumentDB 101Ike Ellis
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData FlyData Inc.
 
Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)Ryan Tabora
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Carolyn Duby
 
Introducing Microsoft SQL Server 2017
Introducing Microsoft SQL Server 2017Introducing Microsoft SQL Server 2017
Introducing Microsoft SQL Server 2017David J Rosenthal
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
Introducing Azure Databases
Introducing Azure DatabasesIntroducing Azure Databases
Introducing Azure DatabasesGrant Fritchey
 
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftData Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 

What's hot (20)

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)
Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)
Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI)
 
Gs08 modernize your data platform with sql technologies wash dc
Gs08 modernize your data platform with sql technologies   wash dcGs08 modernize your data platform with sql technologies   wash dc
Gs08 modernize your data platform with sql technologies wash dc
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODI
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODIEssbase Statistics DW: How to Automatically Administrate Essbase Using ODI
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODI
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Azure DocumentDB 101
Azure DocumentDB 101Azure DocumentDB 101
Azure DocumentDB 101
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
 
Introducing Microsoft SQL Server 2017
Introducing Microsoft SQL Server 2017Introducing Microsoft SQL Server 2017
Introducing Microsoft SQL Server 2017
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Introducing Azure Databases
Introducing Azure DatabasesIntroducing Azure Databases
Introducing Azure Databases
 
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftData Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
 

Similar to Build a modern data platform.pptx

Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and MRégis Baccaro
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLRichard Schneeman
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best PracticesEduardo Castro
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure passJason Strate
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQLTony Tam
 

Similar to Build a modern data platform.pptx (20)

Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Rdbms
RdbmsRdbms
Rdbms
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and M
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Revision
RevisionRevision
Revision
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Scalable web architecture
Scalable web architectureScalable web architecture
Scalable web architecture
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 

More from Ike Ellis

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BIIke Ellis
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxIke Ellis
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azureIke Ellis
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsIke Ellis
 
Power bi premium
Power bi premiumPower bi premium
Power bi premiumIke Ellis
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudIke Ellis
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to daxIke Ellis
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI ExamIke Ellis
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATEIke Ellis
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAXIke Ellis
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeIke Ellis
 
11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL DevelopersIke Ellis
 
SQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesSQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesIke Ellis
 
Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Ike Ellis
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
SQL Pass Architecture SQL Tips & Tricks
SQL Pass Architecture SQL Tips & TricksSQL Pass Architecture SQL Tips & Tricks
SQL Pass Architecture SQL Tips & TricksIke Ellis
 
Continuous integration sql in the city
Continuous integration sql in the cityContinuous integration sql in the city
Continuous integration sql in the cityIke Ellis
 
SQL Server Tips & Tricks
SQL Server Tips & TricksSQL Server Tips & Tricks
SQL Server Tips & TricksIke Ellis
 

More from Ike Ellis (20)

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BI
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptx
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
Power bi premium
Power bi premiumPower bi premium
Power bi premium
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to dax
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI Exam
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATE
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAX
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers11 Goals of High Functioning SQL Developers
11 Goals of High Functioning SQL Developers
 
SQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesSQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutes
 
Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014Tips & Tricks SQL in the City Seattle 2014
Tips & Tricks SQL in the City Seattle 2014
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
SQL Pass Architecture SQL Tips & Tricks
SQL Pass Architecture SQL Tips & TricksSQL Pass Architecture SQL Tips & Tricks
SQL Pass Architecture SQL Tips & Tricks
 
Continuous integration sql in the city
Continuous integration sql in the cityContinuous integration sql in the city
Continuous integration sql in the city
 
SQL Server Tips & Tricks
SQL Server Tips & TricksSQL Server Tips & Tricks
SQL Server Tips & Tricks
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Build a modern data platform.pptx

  • 1. Building a Data Platform for Analytics in Azure Ike Ellis Solliance
  • 2. Ike Ellis General Manager – Data & AI Practice Solliance /ikeellis @ike_ellis www.ikeellis.com youtube.com/IkeEllisOnTheMic • Founder of San Diego Power BI UserGroup • Founder of the San Diego Software Architecture Group • Co-chair of San Diego Data Engineering Meetup • MVP since 2011 • Author of Developing Azure Solutions, Power BI MVP Book • Speaker at PASS Summit, SQLBits, DevIntersections, TechEd, Craft, Microsoft Data & AI Conference
  • 3. © Microsoft Azure + AI Conference All rights reserved. DEVSpring22 DEVSpring22
  • 4. Agenda • What you have likely built already • The need for star schemas • Why SQL Server makes a poor data platform • How the cloud solves this • The need for files • The need for python or code-based pipelines • The elements of a successful data platform • Data pipelines to raw • Different data sources • Creating star-schema based data marts • Alerting • Storing sensitive data • Deciding where the star schema should be • Ingesting into Power BI
  • 5. Research Azure data architecture You’re likely to see this Looks scary
  • 6. At the end of this lecture, I’ll show you how to build this Most of the components you need to get started and be successful organizing data for Power BI
  • 7. Common Enterprise Data Architecture (EDA) source staging ods data warehouse etl etl etl etl and/or source source source
  • 8. Staging • Relational database that looks like the source systems • Usually keeps about 90 days worth of data or some increment of data • Used so that if etl fails, we don’t have to go back to the source system
  • 9. ODS • Operational Data Store • Totally normalized • Used to mimic the source systems, but consolidate them into one database • three databases into one database • Might have light cleaning • Often used for data brokerage • Might have 300 – 500 tables
  • 10. Kimball dimensional modeling Business questions focus on measures that are aggregated by business dimensions Measures are facts about the business (nouns and strong words) Dimensions are ways in which the measures can be aggregated: • pay attention to the “by” keyword • examples: • sales revenue by salesperson • profit by product line • order quantity by Year Product Line Salesperson Time Quantity Revenue Profit
  • 11. Star schemas • Group related dimensions into dimension tables • Group related measures into fact tables • Relate fact tables to dimension tables by using foreign keys DimSalesPerson SalesPersonKey SalesPersonName StoreName StoreCity StoreRegion DimProduct ProductKey ProductName ProductLine SupplierName DimCustomer CustomerKey CustomerName City Region FactOrders CustomerKey SalesPersonKey ProductKey ShippingAgentKey TimeKey OrderNo LineItemNo Quantity Revenue Cost Profit DimDate DateKey Year Quarter Month Day DimShippingAgent ShippingAgentKey ShippingAgentName
  • 12. Considerations for dimension tables Denormalization: • dimension tables are usually “wide” • wide tables are often faster when there are fewer joins • duplicated values are usually preferable to joins Keys: • create new surrogate keys to identify each row: • simple integer keys provide the best performance • retain original business keys DimSalesPerson SalesPersonKey EmployeeNo SalesPersonName StoreName StoreCity StoreRegion surrogate key business key denormalized (no separate store table)
  • 13. Considerations for fact tables Grain: • use the lowest level of detail that relates to all dimensions • create multiple fact tables if multiple grains are required Keys: • the primary key is usually a composite key that includes dimension foreign keys Measures: • additive: Measures that can be aggregated across all dimensions • nonadditive: Measures that cannot be aggregated • semi-additive: Measures that can be aggregated across some dimensions, but not others FactOrders CustomerKey SalesPersonKey ProductKey Timekey OrderNo LineItemNo PaymentMethod Quantity Revenue Cost Profit Margin FactAccountTransaction CustomerKey BranchKey AccountTypeKey AccountNo CreditDebitAmount AccountBalance Additive Nonadditive Semi-additive Degenerate Dimensions Grain = Order Line Item
  • 14. Reasons to make a star schema • Easy to use and understand • One version of the truth • Easy to create aggregations by single passing over the data • Much smaller table count (12 – 25 tables) • Faster queries • Good place to feed cubes (either azure analysis services or power bi shared datasets) • Supported by many business intelligence tools (excel pivot tables, power bi, tableau, etc) • What i always say: • “you can either start out by making a star schema, or you can one day wish you did. those are the two choices”
  • 15. Weaknesses with traditional data platforms
  • 16. weakness: let’s add a single column source staging ods data warehouse etl etl etl etl and/or source source source 1 2 3 4 5 6 7 8 9 x9!!!!
  • 17. so many of you have decided to just go directly to the source! source power query
  • 18. mayhem source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query source power query • spread business logic • when something changes, you have to change a ton of places • inconsistent • repeatedly cleaning the same data
  • 19. and i’ve seen these data models
  • 20. Weakness: SQL Server is bad for every stage of a large data platforms • sql actually writes data multiple times on the insert • one write for the log (traverse the log structure) • and then writing for the disk subsystem • one write to mdf • and then writing for the disk subsystem • writing to maintain indexing • and then writing for the disk subsystem • sql is strongly consistent • the write isn’t successful until all the tables and indexes represent the consistent write and all the triggers fire • sql is expensive • you get charged based on the amount of cores you use • To add columns, you have to provide a base value • Large databases are difficult when you make a dev environment
  • 22. Don’t be afraid of files • files are fast! • files are flexible • new files can have a new data structure without changing the old data structure • files only write the data through one data structure • files can be indexed • file storage is cheap • files can have high data integrity • files can be unstructured or non-relational • Easily copied The whole idea of an analytical system is that data duplication will speed up aggregations and reporting. Files allow for cheap duplication, which allows us to duplicate more data more frequently.
  • 23. Parquet files • Organizing by column allows for better compression, • The space savings are very noticeable at the scale of a Hadoop cluster. • I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. • Better compression also reduces the bandwidth required to read the input. • Splittable • Horizontally scalable
  • 24. Basic physical idea of a data lake data mart etl etl etl etl source staging ods data mart data mart etl
  • 25. Data virtualization as a concept • data stays in it’s original place • sql server • azure blob storage • azure data lake storage gen 2 • metadata repository is over the data where it is • data can then be queried and joined in a single location • spark sql • polybase • hive • power bi • sql server big data clusters
  • 26. The need for python or code-based pipelines and PySpark • Easier testing • Easier modularity/code re-use • Easier code structuring • Easier code changes • Lazy Execution • Code-based IDEs, like PyCharm, VS Code • Easier integration into deployment pipelines, CI/CD, source control. Better tooling for that • Better code-based standards. Code is pythonic
  • 27. What we will build
  • 28. OK, let’s buld the platform • The elements of a successful data platform • Data pipelines to raw • Different data sources • Creating star-schema based data marts • Alerting • Storing sensitive data • Deciding where the star schema should be • Ingesting into Power BI
  • 29. Landing the data • Chaos here • Can land data in JSON, CSV, Parquet, TSV, or any weird format • Could be connecting to APIs, SQL Servers, flat files, NoSQL data stores, other data platforms, etc. • Needs to have great alerting • Create an ADLS container called “Landing” • Store all the various junk in the landing folder, but try to keep some organization Could be parquet Or CSV Or JSON
  • 30. Creating a raw layer • Always a single file format • Parquet • Delta • Whatever you choose • This is where data is organized for consistency • Analysts should be using this to explore data and do preliminary analytics • Dirty • Highly relational • Difficult to use • Should have a metadata layer • Can use T-SQL, Spark SQL, or Python to interact • Can save results in notebooks, use source control, etc
  • 31. Different ways of creating pipelines in Synapse • SQL • Python • Mixed • Scala • C#/.NET • All using the same source control • same environment
  • 32. Building star schemas • Usually build the dimensions first • Build fact tables second
  • 33. Deciding the correct location for data marts Spark Tables Dedicated SQL Pools Azure SQL DB Hyperscale Azure SQL DB MI Regular Azure SQL DB • Great for Power BI refreshes for direct import • Good performance for querying • Really cheap • Most expensive option • Great for very large data, partitioned data that needs to be very fast • Pausable compute to save expenses • Great for large databases that have a lot of data to load and have log contention • Cheaper than Dedicated SQL Pools • Same interface as Azure SQL DB • Good if you still need SSIS, SQL Server Agent, Cross- database queries, Linked Servers, etc. • Good for small, medium data marts • Can be cheap or expensive • Cost can be more elastic
  • 34. Alerting • First build a class based on dataclasses with JSON • Needs a requirements file for the spark clusters to work • Load it in the clusters
  • 35. Alerting • Build a connection to Azure Log Analytics • Put LA secrets in Key Vault • Configure configuration file with access information • Load it into the Spark Configuration
  • 36. Use the class • Log4J interface • Call the class methods • Notice the to_json to get the data into json the same way each time
  • 37. Power BI to Log Analytics • Create a query • Export the query to Power BI
  • 38. Storing sensitive keys • Azure Key Vault • Add it as a linked service
  • 39. Storing sensitive keys • Azure Key Vault • Add it as a linked service
  • 40. Creating an orchestration pipeline • Use pipelines (or ADF) for orchestration, not for real work • Do the real work in notebooks • Much easier to troubleshoot and debug and support • Keeps the pipelines as simple as possible
  • 41. Loading Power BI • Can load spark just from the SQL Server connection • Power BI data models should be simple • Fight for simplicity • Most cleaning and data prep should be in Synapse Pipelines
  • 42. GitHub Integration • Synapse connects to Github or ADO • Branching/Merging • PRs • Commit history • Rollbacks • Merge conflict resolution • All Synapse can be deployed as code • Infrastructure • Code Pipelines • Great for promoting to test/production
  • 43. conclusion • yes, we still make star schemas • yes, we still use slowly-changing dimensions • yes, we still use cubes • we need to understand their limitations • don’t be afraid of files and just in time analytics • don’t conflate alerting and speed with consistency • consistent reporting should be kept to 50 reports or less • everything else should be de-coupled and flexible so they can change quickly • we can create analytic systems without SQL Server (but with SQL) • file-based (parquet) • we still primarily use SQL as a language • cheap • massively parallel • easily change-able • distilled using a star schema and data virtualization
  • 44. Ike Ellis General Manager – Data & AI Practice Solliance /ikeellis @ike_ellis www.ikeellis.com youtube.com/IkeEllisOnTheMic • Founder of San Diego Power BI UserGroup • Founder of the San Diego Software Architecture Group • Co-chair of San Diego Data Engineering Meetup • MVP since 2011 • Author of Developing Azure Solutions, Power BI MVP Book • Speaker at PASS Summit, SQLBits, DevIntersections, TechEd, Craft, Microsoft Data & AI Conference
  • 45. APRIL 5-7, 2022 LAS VEGAS, NV MGM GRAND FOR INFORMATION ABOUT OUR NEXT IN PERSON EVENT, VISIT OUR WEBSITE AT