SlideShare a Scribd company logo
1 of 45
Download to read offline
Azure BI Cloud
Architectural Guidelines
Ph. D. Pedro Bonillo
Executive summary
This document is intended to provide guidelines for building architectures on cloud BI projects.
Considerations
To define an architecture for your project, we suggest you look at these criteria :
Source: where your data is located
ETL complexity: what kind of business rules
and transformations you need to support
Data volumes: The sheer size of the data
Model complexity: the business problem
you are representing and the kind of KPIs
you’ll need to support
Sharing needs: whether the data is only
used for this project or if it needs to
integrate with core data assets
Reporting demand: the expected
rendering speed, and for how many users
Templates
We’ve defined four templates based on common needs patterns which can be reused as-is or slightly modified to suit your particular case.
When you need pure muscle-power
Ex: CDB Reporting, Datalens
For simple reporting over large data
Ex: Radarly, Digital Dashboard
For complex reporting over light data
Ex: Budgit
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Hulk
Iron man
Thor
Hawkeye
Architectural considerations
Sources
Cloud data source are
simple to capture.
On-premises data sources
can imply a form of
gateway (or IR), a push from
the local infrastructure or a
VPN access linking cloud
resources to local networks.
Data volumes
Small data volumes can
generally be processed in
memory all at once and fit
within the 1GB of data
limitation in Power BI.
Medium data volumes can
be processed with a single
machine whereas large
data volumes require
cluster-based, parallelized
processing.
Data interests
Local data interests can be
managed in a fully
autonomous way, isolated
from other projects and
stakeholders.
Global data interest intend
to have their results be
reused by other projects
and teams. As such, these
projects have more
complex integration phases
and have more advanced
security features to manage
them.
The criteria you should consider when planning an architecture
Architectural considerations
ETL complexity
Simple ETL involves only light
transformations and data
type casting.
Medium ETL transforms an
incoming landing model
into a fully-fledged star
schema.
Complex ETL involves
proactive data quality
management, advanced
dimensional models and/or
intricate business rules.
Model complexity
Simple models use additive
measures over a single star
schema or a flat dataset.
Medium models include
advanced DAX with semi-
additive measures and/or
calculation over multiple
star schemas.
Complex models require
performance hindering
features such as row-level
security, bi-directional cross-
filtering or very advanced
DAX calculations.
Reporting demand
Low demands infer that it is
acceptable to have longer
response times (5-15s).
Medium demands require
snappy response times
(<100ms) over a small
number of concurrent users.
High demands involving
having snappy response
times over a large number
of concurrent users.
Functional phases
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION DATA ACCESS
TRANSPORTING
Where the original
data lives
What moves the
data from the
source to the
platform
What coordinates
the different services
What cleans and
transforms the data
from its raw state to
its usable form
Where data lives in
its cold form
Where reporting
calculations are
made for the end-
users
How the end uses
access the data
An architecture is divided into functional workloads. A single
technology can support multiple workload, and a singe workload
can sometimes be shared between different technologies.
Sources
Sources come in two main categories : cloud sources and on-premises sources. Generally speaking, cloud sources
are relatively simple to manage whereas on-prems have to deal with the added complexity of networking.
Cloud : in this category, we find object storage like AWS S3 or Azure Blob, API calls and user documents (ex: Excel
files) stored on SharePoint Online.
- Object storage is straightforward and is handled with an ID/Secret mechanism.
- API calls can be a bit more complex, especially depending on the authentication mechanism, but often offer a
good amount of flexibility in what is returned. They are often capped in terms of data size per call and require
more custom logic to handle.
- Documents stored on SPO allow users to give direct input in the solution but come with the perils of poorly formed
Excel files. Whenever possible, we recommend capturing user inputs through a small web application or a
PowerApp.
On-prems : these sources are highly valuable (they often form the core of information systems) but can be tricky to
access from a cloud services. A few options are available to handle this situation :
- Joining the cloud resource to the internal network through VPN
- Exposing part of the source (or extracts) in a DMZ. This may not be possible if the data is sensitive
- Having an on-premises ETL push the data to the cloud rather than having cloud services fetching the data
- Using a gateway like Azure Data Factory’s Integration Runtime to act as a bridge between the on-prems
resources and the cloud service. This tends to be the easiest scenario.
Transportation
Transportation refers to the Extract-and-Load workloads (without the transformations). Throughput, connectivity,
parametrization, and monitoring are the key aspects in choosing the right transportation solution.
Power BI Dataflows : when data volumes are small (less than 1GB), transformations are simple and the end
destination is solely meant to be used in Power BI, it’s Dataflow/PowerQuery engine can be used. It supports a large
array of connectors and decent parametrization possibilities.
Python : Ideally ran in a serverless environment (AWS Sage Maker, Airflow or Azure Functions), managed code can
adapt to a wide variety of data sources, shapes and destinations… provided you have the skill and time to code
the E-L solution. This is more adapted to small-to-medium data volumes since the code is usually limited to a single
machine (and often to a single cpu thread). It is fully DevOps compatible.
Azure Data Factory : the go-to solution for E-L workloads in Azure, ADF is capable of handling large workloads with
excellent throughput (especially if landing to ADLS) and is fully DevOps compatible. ADF’s main downside is that
while it can read from many different sources, it typically only writes only to MS destinations (with a few exceptions).
For self-service
projects
The go-to
solution. Even
more with on-
prems sources
Your can opener
for complex files
and API calls
OUR VERDICT
OUR VERDICT
Processing
The processing layer is where the data quality and business logic is applied. Some projects require very light
transformations whereas other completely change the model and apply complex business logic.
PySpark: is a managed code framework that can scale very well to Big Data scenario. Spark-based solutions can
be easily implemented in a PaaS format through Databricks with fully managed notebook, simple industrialization
of code and good monitoring capabilities. For maintainability and support, we recommend using PySpark as the
main Spark language.
SQL Procedures : Transformations can take place within the database itself, limiting data movement. Azure SQL DB
supports a DevOps-compatible fully-fledge language (T-SQL) and is suitable for small-to-medium data volumes.
Snowflake’s SQL programmatic objects are less developed but the platform can handle very large data volumes.
Azure SQL DWH offers big data levels of volume using SQL Stored procedures but need to be managed more
actively (manual cluster starting/scaling/stopping).
Azure Data Factory : ADF offers a GUI-based dataflow engine powered by Spark. It can handle very large data
volumes, but may be somewhat more limited for very complex transformations. Simple to medium complexity of ETL
can be performed without problems.
Python : Python’s processing is similar as it’s transportation’s workload : very flexible but potentially longer to
develop and best used for smaller data volumes.
PBI Dataflow : Power BI offers a simple GUI-based (codeless) interface for light ETL workloads. It can be used to
develop simple transformations over low data volumes very quickly and has a large array of source connectors. It is
limited for its output to Power BI and the Common Data Model in ADLS.
For self-service
projects
For point-and-
clic ETL in the
cloud
For complex
flows and ML
For SQL
professionals
For big data
projects
ETL key feature comparaisons
Here are key differentiators for typical…
PBI PySpark
(Databricks)
Cloud sources
On-Prems sources
Handling of semi-structured data
Destinations
Data volume
Transformation capabilities
Machine learning capabilities
CI/CD capabilities
Alert and monitoring
ADF + Dataflow ADF + SQL Procs
(Azure SQL DB)
Python
(Airflow)
Handling of Excel files
Ease of development
Storage
The storage layer is where cold data rests. Its main concerns are data throughput (in and out), data management
capabilities (RLS, data masking, Active Directory authentication, etc.), DevOps compatibility.
Azure Data Lake Storage : ADLS is an object-based storage system based on HDFS. It has excellent throughput but is
limited to file-level security (not row-level or column-level). As such, it is best used during massive import/export or
where 100% of the file needs to be read.
Snowflake : Soon to be present globally on Azure, is a true dataware solution in the cloud. As a pure storage layer,
it doesn’t have the same data management or DevOps capabilities as Azure SQL DB but it supports an impressive
per-project compute costing model over the same data. It’s throughput is similar to Azure SQL DWH.
Azure SQL DB : Azure’s main fully-featured DMBS, SQL DB has excellent data management capabilities and Active
Directory support. It supports a declarative development approach which offers many DevOps opportunities.
However, its throughput is not great and massive data volumes should be loaded using an incremental approach
for best performances. When hosting medium data volumes or more, consider an S3 service level or more to
access advanced features like columnstore storage.
Azure SQL DWH : A massively paralleled processing (MPP) version of Azure SQL DB. By default, data is shared across
60 shards, themselves spread between 1 to 60 nodes which offer very high throughput when required. Azure SQL
DWH supports programmatic objects such as stored procedures but has a slightly different writing style to Azure SQL
DB in order to take advantage of its MPP capabilities.
Landing native
files
BEST FOR
Small structured
data
Big data ETL
Big data
ad hoc use
Database key feature comparaisons
The cloud now offers multiple solutions for hosting relational
data. While these have more or less feature parity on all core
functionalities (they can all perform relatively well), key
features do exist between them. Snowflake Azure SQL DB Azure SQL DWH
Scale time
Compute/storage isolation
Semi-structured data
PBI integration
Azure Active Directory integration
DevOps & CI/CD support
Temporal Tables
Data cloning
Cost when use as ETL
Cost when use as reporting engine
Ease of budget forcasting
Just don’t do it…
DB programming
*
Live calculation
Reporting calculations engines preform the real-time data crunching when users consume reports. This tends to be
a high-demand, 24/7 service due to the group’s international nature. Model complexity and reporting demands
are key drivers when choosing an appropriate technology for this layer.
Power BI models : if the data volumes are small (less than 1GB of compressed data), Power BI native models,
especially when used on a Premium capacity, globally give the most fully-featured capabilities for this workload on
Power BI. Its only real draw backs is the lack of a real developer experience like source control or CI/CD
capabilities. Premium workload data size limits will soon be increased to 10GB… for a fee.
Composite Models : Composite Models allow Power BI to natively keep some of its data (like dimensions or
aggregated tables) in-memory and other in DirectQuery mode. This reduces the need for computation at the
reporting layer. It may not be adequate for complex models since the DAX used by DirectQuery still has limitations
and may involve uneven performances depending on when a query can hit PBI table or when it needs to revert
back to DirectQuery data. They are best suited for dashboarding scenarios (limited interactivity) over large data.
Azure Analysis Services : AAS is the scale-out version of Power BI Native Models. They can perform the same
complex calculations over very large datasets at a cost-efficient level (when compared to Power BI Premium). AAS
has a slower release cycle than Power BI, and thus tends to lack the latest features supported by PBI. It does
however support a real developer experience with source control and development environments.
DirectQuery : Often misunderstood, this mode makes every visualization on a report sends one (or many) SQL query
to the underlying source for every change on the page (ex: a filter applied). Performance will be slower than in-
memory (although may still be acceptable depending on data volume and back-end compute power). Other
issues include DAX limitations and some hard limits on data volume being returned. For these reasons, limit
DirectQuery to exploratory scenarios, near-time scenarios or dashboarding scenarios with limited complexity and
interactivity. Depending on the source, there may also be significant pressure on data gateways.
When it fits !
OUR VERDICT
When it doesn’t !
Near-time,
dashboards and
exploration
Simple reports
over big data
Live calculations feature comparaison
This workload comparison is somewhat more complicated
because of the relationship between data volumes and
calculation complexity. Regardless, here are some general
trends we can observe.
DirectQuery
(on Snowflake)
Cost
Composite
(on Snowflake)
AAS
PBI models
(non Premium)
Volume
Model and KPI complexity
Refreshing
CI/CD
Row-level security
Data mashing
Calculation speed
*
* *
*
*
*
*
*
*
*
*
*
*
USE FOR
Data Access
Data access refers to how the users are able to get to the data. This is done through the
Power BI embedded : preconfigured reports and dashboards for a topic are accessible key-in-hand from the data
portal.
Direct access – data models : If there’s a need to create a local version of the report and enhance the model with
additional KPIs, it is possible to connect directly to the data model through Excel or Power BI. This allows the report
maker to start from pre-validated dimensional data and KPIs and focus on his/her own additional KPIs and visuals.
This method however doesn’t allow the report maker to mash additional data in the current model.
Direct access – curated data : If a BI developer want to create his/her own model, it is possible to access the
curated dimensional data directly from the CoreDB. This significantly lowers the cost of a project by diminishing ETL
costs and tapping into pre-validated dimensional data. KPIs and any additional data will still need to be
developed and tested before use.
Direct access – data lake : A BI developer or a Data Scientist may wish to have access to raw data files for their
own project. This may or may not be possible depending on security needs on a per-dataset basis.
Import for your
own dataware
Datasets for
data science
Build your own
reports
Key-in-hand
reports
Where to connect
Raw data
Curated dimensional data
Business rules applied and tested
ETL already done
One location for multiple data sources
KPIs built and tested
Calculation engine for self-service
Data visualization based on common needs
Core DB
Data lake Analysis
Services
Power BI
What you get depending on where you
connect.
Orchestrating
Orchestrating is a key feature of cloud architecture. Not only does it need to launch and manage job, but should
also be able to interact with the cloud fabric (resource scaling, creation/deletion, etc.). Key features include native
connectivity to various APIs, programmatic flows (if, loops, branches, error-handling, etc.), trigger-based launching,
DevOps compatibility and GUIs for development and monitoring.
LogicApps : The de facto orchestration tool in the Azure stack and includes all the key features required in such a
tool. It includes GUI-based experiences that can be scaled to a full DevOps pipeline.
Azure Data Factory : ADF includes its own simple scheduling and orchestrating tool. While not as developed as
Logic App, it can be a valid choice when the orchestration is limited to simple scheduling, core data activities (ADF
pipelines, Databricks jobs, SQL procs, etc) and basic REST calls.
AirFlow : Airflow is a code oriented scheduler, allowing people to create complex workflows with many tasks. It is
mainly used to execute Python or Spark tasks. It is seamlessly integrated in Data Portal, so it is a good choice if you
keep you data in Datalake and want to have a single entry point to monitor both your data and your processes.
The go-to
solution in Azure
OUR VERDICT
Data science
projects in AWS
Simple
scheduling
needs with ADF
Feature comparaisons for orchestration
CI/CD
Scheduling and triggering
Native connectors
Debugging
Ease of development
Alerting & monitoring
Parameterization
Control flow
ADF
Airflow
LogicApps
Here are the key differentiator for orchestrating
solutions.
Interfacing with On-Prems assets
Secrets Management
Architecture templates
Data volume
Project complexity
Hulk
When you need pure muscle-power
Ex: CDB Reporting, Datalens
Thor
Any sources Large data volumes Global interest
Complex ETL Simple model Medium demand
Simple reporting over large data
Ex: Radarly, Digital Dashboard
Iron Man
Complex reporting over light data
Ex: Budgit
Hawkeye
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
Any sources Small data volumes Local interest
Complex ETL Complex model Medium demand
Any sources large data volumes Global interest
Complex ETL Complex model high demand
Whilst not the only considerations to take into account, architectures can
be broadly segmented by the volume of data they handle and the
complexity of ETL and model they must support. Based on this, we’ve
defined template architectures to guide you through the design process
Hulk
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Best-used for
Any sources Large data volumes Global interest
Complex ETL Complex model High demand
When you need pure muscle-power
Databricks
(pySpark)
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Step 1
Data is landed from S3 to an ADLS gen2 via an ADF pipeline. This
ensure a fast, bottle-neck free landing phase. Due to the volume,
an incremental loading approach is highly recommanded to limit
the impact on an on-prems IR gateway and the throughput to
the SQL DB. We could have used ADF scheduling in simple
scenarios. However, for uniformity’s sake, and to benefit from
extra alerting capabilities, Logic Apps is preferred as the overall
scheduler.
1
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
2
Step 2
Databricks or ADF is used to perform complexe ETL over a large
dataset by leveraging is Spark SQL engine in Python.
It fetches the landed data from ADLS and enriches it with curated
data from the Core DB to perform the complexe ETL.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
3
Step 3
Any group data that can be used in an overall curated
dimensional model useful for other projects is pushed back to the
Core DB. This database is thus enriched project by project easy-
to-used, vouched-for datasets.
Data that is purely report-specific is pushed to a project datamart.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
4
Step 4
Due to the size of the reporting dataset, the complexity of its
model and KPI and the expected reporting performances, an
AAS cube is used as a main reporting engine.
The final report is built on top of this cube and exposed in the
Data Portal through Power BI Embedded.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
5
Step 5
Subsidiary that whishes to do so can access the AAS cube to
make their own custom report and/or fetch the dimensional data
through the Core DB by using custom views for security purposes.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
Thor
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
Best-used for
Any sources Large data volumes Global interest
Complex ETL Simple model Medium demand
LIVE CALCULATION Data Access
Simple reporting over large data
Databricks
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
The overall architecture ressembles a Hulk-like scenario due to the
volume of data.
Transformations are performed in Databricks or ADF despite the
fact that the ETL is rather simple because data volumes can
overwhelme a single-machine architecture and/or the
throughput of the target database.
1
Thor
Simple reporting over large data
Databricks
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
2
Step 2
Two options are available for reporting calculations The simplest one
is to use a AAS cube. This is potentially more expensive in terms of
software, but simpler in development. The alternative is to use PBI
models with aggregations if the KPIs are simple enough to hit the
aggregations on a regular basis. However, this complexifies the data
model (thus development time), can hurt performance when the
query is passed through to the source, and infers costs on the data
mart layer (using Snowflake because of it per-query costing model).
Thor
Simple reporting over large data
Databricks
ADF
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Project
ressources
Shared
ressources
Best-used for
Large data volumes Global interest
Complex ETL Simple model Medium demand
Superset
Snowflake
PR Python
Operator
Amazon S3
Databricks
(pySpark)
PR Spark
Operator
/
Cloud sources
PBI Embedded
(Composite
models)
Airflow
LIVE CALCULATION Data Access
Thor - AWS
Simple reporting over large data
Iron Man
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Medium data volumes Local interest
Complex ETL Complex model Medium demand
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
Complex reporting over light data
LIVE CALCULATION Data Access
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
1
Step 1
This architecture is designed for « tactical projects » where data
sharing is not paramount and data volumes are low, but which
may still necessitate a fair amount of business rules and data
cleansing.
The low volume means we can directly write raw data to the
database without a file-based landing in a datalake.
Iron Man
Complex reporting over light data
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
2
Step 2
The complexe ETL can then be implented in SQL Stored
Procedures within the database itself.
For simplicity’s sake, the orchestration in Logic Apps launches the
ADF, and ADF launches the procs after the landing. This reduces
the complexity of the Logic App code to handle long-runnning
procedures.
Iron Man
Complex reporting over light data
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
3
Step 3
The low data volumes also mean that the entire data set can be
uploaded to Power BI where complex KPIs can be calculated.
The report is made accessible through the data portal.
Iron Man
Complex reporting over light data
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Small data volumes Local interest
Complex ETL Complex model Medium demand
Project
ressources
Shared
ressources Cloud sources
Airflow
Snowflake
PR Python
Operator
Amazon S3 PBI Embedded
(Import)
Databricks
(pySpark)
PR Python
Operator
/
LIVE CALCULATION Data Access
Iron Man - AWS
Complex reporting over light data
Hawkeye
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Quick project for POCs or short-lived needs
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
A fully self-service Power BI stack is possible on small data
volumes and low ETL complexity. This architecture should be kept
to proof of concepts or temporairy projects where the time-to-
market is paramount and maintainability is not required.
While PBI dataflows are able to handle larger and larger data
volumes with premium capacities, current price-performance
ratios are highly suboptimal on larger data volumes.
1
Hawkeye
Quick project for POCs or short-lived needs
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 2
Power Query and the M language are capable of handling low-
to-medium levels of complexities in the ETL.
However, they currently lack the life-cycle tooling (version
control, automated deployments, etc) that are required with
professional development.
2
Hawkeye
Quick project for POCs or short-lived needs
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 3
A major roadblock preventing this architecture to be deployed to
more than POCs and temporary projects is the current lack of
integration with external storage systems.
The only current possibilities are thighly integrated to the Common
Data Model initiative in ADLS, which has yet to prove it viability
beyond Dynamics 365.
3
Hawkeye
Quick project for POCs or short-lived needs
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 4
The calculation and Data Access are of course made in Power BI
directly.
Here again, keep data volumes to a minimum. While the
premium capacties can accomodate larger and larger volumes,
the price-performance ratios are downright disastrous compared
to alternatives (AAS cubes and composite models).
4
Hawkeye
Quick project for POCs or short-lived needs
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Cloud sources Small data volumes Local interest
Simple ETL Simple model Medium demand
Project
ressources
Shared
ressources
Superset
Snowflake
PR Python
Operator
Airflow
Amazon S3
LIVE CALCULATION Data Access
Hawkeye
Quick project for POCs or short-lived needs
Where ?
Project
ressources
Shared
ressources
The Core DB is present in the Hulk and Thor templates.
It is a single database used by several projects.
Thor Hulk
SOURCES PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
ADLS
Core DB
Data mart
AAS PBI Embedded
Databricks
(pySpark)
Core DB contains global information that can be used for all affiliates and across several
projects. It is a central repository that is gradually built from widely used business data (e-
commerce, prisma, websites, consumer activities, etc).
What is Core DB ?
SOURCES DATA LAKE
CORE DB
PROJECT DATA MART
REF
Ref/MDM data
Data quality
3rd Normal Form
Conformed
Dimensions
Star schema
EDW
This widely used pattern allows :
- Consistency across projects
- Better quality and overall data
management
- Lower project costs through
reuse of validated assets
Core DB contains global information that can be used for all affiliates and across several
projects:
• Common dimensions and referentials:
• products,
• entities,
• contacts
• geography
• …
• Widely used business events:
• e-commerce orders,
• prisma data,
• website views,
• consumer activities
• …
What is Core DB ?
The model is business oriented, and not source oriented.
It has its own IDs, to allow cross-source identifications:
For the same business item (for example a contact) it can ingest data from several sources
(CDB, client database, employee database…).
Therefore, the model and schemas do not depend on the sources.
What is Core DB ?
Contact ID CDB_id Email City Score
Core DB
Sources
Because of it’s multi-project nature, CoreDB has special requirements in terms of data
management and development practices.
Core DB requirements
- The data model is managed by a
central data architect
- Changes must be handled by pull
request on a central repository
- Permissions have to be managed
granularly per user and asset
- Access must be granted through
objects (views, procs) which can
part of a automated testing
pipeline
PROD TEST
DEV BRANCHES
Daily rebuild
& sanitizing
Pull request
Automated Testing
Branching
Pull request
Project-based DB development
On-demand CI/CD build
Datalake vs Core DB
Core DB approach :
• Strong cost of input (ETL)
• Small cost of output
• Structured data
• Business event oriented
Recommendation : only common data
Data Lake approach:
• Small cost of input
• Strong cost of output (data prep)
• Miscellaneous data
• “find signal in the noise”
Recommendation : all data

More Related Content

What's hot

Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 

What's hot (20)

Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Building your Datalake on AWS
Building your Datalake on AWSBuilding your Datalake on AWS
Building your Datalake on AWS
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Modern Data Warehouse with Azure Synapse.pdf
Modern Data Warehouse with Azure Synapse.pdfModern Data Warehouse with Azure Synapse.pdf
Modern Data Warehouse with Azure Synapse.pdf
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Data and AI reference architecture
Data and AI reference architectureData and AI reference architecture
Data and AI reference architecture
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 

Similar to Azure BI Cloud Architectural Guidelines.pdf

Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
bindu1512
 

Similar to Azure BI Cloud Architectural Guidelines.pdf (20)

Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data Services
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
 
Building the Next-gen Digital Meter Platform for Fluvius
Building the Next-gen Digital Meter Platform for FluviusBuilding the Next-gen Digital Meter Platform for Fluvius
Building the Next-gen Digital Meter Platform for Fluvius
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Data Driven Advanced Analytics using Denodo Platform on AWS
Data Driven Advanced Analytics using Denodo Platform on AWSData Driven Advanced Analytics using Denodo Platform on AWS
Data Driven Advanced Analytics using Denodo Platform on AWS
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
Modern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaModern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas Jellema
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 

Recently uploaded

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Azure BI Cloud Architectural Guidelines.pdf

  • 1. Azure BI Cloud Architectural Guidelines Ph. D. Pedro Bonillo
  • 2. Executive summary This document is intended to provide guidelines for building architectures on cloud BI projects. Considerations To define an architecture for your project, we suggest you look at these criteria : Source: where your data is located ETL complexity: what kind of business rules and transformations you need to support Data volumes: The sheer size of the data Model complexity: the business problem you are representing and the kind of KPIs you’ll need to support Sharing needs: whether the data is only used for this project or if it needs to integrate with core data assets Reporting demand: the expected rendering speed, and for how many users Templates We’ve defined four templates based on common needs patterns which can be reused as-is or slightly modified to suit your particular case. When you need pure muscle-power Ex: CDB Reporting, Datalens For simple reporting over large data Ex: Radarly, Digital Dashboard For complex reporting over light data Ex: Budgit Quick and agile project for POCs or short-lived needs Ex: Weezevent Hulk Iron man Thor Hawkeye
  • 3. Architectural considerations Sources Cloud data source are simple to capture. On-premises data sources can imply a form of gateway (or IR), a push from the local infrastructure or a VPN access linking cloud resources to local networks. Data volumes Small data volumes can generally be processed in memory all at once and fit within the 1GB of data limitation in Power BI. Medium data volumes can be processed with a single machine whereas large data volumes require cluster-based, parallelized processing. Data interests Local data interests can be managed in a fully autonomous way, isolated from other projects and stakeholders. Global data interest intend to have their results be reused by other projects and teams. As such, these projects have more complex integration phases and have more advanced security features to manage them. The criteria you should consider when planning an architecture
  • 4. Architectural considerations ETL complexity Simple ETL involves only light transformations and data type casting. Medium ETL transforms an incoming landing model into a fully-fledged star schema. Complex ETL involves proactive data quality management, advanced dimensional models and/or intricate business rules. Model complexity Simple models use additive measures over a single star schema or a flat dataset. Medium models include advanced DAX with semi- additive measures and/or calculation over multiple star schemas. Complex models require performance hindering features such as row-level security, bi-directional cross- filtering or very advanced DAX calculations. Reporting demand Low demands infer that it is acceptable to have longer response times (5-15s). Medium demands require snappy response times (<100ms) over a small number of concurrent users. High demands involving having snappy response times over a large number of concurrent users.
  • 5. Functional phases SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION DATA ACCESS TRANSPORTING Where the original data lives What moves the data from the source to the platform What coordinates the different services What cleans and transforms the data from its raw state to its usable form Where data lives in its cold form Where reporting calculations are made for the end- users How the end uses access the data An architecture is divided into functional workloads. A single technology can support multiple workload, and a singe workload can sometimes be shared between different technologies.
  • 6. Sources Sources come in two main categories : cloud sources and on-premises sources. Generally speaking, cloud sources are relatively simple to manage whereas on-prems have to deal with the added complexity of networking. Cloud : in this category, we find object storage like AWS S3 or Azure Blob, API calls and user documents (ex: Excel files) stored on SharePoint Online. - Object storage is straightforward and is handled with an ID/Secret mechanism. - API calls can be a bit more complex, especially depending on the authentication mechanism, but often offer a good amount of flexibility in what is returned. They are often capped in terms of data size per call and require more custom logic to handle. - Documents stored on SPO allow users to give direct input in the solution but come with the perils of poorly formed Excel files. Whenever possible, we recommend capturing user inputs through a small web application or a PowerApp. On-prems : these sources are highly valuable (they often form the core of information systems) but can be tricky to access from a cloud services. A few options are available to handle this situation : - Joining the cloud resource to the internal network through VPN - Exposing part of the source (or extracts) in a DMZ. This may not be possible if the data is sensitive - Having an on-premises ETL push the data to the cloud rather than having cloud services fetching the data - Using a gateway like Azure Data Factory’s Integration Runtime to act as a bridge between the on-prems resources and the cloud service. This tends to be the easiest scenario.
  • 7. Transportation Transportation refers to the Extract-and-Load workloads (without the transformations). Throughput, connectivity, parametrization, and monitoring are the key aspects in choosing the right transportation solution. Power BI Dataflows : when data volumes are small (less than 1GB), transformations are simple and the end destination is solely meant to be used in Power BI, it’s Dataflow/PowerQuery engine can be used. It supports a large array of connectors and decent parametrization possibilities. Python : Ideally ran in a serverless environment (AWS Sage Maker, Airflow or Azure Functions), managed code can adapt to a wide variety of data sources, shapes and destinations… provided you have the skill and time to code the E-L solution. This is more adapted to small-to-medium data volumes since the code is usually limited to a single machine (and often to a single cpu thread). It is fully DevOps compatible. Azure Data Factory : the go-to solution for E-L workloads in Azure, ADF is capable of handling large workloads with excellent throughput (especially if landing to ADLS) and is fully DevOps compatible. ADF’s main downside is that while it can read from many different sources, it typically only writes only to MS destinations (with a few exceptions). For self-service projects The go-to solution. Even more with on- prems sources Your can opener for complex files and API calls OUR VERDICT
  • 8. OUR VERDICT Processing The processing layer is where the data quality and business logic is applied. Some projects require very light transformations whereas other completely change the model and apply complex business logic. PySpark: is a managed code framework that can scale very well to Big Data scenario. Spark-based solutions can be easily implemented in a PaaS format through Databricks with fully managed notebook, simple industrialization of code and good monitoring capabilities. For maintainability and support, we recommend using PySpark as the main Spark language. SQL Procedures : Transformations can take place within the database itself, limiting data movement. Azure SQL DB supports a DevOps-compatible fully-fledge language (T-SQL) and is suitable for small-to-medium data volumes. Snowflake’s SQL programmatic objects are less developed but the platform can handle very large data volumes. Azure SQL DWH offers big data levels of volume using SQL Stored procedures but need to be managed more actively (manual cluster starting/scaling/stopping). Azure Data Factory : ADF offers a GUI-based dataflow engine powered by Spark. It can handle very large data volumes, but may be somewhat more limited for very complex transformations. Simple to medium complexity of ETL can be performed without problems. Python : Python’s processing is similar as it’s transportation’s workload : very flexible but potentially longer to develop and best used for smaller data volumes. PBI Dataflow : Power BI offers a simple GUI-based (codeless) interface for light ETL workloads. It can be used to develop simple transformations over low data volumes very quickly and has a large array of source connectors. It is limited for its output to Power BI and the Common Data Model in ADLS. For self-service projects For point-and- clic ETL in the cloud For complex flows and ML For SQL professionals For big data projects
  • 9. ETL key feature comparaisons Here are key differentiators for typical… PBI PySpark (Databricks) Cloud sources On-Prems sources Handling of semi-structured data Destinations Data volume Transformation capabilities Machine learning capabilities CI/CD capabilities Alert and monitoring ADF + Dataflow ADF + SQL Procs (Azure SQL DB) Python (Airflow) Handling of Excel files Ease of development
  • 10. Storage The storage layer is where cold data rests. Its main concerns are data throughput (in and out), data management capabilities (RLS, data masking, Active Directory authentication, etc.), DevOps compatibility. Azure Data Lake Storage : ADLS is an object-based storage system based on HDFS. It has excellent throughput but is limited to file-level security (not row-level or column-level). As such, it is best used during massive import/export or where 100% of the file needs to be read. Snowflake : Soon to be present globally on Azure, is a true dataware solution in the cloud. As a pure storage layer, it doesn’t have the same data management or DevOps capabilities as Azure SQL DB but it supports an impressive per-project compute costing model over the same data. It’s throughput is similar to Azure SQL DWH. Azure SQL DB : Azure’s main fully-featured DMBS, SQL DB has excellent data management capabilities and Active Directory support. It supports a declarative development approach which offers many DevOps opportunities. However, its throughput is not great and massive data volumes should be loaded using an incremental approach for best performances. When hosting medium data volumes or more, consider an S3 service level or more to access advanced features like columnstore storage. Azure SQL DWH : A massively paralleled processing (MPP) version of Azure SQL DB. By default, data is shared across 60 shards, themselves spread between 1 to 60 nodes which offer very high throughput when required. Azure SQL DWH supports programmatic objects such as stored procedures but has a slightly different writing style to Azure SQL DB in order to take advantage of its MPP capabilities. Landing native files BEST FOR Small structured data Big data ETL Big data ad hoc use
  • 11. Database key feature comparaisons The cloud now offers multiple solutions for hosting relational data. While these have more or less feature parity on all core functionalities (they can all perform relatively well), key features do exist between them. Snowflake Azure SQL DB Azure SQL DWH Scale time Compute/storage isolation Semi-structured data PBI integration Azure Active Directory integration DevOps & CI/CD support Temporal Tables Data cloning Cost when use as ETL Cost when use as reporting engine Ease of budget forcasting Just don’t do it… DB programming *
  • 12. Live calculation Reporting calculations engines preform the real-time data crunching when users consume reports. This tends to be a high-demand, 24/7 service due to the group’s international nature. Model complexity and reporting demands are key drivers when choosing an appropriate technology for this layer. Power BI models : if the data volumes are small (less than 1GB of compressed data), Power BI native models, especially when used on a Premium capacity, globally give the most fully-featured capabilities for this workload on Power BI. Its only real draw backs is the lack of a real developer experience like source control or CI/CD capabilities. Premium workload data size limits will soon be increased to 10GB… for a fee. Composite Models : Composite Models allow Power BI to natively keep some of its data (like dimensions or aggregated tables) in-memory and other in DirectQuery mode. This reduces the need for computation at the reporting layer. It may not be adequate for complex models since the DAX used by DirectQuery still has limitations and may involve uneven performances depending on when a query can hit PBI table or when it needs to revert back to DirectQuery data. They are best suited for dashboarding scenarios (limited interactivity) over large data. Azure Analysis Services : AAS is the scale-out version of Power BI Native Models. They can perform the same complex calculations over very large datasets at a cost-efficient level (when compared to Power BI Premium). AAS has a slower release cycle than Power BI, and thus tends to lack the latest features supported by PBI. It does however support a real developer experience with source control and development environments. DirectQuery : Often misunderstood, this mode makes every visualization on a report sends one (or many) SQL query to the underlying source for every change on the page (ex: a filter applied). Performance will be slower than in- memory (although may still be acceptable depending on data volume and back-end compute power). Other issues include DAX limitations and some hard limits on data volume being returned. For these reasons, limit DirectQuery to exploratory scenarios, near-time scenarios or dashboarding scenarios with limited complexity and interactivity. Depending on the source, there may also be significant pressure on data gateways. When it fits ! OUR VERDICT When it doesn’t ! Near-time, dashboards and exploration Simple reports over big data
  • 13. Live calculations feature comparaison This workload comparison is somewhat more complicated because of the relationship between data volumes and calculation complexity. Regardless, here are some general trends we can observe. DirectQuery (on Snowflake) Cost Composite (on Snowflake) AAS PBI models (non Premium) Volume Model and KPI complexity Refreshing CI/CD Row-level security Data mashing Calculation speed * * * * * * * * * * * * *
  • 14. USE FOR Data Access Data access refers to how the users are able to get to the data. This is done through the Power BI embedded : preconfigured reports and dashboards for a topic are accessible key-in-hand from the data portal. Direct access – data models : If there’s a need to create a local version of the report and enhance the model with additional KPIs, it is possible to connect directly to the data model through Excel or Power BI. This allows the report maker to start from pre-validated dimensional data and KPIs and focus on his/her own additional KPIs and visuals. This method however doesn’t allow the report maker to mash additional data in the current model. Direct access – curated data : If a BI developer want to create his/her own model, it is possible to access the curated dimensional data directly from the CoreDB. This significantly lowers the cost of a project by diminishing ETL costs and tapping into pre-validated dimensional data. KPIs and any additional data will still need to be developed and tested before use. Direct access – data lake : A BI developer or a Data Scientist may wish to have access to raw data files for their own project. This may or may not be possible depending on security needs on a per-dataset basis. Import for your own dataware Datasets for data science Build your own reports Key-in-hand reports
  • 15. Where to connect Raw data Curated dimensional data Business rules applied and tested ETL already done One location for multiple data sources KPIs built and tested Calculation engine for self-service Data visualization based on common needs Core DB Data lake Analysis Services Power BI What you get depending on where you connect.
  • 16. Orchestrating Orchestrating is a key feature of cloud architecture. Not only does it need to launch and manage job, but should also be able to interact with the cloud fabric (resource scaling, creation/deletion, etc.). Key features include native connectivity to various APIs, programmatic flows (if, loops, branches, error-handling, etc.), trigger-based launching, DevOps compatibility and GUIs for development and monitoring. LogicApps : The de facto orchestration tool in the Azure stack and includes all the key features required in such a tool. It includes GUI-based experiences that can be scaled to a full DevOps pipeline. Azure Data Factory : ADF includes its own simple scheduling and orchestrating tool. While not as developed as Logic App, it can be a valid choice when the orchestration is limited to simple scheduling, core data activities (ADF pipelines, Databricks jobs, SQL procs, etc) and basic REST calls. AirFlow : Airflow is a code oriented scheduler, allowing people to create complex workflows with many tasks. It is mainly used to execute Python or Spark tasks. It is seamlessly integrated in Data Portal, so it is a good choice if you keep you data in Datalake and want to have a single entry point to monitor both your data and your processes. The go-to solution in Azure OUR VERDICT Data science projects in AWS Simple scheduling needs with ADF
  • 17. Feature comparaisons for orchestration CI/CD Scheduling and triggering Native connectors Debugging Ease of development Alerting & monitoring Parameterization Control flow ADF Airflow LogicApps Here are the key differentiator for orchestrating solutions. Interfacing with On-Prems assets Secrets Management
  • 18. Architecture templates Data volume Project complexity Hulk When you need pure muscle-power Ex: CDB Reporting, Datalens Thor Any sources Large data volumes Global interest Complex ETL Simple model Medium demand Simple reporting over large data Ex: Radarly, Digital Dashboard Iron Man Complex reporting over light data Ex: Budgit Hawkeye Quick and agile project for POCs or short-lived needs Ex: Weezevent Any sources Small data volumes Local interest Simple ETL Medium model Medium demand Any sources Small data volumes Local interest Complex ETL Complex model Medium demand Any sources large data volumes Global interest Complex ETL Complex model high demand Whilst not the only considerations to take into account, architectures can be broadly segmented by the volume of data they handle and the complexity of ETL and model they must support. Based on this, we’ve defined template architectures to guide you through the design process
  • 19. Hulk SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart AAS PBI Embedded Project ressources Shared ressources Best-used for Any sources Large data volumes Global interest Complex ETL Complex model High demand When you need pure muscle-power Databricks (pySpark) ADF
  • 20. SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart AAS PBI Embedded Project ressources Shared ressources Step 1 Data is landed from S3 to an ADLS gen2 via an ADF pipeline. This ensure a fast, bottle-neck free landing phase. Due to the volume, an incremental loading approach is highly recommanded to limit the impact on an on-prems IR gateway and the throughput to the SQL DB. We could have used ADF scheduling in simple scenarios. However, for uniformity’s sake, and to benefit from extra alerting capabilities, Logic Apps is preferred as the overall scheduler. 1 Hulk When you need pure muscle-power Databricks (pySpark) ADF
  • 21. SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart AAS PBI Embedded Project ressources Shared ressources 2 Step 2 Databricks or ADF is used to perform complexe ETL over a large dataset by leveraging is Spark SQL engine in Python. It fetches the landed data from ADLS and enriches it with curated data from the Core DB to perform the complexe ETL. Hulk When you need pure muscle-power Databricks (pySpark) ADF
  • 22. SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart AAS PBI Embedded Project ressources Shared ressources 3 Step 3 Any group data that can be used in an overall curated dimensional model useful for other projects is pushed back to the Core DB. This database is thus enriched project by project easy- to-used, vouched-for datasets. Data that is purely report-specific is pushed to a project datamart. Hulk When you need pure muscle-power Databricks (pySpark) ADF
  • 23. SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart AAS PBI Embedded Project ressources Shared ressources 4 Step 4 Due to the size of the reporting dataset, the complexity of its model and KPI and the expected reporting performances, an AAS cube is used as a main reporting engine. The final report is built on top of this cube and exposed in the Data Portal through Power BI Embedded. Hulk When you need pure muscle-power Databricks (pySpark) ADF
  • 24. SOURCES ORCHESTRATING PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart AAS PBI Embedded Project ressources Shared ressources 5 Step 5 Subsidiary that whishes to do so can access the AAS cube to make their own custom report and/or fetch the dimensional data through the Core DB by using custom views for security purposes. Hulk When you need pure muscle-power Databricks (pySpark) ADF
  • 25. Thor SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart Composite models PBI Embedded Project ressources Shared ressources Best-used for Any sources Large data volumes Global interest Complex ETL Simple model Medium demand LIVE CALCULATION Data Access Simple reporting over large data Databricks ADF
  • 26. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart Composite models PBI Embedded Project ressources Shared ressources LIVE CALCULATION Data Access Step 1 The overall architecture ressembles a Hulk-like scenario due to the volume of data. Transformations are performed in Databricks or ADF despite the fact that the ETL is rather simple because data volumes can overwhelme a single-machine architecture and/or the throughput of the target database. 1 Thor Simple reporting over large data Databricks ADF
  • 27. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING ADF (dataflow) Logic apps ADLS Core DB Data mart Composite models PBI Embedded Project ressources Shared ressources LIVE CALCULATION Data Access 2 Step 2 Two options are available for reporting calculations The simplest one is to use a AAS cube. This is potentially more expensive in terms of software, but simpler in development. The alternative is to use PBI models with aggregations if the KPIs are simple enough to hit the aggregations on a regular basis. However, this complexifies the data model (thus development time), can hurt performance when the query is passed through to the source, and infers costs on the data mart layer (using Snowflake because of it per-query costing model). Thor Simple reporting over large data Databricks ADF
  • 28. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING Project ressources Shared ressources Best-used for Large data volumes Global interest Complex ETL Simple model Medium demand Superset Snowflake PR Python Operator Amazon S3 Databricks (pySpark) PR Spark Operator / Cloud sources PBI Embedded (Composite models) Airflow LIVE CALCULATION Data Access Thor - AWS Simple reporting over large data
  • 29. Iron Man SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING Best-used for Any sources Medium data volumes Local interest Complex ETL Complex model Medium demand ADF (dataflow) Project ressources Shared ressources Logic apps SQL procedures Data mart PBI model PBI embedded Complex reporting over light data LIVE CALCULATION Data Access
  • 30. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING ADF (dataflow) Project ressources Shared ressources Logic apps SQL procedures Data mart PBI model PBI embedded LIVE CALCULATION Data Access 1 Step 1 This architecture is designed for « tactical projects » where data sharing is not paramount and data volumes are low, but which may still necessitate a fair amount of business rules and data cleansing. The low volume means we can directly write raw data to the database without a file-based landing in a datalake. Iron Man Complex reporting over light data
  • 31. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING ADF (dataflow) Project ressources Shared ressources Logic apps SQL procedures Data mart PBI model PBI embedded LIVE CALCULATION Data Access 2 Step 2 The complexe ETL can then be implented in SQL Stored Procedures within the database itself. For simplicity’s sake, the orchestration in Logic Apps launches the ADF, and ADF launches the procs after the landing. This reduces the complexity of the Logic App code to handle long-runnning procedures. Iron Man Complex reporting over light data
  • 32. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING ADF (dataflow) Project ressources Shared ressources Logic apps SQL procedures Data mart PBI model PBI embedded LIVE CALCULATION Data Access 3 Step 3 The low data volumes also mean that the entire data set can be uploaded to Power BI where complex KPIs can be calculated. The report is made accessible through the data portal. Iron Man Complex reporting over light data
  • 33. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING Best-used for Small data volumes Local interest Complex ETL Complex model Medium demand Project ressources Shared ressources Cloud sources Airflow Snowflake PR Python Operator Amazon S3 PBI Embedded (Import) Databricks (pySpark) PR Python Operator / LIVE CALCULATION Data Access Iron Man - AWS Complex reporting over light data
  • 34. Hawkeye SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING Best-used for Any sources Small data volumes Local interest Simple ETL Medium model Medium demand PBI Scheduling PBI Embedded PBI model PBI dataflow PBI dataflow Quick project for POCs or short-lived needs Project ressources Shared ressources LIVE CALCULATION Data Access
  • 35. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING PBI Scheduling PBI Embedded PBI model PBI dataflow PBI dataflow Project ressources Shared ressources LIVE CALCULATION Data Access Step 1 A fully self-service Power BI stack is possible on small data volumes and low ETL complexity. This architecture should be kept to proof of concepts or temporairy projects where the time-to- market is paramount and maintainability is not required. While PBI dataflows are able to handle larger and larger data volumes with premium capacities, current price-performance ratios are highly suboptimal on larger data volumes. 1 Hawkeye Quick project for POCs or short-lived needs
  • 36. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING PBI Scheduling PBI Embedded PBI model PBI dataflow PBI dataflow Project ressources Shared ressources LIVE CALCULATION Data Access Step 2 Power Query and the M language are capable of handling low- to-medium levels of complexities in the ETL. However, they currently lack the life-cycle tooling (version control, automated deployments, etc) that are required with professional development. 2 Hawkeye Quick project for POCs or short-lived needs
  • 37. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING PBI Scheduling PBI Embedded PBI model PBI dataflow PBI dataflow Project ressources Shared ressources LIVE CALCULATION Data Access Step 3 A major roadblock preventing this architecture to be deployed to more than POCs and temporary projects is the current lack of integration with external storage systems. The only current possibilities are thighly integrated to the Common Data Model initiative in ADLS, which has yet to prove it viability beyond Dynamics 365. 3 Hawkeye Quick project for POCs or short-lived needs
  • 38. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING PBI Scheduling PBI Embedded PBI model PBI dataflow PBI dataflow Project ressources Shared ressources LIVE CALCULATION Data Access Step 4 The calculation and Data Access are of course made in Power BI directly. Here again, keep data volumes to a minimum. While the premium capacties can accomodate larger and larger volumes, the price-performance ratios are downright disastrous compared to alternatives (AAS cubes and composite models). 4 Hawkeye Quick project for POCs or short-lived needs
  • 39. SOURCES ORCHESTRATING PROCESSING STORAGE TRANSPORTING Best-used for Cloud sources Small data volumes Local interest Simple ETL Simple model Medium demand Project ressources Shared ressources Superset Snowflake PR Python Operator Airflow Amazon S3 LIVE CALCULATION Data Access Hawkeye Quick project for POCs or short-lived needs
  • 40. Where ? Project ressources Shared ressources The Core DB is present in the Hulk and Thor templates. It is a single database used by several projects. Thor Hulk SOURCES PROCESSING STORAGE LIVE CALCULATION Data Access TRANSPORTING ADF (dataflow) ADLS Core DB Data mart AAS PBI Embedded Databricks (pySpark)
  • 41. Core DB contains global information that can be used for all affiliates and across several projects. It is a central repository that is gradually built from widely used business data (e- commerce, prisma, websites, consumer activities, etc). What is Core DB ? SOURCES DATA LAKE CORE DB PROJECT DATA MART REF Ref/MDM data Data quality 3rd Normal Form Conformed Dimensions Star schema EDW This widely used pattern allows : - Consistency across projects - Better quality and overall data management - Lower project costs through reuse of validated assets
  • 42. Core DB contains global information that can be used for all affiliates and across several projects: • Common dimensions and referentials: • products, • entities, • contacts • geography • … • Widely used business events: • e-commerce orders, • prisma data, • website views, • consumer activities • … What is Core DB ?
  • 43. The model is business oriented, and not source oriented. It has its own IDs, to allow cross-source identifications: For the same business item (for example a contact) it can ingest data from several sources (CDB, client database, employee database…). Therefore, the model and schemas do not depend on the sources. What is Core DB ? Contact ID CDB_id Email City Score Core DB Sources
  • 44. Because of it’s multi-project nature, CoreDB has special requirements in terms of data management and development practices. Core DB requirements - The data model is managed by a central data architect - Changes must be handled by pull request on a central repository - Permissions have to be managed granularly per user and asset - Access must be granted through objects (views, procs) which can part of a automated testing pipeline PROD TEST DEV BRANCHES Daily rebuild & sanitizing Pull request Automated Testing Branching Pull request Project-based DB development On-demand CI/CD build
  • 45. Datalake vs Core DB Core DB approach : • Strong cost of input (ETL) • Small cost of output • Structured data • Business event oriented Recommendation : only common data Data Lake approach: • Small cost of input • Strong cost of output (data prep) • Miscellaneous data • “find signal in the noise” Recommendation : all data