This document provides guidelines for building cloud BI project architectures. It discusses considerations for architectural design such as data sources, volumes, model complexity and sharing needs. It then presents four common architecture templates - Hulk, Iron Man, Thor and Hawkeye - tailored to different needs around reporting demand, data volume and complexity. Key aspects of architectures like sources, transportation, processing, storage, live calculation, data access and orchestration are examined. Finally, it compares features of technologies that can fulfill different functional roles.
2. Executive summary
This document is intended to provide guidelines for building architectures on cloud BI projects.
Considerations
To define an architecture for your project, we suggest you look at these criteria :
Source: where your data is located
ETL complexity: what kind of business rules
and transformations you need to support
Data volumes: The sheer size of the data
Model complexity: the business problem
you are representing and the kind of KPIs
you’ll need to support
Sharing needs: whether the data is only
used for this project or if it needs to
integrate with core data assets
Reporting demand: the expected
rendering speed, and for how many users
Templates
We’ve defined four templates based on common needs patterns which can be reused as-is or slightly modified to suit your particular case.
When you need pure muscle-power
Ex: CDB Reporting, Datalens
For simple reporting over large data
Ex: Radarly, Digital Dashboard
For complex reporting over light data
Ex: Budgit
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Hulk
Iron man
Thor
Hawkeye
3. Architectural considerations
Sources
Cloud data source are
simple to capture.
On-premises data sources
can imply a form of
gateway (or IR), a push from
the local infrastructure or a
VPN access linking cloud
resources to local networks.
Data volumes
Small data volumes can
generally be processed in
memory all at once and fit
within the 1GB of data
limitation in Power BI.
Medium data volumes can
be processed with a single
machine whereas large
data volumes require
cluster-based, parallelized
processing.
Data interests
Local data interests can be
managed in a fully
autonomous way, isolated
from other projects and
stakeholders.
Global data interest intend
to have their results be
reused by other projects
and teams. As such, these
projects have more
complex integration phases
and have more advanced
security features to manage
them.
The criteria you should consider when planning an architecture
4. Architectural considerations
ETL complexity
Simple ETL involves only light
transformations and data
type casting.
Medium ETL transforms an
incoming landing model
into a fully-fledged star
schema.
Complex ETL involves
proactive data quality
management, advanced
dimensional models and/or
intricate business rules.
Model complexity
Simple models use additive
measures over a single star
schema or a flat dataset.
Medium models include
advanced DAX with semi-
additive measures and/or
calculation over multiple
star schemas.
Complex models require
performance hindering
features such as row-level
security, bi-directional cross-
filtering or very advanced
DAX calculations.
Reporting demand
Low demands infer that it is
acceptable to have longer
response times (5-15s).
Medium demands require
snappy response times
(<100ms) over a small
number of concurrent users.
High demands involving
having snappy response
times over a large number
of concurrent users.
5. Functional phases
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION DATA ACCESS
TRANSPORTING
Where the original
data lives
What moves the
data from the
source to the
platform
What coordinates
the different services
What cleans and
transforms the data
from its raw state to
its usable form
Where data lives in
its cold form
Where reporting
calculations are
made for the end-
users
How the end uses
access the data
An architecture is divided into functional workloads. A single
technology can support multiple workload, and a singe workload
can sometimes be shared between different technologies.
6. Sources
Sources come in two main categories : cloud sources and on-premises sources. Generally speaking, cloud sources
are relatively simple to manage whereas on-prems have to deal with the added complexity of networking.
Cloud : in this category, we find object storage like AWS S3 or Azure Blob, API calls and user documents (ex: Excel
files) stored on SharePoint Online.
- Object storage is straightforward and is handled with an ID/Secret mechanism.
- API calls can be a bit more complex, especially depending on the authentication mechanism, but often offer a
good amount of flexibility in what is returned. They are often capped in terms of data size per call and require
more custom logic to handle.
- Documents stored on SPO allow users to give direct input in the solution but come with the perils of poorly formed
Excel files. Whenever possible, we recommend capturing user inputs through a small web application or a
PowerApp.
On-prems : these sources are highly valuable (they often form the core of information systems) but can be tricky to
access from a cloud services. A few options are available to handle this situation :
- Joining the cloud resource to the internal network through VPN
- Exposing part of the source (or extracts) in a DMZ. This may not be possible if the data is sensitive
- Having an on-premises ETL push the data to the cloud rather than having cloud services fetching the data
- Using a gateway like Azure Data Factory’s Integration Runtime to act as a bridge between the on-prems
resources and the cloud service. This tends to be the easiest scenario.
7. Transportation
Transportation refers to the Extract-and-Load workloads (without the transformations). Throughput, connectivity,
parametrization, and monitoring are the key aspects in choosing the right transportation solution.
Power BI Dataflows : when data volumes are small (less than 1GB), transformations are simple and the end
destination is solely meant to be used in Power BI, it’s Dataflow/PowerQuery engine can be used. It supports a large
array of connectors and decent parametrization possibilities.
Python : Ideally ran in a serverless environment (AWS Sage Maker, Airflow or Azure Functions), managed code can
adapt to a wide variety of data sources, shapes and destinations… provided you have the skill and time to code
the E-L solution. This is more adapted to small-to-medium data volumes since the code is usually limited to a single
machine (and often to a single cpu thread). It is fully DevOps compatible.
Azure Data Factory : the go-to solution for E-L workloads in Azure, ADF is capable of handling large workloads with
excellent throughput (especially if landing to ADLS) and is fully DevOps compatible. ADF’s main downside is that
while it can read from many different sources, it typically only writes only to MS destinations (with a few exceptions).
For self-service
projects
The go-to
solution. Even
more with on-
prems sources
Your can opener
for complex files
and API calls
OUR VERDICT
8. OUR VERDICT
Processing
The processing layer is where the data quality and business logic is applied. Some projects require very light
transformations whereas other completely change the model and apply complex business logic.
PySpark: is a managed code framework that can scale very well to Big Data scenario. Spark-based solutions can
be easily implemented in a PaaS format through Databricks with fully managed notebook, simple industrialization
of code and good monitoring capabilities. For maintainability and support, we recommend using PySpark as the
main Spark language.
SQL Procedures : Transformations can take place within the database itself, limiting data movement. Azure SQL DB
supports a DevOps-compatible fully-fledge language (T-SQL) and is suitable for small-to-medium data volumes.
Snowflake’s SQL programmatic objects are less developed but the platform can handle very large data volumes.
Azure SQL DWH offers big data levels of volume using SQL Stored procedures but need to be managed more
actively (manual cluster starting/scaling/stopping).
Azure Data Factory : ADF offers a GUI-based dataflow engine powered by Spark. It can handle very large data
volumes, but may be somewhat more limited for very complex transformations. Simple to medium complexity of ETL
can be performed without problems.
Python : Python’s processing is similar as it’s transportation’s workload : very flexible but potentially longer to
develop and best used for smaller data volumes.
PBI Dataflow : Power BI offers a simple GUI-based (codeless) interface for light ETL workloads. It can be used to
develop simple transformations over low data volumes very quickly and has a large array of source connectors. It is
limited for its output to Power BI and the Common Data Model in ADLS.
For self-service
projects
For point-and-
clic ETL in the
cloud
For complex
flows and ML
For SQL
professionals
For big data
projects
9. ETL key feature comparaisons
Here are key differentiators for typical…
PBI PySpark
(Databricks)
Cloud sources
On-Prems sources
Handling of semi-structured data
Destinations
Data volume
Transformation capabilities
Machine learning capabilities
CI/CD capabilities
Alert and monitoring
ADF + Dataflow ADF + SQL Procs
(Azure SQL DB)
Python
(Airflow)
Handling of Excel files
Ease of development
10. Storage
The storage layer is where cold data rests. Its main concerns are data throughput (in and out), data management
capabilities (RLS, data masking, Active Directory authentication, etc.), DevOps compatibility.
Azure Data Lake Storage : ADLS is an object-based storage system based on HDFS. It has excellent throughput but is
limited to file-level security (not row-level or column-level). As such, it is best used during massive import/export or
where 100% of the file needs to be read.
Snowflake : Soon to be present globally on Azure, is a true dataware solution in the cloud. As a pure storage layer,
it doesn’t have the same data management or DevOps capabilities as Azure SQL DB but it supports an impressive
per-project compute costing model over the same data. It’s throughput is similar to Azure SQL DWH.
Azure SQL DB : Azure’s main fully-featured DMBS, SQL DB has excellent data management capabilities and Active
Directory support. It supports a declarative development approach which offers many DevOps opportunities.
However, its throughput is not great and massive data volumes should be loaded using an incremental approach
for best performances. When hosting medium data volumes or more, consider an S3 service level or more to
access advanced features like columnstore storage.
Azure SQL DWH : A massively paralleled processing (MPP) version of Azure SQL DB. By default, data is shared across
60 shards, themselves spread between 1 to 60 nodes which offer very high throughput when required. Azure SQL
DWH supports programmatic objects such as stored procedures but has a slightly different writing style to Azure SQL
DB in order to take advantage of its MPP capabilities.
Landing native
files
BEST FOR
Small structured
data
Big data ETL
Big data
ad hoc use
11. Database key feature comparaisons
The cloud now offers multiple solutions for hosting relational
data. While these have more or less feature parity on all core
functionalities (they can all perform relatively well), key
features do exist between them. Snowflake Azure SQL DB Azure SQL DWH
Scale time
Compute/storage isolation
Semi-structured data
PBI integration
Azure Active Directory integration
DevOps & CI/CD support
Temporal Tables
Data cloning
Cost when use as ETL
Cost when use as reporting engine
Ease of budget forcasting
Just don’t do it…
DB programming
*
12. Live calculation
Reporting calculations engines preform the real-time data crunching when users consume reports. This tends to be
a high-demand, 24/7 service due to the group’s international nature. Model complexity and reporting demands
are key drivers when choosing an appropriate technology for this layer.
Power BI models : if the data volumes are small (less than 1GB of compressed data), Power BI native models,
especially when used on a Premium capacity, globally give the most fully-featured capabilities for this workload on
Power BI. Its only real draw backs is the lack of a real developer experience like source control or CI/CD
capabilities. Premium workload data size limits will soon be increased to 10GB… for a fee.
Composite Models : Composite Models allow Power BI to natively keep some of its data (like dimensions or
aggregated tables) in-memory and other in DirectQuery mode. This reduces the need for computation at the
reporting layer. It may not be adequate for complex models since the DAX used by DirectQuery still has limitations
and may involve uneven performances depending on when a query can hit PBI table or when it needs to revert
back to DirectQuery data. They are best suited for dashboarding scenarios (limited interactivity) over large data.
Azure Analysis Services : AAS is the scale-out version of Power BI Native Models. They can perform the same
complex calculations over very large datasets at a cost-efficient level (when compared to Power BI Premium). AAS
has a slower release cycle than Power BI, and thus tends to lack the latest features supported by PBI. It does
however support a real developer experience with source control and development environments.
DirectQuery : Often misunderstood, this mode makes every visualization on a report sends one (or many) SQL query
to the underlying source for every change on the page (ex: a filter applied). Performance will be slower than in-
memory (although may still be acceptable depending on data volume and back-end compute power). Other
issues include DAX limitations and some hard limits on data volume being returned. For these reasons, limit
DirectQuery to exploratory scenarios, near-time scenarios or dashboarding scenarios with limited complexity and
interactivity. Depending on the source, there may also be significant pressure on data gateways.
When it fits !
OUR VERDICT
When it doesn’t !
Near-time,
dashboards and
exploration
Simple reports
over big data
13. Live calculations feature comparaison
This workload comparison is somewhat more complicated
because of the relationship between data volumes and
calculation complexity. Regardless, here are some general
trends we can observe.
DirectQuery
(on Snowflake)
Cost
Composite
(on Snowflake)
AAS
PBI models
(non Premium)
Volume
Model and KPI complexity
Refreshing
CI/CD
Row-level security
Data mashing
Calculation speed
*
* *
*
*
*
*
*
*
*
*
*
*
14. USE FOR
Data Access
Data access refers to how the users are able to get to the data. This is done through the
Power BI embedded : preconfigured reports and dashboards for a topic are accessible key-in-hand from the data
portal.
Direct access – data models : If there’s a need to create a local version of the report and enhance the model with
additional KPIs, it is possible to connect directly to the data model through Excel or Power BI. This allows the report
maker to start from pre-validated dimensional data and KPIs and focus on his/her own additional KPIs and visuals.
This method however doesn’t allow the report maker to mash additional data in the current model.
Direct access – curated data : If a BI developer want to create his/her own model, it is possible to access the
curated dimensional data directly from the CoreDB. This significantly lowers the cost of a project by diminishing ETL
costs and tapping into pre-validated dimensional data. KPIs and any additional data will still need to be
developed and tested before use.
Direct access – data lake : A BI developer or a Data Scientist may wish to have access to raw data files for their
own project. This may or may not be possible depending on security needs on a per-dataset basis.
Import for your
own dataware
Datasets for
data science
Build your own
reports
Key-in-hand
reports
15. Where to connect
Raw data
Curated dimensional data
Business rules applied and tested
ETL already done
One location for multiple data sources
KPIs built and tested
Calculation engine for self-service
Data visualization based on common needs
Core DB
Data lake Analysis
Services
Power BI
What you get depending on where you
connect.
16. Orchestrating
Orchestrating is a key feature of cloud architecture. Not only does it need to launch and manage job, but should
also be able to interact with the cloud fabric (resource scaling, creation/deletion, etc.). Key features include native
connectivity to various APIs, programmatic flows (if, loops, branches, error-handling, etc.), trigger-based launching,
DevOps compatibility and GUIs for development and monitoring.
LogicApps : The de facto orchestration tool in the Azure stack and includes all the key features required in such a
tool. It includes GUI-based experiences that can be scaled to a full DevOps pipeline.
Azure Data Factory : ADF includes its own simple scheduling and orchestrating tool. While not as developed as
Logic App, it can be a valid choice when the orchestration is limited to simple scheduling, core data activities (ADF
pipelines, Databricks jobs, SQL procs, etc) and basic REST calls.
AirFlow : Airflow is a code oriented scheduler, allowing people to create complex workflows with many tasks. It is
mainly used to execute Python or Spark tasks. It is seamlessly integrated in Data Portal, so it is a good choice if you
keep you data in Datalake and want to have a single entry point to monitor both your data and your processes.
The go-to
solution in Azure
OUR VERDICT
Data science
projects in AWS
Simple
scheduling
needs with ADF
17. Feature comparaisons for orchestration
CI/CD
Scheduling and triggering
Native connectors
Debugging
Ease of development
Alerting & monitoring
Parameterization
Control flow
ADF
Airflow
LogicApps
Here are the key differentiator for orchestrating
solutions.
Interfacing with On-Prems assets
Secrets Management
18. Architecture templates
Data volume
Project complexity
Hulk
When you need pure muscle-power
Ex: CDB Reporting, Datalens
Thor
Any sources Large data volumes Global interest
Complex ETL Simple model Medium demand
Simple reporting over large data
Ex: Radarly, Digital Dashboard
Iron Man
Complex reporting over light data
Ex: Budgit
Hawkeye
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
Any sources Small data volumes Local interest
Complex ETL Complex model Medium demand
Any sources large data volumes Global interest
Complex ETL Complex model high demand
Whilst not the only considerations to take into account, architectures can
be broadly segmented by the volume of data they handle and the
complexity of ETL and model they must support. Based on this, we’ve
defined template architectures to guide you through the design process
19. Hulk
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Best-used for
Any sources Large data volumes Global interest
Complex ETL Complex model High demand
When you need pure muscle-power
Databricks
(pySpark)
ADF
20. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Step 1
Data is landed from S3 to an ADLS gen2 via an ADF pipeline. This
ensure a fast, bottle-neck free landing phase. Due to the volume,
an incremental loading approach is highly recommanded to limit
the impact on an on-prems IR gateway and the throughput to
the SQL DB. We could have used ADF scheduling in simple
scenarios. However, for uniformity’s sake, and to benefit from
extra alerting capabilities, Logic Apps is preferred as the overall
scheduler.
1
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
21. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
2
Step 2
Databricks or ADF is used to perform complexe ETL over a large
dataset by leveraging is Spark SQL engine in Python.
It fetches the landed data from ADLS and enriches it with curated
data from the Core DB to perform the complexe ETL.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
22. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
3
Step 3
Any group data that can be used in an overall curated
dimensional model useful for other projects is pushed back to the
Core DB. This database is thus enriched project by project easy-
to-used, vouched-for datasets.
Data that is purely report-specific is pushed to a project datamart.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
23. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
4
Step 4
Due to the size of the reporting dataset, the complexity of its
model and KPI and the expected reporting performances, an
AAS cube is used as a main reporting engine.
The final report is built on top of this cube and exposed in the
Data Portal through Power BI Embedded.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
24. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
5
Step 5
Subsidiary that whishes to do so can access the AAS cube to
make their own custom report and/or fetch the dimensional data
through the Core DB by using custom views for security purposes.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
26. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
The overall architecture ressembles a Hulk-like scenario due to the
volume of data.
Transformations are performed in Databricks or ADF despite the
fact that the ETL is rather simple because data volumes can
overwhelme a single-machine architecture and/or the
throughput of the target database.
1
Thor
Simple reporting over large data
Databricks
ADF
27. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
2
Step 2
Two options are available for reporting calculations The simplest one
is to use a AAS cube. This is potentially more expensive in terms of
software, but simpler in development. The alternative is to use PBI
models with aggregations if the KPIs are simple enough to hit the
aggregations on a regular basis. However, this complexifies the data
model (thus development time), can hurt performance when the
query is passed through to the source, and infers costs on the data
mart layer (using Snowflake because of it per-query costing model).
Thor
Simple reporting over large data
Databricks
ADF
29. Iron Man
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Medium data volumes Local interest
Complex ETL Complex model Medium demand
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
Complex reporting over light data
LIVE CALCULATION Data Access
30. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
1
Step 1
This architecture is designed for « tactical projects » where data
sharing is not paramount and data volumes are low, but which
may still necessitate a fair amount of business rules and data
cleansing.
The low volume means we can directly write raw data to the
database without a file-based landing in a datalake.
Iron Man
Complex reporting over light data
31. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
2
Step 2
The complexe ETL can then be implented in SQL Stored
Procedures within the database itself.
For simplicity’s sake, the orchestration in Logic Apps launches the
ADF, and ADF launches the procs after the landing. This reduces
the complexity of the Logic App code to handle long-runnning
procedures.
Iron Man
Complex reporting over light data
33. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Small data volumes Local interest
Complex ETL Complex model Medium demand
Project
ressources
Shared
ressources Cloud sources
Airflow
Snowflake
PR Python
Operator
Amazon S3 PBI Embedded
(Import)
Databricks
(pySpark)
PR Python
Operator
/
LIVE CALCULATION Data Access
Iron Man - AWS
Complex reporting over light data
34. Hawkeye
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Quick project for POCs or short-lived needs
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
35. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
A fully self-service Power BI stack is possible on small data
volumes and low ETL complexity. This architecture should be kept
to proof of concepts or temporairy projects where the time-to-
market is paramount and maintainability is not required.
While PBI dataflows are able to handle larger and larger data
volumes with premium capacities, current price-performance
ratios are highly suboptimal on larger data volumes.
1
Hawkeye
Quick project for POCs or short-lived needs
36. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 2
Power Query and the M language are capable of handling low-
to-medium levels of complexities in the ETL.
However, they currently lack the life-cycle tooling (version
control, automated deployments, etc) that are required with
professional development.
2
Hawkeye
Quick project for POCs or short-lived needs
37. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 3
A major roadblock preventing this architecture to be deployed to
more than POCs and temporary projects is the current lack of
integration with external storage systems.
The only current possibilities are thighly integrated to the Common
Data Model initiative in ADLS, which has yet to prove it viability
beyond Dynamics 365.
3
Hawkeye
Quick project for POCs or short-lived needs
38. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 4
The calculation and Data Access are of course made in Power BI
directly.
Here again, keep data volumes to a minimum. While the
premium capacties can accomodate larger and larger volumes,
the price-performance ratios are downright disastrous compared
to alternatives (AAS cubes and composite models).
4
Hawkeye
Quick project for POCs or short-lived needs
39. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Cloud sources Small data volumes Local interest
Simple ETL Simple model Medium demand
Project
ressources
Shared
ressources
Superset
Snowflake
PR Python
Operator
Airflow
Amazon S3
LIVE CALCULATION Data Access
Hawkeye
Quick project for POCs or short-lived needs
40. Where ?
Project
ressources
Shared
ressources
The Core DB is present in the Hulk and Thor templates.
It is a single database used by several projects.
Thor Hulk
SOURCES PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
ADLS
Core DB
Data mart
AAS PBI Embedded
Databricks
(pySpark)
41. Core DB contains global information that can be used for all affiliates and across several
projects. It is a central repository that is gradually built from widely used business data (e-
commerce, prisma, websites, consumer activities, etc).
What is Core DB ?
SOURCES DATA LAKE
CORE DB
PROJECT DATA MART
REF
Ref/MDM data
Data quality
3rd Normal Form
Conformed
Dimensions
Star schema
EDW
This widely used pattern allows :
- Consistency across projects
- Better quality and overall data
management
- Lower project costs through
reuse of validated assets
42. Core DB contains global information that can be used for all affiliates and across several
projects:
• Common dimensions and referentials:
• products,
• entities,
• contacts
• geography
• …
• Widely used business events:
• e-commerce orders,
• prisma data,
• website views,
• consumer activities
• …
What is Core DB ?
43. The model is business oriented, and not source oriented.
It has its own IDs, to allow cross-source identifications:
For the same business item (for example a contact) it can ingest data from several sources
(CDB, client database, employee database…).
Therefore, the model and schemas do not depend on the sources.
What is Core DB ?
Contact ID CDB_id Email City Score
Core DB
Sources
44. Because of it’s multi-project nature, CoreDB has special requirements in terms of data
management and development practices.
Core DB requirements
- The data model is managed by a
central data architect
- Changes must be handled by pull
request on a central repository
- Permissions have to be managed
granularly per user and asset
- Access must be granted through
objects (views, procs) which can
part of a automated testing
pipeline
PROD TEST
DEV BRANCHES
Daily rebuild
& sanitizing
Pull request
Automated Testing
Branching
Pull request
Project-based DB development
On-demand CI/CD build
45. Datalake vs Core DB
Core DB approach :
• Strong cost of input (ETL)
• Small cost of output
• Structured data
• Business event oriented
Recommendation : only common data
Data Lake approach:
• Small cost of input
• Strong cost of output (data prep)
• Miscellaneous data
• “find signal in the noise”
Recommendation : all data