CI/CD for a Data Platform
How to enable consistent data pipelines
2
Your Host
| Koen Rottiers
| Senior Consultant @ Codit
| 9 years in IT, track record in networking
and infrastructure
| Combining people, business and
technology
CI/CD for a Data Platform: How to enable consistent data pipelines
@KoenRottiers
Agenda
| A Data Platform?
| What is Azure Data Factory?
| The Data Lake architecture
| Why do CI/CD for a Data Platform?
| Azure Data Factory Git integration
3
A Data Platform?
4
Data Platform overview
5
| Ingestion different sources
| Centralized data store
| Data flows through
| Output curated data
| Multiple inputs and
outputs
What is Azure Data Factory?
6
Azure Data Factory
7
| Orchestrator
| Connectors to different data sources
| Cloud and on-premises
| Data Mapping flows
| Data Wrangling flows
| External compute integration
| DataBricks
| AzureML
| Azure Functions
| ....
Place in the data platform
8
The Data Lake architecture
9
High-Level Architecture
10
On-Premises
Other Azure Resources
Azure DevOps Project
for DataLake infra and
code
DB
DB
File Server
ExpressRoute
vNet Integrated
External Connections/
Sources/Destinations
Transformation
Rg-bru-{env}-datalake-001
App-bru-{env}-
{action}-datalake-001
la-bru-{env}-{action}-
datalake-001
Kb-bru-{env}-datalake-001
Stabru{env}landingdatalake001
mi-bru-{env}-datalake-001
Stabru{env}rawdatalake001
Stabru{env}curateddatalake001
Stabru{env}outputdatalake001
df-bru-{env}-datalake-001
Self-Hosted integration runtimes
11
On-Premises
Azure Networks
DB
File Server
ExpressRoute
vNet Peering
df-bru-{env}-datalake-001
Hub Network
Self-Hosted Runtime
Azure Integration Runtime
DB
Why do CI/CD for a Data Platform?
12
Data Platform Roles and Responsibilities
13
- Data platform owner: This person is the owner and responsible of the overall data platform.
- Data platform operator: This role is responsible for the day to day operational tasks of the platform
- Data pipeline owner: Different pipelines will be running on the platform. Each pipeline will have its own
purpose and so it’s specific owner. This is someone from the BI Team or business.
- Data pipeline developer: This person will be developing new pipelines or making adjustment to
existing ones.
- Data source owner: Different data sources will be integrated with the data platform. Every data source
will need to have an owner to determine access rights, access manner,... This person will be responsible
for the data residing in the source system. Most of the time this will be the application owner of the
application that uses the data source.
Key Advantages
14
| Consistent deployment of data pipelines
| Full testing of data flows in the Data Lake
| Better collaboration
| Feature development tracking
| Pipeline quality reviews
| More fine-grained data security
| Tracking data movements
Azure Data Factory Git integration
15
Data Factory Git Integration
16
Repo’s and branches
17
What does it look like?
18
Azure DevOps – Infra Git Repository
19
Azure DevOps – Pipelines Git Repository
20
Azure DevOps – Pipelines
21
Azure Data Factory – Git Integration
22
So why?
23
| Let data engineer/data scientists focus on delivering value and insights to the
business
| Enable an agile process in data engineering
| Consistency across environments
| Track feature development / Bug fixing
| Be able to audit your data streams
Do you want a demo?
Feel free to reach out to us.
24
Q&A
25

CI/CD for a Data Platform

  • 1.
    CI/CD for aData Platform How to enable consistent data pipelines
  • 2.
    2 Your Host | KoenRottiers | Senior Consultant @ Codit | 9 years in IT, track record in networking and infrastructure | Combining people, business and technology CI/CD for a Data Platform: How to enable consistent data pipelines @KoenRottiers
  • 3.
    Agenda | A DataPlatform? | What is Azure Data Factory? | The Data Lake architecture | Why do CI/CD for a Data Platform? | Azure Data Factory Git integration 3
  • 4.
  • 5.
    Data Platform overview 5 |Ingestion different sources | Centralized data store | Data flows through | Output curated data | Multiple inputs and outputs
  • 6.
    What is AzureData Factory? 6
  • 7.
    Azure Data Factory 7 |Orchestrator | Connectors to different data sources | Cloud and on-premises | Data Mapping flows | Data Wrangling flows | External compute integration | DataBricks | AzureML | Azure Functions | ....
  • 8.
    Place in thedata platform 8
  • 9.
    The Data Lakearchitecture 9
  • 10.
    High-Level Architecture 10 On-Premises Other AzureResources Azure DevOps Project for DataLake infra and code DB DB File Server ExpressRoute vNet Integrated External Connections/ Sources/Destinations Transformation Rg-bru-{env}-datalake-001 App-bru-{env}- {action}-datalake-001 la-bru-{env}-{action}- datalake-001 Kb-bru-{env}-datalake-001 Stabru{env}landingdatalake001 mi-bru-{env}-datalake-001 Stabru{env}rawdatalake001 Stabru{env}curateddatalake001 Stabru{env}outputdatalake001 df-bru-{env}-datalake-001
  • 11.
    Self-Hosted integration runtimes 11 On-Premises AzureNetworks DB File Server ExpressRoute vNet Peering df-bru-{env}-datalake-001 Hub Network Self-Hosted Runtime Azure Integration Runtime DB
  • 12.
    Why do CI/CDfor a Data Platform? 12
  • 13.
    Data Platform Rolesand Responsibilities 13 - Data platform owner: This person is the owner and responsible of the overall data platform. - Data platform operator: This role is responsible for the day to day operational tasks of the platform - Data pipeline owner: Different pipelines will be running on the platform. Each pipeline will have its own purpose and so it’s specific owner. This is someone from the BI Team or business. - Data pipeline developer: This person will be developing new pipelines or making adjustment to existing ones. - Data source owner: Different data sources will be integrated with the data platform. Every data source will need to have an owner to determine access rights, access manner,... This person will be responsible for the data residing in the source system. Most of the time this will be the application owner of the application that uses the data source.
  • 14.
    Key Advantages 14 | Consistentdeployment of data pipelines | Full testing of data flows in the Data Lake | Better collaboration | Feature development tracking | Pipeline quality reviews | More fine-grained data security | Tracking data movements
  • 15.
    Azure Data FactoryGit integration 15
  • 16.
    Data Factory GitIntegration 16
  • 17.
  • 18.
    What does itlook like? 18
  • 19.
    Azure DevOps –Infra Git Repository 19
  • 20.
    Azure DevOps –Pipelines Git Repository 20
  • 21.
    Azure DevOps –Pipelines 21
  • 22.
    Azure Data Factory– Git Integration 22
  • 23.
    So why? 23 | Letdata engineer/data scientists focus on delivering value and insights to the business | Enable an agile process in data engineering | Consistency across environments | Track feature development / Bug fixing | Be able to audit your data streams
  • 24.
    Do you wanta demo? Feel free to reach out to us. 24
  • 25.

Editor's Notes

  • #12 Recently there is a third option where you can fully integrate your data factory into you vNet