Data Engineering Guide
101
15-Nov-2024
Usman Khan
© Copyright Microsoft Corporation. All rights reserved.
Hello!
Thank you for joining me today
Instructor: Usman Khan
Senior Data Engineer / Data Architect
📍Dusseldorf, Germany
instagram.com/usmankhandev
linkedin.com/in/usmanniazi99
Implement
Let’s have a great time together
We all contribute to a great session
• Be Interactive
• Take notes
• Be present (avoid mail, calls, side meetings)
• Ask questions at the end
What you should know about the
session
• DWH Framework
• Problem Solving
• Solution Agnostic
• Processes & Systems
• Short Networking after the Session
Session objectives
In this course, you’ll learn how to implement end-to-end analytics
solutions with Microsoft Fabric. You’ll learn how to:
Understand DWH in detail
Plan, implement, and manage a solution for data analytics.
Understand Data Architectures
Implement and manage Data Processes
1
2
3
4
© Copyright Microsoft Corporation. All rights reserved.
Data Engineering?
General Process
Choose your
career path
Obtain education Gain experience Get certified
Build your
portfolio
Apply for
jobs
Fundamentals
Databases
Operational Databases (OLTP)
• An operational database is a type of database designed to support the day-to-day transactions
and data processing needs of an organization. Operational databases store and manage data
related to ongoing business operations, such as sales transactions, customer information,
inventory levels, and financial records.
• Operational databases are designed for OLTP (Online Transactional Processing) workloads, which
often involve many users executing a high number of small atomic transactions. In other words,
operational databases are optimized for high-throughput transactional processing of records.
Data Warehousing
Features of a DWH
• Subject Oriented
• A data warehouse is subject oriented because it provides information around a subject rather than the
organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue,
etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and
analysis of data for decision making.
• Integrated
• A data warehouse is constructed by integrating data from heterogeneous sources such as relational
databases, flat files, etc. This integration enhances the effective analysis of data.
• Time Variant
• The data collected in a data warehouse is identified with a particular time period. The data in a data
warehouse provides information from the historical point of view.
• Non-volatile
• Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is
kept separate from the operational database and therefore frequent changes in operational database is
not reflected in the data warehouse.
Understanding the Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.
Differences
Terminologies
Metadata
•Metadata is a road-map to data warehouse.
•Metadata in data warehouse defines the warehouse objects.
•Metadata acts as a directory. This directory helps the decision support
system to locate the contents of a data warehouse.
Data Cube
• According to the formal definition, a data cube refers to a multi-dimensional data
structure. That is, data within the data cube is explained by specific dimensional
values.
• A data cube helps us represent data in multiple dimensions. It is defined by
dimensions and facts. The dimensions are the entities with respect to which an
enterprise preserves the records.
Examples
Example
Data Marts
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data warehouse.
• Data marts are flexible.
Schema
•Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema. Some examples are
•Star Schema
•Snowflake Schema
•Fact Constellation Schema
Types of Datawarehouse
1. Enterprise Data Warehouse (EDW)
2. Operational Data Store (ODS)
3. Data Mart
4. Modern Data Warehouse (MDW)
Data Warehouses
•Snowflake
•Redshift
•Teradata
•SQL Server
•Google BigQuery
•Synapse
•Postgres
•Cloudera
ETL
Extraction Sources
SQL
Database
s
NoSQL
Database
Flat Files
API
Data
Streams
Log Files
Emails
Web
Hooks
CRMs
ERPs
Events
Transformation Activities
•Cleaning and Standardization
•Verification and Validation
•Filtering and Sorting
•De-duplication
•Data audits
•Calculations, Translations
•Formatting
•Data encryption, protection
Loading
In this final step of the ETL process, the transformed data is loaded onto its target
destination, which can be a simple database or even a data warehouse or a
Datamart. The size and complexity of data, along with the specific organizational
needs, determine the nature of the destination.
The load process can be:
•Full loading – occurs only at the time of first data loading or for disaster recovery
•Incremental loading – loading of updated data
ELT
ETL Tools
Azure Data
Factory
SSIS DBT Talend
Airflow
Oracle Data
Integrator
AWS Glue Matillion
Fivetran
Apache
Spark
Nifi
Additional Concepts
Data Modelling
Cloud – AWS / Azure
DWH Architecture
Roles within Data
Engineering
Roles Within Data Engineering
•Database Administrator
•Datawarehouse Architect / Engineer
•Data Governance
•Data Quality
•Data Architect
•Data Platform Engineer
•Data Infrastructure Engineer
•Data as a Service
•DataOps
•Cloud Data Engineer & Much More
Skills &
Qualifications
Skills that you
should have
• Python
• SQL
• Cloud – Any
• DWH
• ETL Tool
• Data Modelling
• Communication Skills*
• Excel*
How to Learn Data
Engineering?
Learning & Trainings
•Data Talks Club Project
•AWS Data Engineer Certification
•FreeCodeCamp – YouTube
•DataCamp
•365Data Science
•Kaggle
•Github
•Coursera
•Airflow, DBT, Docker, Terraform
Job Market
What should you do?
• Explore things
• Build, Build, Build.
• Contribute to Open Source. Push to Github
• Write Medium/Dev.to Blogs
• Make a Digital Presence
• Network
• Don’t go after the trends. Find what you love.
• ASK!!
• Get Domain Knowledge
• Build a Good Resume.
Questions
Thank you

The Data Engineering Guide 101 - GDGoC NUML X Bytewise

  • 1.
  • 2.
    © Copyright MicrosoftCorporation. All rights reserved. Hello! Thank you for joining me today Instructor: Usman Khan Senior Data Engineer / Data Architect 📍Dusseldorf, Germany instagram.com/usmankhandev linkedin.com/in/usmanniazi99
  • 3.
  • 5.
    Let’s have agreat time together We all contribute to a great session • Be Interactive • Take notes • Be present (avoid mail, calls, side meetings) • Ask questions at the end What you should know about the session • DWH Framework • Problem Solving • Solution Agnostic • Processes & Systems • Short Networking after the Session
  • 6.
    Session objectives In thiscourse, you’ll learn how to implement end-to-end analytics solutions with Microsoft Fabric. You’ll learn how to: Understand DWH in detail Plan, implement, and manage a solution for data analytics. Understand Data Architectures Implement and manage Data Processes 1 2 3 4 © Copyright Microsoft Corporation. All rights reserved.
  • 7.
  • 9.
    General Process Choose your careerpath Obtain education Gain experience Get certified Build your portfolio Apply for jobs
  • 10.
  • 11.
  • 12.
    Operational Databases (OLTP) •An operational database is a type of database designed to support the day-to-day transactions and data processing needs of an organization. Operational databases store and manage data related to ongoing business operations, such as sales transactions, customer information, inventory levels, and financial records. • Operational databases are designed for OLTP (Online Transactional Processing) workloads, which often involve many users executing a high number of small atomic transactions. In other words, operational databases are optimized for high-throughput transactional processing of records.
  • 13.
  • 14.
    Features of aDWH • Subject Oriented • A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making. • Integrated • A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. • Time Variant • The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. • Non-volatile • Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.
  • 15.
    Understanding the DataWarehouse • A data warehouse is a database, which is kept separate from the organization's operational database. • There is no frequent updating done in a data warehouse. • It possesses consolidated historical data, which helps the organization to analyze its business. • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. • Data warehouse systems help in the integration of diversity of application systems. • A data warehouse system helps in consolidated historical data analysis.
  • 16.
  • 17.
  • 18.
    Metadata •Metadata is aroad-map to data warehouse. •Metadata in data warehouse defines the warehouse objects. •Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse.
  • 19.
    Data Cube • Accordingto the formal definition, a data cube refers to a multi-dimensional data structure. That is, data within the data cube is explained by specific dimensional values. • A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records.
  • 20.
  • 21.
  • 22.
    Data Marts • Datamarts are small in size. • Data marts are customized by department. • The source of a data mart is departmentally structured data warehouse. • Data marts are flexible.
  • 23.
    Schema •Schema is alogical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. Some examples are •Star Schema •Snowflake Schema •Fact Constellation Schema
  • 24.
    Types of Datawarehouse 1.Enterprise Data Warehouse (EDW) 2. Operational Data Store (ODS) 3. Data Mart 4. Modern Data Warehouse (MDW)
  • 26.
  • 27.
  • 29.
  • 30.
    Transformation Activities •Cleaning andStandardization •Verification and Validation •Filtering and Sorting •De-duplication •Data audits •Calculations, Translations •Formatting •Data encryption, protection
  • 31.
    Loading In this finalstep of the ETL process, the transformed data is loaded onto its target destination, which can be a simple database or even a data warehouse or a Datamart. The size and complexity of data, along with the specific organizational needs, determine the nature of the destination. The load process can be: •Full loading – occurs only at the time of first data loading or for disaster recovery •Incremental loading – loading of updated data
  • 32.
  • 35.
    ETL Tools Azure Data Factory SSISDBT Talend Airflow Oracle Data Integrator AWS Glue Matillion Fivetran Apache Spark Nifi
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Roles Within DataEngineering •Database Administrator •Datawarehouse Architect / Engineer •Data Governance •Data Quality •Data Architect •Data Platform Engineer •Data Infrastructure Engineer •Data as a Service •DataOps •Cloud Data Engineer & Much More
  • 42.
  • 43.
    Skills that you shouldhave • Python • SQL • Cloud – Any • DWH • ETL Tool • Data Modelling • Communication Skills* • Excel*
  • 44.
    How to LearnData Engineering?
  • 45.
    Learning & Trainings •DataTalks Club Project •AWS Data Engineer Certification •FreeCodeCamp – YouTube •DataCamp •365Data Science •Kaggle •Github •Coursera •Airflow, DBT, Docker, Terraform
  • 46.
  • 47.
    What should youdo? • Explore things • Build, Build, Build. • Contribute to Open Source. Push to Github • Write Medium/Dev.to Blogs • Make a Digital Presence • Network • Don’t go after the trends. Find what you love. • ASK!! • Get Domain Knowledge • Build a Good Resume.
  • 48.
  • 49.

Editor's Notes

  • #1 Instructor notes and guidance. This space has been left deliberately blank