Fundamentals of
Data Engineering
iabac.org
What is Data Engineering?
The process of creating, developing,
and managing systems for data
collection, storage, and processing.
Make sure that data is accurate,
readily available, and ready for
analyzing.
connects raw data to useful
discoveries.
iabac.org
Key Components of Data
Engineering
Data Collection – Gathering information
from multiple sources.
Data Storage – Utilizing databases, data
lakes, and warehouses for storage.
Data Processing – Converting raw data into
formats that can be used effectively.
Data Workflow Orchestration – Streamlining
the movement of data through automation.
Data Governance & Security – Maintaining
compliance and safeguarding data integrity.
iabac.org
Aspect
Focus
Data Engineering Data Science
Data Engineering vs Data Science
Focus Key Tools
Goal
Key Tools
Goal
SQL, Spark,
Airflow
Reliable data for
analytics
Analysis &
modeling
Python, ML
libraries
Insights &
predictions
iabac.org
Relational databases (SQL, PostgreSQL, and
MySQL).
NoSQL databases (MongoDB and
Cassandra).
Data Warehouses (BigQuery, Snowflake, and
Redshift).
Data lakes (S3 and Delta Lake).
iabac.org
Data Storage Technologies
Batch Processing (ETL) – for example,
Apache Spark.
Hadoop Stream Processing – for example,
Apache Kafka, Flink.
Hybrid Approaches – Combining batch &
real-time.
Data Processing Frameworks
iabac.org
Data Pipeline Orchestration
Workflow automation tools: Apache Airflow,
Prefect, Dagster.
Steps in a pipeline:
Data ingestion
1.
Cleaning & transformation
2.
Storage & indexing
3.
Delivery to consumers
4.
iabac.org
Data Quality- Validations, deduplication,
and anomaly detection.
Security- Encryption, access control (IAM)
Compliance- GDPR, HIPAA, SOC 2.
Metadata Management- Process of
classifying data in order to make it
discoverable.
Data Governance & Security
iabac.org
Data Storage: PostgreSQL, MongoDB and
Snowflake.
Processing: Spark, Flink and DBT.
Orchestration: Airflow and Prefect.
Cloud Platforms: AWS, GCP and Azure.
Tools & Technologies in Data
Engineering
iabac.org
Thank You
Visit: iabac.org

Fundamentals of Data Engineering | IABAC

  • 1.
  • 2.
    What is DataEngineering? The process of creating, developing, and managing systems for data collection, storage, and processing. Make sure that data is accurate, readily available, and ready for analyzing. connects raw data to useful discoveries. iabac.org
  • 3.
    Key Components ofData Engineering Data Collection – Gathering information from multiple sources. Data Storage – Utilizing databases, data lakes, and warehouses for storage. Data Processing – Converting raw data into formats that can be used effectively. Data Workflow Orchestration – Streamlining the movement of data through automation. Data Governance & Security – Maintaining compliance and safeguarding data integrity. iabac.org
  • 4.
    Aspect Focus Data Engineering DataScience Data Engineering vs Data Science Focus Key Tools Goal Key Tools Goal SQL, Spark, Airflow Reliable data for analytics Analysis & modeling Python, ML libraries Insights & predictions iabac.org
  • 5.
    Relational databases (SQL,PostgreSQL, and MySQL). NoSQL databases (MongoDB and Cassandra). Data Warehouses (BigQuery, Snowflake, and Redshift). Data lakes (S3 and Delta Lake). iabac.org Data Storage Technologies
  • 6.
    Batch Processing (ETL)– for example, Apache Spark. Hadoop Stream Processing – for example, Apache Kafka, Flink. Hybrid Approaches – Combining batch & real-time. Data Processing Frameworks iabac.org
  • 7.
    Data Pipeline Orchestration Workflowautomation tools: Apache Airflow, Prefect, Dagster. Steps in a pipeline: Data ingestion 1. Cleaning & transformation 2. Storage & indexing 3. Delivery to consumers 4. iabac.org
  • 8.
    Data Quality- Validations,deduplication, and anomaly detection. Security- Encryption, access control (IAM) Compliance- GDPR, HIPAA, SOC 2. Metadata Management- Process of classifying data in order to make it discoverable. Data Governance & Security iabac.org
  • 9.
    Data Storage: PostgreSQL,MongoDB and Snowflake. Processing: Spark, Flink and DBT. Orchestration: Airflow and Prefect. Cloud Platforms: AWS, GCP and Azure. Tools & Technologies in Data Engineering iabac.org
  • 10.