DEMYSTIFYING DATA
ENGINEERING
BASICS & GETTING STARTED
Source: The AI Hierarchy of Needs - Monica Rogati
TYPICAL ARCHITECTURE/BLUEPRINT
Natural Language Processing, Artificial Intelligence, Machine Learning and Deep
Learning needs a strong Data foundation.
Where to begin?
there is nothing! huge mess
DATA ENGINEERING
● “Data” engineers design and build pipelines that transform and transport
data into a format wherein, by the time it reaches the Data Scientists or
other end users, it is in a highly usable state. These pipelines must take
data from many disparate sources and collect them into a single
warehouse that represents the data uniformly as a single source of truth.
● Designing, building and scaling systems that organize data for analytics.
● Data Engineers prepare the Big Data infrastructure to be analyzed by Data
Scientists.
● Data engineering is the process of designing and building systems that let
people collect and analyze raw data from multiple sources and formats.
SKILL SET
Development + Cloud
Computing + Big Data
+ Databases
software
engineering
big data
cloud computing
databases
DISTINCT ROLES
ROLES
Data Engineer:
● Data engineers work in a variety of settings to build systems that collect, manage, and convert raw
data into usable information for data scientists and business analysts to interpret.
Data Scientist:
● They use linear algebra and multivariable calculus to create new insight from existing data.
Business Analyst:
● Analysis and exploration of historical data → identify trends, patterns & understand the information →
drive business change
let’s talk about the specifics….
ETL (EXTRACT, TRANSFORM, LOAD)
the absolute core of Data Engineering
ETL Process
BIG DATA
PROPERTIES
V’s of BIG DATA
Volume
◾ How much data you have
Velocity
◾ How fast data is getting to you
Variety
◾ How different your data is
Veracity
◾ How reliable your data is
DATA
TYPES/CLASSIFICATION
TYPES
Unstructured/Raw data
● Unprocessed data in format used on source, Text, CSV, Image, Video, etc..
● High Latency
● No schema applied
● Stored in Google Cloud Storage, AWS S3
● Tools like Snowflake, MongoDB allow their specific ways to query unstructured data
Structured/Processed data
● Raw data with schema applied
● Stored in event tables/destinations in pipelines
● Analytics query language: ideally SQL-like
● Low latency data ingestion
● Read focus over large portion of data
DATA
PROCESSING
METHODS
BATCH PROCESSING
STREAM PROCESSING
Process data on the fly, as it comes in
Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or most
recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes
to hours
in the order of seconds or milliseconds
PROCESSING
FRAMEWORKS
MAP REDUCE
● MapReduce is a processing technique and a
program model for distributed computing.
● The algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data
and converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs).
● Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always
performed after the map job.
SPARK VS HADOOP
DATA STORAGE
Relational Database
(SQL)
Document Store
(NoSQL)
DEMO/POC
REFERENCES
The Data Engineering
Cookbook
https://github.com/andkret/Cookbook
THANK YOU
Connect:
● Ketan (LinkedIn)
○ Computer Science ‘24 Grad @ Michigan Tech
○ Ex - Data Engineer @ Abzooba : Abzooba is one of the top 50 Best Data Science firms in
India to work for. Focuses on developing the highest quality analytics products and
services using expertise in Big Data and Cloud, AI, and ML.
○ A constant Learner

data_engineering_basics.pdf

  • 1.
  • 2.
    Source: The AIHierarchy of Needs - Monica Rogati
  • 3.
  • 4.
    Natural Language Processing,Artificial Intelligence, Machine Learning and Deep Learning needs a strong Data foundation.
  • 5.
    Where to begin? thereis nothing! huge mess
  • 6.
  • 7.
    ● “Data” engineersdesign and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth. ● Designing, building and scaling systems that organize data for analytics. ● Data Engineers prepare the Big Data infrastructure to be analyzed by Data Scientists. ● Data engineering is the process of designing and building systems that let people collect and analyze raw data from multiple sources and formats.
  • 8.
  • 9.
    Development + Cloud Computing+ Big Data + Databases software engineering big data cloud computing databases
  • 10.
  • 11.
    ROLES Data Engineer: ● Dataengineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Data Scientist: ● They use linear algebra and multivariable calculus to create new insight from existing data. Business Analyst: ● Analysis and exploration of historical data → identify trends, patterns & understand the information → drive business change
  • 12.
    let’s talk aboutthe specifics….
  • 13.
    ETL (EXTRACT, TRANSFORM,LOAD) the absolute core of Data Engineering
  • 14.
  • 15.
  • 16.
    V’s of BIGDATA Volume ◾ How much data you have Velocity ◾ How fast data is getting to you Variety ◾ How different your data is Veracity ◾ How reliable your data is
  • 17.
  • 18.
    TYPES Unstructured/Raw data ● Unprocesseddata in format used on source, Text, CSV, Image, Video, etc.. ● High Latency ● No schema applied ● Stored in Google Cloud Storage, AWS S3 ● Tools like Snowflake, MongoDB allow their specific ways to query unstructured data Structured/Processed data ● Raw data with schema applied ● Stored in event tables/destinations in pipelines ● Analytics query language: ideally SQL-like ● Low latency data ingestion ● Read focus over large portion of data
  • 19.
  • 20.
  • 21.
    STREAM PROCESSING Process dataon the fly, as it comes in
  • 22.
    Batch vs Stream BatchProcessing Stream Processing Data scope Processing over all or most of the data set processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Latency in minutes to hours in the order of seconds or milliseconds
  • 23.
  • 24.
    MAP REDUCE ● MapReduceis a processing technique and a program model for distributed computing. ● The algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). ● Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    Connect: ● Ketan (LinkedIn) ○Computer Science ‘24 Grad @ Michigan Tech ○ Ex - Data Engineer @ Abzooba : Abzooba is one of the top 50 Best Data Science firms in India to work for. Focuses on developing the highest quality analytics products and services using expertise in Big Data and Cloud, AI, and ML. ○ A constant Learner