More Related Content

Recently uploaded(20)

data_engineering_basics.pdf

  1. DEMYSTIFYING DATA ENGINEERING BASICS & GETTING STARTED
  2. Source: The AI Hierarchy of Needs - Monica Rogati
  3. TYPICAL ARCHITECTURE/BLUEPRINT
  4. Natural Language Processing, Artificial Intelligence, Machine Learning and Deep Learning needs a strong Data foundation.
  5. Where to begin? there is nothing! huge mess
  6. DATA ENGINEERING
  7. ● “Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth. ● Designing, building and scaling systems that organize data for analytics. ● Data Engineers prepare the Big Data infrastructure to be analyzed by Data Scientists. ● Data engineering is the process of designing and building systems that let people collect and analyze raw data from multiple sources and formats.
  8. SKILL SET
  9. Development + Cloud Computing + Big Data + Databases software engineering big data cloud computing databases
  10. DISTINCT ROLES
  11. ROLES Data Engineer: ● Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Data Scientist: ● They use linear algebra and multivariable calculus to create new insight from existing data. Business Analyst: ● Analysis and exploration of historical data → identify trends, patterns & understand the information → drive business change
  12. let’s talk about the specifics….
  13. ETL (EXTRACT, TRANSFORM, LOAD) the absolute core of Data Engineering
  14. ETL Process
  15. BIG DATA PROPERTIES
  16. V’s of BIG DATA Volume ◾ How much data you have Velocity ◾ How fast data is getting to you Variety ◾ How different your data is Veracity ◾ How reliable your data is
  17. DATA TYPES/CLASSIFICATION
  18. TYPES Unstructured/Raw data ● Unprocessed data in format used on source, Text, CSV, Image, Video, etc.. ● High Latency ● No schema applied ● Stored in Google Cloud Storage, AWS S3 ● Tools like Snowflake, MongoDB allow their specific ways to query unstructured data Structured/Processed data ● Raw data with schema applied ● Stored in event tables/destinations in pipelines ● Analytics query language: ideally SQL-like ● Low latency data ingestion ● Read focus over large portion of data
  19. DATA PROCESSING METHODS
  20. BATCH PROCESSING
  21. STREAM PROCESSING Process data on the fly, as it comes in
  22. Batch vs Stream Batch Processing Stream Processing Data scope Processing over all or most of the data set processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Latency in minutes to hours in the order of seconds or milliseconds
  23. PROCESSING FRAMEWORKS
  24. MAP REDUCE ● MapReduce is a processing technique and a program model for distributed computing. ● The algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). ● Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
  25. SPARK VS HADOOP
  26. DATA STORAGE
  27. Relational Database (SQL) Document Store (NoSQL)
  28. DEMO/POC
  29. REFERENCES
  30. The Data Engineering Cookbook https://github.com/andkret/Cookbook
  31. THANK YOU
  32. Connect: ● Ketan (LinkedIn) ○ Computer Science ‘24 Grad @ Michigan Tech ○ Ex - Data Engineer @ Abzooba : Abzooba is one of the top 50 Best Data Science firms in India to work for. Focuses on developing the highest quality analytics products and services using expertise in Big Data and Cloud, AI, and ML. ○ A constant Learner