Demystifying data engineering

Demystifying Data
Engineering
AI in Production Meetup

About me
• Bob Bui(linkedin.com/in/thangbn/ )
• Senior Data Engineer @ EquitySim : AI-EdTech building financial
simulation platform.
• Previously
• Senior Software Engineer @ SAP Innovation Center Singapore: SAP
Leonardo Machine Learning.
• Also building a variety of software products.

Agenda
• Data engineering
• Revisit Data Science
• The need of data engineering & data engineer
• Common concept
• Typical big data analytics architecture

Data science
Extract
Insight & Knowledge
Unstructured
Structured

Maslow's Hierarchy of Needs
Source: https://www.simplypsychology.org/maslow.html

S o u rc e : The AI Hierarchy of Needs M o n ic a

AI and ML needs a Strong
Data Foundation

But what if …
OR
there is nothing a big mess

Data Engineer is
Who prepare the big data infrastructure to be analyzed by
Data Scientists.

Data Engineer’s Skills set BI
Big Data
software engineering
data
warehousing

Skills: Data Scientist vs Data Engineer
Data Engineer Data Scientist
Programming
Data Wrangling
Software Engineering
Software Design & Architecture
Software Ops
Data Intuition
Statistics & Mathematics
AI/Machine Learning
Data Visualization

Other related roles
• Data Analyst: querying data, process data, provide reports,
summarize and visualize data.
• BI Developer/Report Developer: building BI and reporting solutions.
• ML Developer: having ML, Statistics knowledge; focus on implement
ML algorithm.

Business Analytics vs. Business Intelligence
• BI: analysis of historical data → problem identification & resolution → improve
business
• BA: exploration of historical data → identify trends, patterns & understand the
information → drive business change
BI BA
Collect, analyzes, Visualize Data
✅ ✅
Identify problem
✅ ✅
Descriptive Analytics
✅ ❌
Diagnostics Analytics
✅ ❌
Predictive Analytics
❌ ✅
Prescriptive Analytics
❌ ✅

Data lake vs Data warehouse
• Data warehouse: current and historical data used for reporting and data
analysis
• Data lake: repo to store raw, structured, unstructured data; anything,
everything.
• Data swamp: poorly managed data lake → inaccessible, little value

Data lake vs Data warehouse
Data Warehouse Data Lake
processed
structured
DATA processed/unprocessed
structured, unstructured, raw
Scheme-on-write
ETL
More expensive
PROCESSING schema-on-read
ELT
Less expensive
Fixed, less agile AGILITY Flexible, highly agile
Ready to be analyzed READINESS Need more processing before
become useful

Typical Big Data Analytics
System Architecture

Data ingestion
• Role: Streaming data from source into pipeline.
• Characteristics :
• High Performance, Low latency
• Superbly Scalable
• Durable
• Integration with existing DB systems
• Common options:
• Kafka
• AWS Kinesis
• GCP PubSub

Big Data Processing Techs
• Uses:
• ETL: Clean, flatten, transform, aggregate data into more-analyzable format.
• Analytics
• Training data for Machine Learning
• Characteristics:
• Able to handle big data
• Scale out
• Low latency

Processing model: Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or
most recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes to
hours
in the order of seconds or milliseconds
Analytics Complex analytics Simple instant response functions,
aggregates, and rolling metrics

Processing model: Data parallelism vs Task
parallelism

Data parallelism vs Task parallelism
Data parallelism Task parallelism
Fashion
Same operations are performed on different
subsets of same data.
Different operations are
performed on the same or
different data.
Computation Synchronous Asynchronous
Amount of
parallelization
proportional to the input data size.
is proportional to the number of
independent tasks to be
performed.

Unified Model
• Combine
• Batch and Stream
• Data parallelism vs task parallelism

Popular processing techs
• Hadoop ecosystem: on-disk batch processing
• Spark: in-memory batch/”pseudo-streaming” processing
• Flink, Storm: native stream processing
• Beam: unified model framework
• Hosted:
• GCP Dataflow: programming framework
• AWS Data pipeline: S3, AWS EMR centric, web service
• Azure Data Factory: Drag & drop data pipeline builder GUI

Why need another “database”?
• Collect data from multiple sources
• Different data model need
• Workload: Transactional vs Analytical (OLTP vs. OLAP)
• Some NoSQL is not suitable for data analytics
• Storage structure optimize to slice & dice query

Storage Unstructured
• Unstructured: Text, CSV, Image, Video, etc..
• Usually a highly scalable key-value object store.
• Options:
• Managed: Google Cloud Storage, AWS S3
• Open source: OpenStack Swift, Minio.

Storage Structured
• Database: SQL, NoSQL
• Characteristics:
• Analytics query language: ideally SQL-like
• Massively scalable to billion of rows
• Low latency data ingestion
• read focus over large portion of data
• Have MANY options

Option 1: using same DB as application DB
• App database
• A read-replica of app DB
• A separate data warehouse
running your “app database”
App DB
App DB Read Replica
App DB DWApp DB
sync

Option 2: SQL-based analytics DB
• SQL-like Database which optimize toward analytical workload
• Options:
• Open source: Postgres-based: Citus, Greenplum
• Hosted: Athena, Redshift, Azure, BigQuery
• Proprietary : Teradata, Oracle

Option 3: SQL-on-Hadoop
• Leverage data from Hadoop-based processing framework
• Techs: Spark SQL, Drill, Hive, Impala, Presto
• Pros:
• Can scale to massive data sets
• Use common SQL dialects
• Decent tool support
• Join between different type of data sources: SQL, NoSQL, Structured file.
• Cons:
• Languages are very low level
• Requires running a Hadoop cluster

Option 4: ElasticSearch
• Leverage its query language to power search-oriented analytics
• Pros:
• FAST
• Strong ability to search your data
• Cons:
• Slow ingestion
• Difficult query language that is optimized for search, not analytics

Option 5: In-memory databases
• If you want super low latency
• Techs: Druid, Pinot, SAP HANA
• Pros:
• FAST, FAST, FAST
• Cons:
• A LOT OF RAM
• Not so flexible and powerful query language
• Joins is limited
• Challenging to deploy and manage

Take away
• There is overlapping between Data Scientist vs Data Engineer but
the distinction is becoming clearer
• No role is better than another, know what your organization need

Demystifying data engineering

In this document

More Related Content

What's hot

Similar to Demystifying data engineering

Recently uploaded

Demystifying data engineering