Demystifying Data
Engineering
AI in Production Meetup
About me
• Bob Bui(linkedin.com/in/thangbn/ )
• Senior Data Engineer @ EquitySim : AI-EdTech building financial
simulation platform.
• Previously
• Senior Software Engineer @ SAP Innovation Center Singapore: SAP
Leonardo Machine Learning.
• Also building a variety of software products.
Agenda
• Data engineering
• Revisit Data Science
• The need of data engineering & data engineer
• Common concept
• Typical big data analytics architecture
Data science
Extract
Insight & Knowledge
Unstructured
Structured
Data scientist
Maslow's Hierarchy of Needs
Source: https://www.simplypsychology.org/maslow.html
S o u rc e : The AI Hierarchy of Needs M o n ic a
AI and ML needs a Strong
Data Foundation
But what if …
OR
there is nothing a big mess
Data Engineer is
Who prepare the big data infrastructure to be analyzed by
Data Scientists.
Data Engineer’s Skills set BI
Big Data
software engineering
data
warehousing
Skills: Data Scientist vs Data Engineer
Data Engineer Data Scientist
Programming
Data Wrangling
Software Engineering
Software Design & Architecture
Software Ops
Data Intuition
Statistics & Mathematics
AI/Machine Learning
Data Visualization
Other related roles
• Data Analyst: querying data, process data, provide reports,
summarize and visualize data.
• BI Developer/Report Developer: building BI and reporting solutions.
• ML Developer: having ML, Statistics knowledge; focus on implement
ML algorithm.
Common Concepts
Business Analytics vs. Business Intelligence
• BI: analysis of historical data → problem identification & resolution → improve
business
• BA: exploration of historical data → identify trends, patterns & understand the
information → drive business change
BI BA
Collect, analyzes, Visualize Data
✅ ✅
Identify problem
✅ ✅
Descriptive Analytics
✅ ❌
Diagnostics Analytics
✅ ❌
Predictive Analytics
❌ ✅
Prescriptive Analytics
❌ ✅
Data lake vs Data warehouse
• Data warehouse: current and historical data used for reporting and data
analysis
• Data lake: repo to store raw, structured, unstructured data; anything,
everything.
• Data swamp: poorly managed data lake → inaccessible, little value
Data lake vs Data warehouse
Data Warehouse Data Lake
processed
structured
DATA processed/unprocessed
structured, unstructured, raw
Scheme-on-write
ETL
More expensive
PROCESSING schema-on-read
ELT
Less expensive
Fixed, less agile AGILITY Flexible, highly agile
Ready to be analyzed READINESS Need more processing before
become useful
ETL vs. ELT
Typical Big Data Analytics
System Architecture
Typical architecture
Break down architecture
Data ingestion
• Role: Streaming data from source into pipeline.
• Characteristics :
• High Performance, Low latency
• Superbly Scalable
• Durable
• Integration with existing DB systems
• Common options:
• Kafka
• AWS Kinesis
• GCP PubSub
Big Data Processing Techs
• Uses:
• ETL: Clean, flatten, transform, aggregate data into more-analyzable format.
• Analytics
• Training data for Machine Learning
• Characteristics:
• Able to handle big data
• Scale out
• Low latency
Processing model: Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or
most recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes to
hours
in the order of seconds or milliseconds
Analytics Complex analytics Simple instant response functions,
aggregates, and rolling metrics
Processing model: Data parallelism vs Task
parallelism
Data parallelism vs Task parallelism
Data parallelism Task parallelism
Fashion
Same operations are performed on different
subsets of same data.
Different operations are
performed on the same or
different data.
Computation Synchronous Asynchronous
Amount of
parallelization
proportional to the input data size.
is proportional to the number of
independent tasks to be
performed.
Unified Model
• Combine
• Batch and Stream
• Data parallelism vs task parallelism
Popular processing techs
• Hadoop ecosystem: on-disk batch processing
• Spark: in-memory batch/”pseudo-streaming” processing
• Flink, Storm: native stream processing
• Beam: unified model framework
• Hosted:
• GCP Dataflow: programming framework
• AWS Data pipeline: S3, AWS EMR centric, web service
• Azure Data Factory: Drag & drop data pipeline builder GUI
Why need another “database”?
• Collect data from multiple sources
• Different data model need
• Workload: Transactional vs Analytical (OLTP vs. OLAP)
• Some NoSQL is not suitable for data analytics
• Storage structure optimize to slice & dice query
Storage Unstructured
• Unstructured: Text, CSV, Image, Video, etc..
• Usually a highly scalable key-value object store.
• Options:
• Managed: Google Cloud Storage, AWS S3
• Open source: OpenStack Swift, Minio.
Storage Structured
• Database: SQL, NoSQL
• Characteristics:
• Analytics query language: ideally SQL-like
• Massively scalable to billion of rows
• Low latency data ingestion
• read focus over large portion of data
• Have MANY options
Option 1: using same DB as application DB
• App database
• A read-replica of app DB
• A separate data warehouse
running your “app database”
App DB
App DB Read Replica
App DB DWApp DB
sync
Option 2: SQL-based analytics DB
• SQL-like Database which optimize toward analytical workload
• Options:
• Open source: Postgres-based: Citus, Greenplum
• Hosted: Athena, Redshift, Azure, BigQuery
• Proprietary : Teradata, Oracle
Option 3: SQL-on-Hadoop
• Leverage data from Hadoop-based processing framework
• Techs: Spark SQL, Drill, Hive, Impala, Presto
• Pros:
• Can scale to massive data sets
• Use common SQL dialects
• Decent tool support
• Join between different type of data sources: SQL, NoSQL, Structured file.
• Cons:
• Languages are very low level
• Requires running a Hadoop cluster
Option 4: ElasticSearch
• Leverage its query language to power search-oriented analytics
• Pros:
• FAST
• Strong ability to search your data
• Cons:
• Slow ingestion
• Difficult query language that is optimized for search, not analytics
Option 5: In-memory databases
• If you want super low latency
• Techs: Druid, Pinot, SAP HANA
• Pros:
• FAST, FAST, FAST
• Cons:
• A LOT OF RAM
• Not so flexible and powerful query language
• Joins is limited
• Challenging to deploy and manage
Take away
• There is overlapping between Data Scientist vs Data Engineer but
the distinction is becoming clearer
• No role is better than another, know what your organization need

Demystifying data engineering

  • 1.
  • 2.
    About me • BobBui(linkedin.com/in/thangbn/ ) • Senior Data Engineer @ EquitySim : AI-EdTech building financial simulation platform. • Previously • Senior Software Engineer @ SAP Innovation Center Singapore: SAP Leonardo Machine Learning. • Also building a variety of software products.
  • 3.
    Agenda • Data engineering •Revisit Data Science • The need of data engineering & data engineer • Common concept • Typical big data analytics architecture
  • 4.
    Data science Extract Insight &Knowledge Unstructured Structured
  • 5.
  • 6.
    Maslow's Hierarchy ofNeeds Source: https://www.simplypsychology.org/maslow.html
  • 7.
    S o urc e : The AI Hierarchy of Needs M o n ic a
  • 8.
    AI and MLneeds a Strong Data Foundation
  • 9.
    But what if… OR there is nothing a big mess
  • 10.
    Data Engineer is Whoprepare the big data infrastructure to be analyzed by Data Scientists.
  • 11.
    Data Engineer’s Skillsset BI Big Data software engineering data warehousing
  • 12.
    Skills: Data Scientistvs Data Engineer Data Engineer Data Scientist Programming Data Wrangling Software Engineering Software Design & Architecture Software Ops Data Intuition Statistics & Mathematics AI/Machine Learning Data Visualization
  • 13.
    Other related roles •Data Analyst: querying data, process data, provide reports, summarize and visualize data. • BI Developer/Report Developer: building BI and reporting solutions. • ML Developer: having ML, Statistics knowledge; focus on implement ML algorithm.
  • 14.
  • 15.
    Business Analytics vs.Business Intelligence • BI: analysis of historical data → problem identification & resolution → improve business • BA: exploration of historical data → identify trends, patterns & understand the information → drive business change BI BA Collect, analyzes, Visualize Data ✅ ✅ Identify problem ✅ ✅ Descriptive Analytics ✅ ❌ Diagnostics Analytics ✅ ❌ Predictive Analytics ❌ ✅ Prescriptive Analytics ❌ ✅
  • 16.
    Data lake vsData warehouse • Data warehouse: current and historical data used for reporting and data analysis • Data lake: repo to store raw, structured, unstructured data; anything, everything. • Data swamp: poorly managed data lake → inaccessible, little value
  • 17.
    Data lake vsData warehouse Data Warehouse Data Lake processed structured DATA processed/unprocessed structured, unstructured, raw Scheme-on-write ETL More expensive PROCESSING schema-on-read ELT Less expensive Fixed, less agile AGILITY Flexible, highly agile Ready to be analyzed READINESS Need more processing before become useful
  • 18.
  • 19.
    Typical Big DataAnalytics System Architecture
  • 20.
  • 21.
  • 22.
    Data ingestion • Role:Streaming data from source into pipeline. • Characteristics : • High Performance, Low latency • Superbly Scalable • Durable • Integration with existing DB systems • Common options: • Kafka • AWS Kinesis • GCP PubSub
  • 23.
    Big Data ProcessingTechs • Uses: • ETL: Clean, flatten, transform, aggregate data into more-analyzable format. • Analytics • Training data for Machine Learning • Characteristics: • Able to handle big data • Scale out • Low latency
  • 24.
    Processing model: Batchvs Stream Batch Processing Stream Processing Data scope Processing over all or most of the data set processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Latency in minutes to hours in the order of seconds or milliseconds Analytics Complex analytics Simple instant response functions, aggregates, and rolling metrics
  • 25.
    Processing model: Dataparallelism vs Task parallelism
  • 26.
    Data parallelism vsTask parallelism Data parallelism Task parallelism Fashion Same operations are performed on different subsets of same data. Different operations are performed on the same or different data. Computation Synchronous Asynchronous Amount of parallelization proportional to the input data size. is proportional to the number of independent tasks to be performed.
  • 27.
    Unified Model • Combine •Batch and Stream • Data parallelism vs task parallelism
  • 28.
    Popular processing techs •Hadoop ecosystem: on-disk batch processing • Spark: in-memory batch/”pseudo-streaming” processing • Flink, Storm: native stream processing • Beam: unified model framework • Hosted: • GCP Dataflow: programming framework • AWS Data pipeline: S3, AWS EMR centric, web service • Azure Data Factory: Drag & drop data pipeline builder GUI
  • 29.
    Why need another“database”? • Collect data from multiple sources • Different data model need • Workload: Transactional vs Analytical (OLTP vs. OLAP) • Some NoSQL is not suitable for data analytics • Storage structure optimize to slice & dice query
  • 30.
    Storage Unstructured • Unstructured:Text, CSV, Image, Video, etc.. • Usually a highly scalable key-value object store. • Options: • Managed: Google Cloud Storage, AWS S3 • Open source: OpenStack Swift, Minio.
  • 31.
    Storage Structured • Database:SQL, NoSQL • Characteristics: • Analytics query language: ideally SQL-like • Massively scalable to billion of rows • Low latency data ingestion • read focus over large portion of data • Have MANY options
  • 32.
    Option 1: usingsame DB as application DB • App database • A read-replica of app DB • A separate data warehouse running your “app database” App DB App DB Read Replica App DB DWApp DB sync
  • 33.
    Option 2: SQL-basedanalytics DB • SQL-like Database which optimize toward analytical workload • Options: • Open source: Postgres-based: Citus, Greenplum • Hosted: Athena, Redshift, Azure, BigQuery • Proprietary : Teradata, Oracle
  • 34.
    Option 3: SQL-on-Hadoop •Leverage data from Hadoop-based processing framework • Techs: Spark SQL, Drill, Hive, Impala, Presto • Pros: • Can scale to massive data sets • Use common SQL dialects • Decent tool support • Join between different type of data sources: SQL, NoSQL, Structured file. • Cons: • Languages are very low level • Requires running a Hadoop cluster
  • 35.
    Option 4: ElasticSearch •Leverage its query language to power search-oriented analytics • Pros: • FAST • Strong ability to search your data • Cons: • Slow ingestion • Difficult query language that is optimized for search, not analytics
  • 36.
    Option 5: In-memorydatabases • If you want super low latency • Techs: Druid, Pinot, SAP HANA • Pros: • FAST, FAST, FAST • Cons: • A LOT OF RAM • Not so flexible and powerful query language • Joins is limited • Challenging to deploy and manage
  • 37.
    Take away • Thereis overlapping between Data Scientist vs Data Engineer but the distinction is becoming clearer • No role is better than another, know what your organization need