The document discusses the roles and responsibilities of data engineers and data scientists, emphasizing the need for a solid data engineering foundation in AI and machine learning applications. It highlights key concepts such as the differences between data lakes and data warehouses, various processing technologies, and the significance of architecture in big data analytics. The conclusion notes the overlapping skills of data scientists and data engineers while advocating for clarity in their distinct roles based on organizational needs.
About me
• BobBui(linkedin.com/in/thangbn/ )
• Senior Data Engineer @ EquitySim : AI-EdTech building financial
simulation platform.
• Previously
• Senior Software Engineer @ SAP Innovation Center Singapore: SAP
Leonardo Machine Learning.
• Also building a variety of software products.
3.
Agenda
• Data engineering
•Revisit Data Science
• The need of data engineering & data engineer
• Common concept
• Typical big data analytics architecture
Skills: Data Scientistvs Data Engineer
Data Engineer Data Scientist
Programming
Data Wrangling
Software Engineering
Software Design & Architecture
Software Ops
Data Intuition
Statistics & Mathematics
AI/Machine Learning
Data Visualization
13.
Other related roles
•Data Analyst: querying data, process data, provide reports,
summarize and visualize data.
• BI Developer/Report Developer: building BI and reporting solutions.
• ML Developer: having ML, Statistics knowledge; focus on implement
ML algorithm.
Business Analytics vs.Business Intelligence
• BI: analysis of historical data → problem identification & resolution → improve
business
• BA: exploration of historical data → identify trends, patterns & understand the
information → drive business change
BI BA
Collect, analyzes, Visualize Data
✅ ✅
Identify problem
✅ ✅
Descriptive Analytics
✅ ❌
Diagnostics Analytics
✅ ❌
Predictive Analytics
❌ ✅
Prescriptive Analytics
❌ ✅
16.
Data lake vsData warehouse
• Data warehouse: current and historical data used for reporting and data
analysis
• Data lake: repo to store raw, structured, unstructured data; anything,
everything.
• Data swamp: poorly managed data lake → inaccessible, little value
17.
Data lake vsData warehouse
Data Warehouse Data Lake
processed
structured
DATA processed/unprocessed
structured, unstructured, raw
Scheme-on-write
ETL
More expensive
PROCESSING schema-on-read
ELT
Less expensive
Fixed, less agile AGILITY Flexible, highly agile
Ready to be analyzed READINESS Need more processing before
become useful
Data ingestion
• Role:Streaming data from source into pipeline.
• Characteristics :
• High Performance, Low latency
• Superbly Scalable
• Durable
• Integration with existing DB systems
• Common options:
• Kafka
• AWS Kinesis
• GCP PubSub
23.
Big Data ProcessingTechs
• Uses:
• ETL: Clean, flatten, transform, aggregate data into more-analyzable format.
• Analytics
• Training data for Machine Learning
• Characteristics:
• Able to handle big data
• Scale out
• Low latency
24.
Processing model: Batchvs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or
most recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes to
hours
in the order of seconds or milliseconds
Analytics Complex analytics Simple instant response functions,
aggregates, and rolling metrics
Data parallelism vsTask parallelism
Data parallelism Task parallelism
Fashion
Same operations are performed on different
subsets of same data.
Different operations are
performed on the same or
different data.
Computation Synchronous Asynchronous
Amount of
parallelization
proportional to the input data size.
is proportional to the number of
independent tasks to be
performed.
Popular processing techs
•Hadoop ecosystem: on-disk batch processing
• Spark: in-memory batch/”pseudo-streaming” processing
• Flink, Storm: native stream processing
• Beam: unified model framework
• Hosted:
• GCP Dataflow: programming framework
• AWS Data pipeline: S3, AWS EMR centric, web service
• Azure Data Factory: Drag & drop data pipeline builder GUI
29.
Why need another“database”?
• Collect data from multiple sources
• Different data model need
• Workload: Transactional vs Analytical (OLTP vs. OLAP)
• Some NoSQL is not suitable for data analytics
• Storage structure optimize to slice & dice query
30.
Storage Unstructured
• Unstructured:Text, CSV, Image, Video, etc..
• Usually a highly scalable key-value object store.
• Options:
• Managed: Google Cloud Storage, AWS S3
• Open source: OpenStack Swift, Minio.
31.
Storage Structured
• Database:SQL, NoSQL
• Characteristics:
• Analytics query language: ideally SQL-like
• Massively scalable to billion of rows
• Low latency data ingestion
• read focus over large portion of data
• Have MANY options
32.
Option 1: usingsame DB as application DB
• App database
• A read-replica of app DB
• A separate data warehouse
running your “app database”
App DB
App DB Read Replica
App DB DWApp DB
sync
Option 3: SQL-on-Hadoop
•Leverage data from Hadoop-based processing framework
• Techs: Spark SQL, Drill, Hive, Impala, Presto
• Pros:
• Can scale to massive data sets
• Use common SQL dialects
• Decent tool support
• Join between different type of data sources: SQL, NoSQL, Structured file.
• Cons:
• Languages are very low level
• Requires running a Hadoop cluster
35.
Option 4: ElasticSearch
•Leverage its query language to power search-oriented analytics
• Pros:
• FAST
• Strong ability to search your data
• Cons:
• Slow ingestion
• Difficult query language that is optimized for search, not analytics
36.
Option 5: In-memorydatabases
• If you want super low latency
• Techs: Druid, Pinot, SAP HANA
• Pros:
• FAST, FAST, FAST
• Cons:
• A LOT OF RAM
• Not so flexible and powerful query language
• Joins is limited
• Challenging to deploy and manage
37.
Take away
• Thereis overlapping between Data Scientist vs Data Engineer but
the distinction is becoming clearer
• No role is better than another, know what your organization need