https://heapcon.io/speakers/nikolay-matvienko/designing-data-intensive-applications-in-serverless-architecture/
I’ll talk about how to design serverless architecture for a data intensive application in order to process a Data Lake or data streams. I’ll show you how we did it using AWS cloud functions on Node.js running on thousands functions in parallel that process terabytes of data in ETL pipeline. Step by step we will build a serverless architecture for processing pipeline, consider the choice of services, queues, streams, databases and dive into tuning of cloud functions to build a reliable massively scalable cloud computing platform. We will talk about the advantages of such an architecture and platform, its possible limitations and how to get around them.
2. Nikolay Matvienko
Senior Software Engineer and Node.js expert at Grid Dynamics
Diagnostics, performance improvement consultant.
Work in USA and in Russia
Designed serverless computing platforms with AWS
You can find me at twitter.com/matvi3nko github.com/matvi3nko
2
3. 1. SERVERLESS COMPUTING HAS BECOME VERY POPULAR
2. SERVERLESS SOLUTIONS ARE ~60% AND UP TO SEVERAL TIMES CHEAPER
3@matvi3nko
DESIGNING SERVERLESS
3. EVERY YEAR AWS ANNOUNCES NEW SERVICES
4. BUT THE LACK OF PATTERNS AND MISUSE CAN MAKE THE SOLUTION SEVERAL TIMES
MORE EXPENSIVE
FaaS containers
21%
19%
5. 5@matvi3nko
WHERE DOES THE DATA COME FROM?
IoT
DEVICES REAL-TIME STREAM PROCESSING
ENTERPRISE
PLATFORMS BATCH / STREAM PROCESSING
DB
DBDATA LAKE
6. 6@matvi3nko
USE EXISTING OR BUILD YOUR OWN PIPELINE
DATA
DATA
WAREHOUSE
GLUE JOBs KINESIS STREAMS & ANALYTICS
ETL BATCH PROCESSING STREAM PROCESSING
AWS LAMBDA
7. 7@matvi3nko
DYNAMICALLY UPDATED PROCESSING LOGIC
AWS LAMBDA
DATA WAREHOUSE
PROCESSING
RULES
PROCESSING
RULES
PROCESSINGRULES
PROCESSING
RULES
PROCESSING
RULES
PROCESSING
RULES
PROCESSING
RULES REAL-LIFE CASE:
300+ CUSTOMERS
EACH ESTABLISHES HIS
OWN DATA
PROCESSING RULES
8. 8@matvi3nko
DYNAMICALLY UPDATED PROCESSING LOGIC
AWS LAMBDA
DATA WAREHOUSE
PROCESSING
RULES
PROCESSING
RULES
PROCESSINGRULES
PROCESSING
RULES
PROCESSING
RULES
PROCESSING
RULES
PROCESSING
RULES REAL-LIFE CASE:
300+ CUSTOMERS
EACH ESTABLISHES HIS
OWN DATA
PROCESSING RULES
9. 9@matvi3nko
SERVERLESS COMPUTE SERVICE
AWS LAMBDA SERVERLESS FUNCTION
* Symbol in the presentation
export const handler = async (event) => {
const data = event.Records[0].body;
// - TRANSFORM data
// - WRITE to DB or
// - PUT TO QUEUE/STREAM
return 'success';
};
- CHOOSE YOUR CODE
LANGUAGE
- NO INFRASTRUCTURE
TO MANAGE
- TRIGGERED BY EVENTS
- HIGH SCALABLE
- STATELESS
- COST-EFFECTIVE
15. 15@matvi3nko
DATA EXTRACTION PATTERNS
1. MOVE FROM BIG DATA TO A LARGE NUMBER OF MESSAGES
2. USE THE QUEUES FOR MESSAGES, AND DATA STREAMS TO TRANSFER MODELS /
LARGE COLLECTION
3. BUT DO NOT RUSH TO USE STREAMS. CHOOSE TRANSPORT FOR YOUR NEEDS
18. 18@matvi3nko
SEPARATE FUNCTIONS BY RESPONSIBILITY
FUNCTION
PROCESS/ANALYZE
MODEL
MERGED RECORD
from tables/files
QUEUE
MERGER FAN-OUT
COPIER
SPLITTER
WRITER
FILTER
20. 20@matvi3nko
QUERIES & CONNECTIONS PROBLEMS
TRANSFORM 1
TRANSFORM 2
TRANSFORM 3
SQL DB
SELECT
INSERT
PROBLEMS:
1. MANY LAMBDAS HAS TO
QUERY SQL DB
2. A LOT OF CONNECTIONS
3. A LOT OF DEPENDENCIES
UPDATE
21. 21@matvi3nko
DB QUERY LOGIC ENCAPSULATION
TRANSFORM 1
TRANSFORM 2
TRANSFORM 3
GET
PUT
BENEFITS:
1. QUERIES & LOGIC IS
HIDDEN BEHIND THE API
2. LESS CONNECTIONS,
CONTROLLED
CONNECTIONS POOL
3. LESS DEPENDENCIES
UPDATE
SQL DB
API
22. 22@matvi3nko
RELIABILITY: RETRY STRATEGY
New SNAPSHOT DATE
QUEUE QUEUE
FILES PATHS
SCAN ALL OBJECTS
FOR EACH FOLDER
DATA LAKE
QUEUE
MODELS
GET ALL RECORDS AND
AGGREGATE EACH
WITH DETAILS
WRITEDISPATCH
DATABASE
TRANSFORM
RETRIES 3 TIMES (BY DEFAULT)
logo
23. 23@matvi3nko
DEAD LETTER QUEUE
...
TRANSFORM
RETRIES 3 TIMES
1
2
returns Error
DLQ
STORES FAILED MESSAGES
3
ACTOR REPROCESS
PULLS MESSAGES
FROM DLQ BY THE REQUEST
5 6
7
TOPIC
...
4
REPROCESSING
TOPIC
”reprocess_[type]_errors”
24. 24@matvi3nko
KINESIS ERROR HANDLING
... ...
RETRIES N TIMES
24h - 7d
ANALYZE
FUNCTIONINFRASTRUCTURE ERRORS
reject(error) resolve()
SEND RECORD
TO SNS TOPIC
DLQ
STORES
FAILED RECORDS
BUSINESS ERRORS
INTERNAL ERRORS
REPROCESS
PULLS MESSAGES
FROM DLQ BY THE REQUEST
Pull by error type
25. 25@matvi3nko
DLQ FOR THE QUEUE
New SNAPSHOT DATE
QUEUE QUEUE
OBJECTS PATHS
SCAN ALL OBJECTS
FOR EACH FOLDER
DATA LAKE
QUEUE
MODELS
GET ALL RECORDS AND
AGGREGATE EACH
WITH DETAILS
ANALYZEDISPATCH
MERGE BY
FACTORY
DATABASE
TRANSFORM
ACTOR REPROCESSING
TOPIC
”reprocess_messages”
REPROCESS
PULLS MESSAGES
FROM DLQ BY THE REQUEST
DLQ
TOPIC
1. EASY TO TROUBLESHOOT
2. NO NEED TO REPROCESS
GIGABYTES OF DATA AGAIN
26. 26@matvi3nko
DATA TRANSFORMATION PATTERNS
1. ONE LAMBDA FUNCTION – ONE RESPONSIBILITY
2. DIVIDE THE PIPELINE INTO BOUNDARY CONTEXTS WITH
FIXED DATA INTERFACES
3. ENCAPSULATE DB QUERIES BEHIND API FUNCTION
4. USE DEAD LETTER QUEUES FOR RELIABILITY
28. 28@matvi3nko
New SNAPSHOT DATE
QUEUE QUEUE
FILES PATHS
SCAN ALL OBJECTS
FOR EACH FOLDER
DATA LAKE
DATA
STREAMS
RECORDS
GET ALL RECORDS AND
AGGREGATE EACH
WITH DETAILS
ANALYZE
& WRITE
DISPATCH
MERGE BY
TYPE
DATABASE
TRANSFORM
ACTOR REPROCESSING
TOPIC
”reprocess_messages”
REPROCESS
PULLS MESSAGES
FROM DLQ BY THE REQUEST
DLQ
TOPIC
LOAD SECTION
29. 29@matvi3nko
WRITING TO A DATABASES
DBLAMBDA
MICROSERVICE
PROCESSES DATA
ANR/OR FILTERS DATA
WRITES TO DB
Points of failure
Points of failure
LAMBDA
UI
DASHBOARD
ACCOUNTING
FURTHER DATA
PROCESSING
1 – 10K
MESSAGES
1. RETRY THE BATCH?
2. RETRY FAILED RECORD MANUALLY IN LAMBDA?
1 – 10K
MESSAGES
36. 36@matvi3nko
DATA LOADING PATTERNS
1. DECOUPLE FUNCTIONS BY POINTS OF FAILURE
2. USE INFRASTRUCTURE AS A CODE VS ORCHESTRATIONS IN THE CODE
3. USE AWS STEP FUNCTIONS AND DYNAMODB STREAMS FOR TRANSACTIONS
4. PIPELINE IS THE CORE, THINK ABOUT FUTURE PLATFORMS AROUND IT AND
HOW THEY WILL HAVE ACCESS TO DATA
37. 37@matvi3nko
DATA
ETL PIPELINE
- EXTRACT
- TRANSFORM
- LOAD
DATA
WAREHOUSE
ANALYSE VISUALISE
STEP 1 STEP 2
ETL JOB IS
DONE?
SCALABILITY
HOW TO UNDERSTAND THAT THE JOB IS COMPLETED?
38. 38@matvi3nko
THE PROBLEM OF DISTRIBUTED DATA PROCESSING
DATA ETL PIPELINE
EXTRACT TRANSFORM LOAD
DATA WAREHOUSE
STREAM
PROCESSING
PIPELINE
DATA WAREHOUSE
WRITE IN PARALLEL
WRITE IN PARALLEL
39. 39@matvi3nko
ETL JOB STATE CONTROL
New SNAPSHOT DATE
QUEUE QUEUE
FILES PATHS
SCAN ALL OBJECTS
FOR EACH FOLDER
QUEUE
MODELS
GET ALL RECORDS AND
AGGREGATE EACH
WITH DETAILS
WRITEDISPATCH
DATABASE
TRANSFORM
DYNAMODB
JOB TABLE
DYNAMODB
PROCESSED
RECORD TABLE
DYNAMODB
STREAMS
JOB IS NOT DONE
&&
COUNT OF IS NOT
DONE RECORDS ===0
START
JOB IS DONE!
END
44. 44
@matvi3nko
BLUEPRINT
ETL JOB STATE
DONE / NOT
DONE
ETL JOB
RUNNER
JOB STATE
CONTROLLERAWS STEP
FUNCTIONS
SNS
CLOUD
WATCH
CHRON
EVENT
S3 DATA LAKE
STRUCTURED
DATA
GLUE JOB
ETL JOB
JOB STATE
REDIS
ETL BATCH PROCESSING
AD-HOC STREAM PROCESSING
AURORA
DYNAMO
KINESIS
DLQ
DLQ
READER
PROCESSORS
DYNAMO
JOBS
DLQ
5K – 10K FUNCTION IN PARALLEL
PROCESSES 10-100 GB/SEC
45. 45@matvi3nko
CONCLUSIONS
1. DESIGN MISTAKES = SOLUTION PRICE
2. IN SERVERLESS YOU CAN BUILD VERY BUILD A VERY FLEXIBLE ARCHITECTURE
YOU CAN SWITCH FROM BATCH TO STREAM PROCESSING
3. YOU CAN RUN THOUSANDS OF LAMBDA FUNCTIONS IN PARALLEL
AND PROCESS GBs OF DATA PER SEC