This document provides an overview of building a serverless data lake architecture on AWS. It discusses using AWS S3 for storage, AWS Glue for data cataloging and ETL processing, AWS Athena for running SQL queries, and Jupyter Notebooks for exploratory analysis. The full architecture shown brings these services together to allow for ingesting, storing, processing, and analyzing large amounts of data in a serverless and cost-effective manner.
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Serverless data lake architecture
1. Serverless data lake
A quickstart guide to your
first data lake architecture
on AWS.
Maik Wiesmüller
2020-05-27
maik@intenics.io
2. 2
The scope
What is a data lake and why you
should care?
What do we want to achieve?
How to get started?
3. 3
What is a data lake and why we should care?
“[--]A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs or files. A data lake is usually a single store of all enterprise data
including raw copies of source system data and transformed data used for tasks
such as reporting, visualization, advanced analytics and machine learning.[..]”
“Data lake” Wikipedia, Wikimedia Foundation, 23:46, 9 May 2020 , https://en.wikipedia.org/wiki/Data_lake.
● explore and discover huge amounts of data.
● get valuable business insights, market trends or patterns.
● or even sell data
● store everything now, analyze later.
● no hard scaling limits
● self-service
4. 4
What do we want to achieve?
Storage
Ingestion
Processing
Analysis
7. 7
AWS S3 as foundation of our data lake
● managed object storage.
● organized in “buckets”
● designed for a durability of 99,999999999%
● standard availability of 99,99%
● pay-as-you.go
S3
Storage
8. 8
AWS S3 as foundation of our data lake
Ingestion
Processing
Analysis
S3
Storage
Storage
9. 9
AWS Glue as data catalog
● fully managed service
● integrated data catalog
● metadata in Glue tables.
● integrates seamlessly with S3
● Crawler determine schema automatically
Glue
Crawler
Glue
Catalog
10. 10
AWS S3 and Glue as Storage solution
Ingestion
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
12. 3 methods to get started:
● AWS CLI - using the command line.
● AWS SDK - 9 programming languages.
● Kinesis Data Firehose - managed service
Local backups, web-servers, databases, IoT Endpoints...
12
Data ingestion to S3
#aws
SDK
CLI
Firehose
15. 15
Process data with AWS Glue
extract, transform, load -> ETL
Can be done with AWS Glue on large scale.
● fully managed Spark jobs
● Python or Scala
● integrates with data catalog.
● generate ETL code
* link to glue pricing
Glue Jobs
Python/Scala
18. 18
Explore data with SQL-Queries
Use AWS Athena to explore our data with SQL-Queries.
● managed query service
● CSV, JSON, ORC, Avro, and Parquet
● integrates with Glue data catalog
● standard SQL
● create new data sets out of query results
Athena
SQL
19. 19
Explore data with SQL-Queries
Athena
SQL
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose
20. 20
Explore data with Python
Jupyter Notebooks are documents that contain live code, equations,
visualizations and explanatory text.
● ready to use Jupyter installation for python.
● connected to our data catalog
● AWS example notebooks or take a look at https://jupyter.org.
● stop and resume
* link to sagemaker pricing
** link to glue pricing
Jupyter
Notebooks
21. 21
The full picture
Glue Jobs
Python/Scala
Athena
SQL
S3
Storage
Glue
Crawler
Glue
Catalog
Jupyter
Glue Notebooks
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose
22. 22
Thats a lot of services
AWS Lake Formation
central dashboard
Glue
Jobs
S3
Storage
Glue
Crawler
Glue
Data catalog
Lake Formation Athena
SQL
Jupyter
Glue Notebooks
23. 23
What to consider next
● Access management
● IAM User and Roles
● Bucket policies
● Service Roles
● Encryption
● Metrics
24. 24
maik@intenics.io
+49 176 614 39 280
intenics.io
/wiesmueller
Maik Wiesmüller
Cloud solutions consultant @ Intenics
with 20+ years of experience in various
IT positions
If you want to know more about serverless data lake design, visit us at intenics.io
25. You can also download the more detailed version of this guide here:
https://pages.intenics.io/download-your-copy-of-our-quick-start-gui
de-to-you-first-data-lake