Serverless data lake architecture

Serverless data lake
A quickstart guide to your
ﬁrst data lake architecture
on AWS.
Maik Wiesmüller
2020-05-27
maik@intenics.io

2
The scope
What is a data lake and why you
should care?
What do we want to achieve?
How to get started?

3
What is a data lake and why we should care?
“[--]A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs or ﬁles. A data lake is usually a single store of all enterprise data
including raw copies of source system data and transformed data used for tasks
such as reporting, visualization, advanced analytics and machine learning.[..]”
“Data lake” Wikipedia, Wikimedia Foundation, 23:46, 9 May 2020 , https://en.wikipedia.org/wiki/Data_lake.
● explore and discover huge amounts of data.
● get valuable business insights, market trends or patterns.
● or even sell data
● store everything now, analyze later.
● no hard scaling limits
● self-service

4
What do we want to achieve?
Storage
Ingestion
Processing
Analysis

Ingestion
Storage
5
What is really needed?
Processing
Analysis
.

6
Storage
Where to put all the data
and how to keep track of it?

7
AWS S3 as foundation of our data lake
● managed object storage.
● organized in “buckets”
● designed for a durability of 99,999999999%
● standard availability of 99,99%
● pay-as-you.go
S3
Storage

8
AWS S3 as foundation of our data lake
Ingestion
Processing
Analysis
S3
Storage
Storage

9
AWS Glue as data catalog
● fully managed service
● integrated data catalog
● metadata in Glue tables.
● integrates seamlessly with S3
● Crawler determine schema automatically
Glue
Crawler
Glue
Catalog

10
AWS S3 and Glue as Storage solution
Ingestion
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage

11
Ingestion
How to get data loaded
into the lake?

3 methods to get started:
● AWS CLI - using the command line.
● AWS SDK - 9 programming languages.
● Kinesis Data Firehose - managed service
Local backups, web-servers, databases, IoT Endpoints...
12
Data ingestion to S3
#aws
SDK
CLI
Firehose

13
AWS S3 and Glue Catalog as Storage
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Ingestion
#aws
SDK
CLI
Firehose

14
Processing
How to extract, transform
and load huge amounts of
data (ETL)?

15
Process data with AWS Glue
extract, transform, load -> ETL
Can be done with AWS Glue on large scale.
● fully managed Spark jobs
● Python or Scala
● integrates with data catalog.
● generate ETL code
* link to glue pricing
Glue Jobs
Python/Scala

16
Processdata with AWS Glue
Analysis
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
Ingestion
#aws
SDK
CLI
Firehose

17
Analysis
How to get insights from
our data?

18
Explore data with SQL-Queries
Use AWS Athena to explore our data with SQL-Queries.
● managed query service
● CSV, JSON, ORC, Avro, and Parquet
● integrates with Glue data catalog
● standard SQL
● create new data sets out of query results
Athena
SQL

19
Explore data with SQL-Queries
Athena
SQL
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose

20
Explore data with Python
Jupyter Notebooks are documents that contain live code, equations,
visualizations and explanatory text.
● ready to use Jupyter installation for python.
● connected to our data catalog
● AWS example notebooks or take a look at https://jupyter.org.
● stop and resume
* link to sagemaker pricing
** link to glue pricing
Jupyter
Notebooks

21
The full picture
Glue Jobs
Python/Scala
Athena
SQL
S3
Storage
Glue
Crawler
Glue
Catalog
Jupyter
Glue Notebooks
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose

22
Thats a lot of services
AWS Lake Formation
central dashboard
Glue
Jobs
S3
Storage
Glue
Crawler
Glue
Data catalog
Lake Formation Athena
SQL
Jupyter
Glue Notebooks

23
What to consider next
● Access management
● IAM User and Roles
● Bucket policies
● Service Roles
● Encryption
● Metrics

24
maik@intenics.io
+49 176 614 39 280
intenics.io
/wiesmueller
Maik Wiesmüller
Cloud solutions consultant @ Intenics
with 20+ years of experience in various
IT positions
If you want to know more about serverless data lake design, visit us at intenics.io

You can also download the more detailed version of this guide here:
https://pages.intenics.io/download-your-copy-of-our-quick-start-gui
de-to-you-ﬁrst-data-lake

Serverless data lake architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Serverless data lake architecture

Similar to Serverless data lake architecture (20)

More from Maik Wiesmüller

More from Maik Wiesmüller (9)

Recently uploaded

Recently uploaded (20)

Serverless data lake architecture