SlideShare a Scribd company logo
Serverless data lake
A quickstart guide to your
first data lake architecture
on AWS.
Maik Wiesmüller
2020-05-27
maik@intenics.io
2
The scope
What is a data lake and why you
should care?
What do we want to achieve?
How to get started?
3
What is a data lake and why we should care?
“[--]A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs or files. A data lake is usually a single store of all enterprise data
including raw copies of source system data and transformed data used for tasks
such as reporting, visualization, advanced analytics and machine learning.[..]”
“Data lake” Wikipedia, Wikimedia Foundation, 23:46, 9 May 2020 , https://en.wikipedia.org/wiki/Data_lake.
● explore and discover huge amounts of data.
● get valuable business insights, market trends or patterns.
● or even sell data
● store everything now, analyze later.
● no hard scaling limits
● self-service
4
What do we want to achieve?
Storage
Ingestion
Processing
Analysis
Ingestion
Storage
5
What is really needed?
Processing
Analysis
.
6
Storage
Where to put all the data
and how to keep track of it?
7
AWS S3 as foundation of our data lake
● managed object storage.
● organized in “buckets”
● designed for a durability of 99,999999999%
● standard availability of 99,99%
● pay-as-you.go
S3
Storage
8
AWS S3 as foundation of our data lake
Ingestion
Processing
Analysis
S3
Storage
Storage
9
AWS Glue as data catalog
● fully managed service
● integrated data catalog
● metadata in Glue tables.
● integrates seamlessly with S3
● Crawler determine schema automatically
Glue
Crawler
Glue
Catalog
10
AWS S3 and Glue as Storage solution
Ingestion
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
11
Ingestion
How to get data loaded
into the lake?
3 methods to get started:
● AWS CLI - using the command line.
● AWS SDK - 9 programming languages.
● Kinesis Data Firehose - managed service
Local backups, web-servers, databases, IoT Endpoints...
12
Data ingestion to S3
#aws
SDK
CLI
Firehose
13
AWS S3 and Glue Catalog as Storage
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Ingestion
#aws
SDK
CLI
Firehose
14
Processing
How to extract, transform
and load huge amounts of
data (ETL)?
15
Process data with AWS Glue
extract, transform, load -> ETL
Can be done with AWS Glue on large scale.
● fully managed Spark jobs
● Python or Scala
● integrates with data catalog.
● generate ETL code
* link to glue pricing
Glue Jobs
Python/Scala
16
Processdata with AWS Glue
Analysis
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
Ingestion
#aws
SDK
CLI
Firehose
17
Analysis
How to get insights from
our data?
18
Explore data with SQL-Queries
Use AWS Athena to explore our data with SQL-Queries.
● managed query service
● CSV, JSON, ORC, Avro, and Parquet
● integrates with Glue data catalog
● standard SQL
● create new data sets out of query results
Athena
SQL
19
Explore data with SQL-Queries
Athena
SQL
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose
20
Explore data with Python
Jupyter Notebooks are documents that contain live code, equations,
visualizations and explanatory text.
● ready to use Jupyter installation for python.
● connected to our data catalog
● AWS example notebooks or take a look at https://jupyter.org.
● stop and resume
* link to sagemaker pricing
** link to glue pricing
Jupyter
Notebooks
21
The full picture
Glue Jobs
Python/Scala
Athena
SQL
S3
Storage
Glue
Crawler
Glue
Catalog
Jupyter
Glue Notebooks
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose
22
Thats a lot of services
AWS Lake Formation
central dashboard
Glue
Jobs
S3
Storage
Glue
Crawler
Glue
Data catalog
Lake Formation Athena
SQL
Jupyter
Glue Notebooks
23
What to consider next
● Access management
● IAM User and Roles
● Bucket policies
● Service Roles
● Encryption
● Metrics
24
maik@intenics.io
+49 176 614 39 280
intenics.io
/wiesmueller
Maik Wiesmüller
Cloud solutions consultant @ Intenics
with 20+ years of experience in various
IT positions
If you want to know more about serverless data lake design, visit us at intenics.io
You can also download the more detailed version of this guide here:
https://pages.intenics.io/download-your-copy-of-our-quick-start-gui
de-to-you-first-data-lake

More Related Content

What's hot

Kickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon RedshiftKickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon Redshift
Matillion
 
Cloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian HyperStore Operating Environment
Cloudian HyperStore Operating Environment
Cloudian
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark Summit
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory
Sergio Zenatti Filho
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
Alluxio, Inc.
 
Azure Data Lake Store and Analytics
Azure Data Lake Store and AnalyticsAzure Data Lake Store and Analytics
Azure Data Lake Store and Analytics
Sergio Zenatti Filho
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Amazon Web Services
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
OW2
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Xanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology IntroductionXanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology Introduction
Alex G. Lee, Ph.D. Esq. CLP
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
SoftwareMill
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Au cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite ElasticAu cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite Elastic
Elasticsearch
 
Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)
Moldovan Radu Adrian
 
Big data advanced topics - part I
Big data   advanced topics - part IBig data   advanced topics - part I
Big data advanced topics - part I
Moldovan Radu Adrian
 
Elastic Stack Roadmap
Elastic Stack RoadmapElastic Stack Roadmap
Elastic Stack Roadmap
Imma Valls Bernaus
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 

What's hot (20)

Kickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon RedshiftKickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon Redshift
 
Cloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian HyperStore Operating Environment
Cloudian HyperStore Operating Environment
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Azure Data Lake Store and Analytics
Azure Data Lake Store and AnalyticsAzure Data Lake Store and Analytics
Azure Data Lake Store and Analytics
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Xanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology IntroductionXanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology Introduction
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Au cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite ElasticAu cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite Elastic
 
Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)
 
Big data advanced topics - part I
Big data   advanced topics - part IBig data   advanced topics - part I
Big data advanced topics - part I
 
Elastic Stack Roadmap
Elastic Stack RoadmapElastic Stack Roadmap
Elastic Stack Roadmap
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 

Similar to Serverless data lake architecture

IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
Torsten Steinbach
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
Blazeclan Technologies Private Limited
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
Amazon Web Services LATAM
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
Sridevi Murugayen
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
Torsten Steinbach
 
AWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlAWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtl
Mezzybatliwala
 
AWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesAWS Data Lakes and Best Practices
AWS Data Lakes and Best Practices
PeeterParkar
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Matt Stubbs
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Amazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
Amazon Web Services
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
Amazon Web Services
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
Amazon Web Services
 

Similar to Serverless data lake architecture (20)

IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
AWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlAWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtl
 
AWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesAWS Data Lakes and Best Practices
AWS Data Lakes and Best Practices
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 

More from Maik Wiesmüller

Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoff
Maik Wiesmüller
 
Serverless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limitingServerless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limiting
Maik Wiesmüller
 
Serverless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategiesServerless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategies
Maik Wiesmüller
 
Serverless lessons learned #5 retries
Serverless lessons learned #5 retriesServerless lessons learned #5 retries
Serverless lessons learned #5 retries
Maik Wiesmüller
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breaker
Maik Wiesmüller
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrency
Maik Wiesmüller
 
Serverless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queuesServerless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queues
Maik Wiesmüller
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeouts
Maik Wiesmüller
 
AWS CloudFormation Macros
AWS CloudFormation MacrosAWS CloudFormation Macros
AWS CloudFormation Macros
Maik Wiesmüller
 

More from Maik Wiesmüller (9)

Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoff
 
Serverless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limitingServerless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limiting
 
Serverless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategiesServerless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategies
 
Serverless lessons learned #5 retries
Serverless lessons learned #5 retriesServerless lessons learned #5 retries
Serverless lessons learned #5 retries
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breaker
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrency
 
Serverless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queuesServerless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queues
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeouts
 
AWS CloudFormation Macros
AWS CloudFormation MacrosAWS CloudFormation Macros
AWS CloudFormation Macros
 

Recently uploaded

Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
aisafed42
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 

Recently uploaded (20)

Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 

Serverless data lake architecture

  • 1. Serverless data lake A quickstart guide to your first data lake architecture on AWS. Maik Wiesmüller 2020-05-27 maik@intenics.io
  • 2. 2 The scope What is a data lake and why you should care? What do we want to achieve? How to get started?
  • 3. 3 What is a data lake and why we should care? “[--]A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.[..]” “Data lake” Wikipedia, Wikimedia Foundation, 23:46, 9 May 2020 , https://en.wikipedia.org/wiki/Data_lake. ● explore and discover huge amounts of data. ● get valuable business insights, market trends or patterns. ● or even sell data ● store everything now, analyze later. ● no hard scaling limits ● self-service
  • 4. 4 What do we want to achieve? Storage Ingestion Processing Analysis
  • 5. Ingestion Storage 5 What is really needed? Processing Analysis .
  • 6. 6 Storage Where to put all the data and how to keep track of it?
  • 7. 7 AWS S3 as foundation of our data lake ● managed object storage. ● organized in “buckets” ● designed for a durability of 99,999999999% ● standard availability of 99,99% ● pay-as-you.go S3 Storage
  • 8. 8 AWS S3 as foundation of our data lake Ingestion Processing Analysis S3 Storage Storage
  • 9. 9 AWS Glue as data catalog ● fully managed service ● integrated data catalog ● metadata in Glue tables. ● integrates seamlessly with S3 ● Crawler determine schema automatically Glue Crawler Glue Catalog
  • 10. 10 AWS S3 and Glue as Storage solution Ingestion Processing Analysis S3 Storage Glue Crawler Glue Catalog Storage
  • 11. 11 Ingestion How to get data loaded into the lake?
  • 12. 3 methods to get started: ● AWS CLI - using the command line. ● AWS SDK - 9 programming languages. ● Kinesis Data Firehose - managed service Local backups, web-servers, databases, IoT Endpoints... 12 Data ingestion to S3 #aws SDK CLI Firehose
  • 13. 13 AWS S3 and Glue Catalog as Storage Processing Analysis S3 Storage Glue Crawler Glue Catalog Storage Ingestion #aws SDK CLI Firehose
  • 14. 14 Processing How to extract, transform and load huge amounts of data (ETL)?
  • 15. 15 Process data with AWS Glue extract, transform, load -> ETL Can be done with AWS Glue on large scale. ● fully managed Spark jobs ● Python or Scala ● integrates with data catalog. ● generate ETL code * link to glue pricing Glue Jobs Python/Scala
  • 16. 16 Processdata with AWS Glue Analysis Glue Jobs Python/Scala S3 Storage Glue Crawler Glue Catalog Storage Processing Ingestion #aws SDK CLI Firehose
  • 17. 17 Analysis How to get insights from our data?
  • 18. 18 Explore data with SQL-Queries Use AWS Athena to explore our data with SQL-Queries. ● managed query service ● CSV, JSON, ORC, Avro, and Parquet ● integrates with Glue data catalog ● standard SQL ● create new data sets out of query results Athena SQL
  • 19. 19 Explore data with SQL-Queries Athena SQL Glue Jobs Python/Scala S3 Storage Glue Crawler Glue Catalog Storage Processing AnalysisIngestion #aws SDK CLI Firehose
  • 20. 20 Explore data with Python Jupyter Notebooks are documents that contain live code, equations, visualizations and explanatory text. ● ready to use Jupyter installation for python. ● connected to our data catalog ● AWS example notebooks or take a look at https://jupyter.org. ● stop and resume * link to sagemaker pricing ** link to glue pricing Jupyter Notebooks
  • 21. 21 The full picture Glue Jobs Python/Scala Athena SQL S3 Storage Glue Crawler Glue Catalog Jupyter Glue Notebooks Storage Processing AnalysisIngestion #aws SDK CLI Firehose
  • 22. 22 Thats a lot of services AWS Lake Formation central dashboard Glue Jobs S3 Storage Glue Crawler Glue Data catalog Lake Formation Athena SQL Jupyter Glue Notebooks
  • 23. 23 What to consider next ● Access management ● IAM User and Roles ● Bucket policies ● Service Roles ● Encryption ● Metrics
  • 24. 24 maik@intenics.io +49 176 614 39 280 intenics.io /wiesmueller Maik Wiesmüller Cloud solutions consultant @ Intenics with 20+ years of experience in various IT positions If you want to know more about serverless data lake design, visit us at intenics.io
  • 25. You can also download the more detailed version of this guide here: https://pages.intenics.io/download-your-copy-of-our-quick-start-gui de-to-you-first-data-lake