SlideShare a Scribd company logo
1 of 25
Download to read offline
Serverless data lake
A quickstart guide to your
first data lake architecture
on AWS.
Maik Wiesmüller
2020-05-27
maik@intenics.io
2
The scope
What is a data lake and why you
should care?
What do we want to achieve?
How to get started?
3
What is a data lake and why we should care?
“[--]A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs or files. A data lake is usually a single store of all enterprise data
including raw copies of source system data and transformed data used for tasks
such as reporting, visualization, advanced analytics and machine learning.[..]”
“Data lake” Wikipedia, Wikimedia Foundation, 23:46, 9 May 2020 , https://en.wikipedia.org/wiki/Data_lake.
● explore and discover huge amounts of data.
● get valuable business insights, market trends or patterns.
● or even sell data
● store everything now, analyze later.
● no hard scaling limits
● self-service
4
What do we want to achieve?
Storage
Ingestion
Processing
Analysis
Ingestion
Storage
5
What is really needed?
Processing
Analysis
.
6
Storage
Where to put all the data
and how to keep track of it?
7
AWS S3 as foundation of our data lake
● managed object storage.
● organized in “buckets”
● designed for a durability of 99,999999999%
● standard availability of 99,99%
● pay-as-you.go
S3
Storage
8
AWS S3 as foundation of our data lake
Ingestion
Processing
Analysis
S3
Storage
Storage
9
AWS Glue as data catalog
● fully managed service
● integrated data catalog
● metadata in Glue tables.
● integrates seamlessly with S3
● Crawler determine schema automatically
Glue
Crawler
Glue
Catalog
10
AWS S3 and Glue as Storage solution
Ingestion
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
11
Ingestion
How to get data loaded
into the lake?
3 methods to get started:
● AWS CLI - using the command line.
● AWS SDK - 9 programming languages.
● Kinesis Data Firehose - managed service
Local backups, web-servers, databases, IoT Endpoints...
12
Data ingestion to S3
#aws
SDK
CLI
Firehose
13
AWS S3 and Glue Catalog as Storage
Processing
Analysis
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Ingestion
#aws
SDK
CLI
Firehose
14
Processing
How to extract, transform
and load huge amounts of
data (ETL)?
15
Process data with AWS Glue
extract, transform, load -> ETL
Can be done with AWS Glue on large scale.
● fully managed Spark jobs
● Python or Scala
● integrates with data catalog.
● generate ETL code
* link to glue pricing
Glue Jobs
Python/Scala
16
Processdata with AWS Glue
Analysis
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
Ingestion
#aws
SDK
CLI
Firehose
17
Analysis
How to get insights from
our data?
18
Explore data with SQL-Queries
Use AWS Athena to explore our data with SQL-Queries.
● managed query service
● CSV, JSON, ORC, Avro, and Parquet
● integrates with Glue data catalog
● standard SQL
● create new data sets out of query results
Athena
SQL
19
Explore data with SQL-Queries
Athena
SQL
Glue Jobs
Python/Scala
S3
Storage
Glue
Crawler
Glue
Catalog
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose
20
Explore data with Python
Jupyter Notebooks are documents that contain live code, equations,
visualizations and explanatory text.
● ready to use Jupyter installation for python.
● connected to our data catalog
● AWS example notebooks or take a look at https://jupyter.org.
● stop and resume
* link to sagemaker pricing
** link to glue pricing
Jupyter
Notebooks
21
The full picture
Glue Jobs
Python/Scala
Athena
SQL
S3
Storage
Glue
Crawler
Glue
Catalog
Jupyter
Glue Notebooks
Storage
Processing
AnalysisIngestion
#aws
SDK
CLI
Firehose
22
Thats a lot of services
AWS Lake Formation
central dashboard
Glue
Jobs
S3
Storage
Glue
Crawler
Glue
Data catalog
Lake Formation Athena
SQL
Jupyter
Glue Notebooks
23
What to consider next
● Access management
● IAM User and Roles
● Bucket policies
● Service Roles
● Encryption
● Metrics
24
maik@intenics.io
+49 176 614 39 280
intenics.io
/wiesmueller
Maik Wiesmüller
Cloud solutions consultant @ Intenics
with 20+ years of experience in various
IT positions
If you want to know more about serverless data lake design, visit us at intenics.io
You can also download the more detailed version of this guide here:
https://pages.intenics.io/download-your-copy-of-our-quick-start-gui
de-to-you-first-data-lake

More Related Content

What's hot

Kickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon RedshiftKickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon RedshiftMatillion
 
Cloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark Summit
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Sergio Zenatti Filho
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsAlluxio, Inc.
 
Azure Data Lake Store and Analytics
Azure Data Lake Store and AnalyticsAzure Data Lake Store and Analytics
Azure Data Lake Store and AnalyticsSergio Zenatti Filho
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Xanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology IntroductionXanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology IntroductionAlex G. Lee, Ph.D. Esq. CLP
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsSoftwareMill
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Au cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite ElasticAu cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite ElasticElasticsearch
 
Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Moldovan Radu Adrian
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
 

What's hot (20)

Kickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon RedshiftKickstart your data strategy for 2018: Getting started with Amazon Redshift
Kickstart your data strategy for 2018: Getting started with Amazon Redshift
 
Cloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian HyperStore Operating Environment
Cloudian HyperStore Operating Environment
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Azure Data Lake Store and Analytics
Azure Data Lake Store and AnalyticsAzure Data Lake Store and Analytics
Azure Data Lake Store and Analytics
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Xanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology IntroductionXanadu Big Data Platform Technology Introduction
Xanadu Big Data Platform Technology Introduction
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Au cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite ElasticAu cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite Elastic
 
Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)
 
Big data advanced topics - part I
Big data   advanced topics - part IBig data   advanced topics - part I
Big data advanced topics - part I
 
Elastic Stack Roadmap
Elastic Stack RoadmapElastic Stack Roadmap
Elastic Stack Roadmap
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 

Similar to Serverless data lake architecture

IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
AWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlAWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlMezzybatliwala
 
AWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesAWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesPeeterParkar
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESMatt Stubbs
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeDATAVERSITY
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWSAmazon Web Services
 

Similar to Serverless data lake architecture (20)

IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
AWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtlAWS Data Lakes & Best Practices - GoDgtl
AWS Data Lakes & Best Practices - GoDgtl
 
AWS Data Lakes and Best Practices
AWS Data Lakes and Best PracticesAWS Data Lakes and Best Practices
AWS Data Lakes and Best Practices
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 

More from Maik Wiesmüller

Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoffMaik Wiesmüller
 
Serverless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limitingServerless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limitingMaik Wiesmüller
 
Serverless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategiesServerless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategiesMaik Wiesmüller
 
Serverless lessons learned #5 retries
Serverless lessons learned #5 retriesServerless lessons learned #5 retries
Serverless lessons learned #5 retriesMaik Wiesmüller
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerMaik Wiesmüller
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyMaik Wiesmüller
 
Serverless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queuesServerless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queuesMaik Wiesmüller
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsMaik Wiesmüller
 

More from Maik Wiesmüller (9)

Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoff
 
Serverless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limitingServerless lessons learned #7 rate limiting
Serverless lessons learned #7 rate limiting
 
Serverless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategiesServerless lessons learned #6 delivery strategies
Serverless lessons learned #6 delivery strategies
 
Serverless lessons learned #5 retries
Serverless lessons learned #5 retriesServerless lessons learned #5 retries
Serverless lessons learned #5 retries
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breaker
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrency
 
Serverless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queuesServerless lessons learned #2 dead letter queues
Serverless lessons learned #2 dead letter queues
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeouts
 
AWS CloudFormation Macros
AWS CloudFormation MacrosAWS CloudFormation Macros
AWS CloudFormation Macros
 

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 

Recently uploaded (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 

Serverless data lake architecture

  • 1. Serverless data lake A quickstart guide to your first data lake architecture on AWS. Maik Wiesmüller 2020-05-27 maik@intenics.io
  • 2. 2 The scope What is a data lake and why you should care? What do we want to achieve? How to get started?
  • 3. 3 What is a data lake and why we should care? “[--]A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.[..]” “Data lake” Wikipedia, Wikimedia Foundation, 23:46, 9 May 2020 , https://en.wikipedia.org/wiki/Data_lake. ● explore and discover huge amounts of data. ● get valuable business insights, market trends or patterns. ● or even sell data ● store everything now, analyze later. ● no hard scaling limits ● self-service
  • 4. 4 What do we want to achieve? Storage Ingestion Processing Analysis
  • 5. Ingestion Storage 5 What is really needed? Processing Analysis .
  • 6. 6 Storage Where to put all the data and how to keep track of it?
  • 7. 7 AWS S3 as foundation of our data lake ● managed object storage. ● organized in “buckets” ● designed for a durability of 99,999999999% ● standard availability of 99,99% ● pay-as-you.go S3 Storage
  • 8. 8 AWS S3 as foundation of our data lake Ingestion Processing Analysis S3 Storage Storage
  • 9. 9 AWS Glue as data catalog ● fully managed service ● integrated data catalog ● metadata in Glue tables. ● integrates seamlessly with S3 ● Crawler determine schema automatically Glue Crawler Glue Catalog
  • 10. 10 AWS S3 and Glue as Storage solution Ingestion Processing Analysis S3 Storage Glue Crawler Glue Catalog Storage
  • 11. 11 Ingestion How to get data loaded into the lake?
  • 12. 3 methods to get started: ● AWS CLI - using the command line. ● AWS SDK - 9 programming languages. ● Kinesis Data Firehose - managed service Local backups, web-servers, databases, IoT Endpoints... 12 Data ingestion to S3 #aws SDK CLI Firehose
  • 13. 13 AWS S3 and Glue Catalog as Storage Processing Analysis S3 Storage Glue Crawler Glue Catalog Storage Ingestion #aws SDK CLI Firehose
  • 14. 14 Processing How to extract, transform and load huge amounts of data (ETL)?
  • 15. 15 Process data with AWS Glue extract, transform, load -> ETL Can be done with AWS Glue on large scale. ● fully managed Spark jobs ● Python or Scala ● integrates with data catalog. ● generate ETL code * link to glue pricing Glue Jobs Python/Scala
  • 16. 16 Processdata with AWS Glue Analysis Glue Jobs Python/Scala S3 Storage Glue Crawler Glue Catalog Storage Processing Ingestion #aws SDK CLI Firehose
  • 17. 17 Analysis How to get insights from our data?
  • 18. 18 Explore data with SQL-Queries Use AWS Athena to explore our data with SQL-Queries. ● managed query service ● CSV, JSON, ORC, Avro, and Parquet ● integrates with Glue data catalog ● standard SQL ● create new data sets out of query results Athena SQL
  • 19. 19 Explore data with SQL-Queries Athena SQL Glue Jobs Python/Scala S3 Storage Glue Crawler Glue Catalog Storage Processing AnalysisIngestion #aws SDK CLI Firehose
  • 20. 20 Explore data with Python Jupyter Notebooks are documents that contain live code, equations, visualizations and explanatory text. ● ready to use Jupyter installation for python. ● connected to our data catalog ● AWS example notebooks or take a look at https://jupyter.org. ● stop and resume * link to sagemaker pricing ** link to glue pricing Jupyter Notebooks
  • 21. 21 The full picture Glue Jobs Python/Scala Athena SQL S3 Storage Glue Crawler Glue Catalog Jupyter Glue Notebooks Storage Processing AnalysisIngestion #aws SDK CLI Firehose
  • 22. 22 Thats a lot of services AWS Lake Formation central dashboard Glue Jobs S3 Storage Glue Crawler Glue Data catalog Lake Formation Athena SQL Jupyter Glue Notebooks
  • 23. 23 What to consider next ● Access management ● IAM User and Roles ● Bucket policies ● Service Roles ● Encryption ● Metrics
  • 24. 24 maik@intenics.io +49 176 614 39 280 intenics.io /wiesmueller Maik Wiesmüller Cloud solutions consultant @ Intenics with 20+ years of experience in various IT positions If you want to know more about serverless data lake design, visit us at intenics.io
  • 25. You can also download the more detailed version of this guide here: https://pages.intenics.io/download-your-copy-of-our-quick-start-gui de-to-you-first-data-lake