SlideShare a Scribd company logo
Designing performant and scalable data
lakes using Azure Data Lake Storage
Rukmani Gopalan
@RukmaniGopalan
Agenda • Data Lake Concepts and Patterns
• Designing your data lake
• Set up
• Organize data
• Secure data
• Manage cost
• Optimizing your data lake
• Achieve the best performance and scale
Traditional on-prem analytics pipeline
Operational
database
Business/custom apps
Operational
database
Operational
database
Enterprise data
warehouse
Data mart
Data mart
Data mart
ETL
ETL
ETL
ETL ETL
ETL
ETL
Reporting
Analytics
Data mining
Modern data warehouse
Logs (structured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure DatabricksAzure Data Factory
Power BI
Azure Synapse Analytics
Azure Synapse Analytics
Advanced Analytics
Logs (structured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure Data Factory
Power BI
Apps
Azure Databricks
Azure Synapse Analytics Azure Synapse Analytics
Cosmos DB
Realtime Analytics
Logs (structured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure DatabricksAzure Data Factory
Power BI
Apps
Message Broker
Azure Synapse Analytics Azure Synapse Analytics
Cosmos DB
Sensors and IoT
(unstructured)
A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
A z u r e D a t a L a k e S t o r a g e
M A N A G E A B L ES C A L A B L E F A S T S E C U R E
 No limits on
data store size
 Global footprint
(50 regions)
 Optimized for Spark
and Hadoop
Analytic Engines
 Tightly integrated
with Azure end to
end analytics
solutions
 Automated
Lifecycle Policy
Management
 Object Level
tiering
 Support for fine-
grained ACLs,
protecting data at the
file and folder level
 Multi-layered
protection via at-rest
Storage Service
encryption and Azure
Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
 Atomic directory
operations
means jobs
complete faster
 Object store
pricing levels
 File system
operations
minimize
transactions
required for job
completion
Azure Data Lake Storage
Cloud Storage platform with first class file/folder semantics and support for multiple
protocols and cost/performance tiers. Built on Object Storage.
Common Blob Storage Foundation
Blob API ADLS API
Server Backups, Archive
Storage, Semi-structured
Data
Object Data
Hadoop File System, File
and Folder Hierarchy,
Granular ACLS Atomic File
Transactions
Analytics Data
Object Tiering and Lifecycle
Policy Management
AAD Integration, RBAC,
Storage Account Security
HA/DR support through ZRS
and RA-GRS
NFS v3 (preview)
HPC Data, Applications
using NFS v3 against large
sequentially read data sets
File Data
Data Lake Architecture - Summary
Store large volume of multi-structured data in its native format
Defer work to ‘schematize’ after value & requirements are known
(Schema-on-read)
Extract high value insights from the multi-structured data
Build intelligent business scenarios based on the insights
Designing Your Data Lake
• How do I set up my data lake?
• How do I organize my data?
• How do I secure my data?
• How do I manage cost?
How do I set up my data lake?
• Centralized vs Federated implementation
• Data management and administration – done by a central team vs business units/domains
• Blueprint approach to federated data lakes with centralized governance
Flexible – single or multiple storage accounts
Blueprint
Recommendations
 Isolate development vs pre-production and production data lakes
 Identify logical datasets, resources and management needs – this
drives the centralized vs federated approach
• Business unit boundaries
• Regional boundaries
 Promote sharing data/insights across business units – beware of
data silos
How do I organize my data?
• Azure Data Lake Storage hierarchy
• Storage account
Azure resource that contains data objects
• Container
Organize within storage account - contains a set
of files/folders
• Folder/directory
Organize within container - contains a set of
files/folders, Hadoop file system friendly
• File
Holds data that can be read or written
Recommendations
 Organize data based on semantic structure as well as desired
access control
 Separate the different zones into different accounts, containers or
folders depending on business need
How do I secure my data?
PERIMETER/NETWORK
Service Endpoints
Private Endpoints
AUTHENTICATION
Azure Active Directory
(recommended)
Shared Keys
SAS tokens
Shared Key
AUTHORIZATION
RBACs (coarse grained)
POSIX ACLs (fine
grained)
Shared Key
DATA PROTECTION
Encryption on-the-wire with HTTPS
Encryption at Rest
• Service and Customer Managed Keys
Diagnostic Logs
A Little More on Authorization
 RBACs and ACLs integrated with AAD
• RBACs – Storage account and container
• ACLs – File and folders
 Other access mechanisms (not
recommended)
• Shared Keys – Disable if not needed
(preview)
• SAS Tokens – short lived access
Recommendations
 Service or Private endpoints for network security
 Use Azure Active Directory authentication to manage access
 Use RBACs for coarse grained access (at storage account or
container level) and ACLs for fine grained access control (at file or
folder level)
 AAD groups largely simplify your access management issues
How do I manage cost?
• Choose the right set of features for your business – cost vs benefit
• E.g. Redundancy option – criticality of geo-redundancy for production vs dev environments
LRS ZRS GZRS(RA-)GRS
Single Region Dual RegionGRS
How do I manage cost? (Continued…)
• Control data growth –
minimize risk of data
swamp
• Workspace data
management
• Leverage lifecycle
management policies
• Tiering
• Retention
Recommendations
 Choose the features of data lake storage based on business need
 Pre-prod and development environment needs might vary from
production environment needs
 Leverage lifecycle management policies for better data
management
 Move data to a cooler tier if not actively used – be aware of higher
transaction costs and minimum retention policies
 Use retention policies to delete data that is not needed
How do I optimize my data lake?
Goal
• Optimize for performance AND scale as the data and
applications continue to grow on the data lake
The basic considerations are…
• Optimize for high throughput
• Target getting at least a few MBs (higher the better)
per transaction.
• Optimize data access patterns
• Reduce unnecessary scanning of files
• Read only the data you need to read.
• Write efficiently so downstream applications that read
the data benefit
File size and format
• Too many small files adversely impact performance
• Choosing the right format – better performance
AND lower cost
• Parquet – integrated optimizations with Azure Synapse
Analytics and Azure Databricks
• Recommendations
 Modify source to ingest larger files into the data lake
 Coalesce and convert to right format (E.g. Parquet) in
curation phase of your analytics pipelines
 Realtime analytics pipelines (E.g. sensor data in IoT
application) – microbatch for larger writes
Partition your data for optimized access
Partition based on consumption patterns for optimized performance
Sensor ID Year Temperature
Humidity
Pressure
Microsoft Confidential
Query Acceleration (Preview)
 Optimize access to structured data by filtering data directly in the storage service
 Single file predicate evaluation and column projection to optimize analytics engines
 Eg:
 SELECT _1, _3 FROM BlobStorage WHERE _14 < 250 AND _16 > '2019-07-01'
Guidance from experts
Microsoft Docs
Explore overviews, tutorials,
code samples, and more.
Azure Data Lake Storage: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction
Azure Data Lake Storage Guidance Document: https://aka.ms/adls/guidancedoc
Azure Synapse Analytics: https://docs.microsoft.com/azure/synapse-analytics
© Copyright Microsoft Corporation. All rights reserved.

More Related Content

What's hot

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
Mark Kromer
 
Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2
Eric Bragas
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
HARIHARAN R
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
BizTalk360
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
Mark Kromer
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
Mark Kromer
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Cathrine Wilhelmsen
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Mark Kromer
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
Tom Kerkhove
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
David Giard
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Cathrine Wilhelmsen
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Mark Kromer
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
BizTalk360
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 

What's hot (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
 
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 

Similar to Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
Amazon Web Services
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Rakesh Jayaram
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
Amazon Web Services
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
Amazon Web Services LATAM
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
Amazon Web Services
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 

Similar to Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage (20)

AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 

Recently uploaded

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

  • 1. Designing performant and scalable data lakes using Azure Data Lake Storage Rukmani Gopalan @RukmaniGopalan
  • 2. Agenda • Data Lake Concepts and Patterns • Designing your data lake • Set up • Organize data • Secure data • Manage cost • Optimizing your data lake • Achieve the best performance and scale
  • 3. Traditional on-prem analytics pipeline Operational database Business/custom apps Operational database Operational database Enterprise data warehouse Data mart Data mart Data mart ETL ETL ETL ETL ETL ETL ETL Reporting Analytics Data mining
  • 4. Modern data warehouse Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure DatabricksAzure Data Factory Power BI Azure Synapse Analytics Azure Synapse Analytics
  • 5. Advanced Analytics Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure Data Factory Power BI Apps Azure Databricks Azure Synapse Analytics Azure Synapse Analytics Cosmos DB
  • 6. Realtime Analytics Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure DatabricksAzure Data Factory Power BI Apps Message Broker Azure Synapse Analytics Azure Synapse Analytics Cosmos DB Sensors and IoT (unstructured)
  • 7. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and scale profile of object storage together with the performance and analytics feature set of data lake storage A z u r e D a t a L a k e S t o r a g e M A N A G E A B L ES C A L A B L E F A S T S E C U R E  No limits on data store size  Global footprint (50 regions)  Optimized for Spark and Hadoop Analytic Engines  Tightly integrated with Azure end to end analytics solutions  Automated Lifecycle Policy Management  Object Level tiering  Support for fine- grained ACLs, protecting data at the file and folder level  Multi-layered protection via at-rest Storage Service encryption and Azure Active Directory integration C O S T E F F E C T I V E I N T E G R AT I O N R E A D Y  Atomic directory operations means jobs complete faster  Object store pricing levels  File system operations minimize transactions required for job completion
  • 8. Azure Data Lake Storage Cloud Storage platform with first class file/folder semantics and support for multiple protocols and cost/performance tiers. Built on Object Storage. Common Blob Storage Foundation Blob API ADLS API Server Backups, Archive Storage, Semi-structured Data Object Data Hadoop File System, File and Folder Hierarchy, Granular ACLS Atomic File Transactions Analytics Data Object Tiering and Lifecycle Policy Management AAD Integration, RBAC, Storage Account Security HA/DR support through ZRS and RA-GRS NFS v3 (preview) HPC Data, Applications using NFS v3 against large sequentially read data sets File Data
  • 9. Data Lake Architecture - Summary Store large volume of multi-structured data in its native format Defer work to ‘schematize’ after value & requirements are known (Schema-on-read) Extract high value insights from the multi-structured data Build intelligent business scenarios based on the insights
  • 10. Designing Your Data Lake • How do I set up my data lake? • How do I organize my data? • How do I secure my data? • How do I manage cost?
  • 11. How do I set up my data lake? • Centralized vs Federated implementation • Data management and administration – done by a central team vs business units/domains • Blueprint approach to federated data lakes with centralized governance Flexible – single or multiple storage accounts Blueprint
  • 12. Recommendations  Isolate development vs pre-production and production data lakes  Identify logical datasets, resources and management needs – this drives the centralized vs federated approach • Business unit boundaries • Regional boundaries  Promote sharing data/insights across business units – beware of data silos
  • 13. How do I organize my data? • Azure Data Lake Storage hierarchy • Storage account Azure resource that contains data objects • Container Organize within storage account - contains a set of files/folders • Folder/directory Organize within container - contains a set of files/folders, Hadoop file system friendly • File Holds data that can be read or written
  • 14. Recommendations  Organize data based on semantic structure as well as desired access control  Separate the different zones into different accounts, containers or folders depending on business need
  • 15. How do I secure my data? PERIMETER/NETWORK Service Endpoints Private Endpoints AUTHENTICATION Azure Active Directory (recommended) Shared Keys SAS tokens Shared Key AUTHORIZATION RBACs (coarse grained) POSIX ACLs (fine grained) Shared Key DATA PROTECTION Encryption on-the-wire with HTTPS Encryption at Rest • Service and Customer Managed Keys Diagnostic Logs
  • 16. A Little More on Authorization  RBACs and ACLs integrated with AAD • RBACs – Storage account and container • ACLs – File and folders  Other access mechanisms (not recommended) • Shared Keys – Disable if not needed (preview) • SAS Tokens – short lived access
  • 17. Recommendations  Service or Private endpoints for network security  Use Azure Active Directory authentication to manage access  Use RBACs for coarse grained access (at storage account or container level) and ACLs for fine grained access control (at file or folder level)  AAD groups largely simplify your access management issues
  • 18. How do I manage cost? • Choose the right set of features for your business – cost vs benefit • E.g. Redundancy option – criticality of geo-redundancy for production vs dev environments LRS ZRS GZRS(RA-)GRS Single Region Dual RegionGRS
  • 19. How do I manage cost? (Continued…) • Control data growth – minimize risk of data swamp • Workspace data management • Leverage lifecycle management policies • Tiering • Retention
  • 20. Recommendations  Choose the features of data lake storage based on business need  Pre-prod and development environment needs might vary from production environment needs  Leverage lifecycle management policies for better data management  Move data to a cooler tier if not actively used – be aware of higher transaction costs and minimum retention policies  Use retention policies to delete data that is not needed
  • 21. How do I optimize my data lake? Goal • Optimize for performance AND scale as the data and applications continue to grow on the data lake The basic considerations are… • Optimize for high throughput • Target getting at least a few MBs (higher the better) per transaction. • Optimize data access patterns • Reduce unnecessary scanning of files • Read only the data you need to read. • Write efficiently so downstream applications that read the data benefit
  • 22. File size and format • Too many small files adversely impact performance • Choosing the right format – better performance AND lower cost • Parquet – integrated optimizations with Azure Synapse Analytics and Azure Databricks • Recommendations  Modify source to ingest larger files into the data lake  Coalesce and convert to right format (E.g. Parquet) in curation phase of your analytics pipelines  Realtime analytics pipelines (E.g. sensor data in IoT application) – microbatch for larger writes
  • 23. Partition your data for optimized access Partition based on consumption patterns for optimized performance Sensor ID Year Temperature Humidity Pressure
  • 24. Microsoft Confidential Query Acceleration (Preview)  Optimize access to structured data by filtering data directly in the storage service  Single file predicate evaluation and column projection to optimize analytics engines  Eg:  SELECT _1, _3 FROM BlobStorage WHERE _14 < 250 AND _16 > '2019-07-01'
  • 25. Guidance from experts Microsoft Docs Explore overviews, tutorials, code samples, and more. Azure Data Lake Storage: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction Azure Data Lake Storage Guidance Document: https://aka.ms/adls/guidancedoc Azure Synapse Analytics: https://docs.microsoft.com/azure/synapse-analytics
  • 26. © Copyright Microsoft Corporation. All rights reserved.

Editor's Notes

  1. An Azure Virtual Network (VNet) is a representation of your own network in the cloud. It is a logical isolation of the Azure cloud dedicated to your subscription. ... When you create a VNet, your services and VMs within your VNet can communicate directly and securely with each other in the cloud.
  2. Symptom: Job latencies Investigation Storage request throttling Root cause Too many read operations to storage. Large number of row groups in databrick delta parquet file resulted in lots of reads operations. Solution Adjusted parquet.block.size config value to reduce number of row groups per parquet file Job runtimes reduced by 3x
  3. Symptom: Job timeouts Investigation Transaction and throughput peaks, bursty pattern of load Storage request throttling Root cause Data cleanup during SLA job execution Large number of partitions 10’s of thousands of partitions Solution Reduced number of partitions to 250 Best practice: Partitioning strategy must align with your query pattern Reduced the number of delete operations while SLA job is running