An intro to Azure Data Lake

Rick van den Bosch
Rick van den BoschCloud Solution Architect at Betabit
An intro to Azure Data Lake
Rick van den Bosch
Cloud Solutions Architect
@rickvdbosch
rickvandenbosch.net
r.van.den.bosch@betabit.nl
AN INTRO TO AZURE DATA LAKE
Calendar
• About Azure Data Lake
• Azure Data Lake Store
• Demo
• Azure Data Lake HDInsight
• Azure Data Lake Analytics
• Demo
• Power BI
• Resources
AZURE DATA LAKE
Azure Data Lake
Example
Azure Data Lake
AZURE DATA LAKE STORE
Store
• Enterprise-wide hyper-scale repository
• Data of any size, type and ingestion speed
• Operational and exploratory analytics
• WebHDFS-compatible API
• Specifically designed to enable analytics
• Tuned for (data analytics scenario) performance
• Out of the box:
security, manageability, scalability, reliability, and availability
Key capabilities
• Built for Hadoop
• Unlimited storage, petabyte files
• Performance-tuned for big data analytics
• Enterprise-ready: Highly-available and secure
• All data
Security
• Authentication
• Azure Active Directory integration
• Oauth 2.0 support for REST interface
• Access control
• Supports POSIX-style permissions (exposed by WebHDFS)
• ACLs on root, subfolders and individual files
• Encryption
Compatibility
Ingest data
Ingest data – Ad hoc
• Local computer
• Azure Portal
• Azure PowerShell
• Azure CLI
• Using Data Lake Tools for Visual Studio
• Azure Storage Blob
• Azure Data Factory
• AdlCopy tool
• DistCp running on HDInsight cluster
Ingest data – Streamed data
• Azure Stream Analytics
• Azure HDInsight Storm
• EventProcessorHost
Ingest data – Relational data
• Apache Sqoop
• Azure Data Factory
Ingest data – Web server log data
Upload using custom applications
• Azure CLI
• Azure PowerShell
• Azure Data Lake Storage Gen1 .NET SDK
• Azure Data Factory
Ingest data - Data associated with Azure HDInsight
clusters
• Apache DistCp
• AdlCopy service
• Azure Data Factory
Ingest data – Really large datasets
• ExpressRoute
• “Offline” upload of data
• Azure Import/Export service
Process data
Download data
Visualize data
DEMO
Storage Gen2 (Preview)
• Dedicated to big data analytics
• Built on top of Azure Storage
• The only cloud-based multi-modal storage service
“In Data Lake Storage Gen2, all the qualities of object storage remain
while adding the advantages of a file system interface optimized for
analytics workloads.”
Store Gen2 (Preview)
• Optimized performance
• No need to copy or transform data
• Easier management
• Organize and manipulate files through directories and subdirectories
• Enforceable security
• POSIX permissions on folders or individual files
• Cost effectiveness
• Built on top of the low-cost Azure Blob storage
AZURE DATA LAKE HDINSIGHT
HDInsight
• Cloud distribution of the (Hortonworks) Hadoop components
• Supports multiple Hadoop cluster versions (can be deployed any time)
• Hadoop
• YARN for job scheduling & resource management
• MapReduce for parallel processing
• HDFS
An intro to Azure Data Lake
Cluster types
• Apache Hadoop
• Apache Spark
• Apache Kafka
• Apache Interactive Query (AKA: Live Long and Process)
• Apache Storm
• Microsoft Machine Learning Services (R Server)
Component & utilities
• Ambari
• Avro
• Hive & HCatalog
• Mahout
• MapReduce
• Oozie
• Phoenix
• Pig
• Sqoop
• Tez
• YARN
• ZooKeeper
Languages - Default
• Java
- Clojure
- Jython
- Scala
• Python
• Pig Latin (for Pig jobs)
• HiveQL for Hive jobs and SparkSQL
AZURE DATA LAKE ANALYTICS
Analytics
• Dynamic scaling
• Develop faster, debug and optimize smarter using familiar tools
• U-SQL: simple and familiar, powerful, and extensible
• Integrates seamlessly with your IT investments
• Affordable and cost effective
• Works with all your Azure data
Analytics
• on-demand analytics job service to simplify big data analytics
• can handle jobs of any scale instantly
• Azure Active Directory integration
• U-SQL
U-SQL
• language that combines declarative SQL with imperative C#
U-SQL – Key concepts
• Rowset variables
• Each query expression that produces a rowset can be assigned to a variable.
• EXTRACT
• Reads data from a file and defines the schema on read *
• OUTPUT
• Writes data from a rowset to a file *
U-SQL – Scalar variables
DECLARE @in string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";
@searchlog =
EXTRACT UserId int,
ClickedUrls string
FROM @in
USING Extractors.Tsv();
OUTPUT @searchlog
TO @out
USING Outputters.Csv();
U-SQL – Transform rowsets
@searchlog =
EXTRACT UserId int,
Region string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT UserId, Region
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
TO "/output/SearchLog-transform-rowsets.csv"
USING Outputters.Csv();
U-SQL – Extractor parameters
• delimiter
• encoding
• escapeCharacter
• nullEscape
• quoting
• rowDelimiter
• silent
• skipFirstNRows
• charFormat
U-SQL – Outputter parameters
• delimiter
• dateTimeFormat
• encoding
• escapeCharacter
• nullEscape
• quoting
• rowDelimeter
• charFormat
• outputHeader
U-SQL
Built-in extractors and outputters:
• Text
• Csv
• Tsv
A (for instance) CSV Extractor or Outputter is
EXACTLY THAT
Data sources
• Options in the Azure Portal:
• Data Lake Storage Gen1
• Azure Storage
DEMO
POWERBI
Power BI
DEMO
USING AZURE SQL IN DATA LAKE ANALYTICS
Data Sources
CREATE DATA SOURCE –statement
• Azure SQL Database
• Azure SQL Datawarehouse
• SQL Server 2012 and up in an Azure VM
Create Azure SQL Data Source
1. Make sure your SQL Server firewall settings allow Azure Services to
connect
2. Create a ‘database’ in the Data Lake Analytics account
3. Create a Data Lake Analytics Catalog Credential
4. Create a Data Lake Analytics Data Source
5. Query your Azure SQL Database from Data Lake Analytics
Create ‘database’ in DLA (U-SQL)
CREATE DATABASE <YourDatabaseName>;
Create credential (PowerShell)
Login-AzureRmAccount;
Set-AzureRMContext -SubscriptionId <YourSubscriptionId>;
New-AzureRmDataLakeAnalyticsCatalogCredential
-AccountName "<YourDLAAccount>"
-DatabaseName "<YourDatabaseName>"
-CredentialName "YourCredentialName"
-Credential (Get-Credential)
-DatabaseHost "<YourAzureSqlServer>.database.windows.net"
-Port 1433;
Create Data Source (U-SQL)
USE DATABASE <YourDatabaseName>;
CREATE DATA SOURCE <YourDataSourceName>
FROM AZURESQLDB
WITH
(
PROVIDER_STRING =
"Database=<YourAzureSQLDatabaseName>;Trusted_Connection=False;
Encrypt=True",
CREDENTIAL = <YourCredentialName>,
REMOTABLE_TYPES = (bool, byte, sbyte, int, string, …)
);
Data Source (under Data Explorer)
Query your Azure SQL Database (U-SQL)
USE DATABASE <YourDatabaseName>;
@results =
SELECT *
FROM EXTERNAL <YourDataSourceName> EXECUTE
@"<QueryForYourSQLDatabase>";
OUTPUT @results
TO "<OutputFileName>"
USING Outputters.Csv(outputHeader: true);
Resources
• Basic example
• Advanced example
• Create Database (U-SQL) & Create Data Source (U-SQL)
• This example
• Azure blog
• Azure roadmap
• rickvandenbosch.net
THANK YOU
An intro to Azure Data Lake
1 of 58

Recommended

Dipping Your Toes: Azure Data Lake for DBAs by
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
240 views74 slides
Azure Lowlands: An intro to Azure Data Lake by
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeRick van den Bosch
1.6K views52 slides
Data Analytics Meetup: Introduction to Azure Data Lake Storage by
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
1.2K views28 slides
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen by
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
504 views60 slides
Running cost effective big data workloads with Azure Synapse and Azure Data L... by
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
734 views12 slides
Introduction to Azure Data Lake by
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data LakeAntonios Chatzipavlis
3.8K views32 slides

More Related Content

What's hot

Cortana Analytics Workshop: Azure Data Lake by
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeMSAdvAnalytics
3.2K views26 slides
Azure data bricks by Eugene Polonichko by
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAlex Tumanoff
435 views22 slides
Azure data factory by
Azure data factoryAzure data factory
Azure data factoryBizTalk360
4.2K views16 slides
Microsoft cloud big data strategy by
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
8.7K views58 slides
Introduction to Azure Databricks by
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
27.2K views53 slides
Azure Data Lake Analytics Deep Dive by
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
2.8K views59 slides

What's hot(20)

Cortana Analytics Workshop: Azure Data Lake by MSAdvAnalytics
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics3.2K views
Azure data bricks by Eugene Polonichko by Alex Tumanoff
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
Alex Tumanoff435 views
Azure data factory by BizTalk360
Azure data factoryAzure data factory
Azure data factory
BizTalk3604.2K views
Microsoft cloud big data strategy by James Serra
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra8.7K views
Introduction to Azure Databricks by James Serra
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra27.2K views
Azure Data Factory by HARIHARAN R
Azure Data FactoryAzure Data Factory
Azure Data Factory
HARIHARAN R807 views
Azure Data Lake and U-SQL by Michael Rys
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
Michael Rys5.1K views
Streaming Real-time Data to Azure Data Lake Storage Gen 2 by Carole Gunst
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst6.1K views
A lap around Azure Data Factory by BizTalk360
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk3604K views
RDX Insights Presentation - Microsoft Business Intelligence by Christopher Foot
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot764 views
Azure Data Factory V2; The Data Flows by Thomas Sykes
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
Thomas Sykes658 views
Leveraging Azure Databricks to minimize time to insight by combining Batch an... by Microsoft Tech Community
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
The Developer Data Scientist – Creating New Analytics Driven Applications usi... by Microsoft Tech Community
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat... by Microsoft Tech Community
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Azure Synapse Analytics Overview (r1) by James Serra
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra24.7K views

Similar to An intro to Azure Data Lake

Move your on prem data to a lake in a Lake in Cloud by
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
95 views18 slides
Talavant Data Lake Analytics by
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics Sean Forgatch
210 views57 slides
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini... by
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
1.8K views57 slides
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor... by
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
2.3K views42 slides
CC -Unit4.pptx by
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptxRevathiparamanathan
7 views37 slides
Survey of the Microsoft Azure Data Landscape by
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeIke Ellis
1.3K views47 slides

Similar to An intro to Azure Data Lake(20)

Move your on prem data to a lake in a Lake in Cloud by CAMMS
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS95 views
Talavant Data Lake Analytics by Sean Forgatch
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
Sean Forgatch210 views
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini... by Michael Rys
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys1.8K views
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor... by Michael Rys
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys2.3K views
Survey of the Microsoft Azure Data Landscape by Ike Ellis
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
Ike Ellis1.3K views
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635) by Michael Rys
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys2.5K views
Big Data Developers Moscow Meetup 1 - sql on hadoop by bddmoscow
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow931 views
Using existing language skillsets to create large-scale, cloud-based analytics by Microsoft Tech Community
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Big Data on azure by David Giard
Big Data on azureBig Data on azure
Big Data on azure
David Giard823 views
Introducing U-SQL (SQLPASS 2016) by Michael Rys
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
Michael Rys2K views
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p... by PROIDEA
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA40 views
Solr + Hadoop: Interactive Search for Hadoop by gregchanan
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan2.4K views
Best Practices: Hadoop migration to Azure HDInsight by Revin Chalil
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsight
Revin Chalil508 views
Search onhadoopsfhug081413 by gregchanan
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan1.4K views
U-SQL - Azure Data Lake Analytics for Developers by Michael Rys
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys5.7K views
USQL Trivadis Azure Data Lake Event by Trivadis
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis465 views
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014 by NoSQLmatters
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
NoSQLmatters2.4K views
Basic Introduction to Crate @ ViennaDB Meetup by Johannes Moser
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
Johannes Moser308 views

More from Rick van den Bosch

Configuration in azure done right by
Configuration in azure done rightConfiguration in azure done right
Configuration in azure done rightRick van den Bosch
353 views28 slides
Getting started with Azure Cognitive services by
Getting started with Azure Cognitive servicesGetting started with Azure Cognitive services
Getting started with Azure Cognitive servicesRick van den Bosch
99 views29 slides
From .NET Core 3, all the rest will be legacy by
From .NET Core 3, all the rest will be legacyFrom .NET Core 3, all the rest will be legacy
From .NET Core 3, all the rest will be legacyRick van den Bosch
290 views54 slides
Getting sh*t done with Azure Functions (on AKS!) by
Getting sh*t done with Azure Functions (on AKS!)Getting sh*t done with Azure Functions (on AKS!)
Getting sh*t done with Azure Functions (on AKS!)Rick van den Bosch
773 views43 slides
SAFwAD @ Intelligent Cloud Conference by
SAFwAD @ Intelligent Cloud ConferenceSAFwAD @ Intelligent Cloud Conference
SAFwAD @ Intelligent Cloud ConferenceRick van den Bosch
298 views45 slides
Securing an Azure Function REST API with Azure Active Directory by
Securing an Azure Function REST API with Azure Active DirectorySecuring an Azure Function REST API with Azure Active Directory
Securing an Azure Function REST API with Azure Active DirectoryRick van den Bosch
259 views29 slides

More from Rick van den Bosch(11)

From .NET Core 3, all the rest will be legacy by Rick van den Bosch
From .NET Core 3, all the rest will be legacyFrom .NET Core 3, all the rest will be legacy
From .NET Core 3, all the rest will be legacy
Rick van den Bosch290 views
Getting sh*t done with Azure Functions (on AKS!) by Rick van den Bosch
Getting sh*t done with Azure Functions (on AKS!)Getting sh*t done with Azure Functions (on AKS!)
Getting sh*t done with Azure Functions (on AKS!)
Rick van den Bosch773 views
Securing an Azure Function REST API with Azure Active Directory by Rick van den Bosch
Securing an Azure Function REST API with Azure Active DirectorySecuring an Azure Function REST API with Azure Active Directory
Securing an Azure Function REST API with Azure Active Directory
Rick van den Bosch259 views
TechDays 2017 - Going Serverless (2/2): Hands-on with Azure Event Grid by Rick van den Bosch
TechDays 2017 - Going Serverless (2/2): Hands-on with Azure Event GridTechDays 2017 - Going Serverless (2/2): Hands-on with Azure Event Grid
TechDays 2017 - Going Serverless (2/2): Hands-on with Azure Event Grid
Rick van den Bosch379 views
TechDays 2016 - Case Study: Azure + IOT + LoRa = ”Leven is Water” by Rick van den Bosch
TechDays 2016 - Case Study: Azure + IOT + LoRa = ”Leven is Water”TechDays 2016 - Case Study: Azure + IOT + LoRa = ”Leven is Water”
TechDays 2016 - Case Study: Azure + IOT + LoRa = ”Leven is Water”
Rick van den Bosch996 views
Take control of your deployments with Release Management by Rick van den Bosch
Take control of your deployments with Release ManagementTake control of your deployments with Release Management
Take control of your deployments with Release Management
Rick van den Bosch967 views

Recently uploaded

Design Driven Network Assurance by
Design Driven Network AssuranceDesign Driven Network Assurance
Design Driven Network AssuranceNetwork Automation Forum
15 views42 slides
PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
14 views1 slide
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
23 views38 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
263 views86 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
53 views38 slides
STPI OctaNE CoE Brochure.pdf by
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdfmadhurjyapb
14 views1 slide

Recently uploaded(20)

SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software263 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker37 views
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada127 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri16 views

An intro to Azure Data Lake

  • 2. Rick van den Bosch Cloud Solutions Architect @rickvdbosch rickvandenbosch.net r.van.den.bosch@betabit.nl
  • 3. AN INTRO TO AZURE DATA LAKE
  • 4. Calendar • About Azure Data Lake • Azure Data Lake Store • Demo • Azure Data Lake HDInsight • Azure Data Lake Analytics • Demo • Power BI • Resources
  • 10. Store • Enterprise-wide hyper-scale repository • Data of any size, type and ingestion speed • Operational and exploratory analytics • WebHDFS-compatible API • Specifically designed to enable analytics • Tuned for (data analytics scenario) performance • Out of the box: security, manageability, scalability, reliability, and availability
  • 11. Key capabilities • Built for Hadoop • Unlimited storage, petabyte files • Performance-tuned for big data analytics • Enterprise-ready: Highly-available and secure • All data
  • 12. Security • Authentication • Azure Active Directory integration • Oauth 2.0 support for REST interface • Access control • Supports POSIX-style permissions (exposed by WebHDFS) • ACLs on root, subfolders and individual files • Encryption
  • 15. Ingest data – Ad hoc • Local computer • Azure Portal • Azure PowerShell • Azure CLI • Using Data Lake Tools for Visual Studio • Azure Storage Blob • Azure Data Factory • AdlCopy tool • DistCp running on HDInsight cluster
  • 16. Ingest data – Streamed data • Azure Stream Analytics • Azure HDInsight Storm • EventProcessorHost
  • 17. Ingest data – Relational data • Apache Sqoop • Azure Data Factory
  • 18. Ingest data – Web server log data Upload using custom applications • Azure CLI • Azure PowerShell • Azure Data Lake Storage Gen1 .NET SDK • Azure Data Factory
  • 19. Ingest data - Data associated with Azure HDInsight clusters • Apache DistCp • AdlCopy service • Azure Data Factory
  • 20. Ingest data – Really large datasets • ExpressRoute • “Offline” upload of data • Azure Import/Export service
  • 24. DEMO
  • 25. Storage Gen2 (Preview) • Dedicated to big data analytics • Built on top of Azure Storage • The only cloud-based multi-modal storage service “In Data Lake Storage Gen2, all the qualities of object storage remain while adding the advantages of a file system interface optimized for analytics workloads.”
  • 26. Store Gen2 (Preview) • Optimized performance • No need to copy or transform data • Easier management • Organize and manipulate files through directories and subdirectories • Enforceable security • POSIX permissions on folders or individual files • Cost effectiveness • Built on top of the low-cost Azure Blob storage
  • 27. AZURE DATA LAKE HDINSIGHT
  • 28. HDInsight • Cloud distribution of the (Hortonworks) Hadoop components • Supports multiple Hadoop cluster versions (can be deployed any time) • Hadoop • YARN for job scheduling & resource management • MapReduce for parallel processing • HDFS
  • 30. Cluster types • Apache Hadoop • Apache Spark • Apache Kafka • Apache Interactive Query (AKA: Live Long and Process) • Apache Storm • Microsoft Machine Learning Services (R Server)
  • 31. Component & utilities • Ambari • Avro • Hive & HCatalog • Mahout • MapReduce • Oozie • Phoenix • Pig • Sqoop • Tez • YARN • ZooKeeper
  • 32. Languages - Default • Java - Clojure - Jython - Scala • Python • Pig Latin (for Pig jobs) • HiveQL for Hive jobs and SparkSQL
  • 33. AZURE DATA LAKE ANALYTICS
  • 34. Analytics • Dynamic scaling • Develop faster, debug and optimize smarter using familiar tools • U-SQL: simple and familiar, powerful, and extensible • Integrates seamlessly with your IT investments • Affordable and cost effective • Works with all your Azure data
  • 35. Analytics • on-demand analytics job service to simplify big data analytics • can handle jobs of any scale instantly • Azure Active Directory integration • U-SQL
  • 36. U-SQL • language that combines declarative SQL with imperative C#
  • 37. U-SQL – Key concepts • Rowset variables • Each query expression that produces a rowset can be assigned to a variable. • EXTRACT • Reads data from a file and defines the schema on read * • OUTPUT • Writes data from a rowset to a file *
  • 38. U-SQL – Scalar variables DECLARE @in string = "/Samples/Data/SearchLog.tsv"; DECLARE @out string = "/output/SearchLog-scalar-variables.csv"; @searchlog = EXTRACT UserId int, ClickedUrls string FROM @in USING Extractors.Tsv(); OUTPUT @searchlog TO @out USING Outputters.Csv();
  • 39. U-SQL – Transform rowsets @searchlog = EXTRACT UserId int, Region string FROM "/Samples/Data/SearchLog.tsv" USING Extractors.Tsv(); @rs1 = SELECT UserId, Region FROM @searchlog WHERE Region == "en-gb"; OUTPUT @rs1 TO "/output/SearchLog-transform-rowsets.csv" USING Outputters.Csv();
  • 40. U-SQL – Extractor parameters • delimiter • encoding • escapeCharacter • nullEscape • quoting • rowDelimiter • silent • skipFirstNRows • charFormat
  • 41. U-SQL – Outputter parameters • delimiter • dateTimeFormat • encoding • escapeCharacter • nullEscape • quoting • rowDelimeter • charFormat • outputHeader
  • 42. U-SQL Built-in extractors and outputters: • Text • Csv • Tsv A (for instance) CSV Extractor or Outputter is EXACTLY THAT
  • 43. Data sources • Options in the Azure Portal: • Data Lake Storage Gen1 • Azure Storage
  • 44. DEMO
  • 47. DEMO
  • 48. USING AZURE SQL IN DATA LAKE ANALYTICS
  • 49. Data Sources CREATE DATA SOURCE –statement • Azure SQL Database • Azure SQL Datawarehouse • SQL Server 2012 and up in an Azure VM
  • 50. Create Azure SQL Data Source 1. Make sure your SQL Server firewall settings allow Azure Services to connect 2. Create a ‘database’ in the Data Lake Analytics account 3. Create a Data Lake Analytics Catalog Credential 4. Create a Data Lake Analytics Data Source 5. Query your Azure SQL Database from Data Lake Analytics
  • 51. Create ‘database’ in DLA (U-SQL) CREATE DATABASE <YourDatabaseName>;
  • 52. Create credential (PowerShell) Login-AzureRmAccount; Set-AzureRMContext -SubscriptionId <YourSubscriptionId>; New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<YourDLAAccount>" -DatabaseName "<YourDatabaseName>" -CredentialName "YourCredentialName" -Credential (Get-Credential) -DatabaseHost "<YourAzureSqlServer>.database.windows.net" -Port 1433;
  • 53. Create Data Source (U-SQL) USE DATABASE <YourDatabaseName>; CREATE DATA SOURCE <YourDataSourceName> FROM AZURESQLDB WITH ( PROVIDER_STRING = "Database=<YourAzureSQLDatabaseName>;Trusted_Connection=False; Encrypt=True", CREDENTIAL = <YourCredentialName>, REMOTABLE_TYPES = (bool, byte, sbyte, int, string, …) );
  • 54. Data Source (under Data Explorer)
  • 55. Query your Azure SQL Database (U-SQL) USE DATABASE <YourDatabaseName>; @results = SELECT * FROM EXTERNAL <YourDataSourceName> EXECUTE @"<QueryForYourSQLDatabase>"; OUTPUT @results TO "<OutputFileName>" USING Outputters.Csv(outputHeader: true);
  • 56. Resources • Basic example • Advanced example • Create Database (U-SQL) & Create Data Source (U-SQL) • This example • Azure blog • Azure roadmap • rickvandenbosch.net

Editor's Notes

  1. WebHDFS means Hadoop & HDInsight compatibility
  2. Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem It does not impose any limits on account sizes, file sizes, or the amount of data that can be stored in a data lake. Data is stored durably by making multiple copies and there is no limit on the duration of time The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics. Redundant copies, enterprise-grade security for the stored data Data in native format, loading data doesn’t require a schema, structured, semi-structured, and unstructured data
  3. including multi-factor authentication, conditional access, role-based access control, application usage monitoring, security monitoring and alerting, etc. Do not enable encryption. Use keys managed by Data Lake Storage Gen1 Use keys from your own Key Vault.
  4. smaller data sets that are used for prototyping
  5. You can use Azure HDInsight and Azure Data Lake Analytics to run data analysis jobs on the data stored in Data Lake Storage Gen1.
  6. Apache Sqoop, Azure Data Factory, Apache DistCp Custom script / application Azure CLI, Azure PowerShell, Azure Data Lake Storage Gen1 .NET SDK
  7. Hadoop, Spark, Hive, LLAP, Kafka, Storm, and Microsoft Machine Learning Server
  8. - architected for cloud scale and performance. It dynamically provisions resources and lets you do analytics on terabytes or even exabytes of data Deep integration with Visual Studio Query language can use your existing IT investments for identity, management, security, and data warehousing You pay on a per-job basis when data is processed. - optimized to work with Azure Data Lake - providing the highest level of performance, throughput, and parallelization for your big data workloads. Data Lake Analytics can also work with Azure Blob storage and Azure SQL Database.
  9. * You can develop custom ones!
  10. Stolpersteine (struikelstenen) zijn gedenktekens die de kunstenaar Gunter Demnig (1947, Berlijn) plaatst in het trottoir voor de huizen waar mensen gewoond hebben die door de nazi's verdreven, gedeporteerd, vermoord of tot zelfmoord gedreven zijn. Het bestand bevat de (punt-)locaties van de struikelstenen in de openbare ruimte, met daaraan gekoppeld de na(a)m(en) van de mensen die herdacht worden met de betreffende steen, hun geboortedatum, deportatiedatum, vermoedelijke overlijdensdatum en de plaats van overlijden.