The document provides an introduction to Azure Data Lake, detailing its capabilities as an enterprise-wide repository for large-scale data storage and analytics. It covers ingestion methods for different data types, performance optimization, security features, and integration with Azure services like HDInsight and Power BI. Key components include U-SQL for data processing and Azure Data Lake Storage Gen2, emphasizing its advantages over traditional storage solutions.
Introductory details on Azure Data Lake and presentation agenda.
General overview of Azure Data Lake features and its importance.
Description of Azure Data Lake Store, its capabilities, security, and compatibility.
Various approaches for data ingestion including local, streamed, relational data, and large datasets.
Overview of data processing, downloading, and visualizing data in Azure Data Lake.
Introduction of Storage Gen2 aimed at big data analytics, along with its advantages.
Details on HDInsight including cluster types and components for Hadoop ecosystem.
Description and features of Azure Data Lake Analytics, including U-SQL language.
Insights into advanced U-SQL functionalities including extractor and output parameters.
Instructions on using Azure SQL as a data source and querying data in Azure Data Lake.Final thoughts and resources for further learning about Azure Data Lake capabilities.
Store
• Enterprise-wide hyper-scalerepository
• Data of any size, type and ingestion speed
• Operational and exploratory analytics
• WebHDFS-compatible API
• Specifically designed to enable analytics
• Tuned for (data analytics scenario) performance
• Out of the box:
security, manageability, scalability, reliability, and availability
11.
Key capabilities
• Builtfor Hadoop
• Unlimited storage, petabyte files
• Performance-tuned for big data analytics
• Enterprise-ready: Highly-available and secure
• All data
12.
Security
• Authentication
• AzureActive Directory integration
• Oauth 2.0 support for REST interface
• Access control
• Supports POSIX-style permissions (exposed by WebHDFS)
• ACLs on root, subfolders and individual files
• Encryption
Ingest data –Ad hoc
• Local computer
• Azure Portal
• Azure PowerShell
• Azure CLI
• Using Data Lake Tools for Visual Studio
• Azure Storage Blob
• Azure Data Factory
• AdlCopy tool
• DistCp running on HDInsight cluster
16.
Ingest data –Streamed data
• Azure Stream Analytics
• Azure HDInsight Storm
• EventProcessorHost
17.
Ingest data –Relational data
• Apache Sqoop
• Azure Data Factory
18.
Ingest data –Web server log data
Upload using custom applications
• Azure CLI
• Azure PowerShell
• Azure Data Lake Storage Gen1 .NET SDK
• Azure Data Factory
19.
Ingest data -Data associated with Azure HDInsight
clusters
• Apache DistCp
• AdlCopy service
• Azure Data Factory
20.
Ingest data –Really large datasets
• ExpressRoute
• “Offline” upload of data
• Azure Import/Export service
Storage Gen2 (Preview)
•Dedicated to big data analytics
• Built on top of Azure Storage
• The only cloud-based multi-modal storage service
“In Data Lake Storage Gen2, all the qualities of object storage remain
while adding the advantages of a file system interface optimized for
analytics workloads.”
26.
Store Gen2 (Preview)
•Optimized performance
• No need to copy or transform data
• Easier management
• Organize and manipulate files through directories and subdirectories
• Enforceable security
• POSIX permissions on folders or individual files
• Cost effectiveness
• Built on top of the low-cost Azure Blob storage
Analytics
• Dynamic scaling
•Develop faster, debug and optimize smarter using familiar tools
• U-SQL: simple and familiar, powerful, and extensible
• Integrates seamlessly with your IT investments
• Affordable and cost effective
• Works with all your Azure data
35.
Analytics
• on-demand analyticsjob service to simplify big data analytics
• can handle jobs of any scale instantly
• Azure Active Directory integration
• U-SQL
U-SQL – Keyconcepts
• Rowset variables
• Each query expression that produces a rowset can be assigned to a variable.
• EXTRACT
• Reads data from a file and defines the schema on read *
• OUTPUT
• Writes data from a rowset to a file *
38.
U-SQL – Scalarvariables
DECLARE @in string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";
@searchlog =
EXTRACT UserId int,
ClickedUrls string
FROM @in
USING Extractors.Tsv();
OUTPUT @searchlog
TO @out
USING Outputters.Csv();
39.
U-SQL – Transformrowsets
@searchlog =
EXTRACT UserId int,
Region string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT UserId, Region
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
TO "/output/SearchLog-transform-rowsets.csv"
USING Outputters.Csv();
Data Sources
CREATE DATASOURCE –statement
• Azure SQL Database
• Azure SQL Datawarehouse
• SQL Server 2012 and up in an Azure VM
50.
Create Azure SQLData Source
1. Make sure your SQL Server firewall settings allow Azure Services to
connect
2. Create a ‘database’ in the Data Lake Analytics account
3. Create a Data Lake Analytics Catalog Credential
4. Create a Data Lake Analytics Data Source
5. Query your Azure SQL Database from Data Lake Analytics
Query your AzureSQL Database (U-SQL)
USE DATABASE <YourDatabaseName>;
@results =
SELECT *
FROM EXTERNAL <YourDataSourceName> EXECUTE
@"<QueryForYourSQLDatabase>";
OUTPUT @results
TO "<OutputFileName>"
USING Outputters.Csv(outputHeader: true);
56.
Resources
• Basic example
•Advanced example
• Create Database (U-SQL) & Create Data Source (U-SQL)
• This example
• Azure blog
• Azure roadmap
• rickvandenbosch.net
#11 WebHDFS means Hadoop & HDInsight compatibility
#12 Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem
It does not impose any limits on account sizes, file sizes, or the amount of data that can be stored in a data lake. Data is stored durably by making multiple copies and there is no limit on the duration of time
The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.
Redundant copies, enterprise-grade security for the stored data
Data in native format, loading data doesn’t require a schema, structured, semi-structured, and unstructured data
#13 including multi-factor authentication, conditional access, role-based access control, application usage monitoring, security monitoring and alerting, etc.
Do not enable encryption.
Use keys managed by Data Lake Storage Gen1
Use keys from your own Key Vault.
#16 smaller data sets that are used for prototyping
#22 You can use Azure HDInsight and Azure Data Lake Analytics to run data analysis jobs on the data stored in Data Lake Storage Gen1.
#23 Apache Sqoop, Azure Data Factory, Apache DistCp
Custom script / application
Azure CLI, Azure PowerShell, Azure Data Lake Storage Gen1 .NET SDK
#31 Hadoop, Spark, Hive, LLAP, Kafka, Storm, and Microsoft Machine Learning Server
#35 - architected for cloud scale and performance. It dynamically provisions resources and lets you do analytics on terabytes or even exabytes of data
Deep integration with Visual Studio
Query language
can use your existing IT investments for identity, management, security, and data warehousing
You pay on a per-job basis when data is processed.
- optimized to work with Azure Data Lake - providing the highest level of performance, throughput, and parallelization for your big data workloads. Data Lake Analytics can also work with Azure Blob storage and Azure SQL Database.
#46 Stolpersteine (struikelstenen) zijn gedenktekens die de kunstenaar Gunter Demnig (1947, Berlijn) plaatst in het trottoir voor de huizen waar mensen gewoond hebben die door de nazi's verdreven, gedeporteerd, vermoord of tot zelfmoord gedreven zijn. Het bestand bevat de (punt-)locaties van de struikelstenen in de openbare ruimte, met daaraan gekoppeld de na(a)m(en) van de mensen die herdacht worden met de betreffende steen, hun geboortedatum, deportatiedatum, vermoedelijke overlijdensdatum en de plaats van overlijden.