With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,
5. What are the challenges?
• Limited storage
• Limited processing power
• High hardware cost
• High maintains cost
• No disaster recovery
• Availability and reliability issues
• Scalability issues
• Security
• Solution: Azure Data Lake
6. What is Azure Data Lake?
• Highly scalable data storage and analytics service
• Intended for big data storage and analysis
• A faster and efficient solution than on-prem data centers
• Three services:
Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
8. Azure Data Lake Store
• Built for Hadoop
• Compatible with most components in Hadoop Eco-
systems
• Web HDFS API
• Unlimited storage, petabyte files
• Performance-tuned for big data analytics
• High throughput, IOPs
• Multiple parts of a file in multiple servers:
Parallel reading
• Enterprise-ready: Highly-available and secure
• All Data, One Place
• Any Data in native format
• No schema, No prior processing
9. Optimized for Big Data Analytics
• Multiple copies of same file in
improve reading
• Locally-redundant
(multiple copies of data in one Azure
region)
• Parallel reading and writing
• Configurable throughput
• No Limitation in file size or storage
10. Secure Data in Azure Data Lake Store
• Authentication
• Azure Active Directory
• All AAD features
• End-user authentication or Service-to-service authentication
• Access Control
• POSIX-style permissions
• Read, Write, Execute
• ACLs can be enabled on the root folder, on subfolders, and on individual files.
• Encryption
• Encryption at rest
• Encryption at transit -HTTPS
11. How to ingest data to Azure Data Lake Store
• Small Data Sets
• Azure Portal
• Azure Power Shell
• Azure – Cross Platform CLI 2.0
• Data Lake Tools For Visual Studio
• Streamed data
• Azure Stream Analytics
• Azure HDInsight Storm
• Data Lake Store .NET SDK
• Relational data
• Apache Sqoop
• Azure Data Factory
• Large Data Set
• Azure Power Shell
• Azure – Cross Platform CLI 2.0
• Azure Data Lake Store .NET SDK
• Azure Data Factory
• Really Large Data Sets
• Azure ExpressRoute
• Azure Import/Export service
12. How it different from Azure Blob Storage
Azure Data Lake Store Azure Blob Storage
Purpose
Optimized storage for big data analytics
workloads
General purpose
Use Case
Batch, interactive, streaming analytics and
machine learning data such as log files, IoT
data, click streams, large datasets
Any type of text or binary data, such
as application back end, backup data,
media storage for streaming and
general purpose data
Key Concepts
Contains folders, which in turn contains data
stored as files
Contains containers, which in turn has
data in the form of blobs
Size limits
No limits on account sizes, file sizes or number
of files
500 TiB
Geo-redundancy
Locally-redundant (multiple copies of data in
one Azure region)
Locally redundant (LRS), globally
redundant (GRS), read-access globally
redundant (RA-GRS).
13. Azure Data Lake Analytics
• Massive processing power
• Adjustable parallelism
• No server, VM, Cluster to
maintain.
• Pay for the Job
• Use existing .Net, R and
Python libraries.
• New language : U-SQL
14. C#SQL
U-SQL
• Combination of Declarative Logic of SQL and Procedure
logic of C#
• Case sensitive
• “Schema on Read”
U-SQL
@ExtraRuns =
SELECT IPLYear, Bowler,
SUM( string.IsNullOrWhiteSpace(ExtraRuns)? 0:
Convert.ToInt32(ExtraRuns)
) AS ExtraRuns,
ExtraType
FROM @MatchData
GROUP BY IPLYear,Bowler,ExtraType;
18. Pricing
• Pay-as-you-go
• For a 1 TB storage, for a month = $39.94
• Monthly commitment packages
• For a 1 TB storage, for a month = $35
• Usage base:
https://azure.microsoft.com/en-us/pricing/details/data-lake-store/
Usage Price
Write operations (per 10,000) $0.05
Read operations (per 10,000) $0.004
Delete operations Free
Transaction size limit No limit
Editor's Notes
On prem
Lots of data
Limited space
Maintain the servers
Lot of processing power
Grow hardware on demand
Upgrade instantly
Availability and readability : Multiple copies of data, Down time for maintains, hardware familiar causes business issues
Increase decrease hardware on demand
Ability to fail fast, if fail , no need of hardware
Ability move into latest technologies
Scalability: take time to scale
Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS)
Works with application which support Web HDFS
3 copy in a single
IOPS: input output operations per seconds
Automatically optimized for any throughput
250 AU max:
1 AU= 2 core cpu, 6 GB ram
Pay As you go:
Price: 1 Au for 1 Hour 2$
Monthly : 100 Au , 100$
Declarative logic
Procedure logic
sql to query, C# to customize
Case sensitive
C# data type
C# comparison
Some commonly used SQL keywords, including WHILE, UPDATE, and MERGE are not supported in U-SQL
A cloud integration service
Workflow called Pipelines
Activities in pipeline
Integration Run time: self hosted
Activities : Copy data, run ssis packages, execute SPs, Execute U-SQL queries
Price: no of activity runs and data moment hours
Or SSIS runtime based on VM and time