2. Introduction to
Azure Data Lake Analytics
Presenter: Waqas Idrees
Principal Software Engineer
https://www.linkedin.com/in/mdwaqas/
3. Agenda
1. What is Big Data?
2. Azure Data Lake History / Origin
3. Azure Data Lake Overview
o Azure Data Lake Store
o Azure Data Lake Analytics
4. Azure Data Factory
5. Azure Data Lake Analytics (U-SQL)
6. Q & A
4. There’s data, and then there’s Big data.
So, what’s the difference?
Presenter: Waqas Idrees
5. What is Big Data?
• Big Data = All Data
• Big data is the collection and analysis
of information from various sources.
6. What is Big Data?
• Big Data sets can include
o Structured
o Semi Structured
o Unstructured
7. What is Big Data? 3Vs
Big data is characterized by the three Vs
1. An extreme volume of data.
2. A broad variety of types of data.
3. The velocity at which the data need
needs to processed and analyzed.
8. Who Uses Big Data?
Companies considering big data as an integral part of their
strategy because
• It gives businesses the power to pinpoint the cause of their
problems.
• Customers’ buying habits.
9. Who Uses Big Data?
• They can optimize offerings
• They can reduce cost and time
It helps them to make sound decisions
15. Feature of Azure Data Lake
• The ability to store and analyze data of any kind and
size.
• Multiple access methods including U-SQL, Spark,
Hive, and Storm.
• Dynamic scaling to match your business priorities.
• Enterprise-grade security with Azure Active Directory.
17. Azure Data Lake Store
• Users can store structured, semi-
structured or unstructured data.
18. Azure Data Lake Store
• A single Azure Data Lake Store account can
store trillions of files.
• A single file can be greater than a petabyte
in size.
24. Azure Data Lake Analytics
• On-demand job service
• Deploy on Azure and schedule using
Azure Data Factory
• Affordable and cost effective (Pay as
you use)
25. U-SQL
• Familiar syntax to millions of SQL and .Net
Developers
• Unifies declarative nature of SQL with the
imperative power of C#
• Unifies structured, semi structured and
unstructured data.
• Distributed Query Support over all data.
U-SQL
A new language for Big Data
26. U-SQL Language Overview
U-SQL Fundamentals
• All the familiar SQL Clauses
SELECT | FROM | WHERE | GROUP BY | OVER
• Operate on Structure and Unstructured Data
.NET Integration and Extensibility
• U-SQL Expressions are full C# expressions
• Reuse .NET code in other assemblies
• Use C# to define your own
Types | Functions | Aggregations | IO
27. ADLA Executions
U-SQL Cloud Execution
• The data read or written by the script will also be in Azure -
typically in an Azure Data Lake Store account
• You pay for any compute and storage used by the script.
28. ADLA Executions
U-SQL Local Execution
• The data read and written by this script will be on you own
machine.
• There is no additional cost
29. System Requirements
• x64 CPU
• Minimum of 16 GB RAM
• Windows 10 is recommended
• Visual Studio 2015 or +
• Azure Data Lake Tools for Visual Studio
30. First U-SQL Script
• Create new Azure Data Lake > U-SQL Project.
• An empty U-SQL script and its code behind file will be there called "Script.usql"
31. First U-SQL Script
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT @searchlog
TO "/output/SearchLog-first-u-sql.csv"
USING Outputters.Csv();
Row set
Apply schema on
read
File Path
Write out
Easy delimited
text handling
36. When does a job get Queued?
Local Cause
• Queue is already at max concurrency
Cloud Clause
• Shortage of Azure Data Lake Analytics Units
(ADLAUs)
• Queue is already at max concurrency
38. ADLA Cloud Account Configurations
• Maximum number of ADLA accounts per subscription per region: 5
• Maximum number of concurrent U-SQL jobs per account: 20
• Maximum number of Analytics Units (AUs) per account: 32
• Maximum number of Analytics Units (AUs) per job: 32
39. What is an Azure Data Lake Analytics Unit?
An Azure Data Lake Analytics Unit (AU) is a unit of compute resources with
Azure Data Lake.
AU is the equivalent of 2 CPU cores and 6 GB of RAM
40. How AUs are used during U-SQL Query Execution?
When we submit a U-SQL job, e specify three things
1. U-SQL Script
2. Input and Output Files
3. Reserved AUs
41. How AUs are used during U-SQL Query Execution?
U-SQL Compiler and Optimizer Vertex/Vertices
Each Task in a Plan is called Vertex.
Plan
42. How AUs are used during U-SQL Query Execution?
• We need an AU to run a Vertex.
• When the vertex is finished the AU will be assigned to another
vertex.
43. How AUs are used during U-SQL Query Execution?
45. What is an AU Second?
An AU Second is the unit used to measure the compute
resources used for a job.
46. What is an AU Second?
• 1 AU for a job that executes for 1 second = 1 AU Second.
• 1 AU for a job that executes for 1 minute (60 seconds) = 60 AU Seconds.
• 2 AUs for a job that executes for 100 seconds = 200 AU Seconds.
• 10s AUs for a job that executes for 5 minutes (300 seconds) = 3000 AU
Seconds.
48. Pricing Details
INCLUDED ANALYTICS UNIT HOURS PRICE/MONTH SAVINGS OVER PAY-AS-YOU-GO
100 $100 50%
500 $450 55%
1,000 $800 60%
5,000 $3,600 64%
10,000 $6,500 68%
50,000 $29,000 71%
100,000 $52,000 74%
> 100,000 Contact Us
Monthly commitment packages
Monthly commitment packages provide you with a significant discount (up to 74%) compared to Pay-as-You-Go pricing.
49. What can I do with Azure Data Lake Analytics?
• Prepping large amounts of data for insertion into a Data Warehouse
• Processing scraped web data for science and analysis
• Using image processing intelligence to quickly process unstructured
image data
• Replacing long-running monthly batch processing with shorter running
distributed processes
50. What makes it different?
• Only one language to learn
• Only offered as a platform service
• Pricing per job; not per hour
- Multiple definitions of Big Data are available on internet.
- In general Big Data refers to set of data that are so large in volume and so complex that current data processing products are
not capable of managing, capturing or processing of that data within a reasonable amount of time.
- Big Data is all data which can be mined for insights.
- Big Data is collection and analysis of data from various sources such as websites, social media, mobile apps, sensors
internet of things or data collected from the scientific experiment.
Companies find big data as an integral part of their strategy because it can reduce cost and time,
develop new products, optimize offerings, and help you make sound decisions.
It gives businesses the power to pinpoint the cause of their problems and other behaviors
such as customers’ buying habits and risk portfolios. I'll represent more advance topics on this.
1- Azure Data Lake was built on the learning and technologies of cosmos.
2- Cosmos is Microsoft's internal BigData analysis platform.
2.1 There's not a lot of public information available about cosmons.
3- Cosmos is used within Microsoft extensively, across a huge number of servers.
4- It is used to store and process data for applications such as Azure, AdCenter, Bing, MSN, Skype and Windows Live.
5- They are collecting information on our every click, visual search for improving their services, adds expreiences after
performing analysis on that data.
Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS
Hadoop comes in many different flavors, some running on-premises, others running in the cloud. Some are managed BY you, others are managed FOR you
Most Big Data cloud offerings that are available are priced per hour based on how long you keep your cluster up and running. ADLA takes a different approach to pricing.