Big Data Analytics in the Cloud using Microsoft Azure services was discussed. Key points included:
1) Azure provides tools for collecting, processing, analyzing and visualizing big data including Azure Data Lake, HDInsight, Data Factory, Machine Learning, and Power BI. These services can be used to build solutions for common big data use cases and architectures.
2) U-SQL is a language for preparing, transforming and analyzing data that allows users to focus on the what rather than the how of problems. It uses SQL and C# and can operate on structured and unstructured data.
3) Visual Studio provides an integrated environment for authoring, debugging, and monitoring U-SQL scripts and jobs. This allows
2. Big Data Analytics in the
Cloud
Microsoft Azure
Cortana Intelligence Suite
Mark Kromer
Microsoft Azure Cloud Data Architect
@kromerbigdata
@mssqldude
3. What is Big Data Analytics?
Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information.”
Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is gathered from a wide
variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The
aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that
might provide valuable insights about the users who created it. Through this insight, businesses may be able to
gain an edge over their rivals and make superior business decisions.”
Requires lots of data wrangling and Data Engineers
Requires Data Scientists to uncover patterns from
complex raw data
Requires Business Analysts to provide business value
from multiple data sources
Requires additional tools and infrastructure not
provided by traditional database and BI technologies
Why Cloud for Big Data Analytics?
• Quick and easy to stand-up new, large, big data architectures
• Elastic scale
• Metered pricing
• Quickly evolve architectures to rapidly changing landscapes
• Prototype, tear down
4. Big Data Analytics Tools & Use Cases
vs. “Traditional BI”
Traditional BI
• Sales reports
• Post-campaign marketing research & analysis
• CRM reports
• Enterprise data assets
• Can’t miss any transactions, records or rows
• DWs
• Relational Databases
• Well-defined and format data sources
• Direct connections to OLTP and LOB data sources
• Excel
• Well-defined business semantic models
• OLAP cubes
• MDM, Data Quality, Data Governance
Big Data Analytics
• Sentiment Analysis
• Predictive Maintenance
• Churn Analytics
• Customer Analytics
• Real-time marketing
• Avoid simply siphoning off data for BI tools
• Architect multiple paths for data pipelines: speed,
batch, analytical
• Plan for data of varying types, volumes and formats
• Data can/will land at any time, any speed, any format
• It’s OK to miss a few records and data points
• NoSQL
• MPP DWs
• Hadoop, Spark, Storm
• R & ML to find patterns in masses of data lakes
5. • Key Values / JSON / CSV
• Compress files
• Columnar
• Land raw data fast
• Data Wrangle/Munge/Engineer
• Find patterns
• Prepare for business models
• Present to business decision makers
A few basic fundamentals
Big Data Analytics in the Cloud
Collect and land
data in lake
Process data
pipelines
(stream, batch,
analysis)
Presentation
Layer: Surface
knowledge to
business
decision makers
9. Azure Data Factory
What it is:
When to use it:
A pipeline system to move data in, perform activities on data,
move data around, and move data out
• Create solutions using multiple tools as a single process
• Orchestrate processes - Scheduling
• Monitor and manage pipelines
• Call and re-train Azure ML models
12. Example – Customer Churn
Call Log Files
Customer Table
Call Log Files
Customer Table
Customer
Churn Table
Azure Data
Factory:
Data Sources
Customers
Likely to
Churn
Customer
Call Details
Transform & Analyze PublishIngest
13. Simple ADF
• Business Goal: Transform and Analyze Web Logs each month
• Design Process: Transform Raw Weblogs, using a Hive Query,
storing the results in Blob Storage
Web Logs
Loaded to
Blob
Files ready
for analysis
and use in
AzureML
HDInsight HIVE query
to transform Log
entries
14. Azure SQL Data Warehouse
What it is:
When to use it:
A Scaling Data Warehouse Service in the Cloud
• When you need a large-data BI solution in the cloud
• MPP SQL Server in the Cloud
• Elastic scale data warehousing
• When you need pause-able scale-out compute
15. Elastic scale & performance
Real-time elasticity
Resize in <1 minute On-demand compute
Expand or reduce
as needed
Pause Data Warehouse to Save
on Compute Costs. I.e. Pause
during non-business hours
16. Storage can be as big or
small as required
Users can execute niche workloads
without re-scanning data
Elastic scale & performance
Scale
18. SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
Compute
Control
19. Azure Data Lake
What it is:
When to use it:
Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark,
HBase, Storm, U-SQL) Engines
• Low-cost, high-throughput data store
• Non-relational data
• Larger storage limits than Blobs
20. Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
and ADLA
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
22. No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
Optimized for analytic workload
PERFORMANCE
ENTERPRISE GRADE authentication, access
control, audit, encryption at rest
Azure Data Lake
Store
A hyperscale repositoryfor big
data analyticsworkloads
Introducing ADLS
23. Enterprise-
grade
Limitless scaleProductivity
from day one
Easy and
powerful data
preparation
All data
23
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
24. Developing big data apps
Author, debug, & optimize big
data apps
in Visual Studio
Multiple Languages
U-SQL, Hive, & Pig
Seamlessly integrate .NET
25. Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM
26. What is
U-SQL?
A hyper-scalable, highly extensible
language for preparing, transforming and
analyzing all data
Allows users to focus on the what—not
the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
26
27. U-SQL language philosophy
27
Declarative query and transformation language:
• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL
Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
28. Expression-flow programming style
Automatic "in-lining" of SQLIP expressions
– whole script leads to a single execution
model
Execution plan that is optimized out-of-
the-box and w/o user intervention
Per-job and user-driven parallelization
Detail visibility into execution steps, for
debugging
Heat map functionality to identify
performance bottlenecks
010010
100100
010101
29. “Unstructured” Files
• Schema on Read
• Write to File
• Built-in and custom Extractors and
Outputters
• ADL Storage and Azure Blob
Storage
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv;
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
30. Expression-flow Programming Style
12
• Automatic "in-lining" of U-SQL expressions –
whole script leads to a single execution
model.
• Execution plan that is optimized out-of-the-
box and w/o user intervention.
• Per job and user driven level of
parallelization.
• Detail visibility into execution steps, for
debugging.
• Heatmap like functionality to identify
performance bottlenecks.
32. What can you do with Visual Studio?
32
Visualize and
replay progress
of job
Fine-tune query
performance
Visualize physical
plan of U-SQL
query
Browse metadata
catalog
Author U-SQL
scripts (with
C# code)
Create metadata
objects
Submit and cancel
U-SQL Jobs
Debug U-SQL and
C# code
34. Authoring U-SQL queries
34
Visual Studio fully supports
authoring U-SQL scripts
While editing, it provides:
IntelliSense
Syntax color coding
Syntax checking
…
Contextual
Menu
35. Job execution graph
35
After a job is submitted
the progress of the
execution of the job as it
goes through the
different stages is shown
and updated
continuously
Important stats about the
job are also displayed
and updated
continuously
37. HDInsight: Cloud Managed Hadoop
What it is:
When to use it:
Microsoft’s implementation of apache Hadoop (as a service)
that uses Blobs for persistent storage
• When you need to process large scale data (PB+)
• When you want to use Hadoop or Spark as a service
• When you want to compute data and retire the servers, but
retain the results
• When your team is familiar with the Hadoop Zoo
44. Deploying HDInsight Clusters
• Cluster Type: Hadoop, Spark, HBase and Storm.
• Hadoop clusters: for query and analysis workloads
• HBase clusters: for NoSQL workloads
• Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads
• Operating System: Windows or Linux
• Can be deployed from Azure portal, Azure Command Line
Interface (CLI), or Azure PowerShell and Visual Studio
• A UI dashboard is provided to the cluster through Ambari.
• Remote Access through SSH, REST API, ODBC, JDBC.
• Remote Desktop (RDP) access for Windows clusters
45. Azure Machine Learning
What it is:
When to use it:
A multi-platform environment and engine to create and deploy
Machine Learning models and API’s
• When you need to create predictive analytics
• When you need to share Data Science experiments across
teams
• When you need to create call-able API’s for ML functions
• When you also have R and Python experience on your Data
Science team
47. Basic Azure ML Elements
Import Data
Preprocess
Algorithm
Train Model
Split Data
Score Model
48.
49.
50. Power BI
What it is:
When to use it:
Interactive Report and Visualization creation for computing
and mobile platforms
• When you need to create and view interactive reports that
combine multiple datasets
• When you need to embed reporting into an application
• When you need customizable visualizations
• When you need to create shared datasets, reports, and
dashboards that you publish to your team
54. Bulk Ingestion and Preparation
Business
apps
Custom
apps
Sensors
and devices
Bulk Load
Azure Data Factory
55. Data
Transformation
Data
Collection
Presentation
and action
Queuing
System
Data Storage
Big Data Lambda Architecture
Azure Search
Data analytics (Excel,
Power BI, Looker,
Tableau)
Web/thick client
dashboards
Devices to take action
Event hub
Event & data
producers
Applications
Web and social
Devices
Live Dashboards
DocumentDB
MongoDB
SQL Azure
ADW
Hbase
Blob StorageKafka/RabbitMQ/
ActiveMQ
Event hubs Azure ML
Storm / Stream
Analytics
Hive / U-SQL
Data Factory
Sensors
Pig
Cloud gateways
(web APIs)
Field
gateways
What you can do with it: https://azure.microsoft.com/en-us/overview/what-is-azure/
Platform: http://microsoftazure.com
Storage: https://azure.microsoft.com/en-us/documentation/services/storage/
Networking: https://azure.microsoft.com/en-us/documentation/services/virtual-network/
Security: https://azure.microsoft.com/en-us/documentation/services/active-directory/
Services: https://azure.microsoft.com/en-us/documentation/articles/best-practices-scalability-checklist/
Virtual Machines: https://azure.microsoft.com/en-us/documentation/services/virtual-machines/windows/ and https://azure.microsoft.com/en-us/documentation/services/virtual-machines/linux/
PaaS: https://azure.microsoft.com/en-us/documentation/services/app-service/
Azure Data Factory: http://azure.microsoft.com/en-us/services/data-factory/
Video of this process: https://azure.microsoft.com/en-us/documentation/videos/azure-data-factory-102-analyzing-complex-churn-models-with-azure-data-factory/
More options: Prepare System: https://azure.microsoft.com/en-us/documentation/articles/data-factory-build-your-first-pipeline-using-editor/ - Follow steps
Another Lab: https://azure.microsoft.com/en-us/documentation/articles/data-factory-samples/
Azure SQL Data Warehouse: http://azure.microsoft.com/en-us/services/sql-data-warehouse/
15
16
Azure Data Lake: http://azure.microsoft.com/en-us/campaigns/data-lake/
All data
Unstructured, Semi structured, Structured
Domain-specific user defined types using C#
Queries over Data Lake and Azure Blobs
Federated Queries over Operational and DW SQL stores removing the complexity of ETL
Productive from day one
Effortless scale and performance without need to manually tune/configure
Best developer experience throughout development lifecycle for both novices and experts
Leverage your existing skills with SQL and .NET
Easy and powerful data preparation
Easy to use built-in connectors for common data formats
Simple and rich extensibility model for adding customer – specific data transformation – both existing and new
No limits scale
Scales on demand with no change to code
Automatically parallelizes SQL and custom code
Designed to process petabytes of data
Enterprise grade
Managing, securing, sharing, and discovery of familiar data and code objects (tables, functions etc.)
Role based authorization of Catalogs and storage accounts using AAD security
Auditing of catalog objects (databases,tables etc.)
ADLA allows you to compute on data anywhere and a join data from multiple cloud sources.
Primary site: https://azure.microsoft.com/en-us/services/hdinsight/
Quick overview: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/
4-week online course through the edX platform: https://www.edx.org/course/processing-big-data-azure-hdinsight-microsoft-dat202-1x
11 minute introductory video: https://channel9.msdn.com/Series/Getting-started-with-Windows-Azure-HDInsight-Service/Introduction-To-Windows-Azure-HDInsight-Service
Microsoft Virtual Academy Training (4 hours) - https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551?l=UJ7MAv97_5804984382
Learning path for HDInsight: https://azure.microsoft.com/en-us/documentation/learning-paths/hdinsight-self-guided-hadoop-training/
Azure Feature Pack for SQL Server 2016, i.e., SSIS (SQL Server Integration Services): https://msdn.microsoft.com/en-us/library/mt146770(v=sql.130).aspx
Azure Portal: http://azure.portal.com
Provisioning Clusters: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-provision-clusters/
Different clusters have different node types, number of nodes, and node sizes.