Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Big Data Analytics from Azure Cloud to Power BI Mobile
1. Big Data Analytics from Azure
Data Platform to Power BI
Azure Batch, Azure Data Lake, Azure HDInsight, ML, Power BI
March 22, 2017
Roy Kim
@RoyKimYYZ
roykimtoronto@gmail.com
2. Agenda
Overview of Big Data + Azure + Data Insights
Job Postings demo solution architecture &
implementation
Mobile Demo with Power BI
Q&A
Author: Roy Kim
By: Roy Kim
3. Bio
Roy Kim
14+ Years of Microsoft Technology Solutions
.NET, SharePoint, BI, Office 365, Azure Solutions
IT Consultant
University of Toronto – Computer Science Degree
Author: Roy Kim
By: Roy Kim
4. Data to Insight
Author: Roy Kim
Big Data
Data Platform
Technologies
Solution Data Insights
By: Roy Kim
5. Job Postings Demo Solution
Author: Roy Kim References:
https://softwarestrategiesblog.com/2015/09/05/10-ways-big-data-is-revolutionizing-supply-chain-management
Job Postings Azure Data Platform
Data Lake, HDInsight,
SQL, Power BI
Job trends, analysis
By: Roy Kim
11. Job Postings Data Set
Volume
• Many national
job sites
• New job
postings daily
• Metadata and
full text.
Velocity
• New job
postings
created every
minute
Variety
• Semi-
structured
• Job Title
• Location
• Company
• Unstructured
• Job
Description
Veracity
• Incomplete/Im
precise
• Salary, Per
hour
• FT, PT, Temp,
Contract,
Seasonal
• Main
profession
By: Roy Kim
12. Power BI – Job Postings Demo Reports
By: Roy Kim
13. Power BI – Job Postings Demo Reports
By: Roy Kim
14. Job Postings Big Data Solution Architecture
Azure Data Lake
Analytics
Internet Data
Sets
USQL
Storage Account
Blob Store
Azure Batch
.NET Console
App
Blob Store
(WebHDFS)
Azure HDInsight
Hive
Azure
Active Directory
HDInsight
Azure SQL
database
SQL Data
Warehouse
Storage blob
Storage (Azure)
Visual Studio
Online
Data Lake
Azure SQL / Data
Warehouse
SQL DB
Machine
Learning
ML Studio
StorageTierServicesTier
REST/HTML/..
Visualization
/Reporting
Tools
Presentation
Tier
Mobile
Pig
Scoop
By: Roy Kim
Desktop
Batch
Business
Users
Report Builders
Data Analysts
Azure Data
Factory
Pipeline
Data Factory
Browser
Service
Applicatio
n Insights
Microsoft
Azure
Data
Analysis Services
Tabular
Machine
Learning
Storage Account
Blob Store
(HDFS)
Azure Data Lake
Store
Query
15. Job Postings from Internet Job Boards
Web sites that offer APIs
Use any server-side programming language to retrieve data such as NET, Java,
Node.js, etc.
If no APIs, consider HTML web page scraping
REST API
http end points typically return JSON or XML data formats
Html Web Page Scraping
HTML Agility Pack to assist in parsing the Document Object Model for data points.
https://www.nuget.org/packages/HtmlAgilityPack
HTML parsing supporting XPath to traverse the Document Object Model
(DOM)
E.g. doc.DocumentElement.SelectSingleNode(“//div*@id=‘Total Sales’+”)By: Roy Kim
16. Job Postings Data Collector .NET Console Application
.NET console application to read data from the internet and store into Azure Storage
accounts
Concurrent requests to job postings public API and HTML pages
Multi-threaded to increase speed and throughput
Parse HTML pages and JSON
Store JSON files directly into Azure Data Lake Store with ADLS .NET SDK
Leverages Azure Application Insights for logging trace and exception error
messages.
To store files into Azure Data Lake Store, the .NET application needs to access with
an Azure AD service principal with the appropriate access control.
By: Roy Kim
18. Azure Application Insights
By: Roy Kim
Application Insights Core API. This package provides core functionality for transmission of all
Application Insights Telemetry Types and is a dependent package for all other Application Insights
packages.
19. Azure Batch
A managed Azure service executing
command line applications.
For batch processing or batch
computing--running a large volume of
similar tasks to get some desired result.
Commonly used by organizations that
regularly process, transform, and
analyze large volumes of data.
Simply, a set of Azure Virtual Machines
running a console application to process
data that can be on a recurring schedule
and in parallel
References:
https://github.com/Microsoft/azure-docs/blob/master/articles/batch/batch-technical-overview.md
Author: Roy Kim
20. Azure Batch – Demo Implementation
Azure Batch runs the Console Application on a daily schedule against one node (Virtual
Machine - 2 cores)
To run console application in parallel through compute nodes, used the sample Parallel
Tasks .NET solution which uses the Azure Batch Client SDK.
https://github.com/Azure/azure-batch-
samples/tree/master/CSharp/ArticleProjects/ParallelTasks
Azure batch is an architecture option to support data collection in terms of velocity and
volume.
By: Roy Kim
21. Azure Data Lake
Intended for data storage in its raw format for future analysis, processing or
data modelling.
For developers, data scientists, and analysts to store data of any size, shape,
and speed.
To do all types of processing and analytics across different platforms and
languages.
Extract and load, minimal transformations
To manage data in characteristic of variety, velocity and volume
Two Components
1. Azure Data Lake Store
2. Azure Data Lake Analytics
References:
https://azure.microsoft.com/en-us/solutions/data-lake/
By: Roy Kim
22. Azure Data Lake Store
Azure Data Lake Store is a hyper-scale repository for big data analytic
workloads. Azure Data Lake enables you to capture data of any size, type,
and ingestion speed in one single place for operational and exploratory
analytics.
The Azure Data Lake store is an Apache Hadoop file system compatible with
Hadoop Distributed File System (HDFS)
Can be accessed from Hadoop (available with HDInsight cluster) using the
WebHDFS-compatible REST APIs
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
By: Roy Kim
23. Azure Data Lake Store
Use Cases
Store social media
posts, log files, sensor
data
Store corporate data
such as
relational databases
(as flat files)
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
By: Roy Kim
24. Azure Data Lake Analytics
Azure Data Lake Analytics is built to make big data analytics easy.
Focus on writing, running, and managing jobs, rather than operating distributed
infrastructure. Instead of deploying, configuring, and tuning hardware.
Write queries to transform your data and extract valuable insights. The analytics
service can handle jobs of any scale instantly by setting the dial for how much
power you need.
U-SQL – a Big Data query language. Likeness of SQL + C#
”schema on reads”
Pay for your job when it is running; making it cost-effective.
Data Collector app stores .json files in respective folders
USQL scripts logic:
reads 1000s of JSON files in a given folder
Outputs to one TSV (tab delimited) file
Create a Tables to schematize the TSV files
Query against tables to analyze or transform to a new output file.
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
By: Roy Kim
25. Azure Data Lake Analytics – Demo Implementation
By: Roy Kim
USQL script: process json files into a tab delimited file
26. Azure Data Lake Analytics – Demo Implementation
By: Roy Kim
Roy Kim
# of JSON
Files
Single output
file
50 compute
nodes
3.4 mins
duration
27. Azure HDInsight
Hadoop refers to an ecosystem of open-source software that is a framework for
distributed processing, storing, and analysis of big data sets on clusters of
commodity computer hardware.
Azure HDInsight makes the Hadoop components from the Hortonworks Data
Platform (HDP) distribution available in Azure, deploys managed clusters with high
reliability and availability, and provides enterprise-grade security and governance
with Active Directory.
HDInsight offers the cluster types - Hadoop, HBase, Spark, Kafka, Interactive Hive,
Storm, customized, etc.
Supports integration with BI tools such as Power BI, Excel, SQL Server Analysis
Services, and SQL Server Reporting Services.
By: Roy Kim
28. Azure HDInsight – Demo Implementation
Hadoop Cluster Type
Data Source
Windows Azure Storage Account
Data Lake Store access
Data Lake Store Account
Hive Tables
JobPostings (internal) Table
Schema definition
Data loaded from Azure Data Lake Store .TSV file into Hadoop cluster’s WASB
JobPostings External Table
Data referenced in Azure Data Lake Store .TSV file. This is external to
HDInsight storage account.
By: Roy Kim
29. Azure HD Insight – Demo Implementation
Considerations
To manage the compute costs, script the provisioning and de-provisioning of the
cluster.
While a cluster is running, execute scripts and query the data into self service BI
tools and into other data warehouses.
In comparison to Azure Data Lake, ADL Analytics may be more cost effective
since it is pay per use at a more granular level - # of nodes and execution time.
E.g. Running against 100 nodes may cost a few dollars per minute in ADL
Analytics; whereas, in HDInsight, 13 nodes for small VM size may cost a few
dollars an hour.
By: Roy Kim
30. Azure SQL Database
A relational database-as-a-service in the cloud built on the Microsoft SQL Server
engine
No need to manage the infrastructure.
Scale up or down based on Database Transaction Units (DTUs).
1TB storage maximum
Can be used as a simpler data warehouse.
By: Roy Kim
31. Azure SQL Database – Demo Implementation
Developed a simple data warehouse
modelling
Job Postings data loaded from ADLS
Star schema
Added a date dimension table
Table of # of jobs for each province by
a date hierarchy
By: Roy Kim
32. Azure Data Factory
Cloud-based data integration service that orchestrates and automates
the movement and transformation of data.
Create data pipelines that move and transform data, and then run the pipelines on a specified
schedule (hourly, daily, weekly, etc.)
By: Roy Kim
33. Azure Data Factory
Category Data store Supported as a source Supported as a sink
Azure Azure Blob storage
Azure Data Lake Store
Azure SQL Database
Azure SQL Data Warehouse
Azure Table storage
Azure DocumentDB
Azure Search Index
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Databases SQL Server*
Oracle*
MySQL*
DB2*
Teradata*
PostgreSQL*
Sybase*
Cassandra*
MongoDB*
Amazon Redshift
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
File File System*
HDFS*
Amazon S3
FTP
✓
✓
✓
✓
✓
Others Salesforce
Generic ODBC*
Generic OData
Web Table (table from HTML)
GE Historian*
✓
✓
✓
✓
✓
By: Roy Kim
34. Azure Machine Learning – Demo Implementation
By: Roy Kim
Predicting Salary for a
given set of parameters
such as job title and
location
37. The main features of
your Power BI service UI:
1. navigation bar
2. dashboard with tiles
3. Q&A question box
4. help and feedback
buttons
5. dashboard title
6. Office 365 app
launcher
7. Power BI home
buttons
8. Additional dashboard
actions
Power BI App Service
By: Roy Kim
38. • Frequently updated and accessed reports
• Minutes, hours, daily, weekly
• Fast and easy access of reports and dashboards
• IoT and sensor data
• Retail and customer analytics
• Team and organizational performance and productivity e.g. ticket
management
• Collaborative analysis and decision making
• Not always in front of a large screen device
Key Mobile Scenarios
By: Roy Kim
39. • Navigation
• Dashboards and Reports
• Responsive design
• Visualization interaction
• Sharing
• Annotations
• Q&A
• Alerts
• Favourites
Annotations
Mobile App IOS Key Features & Demo
By: Roy Kim
41. Security Architecture
Azure Data Lake
Analytics
Internet Data
Sources
USQL
Storage Account
Blob Store
Azure Batch
.NET Console
App
Azure Data Lake Store
Blob Store
(HDFS)
Azure HDInsight
Hive
Storage Account
Blob Store
(HDFS)
Azure
Active Directory
HDInsight
Azure SQL
database
SQL Data
Warehouse
Storage blob
Storage (Azure)
Visual Studio
Online
Data Lake
Azure SQL / Data
Warehouse
SQL DB
Analysis Services
(preview)
Tabular
StorageTierServicesTier
REST/HTML/..
Visualization
/Reporting
Tools
Pig
Scoop
Roy Kim
Desktop
Batch
BI
Developers
IT Ops
Azure Data
Factory
Pipeline
Data Factory
Microsoft
AzureAPI
Key
API
Key
AAD App
Service Principal
AAD App
Service Principal
AAD User
SQL account
AAD User
SQL account
Account
Key
AAD User
End Users
By: Roy Kim
Applicatio
n Insights
42. Data Processing & Formats
Azure Data Lake
Analytics
Internet Data
Sources
USQL
Azure Batch
.NET Console
App
Azure HDInsight
Hive
HDInsight
Azure SQL
database
SQL Data
Warehouse
Data Lake
Azure SQL
SQL DB
Analysis Services
(preview)
Tabular
DataFormatServicesTier
REST/HTML/.. Pig
Scoop
Roy Kim
Batch
Azure Data
Factory
Pipeline
Data Factory
JSON
HTML
JSON
TSV Hive Ext.
Table
Relational
DB
Hive Int.
Table
Table
By: Roy Kim
43. Closing Remarks
Cloud services such as Azure Data Platform provide new capabilities in
Data Analytics. That is in terms of scale, cost and agility.
Azure Data Lake is a productive option for organizations new to
Hadoop. Yet continue to plan for other Hadoop offerings best fit for
other scenarios.
Many azure services fit together to make the appropriate solution.
That is SaaS, PaaS, IaaS, Data, App, Operational, etc.
As part of planning and design, be aware of MS roadmap and industry
trends.
By: Roy Kim