SlideShare a Scribd company logo
| © Copyright 2016 Hitachi Consulting1
Microsoft Azure Batch
High Performance Computing with an Application of
Scalable Files Processing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2016 Hitachi Consulting2
Outline
 What is Azure Batch and High Performance Computing?
 When to Use Azure Batch?
 Azure Batch Constructs
 Scalable Data Loading Solution with Azure Batch
 .NET Code Walk-through & Demo
 Useful Resources
| © Copyright 2016 Hitachi Consulting3
High Performance Computing
| © Copyright 2016 Hitachi Consulting4
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
| © Copyright 2016 Hitachi Consulting5
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
| © Copyright 2016 Hitachi Consulting6
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
The computation on the
cluster is managed using
Azure Batch APIs.
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
| © Copyright 2016 Hitachi Consulting7
What is Azure Batch?
Yet anther azure service…
High Performance Computing (HPC)
environment on Azure.
The computation on the
cluster is managed using
Azure Batch APIs.
On-demand – Pay as you use
Elastic – Scale up/down or shut down
PaaS – No infrastructure configurations are
needed
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
| © Copyright 2016 Hitachi Consulting8
Computing Example
Job
Job
Sequential Processing
| © Copyright 2016 Hitachi Consulting9
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Sequential Processing
| © Copyright 2016 Hitachi Consulting10
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Sequential Processing
Single Compute Unit
| © Copyright 2016 Hitachi Consulting11
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
Sequential Processing
Single Compute Unit
Start T = 0
| © Copyright 2016 Hitachi Consulting12
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 2
Sequential Processing
Task 1 T = 1X
Start T = 0
Single Compute Unit
| © Copyright 2016 Hitachi Consulting13
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 3
Sequential Processing
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Single Compute Unit
| © Copyright 2016 Hitachi Consulting14
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Task 3 T = 3X
Task 4 T = 4X
Task 5 T = 5X
Task 6 T = 6X
Sequential Processing
End T = 6X+
Single Compute Unit
| © Copyright 2016 Hitachi Consulting15
High Performance Computing
Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
| © Copyright 2016 Hitachi Consulting16
High Performance Computing
Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
Scale out
Using multiple compute units
Divide
A Job is decomposed into
multiple Independent tasks
Distribute
Tasks are processed in a
separate compute nodes,
simultaneously
| © Copyright 2016 Hitachi Consulting17
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
| © Copyright 2016 Hitachi Consulting18
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
| © Copyright 2016 Hitachi Consulting19
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1
Task 2
Task 3
Task 4
Task 4
Task 6
| © Copyright 2016 Hitachi Consulting20
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1 T = 1X
Start T = 0
Task 2 T = 1X
Task 3 T = 1X
Task 4 T = 1X
Task 5 T = 1X
Task 6 T = 1X
End T = 1X+
| © Copyright 2016 Hitachi Consulting21
Big Data vs. Big Compute
The big brothers
Big Data
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight
| © Copyright 2016 Hitachi Consulting22
Big Data vs. Big Compute
The big brothers
Big Data
Big Compute
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight
 CPU & Memory Intensive
 Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
 Azure Batch
| © Copyright 2016 Hitachi Consulting23
Big Data vs. Big Compute
Big Data Processing is a subset of Big Compute, the latter covers a wider
spectrum of computing problems
The big brothers
Big Data
Big Compute
 Data Centric
 Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
 Azure HDInsight
 CPU & Memory Intensive
 Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
 Azure Batch
| © Copyright 2016 Hitachi Consulting24
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting25
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting26
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
 Executing thousands of DB Stored Procedures simultaneously
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting27
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
 Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting28
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
 Image rendering and graphics processing
 Search and optimization problems
 Various experimental/simulation computing applications
 Massively parallel data file processing & loading
 Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
For applications that needs task-to-task interaction, Message Passing Interfaces (MPI) are
supported in Azure Batch – Distributed Processing
In some cases, communication between tasks can be managed via a shared data store –
Parallel Processing
Use cases for Big Compute
| © Copyright 2016 Hitachi Consulting29
Azure Batch
| © Copyright 2016 Hitachi Consulting30
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting31
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting32
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting33
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting34
Job 2
(priority, max
execution time)
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting35
Job 2
(priority, max
execution time)
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, exe
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting36
Compute Size
Resource Default Maximum Limit
Azure Batch Account 1 50
Pools per Batch Account 20 5000
Cores per Batch Account 20 N/A
Tasks per Compute Node 1 4 X node core
Number of Nodes vs Node Size:
 Many small nodes → many tasks, not compute/memory intensive
 Few big nodes → few tasks, compute/memory intensive
(potential multi-threading per task)
 Task queueing is automatically managed by Azure Batch
Azure Batch Account
• Pool
− Number of VMs
− VM Size
− VM OS Family
 Job
− Set of Tasks
− Priority
− Max. Execution time
 Task
− Parent Job
− Resources (.config, .dlls)
− Cmd Executable (.exe)
− Cmd Parameters
Azure Storage Account
 Hosts all the task resources
(.dlls & .exe)
| © Copyright 2016 Hitachi Consulting37
Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task
| © Copyright 2016 Hitachi Consulting38
Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task
Then:
 The 7 tasks with the higher priority job will be executed
(status = “Running”)
 The first 3 added tasks to the lower priority job will be executed
(status = “Running”)
 The rest 4 task of the lower priority job will be queued
(status = “Active”)
 As soon as a “Running” task finishes (status = “Completed”)
an “Active” task will be assigned to the freed compute node
| © Copyright 2016 Hitachi Consulting39
Compute Size
What If:
 Pool Size = 10 Nodes
 Node Size = Small (1 Core)
 Total Cores = 10
And you have:
 2 Jobs
 Each Job has 7 task
 Total tasks = 14
By default:
 1 Core can process only 1 task
Then:
 The 7 tasks with the higher priority job will be executed
(status = “Running”)
 The first 3 added tasks to the lower priority job will be executed
(status = “Running”)
 The rest 4 task of the lower priority job will be queued
(status = “Active”)
 As soon as a “Running” task finishes (status = “Completed”)
an “Active” task will be assigned to the freed compute node
 If job was executed (status = “Running”), then a higher priority job is
submitted to the same pool:
− Azure Batch will “pause” tasks of the low priority job (status = “Suspended”)
to free resources (cores) for the higher priority job,
− then resume them when resources become available
| © Copyright 2016 Hitachi Consulting40
Use Case: Parallel Data Files Loading
| © Copyright 2016 Hitachi Consulting41
Parallel Data Loading with Azure Batch
 Source data is a set of files, with different formants
(Fixed width, Delimited, XML, JSON, Mainframe, Other), in Azure Blob Storage
 Blob Storage Structure: “<DataDomain><DataFeed><DataFeed>_<Timestamp>.<ext>”
 200+ data feeds, each produces 1-3 files daily
 Data feed formats (column, data types, file format) are described in MetadataDB (Azure SQL DB)
 The objective is to build a Data Loading Solution to:
 Parse the files and load them into a database (Azure SQL DW)
 Be scalable – used for ongoing data loading and history data migration
 Be metadata driven – new data feeds can be handled by the solution by adding metadata
 Log execution history and errors
Problem Context
| © Copyright 2016 Hitachi Consulting42
Parallel Data Loading with Azure Batch
The task (unit of parallelization, or granule) can be:
 Processing a Feed
 balanced number of files/file sizes in each feed
 loading files in sequence
 files can be processed simultaneously on the same node using multithreading (CPU/Memory
implications)
 Processing a File
 no files sequence is needed
 fine grain, more control, better utilization of resources
 less manageable (many tasks per job).
 Processing File Line
 multithreading on the same node.
Parallelism Level
| © Copyright 2016 Hitachi Consulting43
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Destination
<Azure SQL DW>
Metadata
<Azure SQL DB>
| © Copyright 2016 Hitachi Consulting44
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host> Metadata
<Azure SQL DB>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting45
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
2 – Create a Job
3 – Create a task for each feed
4 – add the tasks to the job
5 – Submit the job
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting46
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host> Metadata
<SQL Azure DB>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Task 1
Task 2
Task N
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting47
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
File
1
File
2
. . . DS
1
DS
2
. . .
Task 1
Task 2
Task N
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>
| © Copyright 2016 Hitachi Consulting48
Parallel Data Loading with Azure Batch
Task Processing Steps
Get feed format Info from Metadata
| © Copyright 2016 Hitachi Consulting49
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Task Processing Steps
| © Copyright 2016 Hitachi Consulting50
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Task Processing Steps
| © Copyright 2016 Hitachi Consulting51
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
Task Processing Steps
| © Copyright 2016 Hitachi Consulting52
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Task Processing Steps
| © Copyright 2016 Hitachi Consulting53
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Task Processing Steps
| © Copyright 2016 Hitachi Consulting54
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Parse file content to DataTable
Task Processing Steps
| © Copyright 2016 Hitachi Consulting55
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Parse file content to DataTable
Dump DataTable content to destination (DW)
Task Processing Steps
| © Copyright 2016 Hitachi Consulting56
.NET Solution Structure
• Model
• Database Services
• Blob Storage Services
• Parsers
Processing Logic
(Class Library)
• Receives Command Line parameters
• Performs the operation according to the supplied
parameters
Task
(Console App)
• Azure Batch Services
• Creates Pools/Jobs/Task
Runner
(Console App)
| © Copyright 2016 Hitachi Consulting57
.NET Solution Structure
}Azure Blob
Storage
} A Host
• Model
• Database Services
• Blob Storage Services
• Parsers
Processing Logic
(Class Library)
• Receives Command Line parameters
• Performs the operation according to the supplied
parameters
Task
(Console App)
• Azure Batch Services
• Creates Poos/Jobs/Task
Runner
(Console App)
| © Copyright 2016 Hitachi Consulting58
Hosting Azure Batch Runner
None! – One-off execution
SQL Agent Job (VM + SqlServer)
SQL Server Integration Services (VM + SqlServer)
Azure WebJob + Azure Scheduler (or on-demand)
Azure Data Factory
Azure Orchestration???
| © Copyright 2016 Hitachi Consulting59
Code Walk-through
| © Copyright 2016 Hitachi Consulting60
Code Walk-through
 Solution Structure
 Azure Batch Bits
 Azure Blob Storage Bits
 Text File Processing
 XML & JSON – (Quick and Dirty)
 SQL Bulk Copy with Retry Pattern
This is how we do it
| © Copyright 2016 Hitachi Consulting61
Code Walk-through
Solution Structure
| © Copyright 2016 Hitachi Consulting62
Code Walk-through
Azure Batch Bits
Very useful if you want to
sync with subsequent
processing steps.
I.e., start a subsequent step
only when the job finishes.
| © Copyright 2016 Hitachi Consulting63
Code Walk-through
Azure Batch Bits
| © Copyright 2016 Hitachi Consulting64
Code Walk-through
Azure Batch Bits
| © Copyright 2016 Hitachi Consulting65
Code Walk-through
Azure Blob Storage
Streaming is very efficient in
terms of processing large files,
instead of downloading the whole
file to be processed
| © Copyright 2016 Hitachi Consulting66
Code Walk-through
Text File Parsing – FileHelpers Library
Parallel processing at the file level
(a separate thread per line to parse)
| © Copyright 2016 Hitachi Consulting67
Code Walk-through
XML & JSON Files Parsing – Quick & Dirty
• The content of the whole file is loaded in a dataset
• Cannot flush data in batches
• Unlike streaming, it is more memory intensive approach
| © Copyright 2016 Hitachi Consulting68
Code Walk-through
SQL Bulk Copy – Loading in Batches
Batch size <
(available memory / record size)
| © Copyright 2016 Hitachi Consulting69
Code Walk-through
SQL Bulk Copy – Asynchronous
| © Copyright 2016 Hitachi Consulting70
Code Walk-through
SQL Bulk Copy – Retry Pattern
| © Copyright 2016 Hitachi Consulting71
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
| © Copyright 2016 Hitachi Consulting72
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
| © Copyright 2016 Hitachi Consulting73
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
| © Copyright 2016 Hitachi Consulting74
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
 Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
| © Copyright 2016 Hitachi Consulting75
Some Important Notes - Polybase
 Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
 However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
 A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
 Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
 If the source is not Blob Storage (i.e., file system), or you destination is not Azure SQL DW (e.g.,
Azure SQL DB, DocumentDB, or another Azure Blob Storage/Data lake), or your file processing
does not only involve loading data to a database (e.g., processing requests to initiate workflow),
Azure Batch is the right tool.
| © Copyright 2016 Hitachi Consulting76
Useful Resources
Check these out…
• Azure Batch Documentation
https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview
• Azure Batch Explorer
https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer
• HPC and data orchestration using Azure Batch and Data Factory
https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch
• FileHelpers Librarys
http://www.filehelpers.net
• Retry Pattern
https://msdn.microsoft.com/en-us/library/dn589788.aspx
• Spinning up 16,000 A1 Virtual Machines on Azure Batch
https://blogs.endjin.com/2015/07/spinning-up-16000-a1-virtual-machines-on-azure-batch
• Parallel Computing
https://en.wikipedia.org/wiki/Parallel_computing
| © Copyright 2016 Hitachi Consulting77
Acknowledgement
These guys are awesome…
Thanks to James Fox and Alessandro Aeberli for their efforts
in building the awesome Data Landing Solution for Argos.
Nirav is currently the master of the landing solution 
| © Copyright 2016 Hitachi Consulting78
My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org

More Related Content

What's hot

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Altinity Ltd
 
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Altinity Ltd
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
インフラエンジニアのためのcassandra入門
インフラエンジニアのためのcassandra入門インフラエンジニアのためのcassandra入門
インフラエンジニアのためのcassandra入門Akihiro Kuwano
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
selvaraaju
 
Tracing Microservices with Zipkin
Tracing Microservices with ZipkinTracing Microservices with Zipkin
Tracing Microservices with Zipkin
takezoe
 
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
NTT DATA Technology & Innovation
 
TERASOLUNA Framework on the Spring IO Platform
TERASOLUNA Framework on the Spring IO PlatformTERASOLUNA Framework on the Spring IO Platform
TERASOLUNA Framework on the Spring IO Platform
apkiban
 
AWSのNoSQL入門
AWSのNoSQL入門AWSのNoSQL入門
AWSのNoSQL入門
Akihiro Kuwano
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTOClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
Altinity Ltd
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
Ken SASAKI
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 

What's hot (20)

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
インフラエンジニアのためのcassandra入門
インフラエンジニアのためのcassandra入門インフラエンジニアのためのcassandra入門
インフラエンジニアのためのcassandra入門
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
 
Tracing Microservices with Zipkin
Tracing Microservices with ZipkinTracing Microservices with Zipkin
Tracing Microservices with Zipkin
 
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
 
TERASOLUNA Framework on the Spring IO Platform
TERASOLUNA Framework on the Spring IO PlatformTERASOLUNA Framework on the Spring IO Platform
TERASOLUNA Framework on the Spring IO Platform
 
AWSのNoSQL入門
AWSのNoSQL入門AWSのNoSQL入門
AWSのNoSQL入門
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTOClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 

Viewers also liked

Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
Khalid Salama
 
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
20060416   Azure Boot Camp 2016- Azure Data Lake Storage and Analytics20060416   Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
Łukasz Grala
 
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
GUSS
 
Azure SQL DWH
Azure SQL DWHAzure SQL DWH
Azure SQL DWH
Shy Engelberg
 
SQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der PraxisSQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der Praxis
Sascha Dittmann
 
AnalyticsConf : Azure SQL Data Warehouse
AnalyticsConf : Azure SQL Data WarehouseAnalyticsConf : Azure SQL Data Warehouse
AnalyticsConf : Azure SQL Data Warehouse
Wlodek Bielski
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machines
SolarWinds
 
Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)
Enrique Catala Bañuls
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
Khalid Salama
 
Microsoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewMicrosoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse Overview
Justin Munsters
 
SQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu NiculitaSQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu Niculita
ITCamp
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
James Serra
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
James Serra
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
Khalid Salama
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 

Viewers also liked (20)

Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
20060416   Azure Boot Camp 2016- Azure Data Lake Storage and Analytics20060416   Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
20060416 Azure Boot Camp 2016- Azure Data Lake Storage and Analytics
 
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
[JSS2015] Azure SQL Data Warehouse - Azure Data Lake
 
Azure SQL DWH
Azure SQL DWHAzure SQL DWH
Azure SQL DWH
 
SQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der PraxisSQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der Praxis
 
AnalyticsConf : Azure SQL Data Warehouse
AnalyticsConf : Azure SQL Data WarehouseAnalyticsConf : Azure SQL Data Warehouse
AnalyticsConf : Azure SQL Data Warehouse
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machines
 
Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
 
Microsoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewMicrosoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse Overview
 
SQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu NiculitaSQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu Niculita
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similar to Microsoft Azure Batch

Easy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSEasy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWS
Amazon Web Services
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataStylight
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Amazon Web Services
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05
Tetsu Saburi
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
Peteris Arajs - Where is my data
Peteris Arajs - Where is my dataPeteris Arajs - Where is my data
Peteris Arajs - Where is my data
Andrejs Vorobjovs
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
inside-BigData.com
 
Amazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in CloudAmazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in Cloud
amodkadam
 
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
cloudcontroller
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
Ashish Mrig
 
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Amazon Web Services
 
Cost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud ApplicationCost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud Application
Udayan Banerjee
 
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
Thuan Ng
 
Linux on Azure Pitch Deck
Linux on Azure Pitch DeckLinux on Azure Pitch Deck
Linux on Azure Pitch Deck
Nicholas Vossburg
 
Task programming
Task programmingTask programming
Task programming
Yogendra Tamang
 
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTINGEFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
International Journal of Technical Research & Application
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Cloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to ScaleCloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to Scale
Amazon Web Services
 
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
 ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre... ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
Amazon Web Services
 

Similar to Microsoft Azure Batch (20)

Easy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSEasy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWS
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Peteris Arajs - Where is my data
Peteris Arajs - Where is my dataPeteris Arajs - Where is my data
Peteris Arajs - Where is my data
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
Amazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in CloudAmazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in Cloud
 
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
 
Cost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud ApplicationCost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud Application
 
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
ExpertsLive Asia Pacific 2017 - Planning and Deploying SharePoint Server 2016...
 
Linux on Azure Pitch Deck
Linux on Azure Pitch DeckLinux on Azure Pitch Deck
Linux on Azure Pitch Deck
 
Task programming
Task programmingTask programming
Task programming
 
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTINGEFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Cloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to ScaleCloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to Scale
 
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
 ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre... ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
 

More from Khalid Salama

Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
Khalid Salama
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Khalid Salama
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
Khalid Salama
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
Khalid Salama
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
Khalid Salama
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
Khalid Salama
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
Khalid Salama
 

More from Khalid Salama (8)

Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 

Recently uploaded

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 

Recently uploaded (20)

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 

Microsoft Azure Batch

  • 1. | © Copyright 2016 Hitachi Consulting1 Microsoft Azure Batch High Performance Computing with an Application of Scalable Files Processing Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2016 Hitachi Consulting2 Outline  What is Azure Batch and High Performance Computing?  When to Use Azure Batch?  Azure Batch Constructs  Scalable Data Loading Solution with Azure Batch  .NET Code Walk-through & Demo  Useful Resources
  • 3. | © Copyright 2016 Hitachi Consulting3 High Performance Computing
  • 4. | © Copyright 2016 Hitachi Consulting4 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure.
  • 5. | © Copyright 2016 Hitachi Consulting5 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure. Used to scale/parallelize compute- intensive workloads on managed cluster of VMs.
  • 6. | © Copyright 2016 Hitachi Consulting6 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure. The computation on the cluster is managed using Azure Batch APIs. Used to scale/parallelize compute- intensive workloads on managed cluster of VMs.
  • 7. | © Copyright 2016 Hitachi Consulting7 What is Azure Batch? Yet anther azure service… High Performance Computing (HPC) environment on Azure. The computation on the cluster is managed using Azure Batch APIs. On-demand – Pay as you use Elastic – Scale up/down or shut down PaaS – No infrastructure configurations are needed Used to scale/parallelize compute- intensive workloads on managed cluster of VMs.
  • 8. | © Copyright 2016 Hitachi Consulting8 Computing Example Job Job Sequential Processing
  • 9. | © Copyright 2016 Hitachi Consulting9 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Sequential Processing
  • 10. | © Copyright 2016 Hitachi Consulting10 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Sequential Processing Single Compute Unit
  • 11. | © Copyright 2016 Hitachi Consulting11 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 Sequential Processing Single Compute Unit Start T = 0
  • 12. | © Copyright 2016 Hitachi Consulting12 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 2 Sequential Processing Task 1 T = 1X Start T = 0 Single Compute Unit
  • 13. | © Copyright 2016 Hitachi Consulting13 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 3 Sequential Processing Task 1 T = 1X Start T = 0 Task 2 T = 2X Single Compute Unit
  • 14. | © Copyright 2016 Hitachi Consulting14 Computing Example Job Job Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 T = 1X Start T = 0 Task 2 T = 2X Task 3 T = 3X Task 4 T = 4X Task 5 T = 5X Task 6 T = 6X Sequential Processing End T = 6X+ Single Compute Unit
  • 15. | © Copyright 2016 Hitachi Consulting15 High Performance Computing Refers to the use of parallel processing for running compute intensive job programs efficiently via aggregating compute power
  • 16. | © Copyright 2016 Hitachi Consulting16 High Performance Computing Refers to the use of parallel processing for running compute intensive job programs efficiently via aggregating compute power Scale out Using multiple compute units Divide A Job is decomposed into multiple Independent tasks Distribute Tasks are processed in a separate compute nodes, simultaneously
  • 17. | © Copyright 2016 Hitachi Consulting17 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing
  • 18. | © Copyright 2016 Hitachi Consulting18 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing Compute Cluster
  • 19. | © Copyright 2016 Hitachi Consulting19 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing Compute Cluster Task 1 Task 2 Task 3 Task 4 Task 4 Task 6
  • 20. | © Copyright 2016 Hitachi Consulting20 Computing Example JobJob Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Parallel Processing Compute Cluster Task 1 T = 1X Start T = 0 Task 2 T = 1X Task 3 T = 1X Task 4 T = 1X Task 5 T = 1X Task 6 T = 1X End T = 1X+
  • 21. | © Copyright 2016 Hitachi Consulting21 Big Data vs. Big Compute The big brothers Big Data  Data Centric  Increase of data Volume + Velocity + Varity = Technologies to store and process the data efficiently  Azure HDInsight
  • 22. | © Copyright 2016 Hitachi Consulting22 Big Data vs. Big Compute The big brothers Big Data Big Compute  Data Centric  Increase of data Volume + Velocity + Varity = Technologies to store and process the data efficiently  Azure HDInsight  CPU & Memory Intensive  Increase of computation and algorithms complexity = Technologies to parallelize/distribute workload  Azure Batch
  • 23. | © Copyright 2016 Hitachi Consulting23 Big Data vs. Big Compute Big Data Processing is a subset of Big Compute, the latter covers a wider spectrum of computing problems The big brothers Big Data Big Compute  Data Centric  Increase of data Volume + Velocity + Varity = Technologies to store and process the data efficiently  Azure HDInsight  CPU & Memory Intensive  Increase of computation and algorithms complexity = Technologies to parallelize/distribute workload  Azure Batch
  • 24. | © Copyright 2016 Hitachi Consulting24 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications Use cases for Big Compute
  • 25. | © Copyright 2016 Hitachi Consulting25 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading Use cases for Big Compute
  • 26. | © Copyright 2016 Hitachi Consulting26 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading  Executing thousands of DB Stored Procedures simultaneously Use cases for Big Compute
  • 27. | © Copyright 2016 Hitachi Consulting27 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading  Executing thousands of DB Stored Procedures simultaneously NO! Remember where the computation occurs! Use cases for Big Compute
  • 28. | © Copyright 2016 Hitachi Consulting28 When to use Azure Batch Intrinsically parallel (also known as "embarrassingly parallel") applications  Image rendering and graphics processing  Search and optimization problems  Various experimental/simulation computing applications  Massively parallel data file processing & loading  Executing thousands of DB Stored Procedures simultaneously NO! Remember where the computation occurs! For applications that needs task-to-task interaction, Message Passing Interfaces (MPI) are supported in Azure Batch – Distributed Processing In some cases, communication between tasks can be managed via a shared data store – Parallel Processing Use cases for Big Compute
  • 29. | © Copyright 2016 Hitachi Consulting29 Azure Batch
  • 30. | © Copyright 2016 Hitachi Consulting30 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 31. | © Copyright 2016 Hitachi Consulting31 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 32. | © Copyright 2016 Hitachi Consulting32 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 33. | © Copyright 2016 Hitachi Consulting33 Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Job (priority, max execution time) Task 1 (job, exe resources) Task 2 (job, ex resources) Task 3 (job, exe resources) Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 34. | © Copyright 2016 Hitachi Consulting34 Job 2 (priority, max execution time) Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Job 1 (priority, max execution time) Task 1 (job, exe resources) Task 2 (job, ex resources) Task 3 (job, exe resources) Task A (job, exe resources) Task B (job, exe resources) Job 3 (priority, max execution time) Task X (job, exe resources) Task Y (job, exe resources) Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 35. | © Copyright 2016 Hitachi Consulting35 Job 2 (priority, max execution time) Azure Batch Constructs Putting together the pieces of the picture Azure Batch Account Pool 1 (number of nodes, osFamily, Node Size Job 1 (priority, max execution time) Task 1 (job, exe resources) Task 2 (job, exe resources) Task 3 (job, exe resources) Task A (job, exe resources) Task B (job, exe resources) Job 3 (priority, max execution time) Task X (job, exe resources) Task Y (job, exe resources) Pool 2 (number of nodes, osFamily, Node Size Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 36. | © Copyright 2016 Hitachi Consulting36 Compute Size Resource Default Maximum Limit Azure Batch Account 1 50 Pools per Batch Account 20 5000 Cores per Batch Account 20 N/A Tasks per Compute Node 1 4 X node core Number of Nodes vs Node Size:  Many small nodes → many tasks, not compute/memory intensive  Few big nodes → few tasks, compute/memory intensive (potential multi-threading per task)  Task queueing is automatically managed by Azure Batch Azure Batch Account • Pool − Number of VMs − VM Size − VM OS Family  Job − Set of Tasks − Priority − Max. Execution time  Task − Parent Job − Resources (.config, .dlls) − Cmd Executable (.exe) − Cmd Parameters Azure Storage Account  Hosts all the task resources (.dlls & .exe)
  • 37. | © Copyright 2016 Hitachi Consulting37 Compute Size What If:  Pool Size = 10 Nodes  Node Size = Small (1 Core)  Total Cores = 10 And you have:  2 Jobs  Each Job has 7 task  Total tasks = 14 By default:  1 Core can process only 1 task
  • 38. | © Copyright 2016 Hitachi Consulting38 Compute Size What If:  Pool Size = 10 Nodes  Node Size = Small (1 Core)  Total Cores = 10 And you have:  2 Jobs  Each Job has 7 task  Total tasks = 14 By default:  1 Core can process only 1 task Then:  The 7 tasks with the higher priority job will be executed (status = “Running”)  The first 3 added tasks to the lower priority job will be executed (status = “Running”)  The rest 4 task of the lower priority job will be queued (status = “Active”)  As soon as a “Running” task finishes (status = “Completed”) an “Active” task will be assigned to the freed compute node
  • 39. | © Copyright 2016 Hitachi Consulting39 Compute Size What If:  Pool Size = 10 Nodes  Node Size = Small (1 Core)  Total Cores = 10 And you have:  2 Jobs  Each Job has 7 task  Total tasks = 14 By default:  1 Core can process only 1 task Then:  The 7 tasks with the higher priority job will be executed (status = “Running”)  The first 3 added tasks to the lower priority job will be executed (status = “Running”)  The rest 4 task of the lower priority job will be queued (status = “Active”)  As soon as a “Running” task finishes (status = “Completed”) an “Active” task will be assigned to the freed compute node  If job was executed (status = “Running”), then a higher priority job is submitted to the same pool: − Azure Batch will “pause” tasks of the low priority job (status = “Suspended”) to free resources (cores) for the higher priority job, − then resume them when resources become available
  • 40. | © Copyright 2016 Hitachi Consulting40 Use Case: Parallel Data Files Loading
  • 41. | © Copyright 2016 Hitachi Consulting41 Parallel Data Loading with Azure Batch  Source data is a set of files, with different formants (Fixed width, Delimited, XML, JSON, Mainframe, Other), in Azure Blob Storage  Blob Storage Structure: “<DataDomain><DataFeed><DataFeed>_<Timestamp>.<ext>”  200+ data feeds, each produces 1-3 files daily  Data feed formats (column, data types, file format) are described in MetadataDB (Azure SQL DB)  The objective is to build a Data Loading Solution to:  Parse the files and load them into a database (Azure SQL DW)  Be scalable – used for ongoing data loading and history data migration  Be metadata driven – new data feeds can be handled by the solution by adding metadata  Log execution history and errors Problem Context
  • 42. | © Copyright 2016 Hitachi Consulting42 Parallel Data Loading with Azure Batch The task (unit of parallelization, or granule) can be:  Processing a Feed  balanced number of files/file sizes in each feed  loading files in sequence  files can be processed simultaneously on the same node using multithreading (CPU/Memory implications)  Processing a File  no files sequence is needed  fine grain, more control, better utilization of resources  less manageable (many tasks per job).  Processing File Line  multithreading on the same node. Parallelism Level
  • 43. | © Copyright 2016 Hitachi Consulting43 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . Destination <Azure SQL DW> Metadata <Azure SQL DB>
  • 44. | © Copyright 2016 Hitachi Consulting44 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Metadata <Azure SQL DB> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . 1 - Get list of feeds to process Destination <Azure SQL DW>
  • 45. | © Copyright 2016 Hitachi Consulting45 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . 1 - Get list of feeds to process 2 – Create a Job 3 – Create a task for each feed 4 – add the tasks to the job 5 – Submit the job Metadata <Azure SQL DB> Destination <Azure SQL DW>
  • 46. | © Copyright 2016 Hitachi Consulting46 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Metadata <SQL Azure DB> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . Task 1 Task 2 Task N Destination <Azure SQL DW>
  • 47. | © Copyright 2016 Hitachi Consulting47 Parallel Data Loading with Azure Batch Solution Architecture Azure Batch Runner <Host> Source <Azure Blob Storage> Compute Cluster <Azure Batch Pool> Feed 1 Feed 2 Feed N . . . . . . File 1 File 2 . . . DS 1 DS 2 . . . Task 1 Task 2 Task N Metadata <Azure SQL DB> Destination <Azure SQL DW>
  • 48. | © Copyright 2016 Hitachi Consulting48 Parallel Data Loading with Azure Batch Task Processing Steps Get feed format Info from Metadata
  • 49. | © Copyright 2016 Hitachi Consulting49 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Task Processing Steps
  • 50. | © Copyright 2016 Hitachi Consulting50 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Task Processing Steps
  • 51. | © Copyright 2016 Hitachi Consulting51 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use Task Processing Steps
  • 52. | © Copyright 2016 Hitachi Consulting52 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Task Processing Steps
  • 53. | © Copyright 2016 Hitachi Consulting53 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Load file content from Blob Storage Task Processing Steps
  • 54. | © Copyright 2016 Hitachi Consulting54 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Load file content from Blob Storage Parse file content to DataTable Task Processing Steps
  • 55. | © Copyright 2016 Hitachi Consulting55 Parallel Data Loading with Azure Batch Get feed format Info from Metadata Create destination tables Get list of file to process Load parser class to use For each file to process Load file content from Blob Storage Parse file content to DataTable Dump DataTable content to destination (DW) Task Processing Steps
  • 56. | © Copyright 2016 Hitachi Consulting56 .NET Solution Structure • Model • Database Services • Blob Storage Services • Parsers Processing Logic (Class Library) • Receives Command Line parameters • Performs the operation according to the supplied parameters Task (Console App) • Azure Batch Services • Creates Pools/Jobs/Task Runner (Console App)
  • 57. | © Copyright 2016 Hitachi Consulting57 .NET Solution Structure }Azure Blob Storage } A Host • Model • Database Services • Blob Storage Services • Parsers Processing Logic (Class Library) • Receives Command Line parameters • Performs the operation according to the supplied parameters Task (Console App) • Azure Batch Services • Creates Poos/Jobs/Task Runner (Console App)
  • 58. | © Copyright 2016 Hitachi Consulting58 Hosting Azure Batch Runner None! – One-off execution SQL Agent Job (VM + SqlServer) SQL Server Integration Services (VM + SqlServer) Azure WebJob + Azure Scheduler (or on-demand) Azure Data Factory Azure Orchestration???
  • 59. | © Copyright 2016 Hitachi Consulting59 Code Walk-through
  • 60. | © Copyright 2016 Hitachi Consulting60 Code Walk-through  Solution Structure  Azure Batch Bits  Azure Blob Storage Bits  Text File Processing  XML & JSON – (Quick and Dirty)  SQL Bulk Copy with Retry Pattern This is how we do it
  • 61. | © Copyright 2016 Hitachi Consulting61 Code Walk-through Solution Structure
  • 62. | © Copyright 2016 Hitachi Consulting62 Code Walk-through Azure Batch Bits Very useful if you want to sync with subsequent processing steps. I.e., start a subsequent step only when the job finishes.
  • 63. | © Copyright 2016 Hitachi Consulting63 Code Walk-through Azure Batch Bits
  • 64. | © Copyright 2016 Hitachi Consulting64 Code Walk-through Azure Batch Bits
  • 65. | © Copyright 2016 Hitachi Consulting65 Code Walk-through Azure Blob Storage Streaming is very efficient in terms of processing large files, instead of downloading the whole file to be processed
  • 66. | © Copyright 2016 Hitachi Consulting66 Code Walk-through Text File Parsing – FileHelpers Library Parallel processing at the file level (a separate thread per line to parse)
  • 67. | © Copyright 2016 Hitachi Consulting67 Code Walk-through XML & JSON Files Parsing – Quick & Dirty • The content of the whole file is loaded in a dataset • Cannot flush data in batches • Unlike streaming, it is more memory intensive approach
  • 68. | © Copyright 2016 Hitachi Consulting68 Code Walk-through SQL Bulk Copy – Loading in Batches Batch size < (available memory / record size)
  • 69. | © Copyright 2016 Hitachi Consulting69 Code Walk-through SQL Bulk Copy – Asynchronous
  • 70. | © Copyright 2016 Hitachi Consulting70 Code Walk-through SQL Bulk Copy – Retry Pattern
  • 71. | © Copyright 2016 Hitachi Consulting71 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.
  • 72. | © Copyright 2016 Hitachi Consulting72 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.
  • 73. | © Copyright 2016 Hitachi Consulting73 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.  A pre-processing step is to move the data files from the original Blob storage (that might be Geo- redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
  • 74. | © Copyright 2016 Hitachi Consulting74 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.  A pre-processing step is to move the data files from the original Blob storage (that might be Geo- redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.  Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL is used to parse it.
  • 75. | © Copyright 2016 Hitachi Consulting75 Some Important Notes - Polybase  Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best option to load data from Blob Storage into it, by creating external tables that defines the format of the data file.  However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder should have only one data file type.  A pre-processing step is to move the data files from the original Blob storage (that might be Geo- redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.  Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL is used to parse it.  If the source is not Blob Storage (i.e., file system), or you destination is not Azure SQL DW (e.g., Azure SQL DB, DocumentDB, or another Azure Blob Storage/Data lake), or your file processing does not only involve loading data to a database (e.g., processing requests to initiate workflow), Azure Batch is the right tool.
  • 76. | © Copyright 2016 Hitachi Consulting76 Useful Resources Check these out… • Azure Batch Documentation https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview • Azure Batch Explorer https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer • HPC and data orchestration using Azure Batch and Data Factory https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch • FileHelpers Librarys http://www.filehelpers.net • Retry Pattern https://msdn.microsoft.com/en-us/library/dn589788.aspx • Spinning up 16,000 A1 Virtual Machines on Azure Batch https://blogs.endjin.com/2015/07/spinning-up-16000-a1-virtual-machines-on-azure-batch • Parallel Computing https://en.wikipedia.org/wiki/Parallel_computing
  • 77. | © Copyright 2016 Hitachi Consulting77 Acknowledgement These guys are awesome… Thanks to James Fox and Alessandro Aeberli for their efforts in building the awesome Data Landing Solution for Argos. Nirav is currently the master of the landing solution 
  • 78. | © Copyright 2016 Hitachi Consulting78 My Background Applying Computational Intelligence in Data Mining • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 25+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, – evolving neural networks, and – data clustering • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, , Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, IEEE WCCI and INNS-BigData. ResearchGate.org