Arquitectura de Datos en Azure

Elena López
MicrosoftMVP – DataPlatform
elopez@solvex.com.do
www.solvex.com.do
Arquitectura de Datos en Azure
más allá de Data Factory, Power BI y Azure SQL
Database

Azure Data Services
Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Cosmos DB
Azure Databricks
Azure HDInsight
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
Cosmos DB
Power BI Aggregations

A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
Azure Data Lake Storage Gen2
M A N A G E A B L E S C A L A B L EF A S TS E C U R E
 No limits on
data store size
 Global footprint
(50 regions)
 Optimized for Spark
and Hadoop
Analytic Engines
 Tightly integrated
with Azure end to
end analytics
solutions
 Automated
Lifecycle Policy
Management
 Object Level
tiering
 Support for fine-
grained ACLs,
protecting data at the
file and folder level
 Multi-layered
protection via at-rest
Storage Service
encryption and Azure
Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
 Atomic file
operations
means jobs
complete faster
 Object store
pricing levels
 File system
operations
minimize
transactions
required for job
completion

Objectives
 Plan the structure based on optimal data retrieval
 Avoid a chaotic, unorganized data swamp
Data Retention Policy
Temporary data
Permanent data
Applicable period (ex: project lifetime)
etc…
Business Impact / Criticality
High (HBI)
Medium (MBI)
Low (LBI)
etc…
Confidential Classification
Public information
Internal use only
Supplier/partner confidential
Personally identifiable information (PII)
Sensitive – financial
Sensitive – intellectual property
etc…
Probability of Data Access
Recent/current data
Historical data
etc…
Owner / Steward / SME
Subject Area
Security Boundaries
Department
Business unit
etc…
Time Partitioning
Year/Month/Day/Hour/Minute
Downstream App/Purpose
Common ways to organize the data:
Organizing a Data Lake – Folder structure

Architecture
Automated ML
Power BI
Dashboard
Data for
Real-time
Processing
Data Stream
Job
Hourly Prediction
Updates
External Data Azure Services
Send to Azure SQL
for predictions
Get Data
Azure WebJob
Runs jobs to get data
from public source
Azure SQL
Contains Historical Energy
Consumption & Weather Data
Real time
data stats
Azure Data Factory
Pipeline invokes AML
Web Service
Energy Consumption
Data & Weather Data
(Public Source)
Azure Event Hub
stores streaming
data
Azure Stream Analytics
processes events as they
arrive in the Event Hub
Power BI
Dashboard
Data for
Real-time
Processing
Data Stream
Job
Hourly Prediction
Updates
Get Data
Azure WebJob
Runs jobs to get data
from public source
Real time
data stats

SQL DB
Cosmos DB
Datawarehouse
Data lake
Blob storage
… Prepare Data Build & Train Deploy
Machine Learning Process

How much is this car worth?
Machine Learning Problem Example

Model Creation Is Typically Time-Consuming
Mileage
Condition
Car brand
Year of make
Regulations
…
Parameter 1
Parameter 2
Parameter 3
Parameter 4
…
Gradient Boosted
Nearest Neighbors
SVM
Bayesian Regression
LGBM
…
Mileage Gradient Boosted Criterion
Loss
Min Samples Split
Min Samples Leaf
Others Model
Which algorithm? Which parameters?Which features?
Car brand
Year of make

Criterion
Loss
Min Samples Split
Min Samples Leaf
Others
N Neighbors
Weights
Metric
P
Others
Mileage
Condition
Car brand
Year of make
Regulations
…
Gradient Boosted
Nearest Neighbors
SVM
Bayesian Regression
LGBM
…
Nearest Neighbors
Model
Iterate
Gradient BoostedMileage
Car brand
Year of make
Car brand
Year of make
Condition

Iterate

Introducing Automated Machine Learning
Dataset
Optimization
Metric
Constraints
(Time/Cost)
ML ModelAutomated ML
Accessible & Faster

Enter data
Define goals
Apply constraints
Output
Automated ML Accelerates Model Development
Input Intelligently test multiple models in parallel
Optimized model

Automated ML Capabilities
• Based on Microsoft Research
• Brain trained with several
million experiments
• Collaborative filtering and
Bayesian optimization
• Privacy preserving: No need
to “see” the data

Automated ML Capabilities
• ML Scenarios: Classification &
Regression
• Integration: Azure Machine
Learning, Azure Notebooks,
Jupyter Notebooks
• Data Type: Numeric, Text
• Languages: Python SDK for
deployment and hosting for
inference
• Training Compute: Local Machine,
Remote Azure DSVM (Linux),
Azure Batch AI
• Transparency: View run history,
model metrics
• Scale: Faster model training using
multiple cores and parallel
experiments

GA:
• Feature importance as part of
training
• Simple UX for feature importance
for a selected iteration
• Local feature importance for a
given sample
Post GA:
• Importance of Raw data columns
• Accuracy and performance
improvements
Model Explain-ability

File – New Project
Let’s do it!

1. Download Azure Storage Explorer
2. Save for later
Server: demoml.database.windows.net
User: elopez
3. Free temporary azure account
User: lab@solvex.com.do
Pass: Dac93748

Arquitectura de Datos en Azure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Arquitectura de Datos en Azure

Similar to Arquitectura de Datos en Azure (20)

More from Elena Lopez

More from Elena Lopez (11)

Recently uploaded

Recently uploaded (20)

Arquitectura de Datos en Azure

Editor's Notes