SlideShare a Scribd company logo
@ashishth
• The most trusted and
compliant platform
A secure and managed Apache Hadoop and Spark platform for building data lakes in the Cloud
@ashishth
OSS
Framework
Choices
Security
HA & DRStorage
Monitoring
Cost
Optimization
@ashishth
Devices
&
Sensors
Speed
Layer
Data Lake Store Gen 2
Blob
Storage
Corporate
Data
SaaS
Data
Web
Data
Streaming/Real-
Time/
Application
Advanced Analytics
& Data Science
Machine Learning
R, Python, APIs
Analytics
Data Exploration
Corporate
Reporting
Self-Service BI
ETL Serving LayerStorage
Hive LLAP
@ashishth
Devices
&
Sensors
Speed
Layer
Data Lake Store Gen 2
Blob
Storage
Corporate
Data
SaaS
Data
Web
Data
Streaming/Real-
Time/
Application
Advanced Analytics
& Data Science
Machine Learning
R, Python, APIs
Analytics
Data Exploration
Corporate
Reporting
Self-Service BI
ETL Serving LayerStorage
Hive LLAP
?
?
?
?
@ashishth
@ashishth
Spark Pig Hive
Designed for ETL ETL Data warehousing
Adoption High, increasing Low, decreasing Stable
Number of connectors Highest High High
Languages Python, R, Scala, Java, SQL Pig SQL
Performance High Medium Medium
@ashishth
Spark Structured Streaming Storm
Adoption High, increasing Decreasing
Event processing guarantee Exactly once At least once
Throughput High Low
Processing Model Micro Batch Real-Time
Latency High Low
Event time support Yes Yes
Languages Python, R, Scala, Java,
SQL
Java
@ashishth
Capability Hive LLAP Spark SQL Presto
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Result Caching Yes No No
Intelligent Cache
Eviction
Yes No No
Materialized Views Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level
security
Yes [Apache Ranger+ AAD] Medium Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources High number of
connectors @ashishth
Spark Metadata Hive Metadata Spark Metadata
Hive Metadata
Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x
Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool-
open-source-release-now-available/
ADF Airflow Oozie
Service management Azure PaaS IaaS VM HDInsight
Code JSON Python Java
GUI ADF V2 has great UX Good UX Below Average UX
Community Microsoft Growing (10893 Stars) Declining (454 Stars)
On-demand clusters Yes No, but extensible No
Extensibility Custom action-only Full, graph + actions Custom action-only
Pipeline definition JSON/UX Python/ UX XML/UX
Devops-first design Yes Yes Yes
Pipeline monitoring Yes Yes Yes
Scheduling Event, Time Event Event, Time
@ashishth
Motivation and benefits
Architecture best practices
Infrastructure best practices
Storage best practices
Data migration best practices
Security and DevOps best practices
https://azure.microsoft.com/en-us/blog/migrating-on-premises-hadoop-infrastructure-to-azure-hdinsight/
@ashishth
Data
Sources
Apps
Sensors
and
devices
Data Ingestion Advanced Analytics BI/ Visualization
People
Automated
Systems
Apps
Web
Mobile
Bots
Data catalog/ Governance/ Lineage
Connectors: JDBC, ODBC
Productivity Tools
Enterprise grade add-ons (hybrid, backup, DR, security, performance)
Data Prep/
Management
@ashishth
@ashishth
Data
movement
Caching
Storage
options and
tradeoffs
@ashishth
Data Qty Network Bandwidth
45 Mbps (T3) 100 Mbps 1 Gbps
1 TB 2 days 1 day 2 hours
10 TB 22 days 10 days 1 day
35 TB 76 days 34 days 3 days
80 TB 173 days 78 days 8 days
100 TB 216 days 97 days 10 days
200 TB 1 year 194 days 19 days
500 TB 3 years 1 year 49 days
1 PB 6 years 3 years 97 days
2 PB 12 years 5 years 194 days
@ashishth
Network Transfer with TLS
• Over Internet
• Express Route
• Databox online Transfer
Shipping data offline
• Import / Export service
• Data Box offline data transfer
@ashishth
@ashishth
https://github.com/alkohli/azure-docs-
pr/blob/4023eb52cc6ed103e0fa7e794e039c143b6d2a6a/articles/storage/blobs/data-
lake-storage-migrate-on-prem-HDFS-cluster.md
@ashishth
Type Latency Consistency Workloads Bandwidth Key Benefits
ADLS Gen 1 Hierarchical 10-100ms Low HDInsight 3.6(
No HBase)
High Atomic Rename,
File Folder level
ACL’s
ADLS Gen 2 Hierarchical 10-50ms Medium HDInsight 3.6 &
4.0
Unconstrained Atomic Rename,
File Folder level
ACL’s
Standard
BLOB
Object Store 10-50ms Medium HDInsight 3.6 &
4.0
Unconstrained Mature
Premium
BLOB
Object Store ~5ms High HBase in Preview Unconstrained Fast
Premium
Managed
Disks
Hierarchical ~5ms High Kafka, HBase in
preview
Based on disk Consistent
@ashishth
Scenario Supported Workaround
HDInsight 3.6 & 4.0 with Standard Blob as Primary
and/ or secondary
Yes
HDInsight 3.6 & 4.0 with ADLS Gen2 as primary Yes
HDInsight 3.6 & 4.0 with ADLS Gen2 as primary &
Blob as additional
Yes
HDInsight 3.6 & 4.0 with Blob as primary & ADLS
Gen2 as additional
No
HDInsight 3.6 with multiple ADLS Gen2 accounts Yes
HDInsight 3.6 & 4.0 with ADLS Gen1 and ADLS Gen 2 No Distcp across two
clusters
HDInsight 4.0 with ADLS Gen 1 No Distcp across two
clusters
@ashishth
@ashishth
@ashishth
RegionServer
Region
Region
Region
WASB
Client
-Put
-Delete
-Get
Log
Flusher
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
RegionServer
WASB
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Write path challenges with Write Ahead Log
Insert Update Get Delete
Sync Operation
• Inconsistent Latencies
• High latencies
@ashishth
RegionServer
Premium
Managed
Disk(s)
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Insert Update Get Delete
Sync Operation
Introducing Premium Managed disk for
WAL
• Consistent Latencies
• Low latencies
• Data Durability
@ashishth
RegionServer
Region
Region
Region
WASB
Client
-Put
-Delete
-Get
Log
Flusher
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Premium
Managed
Disk(s)
@ashishth
@ashishth
RegionServer
Region
Region
Region
PremiumBlob
Client
-Put
-Delete
-Get
Log
Flusher
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Store File
HFile
Premium
Managed
Disk(s)
Local
SSD
+DRA
M
@ashishth
Cluster Type Operation Row Size # ops
#Region
Servers
Region Server Node
Size
#Clients Throughput Avg Latency (ms) Run Time (min)
Standard Write 1KB 107,374,182 4 Standard_D4_V2 2
37,958
0.417 47
Premium
WAL
Write 1KB 107,374,182 4
Standard_D4_V2
2 57,812 0..271 31
Standard Small Write 100 Bytes 1,073,741,824 4
Standard_DS4_V2
2 84,910 0..186 210
Premium
WAL
Small Write 100 Bytes
1,073,741,824
4
Standard_DS4_V2
2 701,234 0.016 25
Standard Read 100 Bytes 925,075 4 Standard_D4_V2 2 256 62 60
Premium
WAL &
Premium
Blob
Read 100 Bytes 33,503,676 4
Standard_D4_V2
2 9,306 1.7 60
Standard Large Read 1K 945,682 4
Standard_D4_V2
262 61 60
Premium
WAL &
Premium
Blob
Large Read 1K 24,846,209 4 Standard_D4_V2 2 6901 2.3 60
@ashishth
Workload Caching Options Key benefits
Spark Spark IO Cache Up to ~8 to 10x perf improvements
HBase &
Phoenix
Bucket cache Up 5-10x perf gains on recently read or written
data
Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data
Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
@ashishth
@ashishth
HDInsight Cluster
Gateways
Head Node 1 Head Node 2
Worker Node Worker Node Worker Node Worker Node
Zookeeper1
Zookeeper1
Zookeeper1
Hive Metastore
YARN
https://cluster.azurehdinsight.net/APIs
@ashishth
Workload DR Option
Spark / Hive Manual, Partner solution
HBase HBase replication, Snapshot export, Import
Export, Copy Tables
Kafka Mirror Maker
@ashishth
Ingest
Process
Publish
Ingest
Process
Publish
Active Hot standby
RPO
RTO
Cost
Low
None
High
@ashishth
Ingest
Process
Publish
Active Cold standby
Replication
RPO
RTO
Cost
Medium
Medium
High
@ashishth
Ingest
Process
Publish
Active
Replication
DR- Cloud Storage
RPO
RTO
Cost
Effort
Highest
Highest
Lowest
Highest
@ashishth
Ingest
Process
Publish
Ingest
Process
Publish
Active Active
RPO
RTO
Cost
Lowest
None
Highest
@ashishth
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-mirroring
https://github.com/anagha-microsoft/hdi-spark-
dr
https://github.com/anagha-microsoft/hdi-kafka-dr
https://docs.microsoft.com/en-
us/azure/hdinsight/hbase/apache-hbase-backup-replication
@ashishth
Virtual Network (10.1.0.0/16)
HDInsight Cluster in Subnet (10.1.1.0/24)
Gateways
Head Node 1
Head Node 2
Worker Node Worker Node Worker Node Worker Node
Allow VNet
(10.1.0.0/16)
Allow VNet
(10.1.0.0/16)
Hive Metastore
@ashishth
@ashishth
@ashishth
Scenario Authorizing Component
Yarn: Submit-App Apache Ranger: Yarn Plugin
Hive Operations: Create, Select, Update, Drop,
index, Lock, Read, Write, Masking, Row level filter
on Hive Database, Table & Columns
Apache Ranger: Hive Plugin
Create/ Alter Table with storage location
reference
Apache Ranger + ADLS Gen 2 ACL’s
Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin
HBase Access Policies Apache Ranger/ HBase plugin
Kafka Access Policies Apache ranger/ Kafka Plugin
Access Azure Data Lake Storage Gen2 using the
Spark DataFrame API
ADLS Gen 2 ACLs
Access Azure Data Lake Storage Gen2 using the
RDD API
ADLS Gen 2 ACLs
HDFS operations: Mkdir, ls, put, copyFromLocal,
get, cat, mv, cp etc
ADLS Gen 2 ACLs
Running Map Reduce jobs ADLS Gen 2 ACLs @ashishth
@ashishth
Apache Ambari Azure Log Analytics IntegrationHDInsight Cluster Metrics
@ashishth
INSTANCE VCPU RAM TEMPORARY
STORAGE
PAYG
D14 v2 16 112.00 GiB 800 GiB $1.196/hour
E16 v3 16 128.00 GiB 400 GiB $1.064/hour
12.5%
11%
@ashishth
@ashishth

More Related Content

What's hot

Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Data Con LA
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 

What's hot (20)

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Migration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptxMigration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptx
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 

Similar to HDInsight for Architects

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 

Similar to HDInsight for Architects (20)

Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth news
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
SAP Workloads on the AWS Cloud - AWS Innovate Toronto
SAP Workloads on the AWS Cloud - AWS Innovate TorontoSAP Workloads on the AWS Cloud - AWS Innovate Toronto
SAP Workloads on the AWS Cloud - AWS Innovate Toronto
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
AWS re:Invent 2016: Optimizing workloads in SAP HANA with Amazon EC2 X1 Insta...
AWS re:Invent 2016: Optimizing workloads in SAP HANA with Amazon EC2 X1 Insta...AWS re:Invent 2016: Optimizing workloads in SAP HANA with Amazon EC2 X1 Insta...
AWS re:Invent 2016: Optimizing workloads in SAP HANA with Amazon EC2 X1 Insta...
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 

More from Ashish Thapliyal

More from Ashish Thapliyal (12)

Five essential new enhancements in azure HDnsight
Five essential new enhancements in azure HDnsightFive essential new enhancements in azure HDnsight
Five essential new enhancements in azure HDnsight
 
HDInsight Security & Compliance
HDInsight Security & ComplianceHDInsight Security & Compliance
HDInsight Security & Compliance
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
HDInsight HBase replication
HDInsight HBase replicationHDInsight HBase replication
HDInsight HBase replication
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Monitor Azure HDInsight with Azure Log Analytics
Monitor Azure HDInsight with Azure Log AnalyticsMonitor Azure HDInsight with Azure Log Analytics
Monitor Azure HDInsight with Azure Log Analytics
 
HDInsight Interactive Query
HDInsight Interactive QueryHDInsight Interactive Query
HDInsight Interactive Query
 
HDInsight HBase Performance best practices
HDInsight HBase Performance best practicesHDInsight HBase Performance best practices
HDInsight HBase Performance best practices
 
Architecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightArchitecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsight
 
DIY: TPCDS HDInsight Benchmark
DIY: TPCDS HDInsight BenchmarkDIY: TPCDS HDInsight Benchmark
DIY: TPCDS HDInsight Benchmark
 

Recently uploaded

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 

Recently uploaded (20)

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

HDInsight for Architects

Editor's Notes

  1. Azure HDInsight is a secure and managed platform for building data lakes on Azure based on the Apache Hadoop and Spark frameworks. So, what all does HDInsight have to offer? Reliable Open Source analytics with an Industry leading SLA HDInsight allows you to easily spin up open source cluster types guaranteed with the industry’s best 99.9% SLA and 24/7 support. We guarantee this SLA for the entire big data solution, not just the VM instances. HDInsight is architected for full redundancy and high availability including head node replication, data geo-replication, and built-in standby NameNode making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24x7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core, more than all other managed cloud providers combined to support your deployment and the ability to fix and commit code back to Hadoop. Enterprise Grade Security & Monitoring HDInsight protects your data assets and easily extends your on-premise security and governance controls to the cloud. We feature single sign-on (SSO), multi-factor authentication and seamless management of millions of identities through Azure Active Directory. You can authorize users and groups with fine-grained access control policies over all your enterprise data with Apache Ranger. HDInsight meets HIPAA, PCI, SOC compliance, ensuring your enterprise data assets are always protected with the highest security and regulatory compliance. To ensure the highest level of business continuity, HDInsight extends capabilities for alerting, monitoring, defining pre-emptive actions, and enhanced workload protection through native integration with Azure Operations Management Suite (OMS). Most Productive platform for developers and scientists HDInsight offers developers tailored experiences through rich productivity suites for Hadoop & Spark with integrated development environments using Visual Studio, Eclipse, and IntelliJ supporting Scala, Python, R, Java, and .Net. HDInsight gives data scientists the ability to create narratives that combine code, statistical equations, and visualizations that tell a story about the data through integration to the two most popular notebooks: Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R—helping you train more accurate models for better predictions than previously possible. Cost effective cloud scale HDInsight has decoupled compute and storage, enabling you to cost-effectively scale workloads up or down, independent of storage. Local storage can still be used for caching and fast I/O. Spark and interactive Hive users can choose SSD memory for interactive performance; while Kafka users can retain all streaming data in premium managed disks. You only pay for the compute and storage you use and are given the ability to choose any Azure VM types that enables the best utilization of resources. A recent study showed HDInsight delivering 63% lower TCO than deploying Hadoop on premises over 5 years.* Integration with leading Productivity Applications In the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) who provide value added solutions. Through a unique design where every cluster is extended with edge nodes and script action, HDInsight lets customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with any ISV application out-of-the-box. Datameer, Cask, AtScale, StreamSets are few such applications, which are very popular on the HDInsight platform today. Easy for administrators to manage With HDInsight, administrators can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. There’s also no time-consuming installation or set up. There is also no need to patch the operating system or upgrade the Hadoop versions. Azure does it for you. Launch your first cluster in minutes.
  2. Build 2015
  3. The new world of HDInsight 4.0 with Hadoop 3.0, brings the Spark and Hive worlds closer together. Lets see, how… Before Hadoop 3.0, the Spark executors would directly access the Hive metastore. While, on the surface, this seems like a fine thing to do, it is rife with problems. The new architecture instead requires explicit registration of Hive transactional tables as Spark external tables through Hive Warehouse Connector. While it adds one extra step during configuration, this approach greatly increases the reliability of data access. Hive Warehouse Connector supports efficient predicate pushdown and Apache Arrow-based communication between Spark executors and Hive LLAP daemons. This results in overall small overhead of communication between two systems. With Hive Warehouse Connector, Apache Spark on HDInsight 4.0 gets mature transactional capabilities.​ The new integration between Apache Spark and Hive LLAP in HDInsight 4.0 delivers new capabilities for business analysts, data scientists, and data engineers. Business analysts get a performant SQL engine in the form of Hive LLAP (Interactive Query) while data scientists and data engineers get a great platform for ML experimentation and ETL with Apache Spark over transactional data in Hive tables.​
  4. Reference https://azure.microsoft.com/en-us/blog/deploying-apache-airflow-in-azure-to-build-and-run-data-pipelines/
  5. Build 2015
  6. Transfer data over network with TLS Over internet - You can transfer data to Azure storage over a regular internet connection using any one of several tools such as: Azure Storage Explorer, AzCopy, Azure Powershell, and Azure CLI. See Moving data to and from Azure Storage for more information. Express Route - ExpressRoute is an Azure service that lets you create private connections between Microsoft datacenters and infrastructure that’s on your premises or in a colocation facility. ExpressRoute connections do not go over the public Internet, and offer higher security, reliability, and speeds with lower latencies than typical connections over the Internet. For more information, see Create and modify an ExpressRoute circuit. Data Box online data transfer - Data Box Edge and Data Box Gateway are online data transfer products that act as network storage gateways to manage data between your site and Azure. Data Box Edge, an on-premises network device, transfers data to and from Azure and uses artificial intelligence (AI)-enabled edge compute to process data. Data Box Gateway is a virtual appliance with storage gateway capabilities. For more information, see Azure Data Box Documentation - Online Transfer. Shipping data Offline Import / Export service - you can send physical disks to Azure and they will be uploaded for you. For more information, see What is Azure Import/Export service?. Data Box offline data transfer - Data Box, Data Box Disk, and Data Box Heavy devices help you transfer large amounts of data to Azure when the network isn’t an option. These offline data transfer devices are shipped between your organization and the Azure datacenter. They use AES encryption to help protect your data in transit, and they undergo a thorough post-upload sanitization process to delete your data from the device. For more information, see Azure Data Box Documentation - Offline Transfer.
  7. Before I describe specific capabilities and value propositions of HDInsight, let us take a quick look at the architecture of a HDInsight cluster. We will build upon this when we talk about security later on in the presentation. First off, a key difference between an on-premise Hadoop cluster and a HDInsight cluster is that with HDInsight, the storage and compute layers are separated. This allows for storage and compute to be scaled independently of each other. We have seen in numerous customer cases, that trying to combine storage and compute on to a single cluster often leads to underutilization of one or the other or both. With HDInsight, you can keep loading data in to Azure Storage Gen1 or Gen2 or in WASB. And you can create small or large clusters as and when needed. Each HDInsight cluster comes with 2 gateway nodes, 2 head nodes and 3 ZooKeeper nodes. In most cases, these are free of charge. As we will discuss later, we provision multiple of these nodes to ensure high availability. Each HDInsight cluster lives within a VNET. The gateway nodes are the ONLY public endpoints accessible from outside the VNET. As we will see later, this architecture allows you to securely lock down your HDInsight cluster.
  8. Let’s start with network security. Previously you could have injected a HDInsight cluster within a VNET and secured access to it from the public internet using NSG firewalls. Now you can ensure that any resources that the cluster needs to accesss e.g. Azure Storage accounts, Hive metastores etc. can themselves be secured. With the new service endpoint capability, Azure resources such as Azure Storage, Azure DB, Cosmos DB etc. can be secured via service endpoints. HDInsight now integrates with this capability. Let me walk you through how this would work. [WALK THROUGH THE ANIMATION]