@ashishth
@ashishth
LLAP
@ashishth
@ashishth
@ashishth
@ashishth
@ashishth
HDInsight cluster
Azure BLOB Store/
Azure Data Lake
Store
Network
@ashishth
Network
HDInsight cluster Azure BLOB Store/
Azure Data Lake
Store
@ashishth
Network
Hadoop
Cluster
Azure BLOB Store/
Azure Data Lake
Store
Spark Cluster
LLAP Cluster
Presto Cluster
@ashishth
@ashishth
On Demand Processing Clusters
Data Serving Clusters
Azure BLOB Store/
Azure Data Lake
Store
Common Hive
Metastore
@ashishth
Storage Storage
HDInsight Spark/Hive/MR cluster
1. Create cluster
2. Submit jobs
6. Drop cluster jobs
@ashishth
Azure BLOB Store/
Azure Data Lake
Store
Common Hive
Metastore
@ashishth
Azure HDInsight
Analyst
Power
User
Data
Engineer
Data
Scientist @ashishth
@ashishth
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
@ashishth
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
Time
@ashishth
@ashishth
@ashishth
@ashishth
Ingest Serve
Time
@ashishth
• Hive Low Latency and Analytical Processing (LLAP)
• Serves queries directly from Azure BLOB/ADLS
• Works with TEXT, JSON, CSV, TSV, ORC, Parquet
• Super fast performance with TEXT data
• Modern scalable query concurrency architecture
• Security with Apache Ranger and Active Directory
@ashishth
HDInsight Interactive Query architecture
Memory + SSD cache
@ashishth
Intelligent cache
Automatically reacts to changes in underlying data
o Shared cache between queries
o Cache eviction is based on source file last modified date
o Every query will check modified date, and reload if a new file has
arrived
DRAM
SSD
ADLS/BLOBStore
Updates
@ashishth
• LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from TPC-DS benchmark that ran on all engines
successfully
@ashishth
@ashishth
• We used number of different concurrency levels to test the concurrency
performance
• 99 queries on 1 TB data with 32 worker node cluster with max concurrency set
to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1
Test 2: Run all 99 queries, 2 at a time - Concurrency = 2
Test 3: Run all 99 queries, 4 at a time - Concurrency = 4
Test 4: Run all 99 queries, 8 at a time - Concurrency = 8
Test 5: Run all 99 queries, 16 at a time - Concurrency = 16
Test 6: Run all 99 queries, 32 at a time - Concurrency = 32
Test 7: Run all 99 queries, 64 at a time - Concurrency = 64
@ashishth
@ashishth
Capability Interactive Query Spark SQL Presto
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Intelligent Cache Eviction Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level security Yes [Apache Ranger+ AAD] High Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources High number of
connectors
Microsoft Azure Estimate
Your Estimate
Service type Custom name Region Description Estimated Cost
HDInsight East US Interactive Query Component: 2 A3 (4 cores, 7 GB RAM) Head
nodes x 730 Hours, 6 D14V2 (16 cores, 112 GB RAM) Region
nodes x 730 Hours, 3 A1 (1 cores, 1.75 GB RAM) Zookeeper
nodes x 730 Hours, 0 D4V2 (8 cores, 28 GB RAM) Edge nodes
x 730 Hours
$7,163.27
Storage East US Block Blob Storage, General Purpose V2, LRS Redundancy, Hot
Access Tier, 100 TB Capacity, 10,000,000 Write operations,
100,000 List and Create Container Operations, 99,999,000
Read operations, 9,990,000 Other operations. 500 TB Data
Retrieval, 50 TB Data Write
$2,181.82
Support Support $0.00
Monthly Total $9,345.09
Annual Total $112,141.06
Disclaimer
All prices shown are in US Dollar ($). This is a summary estimate, not a quote. For up to date pricing information please visit https://azure.microsoft.com/pricing/calculator/
This estimate was created at 4/13/2018 7:48:34 PM UTC.
@ashishth
DataLakeProbe
HBaseHealthProbe
HBaseMetricsProbe
HBaseProbe
HdfsProbe
HdinsightZookeeperProbe
……..
EdgenodeSSHWatchdog
GatewayTCPPingWatchdog
SSHTCPPingWatchdog
RStudioWatchdog
CertRolloverWatchdog
JobSubmissionPingWatchdog
OozieWatchdog
DataNodesUpWatchdog
NodeManagersUpWatchdog
ResourceHealthWatchdog
AzureNodeStatusWatchdog
ClusterMALoggingHashWatchdo
g
ClusterAvailabilityWatchdog
ClusterHealthWatchdog
……..
namenode_ha_health
ams_metrics_collector_process
ams_metrics_collector_autostart
ams_metrics_collector_hbase_master_p
rocess
namenode_last_checkpoint
namenode_webui
increase_nn_heap_usage_daily
hive_metastore_process
ambari_server_stale_alerts
ambari_server_agent_heartbeat
metrics_monitor_process_percent
……….
OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
HDInsight Log Analytics Architecture
Perimeter level security
Authentication
Authorization
Data security
Roadmap US Gov
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight

Zero ETL analytics with LLAP in Azure HDInsight