@ashishth
Free
Proven @
Scale
No Lock in
Many
Options
Not really
Free
Operation
alization is
Hard
Expertise
60% of Big
data
projects
will fail*
*According 20 Gartner 60% of Advanced Analytics projects will fail in 2017
Cloud
Cloud
Optimizati
ons/
Security
@ashishth
Node 1 Node 2 Node 3 4
HDFS scales with adding more nodes
HDInsight cluster
Azure BLOB Store/
Azure Data Lake
Store
Network
Network
HDInsight cluster
Azure BLOB Store/
Azure Data Lake
Store
Storage Storage
HDInsight Spark/Hive/MR cluster
1. Create cluster
2. Submit jobs
6. Drop cluster jobs
HDInsight orchestration With Azure Data
Factory (ADF)
Source 1
Source n
Source 2
Source 3
Result Set 1
HDInsight Cluster
Result Set 2
Result Set 3
Result Set n
DataLakeProbe
HBaseHealthProbe
HBaseMetricsProbe
HBaseProbe
HdfsProbe
HdinsightZookeeperProbe
……..
EdgenodeSSHWatchdog
GatewayTCPPingWatchdog
SSHTCPPingWatchdog
RStudioWatchdog
CertRolloverWatchdog
JobSubmissionPingWatchdog
OozieWatchdog
DataNodesUpWatchdog
NodeManagersUpWatchdog
ResourceHealthWatchdog
AzureNodeStatusWatchdog
ClusterMALoggingHashWatchdo
g
ClusterAvailabilityWatchdog
ClusterHealthWatchdog
……..
namenode_ha_health
ams_metrics_collector_process
ams_metrics_collector_autostart
ams_metrics_collector_hbase_master_p
rocess
namenode_last_checkpoint
namenode_webui
increase_nn_heap_usage_daily
hive_metastore_process
ambari_server_stale_alerts
ambari_server_agent_heartbeat
metrics_monitor_process_percent
……….
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
Time
• Hive Low Latency and Analytical Processing (LLAP)
• Serves queries directly from Azure BLOB/ADLS
• Works with TEXT, JSON, CSV, TSV, ORC, Parquet
• Super fast performance with TEXT data
• Modern scalable query concurrency architecture
• Security with Apache Ranger and Active Directory
Ingest Serve
Time
HDInsight Interactive Query architecture
Memory + SSD cache
Intelligent cache
Automatically reacting to changes in underlying data
o Shared cache between queries
o Cache eviction is based on source file last modified date
o Every query will check modified date, and reload if a new file has
arrived
DRAM
SSD
ADLS/BLOBStore
Updates
• LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from the TPC-DS benchmark that ran on all engines
successfully
• We used number of different concurrency levels to test the concurrency
performance
• 99 queries on 1 TB data with 32 worker node cluster with max concurrency set
to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1
Test 2: Run all 99 queries, 2 at a time - Concurrency = 2
Test 3: Run all 99 queries, 4 at a time - Concurrency = 4
Test 4: Run all 99 queries, 8 at a time - Concurrency = 8
Test 5: Run all 99 queries, 16 at a time - Concurrency = 16
Test 6: Run all 99 queries, 32 at a time - Concurrency = 32
Test 7: Run all 99 queries, 64 at a time - Concurrency = 64
Azure HDInsight
Analyst
Power
User
Data
Engineer
Data
Scientist
Use IntelliJ to run and debug Spark application
remotely on an HDInsight cluster anytime.
Developers can inspect variables, watch
intermediate data, step through code, and finally
edit the app and resume execution – all against
Azure HDInsight clusters with cluster data.
Set a breakpoint for both driver and executor
code. Debugging executor code lets developers
detect data-related errors by viewing RDD
intermediate values, tracking distributed task
operations, and stepping through execution units.
Set a breakpoint in Spark external libraries
allowing developers to step into Spark code and
debug in the Spark framework.
View both driver and executor code execution
logs in the console panel.
• Interactive responses brings the best properties of Python and
Spark with flexibility to execute one or multiple
statements.
• Built in Python language service such as
IntelliSense auto suggest, auto complete, error
marker, among others.
• Preview and export your PySpark interactive
query results to csv, json, and excel format.
• Integration with Azure for HDInsight cluster
management and query submissions.
• Link with Spark UI and Yarn UI for further trouble
shooting.
Perimeter level security
Authentication
Authorization
Data security
Roadmap US Gov
OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
HDInsight Log Analytics Architecture
Azure HDInsight

Azure HDInsight