HDInsight Interactive Query

Azure HDInsight: Fully Managed, Full Spectrum Open Source Analytics
Ashish Thapliyal
Principal Product Manager
Azure HDInsight
Microsoft Corporation
https://blogs.msdn.microsoft.com/ashish/
https://twitter.com/ashishth

Agenda
HDInsight Intro & Recent Updates
Investment Areas:
• Reducing Complexity
• Fast Performance: Interactive Query
• Monitoring
• Security

Open source
analytics service
for the Enterprise
Fully-managed Hadoop and Spark for the cloud. 99.9% SLA
100% Open Source Hortonworks data platform
Clusters up and running in minutes
Familiar BI tools, interactive open source notebooks
Scale clusters on demand
Secure Hadoop workloads via Active Directory and Ranger
Compliance for Open Source bits
Best in class monitoring with Azure Log Analytics
Native Integration with leading ISVs

More value to our customers
Up to 52% price reduction
Additional 80% price reduction for R Server for Azure HDInsight


GA
Released
Sept ‘17Interactive Query
blazing fast SQL queries on
hyper-scale data
https://docs.microsoft.com/en-
us/azure/hdinsight/interactive-
query/apache-interactive-query-get-
started
• Fast Interactive SQL queries on petabyte-
scale data
• Intelligent Caching / leverage local SSDs
• Modern scalable query concurrency
architecture
• Rich connectivity with the most popular
authoring tools
• No data format conversion in order to get
faster results
• Enterprise Grade Security and Monitoring


GA
Released
Sept ‘17HDInsight Integration
with Azure Log Analytics
Enterprise grade production monitoring for
Hadoop and Spark workloads
us/azure/hdinsight/hdinsight-hadoop-
oms-log-analytics-tutorial
• Monitor all of the HDInsight clusters and
other Azure resources with a single pane of
glass.
• Extendable workload specific dashboards
along with sophisticated analytical query
language for deep analytics.
• Collect and correlate data from multiple Open
Source services.
• Alerts on critical issues with built-in Log
Analytics alerting infrastructure.
• Troubleshoot issues faster by having Hadoop,
Yarn, Spark, Kafka, HBase, Hive, Storm logs,
and Metrics in one place.
• Perform rich log exploration with interactive
queries


Public Preview
Sept ‘17
VSCode Integration
with HDInsight
First class cross-platform
integration with Spark & Hive
workloads
User Manual:
us/azure/hdinsight/hdinsight-for-
vscode?branch=pr-en-us-26060
• Interactive responses brings the best
properties of Python and Spark with
flexibility to execute one or multiple
statements.
• Built in Python language service such as
IntelliSense auto suggest, auto complete,
error marker, among others.
• Preview and export your PySpark interactive
query results to csv, json, and excel format.
• Integration with Azure for HDInsight cluster
management and query submissions.
• Link with Spark UI and Yarn UI for further
trouble shooting.


Public Preview
Sept ‘17
Advanced development
tools for Spark
Distributed debugging of
Spark code running across
multiple Spark executors
User Manual:
us/azure/hdinsight/hdinsight-apache-
spark-intellij-tool-debug-remotely-
through-ssh
• Use IntelliJ to run and debug Spark application
remotely on an HDInsight cluster anytime.
Developers can inspect variables, watch
intermediate data, step through code, and
finally edit the app and resume execution – all
against Azure HDInsight clusters with cluster
data.
• Set a breakpoint for both driver and executor
code. Debugging executor code lets
developers detect data-related errors by
viewing RDD intermediate values, tracking
distributed task operations, and stepping
through execution units.
• Set a breakpoint in Spark external libraries
allowing developers to step into Spark code
and debug in the Spark framework.
• View both driver and executor code
execution logs in the console panel.


GA
Released
Dec ‘17
Apache Kafka
for HDInsight
Enterprise proven Kafka service for the
cloud
99.9% SLAs
Highest level of availability with rack
awareness
Native integration with Azure Managed disks
means faster data ingestion and lowers costs
Build real-time solutions faster
Get a cluster up and running in 4 clicks
Easy data mirroring and setup
Out-of-the box alerting and monitoring
Integration with Apache Spark and Apache
Storm
us/azure/hdinsight/hdinsight-apache-
kafka-get-started


Public Preview
Dec 17
Enterprise Security
Package for HDInsight
Enterprise grade security for Hadoop and
Spark workloads
us/azure/hdinsight/hdinsight-domain-
joined-introduction
• Multi-user authentication using Active
Directory or Azure Active Directory.
• Multi-user Zeppelin notebook with
collaborative data science experience.
• Role based access control for Ambari
operations.
• Fine grained role based access control for
Hive SQL and Spark SQL using Apache
Ranger.
• Data masking of sensitive data using Apache
Ranger.
• Seamless integration with file and folder level
ACLs in Azure Data Lake Store.
• Audit all access to sensitive data as well as
changes to access policies.
• Transparent server side encryption at rest as
well as encryption in transit.

Open Source Big Data is Complex

DataLakeProbe
HBaseHealthProbe
HBaseMetricsProbe
HBaseProbe
HdfsProbe
HdinsightZookeeperProbe
……..
EdgenodeSSHWatchdog
GatewayTCPPingWatchdog
SSHTCPPingWatchdog
RStudioWatchdog
CertRolloverWatchdog
JobSubmissionPingWatchdog
OozieWatchdog
DataNodesUpWatchdog
NodeManagersUpWatchdog
ResourceHealthWatchdog
AzureNodeStatusWatchdog
ClusterMALoggingHashWatchdo
g
ClusterAvailabilityWatchdog
ClusterHealthWatchdog
……..
namenode_ha_health
ams_metrics_collector_process
ams_metrics_collector_autostart
ams_metrics_collector_hbase_master_p
rocess
namenode_last_checkpoint
namenode_webui
increase_nn_heap_usage_daily
hive_metastore_process
ambari_server_stale_alerts
ambari_server_agent_heartbeat
metrics_monitor_process_percent
……….

Microsoft Confidential
Fast Performance with Interactive Query

Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve

Ingest Transform
Convert to
ORC/ Parquet
Load to
Relational
Store
Serve
Time

o Hive Low Latency and Analytical Processing (LLAP)
o Serves queries directly from Azure BLOB/ADLS
o Works with TEXT, JSON, CSV, TSV, ORC, Parquet
o Super fast performance with TEXT data
o Modern scalable query concurrency architecture
o Security with Apache Ranger and Active Directory

HDInsight Interactive Query architecture
Memory + SSD cache

Intelligent cache
DRAM
SSD
ADLS/BLOBStore
Automatically reacting to changes in underlying data
o Shared cache between queries
o Cache eviction is based on source file last modified date
o Every query will check modified date, and reload if a new file has
arrived
Updates

• LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from the TPC-DS benchmark that ran on all engines successfully

• We used number of different concurrency levels to test the
concurrency performance
• 99 queries on 1 TB data with 32 worker node cluster with max
concurrency set to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1

HDInsight: Log Analytics Integration

OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service

HDInsight: Enterprise Security Package

o Available for Hadoop, Spark, LLAP in preview
o Enabled with AAD + AAD DS, AD on IaaS set up
o AAD DS is available in ARM VNET and new Azure
portal
o Ranger database can be external to the cluster

Product demand analysis
Delivery and Operations
Customer ID Name Cell phone Email Address City State Zip Credit card
413707 LUNA PARK 3122049789 luna.park@gmail.com 3250 W FOSTER AVE CHICAGO IL 60625 4147202109819679
391234 MARIE 3121069067 marie@outlook.com 4729 N LINCOLN AVE CHICAGO IL 60625 5166550002516678
413751 MANU WORKY 8471909522 manu.work@gmail.com 11601 W TOUHY AVE CHICAGO IL 60666 5159550002367622
413708 STEVE BENCH 3122049411 steve.bench@outlook.com 325 N LA SALLE ST BLDG CHICAGO IL 60654 4149098188760969
… … ... … … … …
Customer ID Reviews Rating
413707 SPICY, YET HEALTHY. WOULD ORDER AGAIN 9.3
391234 HATS OFF TO MAINTAIN PROPER 4.6
413751 AMAZING FOOD PREPARED RIGHT 9.4
413708 Decent Food 7.1
… …. ….
Id Customer ID Orders placed Discount Date Revenue
102456 68252 277 $526.30 8/1/2016 $2,243.70
102457 413488 282 $84.60 8/1/2016 $2,735.40
102458 250405 134 $281.40 8/1/2016 $1,058.60
102459 114533 141 $253.80 8/1/2016 $1,156.20
102460 315209 289 $346.80 8/1/2016 $2,543.20
… … … … … …
Id Customer ID Time taken Cost Date
102456 68252 63 $224.00 8/1/2016
102457 413488 65 $235.00 8/1/2016
102458 250405 67 $245.00 8/1/2016
102459 114533 71 $227.00 8/1/2016
102460 315209 72 $213.00 8/1/2016
… … … … …

Thank You
Ashish Thapliyal
Azure HDInsight
Microsoft Corporation
© 2018 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or
other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information
provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

HDInsight Interactive Query

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HDInsight Interactive Query

Similar to HDInsight Interactive Query (20)

More from Ashish Thapliyal

More from Ashish Thapliyal (13)

Recently uploaded

Recently uploaded (20)

HDInsight Interactive Query

Editor's Notes