3. Open source
analytics service
for the Enterprise
Fully-managed Hadoop and Spark for the cloud. 99.9% SLA
100% Open Source Hortonworks data platform
Clusters up and running in minutes
Familiar BI tools, interactive open source notebooks
Scale clusters on demand
Secure Hadoop workloads via Active Directory and Ranger
Compliance for Open Source bits
Best in class monitoring with Azure Log Analytics
Native Integration with leading ISVs
5. More value to our customers
Up to 52% price reduction
Additional 80% price reduction for R Server for Azure HDInsight
6.
GA
Released
Sept ‘17Interactive Query
blazing fast SQL queries on
hyper-scale data
https://docs.microsoft.com/en-
us/azure/hdinsight/interactive-
query/apache-interactive-query-get-
started
• Fast Interactive SQL queries on petabyte-
scale data
• Intelligent Caching / leverage local SSDs
• Modern scalable query concurrency
architecture
• Rich connectivity with the most popular
authoring tools
• No data format conversion in order to get
faster results
• Enterprise Grade Security and Monitoring
7.
GA
Released
Sept ‘17HDInsight Integration
with Azure Log Analytics
Enterprise grade production monitoring for
Hadoop and Spark workloads
https://docs.microsoft.com/en-
us/azure/hdinsight/hdinsight-hadoop-
oms-log-analytics-tutorial
• Monitor all of the HDInsight clusters and
other Azure resources with a single pane of
glass.
• Extendable workload specific dashboards
along with sophisticated analytical query
language for deep analytics.
• Collect and correlate data from multiple Open
Source services.
• Alerts on critical issues with built-in Log
Analytics alerting infrastructure.
• Troubleshoot issues faster by having Hadoop,
Yarn, Spark, Kafka, HBase, Hive, Storm logs,
and Metrics in one place.
• Perform rich log exploration with interactive
queries
8.
Public Preview
Sept ‘17
VSCode Integration
with HDInsight
First class cross-platform
integration with Spark & Hive
workloads
User Manual:
https://docs.microsoft.com/en-
us/azure/hdinsight/hdinsight-for-
vscode?branch=pr-en-us-26060
• Interactive responses brings the best
properties of Python and Spark with
flexibility to execute one or multiple
statements.
• Built in Python language service such as
IntelliSense auto suggest, auto complete,
error marker, among others.
• Preview and export your PySpark interactive
query results to csv, json, and excel format.
• Integration with Azure for HDInsight cluster
management and query submissions.
• Link with Spark UI and Yarn UI for further
trouble shooting.
9.
Public Preview
Sept ‘17
Advanced development
tools for Spark
Distributed debugging of
Spark code running across
multiple Spark executors
User Manual:
https://docs.microsoft.com/en-
us/azure/hdinsight/hdinsight-apache-
spark-intellij-tool-debug-remotely-
through-ssh
• Use IntelliJ to run and debug Spark application
remotely on an HDInsight cluster anytime.
Developers can inspect variables, watch
intermediate data, step through code, and
finally edit the app and resume execution – all
against Azure HDInsight clusters with cluster
data.
• Set a breakpoint for both driver and executor
code. Debugging executor code lets
developers detect data-related errors by
viewing RDD intermediate values, tracking
distributed task operations, and stepping
through execution units.
• Set a breakpoint in Spark external libraries
allowing developers to step into Spark code
and debug in the Spark framework.
• View both driver and executor code
execution logs in the console panel.
10.
GA
Released
Dec ‘17
Apache Kafka
for HDInsight
Enterprise proven Kafka service for the
cloud
99.9% SLAs
Highest level of availability with rack
awareness
Native integration with Azure Managed disks
means faster data ingestion and lowers costs
Build real-time solutions faster
Get a cluster up and running in 4 clicks
Easy data mirroring and setup
Out-of-the box alerting and monitoring
Integration with Apache Spark and Apache
Storm
https://docs.microsoft.com/en-
us/azure/hdinsight/hdinsight-apache-
kafka-get-started
11.
Public Preview
Dec 17
Enterprise Security
Package for HDInsight
Enterprise grade security for Hadoop and
Spark workloads
https://docs.microsoft.com/en-
us/azure/hdinsight/hdinsight-domain-
joined-introduction
• Multi-user authentication using Active
Directory or Azure Active Directory.
• Multi-user Zeppelin notebook with
collaborative data science experience.
• Role based access control for Ambari
operations.
• Fine grained role based access control for
Hive SQL and Spark SQL using Apache
Ranger.
• Data masking of sensitive data using Apache
Ranger.
• Seamless integration with file and folder level
ACLs in Azure Data Lake Store.
• Audit all access to sensitive data as well as
changes to access policies.
• Transparent server side encryption at rest as
well as encryption in transit.
26. o Hive Low Latency and Analytical Processing (LLAP)
o Serves queries directly from Azure BLOB/ADLS
o Works with TEXT, JSON, CSV, TSV, ORC, Parquet
o Super fast performance with TEXT data
o Modern scalable query concurrency architecture
o Security with Apache Ranger and Active Directory
28. Intelligent cache
DRAM
SSD
ADLS/BLOBStore
Automatically reacting to changes in underlying data
o Shared cache between queries
o Cache eviction is based on source file last modified date
o Every query will check modified date, and reload if a new file has
arrived
Updates
29. • LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from the TPC-DS benchmark that ran on all engines successfully
33. • We used number of different concurrency levels to test the
concurrency performance
• 99 queries on 1 TB data with 32 worker node cluster with max
concurrency set to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1
Test 2: Run all 99 queries, 2 at a time - Concurrency = 2
Test 3: Run all 99 queries, 4 at a time - Concurrency = 4
Test 4: Run all 99 queries, 8 at a time - Concurrency = 8
Test 5: Run all 99 queries, 16 at a time - Concurrency = 16
Test 6: Run all 99 queries, 32 at a time - Concurrency = 32
Test 7: Run all 99 queries, 64 at a time - Concurrency = 64
41. OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
43. o Available for Hadoop, Spark, LLAP in preview
o Enabled with AAD + AAD DS, AD on IaaS set up
o AAD DS is available in ARM VNET and new Azure
portal
o Ranger database can be external to the cluster
44.
45. Product demand analysis
Delivery and Operations
Customer ID Name Cell phone Email Address City State Zip Credit card
413707 LUNA PARK 3122049789 luna.park@gmail.com 3250 W FOSTER AVE CHICAGO IL 60625 4147202109819679
391234 MARIE 3121069067 marie@outlook.com 4729 N LINCOLN AVE CHICAGO IL 60625 5166550002516678
413751 MANU WORKY 8471909522 manu.work@gmail.com 11601 W TOUHY AVE CHICAGO IL 60666 5159550002367622
413708 STEVE BENCH 3122049411 steve.bench@outlook.com 325 N LA SALLE ST BLDG CHICAGO IL 60654 4149098188760969
… … ... … … … …
Customer ID Reviews Rating
413707 SPICY, YET HEALTHY. WOULD ORDER AGAIN 9.3
391234 HATS OFF TO MAINTAIN PROPER 4.6
413751 AMAZING FOOD PREPARED RIGHT 9.4
413708 Decent Food 7.1
… …. ….
Id Customer ID Orders placed Discount Date Revenue
102456 68252 277 $526.30 8/1/2016 $2,243.70
102457 413488 282 $84.60 8/1/2016 $2,735.40
102458 250405 134 $281.40 8/1/2016 $1,058.60
102459 114533 141 $253.80 8/1/2016 $1,156.20
102460 315209 289 $346.80 8/1/2016 $2,543.20
… … … … … …
Id Customer ID Time taken Cost Date
102456 68252 63 $224.00 8/1/2016
102457 413488 65 $235.00 8/1/2016
102458 250405 67 $245.00 8/1/2016
102459 114533 71 $227.00 8/1/2016
102460 315209 72 $213.00 8/1/2016
… … … … …
For quick PySpark developers who value productivity of Python language, the new VSCode plugin for HDInsight offers first class integration with this popular code editor. Developers can edit their scripts on laptops and submit PySpark statements to HDInsight cluster with interactive responses. This interactivity brings the best properties of Python and Spark to developers and makes their life more enjoyable and productive