2. Contents
Overview ......................................................... 3
Deploy HDP Sandbox ..................................... 4
Create the VM................................................. 6
Configure Azure Data Lake and SQL Data
Warehouse.................................................... 14
Terms of Use................................................. 16
3. Interactive queries using Spark SQL on Azure HDInsight
3
Summary
In order for you to complete the labs we have prepared, you need
to ensure that you have an Azure subscription with admin
rights. This will allow you to create small clusters (max 4 nodes)
that we will utilize during the lab. We ask that you create the HDP
sandbox before arriving at the lab (see ‘Deploy HDP Sandbox).
Please liaise with your internal IT organization to gain the
necessary privileges to complete the lab.
Once your internal IT organization has granted you access to the
Azure Portal we highly recommend you complete the sections in
this document before coming to the lab to test the access granted.
This document should take no more than 30 minutes to complete.
If you have any difficulties at all then please get in contact with
your Microsoft representative.
The first lab will work with the Hortonworks sandbox environment.
We recommend you deploy this and shut it down before attending
the lab. We will also use the Twitter API, in the ‘Deploy HDP
Sandbox’ we’ve also included a link on instructions to set that up.
As part of this lab we will also be using Visual Studio to submit
Hive Queries. The software required to complete the lab is
already installed on a pre-configured VM in Azure called The Data
Science Virtual Machine. This virtual machine has the following
software installed:
Visual Studio 2015 Community Edition
Azure SDK.
Revolution R Open.
Power BI Desktop
SQL Server Express 2014
IPython
Azure PowerShell
Azure Storage Explorer
In this activity we will create an instance of this virtual machine
and install tools on the VM.
Overview
4. Interactive queries using Spark SQL on Azure HDInsight
4
Summary
Deplploying the Hortonworks sandbox ca be done before the lab,
and ensures you have all the necessary rights within your azure
subscription. It only takes a few minutes, and can be shutdown
once setup to prevent any further charges.
The instructions to setup a single node Hortonworks environment
are here: http://hortonworks.com/hadoop-tutorial/deploying-
hortonworks-sandbox-on-microsoft-azure/
Once deployed, you can shut the sandbox down by:
1. Log into the azure portal at https://portal.azure.com/
2. If you cannot see a VM on the dashboard with the name f
your cluster, you can search for it using the dialog at the
top:
3.
4.
5. Select your VM and in the Dashboard select ‘Stop’:
6.
7. Confirm you wish to stop the VM, confirm the status
updates to ‘stopped (Deallocated)’ after a few minutes to
ensure you aren’t charged for the Virtual machine:
Deploy HDP
Sandbox
5. Interactive queries using Spark SQL on Azure HDInsight
5
8.
Create the Twitter API Keys
As part of the lab we’ll be collecting and processing Twitter feeds.
In order to connect to the Twitter API you’ll need to create a
twitter and and collect the API keys. Instructions for this can be
obtained here: http://www.gabfirethemes.com/create-twitter-api-
key/
6. Interactive queries using Spark SQL on Azure HDInsight
6
The HDInsight lab will also use Visual studio to connect to a
HDInsight cluster and run a Hive script. Visual studio community
edition is installed and configured on the Data Science VM, which
is freely available from Azure Market place.
1. Sign in to the Azure portal - https:// portal.azure.com/
2. Click on + New.
3. In the search box type Data science virtual machine press
the return key. You should see the following
4. Click on the Data Science Virtual Machine (published by
Microsoft)
5. Click on Create.
6. In the Basics Blade fill out a Name (n.b. this has to be a
unique name to the whole of Azure), User name, Password,
Resource group. Select a location nearest to you (this is the
location of the Microsoft data center). Example entry is
outlined below:
Create the
VM
7. Interactive queries using Spark SQL on Azure HDInsight
7
7. The Size blade will pop up next. Select A3 (n.b. we will shut
down the VM at the end of this lab).
8. On the Settings blade click OK:
8. Interactive queries using Spark SQL on Azure HDInsight
8
9. On the Summary Blade click OK:
10. On the Buy Blade click Purchase:
9. Interactive queries using Spark SQL on Azure HDInsight
9
11. On the startboard you will see the VM being deployed. This
will take approximately 5-10minutes.
12. Once it is successfully deployed you will see the following on
the startboard:
13. Click on the VM you created from the startboard to get the
following page:
10. Interactive queries using Spark SQL on Azure HDInsight
10
14. Click on the Connect button as highlighted above. Save the
RDP file.
15. Double click on the downloaded RDP file to connect to the VM
and enter you credentials (note the before the username):
The next steps are optional, should you wish to explore other
features of the SDKs
1. Once you have connected to the Data Science Virtual
Machine install the Azure SDK by double clicking on the
Microsoft Web Platform shortcut on the desktop:
11. Interactive queries using Spark SQL on Azure HDInsight
11
In the installer click on Add for Microsoft Azure SDK for .Net
(VS 2015) - <VERSION NUMBER> and then install:
This takes approximately 5minutes to finish installing.
2. Ensure Azure Data Lake Tools for Visual Studio are installed
(Data Lake Tools for Visual Studio). Once Data Lake Tools
for Visual Studio is installed, you will see a Data Lake menu in
Visual Studio.
3. Next, install RTools by visiting the following site -
https://cran.r-project.org/bin/windows/Rtools/ - in Internet
Explorer and downloading Rtools33.exe
12. Interactive queries using Spark SQL on Azure HDInsight
12
Run through the installer ensuring that at the additional tasks
stage the following checkbox is ticked:
This updates the PATH environment variable so that various
R Tooling is available.
4. Close the VM by clicking the X on the blue bar highlighted
below:
5. Shutdown the VM by clicking on the Stop button on the VM
blade in the Azure preview portal (this will take a couple of
minutes).
13. Interactive queries using Spark SQL on Azure HDInsight
13
6. If you managed to successfully complete all these steps, then
you are ready for the Advanced Analytics lab!
14. Interactive queries using Spark SQL on Azure HDInsight
14
Introduction
This is an optional activity and not required to complete the labs.
However during the labs, you may also wish to review some of the
other Big Data Services available in Azure: Azure Data Lake and
the Data Lake analytics Service.
1) Familiarize with Azure Data Lake Store by reading this
2) By completing this tutorial you will enable your Azure
subscription for Data Lake Store Public Preview,
create an Azure Data Lake Store account and test
some basic Data Lake Store functionalities. At the end
don’t delete the ADL account.
3) Understand this post.
4) By completing this tutorial you will create a Data Lake
Analytics account, prepare source data and submit
Data Lake Analytics jobs
5) Familiarize with Azure SQL Data Warehouse by
reading this
6) Create a SQL Data Warehouse by completing this
tutorial
7) Configure integration with Visual Studio by
completing this tutorial
Configure
Azure Data
Lake and
SQL Data
Warehouse