2. Who Am I?
Neil Mackenzie
Windows Azure Architect @ Satory Global
Windows Azure MVP
Blog: http://convective.wordpress.com/
Twitter: @mknz
Book:
Microsoft Windows Azure Development Cookbook
3. Goals and Agenda
Goals
Introduce Windows Azure HDInsight Service to the Windows
Azure developer
Introduce Windows Azure to the Hadoop user
Not a tutorial on how to use Hadoop features
Agenda
Big Data
Windows Azure
Windows Azure HDInsight Service
4. Big Data
Problem:
How do we create value from enormous amounts of low-value
data?
Solution:
Analyze it using a lot of commodity hardware.
5. Three Vs of Big Data
Volume
How much data is there?
Variety
What are the sources of the data?
Velocity
How fast is the data being generated?
6. MapReduce
Distributed computational model for data analysis.
Map function:
Processes a key-value pair to generate intermediate pairs
Reduce function:
Merges all intermediate values with the same intermediate key.
Map and reduce functions allocated to many
compute nodes with data stored locally.
Raw MapReduce functions are written in Java.
7. Apache Hadoop
Modules:
Hadoop Distributed File System (HDFS)
MapReduce
Related projects:
HBase – scalable, distributed database
Hive – data warehouse infrastructure
Mahout – scalable machine learning library
Pig – high-level data-flow language
Other:
Sqoop –import and export to relational database
8. Windows Azure
Compute
PaaS: Cloud Services, Windows Azure Web Sites
IaaS: Virtual Machines
Storage
Windows Azure Storage Service: blobs, tables, queues
Windows Azure SQL Database
IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc.
Connectivity
HTTP, TCP, UDP, Site-to-Site VPN
Administration
Portal, Service Management API
9. Windows Azure HDInsight Service
Components:
HadoopCore – v1.0.1
HDFS & ASV
Pig – v0.9.3
Hive – v0.8.1
Sqoop – v1.4.2
Excel/Hive
Note: this was formerly known as Hadoop on Azure.
10. Hadoop Administration
Portal
http://www.hadooponazure.com
Apply to join preview
Create and manage Hadoop cluster
3 nodes for 5 days
Access the Interactive console
Hive
Invoke Hive statements
JavaScript
Invoke HDFS commands
Invoke Hive & Pig statements
11. Distributed File Systems
HDFS
Contents deleted when cluster deleted
ASV
Azure Storage Vault
Data stored in Windows Azure Blob Storage
Configured on Hadoop on Azure portal
Contents survive deletion of Hadoop cluster
Supports multi-level structure, e.g.:
containername/input/file1
12. Pig
Hadoop feature to perform data-flow operations:
Execution environment
Language: Pig Latin
Execution Environment
Local in local JVM or distributed on Hadoop cluster
Pig Latin
High-level language
Describes data-flow operations
Automatically invokes MapReduce jobs
Much simpler than using MapReduce directly
13. Pig Example
records = LOAD 'asv://flightdata/input/flightdata.txt'
AS
(year:int, month:int, day:int, carrier:chararray, origin:char
array, dest:chararray, depdelay:int, arrdelay:int);
modified_records = FOREACH records
GENERATE origin, depdelay;
STORE modified_records
INTO 'my_output' using PigStorage(',');
14. Hive
Hadoop feature to perform data warehouse
operations
HiveQL
high-level, SQL-like language
Supports equi-joins
Schema on read NOT schema on write
Automatically invokes MapReduce jobs
Much simpler than using MapReduce directly
Metadata store
Contains descriptions of tables
15. Hive Example
FROM flightdata_asv
INSERT OVERWRITE TABLE origin_counts
SELECT origin, COUNT(*)
GROUP BY origin
INSERT OVERWRITE TABLE dest_counts
SELECT dest, COUNT(*)
GROUP BY dest
16. Sqoop
Feature allowing import and export from SQL
databases
Uses JDBC connector
Works with Windows Azure SQL Database
Table must exist before export
17. Sqoop Example
Exporting a table:
sqoop.cmd export –connect
"jdbc:sqlserver://sql_database_server.database.windows.net:1433;database=
sql_database_instance;user=sqoop_login@sql_database_server;password=s
qoop_login_password"
--table sql_database_table
--export-dir "/user/hive/warehouse/hive_table"
--input-fields-terminated-by "001"
18. Excel and Hadoop on Azure
Example of Microsoft business intelligence strategy
Expose Hadoop to existing tools
HiveODBC connector for Excel
Create Hive queries from Excel
Invoke them from Excel
19. More Information
Sign up for preview:
http://www.hadooponazure.com
Support:
http://social.msdn.microsoft.com/Forums/en-US/hdinsight
Avkash Chauhan’s blog:
http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop
Roger Jennings’ blog:
http://oakleafblog.blogspot.com/2012/04/using-data-in-
windows-azure-blobs-with.html
20. Summary
Hadoop:
De-facto solution to the Big Data problem
Windows Azure HDInsight Service
Native Hadoop implementation
Managed Hadoop service for Windows Azure
Currently in preview