This document provides an overview of Hadoop on Azure and how to work with HDInsight Hadoop clusters on the Microsoft Azure cloud platform. It discusses what Hadoop and HDInsight are, how to set up and configure Hadoop clusters on Azure, how to perform common administration tasks, how data is stored and processed using MapReduce, how to write and run MapReduce jobs in different languages, and how to monitor and query Hadoop jobs and data using tools like Hive, Pig, and Excel.
2. Data Expertise / Lynn Langit
Practicing Architect
Cloud Deployments
(Azure, AWS,
Google)
Technical author / trainer
Google Cloud
Developer Series
SQL Server 2012
Developer Series
Cloudera Certified
Developer
2 books on SQL
Server BI
Industry awards
Microsoft – MVP
for SQL Server
Google – GDE for
Cloud Platform
10Gen – Master for
MongoDB
Former MSFT FTE
4 years
3. What is Hadoop?
S HUGE Hype factor in 2011 / 2012
Apache Hadoop is a software framework that supports data-
intensive distributed applications under a free license
• Uses HDFS storage to enable applications to work with thousands
of nodes and petabytes of data
• Uses MapReduce to process the data
• Inspired by Google
• MapReduce
• Google File System
4. What is HDInsight?
S Hadoop on Windows
S Azure
S On-premise
S Microsoft worked with Hortonworks to port Hadoop to
Windows (from Linux)
6. RDBMS vs. Hadoop
RDBMS Hadoop
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query
Response Time
Can be near immediate Has latency (due to batch
processing)
24. Monitoring Job Results
S In the portal
S Main Console
S Job icon (button) status
summary
S Job History
S Interactive Console
S JS quick feedback
S JS detailed feedback (log)
S Using RDP
S Map/Reduce tool
S Hadoop command prompt
32. Hadoop To-Do List
• Use Hadoop when
business needs
designate
• Use other NoSQL if
a better fit
BigData =
Hadoop
• Quick and cheap
• Specialized use cases
• Behavioral data
• dev, test , training
environments
Hadoop on the
cloud • Learn Map/Reduce
• Use HIVE via Excel
• Pay attention to
Impala
Hadoop access
technologies