Hadoop on Windows Azure - an Introduction

1,106 views

Published on

Instead of reinventing the wheel, Microsoft takes a strong and brilliant move to integrate Hadoop on its blockbuster cloud computing PaaS stack. Isn't it? Of course, LINQ2HPC was embraced many .NET developers, however, Hadoop distribution for Windows is also the safest move. This paper evaluates the early preview of Hadoop on Azure. It cover the basics of using Hadoop on Azure. It would be helpful to read about MapReduce and Hadoop Topology before learning about Hadoop on Azure.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,106
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop on Windows Azure - an Introduction

  1. 1. 1www.aditi.com Introducing Hadoop on Azure M Sheik Uduman Ali Technical Architect, Aditi Technologies Instead of reinventing the wheel, Microsoft takes a strong and brilliant move to integrate Hadoop on its blockbuster cloud computing PaaS stack. Isn't it? Of course, LINQ2HPC was embraced many .NET developers, however, Hadoop distribu- tion for Windows is also the safest move. This paper evaluates the early preview of Hadoop on Azure. It cover the basics of using Hadoop on Azure. It would be helpful to read about MapReduce and Hadoop Topology before learning about Hadoop on Azure. For comments or questions regarding the content of this pa- per, please contact Sunny Neogi (sunnyn@aditi.com) or Arun Kumar (arung@aditi.com) www.aditi.com
  2. 2. 2www.aditi.com Why do we need Hadoop? The simple answer to this question is "Big data analysis". Some examples of big data analysis are:  Calculating consumers purchasing trend on particular product categories based on the growing big data with the rate of 1 million transactions per hour  Web application log analysis  Internet search indexing  Social network data Since relational databases and its ecosystem were designed on "scale-up" strategy with centralized data processing, they are not much suitable for data warehousing space. And the data persistence of modern applications is mix and match of relational, structured and non-structured. Hence, we need a much more powerful system. Hadoop is one of the successful open source platform based on MapReduce principle, which in turn follows the "Making big by small" philosophy. The big data processing is called as "Job" since it would be done very fre- quently, periodically, some in a while or only once. It is not to be part of day to day business.
  3. 3. 3www.aditi.com ABOUT ADITI Basically, the input data is processed on "n" number of small physical nodes in a clustered environment in two different phases:  Map: The input data needs to be grouped as <k1, v1> key-value pair. For example, if the input data reside in one or more files, then k1 would be the file name and v1 be the file content. Hence, the map phase receives list of <k1, v1>. It splits each k1 into available map nodes in the cluster. On every node, the mapping function mostly performs "filtering and transfor- mation“ and produces <k2, v2>. For example, if you want to count the number of occurrences of words in the given set of documents, <filename, content> as <k1, v1> and the nodes in the mapping phase does counting the words in the given v1. This will generate output like <"aditi", 1> as <k2, v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is one of the words in the document. Hence, the output of mapping phase is list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.  Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the word count example, a node in the Hadoop cluster may produce may <"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve- ry list(v2) for k2 passed to a node for reducing. The output will be list of <k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again "aditi". Each reducer node does the same for different words. The <k2, v2> aggregation is actually performed by a component called "combiner". As of now, let us keep focus on the mapper and reducer. See the below figure (figure 1): What are the layers of Architecture? What is MapReduce?
  4. 4. 4www.aditi.com ABOUT ADITI Hadoop cluster is an infrastructure with many physical nodes, where some are configured for "mapping" and some are for "reducing" along with administra- tive, tracking and data persistence nodes called as "Name Node", "Job Track- er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi- tecture "Name Node" and "Job Tracker" are masters and remaining are slaves. This is shown in figure 2. In order handle big data storage and processing, Hadoop uses HDFS as a file system which even handle 100 TB content as a single file. What are the layers of Architecture? Hadoop Cluster
  5. 5. 5www.aditi.com ABOUT ADITI Since every task is called as "Job", you can rent required nodes for your job, use and release. Hence, the elastic computing and data storage (blob and ta- ble storage) in Azure is definitely the good choice for running your Hadoop job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop Java SDK is one of the good options for your job. In addition to this, the "Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop streaming, by which you can write your job on C# or F# and use Azure blob for data persistence (the scheme is called as ASV). The figure below shows the Hadoop ecosystem on Azure (figure 3). To create directories, get and put files, and issue some data processing com- mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac- tual Hadoop distribution, Java is the main interface for this). In addition to this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high level data processing language). What are the layers of Architecture? Hadoop Ecosystem on Azure
  6. 6. 6www.aditi.com ABOUT ADITI The www.hadooponazure.com is the management portal to create, release and renew clusters for your job. The following are the steps you need to per- form to run job: 1. Develop the mapping and reducing functions either in Java or your pre- ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py- thon, etc. In Azure, you can write the code in .NET. 2. Decide from where the input data and output result of the job need to be managed. Either in HDFS or Azure Blob. 3. Request a cluster for the job in the portal 4. Specify all the parameters for the job which includes the executable for the job, input and output details 5. Run the job and get the output 6. Release the cluster In this post, let us see the step 3, how we can create a cluster for a job. Requesting a new Cluster After you entered into the portal, you need to enter the following details for the new cluster environment as shown in the below figure (figure 4):  DNS name (<dnsname>.cloudapp.net)  Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32 nodes + 16 TB = extra large  Cluster login information What are the layers of Architecture? TheWeb Portal for Hadoop on Azure
  7. 7. 7www.aditi.com ABOUT ADITI After entering these details, press Request Cluster button. This will create the cluster environment for your job. The screen shows the progress of creating new nodes for the cluster as shown in the below figure (figure 5):
  8. 8. 8www.aditi.com ABOUT ADITI After the provisioning, you will see a screen as shown below (figure 6): You can start create a new job and if you want to access the environment you can use either "Interactive Console" or “Remote Desktop".
  9. 9. 9www.aditi.com ABOUT ADITI The above figure is a Hadoop Streaming based job. —————————————————————————————————— About the Author: M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic- es. He is a blogger and published an online book about "Domain Specific Languages in .NET". Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so- lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg- ing technologies and are focused on custom development. ABOUT ADITI When you click on new job, you will see the below screen (figure 7):

×