Your SlideShare is downloading. ×
  • Like
Windows Azure HDInsight Service
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Windows Azure HDInsight Service


Describe the Hadoop features provided in Windows Azure HDInsight

Describe the Hadoop features provided in Windows Azure HDInsight

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Thanks for sharing this!
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Windows Azure HDInsight Service Hadoop on Windows Azure NEIL MACKENZIE
  • 2. Who Am I? Neil Mackenzie Windows Azure Architect @ Satory Global Windows Azure MVP Blog: Twitter: @mknz Book: Microsoft Windows Azure Development Cookbook
  • 3. Goals and Agenda Goals  Introduce Windows Azure HDInsight Service to the Windows Azure developer  Introduce Windows Azure to the Hadoop user  Not a tutorial on how to use Hadoop features Agenda  Big Data  Windows Azure  Windows Azure HDInsight Service
  • 4. Big Data Problem:  How do we create value from enormous amounts of low-value data? Solution:  Analyze it using a lot of commodity hardware.
  • 5. Three Vs of Big Data Volume  How much data is there? Variety  What are the sources of the data? Velocity  How fast is the data being generated?
  • 6. MapReduce Distributed computational model for data analysis.  Map function:  Processes a key-value pair to generate intermediate pairs  Reduce function:  Merges all intermediate values with the same intermediate key. Map and reduce functions allocated to many compute nodes with data stored locally. Raw MapReduce functions are written in Java.
  • 7. Apache Hadoop Modules:  Hadoop Distributed File System (HDFS)  MapReduce Related projects:  HBase – scalable, distributed database  Hive – data warehouse infrastructure  Mahout – scalable machine learning library  Pig – high-level data-flow language Other:  Sqoop –import and export to relational database
  • 8. Windows Azure Compute  PaaS: Cloud Services, Windows Azure Web Sites  IaaS: Virtual Machines Storage  Windows Azure Storage Service: blobs, tables, queues  Windows Azure SQL Database  IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc. Connectivity  HTTP, TCP, UDP, Site-to-Site VPN Administration  Portal, Service Management API
  • 9. Windows Azure HDInsight Service Components:  HadoopCore – v1.0.1  HDFS & ASV  Pig – v0.9.3  Hive – v0.8.1  Sqoop – v1.4.2  Excel/Hive Note: this was formerly known as Hadoop on Azure.
  • 10. Hadoop Administration Portal   Apply to join preview  Create and manage Hadoop cluster  3 nodes for 5 days  Access the Interactive console  Hive  Invoke Hive statements  JavaScript  Invoke HDFS commands  Invoke Hive & Pig statements
  • 11. Distributed File Systems HDFS  Contents deleted when cluster deleted ASV  Azure Storage Vault  Data stored in Windows Azure Blob Storage  Configured on Hadoop on Azure portal  Contents survive deletion of Hadoop cluster  Supports multi-level structure, e.g.:  containername/input/file1
  • 12. Pig Hadoop feature to perform data-flow operations:  Execution environment  Language: Pig Latin Execution Environment  Local in local JVM or distributed on Hadoop cluster Pig Latin  High-level language  Describes data-flow operations  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly
  • 13. Pig Examplerecords = LOAD asv://flightdata/input/flightdata.txtAS(year:int, month:int, day:int, carrier:chararray, origin:chararray, dest:chararray, depdelay:int, arrdelay:int);modified_records = FOREACH recordsGENERATE origin, depdelay;STORE modified_recordsINTO my_output using PigStorage(,);
  • 14. Hive Hadoop feature to perform data warehouse operations HiveQL  high-level, SQL-like language  Supports equi-joins  Schema on read NOT schema on write  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly Metadata store  Contains descriptions of tables
  • 15. Hive ExampleFROM flightdata_asvINSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY originINSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest
  • 16. Sqoop Feature allowing import and export from SQL databases  Uses JDBC connector  Works with Windows Azure SQL Database  Table must exist before export
  • 17. Sqoop Example Exporting a table:sqoop.cmd export –connect"jdbc:sqlserver://;database=sql_database_instance;user=sqoop_login@sql_database_server;password=sqoop_login_password"--table sql_database_table--export-dir "/user/hive/warehouse/hive_table"--input-fields-terminated-by "001"
  • 18. Excel and Hadoop on Azure Example of Microsoft business intelligence strategy  Expose Hadoop to existing tools HiveODBC connector for Excel  Create Hive queries from Excel  Invoke them from Excel
  • 19. More Information Sign up for preview: Support: Avkash Chauhan’s blog: Roger Jennings’ blog: windows-azure-blobs-with.html
  • 20. Summary Hadoop:  De-facto solution to the Big Data problem Windows Azure HDInsight Service  Native Hadoop implementation  Managed Hadoop service for Windows Azure  Currently in preview