Windows Azure HDInsight Service


Published on

Describe the Hadoop features provided in Windows Azure HDInsight

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Windows Azure HDInsight Service

  1. 1. Windows Azure HDInsight Service Hadoop on Windows Azure NEIL MACKENZIE
  2. 2. Who Am I? Neil Mackenzie Windows Azure Architect @ Satory Global Windows Azure MVP Blog: Twitter: @mknz Book: Microsoft Windows Azure Development Cookbook
  3. 3. Goals and Agenda Goals  Introduce Windows Azure HDInsight Service to the Windows Azure developer  Introduce Windows Azure to the Hadoop user  Not a tutorial on how to use Hadoop features Agenda  Big Data  Windows Azure  Windows Azure HDInsight Service
  4. 4. Big Data Problem:  How do we create value from enormous amounts of low-value data? Solution:  Analyze it using a lot of commodity hardware.
  5. 5. Three Vs of Big Data Volume  How much data is there? Variety  What are the sources of the data? Velocity  How fast is the data being generated?
  6. 6. MapReduce Distributed computational model for data analysis.  Map function:  Processes a key-value pair to generate intermediate pairs  Reduce function:  Merges all intermediate values with the same intermediate key. Map and reduce functions allocated to many compute nodes with data stored locally. Raw MapReduce functions are written in Java.
  7. 7. Apache Hadoop Modules:  Hadoop Distributed File System (HDFS)  MapReduce Related projects:  HBase – scalable, distributed database  Hive – data warehouse infrastructure  Mahout – scalable machine learning library  Pig – high-level data-flow language Other:  Sqoop –import and export to relational database
  8. 8. Windows Azure Compute  PaaS: Cloud Services, Windows Azure Web Sites  IaaS: Virtual Machines Storage  Windows Azure Storage Service: blobs, tables, queues  Windows Azure SQL Database  IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc. Connectivity  HTTP, TCP, UDP, Site-to-Site VPN Administration  Portal, Service Management API
  9. 9. Windows Azure HDInsight Service Components:  HadoopCore – v1.0.1  HDFS & ASV  Pig – v0.9.3  Hive – v0.8.1  Sqoop – v1.4.2  Excel/Hive Note: this was formerly known as Hadoop on Azure.
  10. 10. Hadoop Administration Portal   Apply to join preview  Create and manage Hadoop cluster  3 nodes for 5 days  Access the Interactive console  Hive  Invoke Hive statements  JavaScript  Invoke HDFS commands  Invoke Hive & Pig statements
  11. 11. Distributed File Systems HDFS  Contents deleted when cluster deleted ASV  Azure Storage Vault  Data stored in Windows Azure Blob Storage  Configured on Hadoop on Azure portal  Contents survive deletion of Hadoop cluster  Supports multi-level structure, e.g.:  containername/input/file1
  12. 12. Pig Hadoop feature to perform data-flow operations:  Execution environment  Language: Pig Latin Execution Environment  Local in local JVM or distributed on Hadoop cluster Pig Latin  High-level language  Describes data-flow operations  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly
  13. 13. Pig Examplerecords = LOAD asv://flightdata/input/flightdata.txtAS(year:int, month:int, day:int, carrier:chararray, origin:chararray, dest:chararray, depdelay:int, arrdelay:int);modified_records = FOREACH recordsGENERATE origin, depdelay;STORE modified_recordsINTO my_output using PigStorage(,);
  14. 14. Hive Hadoop feature to perform data warehouse operations HiveQL  high-level, SQL-like language  Supports equi-joins  Schema on read NOT schema on write  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly Metadata store  Contains descriptions of tables
  15. 15. Hive ExampleFROM flightdata_asvINSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY originINSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest
  16. 16. Sqoop Feature allowing import and export from SQL databases  Uses JDBC connector  Works with Windows Azure SQL Database  Table must exist before export
  17. 17. Sqoop Example Exporting a table:sqoop.cmd export –connect"jdbc:sqlserver://;database=sql_database_instance;user=sqoop_login@sql_database_server;password=sqoop_login_password"--table sql_database_table--export-dir "/user/hive/warehouse/hive_table"--input-fields-terminated-by "001"
  18. 18. Excel and Hadoop on Azure Example of Microsoft business intelligence strategy  Expose Hadoop to existing tools HiveODBC connector for Excel  Create Hive queries from Excel  Invoke them from Excel
  19. 19. More Information Sign up for preview: Support: Avkash Chauhan’s blog: Roger Jennings’ blog: windows-azure-blobs-with.html
  20. 20. Summary Hadoop:  De-facto solution to the Big Data problem Windows Azure HDInsight Service  Native Hadoop implementation  Managed Hadoop service for Windows Azure  Currently in preview