Windows Azure HDInsight Service

  • 2,864 views
Uploaded on

Describe the Hadoop features provided in Windows Azure HDInsight

Describe the Hadoop features provided in Windows Azure HDInsight

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Thanks for sharing this!
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,864
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
78
Comments
1
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Windows Azure HDInsight Service Hadoop on Windows Azure NEIL MACKENZIE
  • 2. Who Am I? Neil Mackenzie Windows Azure Architect @ Satory Global Windows Azure MVP Blog: http://convective.wordpress.com/ Twitter: @mknz Book: Microsoft Windows Azure Development Cookbook
  • 3. Goals and Agenda Goals  Introduce Windows Azure HDInsight Service to the Windows Azure developer  Introduce Windows Azure to the Hadoop user  Not a tutorial on how to use Hadoop features Agenda  Big Data  Windows Azure  Windows Azure HDInsight Service
  • 4. Big Data Problem:  How do we create value from enormous amounts of low-value data? Solution:  Analyze it using a lot of commodity hardware.
  • 5. Three Vs of Big Data Volume  How much data is there? Variety  What are the sources of the data? Velocity  How fast is the data being generated?
  • 6. MapReduce Distributed computational model for data analysis.  Map function:  Processes a key-value pair to generate intermediate pairs  Reduce function:  Merges all intermediate values with the same intermediate key. Map and reduce functions allocated to many compute nodes with data stored locally. Raw MapReduce functions are written in Java.
  • 7. Apache Hadoop Modules:  Hadoop Distributed File System (HDFS)  MapReduce Related projects:  HBase – scalable, distributed database  Hive – data warehouse infrastructure  Mahout – scalable machine learning library  Pig – high-level data-flow language Other:  Sqoop –import and export to relational database
  • 8. Windows Azure Compute  PaaS: Cloud Services, Windows Azure Web Sites  IaaS: Virtual Machines Storage  Windows Azure Storage Service: blobs, tables, queues  Windows Azure SQL Database  IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc. Connectivity  HTTP, TCP, UDP, Site-to-Site VPN Administration  Portal, Service Management API
  • 9. Windows Azure HDInsight Service Components:  HadoopCore – v1.0.1  HDFS & ASV  Pig – v0.9.3  Hive – v0.8.1  Sqoop – v1.4.2  Excel/Hive Note: this was formerly known as Hadoop on Azure.
  • 10. Hadoop Administration Portal  http://www.hadooponazure.com  Apply to join preview  Create and manage Hadoop cluster  3 nodes for 5 days  Access the Interactive console  Hive  Invoke Hive statements  JavaScript  Invoke HDFS commands  Invoke Hive & Pig statements
  • 11. Distributed File Systems HDFS  Contents deleted when cluster deleted ASV  Azure Storage Vault  Data stored in Windows Azure Blob Storage  Configured on Hadoop on Azure portal  Contents survive deletion of Hadoop cluster  Supports multi-level structure, e.g.:  containername/input/file1
  • 12. Pig Hadoop feature to perform data-flow operations:  Execution environment  Language: Pig Latin Execution Environment  Local in local JVM or distributed on Hadoop cluster Pig Latin  High-level language  Describes data-flow operations  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly
  • 13. Pig Examplerecords = LOAD asv://flightdata/input/flightdata.txtAS(year:int, month:int, day:int, carrier:chararray, origin:chararray, dest:chararray, depdelay:int, arrdelay:int);modified_records = FOREACH recordsGENERATE origin, depdelay;STORE modified_recordsINTO my_output using PigStorage(,);
  • 14. Hive Hadoop feature to perform data warehouse operations HiveQL  high-level, SQL-like language  Supports equi-joins  Schema on read NOT schema on write  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly Metadata store  Contains descriptions of tables
  • 15. Hive ExampleFROM flightdata_asvINSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY originINSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest
  • 16. Sqoop Feature allowing import and export from SQL databases  Uses JDBC connector  Works with Windows Azure SQL Database  Table must exist before export
  • 17. Sqoop Example Exporting a table:sqoop.cmd export –connect"jdbc:sqlserver://sql_database_server.database.windows.net:1433;database=sql_database_instance;user=sqoop_login@sql_database_server;password=sqoop_login_password"--table sql_database_table--export-dir "/user/hive/warehouse/hive_table"--input-fields-terminated-by "001"
  • 18. Excel and Hadoop on Azure Example of Microsoft business intelligence strategy  Expose Hadoop to existing tools HiveODBC connector for Excel  Create Hive queries from Excel  Invoke them from Excel
  • 19. More Information Sign up for preview: http://www.hadooponazure.com Support: http://social.msdn.microsoft.com/Forums/en-US/hdinsight Avkash Chauhan’s blog: http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop Roger Jennings’ blog: http://oakleafblog.blogspot.com/2012/04/using-data-in- windows-azure-blobs-with.html
  • 20. Summary Hadoop:  De-facto solution to the Big Data problem Windows Azure HDInsight Service  Native Hadoop implementation  Managed Hadoop service for Windows Azure  Currently in preview