Windows Azure HDInsight Service Hadoop on Windows Azure NEIL MACKENZIE
Who Am I? Neil Mackenzie Windows Azure Architect @ Satory Global Windows Azure MVP Blog: http://convective.wordpress.com/ Twitter: @mknz Book: Microsoft Windows Azure Development Cookbook
Goals and Agenda Goals Introduce Windows Azure HDInsight Service to the Windows Azure developer Introduce Windows Azure to the Hadoop user Not a tutorial on how to use Hadoop features Agenda Big Data Windows Azure Windows Azure HDInsight Service
Big Data Problem: How do we create value from enormous amounts of low-value data? Solution: Analyze it using a lot of commodity hardware.
Three Vs of Big Data Volume How much data is there? Variety What are the sources of the data? Velocity How fast is the data being generated?
MapReduce Distributed computational model for data analysis. Map function: Processes a key-value pair to generate intermediate pairs Reduce function: Merges all intermediate values with the same intermediate key. Map and reduce functions allocated to many compute nodes with data stored locally. Raw MapReduce functions are written in Java.
Apache Hadoop Modules: Hadoop Distributed File System (HDFS) MapReduce Related projects: HBase – scalable, distributed database Hive – data warehouse infrastructure Mahout – scalable machine learning library Pig – high-level data-flow language Other: Sqoop –import and export to relational database
Windows Azure Compute PaaS: Cloud Services, Windows Azure Web Sites IaaS: Virtual Machines Storage Windows Azure Storage Service: blobs, tables, queues Windows Azure SQL Database IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc. Connectivity HTTP, TCP, UDP, Site-to-Site VPN Administration Portal, Service Management API
Windows Azure HDInsight Service Components: HadoopCore – v1.0.1 HDFS & ASV Pig – v0.9.3 Hive – v0.8.1 Sqoop – v1.4.2 Excel/Hive Note: this was formerly known as Hadoop on Azure.
Distributed File Systems HDFS Contents deleted when cluster deleted ASV Azure Storage Vault Data stored in Windows Azure Blob Storage Configured on Hadoop on Azure portal Contents survive deletion of Hadoop cluster Supports multi-level structure, e.g.: containername/input/file1
Pig Hadoop feature to perform data-flow operations: Execution environment Language: Pig Latin Execution Environment Local in local JVM or distributed on Hadoop cluster Pig Latin High-level language Describes data-flow operations Automatically invokes MapReduce jobs Much simpler than using MapReduce directly
Hive Hadoop feature to perform data warehouse operations HiveQL high-level, SQL-like language Supports equi-joins Schema on read NOT schema on write Automatically invokes MapReduce jobs Much simpler than using MapReduce directly Metadata store Contains descriptions of tables
Hive ExampleFROM flightdata_asvINSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY originINSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest
Sqoop Feature allowing import and export from SQL databases Uses JDBC connector Works with Windows Azure SQL Database Table must exist before export
Excel and Hadoop on Azure Example of Microsoft business intelligence strategy Expose Hadoop to existing tools HiveODBC connector for Excel Create Hive queries from Excel Invoke them from Excel
More Information Sign up for preview: http://www.hadooponazure.com Support: http://social.msdn.microsoft.com/Forums/en-US/hdinsight Avkash Chauhan’s blog: http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop Roger Jennings’ blog: http://oakleafblog.blogspot.com/2012/04/using-data-in- windows-azure-blobs-with.html
Summary Hadoop: De-facto solution to the Big Data problem Windows Azure HDInsight Service Native Hadoop implementation Managed Hadoop service for Windows Azure Currently in preview