This document discusses Hadoop and its relationship to Microsoft technologies. It provides an overview of what Big Data is, how Hadoop fits into the Windows and Azure environments, and how to program against Hadoop in Microsoft environments. It describes Hadoop capabilities like Extract-Load-Transform and distributed computing. It also discusses how HDFS works on Azure storage and support for Hadoop in .NET, JavaScript, HiveQL, and Polybase. The document aims to show Microsoft's vision of making Hadoop better on Windows and Azure by integrating with technologies like Active Directory, System Center, and SQL Server. It provides links to get started with Hadoop on-premises and on Windows Azure.
2. Session Objectives
⢠What is BigData?
⢠How it fits into the Windows and Windows Azure environments
⢠How do I program against it in the Microsoft Environment
3. What is Big Data?
⢠Traditionally:
⢠Physics Experiments, Sensor data, Satellite data, âŚ
⢠Now:
⢠Operational Logs
⢠Customer behavior
⢠Social interactions online
⢠âŚ
⢠From Terabytes in the 1990 over Petabytes today to Zetabytes in the
future
5. VOLUME VARIETY VELOCITY
(Size) (Structure) (Speed)
Big Data.
6. Whatâs the social sentiment How do I better predict
of my product? future outcomes?
How do I optimize my services
based on patterns of weather,
traffic, etc.?
New Questions.
8. What is Hadoop (v1)?
⢠Processing Platform for Big Data Processing
⢠Using the âMap-Reduceâ Processing Paradigm
⢠Characteristics:
⢠Highly-scalable (scaled out)
⢠Commodity HW-based
⢠Open Source
=> Very low cost for acquisition and storage costs
14. HDFS on Azure: Tale of two File Systems
HDFS API
Containers on Azure Blob Storage
NameNode
Front end
Front end
Front end
Data Node Partition Layer
Data Node
âŚ
Stream Layer
DFS (1 Data Node per Worker Role) Azure Storage Vault (ASV)
and Compute Cluster
15. .Net Map/Reduce Support
⢠Install NuGet
⢠âNuGetâ Microsoft .Net MapReduce API for Hadoop
⢠Provide an implementation of a HadoopJob
⢠Execute the job via either
⢠MRLibMRRunner.exe -dll ConsoleAppHadoopJob.exe
Or
â HadoopJobExecutor.ExecuteJob<HadoopJobClass>();
⢠Collect your result on HDFS
16. Javascript Map/Reduce Support
⢠Provide a map and reduce function variable in JS file
⢠Use Javascript console with
⢠runJS(â/user/myself/MRjob.jsâ, â/path/to/inputfileâ,
â/path/to/output/dirâ)
⢠Collect your result on HDFS
17. Invoking HiveQL Queries
⢠Run queries in Hadoop Command Shell after invoking hive
⢠Through the web console
⢠Programmatically through ODBC
⢠Coming soon: LINQ to Hive!
18. Polybase â Enhancing PDW query engine
Data Scientists
BI Users
DB Admins
Regular Results Traditional schema-based DW
Social Sensor T-SQL applications
Apps & RFID
Mobile Web Enhanced
Apps Apps PDW query engine
Hadoop PDW V2
Unstructured data Structured data
19. Microsoft Hadoop Vision
Better on Windows and Azure
⢠Active Directory
⢠System Center
⢠.Net Programmability
Microsoft Data Connectivity
⢠SQL Server / SQL Parallel Data Warehouse
⢠Azure Storage / Azure Data Market
Microsoft Business Intelligence (BI)
⢠Hive ODBC Connectivity
⢠BI Tools for Big Data
Collaborate with and Contribute to OSS
⢠Collaborate with HortonWorks
⢠Provide improvements and Windows support back to OSS
20. Getting started
⢠On prem: http://www.microsoft.com/bigdata/
⢠Single node cluster (onebox) install
⢠C:hadoop
⢠Starts local services
⢠Can start/stop them with start-onebox.cmd/stop-onebox.cmd
⢠Comes with:
⢠Hadoop command line (shell)
⢠Hadoop Status for name node and map-reduce cluster
⢠HDInsight Dashboard
⢠On Windows Azure: http://HadoopOnAzure.com/
⢠3 node cluster running as a service in Azure
⢠Can be used for 5 days
⢠Provides samples and HDInsight Dashboard
⢠TAP Program
Big DataThis is a picture down the center isle of a shipping container from one of Microsoftâs datacenters. We put ~1800 computers inside one of these containers. Some of us had the privilege of working on the data storage and computational platform that powers Bing. We used 22 of these containers, spanning 40,000 machines where we stored over 100PB of data. This was three years ago, and now these servers are almost obsolete.Big Data is in constant motion and growing at an incredible rate,90% of the worldâs data generated in just the past two years. That's remarkable growth. Technology history has taught us that the one with themost data wins. The empires of data like Twitter, Facebook, Yahoo all of whom are able to capitalize on the notion that data equates to power. More and more companies are increasingly utilizing Hadoop to power Big Data analytics and drive revenue and profit.Itâs all about your Data.
Iâd like to introduce the 3Vâs of Big DataIs it big as in Volume? Where your data exceeds limits of physical capabilities of systems today.Is it Velocity? The data is moving at a fast rate and value can decay over time.Is it Variability? of structure from unstructured, semi-structured to highly structured data.Doug Laney http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdfThe answer is itâs all of the above.Finally some refer to the fourth V of Big data as Value; the value of the insight that can be gained from extracting insight form your Big Data sources.
Given all of this data, and the variety of sources there are new questions that we can answer today that werenât possible just a few years ago.By asking and answering these questions you canreap the benefits of Big Data.Data is everywhere to be mined, but we have what one can call "the pomegranate problem" Imagine all of your data being inside a pomegranate. When you eat a pomegranate itâs a bit difficult getting into all of the little pieces inside the pomegranate out, it's a bit of work.Thatâs the process that you need to go through to extract insights out of your data.Itâs useful to think of it in this way; where your data is the platform. Not the tooling that surrounds it. Itâs all about the data. Itâs all about the questions that you ask.
The second thing I want to talk about is Hadoop and how Hadoop is setup to deliver Breakthrough Insights from your data.How many of you are familiar with Hadoop? How many of you are using Hadoop for projects today?How many are planning on using Hadoop in the next 12mo? How about in the cloud?When people talk about Hadoop they are often talking about specific computational patterns including map reduce, which emerged as a method to process lots of unstructured data on top of a distributed storage system in a highly fault tolerant and embarrassingly scalable way.  Hadoop allows us to store and process large amounts of data on commodity hardware. In the past you would spend large amounts of money on very specialized hardware. Today you can do this with off the shelf hardware running Hadoop. Now, Hadoop doesnât have a monopoly on âbigâ, âreal timeâ or âunstructuredâ but does provide some unique capabilities. Â
Iâd like to share my experience with an internal Microsoft service; Halo 4. We launched Halo 4 recently; players are playing over 25 million games per day. Each of those games upload many metrics are coming in every minute.Something amazing happened when we moved the Halo4 event stream into Hadoop/Hive, we noticed a change in how we thought about data. We were freed from the constant anxiety of wondering how we were going to handle an ever-increasing amount of data, we shifted from trying to store only what we really needed to storing everything. Itâs a digital shoebox of information. Then the questions started to shift from how and what to store; to how to gain breakthrough insights from the data.In a traditional database or data warehouse you have to define the structure of the data, or schema, up front, with Hadoop you define the structure of the data when you use it. Itâs a schema on read vs a schema on write.
Manage data of any type or sizeTo gain the full value of Big Data you need a modern data platform that manages data of any type, whether structured or unstructured, and of any size â from gigabytes to petabytes. Your Big Data solution should also manage data at rest or in motion. Leverage the power of HDInsight on Windows Server or as a Windows Azure Service. HDInsight provides simplicity, ease of management, and an open Enterprise-ready Hadoop service that runs on premise or in the cloud.Enrich your data with the worlds dataHDInsights enables you to realize new value in the data you have and can combine these new insights with 3rd party datasets simply and elegantly. The time spent by your data analysts trying to surface the right data and source for your precise needs is costly. By connecting to external data sources you can begin to answer new types of questions and deliver new value in ways that previously were not possible.Gain insight from any dataYou cannot begin to realize the value of Big Data until you can deliver new insights from all types of data- structured, unstructured, previously archived or discarded. The benefits of Big Data are not limited only to business intelligence experts or data scientists. Nearly everyone in your organization can analyze and make more informed decisions with the right tools including Microsoft Office Excel.Key CapabilitiesAny Data, Any Size, AnywhereMicrosoft Big Data offers an integrated platform for managing data of any shape or any size, whether itâs structured data in relational databases, unstructured data with Hadoop, or streaming data.Microsoft Big Data offers an integrated platform for managing data of any shape or any size, whether itâs structured data in relational databases, unstructured data with Hadoop, or streaming data.Enterprise-ready HadoopSeamlessly extend access privileges across HDInsight with Active Directory.Manage your HDInsight clusters easily with System Center 2012.Enjoy the reliability and high availability of 100% Apache Hadoop compatible HDInsight.Gain Windows Simplicity and Manageability for HadoopSimplicity on premise with a virtualized deployment model.Consistent platform on Windows or on Windows Azure with shared codebase.Deploy Hadoop easily thanks to smart packaging and Cloud optimization from Microsoft.Scale on Demand in the CloudBenefit from deployment options for Big Data on both Windows Server and Windows Azure.Enjoy elastic scalability in the cloud.Gain better control of your data and costs.Open Big Data PlatformGain from the strategic Microsoft and Hortonworks partnership. Leverage the benefits of Microsoft HDInsight that offers 100% compatibility with Apache Hadoop.Enterprise-ready HadoopSeamlessly extend access privileges across HDInsight with Active Directory.Manage your HDInsight clusters easily with System Center 2012.Enjoy the reliability and high availability of 100% Apache Hadoop compatible HDInsight.Gain Windows Simplicity and Manageability for HadoopSimplicity on premise with a virtualized deployment model.Consistent platform on Windows or on Windows Azure with shared codebase.Deploy Hadoop easily thanks to smart packaging and Cloud optimization from Microsoft.Scale on Demand in the CloudBenefit from deployment options for Big Data on both Windows Server and Windows Azure.Enjoy elastic scalability in the cloud.Gain better control of your data and costs.Open Big Data PlatformGain from the strategic Microsoft and Hortonworks partnership. Leverage the benefits of Microsoft HDInsight that offers 100% compatibility with Apache Hadoop.Connecting with the Worldâs DataMicrosoft offers unparalleled opportunities for discovery and enrichment by enabling end users to connect to the worldâs data and services.Microsoft offers unparalleled opportunities for discovery and enrichment by enabling end users to connect to the worldâs data and services.Connect Hadoop to the World via Windows Azure Marketplace Access a wide variety of data from reliable providers such as the U.S. Census Bureau, United Nations, Dunn and Bradstreet, to name a few.Take advantage of hundreds of applications built on the Windows Azure platform.Integrate smart data mining algorithms, such as Microsoft Translator, which uses machine learning for automated text translation.Enrich Your Data with External Information ServicesConvert raw data into useful information through data transformation and advanced analytics, and mashups with external data.Utilize out-of-the-box tools, such as SQL Server Integration Services (SSIS) and Data Quality Services for data transformation and cleansing.Enrich your raw data using smart analytical algorithms. (For instance, you can use a segmentation model to enhance targeting.)Access Predictive Analytics on HadoopGain new insights through predictive analytics, the process of inferring relationships and predictions from huge quantities of data. Unlock new insights from all of your data using smart data-mining tools in SQL Server Analysis Services.Simplifies the data mining process using the Data Mining Add-in for Excel.Integrate a range of data mining tools from the Open Source Community, such as Mahout and R. Connect Hadoop to the World via Windows Azure Marketplace Access a wide variety of data from reliable providers such as the U.S. Census Bureau, United Nations, Dunn and Bradstreet, to name a few.Take advantage of hundreds of applications built on the Windows Azure platform.Integrate smart data mining algorithms, such as Microsoft Translator, which uses machine learning for automated text translation.Enrich Your Data with External Information ServicesConvert raw data into useful information through data transformation and advanced analytics, and mashups with external data.Utilize out-of-the-box tools, such as SQL Server Integration Services (SSIS) and Data Quality Services for data transformation and cleansing.Enrich your raw data using smart analytical algorithms. (For instance, you can use a segmentation model to enhance targeting.)Access Predictive Analytics on HadoopGain new insights through predictive analytics, the process of inferring relationships and predictions from huge quantities of data. Unlock new insights from all of your data using smart data-mining tools in SQL Server Analysis Services.Simplifies the data mining process using the Data Mining Add-in for Excel.Integrate a range of data mining tools from the Open Source Community, such as Mahout and R. Immersive Insights, Wherever You AreMicrosoft Big Data empowers end users to gain insights from any data, whether structured or unstructured, with the familiar tools they use every day. Developers can build Big Data applications with tools for simplified Hadoop programming.
I see the real breakthrough insights coming through when you take what is the traditional "Business Intelligence" and add more capabilities like machine learning, predictive analysis, statistical analysis, large scale graph processing, pattern mining, trend analysis, economic modeling. All of which today are a reality in Hadoop. The implications of this are quite astounding when you think about it. This is huge.
Big Data; in terms of data volume, variability and velocity at scale are is the first problem. But the Big Data solutions and technology by themselves don't lead to solving business objectives. We don't have a Hadoop problem they have analytics, pattern mining, trend analysis, statistical inferenceing, economic modeling, market regression level problems.Data science starts where the utility class services like Big Data Hadoop end. The real opportunity is to expose data science to everyone.As powerful as Hadoop is, today itâs still more of a computer scientistâs or academically-trained analystâs tool than it is an enterprise analytics product. Hadoop itself is controlled through programming code rather than anything that looks like it was designed for business unit personnel. Hadoop data is often more ârawâ and âwildâ than data typically fed to data warehouse and OLAP (Online Analytical Processing) systems. This is where I and Microsoft see opportunity. Â Essentially; wouldn't it be cool if mere mortals could use this stuff and consume insights that are directly coming from Hadoop? Microsoft HDInsight enables you to gain insight from virtually any data, connect with the world of data, improve decision making, and enhance the development of the next generation of products and services.Nearly everyone in your organization can analyze and make more informed decisions with the right tools.PowerPivot for Microsoft Excel and Power View for SharePoint give nearly all users a view into structured and unstructured data.With the Hive Add-in for Excel and Hive ODBC Driver, almost anyone in your organization can directly access Hadoop datafrom end-user tools.Hadoop simplifies programming for developers with JavaScript for MapReduce jobs. The JavaScriptimplementation can also reduce your code by up to 10 times compared to Java.Â
Front End: Security/Auth and scaled out request handlerPartition Layer: Object Layer, Mapping of objects such as Tables, Blobs, Queues to streams (cached in Front End), CCStream Layer: 3-Node HA, Scale-out stream store