Modern Data Warehousing with the Microsoft Analytics Platform System

15,310 views

Published on

The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.

0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
15,310
On SlideShare
0
From Embeds
0
Number of Embeds
526
Actions
Shares
0
Downloads
915
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • Key goal of slide: To convey what every IT person knows: The data warehouse and what’s it for. Then we set-up the Gartner quote to say that there is a tipping point. End the slide with a question: Why is it at a tipping point?
     
    Slide talk track:
    What is the “traditional” data warehouse?
    IT professionals know this well. A data warehouse or an enterprise data warehouse is a database that was designed specifically for data analysis. It is the single source of truth or the central repository for all data in the company. This means disparate data in the company coming from your transactional systems, your ERP, CRM or Line of Business applications would all be extracted, transformed, and cleansed and put into the warehouse. It was built so that the people who is accessing the warehouse using BI tools will be accessing data that has been provisioned by IT and represent accurate data sanctioned by the company.

    However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
  • Key goal of slide: To convey that the traditional data warehouse is going to break in one of four different ways. These ways should also not be a surprise to the IT professionals. At the end of the slide, IT should be asking, what can I do to prevent my warehouse from breaking?

    Slide talk track:
    There are many reasons why data warehouses are at it’s tipping point where something needs to change.
    The first trend that will break my traditional data warehouse is data growth. Data volumes are expected to grow 10X over the next five years and traditional data warehouses cannot keep up with this explosion of data.
    In addition to growing data, end users have the expectation that they’ll need be able to get back query results faster in near real-time. End users are no longer apt to wait minutes to hours for their results which is something traditional data warehouses cannot keep up with. Also, want real-time data, not dated data pulled in during a maintenance window each night
    The third trend is new types of data captured that are “non-relational.” 85% of data growth is coming from “non-relational” data in the form of things like web logs, sensor data, social sentiment and devices. You’ve probably heard the term “Big Data” and “Hadoop” quite a bit. This is where these technologies come into play. More on that later….
    The final trend that is appearing is cloud born data. This is data that might be coming from some of IT’s infrastructure that they are starting to host in the cloud (ie. CRM, ERP, etc) or not stored by any type of corporate owned system. How do you incorporate both on-premise and cloud data as part of your data warehouse? This is the last trend that is breaking the traditional data warehouse.

    However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
  • Key goal of slide: To convey that the modern data warehouse is something that the traditional data warehouse must evolve to. To have IT agree that their warehouses need to take advantage of these new technologies (specifically focusing on the middle and bottom layer).

    Slide talk track:
    To encompass these four trends, we need to evolve our traditional data warehouse to ensure that it does not break. It needs to become the “modern data warehouse.” What is the “modern data warehouse?” This is the new warehouse that is able to excel with these new trends and can be your warehouse now and into the future.
    The modern data warehouse has the ability to:
    Handle all types of data. Whether it be your structured, relational data sources or your non-relational data sources, the Modern data warehouse will incorporate Hadoop. It can handle real-time data by using complex event processor technologies.
    Provide a way to enrich your data with Extract, Transform Load (ETL) capabilities as well as Master Data Management (MDM) and data quality
    Provide a way for any BI tool or query mechanism to interface with all these different types of data with a single query model that leverages a single query language that users already know (example: SQL).

    Questions drive BI, Analytics drive questions

  • Top: solution choice, Bottom: problem if do

    Key goal of slide: To convey the limitations of current modern data warehouse options in the market.


    Slide talk track:
    Organizations are facing the challenge of now turning to two platforms for managing their data—relational database management systems (RDBMS) for traditional data and Apache Hadoop, the most widely used open source Big Data platform for large, non-relational data.

    Many Brand-new tier-one appliances are expensive. Major vendors offer tier-one RDBMS appliances. However, many of these come with a high price tag, averaging millions of dollars, and in-company politics may result into long struggles to approve and implement. Further, most of these appliances are focusing on point solutions instead of general and do not include a Hadoop solution, requiring a separate, additional appliance and ecosystem.

    Hadoop solutions are complex. Vendors can provide a Hadoop solution to you as their own distribution of Hadoop or an appliance that comes pre-installed with Hadoop. The problem is that the Hadoop ecosystem requires significant training investment, and a major effort is needed to integrate the Hadoop ecosystem. There is a steep learning curve and ongoing operational cost when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, and Hbase, rather than T-SQL and a standard RDBMS design. The result is often increased cost at a time when IT is expected to streamline.

    BI tools are unfamiliar. Surveys from Gartner, The BI Survey, and Intelligent Enterprise have found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. Users want tools they already know and can consume, but no vendor can deliver on all the solutions you need at a reasonable cost or in a natively-integrated manner.

    Troubleshooting, support and maintenance. Keeping up with configuration changes, support and maintenance with troubleshooting is not trivial.

    Today’s world of data is changing rapidly, and organizations need a modern data warehouse to adapt successfully to these changes. However, companies want the smoothest path to this transformation- a path where costs, downtime, and training are minimal, and where performance and accessibility to data insights are vastly improved.
  • Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points.

    To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.
     
    Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance.

    Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS.

    Your Modern Data Warehouse in One Turnkey Appliance
    APS integrates PDW and HDInsight to operate seamlessly together in a single appliance

    Integrated Querying across All Data Types Using T-SQL
    PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.

    Enterprise-Ready Hadoop
    HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center

    Big Data Insights to Any User
    Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel

    Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements.

    Scale-Out to accommodate your Growing Data
    APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out

    Remove DW bottlenecks with MPP SQL Server
    Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server.

    Real-Time Performance with In-Memory
    Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore

    Concurrency that Supports High Adoption
    Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.

    Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:

    APS Provides the Industry’s Lowest DW Price/TB
    Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces
    Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstore

    Value through Single Appliance Solution
    Reduce hardware footprint by having PDW and HDInsight within a single appliance
    Remove the need for costly integration efforts

    Value through Flexible Hardware Options
    Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
  • The Analytics Platform System is a pre-built appliance that ships to your door. As an appliance, all of the hardware has been pre-built: servers, storage arrays, switches, power, racks, and more. Also, all the software has been installed, configured, and tuned.
     
    Customers are delivered a fully packaged appliance solution that just works. All they have to do is plug the appliance in and start integrating their specific data into the solution.


    KEY POINT
    Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.
     
    TALK TRACK
    We have a flexible choice of hardware vendors – there’s no lock-in to hardware that may not fit your exact needs and may also require unnecessarily expensive hardware due to lack of choice.

    Operating a Big Data analytics platform can be as simple as this. Avoid proprietary hardware lock-ins like others try to sell you, and rely on basic industry standard components instead. The Microsoft Analytics Platform System is available with the flexibility to choose their preferred hardware from Dell, HP or Quanta, and each hardware choice has been designed, engineered, and tuned to perform optimally.
  • 8 tables (8 filegroups since 1 filegroup per table). Each filegroup is made up of 2 physical files. Each scale unit has two compute nodes, so 16 filegroups so 32 files. Since each unit has 32 cpu cores, so 1 core for each file.

    Want high cardinality for distribution key

    PDW, distributes a single large logical table across 8 tables across the each server.
    The distribution is performed by selecting a column in each table and applying a hash function to it.
  • Partition 2 years by day = 2,568,493

    40 servers * 8 tables =320 tables

    This horizontal partitioning breaks the table up into 8 partitions per compute node.  Each of these distributions (essentially a table in and of itself) have dedicated CPU and disk that is the essence of Massively Parallel Processing in APS. There are 8 internal disks per compute node.
  • 1TB drive: 15TB uncompressed per unit (2 nodes), 60TB uncompressed per rack (4 units, 8 nodes), 420TB uncompressed for 7 racks (28 units, 56 nodes)3TB drive: 45TB uncompressed per unit (2 nodes), 180TB uncompressed per rack (4 units, 8 nodes), 1260TB uncompressed for 7 racks (28 units, 56 nodes)[see slide 125]tempdb, log, and overhead of formatting the drives, storage spaces, etc have already been subtracted out (about 47%) of the 70 1TB drives (4 hot spares, 2 for fabric storage, 32 for raid 1, so 32 drives with unique data for 32TB per scale unit), that gives you 15TB of 'usable' space on a 1/4 rack, apply a 5:1 compression ratio, you get 75TBHP ProLiantDL360p Gen8 Server, 256GB RAM, 1UEach server has 2 processors (E5-2690 “Sandy Bridge” 2.90 GHz, 20M cache) with 8 cores, so 16 cores each serverSixteen (16) HP 16GB (2R x 4) PC3-12800R (DDR3-1333) MemoryTwo (2) internal HP 600GB 6G SAS 10K 2.5inPaired with 1 HP D6000 high density storage enclosures (70 HDD (7.2K) of either 1, 2, or 3 TB capacity) connected to each server through an H221 SAS HBA, 5U, 6Gb/s I usually use the word "conservative" when I'm talking about a 5:1 ratio. I also generally mention that most of the others in the industry use generally use the same number
  • Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.  Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey ApplianceAPS integrates PDW and HDInsight to operate seamlessly together in a single applianceIntegrated Querying across All Data Types Using T-SQLPolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.Enterprise-Ready HadoopHDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User Native Microsoft BI integration within PolyBaseallows everyone access to insights through familiar tools such as SSAS and Excel
  • Key goal of slide: Communicate what Big Data isSlide talk track:ERP, SCM, CRM, and transactional web applications are classic examples of systems processing transactions. Highly structured data in these systems is typically stored in SQL databases. Web 2.0 is about how people and things interact with each other or with your business. Web logs, user clickstreams, social interactions and feeds, and user-generated content are classic places to find interaction data. Big Data is the explosion of data volume and types inside and outside the business too large for traditional systems to manage. There are multiple types of data, including personal, organizational, public, and private.  More Important, Big Data is changing how the business uses data from historical analysis to predictive analytics. Enterprises are using data in more progressive and higher value applications. These uses and applications are changing how data must be stored, managed, analyzed and accessed in order to provide not just the historical and insight analysis of the current data warehouse, but the predictive analytics and forecasting needed to stay competitive in the current marketplace.
  • Key goal of slide: Communicate what Hadoop isSlide talk track:Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open-source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts:HDFS – Hadoop Distributed File System is Hadoop’s file-system which stores large files (from gigabytes to terabytes) across multiple machinesMapReduce – is a programming model that performs filtering, sorting and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include Hbase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper which are all parts of the Hadoop ecosystem that all perform other functions to supplement.
  • Key goal of slide: Communicate conceptually how companies are managing Big Data in current data warehouse environments. This shows both setting up a side by side Hadoop and ETL data into existing data warehouse. Slide talk track:Many companies have responded to the explosion of Big Data by setting up side-by-side Hadoop ecosystems. However, these companies are learning the limitations of this approach, including a steep learning curve of MapReduce and other Hadoop ecosystem tools the cost of installing, maintaining, and tooling side-by-side ecosystems to support two separate query models. Many Hadoop solutions do not integrate into enterprise or other data warehouse systems creating complexity and cost and slowing time to insights. Some Hadoop solutions feature vendor lock-in creating long term obligations Other companies set up costly extract, transform and load (ETL) operations to move non-relational data directly into the data warehouse. This requires IT to modify or create new data schema for all new data which is also time consuming and costly. As a result, performance is degraded, and it is often more expensive to integrate new data, build new applications, or access key BI insights.
  • Key goal of slide: Communicate what Hadoop is.Slide talk track:HDInsight is an enterprise-ready, Hadoop-based distribution from Microsoft that brings a 100% Apache Hadoop solution to the data warehouse. APS gives customers Hadoop with the simplicity of a single appliance, and Microsoft integrates Hadoop data processing directly into the architecture of the appliance for optimum performance. HDInsight node has ‘shared nothing’ access to CPU, memory and storage HDInsight for APS is the most enterprise-ready Hadoop distribution in the market. HDInsight offers enterprise-class security, scalability and manageability. Thanks to a dedicated Secure Node, HDInsight helps you secure your Hadoop cluster. HDInsight also simplifies management through System Center, and organizations can provide multiple users simultaneous access to HDInsight within the appliance deploys with Active Directory.
  • This diagram illustrates the basic layout of the direct-to-fabric Hadoop region alongside a data warehouse region designed for the APS appliance and Windows Azure. Each region provides a boundary for workload, security, metering, and servicing.HDInsight is a Hadoop region that sits over the fabric of the appliance alongside the PDW region for processing. Both regions take advantage of PolyBase as a shared query and processing model, which results in exceptional performance improvements across every node. Based on the Hortonworks 1.0 HDFS, the new HDI (HDInsights) region within APS is a dedicated Hadoop region that sits directly on top of the fabric layer of the appliance to share metered resources with the APS engine and process Hadoop cluster data. In some aspects this transforms APS into a concurrent relational and Hadoop engine, resulting in much better performance. An appliance can be configured to support relational queries only (excluding the HDI region), be configured to provide a Hadoop-only node, or be configured to support both relational and Hadoop from a single appliance. In addition, HDInsight enables the processing of Hadoop data in place, without the need for expensive ETL (extract, transform, and load). By taking advantage of Azure Storage Vault blobs, HDInsight can even extend the storage of the traditional data warehouse into the cloud. Technically, adding one or more scale units of hdi to an all APS rack is "add region" which is supported.  Adding one or more scale units of hdi to a rack with hdi already in it is "add capacity/unit" and is not supported for AU1.
  • Key goal of slide: PolyBase is available only within the Microsoft Analytics Platform System. Slide talk track:PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level.Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-Hybrid Scenario Options PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouseThe ability of querying all of your company’s data, independent of where it resides, what format it is stored in, in a performing way is crucial in today’s data-centric world with massive, increasing data volume. Today, with AU1, one can query various Hadoop distributions + data stored in Azure. For example, with one single T-SQL statement a user can query over data stored in multiple HDP 2.0 clusters, combine it with data in PDW and combine it with data stored in Azure.  No one in the industry (as far as I’m aware of) can do this in this simple fashion. Bringing all Microsoft assets together, on-prem and specifically through our Azure play including various services that will be brought online in future, we can clearly distinguish through our unique & complete end-to-end data management story.   No doubt that there are several pieces missing in our ‘Poly’ vision – including supporting other data stores, enabling push-down computation for our cloud story, more user-definable options language-wise, better automation/polices, and many more ideas we’d like to go after in the next weeks & months ahead.
  • HDInsights benefits: Cheap, quickly procureKey goal of slide: Highlight the four main use cases for PolyBase.Slide talk track:There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop.PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first.PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL.There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster.Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
  • Big Data adds value to the business when it is accessible to BI users with tools that are easy to use and consume for IT and business users alike. While some Hadoop solutions provide BI tools, or require customers to find 3rd party BI solutions these often result in a low adoption rate due to learning curves. Surveys from Gartner,The BI Survey, and Intelligent Enterprisehave found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. The BI solution must be provided to users in tools they already know and can consume.APS is the only data warehouse and Hadoop solution that has native end-end Microsoft BI integration withPolyBase, allowing users to use create new insights themselves by using tools they already know; Every Microsoft BI client, SSAS, SSRS, PowerPivot, and PowerView, has native integration with APS and has ubiquitous connectivity across the entire SQL Server ecosystem. With native BI integration, Microsoft is unique in offering an end-to-end Big Data solution where there are no barriers in the journey from acquiring raw data of all types to displaying high value insights to all users. By providing the customer with an HDInsight region in APS, with PolyBase for querying and joining any type of data in T-SQL, and by democratizing access to data insight through familiar BI tools, Microsoft is prepared to provide Big Data insights to a any user.
  • Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insightsrequirements.Scale-Out to accommodate your Growing DataAPS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-outRemove DW bottlenecks with MPP SQL ServerGet the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-MemoryProvides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstoreConcurrency that Supports High AdoptionScales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.
  • Today, if you are not using a MPP scale-out appliance, most likely your data warehouse is built on the traditional scale-up, SMP architecture and organized as row stores. A scale-up solution runs queries sequentially on a shared-everything architecture. This essentially means that everything is processed on a single box that shares memory, disk, I/O operations, and more. To get more scale in a scale-up solution, you need to acquire a more powerful hardware box every time. You will not be able to add more hardware to the existing rack solution. A scale-up solution also has diminishing returns after a certain scale. Rowstore stores data in traditional tables as rows. The values comprising one row are stored contiguously on a page. Rowstores are sometimes not optimal for many queries that are being issued to the data warehouse, because the query returns the entire row of data—including fields that might not be needed as part of the query.The combination of scale-up SMP and rowstores are common limitations to existing warehouses that affect performance.
  • Key goal of slide: Communicate that the Microsoft Modern Data Warehouse can scale out to petabytes of relational data.Slide talk track:SQL Server 2012 APS is a scale out, Massively Parallel Processing (MPP) architecture that represents the most powerful distributed computing and scale. This type of technology powers supercomputers to achieve raw computing horsepower. As more scale is needed, more resources can be added to scale out to the largest data warehousing projects. APS uses a shared-nothing architecture where there are multiple physical nodes, each running its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they are broken up  to run simultaneously over each physical node. The benefit is in the highest performance at scale through parallel execution. You need only to add new resources to continually scale out this implementation.This means if you also have high concurrency and complex queries at scale, APS can handle these queries with ease. This also means that APS can be optimized for “mixed workload” and “near-real time” data analysis. Enjoy faster data loading and more than two terabytes per hour.Other benefits of scale-out technologies:Start small and scale out to petabytes of dataOptimized for “mixed workload” and “near-real time” data analysisSupport for high concurrencyQuery while you loadNo hardware bottlenecksNo “forklifting” when you want to scale your systemScale not only for data size but for faster queries
  • Key goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.TODO: for parallel query execution – explain difference from SMP. Slide talk track:The biggest issue with traditional data warehouses is that data is stored in rows. The values comprising one row are stored contiguously on a page. Rowstores are not optimal for many queries that are issued to the data warehouse, because the query will return the entire row of data—including fields that might not be needed as part of the query.By changing the primary storage engine to a new, updateable version of in-memory columnstore, data is grouped and stored one column at a time. The benefits to doing this are as follows:Only the columns needed must be read. Therefore, less data is read from disk to memory and later moved from memory to processor cache.Columns are heavily compressed, which reduces the number of bytes that must be read and moved.Most queries do not touch all columns of the table. Therefore, many columns will never be brought into memory. This, combined with excellent compression, improves buffer pool usage—which reduces total I/O. The result is massive compression (sometimes as much as 10x), as well as massive performance gains (as much as 100x). Use of columnstore also leverages your existing hardware instead of requiring you to purchase a new appliance. New in SQL Server 2012 APS and SQL Server 2014: Updateable and clustered columnstoreSQL Server 2012 APS and SQL Server 2013 uses the new updateable columnstore. Updates and direct bulk load are fully supported, which simplifies and accelerates data-loading and enables real-time data warehousing and trickle loading. Using columnstore also can save roughly 70% on overall storage space if you chose to eliminate the row store copy of the data entirely.
  • Key goal of slide: Explain the limitations of serial processing SMP architecture to high concurrency MPP.High performance ad hoc analytic queries Pull insights simultaneously throughout the day Run multiple types of Queries simultaneously Run multiple types of workloads together with no tuning required High concurrency means high availability which means higher adoption Slide talk track:With the explosion of data and the growth of end-users demanding real-time insights, data warehouses are not only growing in resources but also growing in the number of users frequently accessing the data warehouse. A modern data warehouse needs to be able to both scale-out to query results quickly, but it also needs to be able to run mixed workloads all at the same time.Mixed workloads refer to concurrency.  Under concurrency, multiple types of queries are submitted, along with data loads and ELT processing. Under mixed workload scenarios, which organizations are certain to face, APS runs concurrent queries with little or no tuning. Organizations no longer have to worry about the types of workloads being run at any given time, and Microsoft APS can handle many users pulling insights simultaneously throughout the day.
  • Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:APS Provides the Industry’s Lowest DW Price/TBLower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstoreValue through Single Appliance SolutionReduce hardware footprint by having PDW and HDInsight within a single applianceRemove the need for costly integration effortsValue through Flexible Hardware OptionsAvoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
  • EMC Greenplum, Teradata, Oracle Exadata, HP Vertica, and IBM NetezzaKey goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.Slide talk track:Value through Software Innovation/Hardware CommoditizationMore than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:Through Storage Spaces, APS has the performance, reliability, and scale for storage built-in to the software, allowing it to replace the SAN with a more economical high density disk option. This results in large capacity at low cost with no reduction in performance. HyperV virtualization and hardware design minimizes the hardware footprint and cost of the appliance, enabling high availability as simply as possible. Microsoft lowers the cost by reducing the hardware footprint through virtualization, providing Storage Spaces to replace expensive SAN storage, and compressing up to 15x to lower storage usage. These features help give APS the lowest relational data warehouse price/terabyte over any other company by a significant margin (~2x lower than market). The overall market’s comparable price/terabyte ranges from $8-13K/TB. For example, Oracle announced Exadata and a form factor of a 1/8th rack that costs $200K. However, this is only the hardware costs and does not include software prices, which can cost significantly more – hundreds of thousands to a million dollars. Even accounting for Oracle’s 10x compression, APS has a price/terabyte that is about half Oracle’s list price for their normal drive sizes (non-high capacity).IBM PureData Pricing: < $500,000 for quarter rack (8TB uncompressed) http://www.theregister.co.uk/2012/10/10/ibm_puredata_database_appliances/ @ 4x compression (=12 to 15K/TB)Oracle Exadata Pricing: HW pricing (1.1M)—http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf; SW pricing (7.2M)—http://www.oracle.com/us/corporate/pricing/technology-price-list-070617.pdf @ (100TB uncompressed) @ 10x compression (= 8K/TB)EMC Greenplum Pricing: $1,000,000 for half rack (18TB uncompressed) http://www.informationweek.com/software/information-management/emc-intros-backup-savvy-greenplum-applia/227701321 @4x compression (=13.8k/TB)Pricing analysis was done on the last-know publicly accessible information available, and represents the current view of Microsoft Corporation as of the date of this presentation. Because companies respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided outside of the sources cited or after the date of this presentation. Source: Value Prism Consulting: Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value; website: http://www.valueprism.com/resources/resources/ResourceDetails.aspx?ID=100
  • Windows Server 2012 and Windows Azure Virtual Machines offer full virtualization services for both on-premises or on-demand installations.General detailsAll hosts run Windows Server 2012 Standard and Windows Azure Virtual MachinesFabric or workload in Hyper-V virtual machinesFabric virtual machine, MAD01, and CTL share one server. You get lower overhead costs especially for small topologies.APS agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workloadDWConfig and Admin Console continue to exist with minor extensions to expose host level informationWindows Storage Spaces and Azure Storage blobs, enabling use of lower cost DAS (JBODs)APS workload detailsSQL Server 2012 Enterprise Edition (APS build) Control node and compute nodes for APS workloadStorage detailsMore files per file groupUses larger number of spindles in parallel
  • Key goal of slide: APS was built to scale to handle the highest data requirements, the newest data types stored in Hadoop, and deliver the performance that meets today’s near real-time requirements. Slide talk track: A modern data warehouse is progressive, meeting broad needs and requirements:Hadoop integrates and operates seamlessly with your relational data warehousesData easily queried by SQL users without additional skills or trainingEnterprise-ready, meaning it is secure and easily managed by ITInsights accessible to everyoneThe Microsoft Analytics Platform System (APS)– the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Analytics Platform System Appliance (APS), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and non-relational data is PolyBase, Microsoft’s integrated query tool available only in APS.
  • Data capacity requirement variable from smallest (15 terabytes) to largest (6 petabytes) with 5:1 compression (1.2 petabytes uncompressed).  From 1/4 rack up to 7 racksdata loading speed: ideal: 175Gb/hour per node (8 nodes would give 1 TB/hour), have seen 250Gb/hr, 10-20x fasterdata compression: 3x-15x, but 5x is conservative number.  Unique compression because of distribution across compute nodesquery performance: 10x-100x, reasonable linear increase with more racks
  • Modern Data Warehousing with the Microsoft Analytics Platform System

    1. 1. Microsoft Analytics Platform System (APS) Modern Data Warehousing James Serra Big Data Evangelist Microsoft
    2. 2. Agenda • Traditional data warehouse & modern data warehouse • APS architecture • Hadoop & PolyBase • Performance and scale • Appliance benefits • Summarize/questions
    3. 3. 5 Data sources Will your current solution handle future needs?
    4. 4. 10 Data sourcesNon-Relational Data
    5. 5.  Data sources Non-relational data
    6. 6. Are you using or going to use “Big Data” and/or “Hadoop” No or limited access to detailed data; can only surface reports and cannot ask ad-hoc questions. Slow data loading performance cannot keep up with the need for data from transactional systems for intraday reporting. MOLAP cube processing and data refresh take too long. Slow query performance with need for constant tuning, especially with SAN storage. High cost of SAN storage chargeback.
    7. 7. Keep legacy investment Buy new tier one hardware appliance Acquire big data solution (Hadoop) Acquire business intelligence solution Roadblocks to evolving to a modern data warehouse Limited scalability & ability to handle new data types Significant training & still siloed High acquisition/ migration costs & no Hadoop Complex with low adoption Solution and issue with that solution
    8. 8. Introducing the Microsoft Analytics Platform System Your turnkey modern data warehouse appliance • Relational and non-relational data in a single appliance • Or, integrate relational data with non-relational data in an external Hadoop cluster on premise or data stored in the Cloud (hot, warm, cold) • Enterprise-ready Hadoop • Integrated querying across Hadoop and APS using T-SQL (PolyBase) • Direct integration with Microsoft BI tools such as Power BI • Near real-time performance with In- Memory • Scale-out to accommodate your growing data or to increase performance (2-nodes to 56-nodes) • Remove SMP DW bottlenecks with MPP SQL Server • No rip and replace when more performance needed • No performance tuning required • Concurrency that fuels rapid adoption • Industry’s lowest DW price/TB • Value through a single appliance solution • Value with flexible hardware options using commodity hardware • Free up space on SAN (cost averages 10k per TB)
    9. 9. Hardware appliance vendor offerings
    10. 10. Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Pre-configured, built, and tuned software and hardware Integrated support plan with a single Microsoft contactPDW HDInsight PolyBase
    11. 11. APS History • DatAllegro started in 2003 • Microsoft acquires DatAllegro in September 2008 • PDW released in December 2010 (version 1) • Version 2 made available in March, 2013 (PolyBase introduced) • AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform System (APS). It still includes the PDW region as well as a new HDInsights/Hadoop region • AU2 was released in July 2014 • AU3 released in October 2014 There will be AU updates every 3-4 months. NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution. Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword box and search the results, then enter "analytics platform system“ (new name)
    12. 12. Parallelism • Uses many separate CPUs running in parallel to execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing
    13. 13. APS Logical Architecture (overview) “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL DMS DMS DMS DMS Compute Node – the “worker bee” of APS • Runs SQL Server 2014 APS • Contains a “slice” of each database • CPU is saturated by storage Control Node – the “brains” of the APS • Also runs SQL Server 2014 APS • Holds a “shell” copy of each database • Metadata, statistics, etc • The “public face” of the appliance Data Movement Services (DMS) • Part of the “secret sauce” of APS • Moves data around as needed • Enables parallel operations among the compute nodes (queries, loads, etc) “Control” node SQL DMS
    14. 14. APS Logical Architecture (overview) “Compute” node Balanced storage SQL“Control” node SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL DMS DMS DMS DMS DMS 1) User connects to the appliance (control node) and submits query 2) Control node query processor determines best *parallel* query plan 3) DMS distributes sub-queries to each compute node 4) Each compute node executes query on its subset of data 5) Each compute node returns a subset of the response to the control node 6) If necessary, control node does any final aggregation/computation 7) Control node returns results to user Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
    15. 15. APS Data Layout Options “Compute” node Balanced storage SQL Balanced storage Balanced storage Balanced storage “Compute” node SQL “Compute” node SQL “Compute” node SQL DMS DMS DMS DMS Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Customer Dim Cust Dim ID Cust Name Cust Addr Cust Phone Cust Email Sales Fact Date Dim ID Store Dim ID Prod Dim ID Cust Dim ID Qty Sold Dollars Sold T D P D S D C D T D P D S D C D T D P D S D C D T D P D S D C D SalesFact Replicated Table copied to each compute node Distributed Table spread across compute nodes based on “hash” Star Schema
    16. 16. FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H DATA DISTRIBUTION CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ) ); Control Node …Compute Node 1 Compute Node 2 Compute Node X Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C …… Create Table FactSales_H FactSalesA FactSalesB FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH FactSalesA FactSalesB FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH FactSalesA FactSale B FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH Create table metadata on Control Node
    17. 17. APS – Balanced across servers and within 41 Largest Table 600,000,000,000 Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000 In each server randomly distributed to 8 tables (so 320 total tables) 1,875,000,000 Each partition – 2 years data partitioned by week (benefiting queries by date) 18,028,846 As an end user or DBA you think about 1 table: LineItem. “Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables. “Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables. You don’t care or need to know that there are actually 320 tables representing your 1 logical table. CCI can add further performance via segment elimination.
    18. 18. ¼Rack 15TB (Uncompressed) 1/2Rack 30TB(Uncompressed) FullRack 60TB(Uncompressed) 1¼Rack 75.5TB (Uncompressed) 3Rack 181.2TB(Uncompressed) 11/2Rack 90.6TB(Uncompressed) 2Rack 120.8TB(Uncompressed) • 2 – 56 compute nodes (32- 896 cores) • 1 – 7 racks • 1, 2, or 3 TB drives • 15TB – 1.2PB uncompressed • 75TB – 6PB User data (5:1) • Up to 7 spare nodes available across the entire appliance • Dual Infiband: 56Gbps
    19. 19. Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
    20. 20. Advanced Analytics Defined
    21. 21. What is Hadoop? Microsoft Confidential  Distributed, scalable system on commodity HW  Composed of a few parts:  HDFS – Distributed file system  MapReduce – Programming model  Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm  Main players are Hortonworks, Cloudera, MapR  WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) Core Services OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS OOZIE AMBARI YARN MAP REDUCE HIVE & HCATALOG PIG HBASEFALCON Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
    22. 22. Move HDFS into the warehouse before analysis ETL Learn new skills TSQL Build Integrate Manage Maintain Support Complex query and analysis with big data today Steep learning curve, slow and inefficient Hadoop ecosystem “New” data sources “New” data sources“New” data sources
    23. 23. APS delivers enterprise-ready Hadoop with HDInsight Manageable, secured and highly available Hadoop integrated into the appliance High performance tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100% Apache Hadoop SQL Server Parallel Data Warehouse Microsoft HDInsight PolyBase Leverage your existing TSQL skills Additional features over a separate Hadoop cluster Plus one support contact still!
    24. 24. Parallel Data Warehouse region HDInsight region Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: • Security • Metering • Servicing APS appliance overview
    25. 25. Select… Result set Provides a single T-SQL query model (“semantic layer”) for APS and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera Use existing SQL skillset, no IT intervention Query Hadoop data with T-SQL using PolyBase Bringing the worlds or big data and the data warehouse together for users and IT SQL Server Parallel Data Warehouse Cloudera CHD Linux 5.1 Hortonworks HDP 2.2 (Windows, Linux) Windows Azure HDInsight (HDP 2.2) (WASB) PolyBase Microsoft HDInsight HDP 2.0 Others (SQL Server, DB2, Oracle)? True federated query engine
    26. 26. Use cases where PolyBase simplifies using Hadoop data Bringing islands of Hadoop data together High performance queries against Hadoop data (Predicate pushdown) Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
    27. 27. Big data insights for anyone Native Microsoft BI integration to create new insights with familiar tools Tools like Power BI minimize IT intervention for discovering data T-SQL for DBA and power users to join relational and Hadoop data Hadoop tools like map- reduce, Hive and Pig for data scientists Leverages high adoption of Excel, Power View, Power Pivot, and SSAS Power Users Data Scientist Everyone else using Microsoft BI tools
    28. 28. Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
    29. 29. Scale-out  Massively Parallel Processing (MPP) parallelizes queries (speed-driven not just capacity-driven)  Multiple nodes with dedicated CPU, memory, storage “shared-nothing”  Incrementally add HW for near-linear scale to multi-PB (no need to delete older data, stage)  Handles query complexity and concurrency at scale  No “forklift” of prior warehouse to increase capacity  Start small with a few terabyte warehouse  Mixed workload support: Query while you load (250GB/hour per node). No need for maintenance window Scale-out technologies in the Analytics Platform System 91 PDW 0TB 6PB PDW or HDInsight PDW or HDInsight PDW or HDInsight PDW or HDInsight PDW or HDInsight PDW or HDInsight
    30. 30. • Store data in columnar format for massive compression • Load data into or out of memory for next- generation performance • Updateable and clustered for real-time trickle loading • No secondary indexes required 92 Up to 100x faster queries Updatable clustered columnstore vs. table with customary indexing Up to 15x more compression Columnstore index representation Parallel query execution Query Results
    31. 31. Investment firm Before/After Results - HP SMP vs APS 21x improvement loading data (7:30 minutes vs 21 seconds) 62x improvement staging to landing (30 minutes vs 29 seconds) 17x, 166x, 169x query performance improvement (1:05 hour vs 23 seconds) Microsoft BI tools work unchanged 1.1 TB/hr loading time, 8.8x compression (2 billion rows) (472GB to 53GB) 46x improvement creating datamart (70 minutes vs 1:31 minutes)
    32. 32. BI Tools Reporting and cubes SQL Server SMP (Spoke) Concurrency that fuels rapid adoption Great performance with mixed workloads Analytics Platform System ETL/ELT with SSIS, DQS, MDS ERP CRM LOB APPS ETL/ELT with DWLoader Hadoop / Big Data PDW HDInsight PolyBase Ad hoc queries Intra-Day Near real-time Fast ad hoc Columnstore Polybase CRTAS “Link Table” Real-Time ROLAP / MOLAP DirectQuery SNAC
    33. 33. Stream Analytics TransformIngest Example overall data flow and Architecture Web logs Present & decide IoT, Mobile Devices etc. Social Data Event Hubs HDInsight Azure Data Factory Azure SQL DB Azure Blob Storage Azure Machine Learning (Fraud detection etc.) Power BI Web dashboards Mobile devices DW / Long-term storage Predictive analytics Event & data producers Analytics Platform Sys.
    34. 34. Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
    35. 35. APS provides the industry’s lowest DW appliance price/TB Reshaped hardware specs through software innovation Price per terabyte for leading vendors (Sept 2014) Significantly lower price per TB than the closest competitor Lower storage costs with Windows Server 2012 Storage Spaces Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack $- $20 $40 $60 $80 $100 $120 $140 Oracle Pivotal IBM Teradata Microsoft Thousands TCO per TB (uncompressed):
    36. 36. Virtualized architecture overview Host 2 Host 1 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS Base Unit CT L M AD A D V M M Compute 2 Compute 1 • APS engine • DMS Manager • SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL 2014) Software details • All hosts run Windows Server 2012 Standard (AU3: 2012 R2) and Windows Azure Virtual Machines • Fabric or workload in Hyper-V Virtual Machines • Fabric virtual machine, management server (MAD01), and control server (CTL) share one server • APS agent that runs on all hosts and all virtual machines • DWConfig and Admin Console • Windows Storage Spaces and Azure Storage blobs • Does not require expertise in Hyper-V or Windows
    37. 37. APS High-Availability X X Compute Host 1 Compute Host 2 XControl Host Failover Host Infiniband1 Ethernet1 Infiniband2 Ethernet2 XXXFAB AD VMM MAD CTL Compute 2 VM Compute 1 VMCompute 1 VMInfiniband1 Ethernet1 • No Single Point-Of-Failure • No need for SQL Server Clustering
    38. 38. Less DBA Maintenance/Monitoring • No index creation • No deleting/archiving data to save space • Management simplicity (System Center, Admin console, DMVs) • No blocking • No logs • No query hints • No wait states • No IO tuning • No query optimization/tuning • No index reorgs/rebuilds • No partitioning • No managing filegroups • No shrinking/expanding databases • No managing physical servers • No patching servers and software RESULT: DBA’s spend more of their time as architects and not baby sitters!
    39. 39. The no-compromise modern data warehouse solution Microsoft’s turn-key modern data warehouse appliance Analytics Platform System Microsoft • Improved query performance • Faster data loading • Improved concurrency • Less DBA maintenance • Limited training needed • Use familiar BI tools • Ease of appliance deployment • Mixed workload support • Improved data compression • Scalability • High availability • PolyBase • Integration with cloud-born data • HDInsight/Hadoop integration • Data warehouse consolidation • Easy support model Summary of Benefits Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks
    40. 40. Questions? James Serra jserra@microsoft.com Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/
    41. 41. Enterprise-ready big data – cloud enabled • Improved PolyBase Support • Cloudera 5.1 Support • Partial Aggregate Pushdowns • Expanding Big Data capacity • Grow HDInsight region on an appliance with an existing region Next-gen performance & engineered for optimal value • 1.5X data return rate for SELECT * queries • Streaming large data sets for external apps (e.g., SSAS, SAS, R, etc.) Next-gen performance & engineered for optimal value • TSQL Compatibility • Scalar UDFs (CREATE Function) • SQL Server SMP to APS (SQL Server MPP) Migration Utility • Bulk load / BCP through SQL Server command-line tools • OEM Hardware Refresh (HP Gen 9) • HP ProLiant DL360 Gen9 Server w/2x Intel Haswell Processors, 256 GB (16x16Gb) 2133MHz memory • HP 5900 series switches (HA improvements) Symmetry between DW On-Prem and Azure T-SQL Compat: Appliance Hardware

    ×