Successfully reported this slideshow.
Your SlideShare is downloading. ×

Modernizing Your Data Warehouse using APS

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Modernizing Your Data Warehouse using APS

  1. 1. Modernizing Your Data Warehouse using APS Big data. Small data. All data. Stéphane Fréchette - SQL Server MVP - @sfrechette Database / Business Intelligence Solution Architect
  2. 2. - Gartner, “The State of Data Warehousing in 2012”
  3. 3. Increasing data volumes 1 Real-time data 2 New data sources and types 3 4 Cloud-born data Data sources
  4. 4.  The modern data warehouse Data sources Non-relational data
  5. 5. Insights from all your data Enrich and optimize your data from non-traditional sources 5
  6. 6. Roadblocks to a modern data warehouse Keep legacy investment Buy new tier-one hardware appliance Acquire Big Data solution Acquire business intelligence Limited scalability and ability to handle new data types Significant training and data silos High acquisition and migration costs Complex with low adoption
  7. 7. Introducing the Microsoft Analytics Platform System The turnkey modern data warehouse appliance • Relational and non-relational data in a single appliance • Enterprise-ready Hadoop • Integrated querying across Hadoop and PDW using T-SQL • Direct integration with Microsoft BI tools such as Microsoft Excel • Near real-time performance with In-Memory Columnstore • Ability to scale out to accommodate growing data • Removal of data warehouse bottlenecks with MPP SQL Server • Concurrency that fuels rapid adoption • Industry’s lowest data warehouse appliance price per terabyte • Value through a single appliance solution • Value with flexible hardware options using commodity hardware
  8. 8. Microsoft Analytics Platform System The turnkey modern data warehouse appliance
  9. 9. Evolution in the nature and use of data in the enterprise Data complexity: variety and velocity Petabytes Historical analysis Insight analysis Predictive analytics Predictive forecasting Value to the business
  10. 10. What is Hadoop? Microsoft Confidential 10 OPERATIONAL SERVICES AMBARI Core Services DATA SERVICES MAP REDUCE HDFS FLUME SQOOP LOAD & EXTRACT NFS WebHDFS OOZIE YARN HIVE & HCATALOG PIG FALCON HBASE Hadoop Cluster compute & . . . storage . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  11. 11. Manageable, secured, and highly available Hadoop integrated into the appliance High performance and tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100-percent Apache Hadoop SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight
  12. 12. Parallel Data Warehouse workload HDInsight workload Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: • Security • Metering • Servicing
  13. 13. Bringing Hadoop point solutions and the data warehouse together for users and IT Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera SQL Server Parallel Data Warehouse Microsoft Azure HDInsight PolyBase Microsoft HDInsight Hortonworks for Windows and Linux Cloudera Select… Result set
  14. 14. Results Direct and parallelized HDFS access Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute nodes Non-relational data Social apps Sensor and RFID Mobile apps Web apps Hadoop Relational data Traditional schema-based data warehouse applications Regular T-SQL External table External data source External file format Enhanced PDW query engine HDFS bridge PDW
  15. 15. Hadoop / Data Lake (Cloudera, Hortonworks, HDInsight) Source systems Day / Hour / Minute Refresh SQL Server Data Marts SQL Server Reporting Services SQL Server Analytics / Ad-hoc / Visualization MapReduce T-SQL SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Analysis Services APS
  16. 16. HDFS File / Directory //hdfs/social_media/twitter //hdfs/social_media/twitter/Daily.log 1 0 Hadoop Dynamic binding Column filtering Row filtering User Location Product Sentiment Rtwt Hour Date Sean Audie Suz Tom Sanjay Roger Steve CA CO WA IL MN TX AL xbox excel xbox sqls wp8 ssas ssrs -1 0 1 1 1 1 5 0 8 0 0 0 8 8 2 2 1 23 23 5-15-14 5-15-14 5-15-14 5-13-14 5-14-14 5-14-14 5-13-14 SELECT User, Product, Sentiment FROM Twitter_Table WHERE Hour = Current - 1 AND Date = Today AND Sentiment >= 0
  17. 17. Improve APS operations by extending PolyBase HDFS file formats Textfile and RCFile support • Microsoft Azure HDInsight • HDInsight on APS • Hortonworks Data Platform 1.3 and 2.0 (Linux/Windows Server) • Cloudera Linux 4.3 Security and permission model External table source and file format syntax Microsoft Azure Storage Blobs AU1 PolyBase v2 Analytics Platform System (powered by PolyBase)
  18. 18. Big Data insights for anyone New insights with familiar tools through native Microsoft BI integration Minimizes IT intervention for discovering data with tools such as Microsoft Excel Enables DBA and power users to join relational and Hadoop data with T-SQL Takes advantage of high adoption of Excel, Power View, PowerPivot, and SQL Server Analysis Services Offers Hadoop tools like MapReduce, Hive, and Pig for data scientists Everyone else using Microsoft BI tools Power users Data scientist
  19. 19. CREATE EXTERNAL TABLE table_name ({<column_definition>}[,..n ]) {WITH ( DATA_SOURCE = <data_source>, FILE_FORMAT = <file_format>, LOCATION =‘<file_path>’, [REJECT_VALUE = <value>], …)}; 1 Referencing external data source 2 Referencing external file format 3 Path of the Hadoop file/folder 4 (Optional) Reject parameters
  20. 20. CREATE EXTERNAL DATA SOURCE datasource_name {WITH ( TYPE = <data_source>, LOCATION =‘<location>’, [JOB_TRACKER_LOCATION = ‘<jb_location>’] }; 1 Type of external data source 2 Location of external data source Enabling or disabling of MapReduce job generation 3
  21. 21. CREATE EXTERNAL FILE FORMAT fileformat_name {WITH ( FORMAT_TYPE = <type>, [SERDE_METHOD = ‘<sede_method>’,] [DATA_COMPRESSION = ‘<compr_method>’, [FORMAT_OPTIONS (<format_options>)] }; 1 Type of external data source 2 (De)Serialization method [Hive RCFile] 3 Compression method 4 (Optional) Format Options [Text Files]
  22. 22. <Format Options> :: = [,FIELD_TERMINATOR = ‘value’], [,STRING_DELIMITER = ‘value’], [,DATE_FORMAT = ‘value’], [USE_TYPE_DEFAULT = ‘value’] 1 Column delimiter 2 Delimiter for string data types 3 To specify a particular date format 4 How missing entries are handled
  23. 23. Bringing islands of Hadoop data together Running high performance queries against Hadoop data Archiving data warehouse data to Hadoop (move) Exporting relational data to Hadoop (copy) Importing Hadoop data into a data warehouse (copy)
  24. 24. Microsoft Analytics Platform System The turnkey modern data warehouse appliance
  25. 25. Scale up Rowstore Diminishing scale as requirements grow Data Querying data by row Page 1 Page 2 Page 3 C1 C2 C3 C4 R1 R1 R1 R1 R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R4 R4 R5 R5 R5 R5 R6 R6 R6 R6 Sub-optimal performance for many data warehouse queries Forklift Forklift
  26. 26. Scale out Multiple nodes with dedicated CPU, memory, and storage Ability to incrementally add hardware for near-linear scale to multiple petabytes Ability to handle query complexity and concurrency at scale No “forklift” of prior warehouse to increase capacity Ability to scale out HDInsight and PDW Scaling out your data to petabytes Scale-out technologies in the Analytics Platform System PDW / HDInsight PDW / HDInsight PDW / HDInsight PDW PDW / HDInsight PDW / HDInsight PDW / HDInsight 0 terabytes 6 petabytes
  27. 27. Blazing-fast performance MPP and In-Memory Columnstore for next-generation performance Up to 100x faster queries Updateable clustered columnstore vs. table with customary indexing • Store data in columnar format for massive compression • Load data into or out of memory for next-generation performance with up to 60% improvement in data loading speed • Updateable and clustered for real-time trickle loading Up to 15x more compression Columnstore index representation Parallel query execution Query Results
  28. 28. Why is a clustered columnstore index important? • Saves space • Provides easier management by eliminating maintenance of secondary indexes • Supports all PDW data types, including high-precision decimal data types and more Space used in GB (table with 101 million rows) Space used = table space + index space 20.0 15.0 10.0 5.0 0.0 91% savings 1 2 3 4 5 6 In-Memory Columnstore is featured in the storage engine in PDW AU1
  29. 29. Relational query execution processing 1 SQL queries sent to control node Control node creates query execution plan 2 Query plan creates distributed queries to run on each compute node 3 Distributed queries sent to compute nodes (all running in parallel) 4 Control node collects query results and returns them to user 5 Create query plan User query Client Control Compute Compute Compute Compute Appliance Management Query results Aggregate query results Compute nodes process query plan operations in parallel
  30. 30. SQL Server SMP Reporting and cubes BI Tools Great performance with mixed workloads Analytics Platform System ETL/ELT with SSIS, DQS, MDS ERP CRM LOB APPS ETL/ELT with DWLoader Hadoop / Big Data PDW PolyBase HDInsight Ad hoc queries Intra-Day Near real-time Fast ad hoc Columnstore Polybase CRTAS Link Table Real-Time ROLAP / MOLAP DirectQuery SNAC
  31. 31. Microsoft Analytics Platform System The turnkey modern data warehouse appliance
  32. 32. High performance using commodity hardware Price per terabyte for leading vendors Significantly lower price per terabyte than the closest competitor Price per terabyte for user-available storage (compressed) NOTE: Orange line indicates average price per terabyte. Thousands Oracle EMC IBM Teradata Microsoft $30 $25 $20 $15 $10 $5 $0 Lower storage costs with Windows Server 2012 Storage Spaces
  33. 33. Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Integrated support plan with a single Microsoft PDW contact Pre-configured, built, and tuned software and hardware PolyBase HDInsight
  34. 34. Hardware architecture InfiniBand InfiniBand PDW region Ethernet Ethernet Control node Failover node Master node Failover node Compute nodes Economical disk storage Compute nodes Economical disk storage Compute nodes Economical disk storage Networking HDInsight region PDW region Rack #1 InfiniBand InfiniBand Ethernet Ethernet Failover node Compute nodes Economical disk storage Compute nodes Economical disk storage Compute nodes Economical disk storage HDI extension base unit HDI active scale unit HDI active scale unit HDI extension base unit Rack #2 HST-01 HST-02 HSA-01 HST-02 Economical disk storage IB and Ethernet Active Unit Addition of two or three compute nodes depending on OEM hardware configuration and related storage Passive Unit Host for non-worker HDInsight nodes Failover Node High availability for the rack
  35. 35. • PDW engine • DMS Manager • SQL Server 2012 Enterprise Edition (PDW build) Base Unit C T L Host 1 Host 2 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS M A D A D V M M Compute 1 Compute 2 Software details • All hosts run Windows Server 2012 Standard and Windows Azure Virtual Machines • Fabric or workload in Hyper-V Virtual Machines • Fabric virtual machine, management server (MAD01), and control server (CTL) share one server • PDW agent that runs on all hosts and all virtual machines • DWConfig and Admin Console • Windows Storage Spaces and Azure Storage blobs
  36. 36. CT Base Unit L Host 1 Host 1 Host 2 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS M AD A D V M M Compute 1 Compute 1 Compute 2 Host 5 Passive Unit 2 Base Unit CT L M AD FA B AD V M M Compute 1 CT L Virtual machine migration can be used to move workload nodes to new hosts after hardware failure Cluster Shared Volumes • Enable all nodes to access logical unit numbers (LUNs) on economical disk storage • Use Server Message Block (SMB3) protocol Failover capabilities • Uses one cluster across the whole appliance • Automatically migrates virtual machines on host failure • Enforces rules with affinity and anti-affinity maps • Uses Windows Failover Cluster Manager