Modernizing Your 
Data Warehouse 
using APS 
Big data. Small data. All data. 
Stéphane Fréchette - SQL Server MVP - @sfrechette 
Database / Business Intelligence Solution Architect
- Gartner, “The State of Data Warehousing in 2012”
Increasing 
data volumes 
1 
Real-time 
data 
2 
New data 
sources and types 
3 
4 
Cloud-born 
data 
Data sources
 
The modern data warehouse 
Data sources Non-relational data
Insights from all your data 
Enrich and optimize your data from non-traditional sources 
5
Roadblocks to a modern data warehouse 
Keep legacy 
investment 
Buy new tier-one 
hardware appliance 
Acquire Big Data 
solution 
Acquire business 
intelligence 
Limited 
scalability and ability to 
handle new data types 
Significant training 
and data silos 
High acquisition 
and migration 
costs 
Complex with low 
adoption
Introducing the Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance 
• Relational and non-relational 
data in a single appliance 
• Enterprise-ready Hadoop 
• Integrated querying across 
Hadoop and PDW using T-SQL 
• Direct integration with 
Microsoft BI tools such as 
Microsoft Excel 
• Near real-time performance 
with In-Memory Columnstore 
• Ability to scale out to 
accommodate growing data 
• Removal of data warehouse 
bottlenecks with MPP SQL 
Server 
• Concurrency that fuels rapid 
adoption 
• Industry’s lowest data 
warehouse appliance price per 
terabyte 
• Value through a single 
appliance solution 
• Value with flexible hardware 
options using commodity 
hardware
Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance
Evolution in the nature and use of data in the enterprise 
Data complexity: 
variety and velocity 
Petabytes 
Historical 
analysis 
Insight 
analysis 
Predictive 
analytics 
Predictive 
forecasting 
Value to the business
What is Hadoop? 
Microsoft Confidential 
10 
OPERATIONAL 
SERVICES 
AMBARI 
Core Services 
DATA 
SERVICES 
MAP 
REDUCE 
HDFS 
FLUME 
SQOOP 
LOAD & 
EXTRACT 
NFS 
WebHDFS 
OOZIE 
YARN 
HIVE & 
HCATALOG 
PIG 
FALCON HBASE 
Hadoop Cluster 
compute 
& 
. . . 
storage . . . 
. . 
compute 
& 
storage 
. 
. 
Hadoop clusters provide 
scale-out storage and 
distributed data processing 
on commodity hardware
Manageable, secured, and highly available Hadoop integrated into the appliance 
High performance 
and tuned within the 
appliance 
End-user 
authentication with 
Active Directory 
Accessible insights 
for everyone with 
Microsoft BI tools 
Managed and 
monitored using 
System Center 
100-percent Apache 
Hadoop 
SQL Server 
Parallel Data 
Warehouse 
PolyBase 
Microsoft 
HDInsight
Parallel Data Warehouse 
workload 
HDInsight workload 
Fabric 
Hardware 
Appliance 
A region is a logical container within an 
appliance 
Each workload contains the following 
boundaries: 
• Security 
• Metering 
• Servicing
Bringing Hadoop point solutions and the data warehouse together for users and IT 
Provides a single T-SQL query model for PDW 
and Hadoop with rich features of T-SQL, 
including joins without ETL 
Uses the power of MPP to enhance query 
execution performance 
Supports Windows Azure HDInsight to enable 
new hybrid cloud scenarios 
Provides the ability to query non-Microsoft 
Hadoop distributions, such as Hortonworks and 
Cloudera 
SQL Server 
Parallel Data 
Warehouse 
Microsoft Azure 
HDInsight 
PolyBase 
Microsoft 
HDInsight 
Hortonworks for 
Windows and Linux 
Cloudera 
Select… Result set
Results 
Direct and parallelized HDFS access 
Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute 
nodes 
Non-relational data 
Social 
apps 
Sensor 
and RFID 
Mobile 
apps 
Web 
apps 
Hadoop 
Relational data 
Traditional schema-based 
data warehouse applications 
Regular 
T-SQL 
External table 
External data 
source 
External file 
format 
Enhanced PDW 
query engine 
HDFS bridge PDW
Hadoop / Data Lake 
(Cloudera, Hortonworks, 
HDInsight) 
Source systems 
Day / Hour / Minute Refresh 
SQL Server 
Data Marts 
SQL Server 
Reporting Services 
SQL Server 
Analytics / Ad-hoc / Visualization 
MapReduce T-SQL 
SQL Server 
Parallel Data 
Warehouse 
PolyBase 
Microsoft 
HDInsight 
Analysis Services APS
HDFS File / Directory 
//hdfs/social_media/twitter 
//hdfs/social_media/twitter/Daily.log 
1 
0 
Hadoop 
Dynamic binding 
Column filtering 
Row filtering 
User Location Product Sentiment Rtwt Hour Date 
Sean 
Audie 
Suz 
Tom 
Sanjay 
Roger 
Steve 
CA 
CO 
WA 
IL 
MN 
TX 
AL 
xbox 
excel 
xbox 
sqls 
wp8 
ssas 
ssrs 
-1 
0 
1 
1 
1 
1 
5 
0 
8 
0 
0 
0 
8 
8 
2 
2 
1 
23 
23 
5-15-14 
5-15-14 
5-15-14 
5-13-14 
5-14-14 
5-14-14 
5-13-14 
SELECT User, Product, Sentiment 
FROM Twitter_Table 
WHERE Hour = Current - 1 
AND Date = Today 
AND Sentiment >= 0
Improve APS operations by extending PolyBase 
HDFS file formats 
Textfile and 
RCFile support 
• Microsoft Azure HDInsight 
• HDInsight on APS 
• Hortonworks Data Platform 
1.3 and 2.0 (Linux/Windows 
Server) 
• Cloudera Linux 4.3 
Security and 
permission model 
External table 
source and file 
format syntax 
Microsoft 
Azure 
Storage 
Blobs 
AU1 
PolyBase v2 
Analytics Platform 
System 
(powered by PolyBase)
Big Data insights for anyone 
New insights with familiar tools through native Microsoft BI integration 
Minimizes IT 
intervention for 
discovering data 
with tools such as 
Microsoft Excel 
Enables DBA and 
power users to join 
relational and 
Hadoop data with 
T-SQL 
Takes advantage of 
high adoption 
of Excel, Power 
View, PowerPivot, 
and SQL Server 
Analysis Services 
Offers Hadoop 
tools like 
MapReduce, Hive, 
and Pig for data 
scientists 
Everyone else using 
Microsoft BI tools 
Power users 
Data scientist
CREATE EXTERNAL TABLE table_name 
({<column_definition>}[,..n ]) 
{WITH ( 
DATA_SOURCE = <data_source>, 
FILE_FORMAT = <file_format>, 
LOCATION =‘<file_path>’, 
[REJECT_VALUE = <value>], 
…)}; 
1 Referencing external data source 
2 Referencing external file format 
3 Path of the Hadoop file/folder 
4 (Optional) Reject parameters
CREATE EXTERNAL DATA SOURCE datasource_name 
{WITH ( 
TYPE = <data_source>, 
LOCATION =‘<location>’, 
[JOB_TRACKER_LOCATION = ‘<jb_location>’] 
}; 
1 Type of external data source 
2 Location of external data source 
Enabling or disabling of MapReduce 
job generation 
3
CREATE EXTERNAL FILE FORMAT fileformat_name 
{WITH ( 
FORMAT_TYPE = <type>, 
[SERDE_METHOD = ‘<sede_method>’,] 
[DATA_COMPRESSION = ‘<compr_method>’, 
[FORMAT_OPTIONS (<format_options>)] 
}; 
1 Type of external data source 
2 (De)Serialization method [Hive RCFile] 
3 Compression method 
4 (Optional) Format Options [Text Files]
<Format Options> :: = 
[,FIELD_TERMINATOR = ‘value’], 
[,STRING_DELIMITER = ‘value’], 
[,DATE_FORMAT = ‘value’], 
[USE_TYPE_DEFAULT = ‘value’] 
1 Column delimiter 
2 Delimiter for string data types 
3 To specify a particular date format 
4 How missing entries are handled
Bringing islands of Hadoop data together 
Running high performance queries against Hadoop data 
Archiving data warehouse data to Hadoop (move) 
Exporting relational data to Hadoop (copy) 
Importing Hadoop data into a data warehouse (copy)
Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance
Scale up Rowstore 
Diminishing scale as requirements grow 
Data 
Querying data by row 
Page 1 Page 2 Page 3 
C1 C2 C3 C4 
R1 R1 R1 R1 
R2 R2 R2 R2 
R3 R3 R3 R3 
R4 R4 R4 R4 
R5 R5 R5 R5 
R6 R6 R6 R6 
Sub-optimal performance for many data 
warehouse queries 
Forklift 
Forklift
Scale out Multiple nodes with dedicated CPU, 
memory, and storage 
Ability to incrementally add hardware 
for near-linear scale to multiple 
petabytes 
Ability to handle query complexity and 
concurrency at scale 
No “forklift” of prior warehouse to 
increase capacity 
Ability to scale out HDInsight and PDW 
Scaling out your data to petabytes 
Scale-out technologies in the Analytics Platform System 
PDW / 
HDInsight 
PDW / 
HDInsight 
PDW / 
HDInsight 
PDW 
PDW / 
HDInsight 
PDW / 
HDInsight 
PDW / 
HDInsight 
0 terabytes 6 petabytes
Blazing-fast performance 
MPP and In-Memory Columnstore for next-generation performance 
Up to 100x 
faster queries 
Updateable clustered columnstore vs. table with customary indexing 
• Store data in columnar format for massive 
compression 
• Load data into or out of memory for next-generation 
performance with up to 60% 
improvement in data loading speed 
• Updateable and clustered for real-time trickle 
loading 
Up to 15x 
more compression 
Columnstore index representation 
Parallel query execution 
Query 
Results
Why is a clustered columnstore index 
important? 
• Saves space 
• Provides easier management by eliminating 
maintenance of secondary indexes 
• Supports all PDW data types, including high-precision 
decimal data types and more 
Space used in GB (table with 101 million rows) 
Space used = table space + index space 
20.0 
15.0 
10.0 
5.0 
0.0 
91% 
savings 
1 2 3 4 5 6 
In-Memory Columnstore is featured in the 
storage engine in PDW AU1
Relational query execution processing 
1 SQL queries sent to control node 
Control node creates query 
execution plan 
2 
Query plan creates distributed 
queries to run on each compute 
node 
3 
Distributed queries sent to compute 
nodes (all running in parallel) 
4 
Control node collects query results 
and returns them to user 
5 
Create query plan 
User query 
Client Control 
Compute 
Compute 
Compute 
Compute 
Appliance 
Management 
Query results 
Aggregate query results Compute nodes 
process query plan 
operations in parallel
SQL Server SMP 
Reporting and cubes 
BI Tools 
Great performance with mixed workloads 
Analytics Platform System 
ETL/ELT with SSIS, DQS, MDS 
ERP CRM LOB APPS 
ETL/ELT with DWLoader 
Hadoop / Big Data 
PDW 
PolyBase 
HDInsight 
Ad hoc queries 
Intra-Day 
Near real-time 
Fast ad hoc 
Columnstore 
Polybase 
CRTAS 
Link Table 
Real-Time 
ROLAP / MOLAP 
DirectQuery 
SNAC
Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance
High performance using commodity hardware 
Price per terabyte for leading vendors 
Significantly lower 
price per terabyte 
than the closest competitor 
Price per terabyte for user-available storage (compressed) 
NOTE: Orange line indicates average price per 
terabyte. 
Thousands 
Oracle EMC IBM Teradata Microsoft 
$30 
$25 
$20 
$15 
$10 
$5 
$0 
Lower storage costs 
with Windows Server 2012 
Storage Spaces
Hardware and software engineered together 
The ease of an appliance 
Co-engineered 
with HP, Dell, and 
Quanta best 
practices 
Leading 
performance with 
commodity 
hardware 
Integrated 
support plan with 
a single Microsoft 
PDW contact 
Pre-configured, 
built, and tuned 
software and 
hardware 
PolyBase 
HDInsight
Hardware architecture InfiniBand 
InfiniBand 
PDW region 
Ethernet 
Ethernet 
Control node 
Failover node 
Master node 
Failover node 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
Networking 
HDInsight region 
PDW region 
Rack #1 
InfiniBand 
InfiniBand 
Ethernet 
Ethernet 
Failover node 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
HDI extension base 
unit 
HDI active scale 
unit 
HDI active scale 
unit 
HDI extension base 
unit 
Rack #2 
HST-01 
HST-02 
HSA-01 
HST-02 
Economical 
disk storage 
IB and Ethernet 
Active Unit Addition of two or three compute nodes 
depending on OEM hardware 
configuration and related storage 
Passive Unit Host for non-worker HDInsight nodes 
Failover Node High availability for the rack
• PDW engine 
• DMS Manager 
• SQL Server 2012 Enterprise Edition (PDW build) 
Base Unit C 
T 
L 
Host 1 
Host 2 
Host 3 
Host 4 
Economical 
disk storage 
IB and 
Ethernet 
Direct attached SAS 
M 
A 
D 
A 
D 
V 
M 
M 
Compute 1 
Compute 2 
Software details 
• All hosts run Windows Server 2012 Standard and 
Windows Azure Virtual Machines 
• Fabric or workload in Hyper-V Virtual Machines 
• Fabric virtual machine, management server (MAD01), 
and control server (CTL) share one server 
• PDW agent that runs on all hosts and all virtual 
machines 
• DWConfig and Admin Console 
• Windows Storage Spaces and Azure Storage blobs
CT 
Base Unit 
L 
Host 1 
Host 1 
Host 2 
Host 3 
Host 4 
Economical 
disk 
storage 
IB and 
Ethernet 
Direct attached SAS 
M 
AD 
A 
D 
V 
M 
M 
Compute 1 
Compute 1 
Compute 2 
Host 5 
Passive Unit 
2 
Base Unit 
CT 
L 
M 
AD 
FA 
B 
AD 
V 
M 
M 
Compute 1 
CT 
L 
Virtual machine migration can be used to move 
workload nodes to new hosts after hardware failure 
Cluster Shared Volumes 
• Enable all nodes to access logical unit numbers 
(LUNs) on economical disk storage 
• Use Server Message Block (SMB3) protocol 
Failover capabilities 
• Uses one cluster across the whole appliance 
• Automatically migrates virtual machines on host 
failure 
• Enforces rules with affinity and anti-affinity maps 
• Uses Windows Failover Cluster Manager
Modernizing Your Data Warehouse using APS

Modernizing Your Data Warehouse using APS

  • 1.
    Modernizing Your DataWarehouse using APS Big data. Small data. All data. Stéphane Fréchette - SQL Server MVP - @sfrechette Database / Business Intelligence Solution Architect
  • 2.
    - Gartner, “TheState of Data Warehousing in 2012”
  • 3.
    Increasing data volumes 1 Real-time data 2 New data sources and types 3 4 Cloud-born data Data sources
  • 4.
     The moderndata warehouse Data sources Non-relational data
  • 5.
    Insights from allyour data Enrich and optimize your data from non-traditional sources 5
  • 6.
    Roadblocks to amodern data warehouse Keep legacy investment Buy new tier-one hardware appliance Acquire Big Data solution Acquire business intelligence Limited scalability and ability to handle new data types Significant training and data silos High acquisition and migration costs Complex with low adoption
  • 7.
    Introducing the MicrosoftAnalytics Platform System The turnkey modern data warehouse appliance • Relational and non-relational data in a single appliance • Enterprise-ready Hadoop • Integrated querying across Hadoop and PDW using T-SQL • Direct integration with Microsoft BI tools such as Microsoft Excel • Near real-time performance with In-Memory Columnstore • Ability to scale out to accommodate growing data • Removal of data warehouse bottlenecks with MPP SQL Server • Concurrency that fuels rapid adoption • Industry’s lowest data warehouse appliance price per terabyte • Value through a single appliance solution • Value with flexible hardware options using commodity hardware
  • 8.
    Microsoft Analytics PlatformSystem The turnkey modern data warehouse appliance
  • 9.
    Evolution in thenature and use of data in the enterprise Data complexity: variety and velocity Petabytes Historical analysis Insight analysis Predictive analytics Predictive forecasting Value to the business
  • 10.
    What is Hadoop? Microsoft Confidential 10 OPERATIONAL SERVICES AMBARI Core Services DATA SERVICES MAP REDUCE HDFS FLUME SQOOP LOAD & EXTRACT NFS WebHDFS OOZIE YARN HIVE & HCATALOG PIG FALCON HBASE Hadoop Cluster compute & . . . storage . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • 11.
    Manageable, secured, andhighly available Hadoop integrated into the appliance High performance and tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100-percent Apache Hadoop SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight
  • 12.
    Parallel Data Warehouse workload HDInsight workload Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: • Security • Metering • Servicing
  • 13.
    Bringing Hadoop pointsolutions and the data warehouse together for users and IT Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera SQL Server Parallel Data Warehouse Microsoft Azure HDInsight PolyBase Microsoft HDInsight Hortonworks for Windows and Linux Cloudera Select… Result set
  • 14.
    Results Direct andparallelized HDFS access Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute nodes Non-relational data Social apps Sensor and RFID Mobile apps Web apps Hadoop Relational data Traditional schema-based data warehouse applications Regular T-SQL External table External data source External file format Enhanced PDW query engine HDFS bridge PDW
  • 15.
    Hadoop / DataLake (Cloudera, Hortonworks, HDInsight) Source systems Day / Hour / Minute Refresh SQL Server Data Marts SQL Server Reporting Services SQL Server Analytics / Ad-hoc / Visualization MapReduce T-SQL SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Analysis Services APS
  • 16.
    HDFS File /Directory //hdfs/social_media/twitter //hdfs/social_media/twitter/Daily.log 1 0 Hadoop Dynamic binding Column filtering Row filtering User Location Product Sentiment Rtwt Hour Date Sean Audie Suz Tom Sanjay Roger Steve CA CO WA IL MN TX AL xbox excel xbox sqls wp8 ssas ssrs -1 0 1 1 1 1 5 0 8 0 0 0 8 8 2 2 1 23 23 5-15-14 5-15-14 5-15-14 5-13-14 5-14-14 5-14-14 5-13-14 SELECT User, Product, Sentiment FROM Twitter_Table WHERE Hour = Current - 1 AND Date = Today AND Sentiment >= 0
  • 17.
    Improve APS operationsby extending PolyBase HDFS file formats Textfile and RCFile support • Microsoft Azure HDInsight • HDInsight on APS • Hortonworks Data Platform 1.3 and 2.0 (Linux/Windows Server) • Cloudera Linux 4.3 Security and permission model External table source and file format syntax Microsoft Azure Storage Blobs AU1 PolyBase v2 Analytics Platform System (powered by PolyBase)
  • 18.
    Big Data insightsfor anyone New insights with familiar tools through native Microsoft BI integration Minimizes IT intervention for discovering data with tools such as Microsoft Excel Enables DBA and power users to join relational and Hadoop data with T-SQL Takes advantage of high adoption of Excel, Power View, PowerPivot, and SQL Server Analysis Services Offers Hadoop tools like MapReduce, Hive, and Pig for data scientists Everyone else using Microsoft BI tools Power users Data scientist
  • 19.
    CREATE EXTERNAL TABLEtable_name ({<column_definition>}[,..n ]) {WITH ( DATA_SOURCE = <data_source>, FILE_FORMAT = <file_format>, LOCATION =‘<file_path>’, [REJECT_VALUE = <value>], …)}; 1 Referencing external data source 2 Referencing external file format 3 Path of the Hadoop file/folder 4 (Optional) Reject parameters
  • 20.
    CREATE EXTERNAL DATASOURCE datasource_name {WITH ( TYPE = <data_source>, LOCATION =‘<location>’, [JOB_TRACKER_LOCATION = ‘<jb_location>’] }; 1 Type of external data source 2 Location of external data source Enabling or disabling of MapReduce job generation 3
  • 21.
    CREATE EXTERNAL FILEFORMAT fileformat_name {WITH ( FORMAT_TYPE = <type>, [SERDE_METHOD = ‘<sede_method>’,] [DATA_COMPRESSION = ‘<compr_method>’, [FORMAT_OPTIONS (<format_options>)] }; 1 Type of external data source 2 (De)Serialization method [Hive RCFile] 3 Compression method 4 (Optional) Format Options [Text Files]
  • 22.
    <Format Options> ::= [,FIELD_TERMINATOR = ‘value’], [,STRING_DELIMITER = ‘value’], [,DATE_FORMAT = ‘value’], [USE_TYPE_DEFAULT = ‘value’] 1 Column delimiter 2 Delimiter for string data types 3 To specify a particular date format 4 How missing entries are handled
  • 23.
    Bringing islands ofHadoop data together Running high performance queries against Hadoop data Archiving data warehouse data to Hadoop (move) Exporting relational data to Hadoop (copy) Importing Hadoop data into a data warehouse (copy)
  • 24.
    Microsoft Analytics PlatformSystem The turnkey modern data warehouse appliance
  • 25.
    Scale up Rowstore Diminishing scale as requirements grow Data Querying data by row Page 1 Page 2 Page 3 C1 C2 C3 C4 R1 R1 R1 R1 R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R4 R4 R5 R5 R5 R5 R6 R6 R6 R6 Sub-optimal performance for many data warehouse queries Forklift Forklift
  • 26.
    Scale out Multiplenodes with dedicated CPU, memory, and storage Ability to incrementally add hardware for near-linear scale to multiple petabytes Ability to handle query complexity and concurrency at scale No “forklift” of prior warehouse to increase capacity Ability to scale out HDInsight and PDW Scaling out your data to petabytes Scale-out technologies in the Analytics Platform System PDW / HDInsight PDW / HDInsight PDW / HDInsight PDW PDW / HDInsight PDW / HDInsight PDW / HDInsight 0 terabytes 6 petabytes
  • 27.
    Blazing-fast performance MPPand In-Memory Columnstore for next-generation performance Up to 100x faster queries Updateable clustered columnstore vs. table with customary indexing • Store data in columnar format for massive compression • Load data into or out of memory for next-generation performance with up to 60% improvement in data loading speed • Updateable and clustered for real-time trickle loading Up to 15x more compression Columnstore index representation Parallel query execution Query Results
  • 28.
    Why is aclustered columnstore index important? • Saves space • Provides easier management by eliminating maintenance of secondary indexes • Supports all PDW data types, including high-precision decimal data types and more Space used in GB (table with 101 million rows) Space used = table space + index space 20.0 15.0 10.0 5.0 0.0 91% savings 1 2 3 4 5 6 In-Memory Columnstore is featured in the storage engine in PDW AU1
  • 29.
    Relational query executionprocessing 1 SQL queries sent to control node Control node creates query execution plan 2 Query plan creates distributed queries to run on each compute node 3 Distributed queries sent to compute nodes (all running in parallel) 4 Control node collects query results and returns them to user 5 Create query plan User query Client Control Compute Compute Compute Compute Appliance Management Query results Aggregate query results Compute nodes process query plan operations in parallel
  • 30.
    SQL Server SMP Reporting and cubes BI Tools Great performance with mixed workloads Analytics Platform System ETL/ELT with SSIS, DQS, MDS ERP CRM LOB APPS ETL/ELT with DWLoader Hadoop / Big Data PDW PolyBase HDInsight Ad hoc queries Intra-Day Near real-time Fast ad hoc Columnstore Polybase CRTAS Link Table Real-Time ROLAP / MOLAP DirectQuery SNAC
  • 31.
    Microsoft Analytics PlatformSystem The turnkey modern data warehouse appliance
  • 32.
    High performance usingcommodity hardware Price per terabyte for leading vendors Significantly lower price per terabyte than the closest competitor Price per terabyte for user-available storage (compressed) NOTE: Orange line indicates average price per terabyte. Thousands Oracle EMC IBM Teradata Microsoft $30 $25 $20 $15 $10 $5 $0 Lower storage costs with Windows Server 2012 Storage Spaces
  • 33.
    Hardware and softwareengineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Integrated support plan with a single Microsoft PDW contact Pre-configured, built, and tuned software and hardware PolyBase HDInsight
  • 34.
    Hardware architecture InfiniBand InfiniBand PDW region Ethernet Ethernet Control node Failover node Master node Failover node Compute nodes Economical disk storage Compute nodes Economical disk storage Compute nodes Economical disk storage Networking HDInsight region PDW region Rack #1 InfiniBand InfiniBand Ethernet Ethernet Failover node Compute nodes Economical disk storage Compute nodes Economical disk storage Compute nodes Economical disk storage HDI extension base unit HDI active scale unit HDI active scale unit HDI extension base unit Rack #2 HST-01 HST-02 HSA-01 HST-02 Economical disk storage IB and Ethernet Active Unit Addition of two or three compute nodes depending on OEM hardware configuration and related storage Passive Unit Host for non-worker HDInsight nodes Failover Node High availability for the rack
  • 35.
    • PDW engine • DMS Manager • SQL Server 2012 Enterprise Edition (PDW build) Base Unit C T L Host 1 Host 2 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS M A D A D V M M Compute 1 Compute 2 Software details • All hosts run Windows Server 2012 Standard and Windows Azure Virtual Machines • Fabric or workload in Hyper-V Virtual Machines • Fabric virtual machine, management server (MAD01), and control server (CTL) share one server • PDW agent that runs on all hosts and all virtual machines • DWConfig and Admin Console • Windows Storage Spaces and Azure Storage blobs
  • 36.
    CT Base Unit L Host 1 Host 1 Host 2 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS M AD A D V M M Compute 1 Compute 1 Compute 2 Host 5 Passive Unit 2 Base Unit CT L M AD FA B AD V M M Compute 1 CT L Virtual machine migration can be used to move workload nodes to new hosts after hardware failure Cluster Shared Volumes • Enable all nodes to access logical unit numbers (LUNs) on economical disk storage • Use Server Message Block (SMB3) protocol Failover capabilities • Uses one cluster across the whole appliance • Automatically migrates virtual machines on host failure • Enforces rules with affinity and anti-affinity maps • Uses Windows Failover Cluster Manager